Tags: The Value of Community Data

In the last weeks we spent lots of time with the analysis of the meta data. We are working on a model that combines several domains but there is nothing definite yet.

In the meantime, we stumbled about a new, possible source of data. Of course, we knew it was there all the time, but because we were a little afraid that the publicly available data is not sufficient to train a good model, we did not pursue the idea any further. However, now we decided that it is a good idea to check at least if data is able to improve our model notable and if so, we can worry about data acquisition later.

The experiment we conducted was pretty simple: We used a publicly available data set that contains tags for movies. To simplify our setup, we only considered tags that were used at least 50 times and the tags were pre-processed to unify them as good as possible. This resulted in 335 tags for 4239 movies. The data was then used to create a binary vector for each movie with a “1” at position i (tag i) and “0” elsewhere. Next, we used an RBM with binary units in the hidden layer, to train a model. As usual, we used weight decay and momentum to regularize the model and to speed-up the training.

To get a better understanding of the model, we randomly selected a “popular” movie and calculated the distance of all other movies in the new feature space, plus we determined the jaccard coefficient (JC) for the pairs of tags. The intuition is that the value of the JC is decreasing when the distance is increasing. The assumption was checked by calculating the correlation coefficient of some randomly drawn movies.

However, since the model also captures non-linear relations in the data the interpretation of the results is not always straight-forward. Stated differently, some tags are more important than others and so are some pairs of tags. That means it is still possible that a pair has a notable distance but nevertheless also a higher JC value. Such titles will then appear on higher position of a ranking because of the valuable tags or combinations it contains.
Despite possible obstacles, it is still easy to get a better intuition of the model by considering some classic movies. We used 007 movies for this purpose. The reason for our choice is simple: there are lots of them and we expect tags that are very similar. And the model did not disappoint with the results: For the top-20 results, each of the listed movies shared at least the tags “007” and “bond”. Actually the lowest JC value was 0.36, followed by 0.55.

So, in a nutshell we can say that tags are extremely valuable to perform a semantic clustering or to enhance existing meta data of movies. This fact is well known, regardless of the domain, and already used in the literature to improve the accuracy of classification and other tasks. Now, we are in the dilemma that we need tags for more recent movies and an active community that continually interacts with the data.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s