Tags – Gold Nuggets Of The 21th Century

With a matrix factorization, it is pretty easy to learn what concepts are present in bag-of-word text data. Very often, such models learn “neurons” to detect specific themes, like supernatural, pirates, cowboys and other useful topics. The transductive nature of NMFs is not a real problem here, because we can easily incorporate the learned knowledge from the latent topics into other models. As a nice side-effect, when trained on the full dataset, the factorization can be used to cluster the data. This is done by using the latent topic ID with the highest activation as a pseudo label. The idea is that two “zombie” movies have the highest overlap with the “supernatural” theme and thus, the ID of this topic is used as a label for the two movies. Of course, this is not limited to horror movies.

We used the tags from the latest MovieLens dataset, stemmed them and only considered frequent tags. Then we build the co-occurrence matrix and factored it with a non-negative matrix factorization (NMF). As expected, there are dedicated, sometimes funny, themes present:

– “Lord of the Rings”: {orcs, orc, hobbits, hobbit, mortensen, ring, tolkien, liv, astin, orlando}
As we can see, the tags consist of both names of actors and keywords.

– “Western”: {leone, sergio, ennio, kinski, hunter, showdown, western, harsh, bount, gold}.
Again, the most important tags are a mixture of names and keywords.

– “Horror” {zombie, occult, slasher, ghost, demonic, possessed}.
This topic only consists of keywords, but the presence of tags like “romero” are very likely, but probably not frequent enough.

Since tags are not restricted in any way, not all topics learned by the model can be (easily) interpreted. However, we can see that some concepts are hard-wired with persons, for instance actors, directors, authors or even non-fictive people.

A further approach to analyze the “neighborhood” of tags is to embed them into the NMF space. To do this, we encode each tag “t” as a 1-hot vector and project it with the learned weight matrix “W”: h = dot(t, W). The vector h contains the contribution of the tag “t” for each latent topic. The procedure is repeated for all available tags. Next, we can use the projected values to find the nearest neighbors of a specific tag “t” in this new space. Stated differently, if two tags are very related, they should have their highest activation values in the same or similar latent topics and smaller values for disjoint topics.

The results for some tags are pretty amazing. One such tag is ‘cyborgs’ or ‘cyborg’ which are both stemmed to the same value. Those are the nearest neighbors:
– androidscyborgs, futuristic, dystopic/dystopia, robots, sf/scifi, android, future, dick, authorphilip

The last tags probably refer to the science fiction author Philip K. Dick which is well known for short stories and novels about androids but also about many other things, including reality.

With this post, we just scratched the surface of possibilities what to do with tags created by a community, but one thing is for sure: tags have an immense value for recommender systems and that includes content-based systems.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s