The Power Of Sets

We all know that good features are the work-horses of Machine Learning and that is the reason why so much energy is put into this subject. For the lucky ones of us, the data can be directly used to extract the features, while the others need to hand-craft them. And that is clearly the crux since there is no guarantee that hand-crafted features are really best suited for a problem at hand. We wrote about this before and the thing is, right now, hand-crafted features are our only chance for the movie domain. The only exception is collaborative filtering but that only works when a critical mass of data exists.

So, what’s new? Extracting useful latent topics from keywords is pretty forward with a matrix factorization and leads to good results when used on the co-variance matrix X^t*X. If we consider only the highest ranked words in a latent topic, the problem still is that the intersection with keywords from an arbitrary movie is very low, ~1-2 items. That means if latent topic consists of a ‘robbery’ theme like {heist, jewel, robbery, thief, …} a movie with just the keyword “jewel”, does not provide sufficient proof that the theme is present. However, if “thief” and “jewel” are both present, the situation is much clearer.

To group features is not new, but it is very limited due to the computational complexity because the power set of a set has 2**N distinct subsets which makes the approach unfeasible even for smaller N. For example, we considered a latent topic with six words and counted the overlap of keywords of all movies with these words: 235: #2, 78: #3, 3: #6. In other words, only 235 of approximately 25K movies contain two words from the concept and even more worse, only 78 contain three words and only 3 contain all six words.

The lesson that we learned, again, is that even a broader concept like ‘robbery’ is only present in about 4% (1,000) of the movies and that it is almost impossible to find good word partners when we further restrict them to be part of a latent concept.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s