Distilling Knowledge From Raw Data

More than 80% of the time, we have to fight very high-dimensional data that is also very, very sparse. Recently, we got our hands on a keyword dataset for movies. It can be summarized as follows. About 500K movies, 150K raw keywords, about 10 keywords per movie with a variance of about 600. And the clue is that about 40% of the keywords have a frequency of one. This sounds a lot like the problems we wrote so much about.

To successfully train an RBM/AE model with such bag-of-words, we need to limit the keywords. But the problem is that high-frequency keywords are not very discriminative and even with 10K selected keywords, the frequency of the “bottom” keywords is pretty low. Therefore, we decided to start with a matrix factorization of the co-variance matrix to see if we are able to extract useful latent topics from the keyword co-occurrence. For the training, we used the 5K most frequent keywords.

And indeed, not really surprising since factorizing the co-variance matrix is much easier than to learn a factor model from the data, the latent topics often make sense.

music: audience, music-band, tour, drums, blues, drummer, tap-dancing, folk, stage, saxophone
law: arrest, testimony, court, verdict, press, trial, jury, lawyer
western: colt, cowboys, outlaws, rifler, hat, shooter, gunslinger, boots, wild-west
martial-arts: fight, samurai, kendo, kung-fu, black-belt, roundhouse-kick, katana
demonic: supernatural, possession, exorcism, occult, demon, devil, seance, medium, ghost

Some of the topics are not that easy to summarize with a single word, but that is not surprising either, since the extracted topics are latent.

However, a full matrix factorization of the data is unlikely to succeed, since 99.8% of each sample contains zeros and only 0.2% really carries information. The situation is similar to collaborative filtering where more than 99% of the data is unknown and thus, a modified SVD approach is required to factor the data. In a nutshell, it can be often shown that datasets contain useful pattern but it is a different story to use this knowledge for feature engineering. Too often the sparsity of the input data prevents to use off-the-shelf algorithms for feature learning.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s