Fighting Sparsity with Concepts

In the last post we mentioned that about 30% of the available keywords have a frequency of one. Stated differently, we throw away 30% of our valuable data!

We further mentioned that these keywords are very likely to be semantically related like ‘police-station’ and ‘police-officer’ but without additional methods this fact cannot be used. Next, we will describe a way out of this misery.

Recently, we talked about taxonomies and concepts to describe a domain in tree-like way. With a (too) simple example, we can illustrate the power of it: Let us assume that we want to describe the movie concept “police” with some expressive keywords like “cop, detective, gangster, police-station, undercover, robbery, chase”.

To make use of the new concept, we introduce a new feature “police” that is “1” whenever a keyword of the movie is part of it and “0” otherwise. The last challenge is now to map the (one-time) keywords to one or more concepts, which can be done by bi-gram matching or some other approach. The drawback is that the output needs to be checked by a real human once.

Now, the one-time keywords are utilized by compressing them into a set of concepts that are used as additional features. To return to our running example, all the one-times keywords police-{detective,officer,negotiator} would be mapped to the “police” concept which would further strengthen the (semantic) connection between movies that contain those words.

Without this step, it is very likely that we lose a lot of expressive power of our model because connections between movies could not be established or at least not as strong as with concepts.

To summarize, it is very likely that most of the one-time keywords can be encoded with a small set of common concepts, like ‘police’, ‘zombies’, ‘disaster’, or ‘good-vs-evil’. Stated differently, the keywords are clustered and a cluster represents a binary feature that tells us if a specific keyword is part of it or not.

And last but not least, this does not only help for one-time keywords but also for frequent keywords, because connections between movies can be modeled more efficiently: Without the genre, two movies, one with “ranch”, the other with “cowboy” are not directly related, but with the concept “western” that contains both words, there is now a direct connection between them. Sure this example was a bit suggestive, but with a good taxonomy this will also work for non-trivial cases.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s