Higher-Order Features

Regardless of what we are trying to learn, the success of training a good model largely depends on the features. It is no real surprise that “flat” features learned by a linear model are not powerful enough to disentangle most concepts. This is in contrast to neural nets (NN), where features are learned with respect to the problem at hand. However, in the domain of text and this includes keywords, models that do not consider relations between words (co-occurrences) are limited.

The following example demonstrates such limitations. We assume that we want to find related movies for a given query. The query might be a movie itself, or just some descriptive keywords from a dictionary. If the query is q={zombie, infection}, only movies with these keywords will be returned. The ranking starts with all movies that contain both keywords and then all with one of them. The order of the latter depends on the weights for each keyword. With this scheme, movies with no overlapping keywords will never be returned, which is a serious limitation.

A more powerful model would also consider how two keywords are related. For instance, good partners for “zombie” could be “undead”, “virus” or “creature”. In other words, it would make sense, to return also movies that contain related keywords even if “zombie” itself is not present in those movies. Such a model could also learn to map variants of words to a single concept, like “zombies” -> “zombie”, or “ghul”->”zombie” (because both are “undead” creatures).

To model such behavior, we could use a polynomial model of degree 3 which includes a model of degree 2:
degree_2 = W_i_j * q_i * d_j
degree_3 = W_i_j_k * q_i * d_j * d_k + degree_2
where “q” is the query, and “d” a document. Such a higher order model is able to relate triplets of words (a, b, c) in a positive, or negative way. For instance, (zombie, undead, virus) should be weighted higher than (zombie, undead, car) and similar, (zombie, undead) should be weighted higher than (zombie, car).

The drawback of such models is that for a large dictionary the weight matrix |words| x |words| won’t fit into memory which is the reason why low-rank approximations are often used: W = U * V * Y, where each low-rank matrix induces latent concepts of size |words| x N. Details about the training will be discussed in the next post.

In a nutshell, the described model captures higher order relations and assigns a single score to a pair of movies that describes how similar the pair is. That means, we can rank movies by sorting their score with respect to a query movie.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s