Fun With No Teachers

For the domain of movies there are lots of labels, for instance ratings created by users, genres assigned to movies or themes to capture more coarse aspects of movies. But regardless of the availability of labeled data, it is very likely that there is much more unlabeled data. In case of movies, this includes for example, reviews, descriptions and meta information like budget, certificates, places, music and involved persons, but also information like relations between movies and so on. In other words, there are patterns everywhere that are waiting to be extracted.

Because we have no clear goal, except for “mining patterns”, supervised methods are not suited for the task. In other words, we want the “student” to explore the data without the supervision of any teacher and furthermore, we want the process to be as lightweight as possible. This time, we do not use a neural network or a sophisticated clustering algorithm. Instead we use a very simple competitive approach that uses the winner-takes-all strategy.

The description of the algorithm is pretty easy. We have X that contains our input data, L2 normalized and we have W, a random matrix that is used to capture K latent topics in the data, also L2 normalized. The following steps are repeated in a loop:

(1) select an arbitrary x from X
(2) h = dot(W, x)
(2.1) j = argmax(h)
(3) W[j] += learning_rate*x
(3.1) L2 normalize W[j]

The result is very similar to a matrix factorization with rank K.

With the normalization of W and X, the dot product (2) is the cosine similarity of all topics of W and x and the winner is the topic with the largest magnitude (2.1). Furthermore, the normalization also prevents that the norm of W becomes too large (3.1). The winning cluster is updated with the information from x (3) multiplied by a learning_rate that is usually decayed over time. Step (3) ensures that features that are captured by the chosen topic “j” gain gradually more weight.

Besides the ability to explore potential “topics” in the data, the approach can be also used to cluster the data since (2.1) returns an integer that represents the cluster ID. The intuition is that the dot product of some input x and the topics in W is large, when the overlap of features is maximal. For instance, a crime movie with the keywords x=[heist, robbery, police, officer] is not likely to have much in common with latent topics like sci-fi & aliens, love & romance or sword & sandals.

Bottom line, it is amazing how much we can learn from the data with such a simple algorithm and without any kind of error signal from a teacher. It is not surprising that the approach is not powerful enough to capture higher-order relations between features, but the results are still impressive.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s