Topic Neurons

Just for fun, we created a co-occurence matrix and used a non-negative matrix factorization to study some latent topics of the results. One that caught our attention was as a “detective” topic:

– amateur-sleuth, private-detective, suspect, polic-detective
– femme-fatal, blackmail, investigation, murder, cover-up, killer

With the top-k keywords listened above, it should be possible to find some of the classical “private eyes” themes in movies. Now the question is, how do we model the excitation of the neuron regarding to some arbitrary input. Stated differently, the input to the neuron are the keywords of a movie and the output should be a real number that describes how well the given keywords fit to the topic.

In our basic setup, we do not use weights for the different topic keywords. The excitation of the neuron is just a modified dice coefficient, with binary input values: #{overlap_keywords(movie, neuron)} / #{keywords in neuron}.

Actually, the neuron allows a lot of variations of the topics by considering only subsets of the keywords. However, there are limitations, for instance, if movie A has only ‘femme-fatal’ and movie B has only ‘private-detective’ the relation between them is very vague, but the output of the neuron is nevertheless positive in both cases.

Stated differently, if the overlap of the neuron and a movie falls below a certain threshold, the excitation cannot be used to relate movies reliable. Like in this example:
A = there is a _private detective_
B = he observes a _suspect_
C = a man was accused for _murder_

To grasp at least a smaller picture of the movie, we need at least two facts, better three or more. That is obvious, but it needs to be explicitly modeled in an algorithm.

The problem has a larger impact. It is likely, e.g., that a non-crime movie contains some keywords that are present in arbitrary topic neurons, including the detecting neuron. That means, for example, a romantic movie could have a “femme-fatale” but is otherwise not really about detectives. With a minimal overlapping threshold, the neuron would not get excited. That means, with a sufficient overlap, the encoding would clearly indicate only a “romantic”, or “twisted” topic and no useless activation of the “police” topic.

Again, the devil is in the details. With the sparsity of the our movie meta data, it might not be possible for a considerable number of movies to reach a required threshold.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s