In Every Binary There Is Some Weight

Some time ago, we experimented with Siamese Networks to build our own similarity measure. The idea to project our raw feature data into a new concept space would solve lots of our problems and it would help to build a semantic neighborhood for movies. As we previously discussed, the main problem is that we need a label to model the similarities of movie pairs at some level. The genre is a good start but more work has to be done here.

But even if the genre would suffice as a label, the next hurdle would be to adjust the input weights of the keywords. Let us demonstrate this with an example: In a crime-drama movie with a strong focus on cops and heists, keywords like ‘love’ or ‘piano’ are important, otherwise they would not be part of the meta data, but the weight of such words is very likely different compared to words like ‘thief’ ‘robbery’ or ‘loot’.

To use the well-known TF-IDF statistic sound like a good idea but we have a problem here. In contrast to real documents, our term frequency is 1 (present) or 0 (not present). The other parts are less problematical. We have the number of movies and we have the number of times a keyword is present in all movies. What is missing, is a replacement for the term frequency TF. Since it is very likely that the weight also depends on the genre, we will suggest a conditional TF value.

Again, it is easy to demonstrate this with an example: In an action movie the keyword ‘love’ might describe one aspect of the movie, maybe the twisted relationship of the main character, but in the context of a romantic movie, this keyword might describe the whole mood of the movie or a main concept of it. Thus, the weight of the keyword should be adjusted for the different genres.
In one line: Our TF value describes how popular a certain keyword in a specific genre.

Combined with the genre as a label, the new approach is much better suited to discriminate between items with similar features because now, the additional weights of each feature tells the model how important it is. In the case of binary features, the keyword ‘love’ is always treated the same, even if it is only a minor aspect of one movie but very important in the other.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s