Some time ago, we experimented with Siamese Networks to build our own similarity measure. The idea to project our raw feature data into a new concept space would solve lots of our problems and it would help to build a semantic neighborhood for movies. As we previously discussed, the main problem is that we need a label to model the similarities of movie pairs at some level. The genre is a good start but more work has to be done here.
But even if the genre would suffice as a label, the next hurdle would be to adjust the input weights of the keywords. Let us demonstrate this with an example: In a crime-drama movie with a strong focus on cops and heists, keywords like ‘love’ or ‘piano’ are important, otherwise they would not be part of the meta data, but the weight of such words is very likely different compared to words like ‘thief’ ‘robbery’ or ‘loot’.
To use the well-known TF-IDF statistic sound like a good idea but we have a problem here. In contrast to real documents, our term frequency is 1 (present) or 0 (not present). The other parts are less problematical. We have the number of movies and we have the number of times a keyword is present in all movies. What is missing, is a replacement for the term frequency TF. Since it is very likely that the weight also depends on the genre, we will suggest a conditional TF value.
Again, it is easy to demonstrate this with an example: In an action movie the keyword ‘love’ might describe one aspect of the movie, maybe the twisted relationship of the main character, but in the context of a romantic movie, this keyword might describe the whole mood of the movie or a main concept of it. Thus, the weight of the keyword should be adjusted for the different genres.
In one line: Our TF value describes how popular a certain keyword in a specific genre.
Combined with the genre as a label, the new approach is much better suited to discriminate between items with similar features because now, the additional weights of each feature tells the model how important it is. In the case of binary features, the keyword ‘love’ is always treated the same, even if it is only a minor aspect of one movie but very important in the other.