In a recently published paper [arxiv:1701.06538], the authors introduced a mixture of experts which is not new. However, the twist is to use only a small subset of those experts which cannot be done with an ordinary softmax, since the output of a softmax is always -slightly- positive. The idea is to keep only the truly top-k experts by setting the values, before applying the softmax operation, of all non-top-k experts to a large negative value. The result is that the actual output value at the corresponding position is zero.
With numpy, x is a vector, this is actually straightforward:
def keep_topk(x, k, neg=-10):
rest = x.shape - k
idx = np.argsort(x)[0:rest]
x[idx] = neg
We just sort the values of x, getting the indicies for the |x|-k positions and set the values to -10.
But since we want to use all the nice features of Theano, we need to port the code to the tensor world. Frankly, this is no big deal either, but it requires a tiny adaption since we cannot assign values to tensors directly.
def keep_topk(x, k, neg=-10):
rest = x.shape - k
idx = T.argsort(x)[0:rest]
return T.set_subtensor(x[idx], neg)
And that’s it.
The reason why we spent some time with the porting is that we also had the idea to use soft attention to model the final prediction as a decision of a small set of experts. The experts might have different opinions and with the gating, we can blend different confidence levels with different outputs.
Besides the actual preference estimation with a trained model on user data, we are facing the problem to present the results to users. We believe that it is always a good idea to explain why a model decided to recommend a movie to a user. In the most elementary form, we just score all movies and present the top-k selections as a list to the user. However, the movie with the highest score might not be always the best choice, since it also depends on the mood and many other things to decide what is the best item to watch (today).
In contrast to a streaming service, we are limited to the movies that will be aired in the next ~24 hours. But nevertheless, a grouping of items is still beneficial to assist users to pick the best match. The most basic choice is to use genres to group items, but on the other hand, the clustering is very coarse since ‘action’ has too many facets. In case of collaborative filtering, movies could be grouped by the proximity in the space of some learned factor model. This has the advantage that more semantic is captured, but the method requires lots of ratings and is therefore not always feasible. An alternative is to embed tags of movies, but this also requires lots of data and does not work for movies without tags.
Since all methods require extra meta data, we need to use directly the feature data of movies to learn the embedding. But since similarity can be highly subjective, it is not clear if a learned model is actually useful to assist a user to select an item with regards to the mood and other impressions. However, even if the groups do not fully capture those latent factors, a “semantic” re-arrangement will often help users to reduce the number of required decisions to pick the best match. For instance, if a user is in the mood for ‘zombies’ comparing only horror movies is much faster than comparing all recommended items.
Bottom line, with the rise of streaming services, users have access to a very large catalog of items and the ability to watch those items instantly. Thus, a service has to deliver good predictions what to watch next, but also to group those items because of the sheer amount of possible matches. So, the question is once all items are scored, how to aggregate the results to maximize the benefits for the user. This is clearly more a usability issue than a machine learning issue.
After the successful training of a larger model using the most frequent keywords, we decided to focus on a simple clustering scheme. To recap the situation: we learned a model on the top-400 feature words and we set the number of latent topics to 50. Furthermore, the input data was combined with the co-occurence matrix.
We thought some time about a function to convert the input space into the feature space. Due to the very high sparsity in the original space, we decided to start with a simple mapping that just calculates the overlap of a topic neuron t with the feature words x: |x AND t| / |t|. This procedure is repeated for each topic neuron. The result is a vector with 50 overlap measurements for each data sample x.
Before we continue, we need to elaborate on some restrictions of the training data. As we mentioned before, each movie is described a small set of keywords to capture the mood and the topics of a movie. In general, it is very likely that the meta data is either incomplete or that the keywords are not sufficient to capture the whole depth of a topic.
If we take ‘Captain America’ as an example, some keywords could be ‘hero’, ‘soldiers’, ‘superpower’ or ‘patriotism’. Since such meta data is usually provided by humans, we expect some variance and different emphasis regarding the topics. For our example, we consider the case that the keywords focus on the ‘soldier’-theme and therefore, the ‘superhero’-theme cannot be clearly
inferred from the data.
As a preliminary step, we analyzed the distance in the feature space following a information retrieval approach to find similar movies. The issue we described above explains, why our approach returned so many ‘war’ movies when we performed top-k retrieval for ‘Captain America’. The situation is no single case, it can happen with every movie because it is possible that a movie was only tagged with a few keywords to describe all topics or that the keywords only describe a single topic.
However, except for some hurdles we found the system works very well on some genres. For instance, the nearest neighbors of the movie ‘Batman’ are: Batman Returns, Batman & Robin, Batman Forever, Blade, Spider-Man and Daredevil.
As we can see, most of the results fall into the superhero genre as one would except. Furthermore, the closest hits are actually Batman films. A perfect match would be all Batman films first but as we mentioned, the actual meta data of the movies is not sufficient to infer such a ranking.
Here we can see a clear limitation of all these methods: A human would be able to extract more specific latent topics for a Batman movie that would allow to cluster all of them, if appropriate, together. However, a machine has to rely on the given meta data and if the data is limited, so are the inferred latent topics. That is one reason why a collaborative approach is sometimes favored because it does not require to provide the meta data explicitly. Instead, the latent factors are derived by the given ratings.
In other words, we need to combine our approach with further meta data to perform a semantic clustering. We did some tests with the genre information but without a proper weighting the influence of the genre leads a very strong bias.
With these results, we do not expect that a ‘perfect’ clustering is possible since some movies themes are not visible if we only consider the meta data.