We are still searching for the holy grail to handle rare keywords that are not part of a top-k keyword set. However, with the evaluations so far, it is rather clear that no solution will ever be optimal, at least not without the utilization of a big corpus of words to infer the context of those words. The major problem is that even for large-scale training sets, a lot of keywords still have a very low frequency. But since we are very modest, a continual improvement of our current method would suffice.
We experimented with non-negative factorizations and at least for the co-occurrence matrix, the results are very good. The idea is to use the words with the highest weights from a latent topic and then to select an arbitrary keyword that is not in the top-k keyword set and find the best matching latent topic.
Here is an example: We want to find “semantic neighbors” for the word ‘cure’ that is not part of the top-k keyword set. From a NMF model, we extracted the 20 words with the highest weights from each latent topic. Then, we iterated over all movies that contain the word ‘cure’. We then computed the overlap of all other keywords of a movie with the top words from each latent topic. Because, each keyword in the intersection has a weight, we can use it as a weighted sum for a specific topic. Finally, each topic as a dedicated score, or zero, that indicates, how well the keywords of a movie fit into this latent topic. The NMF model we trained contains 50 latent topics and the topic with the highest score for this keyword clearly belongs to the medical domain: medical, doctor, nurse, hospital, health, transplant, surgery. In other words, by considering all movies in our data set, in combination with the latent topics, it is often possible to assign a low frequency keyword with to a dedicated topic by averaging the weights.
The next step is to convert this knowledge into the feature space. A straight-forward solution is to use each a latent topic as a kind of pooled feature that condenses a group of features into a single value. Like the “medical” NMF neuron that combines several related words into a high-level concept. Still, to really benefit of those new features, the encoding must be hierarchical because the weight of such a condensed feature is much higher than those of ordinary single keyword features.