The last year was quite a ride with lots of new ideas, controversial discussions and real-world examples that large-scale machine learning is actually more than a flash in the pan. However, we still have the feeling that more basic research should be done. For instance, a while ago we stumbled about an excellent blog that explains neural nets in a very descriptive but nevertheless still formal way. Since manifolds are a very powerful tool to explain what the representation that a net learned looks like, we tried to find similar, but introductory literature. It was a bit surprising that we did not find much. Maybe we didn’t try hard enough, but it’s more likely that relevant literature is buried somewhere and cannot be easily accessed through search engines, or, in the worst case, there is not much at all.
Nevertheless, the post inspired us to rethink the way how we tackle our current problems. For instance, we try to build a preference-model that ranks items according to the known preferences of users. But instead of simply classifying new items, we would like to capture latent topics in known item pairs to be able to perform something like a k-NN search at test time to better explain why a user might like an item or not. The reason why we are working on alternatives is that because the input space is very sparse and heterogeneous the generalization often fails. In other words, we get good scores for train/test but unseen items still land on the wrong side of the decision boundary way too often.
To get a better understanding of what’s going wrong, we worked on ways to visualize decision boundaries. But it is not hard to believe that a hyperplane is not always the best way to support fine-grained classification. A problem is that all +1 items are not really equal, like some image category, but might contain items from very different categories and the same is true for -1 items. Thus, it might be much easier for the network to learn a local neighborhood instead. As a baseline, we could use k-NN, but it is well known that the accuracy of the algorithm highly depends on the feature representation and since our input space consists of highly entangled data, it does not work in the original space.
As a result, we need a neural net to disentangle the representation with some loss function and then we can use k-NN to perform a prediction that is more robust compared to an end-to-end classification. For example, let’s assume that we trained a classification model and we get an item x that lies on the wrong side of the decision boundary. However, there is also a wrongly classified -positive- item x’ that is pretty close to x and the next negative x” item is farther away than x’: dist(x, x”) > dist(x, x’). Therefore, while the classification of x would be -1, the 1-NN model would still get it right because we consider the neighborhood of x.
Of course it’s not always that trivial, but instead of pushing every category into a single corner of the feature space, we can make the life of the neural net easier by just separating positive and negative items by a fixed margin. This is nothing more than the good old triplet loss, but we still have one challenge to address!
The problem is that not all +1 items come from the same distribution. Thus, if we just sample uniformly, we likely sample items with different topics and force them to be close together in the feature space. This still works, but at the end, we probably just learn a representation that is linearly separable again. So, the question is how to preserve the local neighborhood in each +1/-1 category? We could cheat by using tags or meta-data, or we could also extract topics with a NMF.
Of course the new expressive power won’t come for free. The beauty of a classifier is that we can make predictions but just feeding the item to the network and get the score for it. In case of a k-NN-based method, we also need to store labeled training data and perform a lookup each time a new item needs to be labeled. But depending on how much labeled training data is required for a good score, the overhead is often negligible, since we just need to store a matrix of the size (N x dim) and perform a single matrix multiplication, following by an argument sorting. And if all items have unit-norm, this corresponds to the cosine score.
Of course we are not done here, but we have a strong feeling that pursuing this way will lead somewhere, even if it takes a lot of steps until we arrive somewhere.