It is the dream of all machine learning guys to find a way to use the available data in combination with some unsupervised learning algorithm to train a useful representation of the data. Yes, we drastically simplifying things here, but the point is to learn without the necessity to label the data which is very expensive.
For example, there are tons of documents available which could be used for learning, but the problem is what cost function do we want to optimize? In case of word2vec and friends, we try to predict surrounding or center words without explicit labels. This works very good, but the result is an embedding of words and besides simple aggregation methods, there is no general way to represent documents with a learned embedding in a meaningful way. However, it is still a simple, but powerful approach that can easily utilize huge amounts of unlabeled text data to learn a useful representation.
Another example is a recently published paper [arxiv:1704.01444] that is also using a large text corpus without labels, at least for the first model, to just predict the next character of the data block. So far, this is nothing new, but it is remarkable that a single unit learned to predict the sentiment of a data block. In other words, all those models learn by predicting the next “thing” which can be, for instance, a word, a character, or some other token.
The interesting part is that such an “autoregression” model can be learned by just taking a sequence, removing the last item and try to predict it, given the previous data. This also works for sets, but the process is not straightforward since sets are not ordered. Furthermore, it is not obvious how to select the item, since there is no “previous” data.
Bottom line, it has been demonstrated several times that it is possible to learn a good representation of data by just predicting the next token. Nevertheless, often such methods are limited to short texts, since processing longer texts require to remember lots of data or context, especially for models based on RNNs.
However, since we are usually dealing with short descriptions of items, with the exception that we handle sets and no sequences, we adapted the method and trained a model to predict a keyword from the set, given the rest of the set, with moderate results. Despite the problems we encountered, we still believe that no (strongly) supervised model will ever be able to learn powerful but also general representation of data. Thus, it seems a good idea to follow the research track and address the existing problems one by one, until we
eventually find a method that addressed the major hurdles.