We did not give up our old dream of purely unsupervised learning, but right now, we neither seem have the right data nor a powerful enough model. However, we recently stumbled about a research paper about neural variational inference for text and some ideas caught our attention.
The idea is to encode a whole document as a latent representation and then a softmax is used to independently generating the words the document consists of. The document is encoded as a bag-of-words with an additional integer index for the non-zero words. This sounds very much like our data, very high-dimensional, very sparse and not ordered. So, we shamelessly took their ideas from the paper and turned them into our context, but of course we give credit to the authors[1511.06038].
Since we mainly use the softmax to ‘reconstruct’ the individual words, our new model looks a lot like CBOW, continuous bag of words, with the exception that now a context is not used to predict a single word, but several words. In Theano notation:
x_idx = T.ivector() # the list of word IDs
W = weight matrix of |V| x embed_dim
b0 = bias with embed_dim
R = weight matrix of embed_dim x |V|
b = bias with |V|
h_lin = T.mean(W[x_idx], axis=0) # mean of all input words
h = elu(h_lin + b0)
y_hat = T.nnet.softmax(T.dot(h, R) + b)
loss = -T.sum(T.log(y_hat[0, x_idx]))
which means we average the embedding of all words that are present in the document, optionally with a non-linearity at the end. Then we use this combined representation of the document to maximize the probability of all present words while pushing down the values for all others.
Again, we do not deny a strong resemblance to CBOW and we are pretty sure the novelty is limited, but maybe this is a stepping stone to something greater. The analysis of word neighbors in the learned space at least confirms that some semantic is definitely captured, but more work needs to be done.