We have a dream and the dream is that we could have something powerful similar to Convolutional Networks but for the non-image domain. The good thing about images is that everything is inside of them and that we do not need extra data to learn something useful, except for some labels. Of course, the benefit is also a big burden because the amount of information in a picture is usually huge.
At least for full text documents, where words are ordered and likely appear several times, and we can define a vocabulary, the situation is a more similar to images. Of course text and images are different, but we can at least use the content directly to learn something.
With something abstract like a movie, the situation is much harder. Would the full script of a movie help to build a better model for the data? Would the video data -frame by frame- help us to learn topics present in a movie? Maybe, but the truth is that we will never have access to all information and even if video data is available, the processing of hundred of hours of movies would require immense resources. So, we are back to where we started, but do we give up? Of course not. A car might not fly us to the moon, but it surely brings us to the next video rental store and back home.
In other words, machine learning, and especially neural nets, are constantly evolving and it is only a matter of time before the next breakthrough will happen. Until then, we use the full existing repertoire that we have:
- neural language models for context prediction
- rectifier units to speed up the learning
- dropout to avoid over-fitting and for model averaging
- ada-grad for adjusting the learning rate
- semi-supervised learning for more discriminative power
And last but not least, we can use the wonderful Theano library for rapid prototyping and furthermore the only reason we are not spending hours of fixing errors in the derivation of gradients.