For some kind of data it is not unusual to have a couple of thousand dimensions but only very few of them carry actual values. Like a bag-of-word approach with thousands of binary features but on average 99.5% of them are zero. In case of a neural network this means we have a projection matrix W with N rows and M columns where N is the number of input features (~10,000). Since it is obvious that a dot-product just depends on non-zero entries, the procedure can be speed-up a lot if we use a sparse matrix instead of a dense one, for the input data. However, we only need the sparse tensor type once, since after the first layer, the output is always dense again.
The actual implementation in Theano is not a big deal. Instead of T.fmatrix(), we use sparse.csc_matrix() which comes from the sparse module of Theano:
from theano import sparse. If we use a generic projection layer, all we have to check is the instance type of the input tensor to use the appropriate dot function:
if type(input.output) is theano.sparse.basic.SparseVariable:
op = sparse.structured_dot
op = T.dot
That is all and the rest can stay as it is.
The idea of “structured_dot” is that the first operand, the input data, is sparse and the other operand, the projection matrix, is dense. The derived gradient is also sparse and according to the docs, both fprop and bprop is using C-code.
Bottom line, if the input dimension is huge but only very few elements are actually “non-zero” using a sparse matrix object is essential for a good performance. The fact that non-contiguous objects cannot be used on the GPU is a drawback, but not a real problem for our models since they are CPU-optimized anyway.