For some domains, like NLP or movie descriptions, the dataset might be very high dimensional, several thousands of raw features, in combination with a very high sparsity of the input. In other words, if we have 10K dimensions with a sparsity of 99%, about 9900 zeros per vector, the training of some kind of models can be very challenging.
Let us assume that we want to train an auto-associator to learn something about the structure of the data at hand. With about 100 non-zero and 99000 zero features, the procedure is computationally challenging. Why? At test time, we only need to multiply the non-zero entries of a vector with the learned weight matrix, but during the training we need to reconstruct each and every dimension of the vector.
We address the issue, we borrow ideas from papers which describe how to solve large-scale NLP problems. The main idea is very simple: We only reconstruct the non-zero entries of a vector and a small subset of zero entries. This way, the computation only depends on a fixed number of entries and not all of them. When we return to our initial example, we only need to reconstruct about 200 dimensions instead of 10K (100 non-zero and 100 arbitrary zero positions). This procedure will introduce a bias in the reconstruction error, but this can be sufficiently compensated with a proper weighting scheme.
Bottom line, if the data is sparse and high-dimensional, we can use the sparsity to speed up the model training, but also to guide the model, because reconstructing the non-zero entries is usually much more important.