There are some algorithms that seem to be very powerful on the first look. For instance, we stumbled about a matrix factorization that is semi-supervised to cluster arbitrary data. To introduce a supervised signal, there are pairwise constraints to express that a pair of documents should be in the same cluster (W_like) or not (W_dislike). The model uses a similarity matrix S, plus W_like and W_dislike to build a matrix A: A = S + W_like – W_dislike.
We will illustrate an example from the movie domain. Let us assume that we have n=30K movies, which means S is a n * n matrix and so is W_* and we use k=50 clusters. When we assume that each value is encoded as a float32 value, we need about 3.5 GB memory. The required memory for the model parameters, is negligible, since we only need to store n * k for the cluster assignment and k * k for the centers, which is less than 6 MB memory.
Regardless of the efficiency and precision of models that are using a full similarity matrix, the bottleneck is the memory consumption during training. For 5K samples we need ~100 MB, but for 50K we need about ~9.5 GB, since the increase is not linearly. Combined with the fact that such models are transductive and do not allow to project unseen samples into the new space, the use is limited for datasets with lots of samples.
The impact could be alleviated if the similarity matrix could be stored as a sparse matrix since then the memory consumption could be drastically reduced. However, this requires that the similarity for lots of pairs of movies is zero. Plus, even if most of the full-fledged optimization frameworks support sparse matrices, they can be problematic.