Quite a while ago, we discussed an idea how sparse coding could be useful to build a semantic feature space. It is a bit ironic that we cannot use SC directly on the raw data since the features are too sparse and the data is too noisy. However, we can cheat a little and use NMF to build a useful set of bases. In our case, we factored the matrix of co-occurrences of features and we used 50 latent topics. Furthermore, we set all features below a certain threshold to zero which makes sure that we only focus on the top-k words in each latent topic.
Next, we initialize some sparse coder with these adjusted bases and infer the coefficients for each available movie in the data set. We used a widespread algorithm called Orthogonal Matching Pursuit (OMP) with the settings to get 10% (of 50) non-zero entries. The idea is to determine how much a latent topic is present, if we consider the keywords of an item. With 10% non-zero entries, we can explain an item with up to 5 latent topics which is usually sufficient movies.
In contrast to NMF, we can get negative coefficients which are hard to interpret in the domain of movies. For images, we have to remove such a base from the datum which is plausible. With the settings we used, about 30% of all non-zero coefficients were negative. Further research is required to better understand the meaning of negative values.