Right now, we are train a larger model that can be used for feature extraction and hopefully the multiple layers help to improve the accuracy. But while the training is underway, we thought it won’t hurt to tweak our existing model a little. More precisely, we want to reduce the dimensionality and at the same time to de-correlate the feature data.
For this experiment, we are storing the features along with the movie data in the back-end so the procedure only needs to be done once. The drawback is that new movies cannot be easily transformed in the feature space but since our ultimate goal is a decomposition of the features this is no problem.
The obvious choice would be to perform a PCA on the data and why not? The approach is very fast and since we can measure the explained variance of the resulting model, we can easily determine how much of the original feature dimensions we need to keep. In our setup each movie is described by 50 feature values. With 40 dimensions, we can explain about 92% of the variance in the data which is sufficient for us. Next, we stored the reduced feature data and performed a simple comparison of the results.
The experiment was quite simple. We performed a Top-K retrieval, using the euclidean distances in the feature space, to find the K nearest neighbors of a movie query. In most cases, the results were pretty similar but in some cases the ordering slightly changed. Since we only sampled movies we cannot be sure if the new model really significantly improved the results. However, the new model clearly improved the results for some movies so it seems to be a good idea to further pursuit the idea of dimensionality reduction.
Some last words. The simplicity and the speed of the PCA-approach comes at a price that is called linearity. Even with the limited analysis of the meta data of the movies it is clear that some correlations in the data are not linear. That means we cannot use a linear model, like a PCA, to fully explain the data. In other words, the extracted concepts (features) are probably powerful enough for the task at hand. Nevertheless, the insights are still very useful for the next iteration.