It is no secret that most of the energy has been put into advancing supervised approaches for machine learning. One reason is that lots of problems can be actually phrased as predicting labels and often with very good results. So, the question is, especially for commercial solutions where time and resources are very limited, if it isn’t better to spend some time to label data and train a classifier to get some insights about the data. We got some inspiration from a recent twitter post that suggested a similar approach.
For instance, if we want to predict if an “event” is an outlier or not, we have to decide between supervised and unsupervised methods. The advantage of the latter is that we have access to lots of data, but we have no clear notion of “outliers”, while for the former, we need events that are labeled with the risk that the data is not very representative and therefore, the trained model might be of limited use.
In other words, it is the old story again: A supervised model is usually easier to train, if we have sufficient labeled data at the expense that we get what we “feed”. Thus, more labeled data is likely to improve the model but we can never be sure when we captured all irregularities. On the other hand, unsupervised learning might be able to (fully) disentangle the explaining factors of the data and thus leads to a more powerful model, but coming up with a proper loss function and the actual training can be very hard.
Bottom line, there is some truth in it that if you cannot come up with a good unsupervised model, but you can partly solve the problem with an supervised one, you should start with it. With some luck, the simple model will lead to additional insights that might eventually lead to an unsupervised solution.