In Machine Learning the best model is useless if it does not generalize to new, unseen data. In other words, even if a model reduces the error rate of the training samples to zero, it does not help much in practice if the error rate of new samples is rarely better than flipping a coin.
Of course it makes sense to build a proper architecture before we train a model, but on the other hand, if the model is too powerful and thus has a large capacity, we need enough data to estimate all parameters. Otherwise, the chosen model underfits and fails to detect the actual pattern in the data. That is the opposite of overfit where the model perfectly learns to detect (faulty) patterns in the training data but it does not generalize to unseen data. Both sides of the coin are bad and therefore, we try to meet in the middle.
Occam’s razor states that given a set of hypotheses, we should take the one with the fewest assumptions. In the domain of Machine Learning that means we should start with simple, probably linear model. If it works, there is no need to go deeper and otherwise, we should increase the capacity a little and start again.
How this is related to Deep Learning? In a lot of domains, the explaining factors of the input data are heavily entangled and thus, a powerful model is required to disentangle these factors. This possibly requires a deep model that can re-use features and allows to stack features to build more abstract features. However, for some kind of data this might be overkill and a much more simple model would suffice to explain the factors, e.g. with a linear model.