It is no secret that the advances in Deep Learning are only possible because of the immense amount of data that is available on the Internet, combined with the power introduced by GPU devices. In case of computer vision, it was also a huge human-annotated dataset called ImageNet that allowed to train large-scale supervised classification models with high quality data. But since then, the landscape of neural networks changed a lot and not only for computer vision. At very early times, we could train some model by stacking basic layers together in combination with a loss function to get a reasonable model. This still works today, but in most cases it won’t deliver a mind-blowing performance or, it even get stuck during training.
With the introduction of batch normalization (BN), it was possible to train much larger and deeper models, but we shouldn’t forget that this comes with a price. First, BN introduces additional complexity and second, BN requires a sufficiently large batch size to reliably estimate the statistics of the data. In other words, it does not work for on-line learning and might fail for smaller batch sizes. After BN, other normalization methods were proposed, most notably weight normalization (WN) and layer normalization (LN), but since LN does not work so well for ConvNets, BN remains a standard tool for designing arbitrary networks. At least for recurrent networks, where BN is not straight-forward, LN seems to get more attention. In general, normalization in networks seems to be one key to success if the network is deeper and with a more complex loss function.
However, even with normalization, modern networks often consist of building blocks that go far beyond the classical approach to stack basic layers. For instance, in 2014 GoogLeNet introduced inception modules to better handle objects of different scales in combination with auxiliary classifiers. The network architecture looks a bit scary, but if we take a closer look at it, it can be broken down into modular building blocks. Then in 2015 ResNets were introduced that also used building blocks but they were much simpler and allowed to train networks with a depth of more than 100 layers. Since then, more and more networks started to use modular building blocks to enhance the network capacity to improve the precision at the price of a higher complexity of the model.
Stated differently, for more and more problems, the network architecture that is required to solve the problem can be burdensome to implement from scratch. Not to mention lots of possible pitfalls if you have to do all the implementation on your own. So, in a nutshell, you both require a broad range of knowledge and experience to work on decent problems. However, in case you are an engineer that just wants to get the problem solved without all the algorithmic overhead, you need a good framework that allows you to build a complex network with pre-existing building blocks by combining these modules into a single model that is encoded as a computional graph. And finally use automatic differentiation whenever possible to optimize arbitrary loss functions instead of deriving manual gradients which can be a real pain and a source of very nasty errors.