Since we send the sum of the products of signals by their weights to the input of the activation function, then conceived, we need a different threshold level than from 0.5. We can shift it by a constant, adding it to the sum at the input to the activation function using the bias neuron to remember it. It has no inputs and always outputs one, and the offset itself is set by the weight of the connection with it. But, for multi-neural networks, it is not required, since the weights themselves by the previous layers are adjusted to such a size (smaller or negative) in order to use the standard threshold level – this gives standardization, but requires a larger number of neurons.

When training a neuron, we know the error of the network itself, that is, on the input neurons. Based on them, you can calculate the error in the previous layer and so on up to the input – which is called the Backpropagation method.

The learning process itself can be divided into stages: initialization, learning itself, and prediction.

If our figure can be of different sizes, then pooling layers are applied, which scale the image down. Which algorithm will calculate what will be written when merging depends on the algorithm, usually this is the “max” function for the “max pooling” or “avg” algorithm (mean-square value of neighboring matrix cells) – average pooling.

We already have a few layers. But, in neural networks used in practice, there can be a lot of them. Networks with more than four layers are commonly called deep neural networks (DML, Deep ML). But, there can be a lot of them, so there are 152 of them in ResNet and this is far from the deepest network. But, as you have already noticed, the number of layers is not taken, according to the principle, the more the better, but prototyped. An excessive amount degrades the quality due to attenuation, unless certain solutions are used for this, such as data forwarding with subsequent summation. Examples of neural network architectures include ResNeXt, SENet, DenseNet, Inception-Res Net-V2, Inception-V4, Xception, NASNet, MobileNet V2, Shuffle Net, and Squeeze Net. Most of these networks are designed for image analysis and it is the images that often contain the greatest amount of detail and the largest number of operations is assigned to these networks, which determines their depth. We will consider one of such architectures when creating a number classification network – LeNet-5, created in 1998.

If we need not just to recognize a number or a letter, but their sequence, the meaning inherent in them, we need a connection between them. For this, the neural network, after analyzing the first letter, poisons the result of the analysis of the current one along with the next letter to its input. This can be compared to dynamic memory, and a network that implements this principle is called recurrent (RNN). Examples of such networks (with feedbacks): Kohonen's network, Hopfield's network, ART-model. The recurrent network analyzes text, speech, video information, translates from one language into another, generates text descriptions for images, generates speech (WaveNet MoL, Tacotron 2), categorizes texts by content (belonging to spam). The main direction in which researchers are working in an attempt to improve in such networks is to determine the principle by which the network will decide which, for how long and how much the network will take into account the previous information in the future. Networks adopting specialized tools for storing information are called LSTM (Long-short term memory).