At the output, we will receive the activation of that output neuron that corresponds to a certain number. It does this on the basis of data received from neurons from the previous layer, which are responsible for the digit sectors, namely from which neurons the signals came and which ones did not. Let's take that the incoming signals from neurons through the connections are zero, that is, the sector is not filled, if one, then the sector is painted. Then, the weights of the links from the right sectors are half, which will give one, while the rest have negative weights, which will not give one at the output if some other sector is activated. At the output of the neuron, there is a normalizer that decides for the decision. He needs to decide, based on the input data and weights, to give one or zero. To do this, it multiplies the input data by the weights, adds them, and outputs one or zero based on the threshold value. This normalizer is needed so that, after summing up the information coming from neurons, it will transfer logical information to the next layer of neurons, the degree of importance of which will be determined by the weights on the receiving neuron, and not by them. For this, functions are used that convert the entire range of input signal levels to the range from zero to one. Such a function is called activation functions and is selected for the entire neural network. There are many functions that consider everything less than one to be zero. The weights themselves are not encoded, but are selected during training. Teaching is either supervised or ansupervised and suitable for different classes of tasks. When teaching without a teacher (auto-encoders and generating networks), we give data to the input of the neurons of the network about the expectation when it itself finds some patterns, while the data is not labeled (does not have any labels by classification), which will allow us to identify previously unknown features, similarities and differences, and classifies according to not yet found signs, but how this will happen is difficult to predict. For most tasks, we need to get a classification according to the given groups, for which we input a training set with marked data containing labels about the correct solution (for example, classification, and try to achieve a match with this test set. It can also be reinforced), in which the network tries to find the best solution based on incentives, for example, when playing to achieve superiority over an opponent, but for now, we will postpone consideration of this learning strategy for later. When teaching with a teacher, much less attempts to pick up the weight are required, but still it is from several hundred to tens of thousands, while the network itself contains a huge number of connections. In order to find the weights, we select them by selection and directed refinement. With each pass, we reduce the error, and when the accuracy suits us, we can submit a test sample to validate the quality of the cut ( the network could learn badly or retrain), then you can use the network. However, the numbers may be slightly skewed, but because we are highlighting the areas, this does not greatly affect the accuracy.

When a neuron is trained with a teacher, we send training signals to it and get results at the output. For each input and output signal, we receive a result about the degree of error in prediction. When we ran all the training signals, we got a set (vector) of errors that can be represented as a function of errors. This error function depends on the input parameters (weights) and we need to find the weights at which this error function becomes minimal. To determine these weights, the Gradient Descent algorithm is used, the essence of which is to gradually move to the local target, and the direction of movement is determined by the derivative of this function and the activation function. The activation function is usually sigmoid for regular networks or truncated ReLU for deep networks. Sigmoid outputs a range from zero to one at all times. The truncated ReLU still allows for very large numbers (information is very important) at the input to transfer more than one to the output, there they themselves affect the layers that follow immediately after. For example, the dot above the dash separates the letter L from the letter i, and the information of one pixel affects the decision making at the output, so it is important not to lose this feature and transfer it to the last level. There are not so many varieties of activation functions – they are limited by the requirements for ease of training when it is required to take a derivative. So the sigmoid f after arbitrarily turns into f (1-f), which is effective. With Leaky ReLu (truncated ReLu with leakage) it is even simpler, since it takes the value 0 at x <0, then its wired in this section is also 0, and at x> = 0 it takes 0.01 * x, which with the derivative will be will be 0.01, and for x> 1 it takes 1 + 0.01 * x, which gives 0.01 for the derivative. Calculation is not required here at all, so learning is much faster.