“Error” is a word that doesn’t have many positive connotations. But in our task of teaching an artificial neural network (ANN) to learn, it’s great.
Why? Because it gives us the opportunity to show the network how to learn. The error here is the measure, which determines how the weights must be adjusted. We’ll answer the following question in the upcoming pages: How can a measured error be used to program learning for an ANN?
Let’s first consider how to measure an error or a deviation from a target value. You could, for example, see from the output layer of a network that there is a desired output and an actual output . You can measure the error of the perceptron using E = (y - ŷ), that is, the desired output of the perceptron minus the calculated output, whereby the calculated output was determined using the step function. Using Adaline, you can measure the error differently, namely using E = (y - s), where s is the net input. Any calculation could be used to calculate the error; there is no reason why the error shouldn’t be described as
E = (y - ŷ)2
with the ulterior motive of keeping the error between the desired and calculated value as small as possible. If we compare the two error measurement methods, that is, (y - ŷ) and (y - ŷ)2, the graph looks like the one shown in this figure.
With the perceptron learning rule, the behavior is actually quite logical. Our goal is for the error to be 0. If the difference between the desired and calculated value is < 0, then the calculated value is greater than the desired value. This means that we have to reduce the weights so that the difference is smaller, that is, greater in terms of value. The reverse situation applies as well. So, we either climb up the line or slide down it.
If we look at the square of the error measurement, which plays a central role in the derivation of the Adaline learning rules, we discover a completely different situation, which is caused by the square of the error (experts refer to this as the squared error [SE]). How can we proceed here so that the error becomes 0? If we were a sphere, things would be easy: We would simply switch off our brains and let gravity and friction take over. It would go down on one side and maybe up again on the other, but each time back and forth, we would get closer to 0—friction and gravity would take care of that.
Unfortunately, gravity and friction aren’t available to us to determine in which direction it goes down, but we have another means, gradient descent. “Descent” is relatively clear, like in mountaineering, but what about gradient? Perhaps an alternative name for the gradient descent would help: the method of steepest descent. This mathematical process helps us find the direction in which there are fewer errors.
In short, the learning method based on gradient descent is the backpropagation algorithm (backprop for short), which is a generalization of the delta rule. This algorithm ensures that the weights are adjusted in such a way that the overall error gets minimized.
Editor’s note: This post has been adapted from a section of the book Programming Neural Networks with Python by Joachim Steinwendner and Roland Schwaiger. Dr. Steinwendner is a scientific project leader specializing in data science, machine learning, recommendation systems, and deep learning. Dr. Schwaiger is a software developer, freelance trainer, and consultant. He has a PhD in mathematics and he has spent many years working as a researcher in the development of artificial neural networks, applying them in the field of image recognition.
This post was originally published 10/2025.