How Machines Interpret Visual Data Using Neural Networks

Written by Rheinwerk Computing | Mar 25, 2026 12:59:59 PM

Saying that a computer “looks at” an image is somewhat misleading.

When a machine is fed an image, it actually receives three matrices that instruct the machine on how to mix red, green, and blue light to ensure that each individual pixel in the image has the correct color. When the pixels are assembled together into an image, a human looking at it can see the whole picture and describe what it depicts. Therefore, when it comes to image recognition, “seeing the big picture” is not something machines do; they deal with individual pixels. This detail is important because it means that anyone who wants to make a computer understand what an image shows needs to enable the computer to transition from dealing with individual pixels to the complete picture.

Japanese computer scientist Kunihiko Fukushima first achieved this feat in 1979 when he invented a specific architecture—that is, a structure—of neural networks. This structure was inspired by the cat brain model proposed by Hubel and Wiesel in 1959, in which simple patterns are recognized first and thereafter contribute to understanding the broader picture. Fukushima’s architecture is called neocognitron, and although it did not immediately revolutionize computer vision, it laid the foundation for the neural networks used for computer vision today. The main reason neocognitron could not be used at the time is that it was (arguably) the first truly deep neural network in history, and the method needed to adjust all the parameters of a neural network wasn’t invented until several years later.

When neural networks are used for computer vision, we say that they perform image recognition. By combining the neocognitron’s structure together with backpropagation to train the network, they can learn to recognize what an image shows. Today, anyone with some programming skills can easily sit down and build a neural network for image recognition. The only thing we need to know is that the nodes must be assembled in a specific way, organized into distinct layers. These layers are called convolutional layers—an atrocious sounding term. But what these layers do is so smart that we should take a closer look at them.

A convolution layer works like a filter—or a large sieve—consisting of artificial neurons that filter for only the information the neural network is looking for. If the image fed into the network depicts a horse, and a particular layer is only looking for triangles, it will focus solely on the horse’s ears and nothing else (unless the horse has triangle-shaped nostrils or something similar).

Using several convolutional layers (that is, several filters), a neural network can assemble everything that makes up an image. The angles, circles, stripes, grids—anything an image can contain. As such, a neural network for image recognition consists of many convolutional layers that work together to extract all the shapes and patterns found in images. And how does this neural network determine which shapes and patterns to create filters for? By training itself! Before training, the convolutional layers don’t function as filters for anything. Only by seeing tens of thousands of images and being rewarded each time the machine correctly predicts what’s in the image, can the network adjust its parameters so the filters work well together. Neural networks composed of convolution layers are often simply called convolutional networks (or convnets, for short).

When we examine well-functioning convolutional networks, we find that early convolutional layers, those placed right after where the image enters, become experts at detecting edges and lines—just like the individual cells of a cat’s visual cortex, as Hubel and Wiesel found. And if that wasn’t enough, the later layers in convolutional networks combine the simple information from previous layers into more complex structures—precisely as a cat’s brain does! And human brains, for that matter. Convolutional networks are probably the area where machines most closely resemble humans since convolutional networks process information using the same strategy our brains do—although machines can’t look at images directly; they first need the images to be converted into matrices before they can do anything with them.

The following detail is for those who are particularly curious, and if you’re one of us, you’ll appreciate this idea: A convolution is a mathematical operation involving matrices. It works by multiplying two matrices, producing a new matrix that only has values where the first matrix matches what the second matrix is searching for. It’s a mathematical filter, expressed in the language of neural networks.

The most important thing to remember from this scenario is that machine learning can be used to train neural networks to “understand what they are looking at,” using two ingredients: The network must be structured so that the nodes can perform convolutions, that is, function like filters. Then, the process of machine learning must ensure that the different layers know what to look for and can provide each other with relevant information. During the training process, the neural network looks at thousands, maybe millions, of images. As the convolutional layers learn to identify the correct features in each image, the entire network becomes increasingly accurate at predicting what the image shows. When the network guesses correctly, it receives a reward, thus reinforcing the process. Finally, you end up with a convolutional network that behaves surprisingly like the part of a cat’s brain that processes visual input.

Is computer vision solved now that we have convolutional networks? In many ways, the answer is yes: To measure how well different programs understand what they’re seeing, the research field has had a long tradition of organizing competitions. At these competitions, a standardized dataset is used that all the programs—whether based on hard-coded rules or machine learning—are tested on. After 2010, the standard for these competitions was an enormous digital photo album named ImageNet, which contains more than a million images of over a thousand different objects. In 2017, a convolutional network won the competition with an accuracy of 98%. Since then, many have regarded ImageNet as “solved,” and leading researchers in the field have turned their attention to more challenging tasks.

Convolutional networks have revolutionized computer vision, and thanks to them, sorting and searching through images is easier than ever. We see this capability all the time, all around us. Computers can categorize photos automatically. The social network X automatically filters out tweets with pornographic content, and medical software can detect skin cancer from images of moles. Your phone recognizes your face, and at LAX and many other major airports, passport control is partially automated. These scenarios are just a handful of ways in which convolutional networks are applied worldwide.

Does this mean that convolutional networks understand what they are looking at as well as humans do? Well… modern cameras have a higher resolution than the human eye does, which means that machines can extract greater detail than we humans can (estimates of the resolution of the human eye can achieve are just below 600 megapixels). But if the image shows a situation that requires contextual understanding, it’s a whole other story. Machine learning models, for the time being, do not understand the physical world. They don’t know that the world consists of separate objects—cars, cats, houses—that aren’t physically connected and that have distinct functions. Cars can travel at 50 miles per hour, which neither cats nor houses can do. This presents a significant challenge when making machines “understand what they’re looking at”: If a Tesla drives behind a truck decorated with a realistic image of a field, the Tesla’s image recognition model will likely categorize the truck as a field. That convolutional networks have high accuracy when classifying images is not the same as them understanding the relationships between the objects in the images. In this area, machine learning still has a long way to go, and human responsibility comes back into play. It’s essential to only use machine learning models in situations that are well represented by their training data; if a model has never seen a truck with an image of a field on it, there is little reason to believe that it will understand “what it’s looking at.”

My friend runs into the same problem every time she passes through passport control at Oslo airport in Norway. The automated facial recognition system is there to expedite passengers, but my friend is always sent to manual evaluation—because she’s Asian. The model used for facial recognition performs well on Caucasian faces but struggles with most other ethnicities. The reason for this is probably that it was trained on faces that resemble the average white person. However, that’s not a good excuse, and it again highlights the importance of human involvement in developing machine learning models. It’s our responsibility to ensure that they’ve had the opportunity to learn everything they need to know, by ensuring that their training data covers all the situations they’re expected to encounter.

Google learned this the hard way in the form of a true PR nightmare in 2015. They released a machine learning model trained on images from the Internet and claimed that it could classify anything. Shortly after, a young web developer named Jackie Alcine posted the tweet “My friend’s not a gorilla,” accompanied by a photo. The photo was a selfie of Jackie and her friend—both African American—that the model had labeled as “gorillas.” The whole thing was incredibly inappropriate and uncomfortable for everyone involved. Google issued an unconditional apology, and it didn’t take long before they removed the categories “gorilla,” “chimpanzee,” and “monkey” from the system. This solution was a cheap fix, but Google wanted to ensure that the same thing would never happen again.

It’s up to us to decide which categories a classification model can use to categorize things. If the category “African American” wasn’t included, and the model created an internal representation where Jackie and her friend were closer to “gorilla” than “person,” it was probably because the model’s training data didn’t contain enough people of non-Western ethnicities. Exactly which internal representations a neural network creates is an interesting question. Machine learning models are attentive to what they consider most important, and just as our earlier example of people with lung disease and asthma, this isn’t necessarily the same as what humans consider most important. The good news is that there are several ways to investigate which part of an image a convolutional network devotes the most attention to.

Editor’s note: This post has been adapted from a section of the book Machines That Think: How Artificial Intelligence Works and What It Means for Us by Inga Strümke. Inga is a Norwegian physicist specializing in artificial intelligence and machine learning. She was born in 1989 in Gummersbach, Germany, and grew up in Narvik, Norway. Strümke holds a master's degree in theoretical physics from Norwegian University of Science and Technology (NTNU) and a doctorate in particle physics from the University of Bergen. She is currently an associate professor at NTNU. Strümke is also known for her work in AI ethics and has received an award for science communication from the Norwegian Research Council. She published Maskiner som tenker in 2023. The book was recognized with the Brageprisen, a prestigious Norwegian literature prize.

This post was originally published 3/2026.

View full post