A convolutional neural network (CNN) is a special multilayer network consisting of a detection part and an identification part.
Let’s look at the detection part first. An image consists of many individual pixel values, and each pixel can have values between 0 and 255. A filter (e.g., 3 × 3 matrix) systematically scans the entire image and offsets the values of the filter and the image against each other (this is known as convolution). The result is written to a new layer, the convolutional layer. Depending on how the filter matrix is structured, certain features such as lines, edges, points, corners, and so on can be extracted. Typically, the image is convolved with several filters in succession to extract multiple features and use them as input data for the AI black box. For an AI model, it’s not the sweet smile of a child in the picture that is important for classification, but only the features determined from the picture.
In this post, we’ll take a closer look at the methods used to determine the characteristics. But here too, you don’t have to carry out any calculations yourself later when developing your own AI models. However, once you’ve understood the concept, you can optimize the hyperparameters in a targeted manner. There are some new hyperparameters in addition to those in ANN models, such as size, type, and number of filters.
In the figure below, you can see Bambam the family dog. This image demonstrates what the result of filtering can look like. The goal is to extract features from the image. These characteristics will then be used in the subsequent processes. Convolution reduces the image information and restricts the content to corners, edges, circles, and so on. This reduces the input data for the AI model to pixel patterns.
Special filters will be used for the pattern recognition. Filters with this matrix structure are referred to as kernels. The first kernel we want to look at is used for edge recognition:
The values in the matrix are empirical values that are also used in image-processing programs. Another kernel, which sharpens the image, has the following structure:
The last example is a kernel that creates a 3D effect:
A filter moves across the image from left to right and line by line from top to bottom. Values in the kernel are multiplied by the pixel values and added up. The results are then transferred to a new layer (convolutional layer).
OpenCV Python: OpenCV (Open Source Computer Vision Library) is an open-source library that provides many methods for image processing. Install the OpenCV Python module via Anaconda.
It’s relatively easy to implement this procedure yourself with a few lines of Python source code.
import cv2
import numpy as np
img = cv2.imread('bambam.png')
kernel = np.array([[0, -1, 0],
[-1, 5, -1],
[0, -1, 0]])
img_final = cv2.filter2D(img, -1, kernel)
cv2.imwrite("bambam-sharp.png", img_final)
As usual, the required modules and then the initial image are loaded. A kernel in the form of a two-dimensional array (arrays in an array) is then defined and applied to the image using the filter2D method. When called, this method receives the initial image as the first parameter, followed by the information about the image depth (number of bits per pixel; at -1, the depth of the initial image is retained) and the kernel to be used as the last parameter. The result is saved as bambam-sharp.png.
The figure below shows the results. The first image in the top left is the original, on the right you can see the sharpened image. At the bottom left you can see a 3D effect, while at the bottom right the edges have been extracted.
Experiment with Kernel Parameters: Be sure to experiment with other kernel parameters. The result varies depending on the structure and numerical value.
What happens to the edge pixels when the kernel is processed depends on which padding you choose. With same padding, border areas are filled with zeros. With valid padding, the filter isn’t filled, but only moves in the image area. The dimension of the output layer is therefore smaller than with the same padding filter.
You’re now familiar with the convolution process. You also know that the kernel can move over the edge or remain in the image. The number of pixels by which the filter moves can be specified using stride. The filter moves from left to right and line by line from top to bottom. A typical value for stride is 2, which is also an empirical value. In many examples on the internet and also in the literature, you’ll find CNNs with a kernel size of 3 × 3 and a stride of 2. You can, of course, experiment with these values later.
As mentioned earlier, the result of the convolution is stored in the new convolutional layer. The convolutional layer is followed by the pooling layer, which reduces the dimension. In this context, max pooling is widely used. A filter (e.g., 2 × 2 matrix, also an empirical value) runs through the convolutional layer and copies only the largest number in the filter area into a new layer, the pooling layer. This reduces the dimensions of the image again and further simplifies the input data for the AI black box.
The filters are then applied, and the data is further reduced by the pooling process. How should the data be fed into the AI model? As the last step in the detection part, the final pooling layers (as a result of several convolutions) are converted into a one-dimensional vector (flatten, see figure below). From here on, we can apply everything we’ve already learned about ANNs. The number of nodes in the last layer again depends on the task and must correspond to the number of possible classes.
RGB Color Model: The RGB color model is established in the IT world and means that the colors red, green, and blue can be used to mix all other colors. The information for any color is saved as follows: A total of 3 bytes are required, 1 byte for each basic color. The numbers 0 to 255, which represent the intensity of the respective basic color, are stored in each byte. Another connection between the basic colors is that the higher the individual intensities, the brighter the result (additive color mixing). So, if you select the highest intensity for each basic color (255), you get the color white. If, on the other hand, you select the intensity 0 for all basic colors, you get the color black. All other colors are somewhere in between. The connection can be verified as follows: if you shine different flashlights (with different colors) on one spot on the wall, the result becomes brighter and brighter.
Black-and-white images created in grayscale require 1 byte per pixel. In color images, each pixel has 3 bytes, 1 byte each for red, green, and blue.
But why do we need a detection section? If the image objects are more difficult to distinguish from each other, you’ll achieve better results with a CNN than with a simple ANN alone. The same applies to objects that aren’t positioned nicely in the center of the image. Let’s assume you want to recognize cat pictures. The cat can be at the top left or bottom right of the picture or on a person’s lap. Here, too, better results are achieved if the original images are processed with filters and the features are extracted.
Let’s apply what we’ve learned to a program (K5_mnist_fashion-3.ipynb). Here, we want to classify garments again, but this time with a detection and identification part. The only change is in the structure of the model.
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, (3,3), padding='same',
activation=tf.nn.relu, input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D((2, 2), strides=2),
tf.keras.layers.Conv2D(64, (3,3), padding='same', activation=tf.nn.relu),
tf.keras.layers.MaxPooling2D((2, 2), strides=2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
Conv2d is a convolutional layer. The first parameter specifies the number of filters to be applied. This is followed by the size of the filter in parentheses. You can also see that the same padding layer is used here. The MaxPooling2D matrix is 2 × 2 in size and moves 2 pixels to the right each time. The combination of a convolutional and a max pooling layer is repeated in this example. The second time, 64 filters are applied. This example has a correct classification rate of 90%, but, in return, you have many additional hyperparameters:
For your own applications, you should again copy a similar program and adapt the source code to the new task. However, this program was only intended as an introduction to the topic of CNNs. With large amounts of data and more complex images, the identification part is much more extensive, for example, 16 convolutional layers and 5 pooling layers. There are also models with even more layers.
Editor’s note: This post has been adapted from a section of the book Developing AI Applications: An Introduction by Metin Karatas.