A Beginner’s Guide to Transformer Neural Networks and Self-Attention in AI

Written by Rheinwerk Computing | Dec 24, 2025 2:00:01 PM

Transformer neural networks are a relatively young but revolutionary architecture in the field of AI and ML

Originally presented in the paper “Attention Is All You Need” by Ashish Vaswani et al. in 2017, transformer networks have quickly gained popularity and are now a central component of many advanced systems or large language models (LLMs), especially those for natural language processing (NLP).

In contrast to traditional recurrent neural networks (RNNs), which process sequential data step-by-step and usually only from left to right, transformers work with a mechanism referred to as self-attention. This mechanism enables the model to consider and weigh different parts of an input sequence (i.e., a sentence or an entire text) simultaneously, regardless of how far apart these parts are. This makes transformer networks particularly effective when long sequences and complex dependencies are processed.

A major advantage of the transformer architecture is that it can be parallelized. While RNNs are difficult to parallelize due to their sequential nature and therefore often train more slowly, the structure of the transformer allows calculations to be performed independently and simultaneously. This leads to a considerable acceleration of the training process, especially with large datasets.

The transformer architecture essentially consists of an encoder-decoder system based on multiple layers of self-attention and feed-forward networks. Each layer in this system helps to filter out and combine the relevant parts of the input sequence to enable the most accurate prediction or translation possible.

In the following sections, we’ll take a detailed look at the basic concepts and functionality of transformer neural networks. In particular, we’ll focus on embedding, position coding, and the self-attention mechanism, which are crucial for the model’s ability to capture the context and order of information in a sequence without errors.

The Network Structure

Let’s take a look at the structure and functioning of a transformer neural network using a specific NLP task: we want to generate a text based on a short piece of text, which is called a prompt.

Prompt and Prompt Engineering

A prompt is an input or request given to an AI model, for example, a language model, to generate a specific response or output. This can be a single word, a sentence, or even a complex command that provides the model with the context and the direction in which the response should go.

Prompt engineering refers to the art and science of designing and optimizing prompts in such a way that an AI model delivers the desired results. Careful consideration is given to how a prompt must be formulated to maximize the quality, relevance, and precision of the response generated by the model. This can be done by experimenting with different formulations, contextual information, and special instructions.

Let’s look at a quick example:

Prompt: “Explain the difference between weather and climate.”
Prompt engineering: “Describe in simple terms the difference between weather and climate, and give examples of each.”

Prompt engineering is sometimes regarded as the profession of the future, but in the meantime, LLMs are already very good at improving prompts on their own.

Transformer neural networks are actually nothing more than next-word estimators that add the best matching word (or token, to be more precise) based on an input text. This process is repeated iteratively until an end-of-sentence token gets output by the model (see figure below).

For this purpose, the transformer neural network must specify a list of words (or tokens) and their probability in each step, as shown. The sum of the probabilities usually adds up to 100%. This sounds simple at first, but arriving at this list requires a complex but ingenious network structure and a very large amount of text (i.e., training data).

Based on the groundbreaking work mentioned earlier by Vaswani and his colleagues, we know a transformer neural network consists of an encoder and a decoder block.

The next figure provides a somewhat simplified overview of the structure of a transformer neural network. There are a lot of derivatives with minor or major modifications, but the basic structure with an encoder and a decoder block is present in all of them. To feed text into the network, we must first convert it into numbers through embedding.

Embeddings

First, the input (i.e., sentence, paragraph, or entire book) must be split up. A natural division is the individual words and punctuation marks, and we can retain this as a model of thought. However, we’re actually talking about tokens, which can be entire words, but often also parts of words or punctuation marks. There are also two special tokens: the <Start> and <End> tokens, which mark the start and end of the output. Each word (or token) is then represented by a vector.

Word embedding in the figure below is a simplified representation. You can imagine a word as a point in a high-dimensional space, where ideally words with a similar meaning are in close proximity to each other. When we talk about a multidimensional space here, it’s not a five-dimensional space, as suggested in the figure, but can comprise several thousand dimensions in LLMs. (The embedding space for GPT-3 has more than 12,000 dimensions, for example.)

How is the embedding (i.e., vector values) determined? This happens in the course of training this neural network with a huge dataset. These word embeddings are nothing new in themselves and have been used in NLP for a long time. However, the meaning of a word (or token) depends not only on the word itself but also on the context in a text. Transformer networks manage to incorporate this context, adapt this original embedding vector depending on the context, and place it in the correct position in the embedding space (or meaning space).

Positional Encoding

Until the development of transformer neural networks, NLP tasks were solved with RNNs. The text was presented to this form of neural network word by word or token by token. The position of a word in a text is automatically given by the sequential feeding of the RNN with words or their embedding vectors and is therefore not an issue. Transformer neural networks, on the other hand, receive the input text in parallel, that is, at the same time. This makes it necessary to encode the position; otherwise, the difference in sentence meanings can’t be recognized.

A simple solution is an absolute positional encoding with digits, as shown below, by numbering the words of the input. However, several problems can arise in this context. First, words with a higher numbering also have a higher meaning. Second, the inputs to a transformer neural network can be of different lengths and also contribute to unintentional shifts in meaning due to the absolute numbering. If the word combination “the fox” occurs again later, as in our example, then the higher numbering would change the meaning, even though it’s the same fox in the same context.

We want to use a variant of positional encoding that has the following properties:

Unambiguous: The position corresponds to the value.
Deterministic: Our variant is based on an underlying rule.
Distance invariant: This makes it possible to estimate the distance between the positions of any two tokens.
Independent of sequence length: The input can have a different number of tokens.

In his work on transformer neural networks, Ashish Vaswani proposed a frequency-based positional encoding that fulfills these properties. With frequency-based positional encoding, a vector is returned depending on the position number. This position vector P has the same length as the vectors of the word “embedding” (in our example, their are 5 vectors; with GPT-3, there are more than 12,000). We refer to the vector length as d_emb.

The frequency-based encoding is then based on sine and cosine functions in the following way: There is a run variable k, which takes the value from 1 to the vector length, d_emb.

Another run variable, p, stands for the position number of a token in the text and can have a maximum value of the sequence length. We also define a constant L, which is usually set to the value 10,000. This constant represents a scaling factor that keeps the values of our position vector within reasonable limits. Now, we can determine the rules that define the values of the position vector with the length, d_emb. The term “frequency-based positional encoding” comes from the fact that sine and cosine functions are used for these rules.

What Do Sine and Cosine Have to Do with Frequencies?

Sine and cosine are trigonometric functions that are used to describe harmonic oscillations. An oscillation can be represented by a wave function that crosses the x-axis in a diagram. The more often this crossing occurs per interval, the higher the frequency.

The red function in this figure therefore has a higher frequency than the green function; both functions can be represented by sine and cosine.

A sine function is used for the even elements of the position vector and a cosine function for the odd elements. The reason for this is that for the value 0, all sine functions also return the value 0, and you could potentially get a zero vector for a position vector, which you can easily prevent by alternating sine and cosine functions.

So, if our run variable k is even, then we define the value at position k of the position vector as follows:

For the odd k, we define it as follows:

Formulas are sometimes not so easy to understand, so we visualized the result here.

The word embedding vector and the position embedding vector are added together and serve as input for the encoder.

Encoder

The main task of the encoder is to convert the input sequence into mathematical vectors. It must take the position (via positional encoding) and the context information (via multi-head attention) into account accordingly to provide a rich, contextualized representation for further processing.

Self-Attention

One of the basic problems of text processing is the dependence of the meaning of a word on the context, that is, on the entire sequence of an input text. Recurrent networks were already able to cover this to a certain extent. However, the context dependency was only taken into account in one direction in the sentence. The second problem with recurrent networks is the fact that it’s difficult to encode the context of longer text sequences. The key to the attention mechanism is that you have access to the entire (very large) sequence at the same time because a sequence is presented in its entirety to the transformer neural network and not word by word, as is the case with recurrent networks.

The task of self-attention is to increase the information accuracy of a sequence by taking the context into account. You essentially start with input embedding, that is, at a point in the multidimensional comprehension space, and move this point to the appropriate location in the comprehension space depending on the context. To achieve this, the relationship with all words (tokens) in the sequence is considered for each word (or token).

In the figure below, you can see symbolically what these connections should look like and what influence a word change can have on them. This illustration shows the relationship of the word “it” to all other words. This representation could also be implemented as a square matrix in which the strength of the connection is entered. (In the figure, the darker the connecting line is colored, the higher the value.) The strength of the connection indicates how much attention “it” should pay to the other words in the sentence.

How can you calculate these connections? For each word, a query vector, a key vector, and a value vector are generated during training, which allow the relevance to be calculated.

Query, Key, and Value Vectors

For each word (token), a query, key, and value vector is calculated. Let’s take a look at a sentence: “The cat chases the mouse.”

Query: What is a particular word looking for or asking for? For the word level, the query vector could represent the following: “What other words am I related to?”
Key: How can a word be compared with other words? The key vector is a description of each word. The key vector of the word “mouse” could contain the following information: “I am an object that is being chased.”
Value: What information can a word contribute as a result? If the model knows that “cat” and “mouse” are important, then the value vector will contain the relevant information that the model should use.

At the end of the self-attention step, each word has a new representation (embedding) that takes into account how strongly it’s related to the other words in the sentence and is therefore ideally located in the appropriate place in the comprehension space.

If you’ve read this section carefully, you’ll have noticed that earlier we didn’t talk about self-attention, but about multi-head attention. Instead of just one self-attention calculation (a single “head”), multiple self-attention calculations, that is, query, key, and value vectors, are executed in parallel. Each head can focus on different aspects of the word relationships (e.g., one head looks at the grammatical relationship, another at the semantic proximity of the words, etc.). The results of the different heads are combined to provide a more comprehensive representation of the input sequence.

Decoder

The decoder in a transformer neural network has the task of generating an output for the representations of the input sequence generated by the encoder. In a model for translations, the representations of an English text from the encoder could be used to create a German sentence through the decoder.

A decoder usually works autoregressively; that is, it generates the output (e.g., a word or token) step-by-step, with each new word being based on the previously generated words. This process is usually started with start information or a start token <Start>; the next word or token is generated from the text created so far, and this process is continued until an end symbol <End> is reached.

Masked Self-Attention

Of course, only the new tokens that have already been created may be used to generate the output and not tokens that may be created in the future. Masked self-attention is used to prevent future information from being processed. For example, if the model predicts the third word, it may only access the first two words, whereas the future words are masked. In terms of math, this is usually achieved by assigning very low or negative values to the strength of the connection to future words.

Encoder-Decoder Attention

With encoder-decoder attention, the decoder uses the representation of the encoder to learn which parts of the input sequence (generated by the encoder) are important for generating the next token. In this step, the relevance of the input sequence (e.g., the original text in German) is taken into account to correctly determine the next token (e.g., for a translation into English).

Output

We already know that transformer neural networks are next-word estimators. So, you might think that a word gets output in every step. But this isn’t really the case; instead, the entire list of possible tokens and the corresponding probability are output. The tokens with the highest probabilities are then also the most likely candidates for the next position in the output sequence. For this step, a softmax function is used to ensure that the sum of the probabilities is equal to 1. In some transformer neural networks, you can set a hyperparameter called Temp(erature), which can assume a value between 0 and 1. If it’s set to 0, the token with the highest probability is always used; otherwise, a word from the most probable tokens is selected at random. This randomness can be used to increase the “creativity” of a transformer neural network.

Training Transformer Neural Networks

Training a transformer neural network is hardly any different from training a CNN or any other neural network. However, transformer neural networks are usually trained with much larger datasets and require correspondingly more computing power.

The training dataset of transformer neural networks consists of input-output pairs that depend on the specific task:

Translation:

Enter: “The wolf chases the fox.”

Output: “Der Wolf jagt den Fuchs.”

Text completion:

Input: “The wolf chases”

Output: “the fox”

Text classification:

Input: “This movie was a disaster. I would never watch it again!”

Output: “negative”

Conclusion

Transformer neural networks have reshaped the landscape of artificial intelligence by addressing the limitations of earlier sequence models and introducing an architecture built for parallelization, long-range context handling, and exceptional scalability. Through components such as embeddings, positional encoding, self-attention, and encoder-decoder interactions, transformers can interpret complex relationships within text and generate high-quality outputs across tasks ranging from translation to text generation. As research and applications continue to evolve, transformer architectures remain at the core of modern NLP and form the foundation of today’s most capable large language models.

Editor’s note: This post has been adapted from a section of the book Programming Neural Networks with Python by Joachim Steinwendner and Roland Schwaiger. Dr. Steinwendner is a scientific project leader specializing in data science, machine learning, recommendation systems, and deep learning. Dr. Schwaiger is a software developer, freelance trainer, and consultant. He has a PhD in mathematics and he has spent many years working as a researcher in the development of artificial neural networks, applying them in the field of image recognition.

This post was originally published 12/2025.

View full post