Large language models (LLMs) are a special case of neural networks that are specifically optimized for dealing with language.
LLMs “understand” text instructions and can answer questions. Programming languages are ultimately just languages, albeit quite specialized ones. LLMs that have been trained with a sufficient number of code examples can therefore program surprisingly well.
Children begin to learn a language by hearing parents and other people speak, articulating their first words often still quite indistinctly. In nursery school, they sing songs with other children. They learn to read and write at elementary school. Over the course of nearly 20 years, the child acquires more and more words and the underlying grammar. They constantly receive feedback, interact with their environment, and learn how to communicate as well as possible. The teenager can now communicate by speaking and writing, understands complex texts, and can formulate them themselves. If you’re reading this post, you’ve already largely completed this learning process. Nevertheless, you’re still expanding your vocabulary with new technical terms!
LLMs work in a very similar way. During training, which only takes a few months in enormous server farms, the neural network is fed with gigantic amounts of text—the entire Wikipedia, the code collected on GitHub, entire libraries of books, Facebook postings, and so on. (The extent to which this training is permissible at all without the consent of the authors of these texts must first be clarified in court proceedings.)
Presumably because of the legal uncertainty just mentioned, AI companies are revealing less and less about which data they have used for training. The sources of the free language model Llama 1 are broken down relatively precisely. The training material consists of approximately 1,200 billion tokens. This corresponds to around 2.4 billion pages of text in the format of a book!
Almost 90% of the training material comes from various publicly accessible websites, around 5% from GitHub and around 2% from Wikipedia. The rest of the materials are publicly accessible scientific publications (arXiv) and Stack Exchange contributions. Various training data sets are available to download from the internet for AI research.
Current LLMs store this knowledge in models comprising many billions of parameters. In principle, these parameters correspond to the weighting of connections between the nodes of the neural network. (In the human brain, too, the synapses between the neurons are designed differently; that is, they transmit a signal more or less strongly. During learning, the synapses change in the brain, and the weightings change in the neural model.)
LLMs differ not only in the number of parameters but also in the accuracy (quantization) with which they are stored. Often only 4 bits are used for this. This means that only 16 different values can be saved, but two parameters can be stored in 1 byte to save space.
For example, the freely available language model llama3:8b comprises 8 billion parameters (8b in llama3:8b stands for 8 billion). The total memory requirement of this model is around 5 GB because in addition to the parameters, some other data is stored.
There are currently attempts to save parameters with only 1 bit or with 1.58 bits (b1.58). The odd number results when three states are saved for each parameter: 0, 1, and -1. This kind of economical quantification allows for models with even more parameters with the same amount of memory. However, it still needs to be proved whether this is really advantageous. The efficient use of these so-called 1-bit LLMs also requires new hardware.
Language models with more parameters understand more complex questions and usually give better answers. However, they also require more resources, both during training and later in their application. The language model of ChatGPT-4, which isn’t publicly accessible, probably comprises more than 1,000 billion parameters. The largest variant of the free Llama language model has 400 billion parameters (as of mid-2024).
However, the number of parameters isn’t the only criteria. The quality of language models is heavily dependent on the material used during training and in fine-tuning. Finally, language models can be focused on content. Some LLMs are explicitly intended for programming. For this reason, a particularly large number of code examples were used during the training.
One of the most exciting AI research topics at the moment is the attempt to make high-quality language models as small as possible. Small language models (SLMs) would allow local application on notebooks and even smartphones. “Small” is relative; such models still comprise several billion parameters.
One approach isn’t to use terabytes of undifferentiated texts collected from the internet for training, but to select less, higher-quality training material. It remains to be seen whether and when the quality of such “small” models can keep up with large commercial models.
Editor’s note: This post has been adapted from a section of the book AI-Assisted Coding: The Practical Guide for Software Development by Michael Kofler, Bernd Öggl, and Sebastian Springer. Michael studied telematics at Graz University of Technology and is one of the most successful German-language IT specialist authors. In addition to Linux, his areas of expertise include IT security, Python, Swift, Java, and the Raspberry Pi. He is a developer, advises companies, and works as a lecturer. Bernd is an experienced system administrator and web developer. Since 2001 he has been creating websites for customers, implementing individual development projects, and passing on his knowledge at conferences and in publications. Sebastian is a JavaScript engineer at MaibornWolff. In addition to developing and designing both client-side and server-side JavaScript applications, he focuses on imparting knowledge. He inspires enthusiasm for professional development with JavaScript as a lecturer for JavaScript, a speaker at numerous conferences, and an author.
This blog post was originally published 4/2025.