Featured

LLM Model Selection: Key Criteria for Generative AI Projects

How do you select the “right” LLM model for your task? This post explores the options to consider.

 

Depending on your project, you might have hard criteria and soft criteria. To process long input prompts, the context window is a hard factor. To interact with a model that includes recent developments and trends, the model cutoff date could be extremely important.

 

A sweet spot can be found within the boundaries of model price, latency, and performance. Finding the right balance is important for any generative AI application. This trade-off is highly contextual and influenced by the specific requirements of your project and the resources at your disposal.

 

The performance of a model highly impacts the model’s outputs. Related to model performance is the price of model use. In general, the more powerful a model is, the more expensive it is to use. If you need a model that inherently “knows” about recent events, you must take the knowledge cutoff date into account. Your input data might also define the model selection since the context window might truncate your input if it is too small. For real-time applications, you need to ensure that the model has low latency.

 

Depending on the data you work with, you might want to operate the LLM on your own resources (on-premise) rather than using a model hosted in the cloud. Associated with the model hosting is the model license. Open-source and open-weight models provide their model weights so that you can download and operate it on your hardware, which is not possible with proprietary models that are only accessible via APIs.

 

Many of these parameters are interconnected, which makes model selection quite difficult. Let’s consider the performance of a model and take a closer look at model selection criteria.

 

Performance

You can check the performance of different models at Chatbot Arena and get a result, as shown in this figure.

 

Chatbot Arena Leaderboard

 

The models are sorted in terms of arena score. But how is this score calculated? It is called “arena” for a reason. In the arena, the user interacts with two models: model A and model B. The user can define a prompt and gets responses from the two models. The user must rate which one is performing better. Thus, Chatbot Arena is a double-blind test, which is considered the gold standard in the evaluation of test outcomes. In the screenshot, notice that multiple models share the same rank because the 95% confidence interval is considered the highest rank. The ranking changes often, so your rankings will likely look different due to the passing time.

 

By default, the overall ranking is shown. But you can select other categories like Math or Coding and check those rankings. When searching for a model for a specific task, we advise selecting the category according to your task.

 

But performance is not the only relevant factor.

 

Knowledge Cutoff Date

Each model has a knowledge cutoff date, which is the date on which its training data was finalized. The model is subsequently trained on that data. In the model weights, no more recent data can be represented, which is why it is important to know the knowledge cutoff date. If you ask a model for information like an event or any other piece of information, the model cannot know if it happened after the date of the knowledge cutoff.

 

For chatbots, this parameter is becoming less relevant because these models are more often equipped with internet search capabilities to retrieve up-to-date information by themselves. However, for you as a developer of AI systems, knowledge cutoffs might be an important factor to consider.

 

On-Premise versus Cloud-Based Hosting

Another important aspect of model selection is data privacy. If you’re dealing with confidential information, you or your customers might not want the information to leave the company network. In that case, you want to use a model that is hosted in the safe haven of the internal network. But in that case you’re most likely bound to use opensource models.

 

Open-Source, Open-Weight, and Proprietary Models

Proprietary models are provided to users via web applications or APIs. Famous members of this class include OpenAI or Anthropic who both provide their models in this way.

 

Google is a special case since their models are either provided as proprietary models via APIs (e.g., Gemini). But other model classes, like Gemma, are provided as open-source models via Hugging Face.

 

To be completely accurate, we should differentiate “open source” and “open weight.” Truly open-source models are provided with all their details, such as its model architecture and training data used. This total transparency is rare, however. A provider might release to the public a trained model with its weights, but specific details of the underlying data and training might still be kept secret.

 

A famous example of this group is Meta with its Llama model family. These models are free to use, but the company doesn’t disclose its training data.

 

Price

The price of using an LLM service can be a significant factor in your model selection process. Typically, proprietary models are paid on a token basis. To be more precise, a differentiation is made between input tokens and output tokens. Input tokens are typically cheaper than output tokens. You can find the current prices for OpenAI models here and for Anthropic models here.

 

You should come up with some estimates on how many API requests and how many tokens will be processed. Based on these estimates, you can derive an estimate of your total costs.

 

Context Window

Your project might include the processing of very long documents, making it necessary to pass as much information as possible to the model. Thus, the context window is a driving factor for the best choice of model.

 

If you check, for example, the models on Groq, you’ll find models with rather small context windows like “LlaVa 1.5 7B” with a context window of 4.096 tokens as well as “Llama 3.3 70B Versatile” with an extremely large context window of 128.000 tokens.

 

Latency

Some use cases require extremely fast model responses despite any interdependencies that come with the hosting of a model. If latency does not play a role, you might even run an open-source model on a CPU. In other cases, latency might be the most relevant factor.

 

For example, let’s say you want to couple an LLM with voice generation to enable real-time chats. In such a situation, an LLM can easily be the bottleneck and reduce the user experience because conversation cannot be “natural” if the conversation partner requires long response times.

 

Conclusion

Selecting the right LLM is rarely a straightforward decision. The criteria explored here are deeply interconnected, and optimizing for one often means making trade-offs on another. A high-performing proprietary model may be too costly at scale; an open-weight model hosted on-premise may introduce latency that rules it out for real-time applications.

 

The right choice is always shaped by the specific demands of your project. By systematically working through these criteria and weighing them against your constraints, you can move from an overwhelming field of options to a well-reasoned model selection that serves your application effectively.

 

Editor’s note: This post has been adapted from a section of the book Generative AI with Python: The Developer’s Guide to Pretrained LLMs, Vector Databases, Retrieval-Augmented Generation, and Agentic Systems by Bert Gollnick. Bert is a senior data scientist who specializes in renewable energies. For many years, he has taught courses about data science and machine learning, and more recently, about generative AI and natural language processing. Bert studied aeronautics at the Technical University of Berlin and economics at the University of Hagen. His main areas of interest are machine learning and data science.

 

This post was originally published 4/2026.

Recommendation

Develop Your Own Gen AI Applications with Python!
Develop Your Own Gen AI Applications with Python!

Ready to build your own generative AI applications? This developer's guide walks you through working with pretrained LLMs, vector databases, and retrieval-augmented generation using Python — then takes it further with agentic systems built on frameworks like LangChain, CrewAI, and OpenAI Agents. It's a practical, code-first path from AI concepts to fully deployed applications.

Learn More
Rheinwerk Computing
by Rheinwerk Computing

Rheinwerk Computing is an imprint of Rheinwerk Publishing and publishes books by leading experts in the fields of programming, administration, security, analytics, and more.

Comments