How to Choose a Machine Learning Model for Your Project

Written by Rheinwerk Computing | Apr 8, 2026 1:00:03 PM

Choosing the right machine learning model can feel overwhelming, especially when you're new to the field.

With dozens of algorithms to consider and no shortage of conflicting advice online, it's easy to get lost before you've even written a line of code. The good news is that model selection doesn't have to be a guessing game. By working through a few key questions about your data and your goals, you can narrow the field quickly and make a confident, well-reasoned choice.

There are a number of resources available online that show you the full suite of options in the modeling space. You can find examples in the scikit-learn (sklearn) documentation here.

The figure below shows the framework we’ll use, which consists of the general order of algorithms to consider as well as their associated complexity.

You’ll see that regression should usually be considered first, and it has relatively low complexity. Then, we get into the tree-based models, which have an increasing level of complexity.

The trends and themes in this framework won’t always hold true in every situation. The goal is to give you a general framework you can use to think about model selection, since it can be a confusing and intimidating task when you’re getting started.

How Important Is Interpretability?

Interpretability is the ability to understand why your model is making the decisions it’s making. For the purposes of picking our model, we need to take a moment to think about it. We’ve just established that regression is lower in complexity, but let’s throw another curve ball into the mix. If interpretability isn’t high on your priority list, you can skip regression. This is primarily due to the additional data cleaning and considerations required for regression. The tradeoff in time it takes to train a regression model versus a decision tree is relatively minimal. Any minimal increase in time to train a decision tree offsets the added effort required to ensure your data is set up properly for a regression model.

The biggest drawback you’ll see with tree-based models is that they’re black boxes that are hard to interpret. There are techniques used to interpret tree-based models, so this is by no means a binary decision where you’re either getting a better model or a model with interpretability. However, if your stakeholder is hyperfocused on understanding the why behind the prediction, or the model is being used in a day-to-day operational setting, understanding how a model makes its predictions yields significant value. The coefficients generated by regression models provide a high degree of specificity in interpretation that is hard to replicate with a machine learning model.

Let’s summarize our decision-making process:

If interpretability is very important: Stick with regression as the default and move to the next question.
If interpretability is not important: Skip the next question and adjust to tree-based models.

How Many Rows and Columns?

If you’ve identified that interpretability is important, the next step is identifying whether your data can support a regression use case. You should think about rows and columns together because of the limitations in how the backend math of regression works.

Here’s a mental model for considering this is: The more rows you have, the more columns you can use. As the size of your data and number of records grows, it allows the math behind the model to look across more columns and find relationships between them. A dataset with only 100 rows and 30 columns won’t work for regression.

As a rule of thumb, it’s best to keep your number of columns at 30 or fewer if you have fewer than 100,000 rows of data to train your model on. If you have hundreds of thousands of rows, then 50 columns or fewer is generally acceptable. For larger datasets, the number of rows per column should be around 3,000–5,000 rows per column.

These guidelines allow the math behind the regression to operate correctly. They also ensure you’re not breaking any of the assumptions in regression.

Keep Multicollinearity in Mind

Multicollinearity occurs when you have two columns that are highly correlated with each other. Regression assumes there is no multicollinearity in the dataset. As your column count grows, this becomes a more challenging dynamic to manage.

To summarize our rules:

If you have an appropriate column-to-row ratio: Stick with regression and move to the next question.
If you do not have an appropriate column-to-row ratio: Adjust to a tree-based model and move to the next question.

What Is Being Predicted?

At this point, you’ve identified the category of model you’ll be using: either a regression approach or tree-based approach. Now, you’ll need to understand what you’re predicting and the associated next step based on the type of model you selected. The two categories of what we’re predicting are called regression (if you’re confused, keep reading) and classification.

Regression

Regression in this context translates to predicting a number. If you’re predicting the number of sales, this is a regression or regressor prediction. Regression models naturally do this, and linear regression does this explicitly. When most people think about predictive models, they’re likely thinking about a model that predicts a specific number.

Sometimes Regression Is Regression… Sometimes It’s Not

I’m not entirely sure who thought it was a good idea to name the overall approach to predicting a number “regression” when this terminology is already reserved for linear and logistic regression, but it is what it is. This has confused me on a few occasions when onboarding onto a new project or team, so it’s never a bad idea to clarify what someone means when they say “regression.”

Classification

Classification is well-named. The objective of the classification model is to classify your data. In practice, it’s still technically predicting a number. For example, if you’re building a model to predict whether an employee will leave the company, your model will classify them either as someone who will stay or as someone who will leave. Someone who will leave is often coded in the data as a 1 and someone who will stay is coded as a 0.

Probabilities are at the core of classification. While it can depend on the use case, how valuable is it to provide a binary prediction? Psychologically, it creates a perception of confidence. However, using the employee turnover model example, what if your model predicts an employee will leave in the next six months, but they’re still employed at the company on month seven?

Thinking probabilistically is often more valuable for stakeholders, but it also keeps your model from taking unnecessary heat for being wrong. All models are wrong—the good ones are just less wrong. As an alternative, what if your model predicted the probability someone would leave in the next six months? For a specific employee, the same binary prediction that they may leave could actually only be a 25% probability. Most decisionmakers will interpret a 25% probability of an event occurring differently than just being told the event will occur.

So why the lecture about probabilistic thinking? Because all classification models start with probability and are converted into binary terms that minimize the frequency of false positives and false negatives. The difference in the code is relatively trivial, as the model is outputting both, so as always it goes back to the use case.

There is also a spectrum to consider as well. Going back to the employee turnover example, what if a probability of 15% is considered high in this business context? In scenarios where your target variable is considered imbalanced (one outcome you’re predicting is more likely than another), it can be challenging for your stakeholders to understand the full context. For employee turnover, most companies will retain the majority of their employees in a six-month span rather than see them leave (I hope). This can lead to your model output recommending that the optimal binary cutoff point for turnover should be 15%. While the math may be optimal, a stakeholder is likely to question the value of your model’s outputs. In this scenario, you can consider grouping your data together into logical categories. One approach could be to group anyone with a probability of 50% or greater as high risk, 15% to 49% as medium risk, and anything below that as low risk. While you’ve introduced subjectivity into the model’s output, you’ve also met the stakeholder where they need to be to effectively consume your model’s output.

Translating to Model Selection

Linear regression and logistic regression models have distinct use cases. If you’re trying to predict a number, you’d use linear regression (should we call it “regression regression”?). If you’re trying to classify data, you’d use logistic regression.

For tree-based models, the regression versus classification distinction is almost completely abstracted from our perspective. Each model has a regressor and classifier function that can be loaded in, and the inputs required are more or less the same (e.g., DecisionTreeClassifier and DecisionTreeRegressor). In practice, this is quite nice. It’s easier to switch between approaches when you can just change the name of the function without having to change all your hyperparameters. However, it can muddy the waters from a learning perspective, because there isn’t much of a distinction when you’re applying it.

Conclusion

Model selection is one of those skills that gets easier the more you practice it. The framework covered in this post—weighing interpretability, evaluating your data's row-to-column ratio, and identifying whether you're predicting a number or a category—gives you a repeatable process for approaching any new project. You won't always land on the perfect model the first time, and that's expected. What matters is that you're making deliberate, informed decisions rather than reaching for whatever is most familiar or most popular. As you continue building projects and refining your intuition, these questions will become second nature.

Editor’s note: This post has been adapted from a section of the book Applied Machine Learning: Using Machine Learning to Solve Business Problems by Jason Hodson. Jason has worked in data-centric roles for nearly a decade. He currently works as an HR analytics manager, and he has prior experience in a forecasting role using the full range of applied machine learning. In a previous role, Jason wrote the end-to-end code for an enterprise hiring manager and candidate experience process, collaborating with recruiting leaders to understand and leverage data from a company-wide survey. He’s built large data models and dashboards and taught nontechnical users how to adopt and use them. Jason has been a technical mentor in all his roles, helping others develop their analytics and programming skill set. The common thread across Jason’s career is his ability to be a translator for stakeholders, peers, and junior team members. His learning journey also gives him a unique perspective: Before earning a master’s degree in business analytics, he was entirely self-taught. This has made his approach to teaching more practical, allowing concepts to translate better (and faster) into the business world.

This post was originally published 4/2026.

View full post