How to Scale and Load Balance in Kubernetes

Written by Rheinwerk Computing | Apr 24, 2025 1:24:25 PM

Along with self-healing, horizontal scaling is one of the best features of Kubernetes that significantly simplifies IT operations.

The idea behind this is to simply scale another container when the load increases, and to do so fully automatically. A metric monitors the load on the container based on the CPU or the number of messages in a queue, for example. If the metric rises above a defined threshold value, then a new container is set up and the load will be distributed to all existing containers via a load distributor. Of course, the number of containers will be reduced again when the load decreases. As shown in the figure below, this principle allows you to consume only what you need. In the cloud, that means that you save money, and that’s in line with the pay-as-you-go principle.

In my first job after graduating, I worked for a company that was right in the middle of a cloud migration. The old sales platform ran 24/7 on a powerful server system from HP in the company's own data center. It was a huge server rack full of computing power. This server needed plenty of capacity to cope with the rush of buyers at a sales event. But most of the time, the server was only running at 40% capacity (and that's a good utilization!) and was unnecessarily heating up the data center.

However, not every application is designed for horizontal scaling. Stateless applications are predestined for this. Horizontal scaling is a requirement that must be taken into account in the software architecture. But if everything fits, then you and the operations team will be able to sleep soundly.

Horizontal Pod Autoscaling

For horizontal pod scaling, the horizontal pod autoscaler (HPA) object is available. The HPA enables your applications to respond dynamically to changes in the load by automatically increasing or decreasing the number of pods, as shown in the next figure. It can monitor certain metrics such as CPU utilization and scale automatically if threshold values are exceeded or not reached.

Horizontal in this context means that the number of pods is increased; that is, the cluster grows in width. The counterpart to this is vertical scaling.

Good to Know: The HPA process is a control loop that runs and checks regularly. The standard value is 15 seconds. This means that scaling does not take effect immediately if the threshold value of a metric is exceeded.

Let's jump straight into an example. For Minikube, you want to run the minikube addons enable metrics-server command in preparation. The metrics server then collects the metrics from the kubelets for the pods that are needed for the HPA.

I had to stop and restart Minikube after activating the add-on so that the HPA could get the metrics.

Note: If you want to install the metrics server on an "ordinary" cluster such as the sample Raspberry Pi cluster, you can find more information at the following address: http://sprs.co/v596452.

For this example, we are using the Apache pod, which Kubernetes provides specifically for this use case. You can find the manifest in this code listing.

apiVersion: apps/v1

kind: Deployment

metadata:

name: apache-hpa

spec:

selector:

matchLabels:

run: apache-hpa

template:

metadata:

labels:

run: apache-hpa

spec:

containers:

- name: apache-hpa

image: registry.k8s.io/hpa-example

ports:

- containerPort: 80

resources:

limits:

cpu: 300m

requests:

cpu: 300m

---

apiVersion: v1

kind: Service

metadata:

name: apache-hpa

labels:

run: apache-hpa

spec:

ports:

- port: 80

selector:

run: apache-hpa

Roll out the manifests for the deployment and the service. You can then roll out the HPA object. There you define that the monitored metric is the CPU and that the autoscaler can scale between a minimum of one pod and a maximum of three pods.

Note: You can find the HPA example from the Kubernetes documentation at the following address: http://s-prs.co/v596453.

apiVersion: autoscaling/v1

kind: HorizontalPodAutoscaler

metadata:

name: apache-hpa

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

name: apache-hpa

minReplicas: 1

maxReplicas: 3

targetCPUUtilizationPercentage: 50

Good to Know: The HPA expects the definition of requests and limits. This makes sense because if the pod can simply use the entire CPU during your load test, then it is difficult to see a result.

We now need to put the application under load in order to experience the HPA in action. You can use the kubectl command for this purpose. Make sure that you create the load generator pod in the same namespace as the Apache pod. This is the only way it can reach Apache with the command provided as we use the name of the service. If your load generator is in a different namespace, you must adapt the URL.

kubectl run -i --tty load-generator --rm --image=busybox:1.28 \

--restart=Never -- /bin/sh -c "while sleep 0.01; do wget \

-q -O- http://apache-hpa; done"

Now observe the behavior of the HPA and the deployment. As in the next figure, you can see that the load on the pods increases and the HPA scales new pods.

Note: Regarding the command, it is important that you write the command in your console in one line. Simply copying and pasting the multiline command caused problems for me.

You have now created a very simple HPA and seen it in action. The HPA becomes particularly interesting when you use custom metrics. If you have a suitable application, you will find more information on the following page: http://s-prs.co/v596454.

Vertical Pod Autoscaling

While the HPA adjusts the number of pods to handle the load, the vertical pod autoscaler (VPA) focuses on the resource allocation of the individual pods. The VPA optimizes the CPU and memory requirements of the pods running in your Kubernetes cluster. This enlarges or reduces the size of the pod as required, as you can see in this figure..

The VPA continuously monitors the resource utilization of the pods and compares it with the defined requests and limits. If it determines that the resource requirements are not ideal, then it adjusts the requirements.

Note: I used the VPA in a project for Prometheus. This was a good way to make Prometheus scalable without having to synchronize multiple replicas. What I found very critical about it is that the requests and limits are not recognizable at a glance. In addition, the pod behaves in a different way than the manifest in version management suggests.

For me, the VPA was an invisible magic hand that I found difficult to understand. The HPA is much easier because you can quickly see how many replicas of a pod are currently running.

If you have the option, it is best to develop your application in such a way that it can scale horizontally.

Let's briefly go through an example. We use the Apache pod again, but now we use a VPA. For this reason, make sure to delete the HPA for this example if you have not already done so.

To install the VPA, you first need a set of CRDs. You can install them using the following commands:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/vparelease-

1.0/vertical-pod-autoscaler/deploy/vpa-v1-crd-gen.yaml

kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/vparelease-

1.0/vertical-pod-autoscaler/deploy/vpa-rbac.yaml

Then you can import the VPA object and start the load generator again as in the previous example. Observe the pod and the way the VPA handles it.

apiVersion: autoscaling.k8s.io/v1

kind: VerticalPodAutoscaler

metadata:

name: my-vpa

spec:

targetRef:

apiVersion: "apps/v1"

kind: Deployment

name: apache-hpa

updatePolicy:

updateMode: "Auto"

You have now become familiar with both options for the automatic scaling of Kubernetes. I always prefer horizontal scaling to vertical scaling. First, it allows you to create multiple pods that run on different nodes, which ensures greater reliability. Second, the requests and limits of a single pod are set in such a way that it can still find space even on well-utilized nodes. This increases the capacity utilization and thus the efficiency of your cluster.

In addition, applications that can scale horizontally are usually more robust. But that also means that scaling is already part of the application, and in the development phase you already need to think about how the shutdown of a pod works and how the overall application can survive it. This way, you can make sure that your application survives an unintentional failure of a pod and can be scaled accordingly. Of course, scaling during operation can help, but it does not save poorly programmed apps whose architecture has a bottleneck.

In real life, you must select the scaling type that best suits your application. For example, applications that depend on a stable state are difficult to scale horizontally: databases are a prime example in this respect. With a web server like Apache, the result depends on whether the requests can be distributed well to different pods via a load balancer.

Cluster Autoscaler

For the sake of completeness, I also want to mention the cluster autoscaler. This tool is particularly interesting if you have a very volatile load on your applications, but it is usually the responsibility of the cluster admins. It allows you to automatically start new nodes and delete old nodes. Especially in public cloud environments, you can save money immediately. The next figure shows a graphical representation of the scaling.

Good to Know: In my opinion, the cluster autoscaler provides several advantages:

Cost efficiency: By adapting the cluster size to the actual load, you avoid the costs of unused resources.
Scalability: The cluster autoscaler allows your cluster to grow and shrink with the requirements of your applications, which is essential for scalable, cloud-native applications.
Improved developer experience: You need to worry very little about capacity. If you want to carry out a quick load test, the cluster can simply map that independently.

How does the cluster autoscaler work? It continuously monitors the utilization of the pods and nodes in your cluster and detects when pods cannot be started because not enough resources such as CPU or memory are available on the existing nodes. Based on this knowledge, the autoscaler then initiates the addition of new nodes to provide the required resources. At the same time, it also recognizes when nodes are underutilized and removes these nodes to save resources and costs. Remaining pods are evicted and started on other nodes. This empties the node, and then it can be switched off.

You need the cluster autoscaler in particular if you want to manage Kubernetes clusters in large, dynamic environments and have to think about geographical scaling in the cloud. If you are not yet using it and are looking for more information, you can find the GitHub repository at the following address: http://s-prs.co/v596455.

Editor’s note: This post has been adapted from a section of the book Kubernetes: Practical Guide for Developers and DevOps Teams by Kevin Welter.

View full post