Arslan's Tech Blog

Optimizing Workload and Compute in Kubernetes using descheduler

Arslan Ali Ansari — Mon, 11 Mar 2024 08:19:05 GMT

Binding and placement of pending Pods on to respective Nodes are managed by a scheduler in Kubernetes called Kube-scheduler. Configurable scheduler policies, plugins, and extensions manage the placement decisions, often called predicates and priorities. The decision of a scheduler is based on the actual condition or state of the Cluster at the time when the Pod is requested to be deployed (scheduled). Since the Kubernetes cluster may change its state by updating or change of labels, taints, tolerations, or even by introducing new nodes into it. There may be a desire of relocating a pod from one node to another, a.k.a descheduler.

So before we explore the descheduler, we might need to recap how the scheduler works. The scheduling decision is based on 4 stages or extension points, these are:

Scheduling Queue
Filtering
Scoring
Binding

There can be multiple plugins installed on these extension points, e.g: PrioritySort plugin on the queue, NodeResourceFit, and NodeName plugin on the filtering extension point. Because of the highly extensible nature of Kubernetes, it makes it possible first to customize which plugin goes where and also allows us to write our custom plugins. It also gives us the ability to add a plugin in the post and pre-stages of the extension points.

Now since the scheduling decision is based on a decision taken place by multiple plugins and their placement in the extension points at the time of scheduling, so it is highly possible that the original scheduling decision is not valid anymore.

When do you need a Descheduler

Due to the dynamic nature of the Kubernetes Cluster, there could be several reasons why you may want to evict (deschedule) a Pod from a node:

To improve cluster performance and availability by redistributing pods to optimize resource usage and reduce contention for resources.
To minimize downtime by automatically rescheduling pods on healthy nodes when a node fails.
To help with scaling by removing underutilized pods and redistributing them to nodes where they can be better utilized.
To improve security by ensuring that only authorized pods are running on the cluster.
To enforce policies such as inter-pod anti-affinity, where it can detect and deschedule the pods that don't conform to the policy and redistribute the pods to other nodes.
To improve cost-efficiency by reducing wastage of resources by identifying and removing duplicate pods and rescheduling them.

Below is the list of different scenarios where the use of a descheduler is unavoidable:

1. A new node is introduced in a cluster

You have just introduced a new node in the cluster and want to distribute the workload evenly. Without descheduling, your pods may reside on the original nodes for ages, and due to this adding new nodes will not have any immediate performance benefits. Descheduling pods and redistributing them on new nodes can help improve resource usage and ensure that resources are distributed evenly across the cluster. By spreading the pods across different nodes, the load on individual nodes is reduced resulting in improved performance and stability of the cluster. It will also help the default scheduler and auto-scaler to adjust the number of replicas to match the new capacity and resource available in the cluster.

2. Node labels are updated

A node label update can affect different scenarios and the original scheduling decision may not be appropriate for certain pods. Here are some of the important ones:

Node Affinity: Node affinity allows for pods to be scheduled based on the labels assigned to a node. If the labels of a node is changed, it may no longer match the node affinity rules of a pod, which can lead to an undesired state.
Node Selector: Node selector allows to schedule pods on specific nodes based on the node labels, if a label is updated then these decisions are no longer valid and require eviction.
Failure Domain: Node labels can be used to indicate the failure domain of a node like a region, rack, or zone, which can be used to schedule the pods to spread across multiple failure domains. An intelligent deschduler will ensure the high availability of services by taking optimum descheduling decisions.

3. Node failure requires Pods to be moved

It is important to deschedule pods on a failed node because:

High availability: When a node fails, the pods running on that node can become unavailable, and this can have a significant impact on the availability of the applications and services running on the cluster. By descheduling the pods on a failed node, the cluster can automatically reschedule the pods on healthy nodes, which can help to minimize downtime and improve availability.
Resource Utilization: A failed node can cause a significant drain on resources such as CPU and memory. By descheduling the pods on a failed node, the cluster can free up these resources, which can be used more efficiently by other pods, improving overall cluster performance.
Auto Scaling: A failed node can impact the scaling of pods. By descheduling the pods running on a failed node, the auto-scaler can automatically adjust the number of pods running on healthy nodes to maintain the desired number of replicas.
Networking: Descheduling pods on a failed node can help prevent networking issues, such as IP conflicts or service outages, which may be caused by pods running on a failed node.
Security: Descheduling pods running on a failed node can help prevent security risks. A failed node can be compromised and running malicious pods on a compromised node can pose a significant risk to the cluster.

Descheduling pods on failed nodes are important to ensure that the cluster remains highly available, that resources are used efficiently, and that the auto-scaler can adjust the number of replicas running, avoiding networking issues and maintaining security.

4. Remove Duplicates

Duplicate pods in a Kubernetes cluster can cause several issues that can negatively impact the performance and availability of the cluster. Some reasons why it's important to remove duplicate pods from a node running in Kubernetes include:

Resource Utilization: Duplicate pods consume resources such as CPU and memory that could be used more efficiently by other pods. This can cause resource contention, which can lead to delays in container startup times and negatively impact the overall performance of the cluster.
Networking: Each pod consumes network resources such as IP addresses, and having multiple pods with the same IP address can cause networking issues such as IP conflicts, which can cause communication problems between pods and services.
Scalability: Duplicate pods can make it difficult to scale the number of pods running in a cluster. For example, if a Deployment controller creates multiple replicas of a pod, each replica will have a different replica number and the same pod name which can confuse when trying to scale the number of replicas.
Security: Having duplicate pods can make it difficult to keep track of what is running on a cluster and can open security vulnerabilities. It is important to ensure that all running pods are authorized and that no rogue pods are running.
Cost: Running duplicate pods can result in a waste of resources, which can result in higher costs.
Removing duplicate pods from a node can help to improve the overall performance and availability of a Kubernetes cluster by optimizing resource utilization, reducing network conflicts, improving scalability, security, and reducing costs.

5. Low/High Node Utilization

A descheduler can help distribute load across different nodes in the cluster in several ways:

Balancing resource usage: A descheduler can identify pods that are consuming a disproportionate amount of resources on a node, such as CPU or memory, and move them to other nodes where resources are more available. This can help to balance the resource usage across the cluster and improve overall cluster performance.
Reducing node overcommitment: A descheduler can identify nodes that have a high number of pods running on them and redistribute the pods to other nodes to reduce the number of pods running on the overcommitted node. This can help to reduce contention for resources and improve the overall performance of the cluster.
Improving node utilization: A descheduler can help to identify and remove underutilized pods on a node, and redistribute them to nodes where they can be utilized better. This can help to improve the utilization of resources across the cluster.

6. Pods Violating Inter Pod AntiAffinity

Inter-pod anti-affinity is a feature that allows you to specify rules for how pods should be scheduled in relation to one another. These rules can be used to ensure that pods that belong to the same application or service are spread across different nodes in a cluster, in order to improve availability and reduce the risk of single points of failure.

One important reason to use inter-pod anti-affinity is to ensure that pods that need to be highly available, such as database pods, are not scheduled on the same node. If multiple pods that need to be highly available are scheduled on the same node and that node goes down, multiple pods will become unavailable at the same time. Spreading these pods across different nodes can help to mitigate this risk.

Another reason to use inter-pod anti-affinity is to ensure that pods are spread across different zones or regions to improve resiliency in the event of a zone or region failure.

It is important to use a descheduler in this scenario because, despite the best efforts of Kubernetes scheduler, sometimes pods can violate inter-pod anti-affinity rules due to various reasons like over-commitment of resources or other factors. A descheduler can help identify and remove these pods, ensuring they are rescheduled to comply with the specified anti-affinity rules. This can help to improve cluster availability, reduce the risk of single points of failure, and optimize resource utilization.

10. Pods Violating Topology Spread Constraint

Topology spread constraints allow to spread of pods evenly across different nodes, zones, regions, or racks. By descheduling pods that violate these constraints, the cluster can ensure that the resources are utilized more efficiently, pods are spread out to reduce contention of resources, it can guarantee high availability by running the services on multiple nodes, zones, and regions and it can also help minimize the impact of an infrastructure failure by failure domain awareness.

11. Pods Having Too Many Restarts

When a pod has too many restarts, it can indicate that there is an issue with the pod or the node it is running on. Pods that are continuously restarting can destabilize the cluster, can delay container startup times, and negatively impact the overall performance of the cluster. Descheduling such a pod can help the operation teams to understand the issue behind the continuous restarts and stabilize the application and cluster performance.

Conclusion

In summary, a descheduler plays an important role in managing and optimizing the distribution of pods within a Kubernetes cluster. It can help to ensure that the cluster is running at optimal performance, that resources are being used efficiently, and that the cluster is secure and available.

How to Host Static HTML Website on AWS Amplify

Arslan Ali Ansari — Thu, 12 Jan 2023 23:54:08 GMT

There are many low cost static website hosting options available but most of them require complicated process of uploading the website, configuring the domain names and managing the SSL certificates etc. Amazon Amplify simplifies these tasks by providing simple CI/CD, allowing you to connect your Code Repository, it also provides automatic Build Pipelines for multiple environments and above all it provides out of the box SSL Certificate for your custom domains.

Deploying a static website on AWS Amplify is a straightforward process that can be broken down into the following steps:

1. Create a new Amplify app.

Log in to the AWS Amplify Console and create a new app by selecting the Host Web App option and by providing a name, environment, and repository for your website.

2.Connect your repository

Connect your repository to the app by linking it to your GitHub, GitLab, or Bitbucket account, or by manually uploading your website files. Here are the steps to connect your GitHub repository in AWS Amplify:

Go to the Amplify Console and select the app that you want to connect to your GitHub repository.
Click on the "Connect branch" button.
Select "GitHub" from the list of repository providers.
Use your GitHub account to sign in, and authorize Amplify to access your GitHub repositories.
Select the repository that contains your website or app, and choose the branch that you want to connect.
Click on the "Next" button and configure the build settings, environment variables, and other options as needed.
Click on the "Save and Deploy" button to start the build and deploy process.
After the deployment is complete, you can monitor the build and deploy process, and make updates to your app through the Amplify console.

Heres the screen shot of the options available for Code Repository options:

It's important to note that in order to connect to your GitHub repository, you need to have a GitHub account, and your repository should be public or accessible by AWS Amplify.

3. Build and Deploy your App

Once your repository is connected, Amplify will automatically build and deploy your app. You can also configure custom build settings and environment variables if needed. Please comment if you need any support in your custom build process.

4. Configure custom domains

Amplify will provide a default domain for your app, but you can also configure custom domains if needed. Here are the steps to configure custom domains for a deployed app on AWS Amplify:

Go to the Amplify Console and select the app that you want to configure a custom domain for.
Click on the "Domain settings" tab and then click on the "Connect domain" button.
Enter the domain name that you want to use for your app. Amplify will automatically check the availability of the domain and suggest a domain if it's not available.
If the domain is available, Amplify will provide you with a set of instructions to verify that you own the domain. This typically involves adding a CNAME or A record to your domain's DNS settings.
Once the domain is verified, Amplify will automatically create a certificate for your custom domain.
Once the certificate is ready, you can associate the custom domain with your app by clicking on the "Associate" button.
Now, your custom domain is configured and should be active within a few minutes. You can test it by visiting the custom domain in your browser.

Here is a screenshot of how I have added my custom domain and mapped the branches to each of the sub-domains:

It's important to note that you will need to have access to your domain's DNS settings to configure custom domains in Amplify. Also, Amplify require a valid SSL certificate for custom domains, it will create one for you automatically but you can also use your own certificate.

5. Monitor and update your app

After your app is deployed, you can monitor it's performance, view analytics, and make updates through the Amplify console. You can also monitor the application's access logs and set alarms based on different metrics like 40x and 50x errors etc.

Below are the screen shots of the access logs and the site analytics of my personal portfolio website I deployed on Amplify earlier.

6. Enable CloudFront distribution

By default CloudFront distribution is enabled for you application if you are using Route 53 to configure your DNS.

Securing the Kubernetes Clusters with AI and ML

Arslan Ali Ansari — Wed, 19 Oct 2022 20:46:21 GMT

More than half of all enterprises consider security as their biggest challenge when publishing their microservice workloads in production. 50% require developers to use validated images only, around 80% want to have a DevSecOps initiative, more than 40% consider DevOps as the role most responsible for Kubernetes security, and most importantly, more than half have delayed application deployment due to security concerns.

According to the State of Kubernetes Security Report 2022, security is one of the biggest concerns with container adoption, and security issues continue to cause delays in deploying applications into production.

In the last 1 year, 93% said that they have experienced at least one major security incident. More than 30% of them have experienced customer or revenue losses due to these incidents. According to a recent study, 95% of the breaches were due to human errors.

Source (Red Hat State of Kubernetes Security Report - 2022)

Impact

1. Data Compromise
An attacker with access to business or infrastructure data can leak or destroy the data.

2. Resource Hijacking
If an attacker gets access to a node, or any compute resource in a cluster; he can easily run resource-hungry scripts like crypto mining (crypto-jacking), AI model processing, etc.

3. Denial of Service
DoS can be achieved using buffer overflow (by flooding general requests), ICMP flooding (by sending spoofed packets), or SYN flooding (by sending false requests to the server without a handshake). DoS attack is meant to shut down a computer or make a network inaccessible.

4. Ransom
An attacker can take over the Management Layer or even remove the (important) data and ask for a ransom in return for the data or control.

5. Loss in Customers and/or revenue
One may incur major losses once a service is down because of any reason mentioned above.

AIOps to the Rescue

KaiOps is an AI/ML SaaS-based tool, that connects with your clusters using a secure connection with the KubeAPI server. You may need to update the default security profile to enable the KaiOps security agents to scan and flag different vulnerabilities as low, medium, or high on threats. Once a threat is found, it will send you notifications on your preferred communication channels and inform you of the possible implication, remediation, and/or prevention strategies.

KaiOps scans the cluster every few seconds at runtime and observes the most important Kube events related to network, storage, and workload. It also watches all whitelisted Kube objects for any change and runs AI inferences using GNN (Graph Neural network) to detect misconfigurations, policy violations, and potential threats.
Cluster monitoring methods are divided into 10 classes:

1. Access
Exposing cluster Nodes publicly may give the same access to containers with potential vulnerabilities and may lead to attackers penetrating the cluster.
2. Execution
Execution of commands is monitored very closely as an SSH server or a bash script running in a container can be compromised using brute force attacks.
3. Persistence
Writeable hostPaths in containers or persistence volumes, backdoor containers, or exposed/compromised CronJobs are also monitored to reduce the risk of penetration.
4. Privilege
The security contexts of all running containers are continuously monitored along with the RBAC context.
5. Defense Evasion
Kube objects deletion, clearing of container logs, or connections through proxy server are some of the defense evasive events its monitors according to security policy.
6. Credentials Leak
Leakage of Kube CA data of an exposed ApiServer or cloud credential files on a hosted Kubernetes service. KaiOps continuously monitors Kube's internal network and even the object calls data, which is fed into the GNN for AI model training for anomaly detection.
7. Discovery
Exposed Observability platforms, K8s dashboard, ApiServer, and similar other services may lead an attacker to penetrate the cluster.
8. Lateral Movement
Writeable volumes mounted on hosts, exposed access to cloud resources, privileged service accounts assigned to a container, exposed application configuration through environment variables, core DNS poisoning, arp poisoning, IP spoofing, or even public IPs assigned to a container may compromise the security of a cluster. KaiOps flags its vulnerabilities and whitelists some of them based on a user-defined security policy.
9. Collection
Images from unknown or public sources and compromised container images can jeopardize cluster security. KaiOps continuously scans for Image and Package Vulnerabilities using static image scanners and feed the data into the AI inference engine.
10. Custom K-Tactics
There is some custom patented K-Tactics that KaiOps uses to help its GNN Models detect errors, ambiguities, vulnerabilities, and even anomalies. K-Tactics are also used to reduce nuisance alarms and notifications.

Conclusion

With almost One Million Kubernetes clusters found to be exposed, it is evident that most of the above security vulnerabilities are present in the vast majority of Kubernetes clusters, making remediation difficult. KaiOps uses its patented AI/ML agents by feeding the telemetry data into a GNN model to predict and detect security vulnerabilities, misconfigurations, and anomalies. KaiOps SaaS is free for a cluster with up to 3 nodes for 14 days, and even includes live support in setting up their service for your clusters. KaiOps currently provide support for Native Kubernetes Clusters on almost all hosted Kubernetes services. Use this link to register and get an extended 1 month of a free trial.

References

Common Security Issues in Kubernetes Clusters -- Part 1

Arslan Ali Ansari — Tue, 11 Oct 2022 07:11:38 GMT

54% of enterprises mentioned Security as their biggest challenge in Kubernetes. Almost 7 out of 10 have limited or lack resources to manage security as Kubernetes has a steep learning curve. Now it is not only the Kubernetes lacking native security scanning ability rather any vulnerability in an application running on the cluster may lead to a bigger threat. Here are a few hacks from the past:

Now that we know that a single container vulnerability may compromise the overall security of the cluster, let us review a few aspects.

Public Nodes

A public cluster is one whose Nodes are accessible from the internet, technically they have one or more networks configured with ExternalIP. You can test it using the following command

kubectl get node  -o jsonpath='{.status.addresses}' | grep ExternalIP

Replace with your node name in the cluster. following the sample output

If any of your nodes have an ExternalIP then it is exposed, which means all your containers/pods running on this node are also exposed to the public. It is highly recommended to keep your nodes private and route the outgoing traffic using a Cloud NAT.

Public Kubernetes Dashboard

Kubernetes Dashboard is an open-source deployment used to manage and monitor cluster deployments and other objects. It is available on all clusters and can be deployed or installed with a few simple commands. Here is how you can also deploy it on your cluster:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.6.1/aio/deploy/recommended.yaml

Once installed you can proxy it to access it from your remote machine (machine with cluster access) using kubectl proxy command.Now you can access your dashboard using the following url: http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/

Note: The kubeconfig authentication method does not support external identity providers or X.509 certificate-based authentication. The UI can only be accessed from the machine where the command is executed. See kubectl proxy --help for more options.

Now, this is your standard access method, if one of your vulnerable (or compromised) containers has the access to your dashboard then the hacker will have access to the whole Kubernetes cluster.

Exposed Observability Platform

If you are using Prometheus, Grafana, or Splunk, chances are that you are exposing them by using an external load balancer. These and other observability platforms are often used to monitor and prevent performance, security, and other issues in a cluster. Once exposed to public traffic either through an ingress or through a vulnerable (or compromised) container may lead to a brute force attack. A hacker can take control of the whole cluster once getting access to such a platform.

Privileged Container

A privileged container is one that has full access to itself and through the reverse shell technique it can also get full control over its encapsulating Node. Full access to a node exposes all the images stored on it and it may lead to access to other nodes in the network exposing all available resource objects on the cluster. Hackers may deploy a cron job or hijack the resources of the nodes without even letting it be noticed from the Kubernetes platform.

Conclusion

In this part, I have just scratched the surface and mentioned some of the most common security vulnerabilities. An attacker can use impact techniques to destroy, abuse, or disrupt the normal behavior of an environment. I will discuss some more security vulnerabilities in my upcoming articles.