EKS Upgrade: Fix High CPU In Aws-node After V1.33

by RICHARD 50 views

Introduction

Hey guys! Ever upgraded your Kubernetes cluster and suddenly felt like your CPU is doing a marathon? Yeah, it's a pain. This article dives deep into a recent issue where upgrading an EKS cluster from Kubernetes v1.32 to v1.33 caused a significant spike in CPU usage within the aws-node container (VPC CNI). We'll break down the problem, explore potential causes, and discuss how to troubleshoot it. So, grab your coffee, and let's get started!

The Problem: CPU Usage Spike After Kubernetes v1.33 Upgrade

After a Kubernetes cluster upgrade from v1.32 to v1.33, the aws-node container, which is crucial for VPC CNI (Container Network Interface), showed approximately 2x higher CPU usage. This is a major headache because increased CPU consumption can lead to performance bottlenecks, higher costs, and an overall unstable environment. Imagine your applications suddenly running slower, or your cloud bill skyrocketing – not a fun scenario, right? The reported issue clearly illustrates this problem with a visual representation of the CPU usage before and after the upgrade. The sudden jump in CPU usage is quite alarming and demands a thorough investigation.

The key question here is: Why did this happen? The initial thought was whether the CNI release notes, which mentioned support for multi-NICs on an instance, could be the culprit. The reasoning was that perhaps the increased functionality meant more work for the container, leading to higher CPU usage. However, there was nothing in the Kubernetes release notes that explicitly pointed to this being a trigger after the upgrade. This is the kind of puzzle we need to solve. We'll need to dig into the configurations, logs, and environmental factors to really understand what's going on. It’s like being a detective, but instead of a crime scene, we have a Kubernetes cluster!

To give you a clearer picture, the CNI driver version remained consistent at v1.20.x before and after the upgrade. This is an important detail because it rules out version changes as the direct cause. It means we need to look beyond the obvious and consider more subtle changes or interactions triggered by the Kubernetes upgrade. We need to ask ourselves, what else changed? What dependencies or underlying systems could have been affected? This is where detailed troubleshooting comes into play, and we'll explore the steps involved in the sections below.

Environment Configuration

To really understand the context, it's essential to look at the environment configurations. Several key environment variables were set for the aws-node container. Let's break them down:

env:
 - name: AWS_VPC_CNI_NODE_PORT_SUPPORT
 value: "true"
 - name: AWS_VPC_ENI_MTU
 value: "9001"
 - name: AWS_VPC_K8S_CNI_EXTERNALSNAT
 value: "false"
 - name: AWS_VPC_K8S_CNI_LOGLEVEL
 value: "DEBUG"
 - name: AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG
 value: "false"
 - name: ENABLE_PREFIX_DELEGATION
 value: "true"
 - name: WARM_ENI_TARGET
 value: "1"
 - name: WARM_PREFIX_TARGET
 value: "1"
 - name: DISABLE_TCP_EARLY_DEMUX
 value: "true"
  • AWS_VPC_CNI_NODE_PORT_SUPPORT: This is set to true, which means that NodePort services are supported. NodePort services expose applications running in the cluster on a specific port on each node. This can impact networking and CPU usage due to the additional traffic management.
  • AWS_VPC_ENI_MTU: The Maximum Transmission Unit (MTU) is set to 9001, indicating the use of jumbo frames. Jumbo frames can improve network performance by reducing overhead, but they also require proper configuration across the entire network. If there are any misconfigurations, it could lead to fragmentation and increased CPU usage.
  • AWS_VPC_K8S_CNI_EXTERNALSNAT: Set to false, this disables Source Network Address Translation (SNAT) for traffic leaving the cluster. Disabling SNAT can affect how traffic is routed and might have implications for CPU usage depending on the network policies and configurations.
  • AWS_VPC_K8S_CNI_LOGLEVEL: The log level is set to DEBUG, which is great for troubleshooting because it provides detailed logs. However, it can also increase CPU usage due to the extra logging overhead. When troubleshooting, it's helpful, but in production, you might want to reduce this.
  • AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG: Set to false, this indicates that custom network configurations are not being used. This simplifies the network setup, but it also means we should focus on the standard network configurations for potential issues.
  • ENABLE_PREFIX_DELEGATION: This is set to true, enabling prefix delegation, which optimizes IP address allocation. Prefix delegation can reduce the overhead of IP address management, but it also adds complexity to the network configuration, and any issues here could impact CPU usage.
  • WARM_ENI_TARGET and WARM_PREFIX_TARGET: Both are set to 1, meaning the system keeps one Elastic Network Interface (ENI) and one IP prefix in reserve. This helps in quickly scaling up the network, but maintaining these warm resources can also consume CPU.
  • DISABLE_TCP_EARLY_DEMUX: This is set to true, disabling early demultiplexing of TCP connections. This can affect how connections are handled and might influence CPU usage, especially if there are many short-lived connections.

Understanding these environment variables is critical because they define how the aws-node container interacts with the network. Any changes or optimizations in Kubernetes v1.33 might interact differently with these settings, leading to unexpected CPU usage. It’s like tuning a car engine – you need to understand each component to optimize performance effectively.

Recreating the Issue: A Step-by-Step Guide

To effectively troubleshoot this issue, it's crucial to understand how to reproduce it. Here’s a breakdown of the steps:

  1. Start with the Baseline: Begin by running the CNI driver at version v1.20.1 on an EKS cluster running Kubernetes v1.32. This is your control environment, where you should observe normal CPU usage.
  2. Perform the Upgrade: Upgrade the EKS control plane to Kubernetes v1.33. This is the critical step where the issue is triggered.
  3. Monitor CPU Usage: After the upgrade, closely monitor the CPU usage of the aws-node container. You should observe a relatively higher CPU usage compared to the baseline.

By following these steps, you can reliably reproduce the issue and have a controlled environment for testing potential solutions. This is like setting up a science experiment – you need to control the variables to understand the cause and effect. If you can consistently reproduce the problem, you’re one step closer to solving it!

Diving into the Environment Details

To paint a complete picture, let's look at the specific environment details:

  • Kubernetes Version: v1.33.2-eks-931bdca
  • CNI Version: v1.20.1
  • Operating System: Bottlerocket, a Linux-based operating system designed for running containers. The specific version is 1.44.0 (aws-k8s-1.33).
  • Kernel Version: Linux ip-100-64-63-20.dev1.internal 6.12.37 #1 SMP PREEMPT_DYNAMIC Thu Jul 24 23:19:42 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

These details are important because they highlight the specific versions and configurations where the issue occurred. For instance, Bottlerocket is designed to be lightweight and secure, so any unexpected CPU usage could indicate an incompatibility or misconfiguration specific to this OS. Similarly, the kernel version can play a role, as certain kernel features or bugs might interact differently with Kubernetes v1.33.

Understanding the OS and kernel versions is like knowing the exact ingredients in a recipe. If something goes wrong, you need to know all the components to figure out what caused the problem. In this case, knowing we're using Bottlerocket and a specific kernel version helps us narrow down potential causes and search for relevant issues or known bugs.

Initial Troubleshooting Steps

Okay, so we know the problem, we know the environment, and we know how to reproduce it. Now, let’s talk about how to start troubleshooting this increased CPU usage. Here’s a structured approach:

  1. Check the Logs: The first thing you should always do is dive into the logs. Specifically, look at the logs for the aws-node container. Since the AWS_VPC_K8S_CNI_LOGLEVEL is set to DEBUG, you should have plenty of information. Look for any errors, warnings, or unusual patterns that might indicate what’s going on. Log analysis is like reading a story – you’re looking for clues that tell you what happened.
  2. Monitor Network Traffic: Since the aws-node container is responsible for networking, monitoring network traffic is crucial. Use tools like tcpdump or Wireshark to capture and analyze network packets. Look for any anomalies, such as excessive traffic, dropped packets, or unusual connection patterns. Think of this as listening to the heartbeat of your network – any irregular rhythms can indicate a problem.
  3. Profile CPU Usage: Use profiling tools like perf or pprof to understand which functions or processes are consuming the most CPU. This will help you pinpoint the exact source of the CPU spike. CPU profiling is like doing an autopsy – you’re dissecting the CPU usage to find the root cause.
  4. Review Kubernetes Events: Check Kubernetes events for any errors or warnings related to networking or the aws-node container. Events can provide valuable insights into what’s happening in your cluster. Kubernetes events are like news reports – they tell you about significant occurrences in your cluster.
  5. Compare Configurations: Compare the configurations of your cluster before and after the upgrade. Look for any differences that might be contributing to the increased CPU usage. This is like comparing blueprints – you’re looking for changes that might explain the new behavior.
  6. Check CNI Configuration: Review the CNI configuration files to ensure they are correctly set up and haven’t been inadvertently changed during the upgrade. CNI configurations are the rules of the game for networking – if they’re not right, things can go wrong.

By following these initial steps, you can gather a lot of information and start narrowing down the potential causes of the increased CPU usage. Remember, troubleshooting is a process of elimination – you gather data, form hypotheses, and test them until you find the root cause.

Log Analysis and Findings

The initial investigation involved collecting logs from the CNI using the aws-cni-support.sh script and sharing them with [email protected]. However, the log output of the CNI did not show any significant changes before and after the upgrade. This is a crucial piece of information because it suggests that the issue might not be directly related to errors or warnings within the CNI logs themselves.

This situation is like a medical mystery where the initial tests come back normal. It means we need to dig deeper and consider other possibilities. We need to look beyond the obvious and explore potential interactions between the CNI, Kubernetes, and the underlying infrastructure. It's like saying,