KubeCon Japan 2025: AI/HPC, Autoscaling, Scheduling

June 17, 2025

This year's KubeCon + CloudNativeCon, the first edition held in Japan, was an incredible gathering of the cloud-native community. With the advance of AI-related topics, the hot subjects seemed to be the crossroads between AI and High-Performance Computing (HPC) workloads, and how we manage accelerated hardware.

cropped-PXL_20250617_065615069.MP.jpg

A major topic of discussion was Dynamic Resource Allocation (DRA), a Kubernetes feature designed to expose devices to containers in a more abstract way. DRA, which is maturing and approaching general availability, provides a more powerful and flexible API for requesting devices than the previous Device Plugin method. This is a significant step forward from the previous approach, which often required extensive system-level cross-checks outside the standard Kubernetes model. The experience with DRA is becoming similar to how PersistentVolumes are handled for storage.

PXL_20250617_065809697.MP.jpg

The conversation around DRA has naturally extended to networking, treating network interfaces as another "resource" that can be requested and configured. This has raised questions about whether DRA could eventually replace CNI meta-plugins like Multus, which are currently used to attach multiple network interfaces to pods. For complex use cases like SR-IOV, which require not just allocation but also specific device configuration, DRA presents a potentially simpler, more integrated approach. It could remove the need for an intermediary like Multus to pass information along. However, DRA is still an evolving feature, and while projects like DraNet are emerging to leverage it for high-performance networking, a full replacement for Multus is not imminent. The future may involve a period of coexistence or integration, as the community determines how to best use DRA's power for advanced networking scenarios.

I also noticed a growing trend around true multi-cloud utilization driven by advanced auto-scaling controllers. While Cluster Autoscaler has been a well-known solution, it can lack flexibility. During this KubeCon, I heard many companies in Japan are adopting Karpenter, a name that was new to me. A significant paradigm shift with Karpenter is that it moves away from the rigid, one-size-fits-all approach of predefined node groups used by Cluster Autoscaler. Instead of merely scaling a group of identical nodes up and down, Karpenter directly provisions the right type and size of machine to run workloads that don't fit in the current cluster, making it potentially faster and more cost-effective.

This focus on autoscaling naturally leads to the challenge of scheduling, an issue I've encountered with my own on-premhouse cluster. When a pod is scheduled, it remains on that node unless terminated. This means that even with the correct affinity rules, a pod's placement might not remain optimal over time, a problem magnified in an autoscaling environment. I learned the PlayStation platform team tried to tackle a similar issue using Pod Disruption Budgets (PDB) and Pod Topology Spread Constraints (PTSC) with limited success.

The core of this problem lies in the Kubernetes scheduling model. During a traffic spike, nodes might be added and pods distributed evenly. However, as traffic subsides or a large rollout occurs, pod downscaling can be unpredictable. A few nodes can remain highly utilized while others are left nearly empty. This is where the Descheduler project comes in. It acts as a real-time descheduler that evicts pods based on its own set of policies, such as high or low node CPU and memory usage, or pods that no longer meet affinity rules. This process allows the evicted pods to be rescheduled by the default scheduler onto more appropriate nodes, often called "repacking."

Another fascinating conference track I attended was on Non-Uniform Memory Access (NUMA), a recurring topic that I had never had the chance to delve into. On motherboards with multiple CPUs or with chiplet architectures like the AMD EPYC CPU, the system can act like a single giant processor, but there are trade-offs. Each CPU has low-latency direct access to its local memory sticks, which is called a NUMA zone. When a CPU needs to access memory attached to a neighboring CPU, it must cross interconnects on the board, which can introduce significant latency increases.

This made me realize that the massive virtual machines offered by cloud providers likely use NUMA architectures without explicitly documenting it. This cross-CPU communication typically only happens when a large program doesn't fit into a single NUMA zone and has to spread across another, causing a significant performance hit for the entire application. For AI and HPC workloads, NUMA-aware scheduling can be critical.

By default, the Kubernetes kubelet includes a Topology Manager that can make NUMA-aware decisions. When pod resources are set to specific QoS classes, like Guaranteed, the Topology Manager can enforce policies like 'single-numa-node' to ensure all resources for a pod are allocated from the same NUMA zone, which is critical for latency-sensitive applications.

Finally, in a conversation with some CNCF members who are actively managing the Kubernetes project, I learned more about how community feedback is gathered. The process is far more structured than just a simple feedback form. It involves a robust system of proposals, security audits, and project milestones managed through the Technical Oversight Committee (TOC), which is responsible for the technical vision and project oversight of the CNCF.

Attending the first KubeCon + CloudNativeCon in Japan was an incredibly rewarding experience that truly challenged my assumptions. I went in expecting to find a tech ecosystem somewhat slowed by the traditional corporate culture I’d always heard about, and to be honest, I did hear whispers of those challenges in many conversations.

And yes, the language barrier was very real, Japanese is the only spoken language around there.