Suggested Reads

Suggested Reads

54832 bookmarks
Newest
Open Source Survival Guide
Open Source Survival Guide

Open Source Survival Guide

https://chrisshort.net/abstracts/open-source-survival-guide/

Open Source Survival Guide provides practical rules for navigating the open source ecosystem. Learn how to balance community collaboration with business goals, contribute effectively, build trust, and maintain your sanity in the complex world of open source software development.

via Chris Short https://chrisshort.net/

March 26, 2025

·chrisshort.net·
Open Source Survival Guide
Fresh Swap Features for Linux Users in Kubernetes 1.32
Fresh Swap Features for Linux Users in Kubernetes 1.32

Fresh Swap Features for Linux Users in Kubernetes 1.32

https://kubernetes.io/blog/2025/03/25/swap-linux-improvements/

Swap is a fundamental and an invaluable Linux feature. It offers numerous benefits, such as effectively increasing a node’s memory by swapping out unused data, shielding nodes from system-level memory spikes, preventing Pods from crashing when they hit their memory limits, and much more. As a result, the node special interest group within the Kubernetes project has invested significant effort into supporting swap on Linux nodes.

The 1.22 release introduced Alpha support for configuring swap memory usage for Kubernetes workloads running on Linux on a per-node basis. Later, in release 1.28, support for swap on Linux nodes has graduated to Beta, along with many new improvements. In the following Kubernetes releases more improvements were made, paving the way to GA in the near future.

Prior to version 1.22, Kubernetes did not provide support for swap memory on Linux systems. This was due to the inherent difficulty in guaranteeing and accounting for pod memory utilization when swap memory was involved. As a result, swap support was deemed out of scope in the initial design of Kubernetes, and the default behavior of a kubelet was to fail to start if swap memory was detected on a node.

In version 1.22, the swap feature for Linux was initially introduced in its Alpha stage. This provided Linux users the opportunity to experiment with the swap feature for the first time. However, as an Alpha version, it was not fully developed and only partially worked on limited environments.

In version 1.28 swap support on Linux nodes was promoted to Beta. The Beta version was a drastic leap forward. Not only did it fix a large amount of bugs and made swap work in a stable way, but it also brought cgroup v2 support, introduced a wide variety of tests which include complex scenarios such as node-level pressure, and more. It also brought many exciting new capabilities such as the LimitedSwap behavior which sets an auto-calculated swap limit to containers, OpenMetrics instrumentation support (through the /metrics/resource endpoint) and Summary API for VerticalPodAutoscalers (through the /stats/summary endpoint), and more.

Today we are working on more improvements, paving the way for GA. Currently, the focus is especially towards ensuring node stability, enhanced debug abilities, addressing user feedback, polishing the feature and making it stable. For example, in order to increase stability, containers in high-priority pods cannot access swap which ensures the memory they need is ready to use. In addition, the UnlimitedSwap behavior was removed since it might compromise the node's health. Secret content protection against swapping has also been introduced (see relevant security-risk section for more info).

To conclude, compared to previous releases, the kubelet's support for running with swap enabled is more stable and robust, more user-friendly, and addresses many known shortcomings. That said, the NodeSwap feature introduces basic swap support, and this is just the beginning. In the near future, additional features are planned to enhance swap functionality in various ways, such as improving evictions, extending the API, increasing customizability, and more!

How do I use it?

In order for the kubelet to initialize on a swap-enabled node, the failSwapOn field must be set to false on kubelet's configuration setting, or the deprecated --fail-swap-on command line flag must be deactivated.

It is possible to configure the memorySwap.swapBehavior option to define the manner in which a node utilizes swap memory. For instance,

this fragment goes into the kubelet's configuration file

memorySwap: swapBehavior: LimitedSwap

The currently available configuration options for swapBehavior are:

NoSwap (default): Kubernetes workloads cannot use swap. However, processes outside of Kubernetes' scope, like system daemons (such as kubelet itself!) can utilize swap. This behavior is beneficial for protecting the node from system-level memory spikes, but it does not safeguard the workloads themselves from such spikes.

LimitedSwap: Kubernetes workloads can utilize swap memory, but with certain limitations. The amount of swap available to a Pod is determined automatically, based on the proportion of the memory requested relative to the node's total memory. Only non-high-priority Pods under the Burstable Quality of Service (QoS) tier are permitted to use swap. For more details, see the section below.

If configuration for memorySwap is not specified, by default the kubelet will apply the same behaviour as the NoSwap setting.

On Linux nodes, Kubernetes only supports running with swap enabled for hosts that use cgroup v2. On cgroup v1 systems, all Kubernetes workloads are not allowed to use swap memory.

Install a swap-enabled cluster with kubeadm

Before you begin

It is required for this demo that the kubeadm tool be installed, following the steps outlined in the kubeadm installation guide. If swap is already enabled on the node, cluster creation may proceed. If swap is not enabled, please refer to the provided instructions for enabling swap.

Create a swap file and turn swap on

I'll demonstrate creating 4GiB of swap, both in the encrypted and unencrypted case.

Setting up unencrypted swap

An unencrypted swap file can be set up as follows.

Allocate storage and restrict access

fallocate --length 4GiB /swapfile chmod 600 /swapfile

Format the swap space

mkswap /swapfile

Activate the swap space for paging

swapon /swapfile

Setting up encrypted swap

An encrypted swap file can be set up as follows. Bear in mind that this example uses the cryptsetup binary (which is available on most Linux distributions).

Allocate storage and restrict access

fallocate --length 4GiB /swapfile chmod 600 /swapfile

Create an encrypted device backed by the allocated storage

cryptsetup --type plain --cipher aes-xts-plain64 --key-size 256 -d /dev/urandom open /swapfile cryptswap

Format the swap space

mkswap /dev/mapper/cryptswap

Activate the swap space for paging

swapon /dev/mapper/cryptswap

Verify that swap is enabled

Swap can be verified to be enabled with both swapon -s command or the free command

swapon -s Filename Type Size Used Priority /dev/dm-0 partition 4194300 0 -2

free -h total used free shared buff/cache available Mem: 3.8Gi 1.3Gi 249Mi 25Mi 2.5Gi 2.5Gi Swap: 4.0Gi 0B 4.0Gi

Enable swap on boot

After setting up swap, to start the swap file at boot time, you either set up a systemd unit to activate (encrypted) swap, or you add a line similar to /swapfile swap swap defaults 0 0 into /etc/fstab.

Set up a Kubernetes cluster that uses swap-enabled nodes

To make things clearer, here is an example kubeadm configuration file kubeadm-config.yaml for the swap enabled cluster.

--- apiVersion: "kubeadm.k8s.io/v1beta3" kind: InitConfiguration --- apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration failSwapOn: false memorySwap: swapBehavior: LimitedSwap

Then create a single-node cluster using kubeadm init --config kubeadm-config.yaml. During init, there is a warning that swap is enabled on the node and in case the kubelet failSwapOn is set to true. We plan to remove this warning in a future release.

How is the swap limit being determined with LimitedSwap?

The configuration of swap memory, including its limitations, presents a significant challenge. Not only is it prone to misconfiguration, but as a system-level property, any misconfiguration could potentially compromise the entire node rather than just a specific workload. To mitigate this risk and ensure the health of the node, we have implemented Swap with automatic configuration of limitations.

With LimitedSwap, Pods that do not fall under the Burstable QoS classification (i.e. BestEffort/Guaranteed QoS Pods) are prohibited from utilizing swap memory. BestEffort QoS Pods exhibit unpredictable memory consumption patterns and lack information regarding their memory usage, making it difficult to determine a safe allocation of swap memory. Conversely, Guaranteed QoS Pods are typically employed for applications that rely on the precise allocation of resources specified by the workload, with memory being immediately available. To maintain the aforementioned security and node health guarantees, these Pods are not permitted to use swap memory when LimitedSwap is in effect. In addition, high-priority pods are not permitted to use swap in order to ensure the memory they consume always residents on disk, hence ready to use.

Prior to detailing the calculation of the swap limit, it is necessary to define the following terms:

nodeTotalMemory: The total amount of physical memory available on the node.

totalPodsSwapAvailable: The total amount of swap memory on the node that is available for use by Pods (some swap memory may be reserved for system use).

containerMemoryRequest: The container's memory request.

Swap limitation is configured as: (containerMemoryRequest / nodeTotalMemory) × totalPodsSwapAvailable

In other words, the amount of swap that a container is able to use is proportionate to its memory request, the node's total physical memory and the total amount of swap memory on the node that is available for use by Pods.

It is important to note that, for containers within Burstable QoS Pods, it is possible to opt-out of swap usage by specifying memory requests that are equal to memory limits. Containers configured in this manner will not have access to swap memory.

How does it work?

There are a number of possible ways that one could envision swap use on a node. When swap is already provisioned and available on a node, the kubelet is able to be configured so that:

It can start with swap on.

It will direct the Container Runtime Interface to allocate zero swap memory to Kubernetes workloads by default.

Swap configuration on a node is exposed to a clust

·kubernetes.io·
Fresh Swap Features for Linux Users in Kubernetes 1.32
Last Week in Kubernetes Development - Week Ending March 23 2025
Last Week in Kubernetes Development - Week Ending March 23 2025

Week Ending March 23, 2025

https://lwkd.info/2025/20250325

Developer News

Five security vulnerabilities, one critical, in Ingress-Nginx that can result in arbitrary code execution(CVE-2025-24513, CVE-2025-24514,CVE-2025-1097, CVE-2025-1098,CVE-2025-1974) were reported to the SRC. In a default installation, this can compromise all Secrets on the cluster. Upgrade Ingress-Nginx to the latest version (v1.11.5 or v1.12.1) immediately. If unable to upgrade, some exploits will be disabled if you disable Validating Admission Controllers.

There is also a new low risk vulnerability in Kubernetes network policy enforcement: CVE-2024-7598; a long-term solution is being discussed in a KEP.

Siyuan Zhang has begun a discussion on Emulation Version changes coming over the next few releases.

Registration for the Kubecon London Maintainer Summit closes Thursday, don’t miss it! Also, remember to sign up with your SIG for the Meet & Greet on April 3.

There will not be an LWKD issue next week because of KubeCon + CloudNativeCon EU. Happy KubeCon week to everyone attending!

Release Schedule

Next Deadline: Docs PRs ready for review, March 25

Code freeze is in effect for Kubernetes v1.33. Folks who have got their KEPs tracked (all 58) for the release, make sure to get your docs PRs ready for review soon!

Featured PRs

Container Stop Signals

This PR adds the initial implementation for the alpha release of custom container stop signals. A new container Lifecycle, StopSignal has been added with which users are able to define custom stop signals for their containers, overriding the default signal set in the image/container runtime. This PR adds StopSignal to container Lifecycle and also adds a StopSignal field to both ContainerConfig and ContainerStatus in the CRI API. Once the logic for using the custom stop signal has been added to the different container runtimes, the runtimes would also report the effective stop signal used by containers in their respective container statuses.

KEP of the Week

KEP 1790: Recovery from volume expansion failure

This KEP proposes allowing users to reduce a PersistentVolumeClaim (PVC) size after a failed expansion due to storage provider limitations. To prevent quota abuse, a new field, pvc.Status.AllocatedResources, ensures accurate tracking. Users can retry expansion with a smaller size, and quota calculations will use the maximum of pvc.Spec.Capacity and pvc.Status.AllocatedResources.

This KEP is tracked for beta in the ongoing release cycle.

Other Merges

CPUManager feature gate removed after graduating to GA

Separate container runtime filesystem e2e tests added

DisableNodeKubeProxyVersion feature gate to be enabled by default

HTTPS Proxy support for WebSockets

Compressed and uncompressed kubelet log file permissions to be consistent

ListFromCacheSnapshot feature gate added to allow apiserver to serve LISTs with exact RV and continuations from cache

Integration tests for PreferSameZone/PreferSameNode

Mutation of authn options removed by binding flag setters to a tracking bool in options

InPlacePodVerticalScaling: Errors that occur during pod resize actuation will be surfaced in the PodResizeInProgress condition

InPlace Pod Resize disabled for swap enabled containers that does not have memory ResizePolicy as RestartContainer

New ‘tolerance’ field to HorizontalPodAutoscaler, overriding the cluster-wide default

SchedulerPopFromBackoffQ beta feature gate to improve scheduling queue behavior by popping pods from the backoffQ when the activeQ is empty

Dynamic Resource Allocation to support partitionable devices allocation with DRAPartitionableDevices feature gate

More e2e tests added for the kubelet mappings functionality

Pressure Stall Information (PSI) metrics added to node metrics

Pod API updated to support hugepage resources at spec level for pod-level resources

InPlacePodVerticalScaling E2E tests to run in the default PR-blocking jobs

Bugfix for when pods did not correctly have a Pending phase after node reboot

Topology labels to be copied from Node objects to Pods upon scheduling

Feature gated test labeling implemented

Promotions

SupplementalGroupsPolicy to beta

CPUManagerPolicyOptions to GA

NodeInclusionPolicyInPodTopologySpread to GA

ProcMountType to beta

PodLifecycleSleepActionAllowZero to beta

Deprecated

InPlacePodVerticalScalingAllocatedStatus feature gate is deprecated

Shoutouts

Nina Polshakova: Huge shoutout to the v1.33 Enhancements team for a seamless code and test freeze yesterday: @Dipesh, @Arka,@eunji, @Faeka Ansari, @Jenny Shu, and @lzung —amazing work! And props to Dipesh for accurately predicting the number of KEPs (58!) tracked at code freeze!

via Last Week in Kubernetes Development https://lwkd.info/

March 25, 2025 at 07:00PM

·lwkd.info·
Last Week in Kubernetes Development - Week Ending March 23 2025
Ingress-nginx CVE-2025-1974: What You Need to Know
Ingress-nginx CVE-2025-1974: What You Need to Know

Ingress-nginx CVE-2025-1974: What You Need to Know

https://kubernetes.io/blog/2025/03/24/ingress-nginx-cve-2025-1974/

Today, the ingress-nginx maintainers have released patches for a batch of critical vulnerabilities that could make it easy for attackers to take over your Kubernetes cluster. If you are among the over 40% of Kubernetes administrators using ingress-nginx, you should take action immediately to protect your users and data.

Background

Ingress is the traditional Kubernetes feature for exposing your workload Pods to the world so that they can be useful. In an implementation-agnostic way, Kubernetes users can define how their applications should be made available on the network. Then, an ingress controller uses that definition to set up local or cloud resources as required for the user’s particular situation and needs.

Many different ingress controllers are available, to suit users of different cloud providers or brands of load balancers. Ingress-nginx is a software-only ingress controller provided by the Kubernetes project. Because of its versatility and ease of use, ingress-nginx is quite popular: it is deployed in over 40% of Kubernetes clusters!

Ingress-nginx translates the requirements from Ingress objects into configuration for nginx, a powerful open source webserver daemon. Then, nginx uses that configuration to accept and route requests to the various applications running within a Kubernetes cluster. Proper handling of these nginx configuration parameters is crucial, because ingress-nginx needs to allow users significant flexibility while preventing them from accidentally or intentionally tricking nginx into doing things it shouldn’t.

Vulnerabilities Patched Today

Four of today’s ingress-nginx vulnerabilities are improvements to how ingress-nginx handles particular bits of nginx config. Without these fixes, a specially-crafted Ingress object can cause nginx to misbehave in various ways, including revealing the values of Secrets that are accessible to ingress-nginx. By default, ingress-nginx has access to all Secrets cluster-wide, so this can often lead to complete cluster takeover by any user or entity that has permission to create an Ingress.

The most serious of today’s vulnerabilities, CVE-2025-1974, rated 9.8 CVSS, allows anything on the Pod network to exploit configuration injection vulnerabilities via the Validating Admission Controller feature of ingress-nginx. This makes such vulnerabilities far more dangerous: ordinarily one would need to be able to create an Ingress object in the cluster, which is a fairly privileged action. When combined with today’s other vulnerabilities, CVE-2025-1974 means that anything on the Pod network has a good chance of taking over your Kubernetes cluster, with no credentials or administrative access required. In many common scenarios, the Pod network is accessible to all workloads in your cloud VPC, or even anyone connected to your corporate network! This is a very serious situation.

Today, we have released ingress-nginx v1.12.1 and v1.11.5, which have fixes for all five of these vulnerabilities.

Your next steps

First, determine if your clusters are using ingress-nginx. In most cases, you can check this by running kubectl get pods --all-namespaces --selector app.kubernetes.io/name=ingress-nginx with cluster administrator permissions.

If you are using ingress-nginx, make a plan to remediate these vulnerabilities immediately.

The best and easiest remedy is to upgrade to the new patch release of ingress-nginx. All five of today’s vulnerabilities are fixed by installing today’s patches.

If you can’t upgrade right away, you can significantly reduce your risk by turning off the Validating Admission Controller feature of ingress-nginx.

If you have installed ingress-nginx using Helm

Reinstall, setting the Helm value controller.admissionWebhooks.enabled=false

If you have installed ingress-nginx manually

delete the ValidatingWebhookconfiguration called ingress-nginx-admission

edit the ingress-nginx-controller Deployment or Daemonset, removing --validating-webhook from the controller container’s argument list

If you turn off the Validating Admission Controller feature as a mitigation for CVE-2025-1974, remember to turn it back on after you upgrade. This feature provides important quality of life improvements for your users, warning them about incorrect Ingress configurations before they can take effect.

Conclusion, thanks, and further reading

The ingress-nginx vulnerabilities announced today, including CVE-2025-1974, present a serious risk to many Kubernetes users and their data. If you use ingress-nginx, you should take action immediately to keep yourself safe.

Thanks go out to Nir Ohfeld, Sagi Tzadik, Ronen Shustin, and Hillai Ben-Sasson from Wiz for responsibly disclosing these vulnerabilities, and for working with the Kubernetes SRC members and ingress-nginx maintainers (Marco Ebert and James Strong) to ensure we fixed them effectively.

For further information about the maintenance and future of ingress-nginx, please see this GitHub issue and/or attend James and Marco’s KubeCon/CloudNativeCon EU 2025 presentation.

For further information about the specific vulnerabilities discussed in this article, please see the appropriate GitHub issue: CVE-2025-24513, CVE-2025-24514, CVE-2025-1097, CVE-2025-1098, or CVE-2025-1974

via Kubernetes Blog https://kubernetes.io/

March 24, 2025 at 04:00PM

·kubernetes.io·
Ingress-nginx CVE-2025-1974: What You Need to Know
DevOps Toolkit - KubeVela & OAM: The Resurrection of Simplified App Management? - https://www.youtube.com/watch?v=hEquSxuaZUM
DevOps Toolkit - KubeVela & OAM: The Resurrection of Simplified App Management? - https://www.youtube.com/watch?v=hEquSxuaZUM

KubeVela & OAM: The Resurrection of Simplified App Management?

Learn how to deploy and manage backend applications effortlessly using KubeVela and the Open Application Model (OAM). In this video, we explore creating one of the components of an Internal Developer Platform that simplifies Kubernetes complexities. Discover how to deploy an app with just a few lines of YAML, promote it to production, and integrate a database. We delve into KubeVela Components, Traits, Policies, and Workflows, highlighting its strengths and limitations. By the end, you'll be equipped to decide if KubeVela is right for your platform needs.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: Twingate 🔗 https://twingate.com ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

KubeVela #OpenApplicationModel #OAM

Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join

▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/internal-developer-platforms/kubevela--oam-the-resurrection-of-simplified-app-management? 🔗 KubeVela: https://kubevela.io

▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Introduction to KubeVela and OAM 02:09 Twingate (sponsor) 03:13 Open Application Model (OAM) and KubeVela (Revisited) 05:02 Define KubeVela Components and Traits 11:02 Use KubeVela Components and Traits 13:42 KubeVela Policies and Workflows 16:19 KubeVela in Action 25:32 KubeVela Pros and Cons

via YouTube https://www.youtube.com/watch?v=hEquSxuaZUM

·youtube.com·
DevOps Toolkit - KubeVela & OAM: The Resurrection of Simplified App Management? - https://www.youtube.com/watch?v=hEquSxuaZUM
Introducing JobSet
Introducing JobSet

Introducing JobSet

https://kubernetes.io/blog/2025/03/23/introducing-jobset/

Authors: Daniel Vega-Myhre (Google), Abdullah Gharaibeh (Google), Kevin Hannon (Red Hat)

In this article, we introduce JobSet, an open source API for representing distributed jobs. The goal of JobSet is to provide a unified API for distributed ML training and HPC workloads on Kubernetes.

Why JobSet?

The Kubernetes community’s recent enhancements to the batch ecosystem on Kubernetes has attracted ML engineers who have found it to be a natural fit for the requirements of running distributed training workloads.

Large ML models (particularly LLMs) which cannot fit into the memory of the GPU or TPU chips on a single host are often distributed across tens of thousands of accelerator chips, which in turn may span thousands of hosts.

As such, the model training code is often containerized and executed simultaneously on all these hosts, performing distributed computations which often shard both the model parameters and/or the training dataset across the target accelerator chips, using communication collective primitives like all-gather and all-reduce to perform distributed computations and synchronize gradients between hosts.

These workload characteristics make Kubernetes a great fit for this type of workload, as efficiently scheduling and managing the lifecycle of containerized applications across a cluster of compute resources is an area where it shines.

It is also very extensible, allowing developers to define their own Kubernetes APIs, objects, and controllers which manage the behavior and life cycle of these objects, allowing engineers to develop custom distributed training orchestration solutions to fit their needs.

However, as distributed ML training techniques continue to evolve, existing Kubernetes primitives do not adequately model them alone anymore.

Furthermore, the landscape of Kubernetes distributed training orchestration APIs has become fragmented, and each of the existing solutions in this fragmented landscape has certain limitations that make it non-optimal for distributed ML training.

For example, the KubeFlow training operator defines custom APIs for different frameworks (e.g. PyTorchJob, TFJob, MPIJob, etc.); however, each of these job types are in fact a solution fit specifically to the target framework, each with different semantics and behavior.

On the other hand, the Job API fixed many gaps for running batch workloads, including Indexed completion mode, higher scalability, Pod failure policies and Pod backoff policy to mention a few of the most recent enhancements. However, running ML training and HPC workloads using the upstream Job API requires extra orchestration to fill the following gaps:

Multi-template Pods : Most HPC or ML training jobs include more than one type of Pods. The different Pods are part of the same workload, but they need to run a different container, request different resources or have different failure policies. A common example is the driver-worker pattern.

Job groups : Large scale training workloads span multiple network topologies, running across multiple racks for example. Such workloads are network latency sensitive, and aim to localize communication and minimize traffic crossing the higher-latency network links. To facilitate this, the workload needs to be split into groups of Pods each assigned to a network topology.

Inter-Pod communication : Create and manage the resources (e.g. headless Services) necessary to establish communication between the Pods of a job.

Startup sequencing : Some jobs require a specific start sequence of pods; sometimes the driver is expected to start first (like Ray or Spark), in other cases the workers are expected to be ready before starting the driver (like MPI).

JobSet aims to address those gaps using the Job API as a building block to build a richer API for large-scale distributed HPC and ML use cases.

How JobSet Works

JobSet models a distributed batch workload as a group of Kubernetes Jobs. This allows a user to easily specify different pod templates for different distinct groups of pods (e.g. a leader, workers, parameter servers, etc.).

It uses the abstraction of a ReplicatedJob to manage child Jobs, where a ReplicatedJob is essentially a Job Template with some desired number of Job replicas specified. This provides a declarative way to easily create identical child-jobs to run on different islands of accelerators, without resorting to scripting or Helm charts to generate many versions of the same job but with different names.

Some other key JobSet features which address the problems described above include:

Replicated Jobs : In modern data centers, hardware accelerators like GPUs and TPUs allocated in islands of homogenous accelerators connected via a specialized, high bandwidth network links. For example, a user might provision nodes containing a group of hosts co-located on a rack, each with H100 GPUs, where GPU chips within each host are connected via NVLink, with a NVLink Switch connecting the multiple NVLinks. TPU Pods are another example of this: TPU ViperLitePods consist of 64 hosts, each with 4 TPU v5e chips attached, all connected via ICI mesh. When running a distributed training job across multiple of these islands, we often want to partition the workload into a group of smaller identical jobs, 1 per island, where each pod primarily communicates with the pods within the same island to do segments of distributed computation, and keeping the gradient synchronization over DCN (data center network, which is lower bandwidth than ICI) to a bare minimum.

Automatic headless service creation, configuration, and lifecycle management : Pod-to-pod communication via pod hostname is enabled by default, with automatic configuration and lifecycle management of the headless service enabling this.

Configurable success policies : JobSet has configurable success policies which target specific ReplicatedJobs, with operators to target “Any” or “All” of their child jobs. For example, you can configure the JobSet to be marked complete if and only if all pods that are part of the “worker” ReplicatedJob are completed.

Configurable failure policies : JobSet has configurable failure policies which allow the user to specify a maximum number of times the JobSet should be restarted in the event of a failure. If any job is marked failed, the entire JobSet will be recreated, allowing the workload to resume from the last checkpoint. When no failure policy is specified, if any job fails, the JobSet simply fails.

Exclusive placement per topology domain : JobSet allows users to express that child jobs have 1:1 exclusive assignment to a topology domain, typically an accelerator island like a rack. For example, if the JobSet creates two child jobs, then this feature will enforce that the pods of each child job will be co-located on the same island, and that only one child job is allowed to schedule per island. This is useful for scenarios where we want to use a distributed data parallel (DDP) training strategy to train a model using multiple islands of compute resources (GPU racks or TPU slices), running 1 model replica in each accelerator island, ensuring the forward and backward passes themselves occur within a single model replica occurs over the high bandwidth interconnect linking the accelerators chips within the island, and only the gradient synchronization between model replicas occurs across accelerator islands over the lower bandwidth data center network.

Integration with Kueue : Users can submit JobSets via Kueue to oversubscribe their clusters, queue workloads to run as capacity becomes available, prevent partial scheduling and deadlocks, enable multi-tenancy, and more.

Example use case

Distributed ML training on multiple TPU slices with Jax

The following example is a JobSet spec for running a TPU Multislice workload on 4 TPU v5e slices. To learn more about TPU concepts and terminology, please refer to these docs.

This example uses Jax, an ML framework with native support for Just-In-Time (JIT) compilation targeting TPU chips via OpenXLA. However, you can also use PyTorch/XLA to do ML training on TPUs.

This example makes use of several JobSet features (both explicitly and implicitly) to support the unique scheduling requirements of TPU multislice training out-of-the-box with very little configuration required by the user.

Run a simple Jax workload on

apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: multislice annotations:

Give each child Job exclusive usage of a TPU slice

alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool spec: failurePolicy: maxRestarts: 3 replicatedJobs:

  • name: workers replicas: 4 # Set to number of TPU slices template: spec: parallelism: 2 # Set to number of VMs per TPU slice completions: 2 # Set to number of VMs per TPU slice backoffLimit: 0 template: spec: hostNetwork: true dnsPolicy: ClusterFirstWithHostNet nodeSelector: cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice cloud.google.com/gke-tpu-topology: 2x4 containers:
  • name: jax-tpu image: python:3.8 ports:
  • containerPort: 8471
  • containerPort: 8080 securityContext: privileged: true command:
  • bash
  • -c
  • | pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html python -c 'import jax; print("Global device count:", jax.device_count())' sleep 60 resources: limits: google.com/tpu: 4

Future work and getting involved

We have a number of features on the JobSet roadmap planned for development this year, which can be found in the JobSet roadmap.

Please feel free to reach out with feedback of any kind. We’re also open to additional contributors, whether it is to fix or report bugs, or help add new features or write documentation.

You can get in touch with us via our repo, mailing list or on Slack.

Last but not least, thanks to all our contri

·kubernetes.io·
Introducing JobSet
Putin sent Trump a portrait | US envoy says ‘elephant in the room’ in peace talks is whether Ukraine will cede occupied regions | CNN
Putin sent Trump a portrait | US envoy says ‘elephant in the room’ in peace talks is whether Ukraine will cede occupied regions | CNN
The biggest obstacle to resolving Russia’s war in Ukraine is the status of Crimea and the four mainland Ukrainian regions occupied by Russia, said US special envoy Steve Witkoff, calling them “the elephant in the room” in peace talks.
·cnn.com·
Putin sent Trump a portrait | US envoy says ‘elephant in the room’ in peace talks is whether Ukraine will cede occupied regions | CNN
Platform engineering challenges: balancing simplicity and autonomy in Kubernetes | KubeFM
Platform engineering challenges: balancing simplicity and autonomy in Kubernetes | KubeFM
This interview explores how Kubernetes is evolving to support modern workloads while addressing platform engineering challenges. In this interview, Roland Barcia, Director at AWS leading the specialist technology team, discusses: - How **emerging tools** like **Karpenter** and **Argo CD** are adapting to support **diverse workloads** from LLMs to data processing - The balance between **platform standardization** and **team autonomy** in Kubernetes environments - The **future of Kubernetes** and its evolution to support **new workloads** like LLMs and **stateful applications**
·kube.fm·
Platform engineering challenges: balancing simplicity and autonomy in Kubernetes | KubeFM
ai.robots.txt/robots.txt
ai.robots.txt/robots.txt
A list of AI agents and robots to block. Contribute to ai-robots-txt/ai.robots.txt development by creating an account on GitHub.
·github.com·
ai.robots.txt/robots.txt