Found 54502 bookmarks
Newest
Introducing JobSet
Introducing JobSet

Introducing JobSet

https://kubernetes.io/blog/2025/03/23/introducing-jobset/

Authors: Daniel Vega-Myhre (Google), Abdullah Gharaibeh (Google), Kevin Hannon (Red Hat)

In this article, we introduce JobSet, an open source API for representing distributed jobs. The goal of JobSet is to provide a unified API for distributed ML training and HPC workloads on Kubernetes.

Why JobSet?

The Kubernetes community’s recent enhancements to the batch ecosystem on Kubernetes has attracted ML engineers who have found it to be a natural fit for the requirements of running distributed training workloads.

Large ML models (particularly LLMs) which cannot fit into the memory of the GPU or TPU chips on a single host are often distributed across tens of thousands of accelerator chips, which in turn may span thousands of hosts.

As such, the model training code is often containerized and executed simultaneously on all these hosts, performing distributed computations which often shard both the model parameters and/or the training dataset across the target accelerator chips, using communication collective primitives like all-gather and all-reduce to perform distributed computations and synchronize gradients between hosts.

These workload characteristics make Kubernetes a great fit for this type of workload, as efficiently scheduling and managing the lifecycle of containerized applications across a cluster of compute resources is an area where it shines.

It is also very extensible, allowing developers to define their own Kubernetes APIs, objects, and controllers which manage the behavior and life cycle of these objects, allowing engineers to develop custom distributed training orchestration solutions to fit their needs.

However, as distributed ML training techniques continue to evolve, existing Kubernetes primitives do not adequately model them alone anymore.

Furthermore, the landscape of Kubernetes distributed training orchestration APIs has become fragmented, and each of the existing solutions in this fragmented landscape has certain limitations that make it non-optimal for distributed ML training.

For example, the KubeFlow training operator defines custom APIs for different frameworks (e.g. PyTorchJob, TFJob, MPIJob, etc.); however, each of these job types are in fact a solution fit specifically to the target framework, each with different semantics and behavior.

On the other hand, the Job API fixed many gaps for running batch workloads, including Indexed completion mode, higher scalability, Pod failure policies and Pod backoff policy to mention a few of the most recent enhancements. However, running ML training and HPC workloads using the upstream Job API requires extra orchestration to fill the following gaps:

Multi-template Pods : Most HPC or ML training jobs include more than one type of Pods. The different Pods are part of the same workload, but they need to run a different container, request different resources or have different failure policies. A common example is the driver-worker pattern.

Job groups : Large scale training workloads span multiple network topologies, running across multiple racks for example. Such workloads are network latency sensitive, and aim to localize communication and minimize traffic crossing the higher-latency network links. To facilitate this, the workload needs to be split into groups of Pods each assigned to a network topology.

Inter-Pod communication : Create and manage the resources (e.g. headless Services) necessary to establish communication between the Pods of a job.

Startup sequencing : Some jobs require a specific start sequence of pods; sometimes the driver is expected to start first (like Ray or Spark), in other cases the workers are expected to be ready before starting the driver (like MPI).

JobSet aims to address those gaps using the Job API as a building block to build a richer API for large-scale distributed HPC and ML use cases.

How JobSet Works

JobSet models a distributed batch workload as a group of Kubernetes Jobs. This allows a user to easily specify different pod templates for different distinct groups of pods (e.g. a leader, workers, parameter servers, etc.).

It uses the abstraction of a ReplicatedJob to manage child Jobs, where a ReplicatedJob is essentially a Job Template with some desired number of Job replicas specified. This provides a declarative way to easily create identical child-jobs to run on different islands of accelerators, without resorting to scripting or Helm charts to generate many versions of the same job but with different names.

Some other key JobSet features which address the problems described above include:

Replicated Jobs : In modern data centers, hardware accelerators like GPUs and TPUs allocated in islands of homogenous accelerators connected via a specialized, high bandwidth network links. For example, a user might provision nodes containing a group of hosts co-located on a rack, each with H100 GPUs, where GPU chips within each host are connected via NVLink, with a NVLink Switch connecting the multiple NVLinks. TPU Pods are another example of this: TPU ViperLitePods consist of 64 hosts, each with 4 TPU v5e chips attached, all connected via ICI mesh. When running a distributed training job across multiple of these islands, we often want to partition the workload into a group of smaller identical jobs, 1 per island, where each pod primarily communicates with the pods within the same island to do segments of distributed computation, and keeping the gradient synchronization over DCN (data center network, which is lower bandwidth than ICI) to a bare minimum.

Automatic headless service creation, configuration, and lifecycle management : Pod-to-pod communication via pod hostname is enabled by default, with automatic configuration and lifecycle management of the headless service enabling this.

Configurable success policies : JobSet has configurable success policies which target specific ReplicatedJobs, with operators to target “Any” or “All” of their child jobs. For example, you can configure the JobSet to be marked complete if and only if all pods that are part of the “worker” ReplicatedJob are completed.

Configurable failure policies : JobSet has configurable failure policies which allow the user to specify a maximum number of times the JobSet should be restarted in the event of a failure. If any job is marked failed, the entire JobSet will be recreated, allowing the workload to resume from the last checkpoint. When no failure policy is specified, if any job fails, the JobSet simply fails.

Exclusive placement per topology domain : JobSet allows users to express that child jobs have 1:1 exclusive assignment to a topology domain, typically an accelerator island like a rack. For example, if the JobSet creates two child jobs, then this feature will enforce that the pods of each child job will be co-located on the same island, and that only one child job is allowed to schedule per island. This is useful for scenarios where we want to use a distributed data parallel (DDP) training strategy to train a model using multiple islands of compute resources (GPU racks or TPU slices), running 1 model replica in each accelerator island, ensuring the forward and backward passes themselves occur within a single model replica occurs over the high bandwidth interconnect linking the accelerators chips within the island, and only the gradient synchronization between model replicas occurs across accelerator islands over the lower bandwidth data center network.

Integration with Kueue : Users can submit JobSets via Kueue to oversubscribe their clusters, queue workloads to run as capacity becomes available, prevent partial scheduling and deadlocks, enable multi-tenancy, and more.

Example use case

Distributed ML training on multiple TPU slices with Jax

The following example is a JobSet spec for running a TPU Multislice workload on 4 TPU v5e slices. To learn more about TPU concepts and terminology, please refer to these docs.

This example uses Jax, an ML framework with native support for Just-In-Time (JIT) compilation targeting TPU chips via OpenXLA. However, you can also use PyTorch/XLA to do ML training on TPUs.

This example makes use of several JobSet features (both explicitly and implicitly) to support the unique scheduling requirements of TPU multislice training out-of-the-box with very little configuration required by the user.

Run a simple Jax workload on

apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: multislice annotations:

Give each child Job exclusive usage of a TPU slice

alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool spec: failurePolicy: maxRestarts: 3 replicatedJobs:

  • name: workers replicas: 4 # Set to number of TPU slices template: spec: parallelism: 2 # Set to number of VMs per TPU slice completions: 2 # Set to number of VMs per TPU slice backoffLimit: 0 template: spec: hostNetwork: true dnsPolicy: ClusterFirstWithHostNet nodeSelector: cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice cloud.google.com/gke-tpu-topology: 2x4 containers:
  • name: jax-tpu image: python:3.8 ports:
  • containerPort: 8471
  • containerPort: 8080 securityContext: privileged: true command:
  • bash
  • -c
  • | pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html python -c 'import jax; print("Global device count:", jax.device_count())' sleep 60 resources: limits: google.com/tpu: 4

Future work and getting involved

We have a number of features on the JobSet roadmap planned for development this year, which can be found in the JobSet roadmap.

Please feel free to reach out with feedback of any kind. We’re also open to additional contributors, whether it is to fix or report bugs, or help add new features or write documentation.

You can get in touch with us via our repo, mailing list or on Slack.

Last but not least, thanks to all our contri

·kubernetes.io·
Introducing JobSet
Platform engineering challenges: balancing simplicity and autonomy in Kubernetes | KubeFM
Platform engineering challenges: balancing simplicity and autonomy in Kubernetes | KubeFM
This interview explores how Kubernetes is evolving to support modern workloads while addressing platform engineering challenges. In this interview, Roland Barcia, Director at AWS leading the specialist technology team, discusses: - How **emerging tools** like **Karpenter** and **Argo CD** are adapting to support **diverse workloads** from LLMs to data processing - The balance between **platform standardization** and **team autonomy** in Kubernetes environments - The **future of Kubernetes** and its evolution to support **new workloads** like LLMs and **stateful applications**
·kube.fm·
Platform engineering challenges: balancing simplicity and autonomy in Kubernetes | KubeFM
ai.robots.txt/robots.txt
ai.robots.txt/robots.txt
A list of AI agents and robots to block. Contribute to ai-robots-txt/ai.robots.txt development by creating an account on GitHub.
·github.com·
ai.robots.txt/robots.txt
Fooling a Self-Driving Tesla Is Dangerously Easy
Fooling a Self-Driving Tesla Is Dangerously Easy
In his latest video, Mark Rober shows how easy it is to fool Tesla’s self-driving capability (they use cheaper video camer
·kottke.org·
Fooling a Self-Driving Tesla Is Dangerously Easy
The Best Monitor Arms
The Best Monitor Arms
We researched and tested more than a dozen monitor arms and stands to find the best options to raise your screen and free up space on your desk.
·nytimes.com·
The Best Monitor Arms
Poisoned Windows shortcuts found to be a favorite of Chinese, Russian, N. Korean state hackers | The Record from Recorded Future News
Poisoned Windows shortcuts found to be a favorite of Chinese, Russian, N. Korean state hackers | The Record from Recorded Future News
The Zero Day Initiative measured the prevalence of manipulated Windows shortcut files in campaigns attributed to nation-state hacking groups — finding at least 11 exploited a bug that allows malicious use of the files.
·therecord.media·
Poisoned Windows shortcuts found to be a favorite of Chinese, Russian, N. Korean state hackers | The Record from Recorded Future News
Last Week in Kubernetes Development - Week Ending March 16 2025
Last Week in Kubernetes Development - Week Ending March 16 2025

Week Ending March 16, 2025

https://lwkd.info/2025/20250319

Developer News

CVE-2026-1767 allows authenticated users to access git repos belonging to other users if created with the in-tree gitRepo volume type. In-tree gitRepo volumes have been deprecated. The SRC suggests several workarounds in the issue.

SIG-Windows plans to make the Windows unit tests release-informing. This is a big step forwards for support of Kubernetes on Windows.

Release Schedule

Next Deadline: Code and Test Freeze, March 20/21

Code and Test Freeze starts at 0200 UTC on Friday, March 21. Your PRs should all be merged by then; file an exception as soon as possible if you think you won’t make that deadline.

Other Merges

kube-openapi updated and integrated streaming tags validation

TestListCorruptObject corrupts the object in etcd instead of changing encryption key

A new function verifyAlphaFeatures implemented to ensure that alpha features cannot be enabled by default

Extracted delegator.Helper interface to allow making delegate decision based on cache state

Split subfunction to allow adding more subtests

Unit tests for Windows DSR and Overlay Support added

scheduler_perf topology spreading tests moved to a separate package

Fixes for unit tests on Windows

PodResourceAllocation type replaced with PodResourceInfoMap

Support for emulation versioning of custom resource formats

Unit tests for credential provider in service account mode

DRA adds user RBAC

InPlacePodVerticalScaling moves pod resize status to pod conditions

DeclarativeValidation feature gate to be enabled by default

ReplicationController spec.replicas and spec.minReadySeconds fields migrated to declarative validation

Declarative Validation enabled for ReplicationController

Fix for incorrect AppArmorProfile.Type marker

JobSuccessPolicy E2E tests promoted to conformance

kubelet to set observedGeneration field on pod conditions if PodObservedGenerationTracking feature gate is set

Workqueue for node updates in DaemonSetController

PreEnqueue plugins to be called before adding pod to backoffQ

Forward compatibility added for compatibility mode

Alpha support for Windows HostNetwork containers removed

Add metrics to track allocation of Uncore Cache blocks

Updated /version response to report binary version information separate from compatibility version

New alpha feature gate MutableCSINodeAllocatableCount introduced

Swap capacity to be reported as part of node.status.nodeSystemInfo

Quota support for PVC with VolumeAttributesClass

UpdatePodSandboxResources CRI method

Multi-tenancy in accessing node images via Pod API

Storage capacity scoring added to VolumeBinding plugin

GA feature gate PersistentVolumeLastPhaseTransitionTime removed

Refactoring for featuregate lifecycle management script

Promotions

InPlacePodVerticalScaling to beta

DRAResourceClaimDeviceStatus to beta

CoordinatedLeaderElection to beta

TopologyAwareHints to GA

RemoteRequestHeaderUID to beta

SchedulerAsyncPreemption to beta

JobSuccessPolicy to GA

Deprecated

apidiscovery.k8s.io/v2beta1 API group is disabled by default

gitRepo volume plugin disabled by default

via Last Week in Kubernetes Development https://lwkd.info/

March 19, 2025 at 02:00PM

·lwkd.info·
Last Week in Kubernetes Development - Week Ending March 16 2025
Exploitation of Apache Tomcat Vulnerability CVE-2025-24813 - NHS England Digital
Exploitation of Apache Tomcat Vulnerability CVE-2025-24813 - NHS England Digital
Exploitation of remote arbitrary code execution vulnerability CVE-2025-24813 reported in the wild.  CVE-2025-24813 is a vulnerability that an attacker could exploit to achieve remote code execution (RCE), view security sensitive files, or inject content into those files.
·digital.nhs.uk·
Exploitation of Apache Tomcat Vulnerability CVE-2025-24813 - NHS England Digital
Asahi Lina Pausing Work On Apple GPU Linux Driver Development
Asahi Lina Pausing Work On Apple GPU Linux Driver Development
Following Hector Martin stepping down from the Asahi Linux project that he founded for bringing Linux to Apple Silicon hardware, Asahi Lina announced today that she is pausing work on all of the Apple GPU driver development she had been pursuing for Asahi Linux with the open-source DRM kernel driver as well as Mesa contributions.
·phoronix.com·
Asahi Lina Pausing Work On Apple GPU Linux Driver Development
HTTP/3 is everywhere but nowhere
HTTP/3 is everywhere but nowhere
HTTP/3 has been in development since at least 2016, while QUIC (the protocol beneath it) was first introduced by Google way back in 2013. Both are now...
·httptoolkit.com·
HTTP/3 is everywhere but nowhere
DevOps Toolkit - Ep15 - Ask Me Anything About DevOps Cloud Kubernetes Platform Engineering... w/Endre Sara - https://www.youtube.com/watch?v=lK0Hh47YUc8
DevOps Toolkit - Ep15 - Ask Me Anything About DevOps Cloud Kubernetes Platform Engineering... w/Endre Sara - https://www.youtube.com/watch?v=lK0Hh47YUc8

Ep15 - Ask Me Anything About DevOps, Cloud, Kubernetes, Platform Engineering,... w/Endre Sara

There are no restrictions in this AMA session. You can ask anything about DevOps, Cloud, Kubernetes, Platform Engineering, containers, or anything else. We'll have a special guest Endre Sara to help us out.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: Codefresh 🔗 GitOps Argo CD Certifications: https://learning.codefresh.io (use "viktor" for a 50% discount) ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

via YouTube https://www.youtube.com/watch?v=lK0Hh47YUc8

·youtube.com·
DevOps Toolkit - Ep15 - Ask Me Anything About DevOps Cloud Kubernetes Platform Engineering... w/Endre Sara - https://www.youtube.com/watch?v=lK0Hh47YUc8
Saving 10s of thousands of dollars deploying AI at scale with Kubernetes with John McBride
Saving 10s of thousands of dollars deploying AI at scale with Kubernetes with John McBride

Saving 10s of thousands of dollars deploying AI at scale with Kubernetes, with John McBride

https://ku.bz/wP6bTlrFs

Curious about running AI models on Kubernetes without breaking the bank? This episode delivers practical insights from someone who's done it successfully at scale.

John McBride, VP of Infrastructure and AI Engineering at the Linux Foundation shares how his team at OpenSauced built StarSearch, an AI feature that uses natural language processing to analyze GitHub contributions and provide insights through semantic queries. By using open-source models instead of commercial APIs, the team saved tens of thousands of dollars.

You will learn:

How to deploy VLLM on Kubernetes to serve open-source LLMs like Mistral and Llama, including configuration challenges with GPU drivers and daemon sets

Why smaller models (7-14B parameters) can achieve 95% effectiveness for many tasks compared to larger commercial models, with proper prompt engineering

How running inference workloads on your own infrastructure with T4 GPUs can reduce costs from tens of thousands to just a couple thousand dollars monthly

Practical approaches to monitoring GPU workloads in production, including handling unpredictable failures and VRAM consumption issues

Sponsor

This episode is brought to you by StackGen! Don't let infrastructure block your teams. StackGen deterministically generates secure cloud infrastructure from any input - existing cloud environments, IaC or application code.

More info

Find all the links and info for this episode here: https://ku.bz/wP6bTlrFs

Interested in sponsoring an episode? Learn more.

via KubeFM https://kube.fm

March 18, 2025 at 06:00AM

·kube.fm·
Saving 10s of thousands of dollars deploying AI at scale with Kubernetes with John McBride