Kubernetes v1.34: Pod Replacement Policy for Jobs Goes GA
https://kubernetes.io/blog/2025/09/05/kubernetes-v1-34-pod-replacement-policy-for-jobs-goes-ga/
In Kubernetes v1.34, the Pod replacement policy feature has reached general availability (GA). This blog post describes the Pod replacement policy feature and how to use it in your Jobs.
About Pod Replacement Policy
By default, the Job controller immediately recreates Pods as soon as they fail or begin terminating (when they have a deletion timestamp).
As a result, while some Pods are terminating, the total number of running Pods for a Job can temporarily exceed the specified parallelism. For Indexed Jobs, this can even mean multiple Pods running for the same index at the same time.
This behavior works fine for many workloads, but it can cause problems in certain cases.
For example, popular machine learning frameworks like TensorFlow and JAX expect exactly one Pod per worker index. If two Pods run at the same time, you might encounter errors such as:
/job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4
Additionally, starting replacement Pods before the old ones fully terminate can lead to:
Scheduling delays by kube-scheduler as the nodes remain occupied.
Unnecessary cluster scale-ups to accommodate the replacement Pods.
Temporary bypassing of quota checks by workload orchestrators like Kueue.
With Pod replacement policy, Kubernetes gives you control over when the control plane replaces terminating Pods, helping you avoid these issues.
How Pod Replacement Policy works
This enhancement means that Jobs in Kubernetes have an optional field .spec.podReplacementPolicy.
You can choose one of two policies:
TerminatingOrFailed (default): Replaces Pods as soon as they start terminating.
Failed: Replaces Pods only after they fully terminate and transition to the Failed phase.
Setting the policy to Failed ensures that a new Pod is only created after the previous one has completely terminated.
For Jobs with a Pod Failure Policy, the default podReplacementPolicy is Failed, and no other value is allowed. See Pod Failure Policy to learn more about Pod Failure Policies for Jobs.
You can check how many Pods are currently terminating by inspecting the Job’s .status.terminating field:
kubectl get job myjob -o=jsonpath='{.status.terminating}'
Example
Here’s a Job example that executes a task two times (spec.completions: 2) in parallel (spec.parallelism: 2) and replaces Pods only after they fully terminate (spec.podReplacementPolicy: Failed):
apiVersion: batch/v1 kind: Job metadata: name: example-job spec: completions: 2 parallelism: 2 podReplacementPolicy: Failed template: spec: restartPolicy: Never containers:
- name: worker image: your-image
If a Pod receives a SIGTERM signal (deletion, eviction, preemption...), it begins terminating. When the container handles termination gracefully, cleanup may take some time.
When the Job starts, we will see two Pods running:
kubectl get pods
NAME READY STATUS RESTARTS AGE example-job-qr8kf 1/1 Running 0 2s example-job-stvb4 1/1 Running 0 2s
Let's delete one of the Pods (example-job-qr8kf).
With the TerminatingOrFailed policy, as soon as one Pod (example-job-qr8kf) starts terminating, the Job controller immediately creates a new Pod (example-job-b59zk) to replace it.
kubectl get pods
NAME READY STATUS RESTARTS AGE example-job-b59zk 1/1 Running 0 1s example-job-qr8kf 1/1 Terminating 0 17s example-job-stvb4 1/1 Running 0 17s
With the Failed policy, the new Pod (example-job-b59zk) is not created while the old Pod (example-job-qr8kf) is terminating.
kubectl get pods
NAME READY STATUS RESTARTS AGE example-job-qr8kf 1/1 Terminating 0 17s example-job-stvb4 1/1 Running 0 17s
When the terminating Pod has fully transitioned to the Failed phase, a new Pod is created:
kubectl get pods
NAME READY STATUS RESTARTS AGE example-job-b59zk 1/1 Running 0 1s example-job-stvb4 1/1 Running 0 25s
How can you learn more?
Read the user-facing documentation for Pod Replacement Policy, Backoff Limit per Index, and Pod Failure Policy.
Read the KEPs for Pod Replacement Policy, Backoff Limit per Index, and Pod Failure Policy.
Acknowledgments
As with any Kubernetes feature, multiple people contributed to getting this done, from testing and filing bugs to reviewing code.
As this feature moves to stable after 2 years, we would like to thank the following people:
Kevin Hannon - for writing the KEP and the initial implementation.
Michał Woźniak - for guidance, mentorship, and reviews.
Aldo Culquicondor - for guidance, mentorship, and reviews.
Maciej Szulik - for guidance, mentorship, and reviews.
Dejan Zele Pejchev - for taking over the feature and promoting it from Alpha through Beta to GA.
Get involved
This work was sponsored by the Kubernetes batch working group in close collaboration with the SIG Apps community.
If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.
via Kubernetes Blog https://kubernetes.io/
September 05, 2025 at 02:30PM