Blog: Kubernetes 1.27: More fine-grained pod topology spread policies reached beta
Authors: Alex Wang (Shopee), Kante Yin (DaoCloud), Kensei Nakada (Mercari)
In Kubernetes v1.19, Pod topology spread constraints
went to general availability (GA).
As time passed, we - SIG Scheduling - received feedback from users,
and, as a result, we're actively working on improving the Topology Spread feature via three KEPs.
All of these features have reached beta in Kubernetes v1.27 and are enabled by default.
This blog post introduces each feature and the use case behind each of them.
KEP-3022: min domains in Pod Topology Spread
Pod Topology Spread has the maxSkew parameter to define the degree to which Pods may be unevenly distributed.
But, there wasn't a way to control the number of domains over which we should spread.
Some users want to force spreading Pods over a minimum number of domains, and if there aren't enough already present, make the cluster-autoscaler provision them.
Kubernetes v1.24 introduced the minDomains parameter for pod topology spread constraints,
as an alpha feature.
Via minDomains parameter, you can define the minimum number of domains.
For example, assume there are 3 Nodes with the enough capacity,
and a newly created ReplicaSet has the following topologySpreadConstraints in its Pod template.
...
topologySpreadConstraints :
- maxSkew : 1
minDomains : 5 # requires 5 Nodes at least (because each Node has a unique hostname).
whenUnsatisfiable : DoNotSchedule # minDomains is valid only when DoNotSchedule is used.
topologyKey : kubernetes.io/hostname
labelSelector :
matchLabels :
foo : bar
In this case, 3 Pods will be scheduled to those 3 Nodes,
but other 2 Pods from this replicaset will be unschedulable until more Nodes join the cluster.
You can imagine that the cluster autoscaler provisions new Nodes based on these unschedulable Pods,
and as a result, the replicas are finally spread over 5 Nodes.
KEP-3094: Take taints/tolerations into consideration when calculating podTopologySpread skew
Before this enhancement, when you deploy a pod with podTopologySpread configured, kube-scheduler would
take the Nodes that satisfy the Pod's nodeAffinity and nodeSelector into consideration
in filtering and scoring, but would not care about whether the node taints are tolerated by the incoming pod or not.
This may lead to a node with untolerated taint as the only candidate for spreading, and as a result,
the pod will stuck in Pending if it doesn't tolerate the taint.
To allow more fine-gained decisions about which Nodes to account for when calculating spreading skew,
Kubernetes 1.25 introduced two new fields within topologySpreadConstraints to define node inclusion policies:
nodeAffinityPolicy and nodeTaintPolicy .
A manifest that applies these policies looks like the following:
apiVersion : v1
kind : Pod
metadata :
name : example-pod
spec :
# Configure a topology spread constraint
topologySpreadConstraints :
- maxSkew : integer
# ...
nodeAffinityPolicy : [Honor|Ignore]
nodeTaintsPolicy : [Honor|Ignore]
# other Pod fields go here
The nodeAffinityPolicy field indicates how Kubernetes treats a Pod's nodeAffinity or nodeSelector for
pod topology spreading.
If Honor , kube-scheduler filters out nodes not matching nodeAffinity /nodeSelector in the calculation of
spreading skew.
If Ignore , all nodes will be included, regardless of whether they match the Pod's nodeAffinity /nodeSelector
or not.
For backwards compatibility, nodeAffinityPolicy defaults to Honor .
The nodeTaintsPolicy field defines how Kubernetes considers node taints for pod topology spreading.
If Honor , only tainted nodes for which the incoming pod has a toleration, will be included in the calculation of spreading skew.
If Ignore , kube-scheduler will not consider the node taints at all in the calculation of spreading skew, so a node with
pod untolerated taint will also be included.
For backwards compatibility, nodeTaintsPolicy defaults to Ignore .
The feature was introduced in v1.25 as alpha. By default, it was disabled, so if you want to use this feature in v1.25,
you had to explictly enable the feature gate NodeInclusionPolicyInPodTopologySpread . In the following v1.26
release, that associated feature graduated to beta and is enabled by default.
KEP-3243: Respect Pod topology spread after rolling upgrades
Pod Topology Spread uses the field labelSelector to identify the group of pods over which
spreading will be calculated. When using topology spreading with Deployments, it is common
practice to use the labelSelector of the Deployment as the labelSelector in the topology
spread constraints. However, this implies that all pods of a Deployment are part of the spreading
calculation, regardless of whether they belong to different revisions. As a result, when a new revision
is rolled out, spreading will apply across pods from both the old and new ReplicaSets, and so by the
time the new ReplicaSet is completely rolled out and the old one is rolled back, the actual spreading
we are left with may not match expectations because the deleted pods from the older ReplicaSet will cause
skewed distribution for the remaining pods. To avoid this problem, in the past users needed to add a
revision label to Deployment and update it manually at each rolling upgrade (both the label on the
pod template and the labelSelector in the topologySpreadConstraints ).
To solve this problem with a simpler API, Kubernetes v1.25 introduced a new field named
matchLabelKeys to topologySpreadConstraints . matchLabelKeys is a list of pod label keys to select
the pods over which spreading will be calculated. The keys are used to lookup values from the labels of
the Pod being scheduled, those key-value labels are ANDed with labelSelector to select the group of
existing pods over which spreading will be calculated for the incoming pod.
With matchLabelKeys , you don't need to update the pod.spec between different revisions.
The controller or operator managing rollouts just needs to set different values to the same label key for different revisions.
The scheduler will assume the values automatically based on matchLabelKeys .
For example, if you are configuring a Deployment, you can use the label keyed with
pod-template-hash ,
which is added automatically by the Deployment controller, to distinguish between different
revisions in a single Deployment.
topologySpreadConstraints :
- maxSkew : 1
topologyKey : kubernetes.io/hostname
whenUnsatisfiable : DoNotSchedule
labelSelector :
matchLabels :
app : foo
matchLabelKeys :
- pod-template-hash
Getting involved
These features are managed by Kubernetes SIG Scheduling .
Please join us and share your feedback. We look forward to hearing from you!
How can I learn more?
Pod Topology Spread Constraints in the Kubernetes documentation
KEP-3022: min domains in Pod Topology Spread
KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew
KEP-3243: Respect PodTopologySpread after rolling upgrades
At 95 it's high time to retire: Federal Circuit's chief judge deserves respect and trust for her complaint over Judge Newman's alleged inability and misconduct
I know that the position I'm taking here is not going to be popular with a significant part of this blog's audience. On LinkedIn and other w...
Bulk Crap Uninstaller - Remove large amounts of unwanted applications
Bulk Crap Uninstaller (BCUninstaller, BCU) is a free, open source program manager. It excels at removing large amounts of applications with minimal user input.
The leaker was in the same career field as me; by default, he was granted a top secret clearance. Policies need to change here. I have more unneeded classified info in my head than actionable info. | The military loved Discord for Gen Z recruiting. Then the leaks began.
Taiwan will fall to a Chinese invasion, I have no doubt; the world’s response would not be as effective as the response to the Ukraine invasion for many reasons | Taiwan china invasion leaked documents
Recent charges, convictions and sentences all indicate that the start-up world’s habit of playing fast and loose with the truth actually has consequences.
Starting to feel like I should have a Discord server | Introducing Discord Voice Messages
Starting today, we’re rolling out the ability to send Voice Messages in DMs, GDMs, and servers with fewer than 200 members on mobile. Send a quick quip to a friend at the press of a button without having to hop in Voice.
IBM considers sale of weather unit as part of streamlining efforts; deal likely valued at over $1B | WRAL TechWire
IBM is considering selling its weather operation as part of efforts to streamline its operations, the Wall Street Journal reports, confirming that an auction for the business is in the early stages and private equity is the most likely buyer, with the deal valued at over $1 billion.
Announcing New Tools for Building with Generative AI on AWS | Amazon Web Services
The seeds of a machine learning (ML) paradigm shift have existed for decades, but with the ready availability of scalable compute capacity, a massive proliferation of data, and the rapid advancement of ML technologies, customers across industries are transforming their businesses. Just recently, generative AI applications like ChatGPT have captured widespread attention and imagination. We […]
Arm Opens Up To Using Intel's 18A Process For Leading-Edge SoCs
Intel Foundry Services (IFS) has racked up a big win today with Arm over enabling chip designers to make use of Intel's upcoming 18A process for low-power Arm SoCs.
Git with a cup of tea! Painless self-hosted all-in-one software development service, includes Git hosting, code review, team collaboration, package registry and CI/CD
Tech exec Nima Momeni arrested in SF killing of Bob Lee further proving the city does not have a violent street crime problem
Mission Local is informed that the San Francisco Police Department early this morning made an arrest in the April 4 killing of tech executive Bob Lee, Mission Local is told that San Francisco police this morning arrested a suspect in the April 4 killing of tec exec Bob Lee. "Crazy Bob" and suspect Nima Momeni purportedly knew one another.