1_r/devopsish

1_r/devopsish

54892 bookmarks
Custom sorting
Kubernetes v1.35: Fine-grained Supplemental Groups Control Graduates to GA
Kubernetes v1.35: Fine-grained Supplemental Groups Control Graduates to GA

Kubernetes v1.35: Fine-grained Supplemental Groups Control Graduates to GA

https://kubernetes.io/blog/2025/12/23/kubernetes-v1-35-fine-grained-supplementalgroups-control-ga/

On behalf of Kubernetes SIG Node, we are pleased to announce the graduation of fine-grained supplemental groups control to General Availability (GA) in Kubernetes v1.35!

The new Pod field, supplementalGroupsPolicy, was introduced as an opt-in alpha feature for Kubernetes v1.31, and then had graduated to beta in v1.33. Now, the feature is generally available. This feature allows you to implement more precise control over supplemental groups in Linux containers that can strengthen the security posture particularly in accessing volumes. Moreover, it also enhances the transparency of UID/GID details in containers, offering improved security oversight.

If you are planning to upgrade your cluster from v1.32 or an earlier version, please be aware that some behavioral breaking change introduced since beta (v1.33). For more details, see the behavioral changes introduced in beta and the upgrade considerations sections of the previous blog for graduation to beta.

Motivation: Implicit group memberships defined in /etc/group in the container image

Even though the majority of Kubernetes cluster admins/users may not be aware of this, by default Kubernetes merges group information from the Pod with information defined in /etc/group in the container image.

Here's an example; a Pod manifest that specifies spec.securityContext.runAsUser: 1000, spec.securityContext.runAsGroup: 3000 and spec.securityContext.supplementalGroups: 4000 as part of the Pod's security context.

apiVersion: v1 kind: Pod metadata: name: implicit-groups-example spec: securityContext: runAsUser: 1000 runAsGroup: 3000 supplementalGroups: [4000] containers:

  • name: example-container image: registry.k8s.io/e2e-test-images/agnhost:2.45 command: [ "sh", "-c", "sleep 1h" ] securityContext: allowPrivilegeEscalation: false

What is the result of id command in the example-container container? The output should be similar to this:

uid=1000 gid=3000 groups=3000,4000,50000

Where does group ID 50000 in supplementary groups (groups field) come from, even though 50000 is not defined in the Pod's manifest at all? The answer is /etc/group file in the container image.

Checking the contents of /etc/group in the container image contains something like the following:

user-defined-in-image:x:1000: group-defined-in-image:x:50000:user-defined-in-image

This shows that the container's primary user 1000 belongs to the group 50000 in the last entry.

Thus, the group membership defined in /etc/group in the container image for the container's primary user is implicitly merged to the information from the Pod. Please note that this was a design decision the current CRI implementations inherited from Docker, and the community never really reconsidered it until now.

What's wrong with it?

The implicitly merged group information from /etc/group in the container image poses a security risk. These implicit GIDs can't be detected or validated by policy engines because there's no record of them in the Pod manifest. This can lead to unexpected access control issues, particularly when accessing volumes (see kubernetes/kubernetes#112879 for details) because file permission is controlled by UID/GIDs in Linux.

Fine-grained supplemental groups control in a Pod: supplementaryGroupsPolicy

To tackle this problem, a Pod's .spec.securityContext now includes supplementalGroupsPolicy field.

This field lets you control how Kubernetes calculates the supplementary groups for container processes within a Pod. The available policies are:

Merge: The group membership defined in /etc/group for the container's primary user will be merged. If not specified, this policy will be applied (i.e. as-is behavior for backward compatibility).

Strict: Only the group IDs specified in fsGroup, supplementalGroups, or runAsGroup are attached as supplementary groups to the container processes. Group memberships defined in /etc/group for the container's primary user are ignored.

I'll explain how the Strict policy works. The following Pod manifest specifies supplementalGroupsPolicy: Strict:

apiVersion: v1 kind: Pod metadata: name: strict-supplementalgroups-policy-example spec: securityContext: runAsUser: 1000 runAsGroup: 3000 supplementalGroups: [4000] supplementalGroupsPolicy: Strict containers:

  • name: example-container image: registry.k8s.io/e2e-test-images/agnhost:2.45 command: [ "sh", "-c", "sleep 1h" ] securityContext: allowPrivilegeEscalation: false

The result of id command in the example-container container should be similar to this:

uid=1000 gid=3000 groups=3000,4000

You can see Strict policy can exclude group 50000 from groups!

Thus, ensuring supplementalGroupsPolicy: Strict (enforced by some policy mechanism) helps prevent the implicit supplementary groups in a Pod.

Note:

A container with sufficient privileges can change its process identity. The supplementalGroupsPolicy only affect the initial process identity.

Read on for more details.

Attached process identity in Pod status

This feature also exposes the process identity attached to the first container process of the container via .status.containerStatuses[].user.linux field. It would be helpful to see if implicit group IDs are attached.

... status: containerStatuses:

  • name: ctr user: linux: gid: 3000 supplementalGroups:
  • 3000
  • 4000 uid: 1000 ...

Note:

Please note that the values in status.containerStatuses[].user.linux field is the firstly attached process identity to the first container process in the container. If the container has sufficient privilege to call system calls related to process identity (e.g. setuid(2), setgid(2) or setgroups(2), etc.), the container process can change its identity. Thus, the actual process identity will be dynamic.

There are several ways to restrict these permissions in containers. We suggest the belows as simple solutions:

setting privilege: false and allowPrivilegeEscalation: false in your container's securityContext, or

conform your pod to Restricted policy in Pod Security Standard.

Also, kubelet has no visibility into NRI plugins or container runtime internal workings. Cluster Administrator configuring nodes or highly privilege workloads with the permission of a local administrator may change supplemental groups for any pod. However this is outside of a scope of Kubernetes control and should not be a concern for security-hardened nodes.

Strict policy requires up-to-date container runtimes

The high level container runtime (e.g. containerd, CRI-O) plays a key role for calculating supplementary group ids that will be attached to the containers. Thus, supplementalGroupsPolicy: Strict requires a CRI runtime that support this feature. The old behavior (supplementalGroupsPolicy: Merge) can work with a CRI runtime that does not support this feature, because this policy is fully backward compatible.

Here are some CRI runtimes that support this feature, and the versions you need to be running:

containerd: v2.0 or later

CRI-O: v1.31 or later

And, you can see if the feature is supported in the Node's .status.features.supplementalGroupsPolicy field. Please note that this field is different from status.declaredFeatures introduced in KEP-5328: Node Declared Features(formerly Node Capabilities).

apiVersion: v1 kind: Node ... status: features: supplementalGroupsPolicy: true

As container runtimes support this feature universally, various security policies may start enforcing the Strict behavior as more secure. It is the best practice to ensure that your Pods are ready for this enforcement and all supplemental groups are transparently declared in Pod spec, rather than in images.

Getting involved

This enhancement was driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

How can I learn more?

Configure a Security Context for a Pod or Container for the further details of supplementalGroupsPolicy

KEP-3619: Fine-grained SupplementalGroups control

via Kubernetes Blog https://kubernetes.io/

December 23, 2025 at 01:30PM

·kubernetes.io·
Kubernetes v1.35: Fine-grained Supplemental Groups Control Graduates to GA
Kubernetes v1.35: Kubelet Configuration Drop-in Directory Graduates to GA
Kubernetes v1.35: Kubelet Configuration Drop-in Directory Graduates to GA

Kubernetes v1.35: Kubelet Configuration Drop-in Directory Graduates to GA

https://kubernetes.io/blog/2025/12/22/kubernetes-v1-35-kubelet-config-drop-in-directory-ga/

With the recent v1.35 release of Kubernetes, support for a kubelet configuration drop-in directory is generally available. The newly stable feature simplifies the management of kubelet configuration across large, heterogeneous clusters.

With v1.35, the kubelet command line argument --config-dir is production-ready and fully supported, allowing you to specify a directory containing kubelet configuration drop-in files. All files in that directory will be automatically merged with your main kubelet configuration. This allows cluster administrators to maintain a cohesive base configuration for kubelets while enabling targeted customizations for different node groups or use cases, and without complex tooling or manual configuration management.

The problem: managing kubelet configuration at scale

As Kubernetes clusters grow larger and more complex, they often include heterogeneous node pools with different hardware capabilities, workload requirements, and operational constraints. This diversity necessitates different kubelet configurations across node groups—yet managing these varied configurations at scale becomes increasingly challenging. Several pain points emerge:

Configuration drift: Different nodes may have slightly different configurations, leading to inconsistent behavior

Node group customization: GPU nodes, edge nodes, and standard compute nodes often require different kubelet settings

Operational overhead: Maintaining separate, complete configuration files for each node type is error-prone and difficult to audit

Change management: Rolling out configuration changes across heterogeneous node pools requires careful coordination

Before this support was added to Kubernetes, cluster administrators had to choose between using a single monolithic configuration file for all nodes, manually maintaining multiple complete configuration files, or relying on separate tooling. Each approach had its own drawbacks. This graduation to stable gives cluster administrators a fully supported fourth way to solve that challenge.

Example use cases

Managing heterogeneous node pools

Consider a cluster with multiple node types: standard compute nodes, high-capacity nodes (such as those with GPUs or large amounts of memory), and edge nodes with specialized requirements.

Base configuration

File: 00-base.conf

apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration clusterDNS:

  • "10.96.0.10" clusterDomain: cluster.local

High-capacity node override

File: 50-high-capacity-nodes.conf

apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration maxPods: 50 systemReserved: memory: "4Gi" cpu: "1000m"

Edge node override

File: 50-edge-nodes.conf (edge compute typically has lower capacity)

apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration evictionHard: memory.available: "500Mi" nodefs.available: "5%"

With this structure, high-capacity nodes apply both the base configuration and the capacity-specific overrides, while edge nodes apply the base configuration with edge-specific settings.

Gradual configuration rollouts

When rolling out configuration changes, you can:

Add a new drop-in file with a high numeric prefix (e.g., 99-new-feature.conf)

Test the changes on a subset of nodes

Gradually roll out to more nodes

Once stable, merge changes into the base configuration

Viewing the merged configuration

Since configuration is now spread across multiple files, you can inspect the final merged configuration using the kubelet's /configz endpoint:

Start kubectl proxy

kubectl proxy

In another terminal, fetch the merged configuration

Change the '<node-name>' placeholder before running the curl command

curl -X GET http://127.0.0.1:8001/api/v1/nodes/<node-name>/proxy/configz | jq .

This shows the actual configuration the kubelet is using after all merging has been applied. The merged configuration also includes any configuration settings that were specified via kubelet command-line arguments.

For detailed setup instructions, configuration examples, and merging behavior, see the official documentation:

Set Kubelet Parameters Via A Configuration File

Kubelet Configuration Directory Merging

Good practices

When using the kubelet configuration drop-in directory:

Test configurations incrementally: Always test new drop-in configurations on a subset of nodes before rolling out cluster-wide to minimize risk

Version control your drop-ins: Store your drop-in configuration files in version control (or the configuration source from which these are generated) alongside your infrastructure as code to track changes and enable easy rollbacks

Use numeric prefixes for predictable ordering: Name files with numeric prefixes (e.g., 00-, 50-, 90-) to explicitly control merge order and make the configuration layering obvious to other administrators

Be mindful of temporary files: Some text editors automatically create backup files (such as .bak, .swp, or files with ~ suffix) in the same directory when editing. Ensure these temporary or backup files are not left in the configuration directory, as they may be processed by the kubelet

Acknowledgments

This feature was developed through the collaborative efforts of SIG Node. Special thanks to all contributors who helped design, implement, test, and document this feature across its journey from alpha in v1.28, through beta in v1.30, to GA in v1.35.

To provide feedback on this feature, join the Kubernetes Node Special Interest Group, participate in discussions on the public Slack channel (#sig-node), or file an issue on GitHub.

Get involved

If you have feedback or questions about kubelet configuration management, or want to share your experience using this feature, join the discussion:

SIG Node community page

Kubernetes Slack in the #sig-node channel

SIG Node mailing list

SIG Node would love to hear about your experiences using this feature in production!

via Kubernetes Blog https://kubernetes.io/

December 22, 2025 at 01:30PM

·kubernetes.io·
Kubernetes v1.35: Kubelet Configuration Drop-in Directory Graduates to GA
DevOps & AI Toolkit - Stop Resisting AI or Get Left Behind! (A Wake-Up Call) - https://www.youtube.com/watch?v=ZEB2pKs2R-Q
DevOps & AI Toolkit - Stop Resisting AI or Get Left Behind! (A Wake-Up Call) - https://www.youtube.com/watch?v=ZEB2pKs2R-Q

Stop Resisting AI or Get Left Behind! (A Wake-Up Call)

What happens when the team that always adapts first suddenly refuses to play by new rules? This video tells the story of a seemingly safe bet that went terribly wrong—a bet on a team with every advantage imaginable: the most skilled players, unlimited budgets, deep experience, and even the power to make the rules. But when AI was allowed on the field, everything changed. While other teams embraced the new reality and showed up with full rosters ready to play, that team's players mostly refused to participate, calling it hype and yelling at the few teammates who dared to try.

This story is about what's happening right now in tech companies with AI. The teams that led every previous transformation—VMs, cloud, containers, Kubernetes—are sitting on the sidelines while historically resistant teams are running full speed ahead. The irony is brutal, and the lesson is clear: being adaptable in the past doesn't guarantee you'll adapt in the future. The question isn't whether AI will change how we work, but whether you'll be in the field playing or on the bench yelling at those who are.

AIAdoption #TechLeadership #ChangeManagement

Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join

▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/ai/stop-resisting-ai-or-get-left-behind-a-wake-up-call

▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

via YouTube https://www.youtube.com/watch?v=ZEB2pKs2R-Q

·youtube.com·
DevOps & AI Toolkit - Stop Resisting AI or Get Left Behind! (A Wake-Up Call) - https://www.youtube.com/watch?v=ZEB2pKs2R-Q
Avoiding Zombie Cluster Members When Upgrading to etcd v3.6
Avoiding Zombie Cluster Members When Upgrading to etcd v3.6

Avoiding Zombie Cluster Members When Upgrading to etcd v3.6

https://kubernetes.io/blog/2025/12/21/preventing-etcd-zombies/

This article is a mirror of an original that was recently published to the official etcd blog. The key takeaway? Always upgrade to etcd v3.5.26 or later before moving to v3.6. This ensures your cluster is automatically repaired, and avoids zombie members.

Issue summary

Recently, the etcd community addressed an issue that may appear when users upgrade from v3.5 to v3.6. This bug can cause the cluster to report "zombie members", which are etcd nodes that were removed from the database cluster some time ago, and are re-appearing and joining database consensus. The etcd cluster is then inoperable until these zombie members are removed.

In etcd v3.5 and earlier, the v2store was the source of truth for membership data, even though the v3store was also present. As a part of our v2store deprecation plan, in v3.6 the v3store is the source of truth for cluster membership. Through a bug report we found out that, in some older clusters, v2store and v3store could become inconsistent. This inconsistency manifests after upgrading as seeing old, removed "zombie" cluster members re-appearing in the cluster.

The fix and upgrade path

We’ve added a mechanism in etcd v3.5.26 to automatically sync v3store from v2store, ensuring that affected clusters are repaired before upgrading to 3.6.x.

To support the many users currently upgrading to 3.6, we have provided the following safe upgrade path:

Upgrade your cluster to v3.5.26 or later.

Wait and confirm that all members are healthy post-update.

Upgrade to v3.6.

We are unable to provide a safe workaround path for users who have some obstacle preventing updating to v3.5.26. As such, if v3.5.26 is not available from your packaging source or vendor, you should delay upgrading to v3.6 until it is.

Additional technical detail

Information below is offered for reference only. Users can follow the safe upgrade path without knowledge of the following details.

This issue is encountered with clusters that have been running in production on etcd v3.5.25 or earlier. It is a side effect of adding and removing members from the cluster, or recovering the cluster from failure. This means that the issue is more likely the older the etcd cluster is, but it cannot be ruled out for any user regardless of the age of the cluster.

etcd maintainers, working with issue reporters, have found three possible triggers for the issue based on symptoms and an analysis of etcd code and logs:

Bug in etcdctl snapshot restore (v3.4 and old versions): When restoring a snapshot using etcdctl snapshot restore, etcdctl was supposed to remove existing members before adding the new ones. In v3.4, due to a bug, old members were not removed, resulting in zombie members. Refer to the comment on etcdctl.

--force-new-cluster in v3.5 and earlier versions: In rare cases, forcibly creating a new single-member cluster did not fully remove old members, leaving zombies. The issue was resolved in v3.5.22. Please refer to this PR in the Raft project for detailed technical information.

--unsafe-no-sync enabled: If --unsafe-no-sync is enabled, in rare cases etcd might persist a membership change to v3store but crash before writing it to the WAL, causing inconsistency between v2store and v3store. This is a problem for single-member clusters. For multi-member clusters, forcibly creating a new single-member cluster from the crashed node’s data may lead to zombie members.

Note

--unsafe-no-sync is generally not recommended, as it may break the guarantees given by the consensus protocol.

Importantly, there may be other triggers for v2store and v3store membership data becoming inconsistent that we have not yet found. This means that you cannot assume that you are safe just because you have not performed any of the three actions above. Once users are upgraded to etcd v3.6, v3store becomes the source of membership data, and further inconsistency is not possible.

Advanced users who want to verify the consistency between v2store and v3store can follow the steps described in this comment. This check is not required to fix the issue, nor does SIG etcd recommend bypassing the v3.5.26 update regardless of the results of the check.

Key takeaway

Always upgrade to v3.5.26 or later before moving to v3.6. This ensures your cluster is automatically repaired and avoids zombie members.

Acknowledgements

We would like to thank Christian Baumann for reporting this long-standing upgrade issue. His report and follow-up work helped bring the issue to our attention so that we could investigate and resolve it upstream.

via Kubernetes Blog https://kubernetes.io/

December 20, 2025 at 07:00PM

·kubernetes.io·
Avoiding Zombie Cluster Members When Upgrading to etcd v3.6
Kubernetes 1.35: In-Place Pod Resize Graduates to Stable
Kubernetes 1.35: In-Place Pod Resize Graduates to Stable

Kubernetes 1.35: In-Place Pod Resize Graduates to Stable

https://kubernetes.io/blog/2025/12/19/kubernetes-v1-35-in-place-pod-resize-ga/

This release marks a major step: more than 6 years after its initial conception, the In-Place Pod Resize feature (also known as In-Place Pod Vertical Scaling), first introduced as alpha in Kubernetes v1.27, and graduated to beta in Kubernetes v1.33, is now stable (GA) in Kubernetes 1.35!

This graduation is a major milestone for improving resource efficiency and flexibility for workloads running on Kubernetes.

What is in-place Pod Resize?

In the past, the CPU and memory resources allocated to a container in a Pod were immutable. This meant changing them required deleting and recreating the entire Pod. For stateful services, batch jobs, or latency-sensitive workloads, this was an incredibly disruptive operation.

In-Place Pod Resize makes CPU and memory requests and limits mutable, allowing you to adjust these resources within a running Pod, often without requiring a container restart.

Key Concept:

Desired Resources: A container's spec.containers[*].resources field now represents the desired resources. For CPU and memory, these fields are now mutable.

Actual Resources: The status.containerStatuses[*].resources field reflects the resources currently configured for a running container.

Triggering a Resize: You can request a resize by updating the desired requests and limits in the Pod's specification by utilizing the new resize subresource.

How can I start using in-place Pod Resize?

Detailed usage instructions and examples are provided in the official documentation: Resize CPU and Memory Resources assigned to Containers.

How does this help me?

In-place Pod Resize is a foundational building block that unlocks seamless, vertical autoscaling and improvements to workload efficiency.

Resources adjusted without disruption Workloads sensitive to latency or restarts can have their resources modified in-place without downtime or loss of state.

More powerful autoscaling Autoscalers are now empowered to adjust resources and with less impact. For example, Vertical Pod Autoscaler (VPA)'s InPlaceOrRecreate update mode, which leverages this feature, has graduated to beta. This allows resources to be adjusted automatically and seamlessly based on usage with minimal disruption.

See AEP-4016 for more details.

Address transient resource needs Workloads that temporarily need more resources can be adjusted quickly. This enables features like the CPU Startup Boost (AEP-7862) where applications can request more CPU during startup and then automatically scale back down.

Here are a few examples of some use cases:

A game server that needs to adjust its size with shifting player count.

A pre-warmed worker that can be shrunk while unused but inflated with the first request.

Dynamically scale with load for efficient bin-packing.

Increased resources for JIT compilation on startup.

Changes between beta (1.33) and stable (1.35)

Since the initial beta in v1.33, development effort has primarily been around stabilizing the feature and improving its usability based on community feedback. Here are the primary changes for the stable release:

Memory limit decrease Decreasing memory limits was previously prohibited. This restriction has been lifted, and memory limit decreases are now permitted. The Kubelet attempts to prevent OOM-kills by allowing the resize only if the current memory usage is below the new desired limit. However, this check is best-effort and not guaranteed.

Prioritized resizes If a node doesn't have enough room to accept all resize requests, Deferred resizes are reattempted based on the following priority:

PriorityClass

QoS class

Duration Deferred, with older requests prioritized first.

Pod Level Resources (Alpha) Support for in-place Pod Resize with Pod Level Resources has been introduced behind its own feature gate, which is alpha in v1.35.

Increased observability: There are now new Kubelet metrics and Pod events specifically associated with In-Place Pod Resize to help users track and debug resource changes.

What's next?

The graduation of In-Place Pod Resize to stable opens the door for powerful integrations across the Kubernetes ecosystem. There are several areas for futher improvement that are currently planned.

Integration with autoscalers and other projects

There are planned integrations with several autoscalers and other projects to improve workload efficiency at a larger scale. Some projects under discussion:

VPA CPU startup boost (AEP-7862): Allows applications to request more CPU at startup and scale back down after a specific period of time.

VPA Support for in-place updates (AEP-4016): VPA support for InPlaceOrRecreate has recently graduated to beta, with the eventual goal being to graduate the feature to stable. Support for InPlace mode is still being worked on; see this pull request.

Ray autoscaler: Plans to leverage In-Place Pod Resize to improve workload efficiency. See this Google Cloud blog post for more details.

Agent-sandbox "Soft-Pause": Investigating leveraging in-place Pod Resize for better improved latency. See the Github issue for more details.

Runtime support: Java and Python runtimes do not support resizing memory without restart. There is an open conversation with the Java developers, see the bug.

If you have a project that could benefit from integration with in-place pod resize, please reach out using the channels listed in the feedback section!

Feature expansion

Today, In-Place Pod Resize is prohibited when used in combination with: swap, the static CPU Manager, and the static Memory Manager. Additionally, resources other than CPU and memory are still immutable. Expanding the set of supported features and resources is under consideration as more feedback about community needs comes in.

There are also plans to support workload preemption; if there is not enough room on the node for the resize of a high priority pod, the goal is to enable policies to automatically evict a lower-priority pod or upsize the node.

Improved stability

Resolve kubelet-scheduler race conditions There are known race conditions between the kubelet and scheduler with regards to in-place pod resize. Work is underway to resolve these issues over the next few releases. See the issue for more details.

Safer memory limit decrease The Kubelet's best-effort check for OOM-kill prevention can be made even safer by moving the memory usage check into the container runtime itself. See the issue for more details.

Providing feedback

Looking to further build on this foundational feature, please share your feedback on how to improve and extend this feature. You can share your feedback through GitHub issues, mailing lists, or Slack channels related to the Kubernetes #sig-node and #sig-autoscaling communities.

Thank you to everyone who contributed to making this long-awaited feature a reality!

via Kubernetes Blog https://kubernetes.io/

December 19, 2025 at 01:30PM

·kubernetes.io·
Kubernetes 1.35: In-Place Pod Resize Graduates to Stable
Kubernetes v1.35: Job Managed By Goes GA
Kubernetes v1.35: Job Managed By Goes GA

Kubernetes v1.35: Job Managed By Goes GA

https://kubernetes.io/blog/2025/12/18/kubernetes-v1-35-job-managedby-for-jobs-goes-ga/

In Kubernetes v1.35, the ability to specify an external Job controller (through .spec.managedBy) graduates to General Availability.

This feature allows external controllers to take full responsibility for Job reconciliation, unlocking powerful scheduling patterns like multi-cluster dispatching with MultiKueue.

Why delegate Job reconciliation?

The primary motivation for this feature is to support multi-cluster batch scheduling architectures, such as MultiKueue.

The MultiKueue architecture distinguishes between a Management Cluster and a pool of Worker Clusters:

The Management Cluster is responsible for dispatching Jobs but not executing them. It needs to accept Job objects to track status, but it skips the creation and execution of Pods.

The Worker Clusters receive the dispatched Jobs and execute the actual Pods.

Users usually interact with the Management Cluster. Because the status is automatically propagated back, they can observe the Job's progress "live" without accessing the Worker Clusters.

In the Worker Clusters, the dispatched Jobs run as regular Jobs managed by the built-in Job controller, with no .spec.managedBy set.

By using .spec.managedBy, the MultiKueue controller on the Management Cluster can take over the reconciliation of a Job. It copies the status from the "mirror" Job running on the Worker Cluster back to the Management Cluster.

Why not just disable the Job controller? While one could theoretically achieve this by disabling the built-in Job controller entirely, this is often impossible or impractical for two reasons:

Managed Control Planes: In many cloud environments, the Kubernetes control plane is locked, and users cannot modify controller manager flags.

Hybrid Cluster Role: Users often need a "hybrid" mode where the Management Cluster dispatches some heavy workloads to remote clusters but still executes smaller or control-plane-related Jobs in the Management Cluster. .spec.managedBy allows this granularity on a per-Job basis.

How .spec.managedBy works

The .spec.managedBy field indicates which controller is responsible for the Job, specifically there are two modes of operation:

Standard: if unset or set to the reserved value kubernetes.io/job-controller, the built-in Job controller reconciles the Job as usual (standard behavior).

Delegation: If set to any other value, the built-in Job controller skips reconciliation entirely for that Job.

To prevent orphaned Pods or resource leaks, this field is immutable. You cannot transfer a running Job from one controller to another.

If you are looking into implementing an external controller, be aware that your controller needs to be conformant with the definitions for the Job API. In order to enforce the conformance, a significant part of the effort was to introduce the extensive Job status validation rules. Navigate to the How can you learn more? section for more details.

Ecosystem Adoption

The .spec.managedBy field is rapidly becoming the standard interface for delegating control in the Kubernetes batch ecosystem.

Various custom workload controllers are adding this field (or an equivalent) to allow MultiKueue to take over their reconciliation and orchestrate them across clusters:

JobSet

Kubeflow Trainer

KubeRay

AppWrapper

Tekton Pipelines

While it is possible to use .spec.managedBy to implement a custom Job controller from scratch, we haven't observed that yet. The feature is specifically designed to support delegation patterns, like MultiKueue, without reinventing the wheel.

How can you learn more?

If you want to dig deeper:

Read the user-facing documentation for:

Jobs,

Delegation of managing a Job object to an external controller, and

MultiKueue.

Deep dive into the design history:

The Kubernetes Enhancement Proposal (KEP) Job's managed-by mechanism including introduction of the extensive Job status validation rules.

The Kueue KEP for MultiKueue.

Explore how MultiKueue uses .spec.managedBy in practice in the task guide for running Jobs across clusters.

Acknowledgments

As with any Kubernetes feature, a lot of people helped shape this one through design discussions, reviews, test runs, and bug reports.

We would like to thank, in particular:

Maciej Szulik - for guidance, mentorship, and reviews.

Filip Křepinský - for guidance, mentorship, and reviews.

Get involved

This work was sponsored by the Kubernetes Batch Working Group in close collaboration with the SIG Apps, and with strong input from the SIG Scheduling community.

If you are interested in batch scheduling, multi-cluster solutions, or further improving the Job API:

Join us in the Batch WG and SIG Apps meetings.

Subscribe to the WG Batch Slack channel.

via Kubernetes Blog https://kubernetes.io/

December 18, 2025 at 01:30PM

·kubernetes.io·
Kubernetes v1.35: Job Managed By Goes GA
Kubernetes v1.35: Timbernetes (The World Tree Release)
Kubernetes v1.35: Timbernetes (The World Tree Release)

Kubernetes v1.35: Timbernetes (The World Tree Release)

https://kubernetes.io/blog/2025/12/17/kubernetes-v1-35-release/

Editors: Aakanksha Bhende, Arujjwal Negi, Chad M. Crowell, Graziano Casto, Swathi Rao

Similar to previous releases, the release of Kubernetes v1.35 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community.

This release consists of 60 enhancements, including 17 stable, 19 beta, and 22 alpha features.

There are also some deprecations and removals in this release; make sure to read about those.

Release theme and logo

2025 began in the shimmer of Octarine: The Color of Magic (v1.33) and rode the gusts Of Wind & Will (v1.34). We close the year with our hands on the World Tree, inspired by Yggdrasil, the tree of life that binds many realms. Like any great tree, Kubernetes grows ring by ring and release by release, shaped by the care of a global community.

At its center sits the Kubernetes wheel wrapped around the Earth, grounded by the resilient maintainers, contributors and users who keep showing up. Between day jobs, life changes, and steady open-source stewardship, they prune old APIs, graft new features and keep one of the world’s largest open source projects healthy.

Three squirrels guard the tree: a wizard holding the LGTM scroll for reviewers, a warrior with an axe and Kubernetes shield for the release crews who cut new branches, and a rogue with a lantern for the triagers who bring light to dark issue queues.

Together, they stand in for a much larger adventuring party. Kubernetes v1.35 adds another growth ring to the World Tree, a fresh cut shaped by many hands, many paths and a community whose branches reach higher as its roots grow deeper.

Spotlight on key updates

Kubernetes v1.35 is packed with new features and improvements. Here are a few select updates the Release Team would like to highlight!

Stable: In-place update of Pod resources

Kubernetes has graduated in-place updates for Pod resources to General Availability (GA). This feature allows users to adjust CPU and memory resources without restarting Pods or Containers. Previously, such modifications required recreating Pods, which could disrupt workloads, particularly for stateful or batch applications. Earlier Kubernetes releases allowed you to change only infrastructure resource settings (requests and limits) for existing Pods. The new in-place functionality allows for smoother, nondisruptive vertical scaling, improves efficiency, and can also simplify development.

This work was done as part of KEP #1287 led by SIG Node.

Beta: Pod certificates for workload identity and security

Previously, delivering certificates to pods required external controllers (cert-manager, SPIFFE/SPIRE), CRD orchestration, and Secret management, with rotation handled by sidecars or init containers. Kubernetes v1.35 enables native workload identity with automated certificate rotation, drastically simplifying service mesh and zero-trust architectures.

Now, the kubelet generates keys, requests certificates via PodCertificateRequest, and writes credential bundles directly to the Pod's filesystem. The kube-apiserver enforces node restriction at admission time, eliminating the most common pitfall for third-party signers: accidentally violating node isolation boundaries. This enables pure mTLS flows with no bearer tokens in the issuance path.

This work was done as part of KEP #4317 led by SIG Auth.

Alpha: Node declared features before scheduling

When control planes enable new features but nodes lag behind (permitted by Kubernetes skew policy), the scheduler can place pods requiring those features onto incompatible older nodes. The node-declaration features framework allows nodes to declare their supported Kubernetes features. With the new alpha feature enabled, a Node reports the features it supports, publishing this information to the control plane via a new .status.declaredFeatures field. Then, the kube-scheduler, admission controllers, and third-party components can use these declarations. For example, you can enforce scheduling and API validation constraints to ensure that Pods run only on compatible nodes.

This work was done as part of KEP #5328 led by SIG Node.

Features graduating to Stable

This is a selection of some of the improvements that are now stable following the v1.35 release.

PreferSameNode traffic distribution

The trafficDistribution field for Services has been updated to provide more explicit control over traffic routing. A new option, PreferSameNode, has been introduced to let services strictly prioritize endpoints on the local node if available, falling back to remote endpoints otherwise.

Simultaneously, the existing PreferClose option has been renamed to PreferSameZone. This change makes the API self-explanatory by explicitly indicating that traffic is preferred within the current availability zone. While PreferClose is preserved for backward compatibility, PreferSameZone is now the standard for zonal routing, ensuring that both node-level and zone-level preferences are clearly distinguished.

This work was done as part of KEP #3015 led by SIG Network.

Job API managed-by mechanism

The Job API now includes a managedBy field that allows an external controller to handle Job status synchronization. This feature, which graduates to stable in Kubernetes v1.35, is primarily driven by MultiKueue, a multi-cluster dispatching system where a Job created in a management cluster is mirrored and executed in a worker cluster, with status updates propagated back. To enable this workflow, the built-in Job controller must not act on a particular Job resource so that the Kueue controller can manage status updates instead.

The goal is to allow clean delegation of Job synchronization to another controller. It does not aim to pass custom parameters to that controller or modify CronJob concurrency policies.

This work was done as part of KEP #4368 led by SIG Apps.

Reliable Pod update tracking with .metadata.generation

Historically, the Pod API lacked the metadata.generation field found in other Kubernetes objects such as Deployments. Because of this omission, controllers and users had no reliable way to verify whether the kubelet had actually processed the latest changes to a Pod's specification. This ambiguity was particularly problematic for features like In-Place Pod Vertical Scaling, where it was difficult to know exactly when a resource resize request had been enacted.

Kubernetes v1.33 added .metadata.generation fields for Pods, as an alpha feature. That field is now stable in the v1.35 Pod API, which means that every time a Pod's spec is updated, the .metadata.generation value is incremented. As part of this improvement, the Pod API also gained a .status.observedGeneration field, which reports the generation that the kubelet has successfully seen and processed. Pod conditions also each contain their own individual observedGeneration field that clients can report and / or observe.

Because this feature has graduated to stable in v1.35, it is available for all workloads.

This work was done as part of KEP #5067 led by SIG Node.

Configurable NUMA node limit for topology manager

The topology manager historically used a hard-coded limit of 8 for the maximum number of NUMA nodes it can support, preventing state explosion during affinity calculation. (There's an important detail here; a NUMA node is not the same as a Node in the Kubernetes API.) This limit on the number of NUMA nodes prevented Kubernetes from fully utilizing modern high-end servers, which increasingly feature CPU architectures with more than 8 NUMA nodes.

Kubernetes v1.31 introduced a new, beta max-allowable-numa-nodes option to the topology manager policy configuration. In Kubernetes v1.35, that option is stable. Cluster administrators who enable it can use servers with more than 8 NUMA nodes.

Although the configuration option is stable, the Kubernetes community is aware of the poor performance for large NUMA hosts, and there is a proposed enhancement (KEP-5726) that aims to improve on it. You can learn more about this by reading Control Topology Management Policies on a node.

This work was done as part of KEP #4622 led by SIG Node.

New features in Beta

This is a selection of some of the improvements that are now beta following the v1.35 release.

Expose node topology labels via Downward API

Accessing node topology information, such as region and zone, from within a Pod has typically required querying the Kubernetes API server. While functional, this approach creates complexity and security risks by necessitating broad RBAC permissions or sidecar containers just to retrieve infrastructure metadata. Kubernetes v1.35 promotes the capability to expose node topology labels directly via the Downward API to beta.

The kubelet can now inject standard topology labels, such as topology.kubernetes.io/zone and topology.kubernetes.io/region, into Pods as environment variables or projected volume files. The primary benefit is a safer and more efficient way for workloads to be topology-aware. This allows applications to natively adapt to their availability zone or region without dependencies on the API server, strengthening security by upholding the principle of least privilege and simplifying cluster configuration.

Note: Kubernetes now injects available topology labels to every Pod so that they can be used as inputs to the downward API. With the v1.35 upgrade, most cluster administrators will see several new labels added to each Pod; this is expected as part of the design.

This work was done as part of KEP #4742 led by SIG Node.

Native support for storage version migration

In Kubernetes v1.35, the native support for storage version migration graduates to beta and is enabled by default. This move integrates the migration logic directly i

·kubernetes.io·
Kubernetes v1.35: Timbernetes (The World Tree Release)
DevOps & AI Toolkit - Distributed Tracing Explained: OpenTelemetry & Jaeger Tutorial - https://www.youtube.com/watch?v=Oa-zqv-EBpw
DevOps & AI Toolkit - Distributed Tracing Explained: OpenTelemetry & Jaeger Tutorial - https://www.youtube.com/watch?v=Oa-zqv-EBpw

Distributed Tracing Explained: OpenTelemetry & Jaeger Tutorial

Your users are complaining about slow response times—sometimes 8 seconds, other times 2 seconds—but your metrics show everything is fine. Average response times look acceptable, all services report healthy, and your dashboards are green. So what's really happening? The problem is that what looks like a single user request is actually dozens of separate, independent requests cascading through your microservices. Each service only sees its own operations, with no way to know they're part of the same logical transaction. Your logs show individual services completed successfully, but you can't correlate these entries across services or identify which specific operation is causing the delay.

This video shows you exactly how to solve this blindness using distributed tracing with OpenTelemetry. You'll learn the difference between automatic and manual instrumentation, see real examples of tracing implementation in TypeScript, and analyze actual traces using Jaeger to understand request flows through complex systems. We'll cover traces, spans, context propagation, semantic conventions, sampling strategies, and how to export trace data to any backend without vendor lock-in. By the end, you'll understand why traditional observability tools can't see what's happening in distributed systems and how to implement tracing that reveals the complete journey of every request through your architecture.

DistributedTracing #OpenTelemetry #Microservices

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: DevStats 🔗 https://devstats.plug.dev/5W1oh9J ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join

▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/observability/distributed-tracing-explained-opentelemetry--jaeger-tutorial 🔗 OpenTelemetry: https://opentelemetry.io

▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Distributed Tracing with OpenTelemetry (OTEL) 01:18 DevStats (sponsor) 02:34 Microservices Performance Mystery 06:24 OpenTelemetry Distributed Tracing 10:57 Analyzing Traces with Jaeger 14:53 Understanding OpenTelemetry Traces 20:14 Tracing Solves Observability Blindness

via YouTube https://www.youtube.com/watch?v=Oa-zqv-EBpw

·youtube.com·
DevOps & AI Toolkit - Distributed Tracing Explained: OpenTelemetry & Jaeger Tutorial - https://www.youtube.com/watch?v=Oa-zqv-EBpw
Open Source: Inside 2025's 4 Biggest Trends
Open Source: Inside 2025's 4 Biggest Trends
The biggest open source stories in 2025 clustered around AI, licensing/governance, security and the shift in the “commercial open source” business model.
·thenewstack.io·
Open Source: Inside 2025's 4 Biggest Trends
Claude Agent Skills: A First Principles Deep Dive
Claude Agent Skills: A First Principles Deep Dive
Technical deep dive into Claude Agent Skills' prompt-based meta-tool architecture. Learn how context injection design, two-message patterns, LLM-based routing, and runtime context modification enable sophisticated AI agent behaviors. Complete implementation guide covering SKILL.md structure, execution lifecycle, permission scoping, and design patterns for building LLM tool systems. Essential for AI engineers, prompt engineers, and technical architects building agentic applications.
·leehanchung.github.io·
Claude Agent Skills: A First Principles Deep Dive
Papermoon: A Space-Grade Linux for the NewSpace Era
Papermoon: A Space-Grade Linux for the NewSpace Era
Discover Papermoon, the open source project creating a standard, space-grade Linux to replace bespoke software for satellites and spacecraft in the NewSpace era.
·thenewstack.io·
Papermoon: A Space-Grade Linux for the NewSpace Era
Last Week in Kubernetes Development - Week Ending December 7 2025
Last Week in Kubernetes Development - Week Ending December 7 2025

Week Ending December 7, 2025

https://lwkd.info/2025/20251211

Developer News

Maintainer activity proposals for Kubecon EU are due this Sunday. That includes Maintainer Track sessions, Maintainer Summit proposals, Project Lightning Talks, requests for ContribFests, and requests for Project Pavillion Kiosks. Leads for SIGs and CNCF Projects should submit their proposals as soon as possible.

The Checkpoint/Restore WG has started and you can join a kickoff meeting next Thursday.

Release Schedule

Next Deadline: Release day, 17 December

We are in the final week before releasing 1.35. Make sure to respond quickly to any blocker issues or test failures your SIG is tagged on.

Kubernetes v1.35.0-rc.1 has been built and pushed using Golang version 1.25.5. Patch releases 1.33.7, 1.34.3 were published this week, built with Golang version 1.24.11 and release of v1.32.11 has been delayed.

Merges

Golang bump: to v1.24.11 in 1.32-1.34, and to v1.25.5 in 1.35

Fix IP address detection in image building for all supported versions

IPallocator won’t automatically retry all errors

Shoutouts

Urvashi: Hey team, Just wanted to give a big shout-out to Doc Team! We successfully tracked all 60 KEPs for the release — truly awesome work! This definitely wouldn’t have been possible without each one of you @yudocaa @kernel-kun @Khang Nguyen @anshuman @Orlix thank you for consistently showing up, collaborating, and keeping everything on track. A special mention to @Dipesh, who has been our guiding star throughout the cycle always there with quick suggestions, new approaches, and calm leadership. Huge thanks to you Dipesh!! And huge thanks to @Drew Hagen and @Kat Cosgrove as well for stepping in at the right moments, nudging (and sometimes gently pushing) the KEP owners so we didn’t miss any critical deadlines. That really helped us keep everything on schedule. Overall, feeling super proud of us all!

via Last Week in Kubernetes Development https://lwkd.info/

December 11, 2025 at 06:48PM

·lwkd.info·
Last Week in Kubernetes Development - Week Ending December 7 2025
AI, DevOps, and Kubernetes: Kelsey Hightower on What’s Next
AI, DevOps, and Kubernetes: Kelsey Hightower on What’s Next
DevOps: Failed or Evolved? Kelsey Hightower – one of the most influential figures in DevOps and cloud-native engineering – breaks down the future of DevOps, ...
·youtube.com·
AI, DevOps, and Kubernetes: Kelsey Hightower on What’s Next