
1_r/devopsish
Kubernetes upgrades: beyond the one-click update, with Tanat Lokejaroenlarb
Discover how Adevinta manages Kubernetes upgrades at scale in this episode with Tanat Lokejaroenlarb. Tanat shares his team's journey from time-consuming blue-green deployments to efficient in-place upgrades for their multi-tenant Kubernetes platform SHIP, detailing the engineering decisions and operational challenges they overcame.
You will learn:
How to transition from blue-green to in-place Kubernetes upgrades while maintaining service reliability
Techniques for tracking and addressing API deprecations using tools like Pluto and Kube-no-trouble
Strategies for minimizing SLO impact during node rebuilds through serialized approaches and proper PDB configuration
Why a phased upgrade approach with "cluster waves" provides safer production deployments even with thorough testing
Sponsor
This episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.
More info
Find all the links and info for this episode here: https://ku.bz/VVHFfXGl_
Interested in sponsoring an episode? Learn more.
via KubeFM https://kube.fm
May 06, 2025 at 06:00AM
Kubernetes v1.33: Prevent PersistentVolume Leaks When Deleting out of Order graduates to GA
I am thrilled to announce that the feature to prevent PersistentVolume (or PVs for short) leaks when deleting out of order has graduated to General Availability (GA) in Kubernetes v1.33! This improvement, initially introduced as a beta feature in Kubernetes v1.31, ensures that your storage resources are properly reclaimed, preventing unwanted leaks.
How did reclaim work in previous Kubernetes releases?
PersistentVolumeClaim (or PVC for short) is a user's request for storage. A PV and PVC are considered Bound if a newly created PV or a matching PV is found. The PVs themselves are backed by volumes allocated by the storage backend.
Normally, if the volume is to be deleted, then the expectation is to delete the PVC for a bound PV-PVC pair. However, there are no restrictions on deleting a PV before deleting a PVC.
For a Bound PV-PVC pair, the ordering of PV-PVC deletion determines whether the PV reclaim policy is honored. The reclaim policy is honored if the PVC is deleted first; however, if the PV is deleted prior to deleting the PVC, then the reclaim policy is not exercised. As a result of this behavior, the associated storage asset in the external infrastructure is not removed.
PV reclaim policy with Kubernetes v1.33
With the graduation to GA in Kubernetes v1.33, this issue is now resolved. Kubernetes now reliably honors the configured Delete reclaim policy, even when PVs are deleted before their bound PVCs. This is achieved through the use of finalizers, ensuring that the storage backend releases the allocated storage resource as intended.
How does it work?
For CSI volumes, the new behavior is achieved by adding a finalizer external-provisioner.volume.kubernetes.io/finalizer on new and existing PVs. The finalizer is only removed after the storage from the backend is deleted. Addition or removal of finalizer is handled by external-provisioner `
An example of a PV with the finalizer, notice the new finalizer in the finalizers list
kubectl get pv pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53 -o yaml
apiVersion: v1 kind: PersistentVolume metadata: annotations: pv.kubernetes.io/provisioned-by: csi.example.driver.com creationTimestamp: "2021-11-17T19:28:56Z" finalizers:
- kubernetes.io/pv-protection
- external-provisioner.volume.kubernetes.io/finalizer name: pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53 resourceVersion: "194711" uid: 087f14f2-4157-4e95-8a70-8294b039d30e spec: accessModes:
- ReadWriteOnce capacity: storage: 1Gi claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: example-vanilla-block-pvc namespace: default resourceVersion: "194677" uid: a7b7e3ba-f837-45ba-b243-dec7d8aaed53 csi: driver: csi.example.driver.com fsType: ext4 volumeAttributes: storage.kubernetes.io/csiProvisionerIdentity: 1637110610497-8081-csi.example.driver.com type: CNS Block Volume volumeHandle: 2dacf297-803f-4ccc-afc7-3d3c3f02051e persistentVolumeReclaimPolicy: Delete storageClassName: example-vanilla-block-sc volumeMode: Filesystem status: phase: Bound
The finalizer prevents this PersistentVolume from being removed from the cluster. As stated previously, the finalizer is only removed from the PV object after it is successfully deleted from the storage backend. To learn more about finalizers, please refer to Using Finalizers to Control Deletion.
Similarly, the finalizer kubernetes.io/pv-controller is added to dynamically provisioned in-tree plugin volumes.
Important note
The fix does not apply to statically provisioned in-tree plugin volumes.
How to enable new behavior?
To take advantage of the new behavior, you must have upgraded your cluster to the v1.33 release of Kubernetes and run the CSI external-provisioner version 5.0.1 or later. The feature was released as beta in v1.31 release of Kubernetes, where it was enabled by default.
References
KEP-2644
Volume leak issue
Beta Release Blog
How do I get involved?
The Kubernetes Slack channel SIG Storage communication channels are great mediums to reach out to the SIG Storage and migration working group teams.
Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution:
Fan Baofa (carlory)
Jan Šafránek (jsafrane)
Xing Yang (xing-yang)
Matthew Wong (wongma7)
Join the Kubernetes Storage Special Interest Group (SIG) if you're interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system. We’re rapidly growing and always welcome new contributors.
via Kubernetes Blog https://kubernetes.io/
May 05, 2025 at 02:30PM
Kubernetes v1.33: Mutable CSI Node Allocatable Count
https://kubernetes.io/blog/2025/05/02/kubernetes-1-33-mutable-csi-node-allocatable-count/
Scheduling stateful applications reliably depends heavily on accurate information about resource availability on nodes. Kubernetes v1.33 introduces an alpha feature called mutable CSI node allocatable count, allowing Container Storage Interface (CSI) drivers to dynamically update the reported maximum number of volumes that a node can handle. This capability significantly enhances the accuracy of pod scheduling decisions and reduces scheduling failures caused by outdated volume capacity information.
Background
Traditionally, Kubernetes CSI drivers report a static maximum volume attachment limit when initializing. However, actual attachment capacities can change during a node's lifecycle for various reasons, such as:
Manual or external operations attaching/detaching volumes outside of Kubernetes control.
Dynamically attached network interfaces or specialized hardware (GPUs, NICs, etc.) consuming available slots.
Multi-driver scenarios, where one CSI driver’s operations affect available capacity reported by another.
Static reporting can cause Kubernetes to schedule pods onto nodes that appear to have capacity but don't, leading to pods stuck in a ContainerCreating state.
Dynamically adapting CSI volume limits
With the new feature gate MutableCSINodeAllocatableCount, Kubernetes enables CSI drivers to dynamically adjust and report node attachment capacities at runtime. This ensures that the scheduler has the most accurate, up-to-date view of node capacity.
How it works
When this feature is enabled, Kubernetes supports two mechanisms for updating the reported node volume limits:
Periodic Updates: CSI drivers specify an interval to periodically refresh the node's allocatable capacity.
Reactive Updates: An immediate update triggered when a volume attachment fails due to exhausted resources (ResourceExhausted error).
Enabling the feature
To use this alpha feature, you must enable the MutableCSINodeAllocatableCount feature gate in these components:
kube-apiserver
kubelet
Example CSI driver configuration
Below is an example of configuring a CSI driver to enable periodic updates every 60 seconds:
apiVersion: storage.k8s.io/v1 kind: CSIDriver metadata: name: example.csi.k8s.io spec: nodeAllocatableUpdatePeriodSeconds: 60
This configuration directs Kubelet to periodically call the CSI driver's NodeGetInfo method every 60 seconds, updating the node’s allocatable volume count. Kubernetes enforces a minimum update interval of 10 seconds to balance accuracy and resource usage.
Immediate updates on attachment failures
In addition to periodic updates, Kubernetes now reacts to attachment failures. Specifically, if a volume attachment fails with a ResourceExhausted error (gRPC code 8), an immediate update is triggered to correct the allocatable count promptly.
This proactive correction prevents repeated scheduling errors and helps maintain cluster health.
Getting started
To experiment with mutable CSI node allocatable count in your Kubernetes v1.33 cluster:
Enable the feature gate MutableCSINodeAllocatableCount on the kube-apiserver and kubelet components.
Update your CSI driver configuration by setting nodeAllocatableUpdatePeriodSeconds.
Monitor and observe improvements in scheduling accuracy and pod placement reliability.
Next steps
This feature is currently in alpha and the Kubernetes community welcomes your feedback. Test it, share your experiences, and help guide its evolution toward beta and GA stability.
Join discussions in the Kubernetes Storage Special Interest Group (SIG-Storage) to shape the future of Kubernetes storage capabilities.
via Kubernetes Blog https://kubernetes.io/
May 02, 2025 at 02:30PM
Kubernetes v1.33: New features in DRA
https://kubernetes.io/blog/2025/05/01/kubernetes-v1-33-dra-updates/
Kubernetes Dynamic Resource Allocation (DRA) was originally introduced as an alpha feature in the v1.26 release, and then went through a significant redesign for Kubernetes v1.31. The main DRA feature went to beta in v1.32, and the project hopes it will be generally available in Kubernetes v1.34.
The basic feature set of DRA provides a far more powerful and flexible API for requesting devices than Device Plugin. And while DRA remains a beta feature for v1.33, the DRA team has been hard at work implementing a number of new features and UX improvements. One feature has been promoted to beta, while a number of new features have been added in alpha. The team has also made progress towards getting DRA ready for GA.
Features promoted to beta
Driver-owned Resource Claim Status was promoted to beta. This allows the driver to report driver-specific device status data for each allocated device in a resource claim, which is particularly useful for supporting network devices.
New alpha features
Partitionable Devices lets a driver advertise several overlapping logical devices (“partitions”), and the driver can reconfigure the physical device dynamically based on the actual devices allocated. This makes it possible to partition devices on-demand to meet the needs of the workloads and therefore increase the utilization.
Device Taints and Tolerations allow devices to be tainted and for workloads to tolerate those taints. This makes it possible for drivers or cluster administrators to mark devices as unavailable. Depending on the effect of the taint, this can prevent devices from being allocated or cause eviction of pods that are using the device.
Prioritized List lets users specify a list of acceptable devices for their workloads, rather than just a single type of device. So while the workload might run best on a single high-performance GPU, it might also be able to run on 2 mid-level GPUs. The scheduler will attempt to satisfy the alternatives in the list in order, so the workload will be allocated the best set of devices available in the cluster.
Admin Access has been updated so that only users with access to a namespace with the resource.k8s.io/admin-access: "true" label are authorized to create ResourceClaim or ResourceClaimTemplates objects with the adminAccess field within the namespace. This grants administrators access to in-use devices and may enable additional permissions when making the device available in a container. This ensures that non-admin users cannot misuse the feature.
Preparing for general availability
A new v1beta2 API has been added to simplify the user experience and to prepare for additional features being added in the future. The RBAC rules for DRA have been improved and support has been added for seamless upgrades of DRA drivers.
What’s next?
The plan for v1.34 is even more ambitious than for v1.33. Most importantly, we (the Kubernetes device management working group) hope to bring DRA to general availability, which will make it available by default on all v1.34 Kubernetes clusters. This also means that many, perhaps all, of the DRA features that are still beta in v1.34 will become enabled by default, making it much easier to use them.
The alpha features that were added in v1.33 will be brought to beta in v1.34.
Getting involved
A good starting point is joining the WG Device Management Slack channel and meetings, which happen at US/EU and EU/APAC friendly time slots.
Not all enhancement ideas are tracked as issues yet, so come talk to us if you want to help or have some ideas yourself! We have work to do at all levels, from difficult core changes to usability enhancements in kubectl, which could be picked up by newcomers.
Acknowledgments
A huge thanks to everyone who has contributed:
Cici Huang (cici37)
Ed Bartosh (bart0sh
John Belamaric (johnbelamaric)
Jon Huhn (nojnhuh)
Kevin Klues (klueska)
Morten Torkildsen (mortent)
Patrick Ohly (pohly)
Rita Zhang (ritazh)
Shingo Omura (everpeace)
via Kubernetes Blog https://kubernetes.io/
May 01, 2025 at 02:30PM