
1_r/devopsish
Kubernetes v1.34: Pods Report DRA Resource Health
https://kubernetes.io/blog/2025/09/17/kubernetes-v1-34-pods-report-dra-resource-health/
The rise of AI/ML and other high-performance workloads has made specialized hardware like GPUs, TPUs, and FPGAs a critical component of many Kubernetes clusters. However, as discussed in a previous blog post about navigating failures in Pods with devices, when this hardware fails, it can be difficult to diagnose, leading to significant downtime. With the release of Kubernetes v1.34, we are excited to announce a new alpha feature that brings much-needed visibility into the health of these devices.
This work extends the functionality of KEP-4680, which first introduced a mechanism for reporting the health of devices managed by Device Plugins. Now, this capability is being extended to Dynamic Resource Allocation (DRA). Controlled by the ResourceHealthStatus feature gate, this enhancement allows DRA drivers to report device health directly into a Pod's .status field, providing crucial insights for operators and developers.
Why expose device health in Pod status?
For stateful applications or long-running jobs, a device failure can be disruptive and costly. By exposing device health in the .status field for a Pod, Kubernetes provides a standardized way for users and automation tools to quickly diagnose issues. If a Pod is failing, you can now check its status to see if an unhealthy device is the root cause, saving valuable time that might otherwise be spent debugging application code.
How it works
This feature introduces a new, optional communication channel between the Kubelet and DRA drivers, built on three core components.
A new gRPC health service
A new gRPC service, DRAResourceHealth, is defined in the dra-health/v1alpha1 API group. DRA drivers can implement this service to stream device health updates to the Kubelet. The service includes a NodeWatchResources server-streaming RPC that sends the health status (Healthy, Unhealthy, or Unknown) for the devices it manages.
Kubelet integration
The Kubelet’s DRAPluginManager discovers which drivers implement the health service. For each compatible driver, it starts a long-lived NodeWatchResources stream to receive health updates. The DRA Manager then consumes these updates and stores them in a persistent healthInfoCache that can survive Kubelet restarts.
Populating the Pod status
When a device's health changes, the DRA manager identifies all Pods affected by the change and triggers a Pod status update. A new field, allocatedResourcesStatus, is now part of the v1.ContainerStatus API object. The Kubelet populates this field with the current health of each device allocated to the container.
A practical example
If a Pod is in a CrashLoopBackOff state, you can use kubectl describe pod <pod-name> to inspect its status. If an allocated device has failed, the output will now include the allocatedResourcesStatus field, clearly indicating the problem:
status: containerStatuses:
- name: my-gpu-intensive-container # ... other container statuses allocatedResourcesStatus:
- name: "claim:my-gpu-claim" resources:
- resourceID: "example.com/gpu-a1b2-c3d4" health: "Unhealthy"
This explicit status makes it clear that the issue is with the underlying hardware, not the application.
Now you can improve the failure detection logic to react on the unhealthy devices associated with the Pod by de-scheduling a Pod.
How to use this feature
As this is an alpha feature in Kubernetes v1.34, you must take the following steps to use it:
Enable the ResourceHealthStatus feature gate on your kube-apiserver and kubelets.
Ensure you are using a DRA driver that implements the v1alpha1 DRAResourceHealth gRPC service.
DRA drivers
If you are developing a DRA driver, make sure to think about device failure detection strategy and ensure that your driver is integrated with this feature. This way, your driver will improve the user experience and simplify debuggability of hardware issues.
What's next?
This is the first step in a broader effort to improve how Kubernetes handles device failures. As we gather feedback on this alpha feature, the community is planning several key enhancements before graduating to Beta:
Detailed health messages: To improve the troubleshooting experience, we plan to add a human-readable message field to the gRPC API. This will allow DRA drivers to provide specific context for a health status, such as "GPU temperature exceeds threshold" or "NVLink connection lost".
Configurable health timeouts: The timeout for marking a device's health as "Unknown" is currently hardcoded. We plan to make this configurable, likely on a per-driver basis, to better accommodate the different health-reporting characteristics of various hardware.
Improved post-mortem troubleshooting: We will address a known limitation where health updates may not be applied to pods that have already terminated. This fix will ensure that the health status of a device at the time of failure is preserved, which is crucial for troubleshooting batch jobs and other "run-to-completion" workloads.
This feature was developed as part of KEP-4680, and community feedback is crucial as we work toward graduating it to Beta. We have more improvements of device failure handling in k8s and encourage you to try it out and share your experiences with the SIG Node community!
via Kubernetes Blog https://kubernetes.io/
September 17, 2025 at 02:30PM
Kubernetes v1.34: Moving Volume Group Snapshots to v1beta2
https://kubernetes.io/blog/2025/09/16/kubernetes-v1-34-volume-group-snapshot-beta-2/
Volume group snapshots were introduced as an Alpha feature with the Kubernetes 1.27 release and moved to Beta in the Kubernetes 1.32 release. The recent release of Kubernetes v1.34 moved that support to a second beta. The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaims for snapshotting. A key aim is to allow you restore that set of snapshots to new volumes and recover your workload based on a crash consistent recovery point.
This new feature is only supported for CSI volume drivers.
What's new in Beta 2?
While testing the beta version, we encountered an issue where the restoreSize field is not set for individual VolumeSnapshotContents and VolumeSnapshots if CSI driver does not implement the ListSnapshots RPC call. We evaluated various options here and decided to make this change releasing a new beta for the API.
Specifically, a VolumeSnapshotInfo struct is added in v1beta2, it contains information for an individual volume snapshot that is a member of a volume group snapshot. VolumeSnapshotInfoList, a list of VolumeSnapshotInfo, is added to VolumeGroupSnapshotContentStatus, replacing VolumeSnapshotHandlePairList. VolumeSnapshotInfoList is a list of snapshot information returned by the CSI driver to identify snapshots on the storage system. VolumeSnapshotInfoList is populated by the csi-snapshotter sidecar based on the CSI CreateVolumeGroupSnapshotResponse returned by the CSI driver's CreateVolumeGroupSnapshot call.
The existing v1beta1 API objects will be converted to the new v1beta2 API objects by a conversion webhook.
What’s next?
Depending on feedback and adoption, the Kubernetes project plans to push the volume group snapshot implementation to general availability (GA) in a future release.
How can I learn more?
The design spec for the volume group snapshot feature.
The code repository for volume group snapshot APIs and controller.
CSI documentation on the group snapshot feature.
How do I get involved?
This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who stepped up these last few quarters to help the project reach beta:
Ben Swartzlander (bswartz)
Hemant Kumar (gnufied)
Jan Šafránek (jsafrane)
Madhu Rajanna (Madhu-1)
Michelle Au (msau42)
Niels de Vos (nixpanic)
Leonardo Cecchi (leonardoce)
Saad Ali (saad-ali)
Xing Yang (xing-yang)
Yati Padia (yati1998)
For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.
We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.
via Kubernetes Blog https://kubernetes.io/
September 16, 2025 at 02:30PM
VerticalPodAutoscaler Went Rogue: It Took Down Our Cluster, with Thibault Jamet
Running 30 Kubernetes clusters serving 300,000 requests per second sounds impressive until your Vertical Pod Autoscaler goes rogue and starts evicting critical system pods in an endless loop.
Thibault Jamet shares the technical details of debugging a complex VPA failure at Adevinta, where webhook timeouts triggered continuous pod evictions across their multi-tenant Kubernetes platform.
You will learn:
VPA architecture deep dive - How the recommender, updater, and mutating webhook components interact and what happens when the webhook fails
Hidden Kubernetes limits - How default QPS and burst rate limits in the Kubernetes Go client can cause widespread failures, and why these aren't well documented in Helm charts
Monitoring strategies for autoscaling - What metrics to track for webhook latency and pod eviction rates to catch similar issues before they become critical
Sponsor
This episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.io
More info
Find all the links and info for this episode here: https://ku.bz/rf1pbWXdN
Interested in sponsoring an episode? Learn more.
via KubeFM https://kube.fm
September 16, 2025 at 06:00AM
Kubernetes v1.34: Decoupled Taint Manager Is Now Stable
https://kubernetes.io/blog/2025/09/15/kubernetes-v1-34-decoupled-taint-manager-is-now-stable/
This enhancement separates the responsibility of managing node lifecycle and pod eviction into two distinct components. Previously, the node lifecycle controller handled both marking nodes as unhealthy with NoExecute taints and evicting pods from them. Now, a dedicated taint eviction controller manages the eviction process, while the node lifecycle controller focuses solely on applying taints. This separation not only improves code organization but also makes it easier to improve taint eviction controller or build custom implementations of the taint based eviction.
What's new?
The feature gate SeparateTaintEvictionController has been promoted to GA in this release. Users can optionally disable taint-based eviction by setting --controllers=-taint-eviction-controller in kube-controller-manager.
How can I learn more?
For more details, refer to the KEP and to the beta announcement article: Kubernetes 1.29: Decoupling taint manager from node lifecycle controller.
How to get involved?
We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature and helped move it from beta to stable:
Ed Bartosh (@bart0sh)
Yuan Chen (@yuanchen8911)
Aldo Culquicondor (@alculquicondor)
Baofa Fan (@carlory)
Sergey Kanzhelev (@SergeyKanzhelev)
Tim Bannister (@lmktfy)
Maciej Skoczeń (@macsko)
Maciej Szulik (@soltysh)
Wojciech Tyczynski (@wojtek-t)
via Kubernetes Blog https://kubernetes.io/
September 15, 2025 at 02:30PM
Terminal Agents: Codex vs. Crush vs. OpenCode vs. Cursor CLI vs. Claude Code
I love Claude Code, but I hate being locked into Anthropic models. What if I want to use GPT5 or whatever comes out next week? So I went on a quest to find a terminal-based coding agent that works with different models and doesn't suck compared to Claude Code.
I tested every terminal agent I could find: Codex CLI from OpenAI, Charm Crush, OpenCode, and Cursor CLI. My requirements were simple - intuitive interface, MCP servers support, saved prompts, and actual functionality for coding and operations. The results were... disappointing. From agents that couldn't even fetch their own documentation to beautiful UIs that prioritized looks over functionality, each had critical flaws that made them unusable for real work. Even GPT5, hyped as the best coding model ever, couldn't shine through these broken wrappers. By the end, you'll understand why having a great model isn't enough - you need the complete package, and right now, that's still painfully rare in the terminal agent space.
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 👉 Grab your free seat to the 2-Day AI Mastermind: https://link.outskill.com/AIDOS2 🔐 100% Discount for the first 1000 people 💥 Dive deep into AI and Learn Automations, Build AI Agents, Make videos & images – all for free! 🎁 Bonuses worth $5100+ if you join and attend ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
TerminalAgents #CodingAI #GPT5
Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join
▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/ai/terminal-agents-codex-vs.-crush-vs.-opencode-vs.-cursor-cli-vs.-claude-code
▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).
▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/
▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox
▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Terminal-Based Coding Agents 01:09 Outskill (sponsor) 02:31 Why Terminal AI Agents Matter 06:54 Codex CLI - OpenAI's Terminal Agent 12:03 Charm Crush - Beautiful Terminal UI Agent 17:18 OpenCode - SST's Terminal Agent 20:13 Cursor CLI - From Cursor IDE Makers 24:10 Terminal AI Agents - Final Verdict
via YouTube https://www.youtube.com/watch?v=MXOP4WELkCc
The making of Flux: The origin
This episode unpacks the technical and governance milestones that secured Flux's place in the cloud-native ecosystem, from a 45-minute production outage that led to the birth of GitOps to the CNCF process that defines project maturity and the handover of stewardship after Weaveworks' closure.
You will learn:
How a single incident pushed Weaveworks to adopt Git as the source of truth, creating the foundation of GitOps.
How Flux sustained continuity after Weaveworks shut down through community governance.
Where Flux is heading next with security guidance, Flux v2, and an enterprise-ready roadmap.
Sponsor
Join the Flux maintainers and community at FluxCon, November 11th in Salt Lake City—register here
More info
Find all the links and info for this episode here: https://ku.bz/5Sf5wpd8y
Interested in sponsoring an episode? Learn more.
via KubeFM https://kube.fm
September 15, 2025 at 06:00AM
Kubernetes v1.34: Autoconfiguration for Node Cgroup Driver Goes GA
https://kubernetes.io/blog/2025/09/12/kubernetes-v1-34-cri-cgroup-driver-lookup-now-ga/
Historically, configuring the correct cgroup driver has been a pain point for users running new Kubernetes clusters. On Linux systems, there are two different cgroup drivers: cgroupfs and systemd. In the past, both the kubelet and CRI implementation (like CRI-O or containerd) needed to be configured to use the same cgroup driver, or else the kubelet would misbehave without any explicit error message. This was a source of headaches for many cluster admins. Now, we've (almost) arrived at the end of that headache.
Automated cgroup driver detection
In v1.28.0, the SIG Node community introduced the feature gate KubeletCgroupDriverFromCRI, which instructs the kubelet to ask the CRI implementation which cgroup driver to use. You can read more here. After many releases of waiting for each CRI implementation to have major versions released and packaged in major operating systems, this feature has gone GA as of Kubernetes 1.34.0.
In addition to setting the feature gate, a cluster admin needs to ensure their CRI implementation is new enough:
containerd: Support was added in v2.0.0
CRI-O: Support was added in v1.28.0
Announcement: Kubernetes is deprecating containerd v1.y support
While CRI-O releases versions that match Kubernetes versions, and thus CRI-O versions without this behavior are no longer supported, containerd maintains its own release cycle. containerd support for this feature is only in v2.0 and later, but Kubernetes 1.34 still supports containerd 1.7 and other LTS releases of containerd.
The Kubernetes SIG Node community has formally agreed upon a final support timeline for containerd v1.y. The last Kubernetes release to offer this support will be the last released version of v1.35, and support will be dropped in v1.36.0. To assist administrators in managing this future transition, a new detection mechanism is available. You are able to monitor the kubelet_cri_losing_support metric to determine if any nodes in your cluster are using a containerd version that will soon be outdated. The presence of this metric with a version label of 1.36.0 will indicate that the node's containerd runtime is not new enough for the upcoming requirements. Consequently, an administrator will need to upgrade containerd to v2.0 or a later version before, or at the same time as, upgrading the kubelet to v1.36.0.
via Kubernetes Blog https://kubernetes.io/
September 12, 2025 at 02:30PM
Kubernetes v1.34: Mutable CSI Node Allocatable Graduates to Beta
https://kubernetes.io/blog/2025/09/11/kubernetes-v1-34-mutable-csi-node-allocatable-count/
The functionality for CSI drivers to update information about attachable volume count on the nodes, first introduced as Alpha in Kubernetes v1.33, has graduated to Beta in the Kubernetes v1.34 release! This marks a significant milestone in enhancing the accuracy of stateful pod scheduling by reducing failures due to outdated attachable volume capacity information.
Background
Traditionally, Kubernetes CSI drivers report a static maximum volume attachment limit when initializing. However, actual attachment capacities can change during a node's lifecycle for various reasons, such as:
Manual or external operations attaching/detaching volumes outside of Kubernetes control.
Dynamically attached network interfaces or specialized hardware (GPUs, NICs, etc.) consuming available slots.
Multi-driver scenarios, where one CSI driver’s operations affect available capacity reported by another.
Static reporting can cause Kubernetes to schedule pods onto nodes that appear to have capacity but don't, leading to pods stuck in a ContainerCreating state.
Dynamically adapting CSI volume limits
With this new feature, Kubernetes enables CSI drivers to dynamically adjust and report node attachment capacities at runtime. This ensures that the scheduler, as well as other components relying on this information, have the most accurate, up-to-date view of node capacity.
How it works
Kubernetes supports two mechanisms for updating the reported node volume limits:
Periodic Updates: CSI drivers specify an interval to periodically refresh the node's allocatable capacity.
Reactive Updates: An immediate update triggered when a volume attachment fails due to exhausted resources (ResourceExhausted error).
Enabling the feature
To use this beta feature, the MutableCSINodeAllocatableCount feature gate must be enabled in these components:
kube-apiserver
kubelet
Example CSI driver configuration
Below is an example of configuring a CSI driver to enable periodic updates every 60 seconds:
apiVersion: storage.k8s.io/v1 kind: CSIDriver metadata: name: example.csi.k8s.io spec: nodeAllocatableUpdatePeriodSeconds: 60
This configuration directs kubelet to periodically call the CSI driver's NodeGetInfo method every 60 seconds, updating the node’s allocatable volume count. Kubernetes enforces a minimum update interval of 10 seconds to balance accuracy and resource usage.
Immediate updates on attachment failures
When a volume attachment operation fails due to a ResourceExhausted error (gRPC code 8), Kubernetes immediately updates the allocatable count instead of waiting for the next periodic update. The Kubelet then marks the affected pods as Failed, enabling their controllers to recreate them. This prevents pods from getting permanently stuck in the ContainerCreating state.
Getting started
To enable this feature in your Kubernetes v1.34 cluster:
Enable the feature gate MutableCSINodeAllocatableCount on the kube-apiserver and kubelet components.
Update your CSI driver configuration by setting nodeAllocatableUpdatePeriodSeconds.
Monitor and observe improvements in scheduling accuracy and pod placement reliability.
Next steps
This feature is currently in beta and the Kubernetes community welcomes your feedback. Test it, share your experiences, and help guide its evolution to GA stability.
Join discussions in the Kubernetes Storage Special Interest Group (SIG-Storage) to shape the future of Kubernetes storage capabilities.
via Kubernetes Blog https://kubernetes.io/
September 11, 2025 at 02:30PM
Week Ending September 7, 2025
https://lwkd.info/2025/20250910
Developer News
The Kubernetes v1.35 Release Team shadow application is open till Sept 14, 2025, with results by Sept 22 and the release cycle running Sept 15–Dec 17. Learn more in the Release Team Overview, Shadows Guide, Role Handbooks, and Selection Criteria. Updates will be shared in the #sig-release Slack channel and kubernetes/sig-release repo
A medium-severity flaw (CVE-2025-7445) in secrets-store-sync-controller < v0.0.2 may expose service account tokens in logs, risking cloud vault access. Upgrade to v0.0.2+ and check logs for leaked or misused tokens. See Kubernetes CVE detailsKubernetes CVE details here.
Steering Committee Election
The nomination period for the Kubernetes Steering Committee Election has ended.
Now it’s time for your vote! The Steering Committee Election[https://github.com/kubernetes/community/tree/master/elections/steering/2025#voting-process] begins on Friday, 12th September. You can check your eligibility to vote in the voting app, and file an exception request if you need to.
Release Schedule
Next Deadline: 1.35 Release Cycle Starts, September
The Kubernetes v1.35 Release Team shadow application opened on Sept 4 and will close on Sept 14, 2025 (midnight anywhere). Selected applicants will be notified by Sept 22, and the release cycle is expected to run from Sept 15 to Dec 17, 2025. This is a great opportunity to get involved with the release process!
The cherry pick deadlines closed on Sept 5 for Kubernetes 1.33.5, 1.32.9, and 1.31.13, all targeting release on Sept 9, 2025
Featured PRs
133097: Resolve confusing use of TooManyRequests error for eviction
This PR resolves an issue where pod eviction requests could return a TooManyRequests (429) error with an unrelated disruption budget message; The API server now reports a clearer error when eviction is blocked by the fail-safe mechanism in the DisruptionController, avoiding misleading responses.
133890: Fix missing kubeletvolume_stats* metrics
This PR fixes a regression in v1.34 where kubeletvolume_stats* metrics disappeared from the kubelet metrics endpoint; The bug was caused by multiple calls to Register(); The fix ensures the metrics are registered correctly and reported again.
KEP of the Week
KEP 740: Support external signing of service account tokens
This KEP enables Kubernetes to integrate with external key management solutions such as HSMs and cloud KMS for signing service account tokens. It supports out-of-process JWT signing and dynamic public key discovery, improving security and allowing key rotation without restarting kube-apiserver. Existing file-based key management remains supported as a fallback.
This KEP is tracked for beta in v1.34.
Other Merges
DRA kubelet : Avoid deadlock when gRPC connection to driver goes idle
Add k8s-long-name and k8s-short-name format validation tags
Prevent missing kubelet_volume_stats metrics
Show real error reason in pod STATUS when a pod has both Running and Error containers
Migrate plugin-manager logs to contextual logging — improves developer diagnostics, no user-facing impact
Add Close() API to remote runtime/image — enables graceful gRPC cleanup, prevents resource leaks
Add the correct error when eviction is blocked due to the failSafe mechanism of the DisruptionController
Configure JSON content type for generic webhook RESTClient
Disable estimating resource size for resources with watch cache disabled
Enforce that all resources set resourcePrefix
Prevent error logs by skipping stats collection for resources missing resourcePrefix
Add paths section to kubelet statusz endpoint
Lock down the AllowOverwriteTerminationGracePeriodSeconds feature gate.
Add +k8s:ifEnabled / +k8s:ifDisabled / +k8s:enumExclude tags for validation
Add stress test for pod cleanup on VolumeAttachmentLimitExceeded
Deprecated
Removed deprecated gogo protocol definitions from k8s.io/kubelet/pkg/apis/dra in favor of google.golang.org/protobuf.
Drop SizeMemoryBackedVolumes after the feature GA-ed in 1.32
Remove GA feature gate ComponentSLIs (now always on)
Version Updates
Update CNI plugins to v1.8.0
Bump gengo to v2.0.0-20250903151518-081d64401ab4
Subprojects and Dependency Updates
cloud-provider-aws v1.34.0 resolves nil pointer dereferences, updates topology labels and EC2 SDK, adds a TG reconciler for NLB hairpinning, and refreshes Go deps
coredns v1.12.4 fixes DoH context propagation, file plugin label offsets, gRPC/transfer leaks, and adds loadbalance prefer and metrics timeouts
cri-o v1.34.0 moves to Kubernetes v1.34 dev, switches to opencontainers/cgroups with runc 1.3, improves container monitoring, and fixes deadlocks and terminal resize issues.
minikube v1.37.0 adds krunkit driver for macOS GPU AI workloads, introduces kubetail addon, supports Kubernetes v1.34.0, deprecates HyperKit, and updates key addons and CNIs
via Last Week in Kubernetes Development https://lwkd.info/
September 10, 2025 at 06:00PM
Kubernetes v1.34: Use An Init Container To Define App Environment Variables
https://kubernetes.io/blog/2025/09/10/kubernetes-v1-34-env-files/
Kubernetes typically uses ConfigMaps and Secrets to set environment variables, which introduces additional API calls and complexity, For example, you need to separately manage the Pods of your workloads and their configurations, while ensuring orderly updates for both the configurations and the workload Pods.
Alternatively, you might be using a vendor-supplied container that requires environment variables (such as a license key or a one-time token), but you don’t want to hard-code them or mount volumes just to get the job done.
If that's the situation you are in, you now have a new (alpha) way to achieve that. Provided you have the EnvFiles feature gate enabled across your cluster, you can tell the kubelet to load a container's environment variables from a volume (the volume must be part of the Pod that the container belongs to). this feature gate allows you to load environment variables directly from a file in an emptyDir volume without actually mounting that file into the container. It’s a simple yet elegant solution to some surprisingly common problems.
What’s this all about?
At its core, this feature allows you to point your container to a file, one generated by an initContainer, and have Kubernetes parse that file to set your environment variables. The file lives in an emptyDir volume (a temporary storage space that lasts as long as the pod does), Your main container doesn’t need to mount the volume. The kubelet will read the file and inject these variables when the container starts.
How It Works
Here's a simple example:
apiVersion: v1 kind: Pod spec: initContainers:
- name: generate-config image: busybox command: ['sh', '-c', 'echo "CONFIG_VAR=HELLO" > /config/config.env'] volumeMounts:
- name: config-volume mountPath: /config containers:
- name: app-container image: gcr.io/distroless/static env:
- name: CONFIG_VAR valueFrom: fileKeyRef: path: config.env volumeName: config-volume key: CONFIG_VAR volumes:
- name: config-volume emptyDir: {}
Using this approach is a breeze. You define your environment variables in the pod spec using the fileKeyRef field, which tells Kubernetes where to find the file and which key to pull. The file itself resembles the standard for .env syntax (think KEY=VALUE), and (for this alpha stage at least) you must ensure that it is written into an emptyDir volume. Other volume types aren't supported for this feature. At least one init container must mount that emptyDir volume (to write the file), but the main container doesn’t need to—it just gets the variables handed to it at startup.
A word on security
While this feature supports handling sensitive data such as keys or tokens, note that its implementation relies on emptyDir volumes mounted into pod. Operators with node filesystem access could therefore easily retrieve this sensitive data through pod directory paths.
If storing sensitive data like keys or tokens using this feature, ensure your cluster security policies effectively protect nodes against unauthorized access to prevent exposure of confidential information.
Summary
This feature will eliminate a number of complex workarounds used today, simplifying apps authoring, and opening doors for more use cases. Kubernetes stays flexible and open for feedback. Tell us how you use this feature or what is missing.
via Kubernetes Blog https://kubernetes.io/
September 10, 2025 at 02:30PM
Ep34 - Ask Me Anything About Anything with Scott Rosenberg
There are no restrictions in this AMA session. You can ask anything about DevOps, AI, Cloud, Kubernetes, Platform Engineering, containers, or anything else. Scott Rosenberg, a regular guest, will be here to help us out.
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: Codefresh 🔗 GitOps Argo CD Certifications: https://learning.codefresh.io (use "viktor" for a 50% discount) ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/
▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox
via YouTube https://www.youtube.com/watch?v=IoBrA6gUESk
Kubernetes v1.34: Snapshottable API server cache
https://kubernetes.io/blog/2025/09/09/kubernetes-v1-34-snapshottable-api-server-cache/
For years, the Kubernetes community has been on a mission to improve the stability and performance predictability of the API server. A major focus of this effort has been taming list requests, which have historically been a primary source of high memory usage and heavy load on the etcd datastore. With each release, we've chipped away at the problem, and today, we're thrilled to announce the final major piece of this puzzle.
The snapshottable API server cache feature has graduated to Beta in Kubernetes v1.34, culminating a multi-release effort to allow virtually all read requests to be served directly from the API server's cache.
Evolving the cache for performance and stability
The path to the current state involved several key enhancements over recent releases that paved the way for today's announcement.
Consistent reads from cache (Beta in v1.31)
While the API server has long used a cache for performance, a key milestone was guaranteeing consistent reads of the latest data from it. This v1.31 enhancement allowed the watch cache to be used for strongly-consistent read requests for the first time, a huge win as it enabled filtered collections (e.g. "a list of pods bound to this node") to be safely served from the cache instead of etcd, dramatically reducing its load for common workloads.
Taming large responses with streaming (Beta in v1.33)
Another key improvement was tackling the problem of memory spikes when transmitting large responses. The streaming encoder, introduced in v1.33, allowed the API server to send list items one by one, rather than buffering the entire multi-gigabyte response in memory. This made the memory cost of sending a response predictable and minimal, regardless of its size.
The missing piece
Despite these huge improvements, a critical gap remained. Any request for a historical LIST—most commonly used for paginating through large result sets—still had to bypass the cache and query etcd directly. This meant that the cost of retrieving the data was still unpredictable and could put significant memory pressure on the API server.
Kubernetes 1.34: snapshots complete the picture
The snapshottable API server cache solves this final piece of the puzzle. This feature enhances the watch cache, enabling it to generate efficient, point-in-time snapshots of its state.
Here’s how it works: for each update, the cache creates a lightweight snapshot. These snapshots are "lazy copies," meaning they don't duplicate objects but simply store pointers, making them incredibly memory-efficient.
When a list request for a historical resourceVersion arrives, the API server now finds the corresponding snapshot and serves the response directly from its memory. This closes the final major gap, allowing paginated requests to be served entirely from the cache.
A new era of API Server performance 🚀
With this final piece in place, the synergy of these three features ushers in a new era of API server predictability and performance:
Get Data from Cache: Consistent reads and snapshottable cache work together to ensure nearly all read requests—whether for the latest data or a historical snapshot—are served from the API server's memory.
Send data via stream: Streaming list responses ensure that sending this data to the client has a minimal and constant memory footprint.
The result is a system where the resource cost of read operations is almost fully predictable and much more resiliant to spikes in request load. This means dramatically reduced memory pressure, a lighter load on etcd, and a more stable, scalable, and reliable control plane for all Kubernetes clusters.
How to get started
With its graduation to Beta, the SnapshottableCache feature gate is enabled by default in Kubernetes v1.34. There are no actions required to start benefiting from these performance and stability improvements.
Acknowledgements
Special thanks for designing, implementing, and reviewing these critical features go to:
Ahmad Zolfaghari (@ah8ad3)
Ben Luddy (@benluddy) – Red Hat
Chen Chen (@z1cheng) – Microsoft
Davanum Srinivas (@dims) – Nvidia
David Eads (@deads2k) – Red Hat
Han Kang (@logicalhan) – CoreWeave
haosdent (@haosdent) – Shopee
Joe Betz (@jpbetz) – Google
Jordan Liggitt (@liggitt) – Google
Łukasz Szaszkiewicz (@p0lyn0mial) – Red Hat
Maciej Borsz (@mborsz) – Google
Madhav Jivrajani (@MadhavJivrajani) – UIUC
Marek Siarkowicz (@serathius) – Google
NKeert (@NKeert)
Tim Bannister (@lmktfy)
Wei Fu (@fuweid) - Microsoft
Wojtek Tyczyński (@wojtek-t) – Google
...and many others in SIG API Machinery. This milestone is a testament to the community's dedication to building a more scalable and robust Kubernetes.
via Kubernetes Blog https://kubernetes.io/
September 09, 2025 at 02:30PM
Predictive vs Reactive: A Journey to Smarter Kubernetes Scaling, with Jorrick Stempher
Jorrick Stempher shares how his team of eight students built a complete predictive scaling system for Kubernetes clusters using machine learning.
Rather than waiting for nodes to become overloaded, their system uses the Prophet forecasting model to proactively anticipate load patterns and scale infrastructure, giving them the 8-9 minutes needed to provision new nodes on Vultr.
You will learn:
How to implement predictive scaling using Prophet ML model, Prometheus metrics, and custom APIs to forecast Kubernetes workload patterns
The Node Ranking Index (NRI) - a unified metric that combines CPU, RAM, and request data into a single comparable number for efficient scaling decisions
Real-world implementation challenges, including data validation, node startup timing constraints, load testing strategies, and the importance of proper research before building complex scaling solutions
Sponsor
This episode is brought to you by Testkube—the ultimate Continuous Testing Platform for Cloud Native applications. Scale fast, test continuously, and ship confidently. Check it out at testkube.io
More info
Find all the links and info for this episode here: https://ku.bz/clbDWqPYp
Interested in sponsoring an episode? Learn more.
via KubeFM https://kube.fm
September 09, 2025 at 06:00AM
Kubernetes v1.34: VolumeAttributesClass for Volume Modification GA
https://kubernetes.io/blog/2025/09/08/kubernetes-v1-34-volume-attributes-class/
The VolumeAttributesClass API, which empowers users to dynamically modify volume attributes, has officially graduated to General Availability (GA) in Kubernetes v1.34. This marks a significant milestone, providing a robust and stable way to tune your persistent storage directly within Kubernetes.
What is VolumeAttributesClass?
At its core, VolumeAttributesClass is a cluster-scoped resource that defines a set of mutable parameters for a volume. Think of it as a "profile" for your storage, allowing cluster administrators to expose different quality-of-service (QoS) levels or performance tiers.
Users can then specify a volumeAttributesClassName in their PersistentVolumeClaim (PVC) to indicate which class of attributes they desire. The magic happens through the Container Storage Interface (CSI): when a PVC referencing a VolumeAttributesClass is updated, the associated CSI driver interacts with the underlying storage system to apply the specified changes to the volume.
This means you can now:
Dynamically scale performance: Increase IOPS or throughput for a busy database, or reduce it for a less critical application.
Optimize costs: Adjust attributes on the fly to match your current needs, avoiding over-provisioning.
Simplify operations: Manage volume modifications directly within the Kubernetes API, rather than relying on external tools or manual processes.
What is new from Beta to GA
There are two major enhancements from beta.
Cancel support from infeasible errors
To improve resilience and user experience, the GA release introduces explicit cancel support when a requested volume modification becomes infeasible. If the underlying storage system or CSI driver indicates that the requested changes cannot be applied (e.g., due to invalid arguments), users can cancel the operation and revert the volume to its previous stable configuration, preventing the volume from being left in an inconsistent state.
Quota support based on scope
While VolumeAttributesClass doesn't add a new quota type, the Kubernetes control plane can be configured to enforce quotas on PersistentVolumeClaims that reference a specific VolumeAttributesClass.
This is achieved by using the scopeSelector field in a ResourceQuota to target PVCs that have .spec.volumeAttributesClassName set to a particular VolumeAttributesClass name. Please see more details here.
Drivers support VolumeAttributesClass
Amazon EBS CSI Driver: The AWS EBS CSI driver has robust support for VolumeAttributesClass and allows you to modify parameters like volume type (e.g., gp2 to gp3, io1 to io2), IOPS, and throughput of EBS volumes dynamically.
Google Compute Engine (GCE) Persistent Disk CSI Driver (pd.csi.storage.gke.io): This driver also supports dynamic modification of persistent disk attributes, including IOPS and throughput, via VolumeAttributesClass.
Contact
For any inquiries or specific questions related to VolumeAttributesClass, please reach out to the SIG Storage community.
via Kubernetes Blog https://kubernetes.io/
September 08, 2025 at 02:30PM
Why Kubernetes Discovery Sucks for AI (And How Vector DBs Fix It)
Discover why the Kubernetes API is brilliant for execution but a complete nightmare for discovery, and learn how semantic search with vector databases can finally solve this problem. This video demonstrates the real-world challenge of finding the right Kubernetes resources when you have hundreds of cryptically named resource types in your cluster, and shows how AI struggles with the same discovery issues that plague human users.
We'll walk through a practical scenario where you need to create a PostgreSQL database with schema management in AWS, revealing how traditional keyword-based searching through 443+ Kubernetes resources becomes an exercise in frustration. Even when filtering by logical terms like "database," "postgresql," and "aws," the perfect solution remains hidden because it doesn't match your search keywords. The video then introduces a game-changing approach using vector databases and semantic search that enables both humans and AI to discover resources through natural language queries, regardless of exact keyword matches. By converting Kubernetes resource definitions into embeddings that capture semantic meaning, we transform an unsearchable cluster into an instantly discoverable one where you can simply describe what you want to accomplish rather than memorizing cryptic resource names.
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: UpCloud 🔗 https://signup.upcloud.com/?promo=devopstoolkit500 👉 Promo code: devopstoolkit500 ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
KubernetesAPI #SemanticSearch #VectorDatabase
Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join
▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/kubernetes/why-kubernetes-discovery-sucks-for-ai-and-how-vector-dbs-fix-it 🔗 DevOps AI Toolkit: https://github.com/vfarcic/dot-ai
▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).
▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/
▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox
▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Kubernetes API Discovery with AI 01:30 UpCloud (sponsor) 02:37 Kubernetes API Discovery Nightmare 11:33 Why AI Fails at Kubernetes Discovery 16:47 Vector Database Semantic Search Solution 23:15 Semantic Search Pros, Cons, and Key Takeaways
via YouTube https://www.youtube.com/watch?v=MSNstHj4rmk