1_r/devopsish

1_r/devopsish

54634 bookmarks
Custom sorting
Kubernetes v1.34: Decoupled Taint Manager Is Now Stable
Kubernetes v1.34: Decoupled Taint Manager Is Now Stable

Kubernetes v1.34: Decoupled Taint Manager Is Now Stable

https://kubernetes.io/blog/2025/09/15/kubernetes-v1-34-decoupled-taint-manager-is-now-stable/

This enhancement separates the responsibility of managing node lifecycle and pod eviction into two distinct components. Previously, the node lifecycle controller handled both marking nodes as unhealthy with NoExecute taints and evicting pods from them. Now, a dedicated taint eviction controller manages the eviction process, while the node lifecycle controller focuses solely on applying taints. This separation not only improves code organization but also makes it easier to improve taint eviction controller or build custom implementations of the taint based eviction.

What's new?

The feature gate SeparateTaintEvictionController has been promoted to GA in this release. Users can optionally disable taint-based eviction by setting --controllers=-taint-eviction-controller in kube-controller-manager.

How can I learn more?

For more details, refer to the KEP and to the beta announcement article: Kubernetes 1.29: Decoupling taint manager from node lifecycle controller.

How to get involved?

We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature and helped move it from beta to stable:

Ed Bartosh (@bart0sh)

Yuan Chen (@yuanchen8911)

Aldo Culquicondor (@alculquicondor)

Baofa Fan (@carlory)

Sergey Kanzhelev (@SergeyKanzhelev)

Tim Bannister (@lmktfy)

Maciej Skoczeń (@macsko)

Maciej Szulik (@soltysh)

Wojciech Tyczynski (@wojtek-t)

via Kubernetes Blog https://kubernetes.io/

September 15, 2025 at 02:30PM

·kubernetes.io·
Kubernetes v1.34: Decoupled Taint Manager Is Now Stable
Open Source Under Attack • All Things Open 2025
Open Source Under Attack • All Things Open 2025
By Chris Short In March 2022, when the US Federal Reserve System ended its Zero Interest Rate Policy, or ZIRP era, interest rates began rising, marking the end of the “cheap money period.” Almost all major tech companies conducted layoffs in the following months, citing various economic pressures, including higher borrowing costs. This shift significantly... Read More
·2025.allthingsopen.org·
Open Source Under Attack • All Things Open 2025
AI & DevOps Toolkit - Terminal Agents: Codex vs. Crush vs. OpenCode vs. Cursor CLI vs. Claude Code - https://www.youtube.com/watch?v=MXOP4WELkCc
AI & DevOps Toolkit - Terminal Agents: Codex vs. Crush vs. OpenCode vs. Cursor CLI vs. Claude Code - https://www.youtube.com/watch?v=MXOP4WELkCc

Terminal Agents: Codex vs. Crush vs. OpenCode vs. Cursor CLI vs. Claude Code

I love Claude Code, but I hate being locked into Anthropic models. What if I want to use GPT5 or whatever comes out next week? So I went on a quest to find a terminal-based coding agent that works with different models and doesn't suck compared to Claude Code.

I tested every terminal agent I could find: Codex CLI from OpenAI, Charm Crush, OpenCode, and Cursor CLI. My requirements were simple - intuitive interface, MCP servers support, saved prompts, and actual functionality for coding and operations. The results were... disappointing. From agents that couldn't even fetch their own documentation to beautiful UIs that prioritized looks over functionality, each had critical flaws that made them unusable for real work. Even GPT5, hyped as the best coding model ever, couldn't shine through these broken wrappers. By the end, you'll understand why having a great model isn't enough - you need the complete package, and right now, that's still painfully rare in the terminal agent space.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 👉 Grab your free seat to the 2-Day AI Mastermind: https://link.outskill.com/AIDOS2 🔐 100% Discount for the first 1000 people 💥 Dive deep into AI and Learn Automations, Build AI Agents, Make videos & images – all for free! 🎁 Bonuses worth $5100+ if you join and attend ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

TerminalAgents #CodingAI #GPT5

Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join

▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/ai/terminal-agents-codex-vs.-crush-vs.-opencode-vs.-cursor-cli-vs.-claude-code

▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Terminal-Based Coding Agents 01:09 Outskill (sponsor) 02:31 Why Terminal AI Agents Matter 06:54 Codex CLI - OpenAI's Terminal Agent 12:03 Charm Crush - Beautiful Terminal UI Agent 17:18 OpenCode - SST's Terminal Agent 20:13 Cursor CLI - From Cursor IDE Makers 24:10 Terminal AI Agents - Final Verdict

via YouTube https://www.youtube.com/watch?v=MXOP4WELkCc

·youtube.com·
AI & DevOps Toolkit - Terminal Agents: Codex vs. Crush vs. OpenCode vs. Cursor CLI vs. Claude Code - https://www.youtube.com/watch?v=MXOP4WELkCc
The making of Flux: The origin
The making of Flux: The origin

The making of Flux: The origin

https://ku.bz/5Sf5wpd8y

This episode unpacks the technical and governance milestones that secured Flux's place in the cloud-native ecosystem, from a 45-minute production outage that led to the birth of GitOps to the CNCF process that defines project maturity and the handover of stewardship after Weaveworks' closure.

You will learn:

How a single incident pushed Weaveworks to adopt Git as the source of truth, creating the foundation of GitOps.

How Flux sustained continuity after Weaveworks shut down through community governance.

Where Flux is heading next with security guidance, Flux v2, and an enterprise-ready roadmap.

Sponsor

Join the Flux maintainers and community at FluxCon, November 11th in Salt Lake City—register here

More info

Find all the links and info for this episode here: https://ku.bz/5Sf5wpd8y

Interested in sponsoring an episode? Learn more.

via KubeFM https://kube.fm

September 15, 2025 at 06:00AM

·kube.fm·
The making of Flux: The origin
Kubernetes v1.34: Autoconfiguration for Node Cgroup Driver Goes GA
Kubernetes v1.34: Autoconfiguration for Node Cgroup Driver Goes GA

Kubernetes v1.34: Autoconfiguration for Node Cgroup Driver Goes GA

https://kubernetes.io/blog/2025/09/12/kubernetes-v1-34-cri-cgroup-driver-lookup-now-ga/

Historically, configuring the correct cgroup driver has been a pain point for users running new Kubernetes clusters. On Linux systems, there are two different cgroup drivers: cgroupfs and systemd. In the past, both the kubelet and CRI implementation (like CRI-O or containerd) needed to be configured to use the same cgroup driver, or else the kubelet would misbehave without any explicit error message. This was a source of headaches for many cluster admins. Now, we've (almost) arrived at the end of that headache.

Automated cgroup driver detection

In v1.28.0, the SIG Node community introduced the feature gate KubeletCgroupDriverFromCRI, which instructs the kubelet to ask the CRI implementation which cgroup driver to use. You can read more here. After many releases of waiting for each CRI implementation to have major versions released and packaged in major operating systems, this feature has gone GA as of Kubernetes 1.34.0.

In addition to setting the feature gate, a cluster admin needs to ensure their CRI implementation is new enough:

containerd: Support was added in v2.0.0

CRI-O: Support was added in v1.28.0

Announcement: Kubernetes is deprecating containerd v1.y support

While CRI-O releases versions that match Kubernetes versions, and thus CRI-O versions without this behavior are no longer supported, containerd maintains its own release cycle. containerd support for this feature is only in v2.0 and later, but Kubernetes 1.34 still supports containerd 1.7 and other LTS releases of containerd.

The Kubernetes SIG Node community has formally agreed upon a final support timeline for containerd v1.y. The last Kubernetes release to offer this support will be the last released version of v1.35, and support will be dropped in v1.36.0. To assist administrators in managing this future transition, a new detection mechanism is available. You are able to monitor the kubelet_cri_losing_support metric to determine if any nodes in your cluster are using a containerd version that will soon be outdated. The presence of this metric with a version label of 1.36.0 will indicate that the node's containerd runtime is not new enough for the upcoming requirements. Consequently, an administrator will need to upgrade containerd to v2.0 or a later version before, or at the same time as, upgrading the kubelet to v1.36.0.

via Kubernetes Blog https://kubernetes.io/

September 12, 2025 at 02:30PM

·kubernetes.io·
Kubernetes v1.34: Autoconfiguration for Node Cgroup Driver Goes GA
Kubernetes v1.34: Mutable CSI Node Allocatable Graduates to Beta
Kubernetes v1.34: Mutable CSI Node Allocatable Graduates to Beta

Kubernetes v1.34: Mutable CSI Node Allocatable Graduates to Beta

https://kubernetes.io/blog/2025/09/11/kubernetes-v1-34-mutable-csi-node-allocatable-count/

The functionality for CSI drivers to update information about attachable volume count on the nodes, first introduced as Alpha in Kubernetes v1.33, has graduated to Beta in the Kubernetes v1.34 release! This marks a significant milestone in enhancing the accuracy of stateful pod scheduling by reducing failures due to outdated attachable volume capacity information.

Background

Traditionally, Kubernetes CSI drivers report a static maximum volume attachment limit when initializing. However, actual attachment capacities can change during a node's lifecycle for various reasons, such as:

Manual or external operations attaching/detaching volumes outside of Kubernetes control.

Dynamically attached network interfaces or specialized hardware (GPUs, NICs, etc.) consuming available slots.

Multi-driver scenarios, where one CSI driver’s operations affect available capacity reported by another.

Static reporting can cause Kubernetes to schedule pods onto nodes that appear to have capacity but don't, leading to pods stuck in a ContainerCreating state.

Dynamically adapting CSI volume limits

With this new feature, Kubernetes enables CSI drivers to dynamically adjust and report node attachment capacities at runtime. This ensures that the scheduler, as well as other components relying on this information, have the most accurate, up-to-date view of node capacity.

How it works

Kubernetes supports two mechanisms for updating the reported node volume limits:

Periodic Updates: CSI drivers specify an interval to periodically refresh the node's allocatable capacity.

Reactive Updates: An immediate update triggered when a volume attachment fails due to exhausted resources (ResourceExhausted error).

Enabling the feature

To use this beta feature, the MutableCSINodeAllocatableCount feature gate must be enabled in these components:

kube-apiserver

kubelet

Example CSI driver configuration

Below is an example of configuring a CSI driver to enable periodic updates every 60 seconds:

apiVersion: storage.k8s.io/v1 kind: CSIDriver metadata: name: example.csi.k8s.io spec: nodeAllocatableUpdatePeriodSeconds: 60

This configuration directs kubelet to periodically call the CSI driver's NodeGetInfo method every 60 seconds, updating the node’s allocatable volume count. Kubernetes enforces a minimum update interval of 10 seconds to balance accuracy and resource usage.

Immediate updates on attachment failures

When a volume attachment operation fails due to a ResourceExhausted error (gRPC code 8), Kubernetes immediately updates the allocatable count instead of waiting for the next periodic update. The Kubelet then marks the affected pods as Failed, enabling their controllers to recreate them. This prevents pods from getting permanently stuck in the ContainerCreating state.

Getting started

To enable this feature in your Kubernetes v1.34 cluster:

Enable the feature gate MutableCSINodeAllocatableCount on the kube-apiserver and kubelet components.

Update your CSI driver configuration by setting nodeAllocatableUpdatePeriodSeconds.

Monitor and observe improvements in scheduling accuracy and pod placement reliability.

Next steps

This feature is currently in beta and the Kubernetes community welcomes your feedback. Test it, share your experiences, and help guide its evolution to GA stability.

Join discussions in the Kubernetes Storage Special Interest Group (SIG-Storage) to shape the future of Kubernetes storage capabilities.

via Kubernetes Blog https://kubernetes.io/

September 11, 2025 at 02:30PM

·kubernetes.io·
Kubernetes v1.34: Mutable CSI Node Allocatable Graduates to Beta
Last Week in Kubernetes Development - Week Ending September 7 2025
Last Week in Kubernetes Development - Week Ending September 7 2025

Week Ending September 7, 2025

https://lwkd.info/2025/20250910

Developer News

The Kubernetes v1.35 Release Team shadow application is open till Sept 14, 2025, with results by Sept 22 and the release cycle running Sept 15–Dec 17. Learn more in the Release Team Overview, Shadows Guide, Role Handbooks, and Selection Criteria. Updates will be shared in the #sig-release Slack channel and kubernetes/sig-release repo

A medium-severity flaw (CVE-2025-7445) in secrets-store-sync-controller < v0.0.2 may expose service account tokens in logs, risking cloud vault access. Upgrade to v0.0.2+ and check logs for leaked or misused tokens. See Kubernetes CVE detailsKubernetes CVE details here.

Steering Committee Election

The nomination period for the Kubernetes Steering Committee Election has ended.

Now it’s time for your vote! The Steering Committee Election[https://github.com/kubernetes/community/tree/master/elections/steering/2025#voting-process] begins on Friday, 12th September. You can check your eligibility to vote in the voting app, and file an exception request if you need to.

Release Schedule

Next Deadline: 1.35 Release Cycle Starts, September

The Kubernetes v1.35 Release Team shadow application opened on Sept 4 and will close on Sept 14, 2025 (midnight anywhere). Selected applicants will be notified by Sept 22, and the release cycle is expected to run from Sept 15 to Dec 17, 2025. This is a great opportunity to get involved with the release process!

The cherry pick deadlines closed on Sept 5 for Kubernetes 1.33.5, 1.32.9, and 1.31.13, all targeting release on Sept 9, 2025

Featured PRs

133097: Resolve confusing use of TooManyRequests error for eviction

This PR resolves an issue where pod eviction requests could return a TooManyRequests (429) error with an unrelated disruption budget message; The API server now reports a clearer error when eviction is blocked by the fail-safe mechanism in the DisruptionController, avoiding misleading responses.

133890: Fix missing kubeletvolume_stats* metrics

This PR fixes a regression in v1.34 where kubeletvolume_stats* metrics disappeared from the kubelet metrics endpoint; The bug was caused by multiple calls to Register(); The fix ensures the metrics are registered correctly and reported again.

KEP of the Week

KEP 740: Support external signing of service account tokens

This KEP enables Kubernetes to integrate with external key management solutions such as HSMs and cloud KMS for signing service account tokens. It supports out-of-process JWT signing and dynamic public key discovery, improving security and allowing key rotation without restarting kube-apiserver. Existing file-based key management remains supported as a fallback.

This KEP is tracked for beta in v1.34.

Other Merges

DRA kubelet : Avoid deadlock when gRPC connection to driver goes idle

Add k8s-long-name and k8s-short-name format validation tags

Prevent missing kubelet_volume_stats metrics

Show real error reason in pod STATUS when a pod has both Running and Error containers

Migrate plugin-manager logs to contextual logging — improves developer diagnostics, no user-facing impact

Add Close() API to remote runtime/image — enables graceful gRPC cleanup, prevents resource leaks

Add the correct error when eviction is blocked due to the failSafe mechanism of the DisruptionController

Configure JSON content type for generic webhook RESTClient

Disable estimating resource size for resources with watch cache disabled

Enforce that all resources set resourcePrefix

Prevent error logs by skipping stats collection for resources missing resourcePrefix

Add paths section to kubelet statusz endpoint

Lock down the AllowOverwriteTerminationGracePeriodSeconds feature gate.

Add +k8s:ifEnabled / +k8s:ifDisabled / +k8s:enumExclude tags for validation

Add stress test for pod cleanup on VolumeAttachmentLimitExceeded

Deprecated

Removed deprecated gogo protocol definitions from k8s.io/kubelet/pkg/apis/dra in favor of google.golang.org/protobuf.

Drop SizeMemoryBackedVolumes after the feature GA-ed in 1.32

Remove GA feature gate ComponentSLIs (now always on)

Version Updates

Update CNI plugins to v1.8.0

Bump gengo to v2.0.0-20250903151518-081d64401ab4

Subprojects and Dependency Updates

cloud-provider-aws v1.34.0 resolves nil pointer dereferences, updates topology labels and EC2 SDK, adds a TG reconciler for NLB hairpinning, and refreshes Go deps

coredns v1.12.4 fixes DoH context propagation, file plugin label offsets, gRPC/transfer leaks, and adds loadbalance prefer and metrics timeouts

cri-o v1.34.0 moves to Kubernetes v1.34 dev, switches to opencontainers/cgroups with runc 1.3, improves container monitoring, and fixes deadlocks and terminal resize issues.

minikube v1.37.0 adds krunkit driver for macOS GPU AI workloads, introduces kubetail addon, supports Kubernetes v1.34.0, deprecates HyperKit, and updates key addons and CNIs

via Last Week in Kubernetes Development https://lwkd.info/

September 10, 2025 at 06:00PM

·lwkd.info·
Last Week in Kubernetes Development - Week Ending September 7 2025
KDE Linux
KDE Linux
A free Linux®-based operating system built by KDE
·kde.org·
KDE Linux
Kubernetes v1.34: Use An Init Container To Define App Environment Variables
Kubernetes v1.34: Use An Init Container To Define App Environment Variables

Kubernetes v1.34: Use An Init Container To Define App Environment Variables

https://kubernetes.io/blog/2025/09/10/kubernetes-v1-34-env-files/

Kubernetes typically uses ConfigMaps and Secrets to set environment variables, which introduces additional API calls and complexity, For example, you need to separately manage the Pods of your workloads and their configurations, while ensuring orderly updates for both the configurations and the workload Pods.

Alternatively, you might be using a vendor-supplied container that requires environment variables (such as a license key or a one-time token), but you don’t want to hard-code them or mount volumes just to get the job done.

If that's the situation you are in, you now have a new (alpha) way to achieve that. Provided you have the EnvFiles feature gate enabled across your cluster, you can tell the kubelet to load a container's environment variables from a volume (the volume must be part of the Pod that the container belongs to). this feature gate allows you to load environment variables directly from a file in an emptyDir volume without actually mounting that file into the container. It’s a simple yet elegant solution to some surprisingly common problems.

What’s this all about?

At its core, this feature allows you to point your container to a file, one generated by an initContainer, and have Kubernetes parse that file to set your environment variables. The file lives in an emptyDir volume (a temporary storage space that lasts as long as the pod does), Your main container doesn’t need to mount the volume. The kubelet will read the file and inject these variables when the container starts.

How It Works

Here's a simple example:

apiVersion: v1 kind: Pod spec: initContainers:

  • name: generate-config image: busybox command: ['sh', '-c', 'echo "CONFIG_VAR=HELLO" > /config/config.env'] volumeMounts:
  • name: config-volume mountPath: /config containers:
  • name: app-container image: gcr.io/distroless/static env:
  • name: CONFIG_VAR valueFrom: fileKeyRef: path: config.env volumeName: config-volume key: CONFIG_VAR volumes:
  • name: config-volume emptyDir: {}

Using this approach is a breeze. You define your environment variables in the pod spec using the fileKeyRef field, which tells Kubernetes where to find the file and which key to pull. The file itself resembles the standard for .env syntax (think KEY=VALUE), and (for this alpha stage at least) you must ensure that it is written into an emptyDir volume. Other volume types aren't supported for this feature. At least one init container must mount that emptyDir volume (to write the file), but the main container doesn’t need to—it just gets the variables handed to it at startup.

A word on security

While this feature supports handling sensitive data such as keys or tokens, note that its implementation relies on emptyDir volumes mounted into pod. Operators with node filesystem access could therefore easily retrieve this sensitive data through pod directory paths.

If storing sensitive data like keys or tokens using this feature, ensure your cluster security policies effectively protect nodes against unauthorized access to prevent exposure of confidential information.

Summary

This feature will eliminate a number of complex workarounds used today, simplifying apps authoring, and opening doors for more use cases. Kubernetes stays flexible and open for feedback. Tell us how you use this feature or what is missing.

via Kubernetes Blog https://kubernetes.io/

September 10, 2025 at 02:30PM

·kubernetes.io·
Kubernetes v1.34: Use An Init Container To Define App Environment Variables
CIQ to Accelerate AI and HPC Workloads with NVIDIA CUDA
CIQ to Accelerate AI and HPC Workloads with NVIDIA CUDA
CIQ is the leading Enterprise Linux provider licensed to include NVIDIA CUDA in all AI and HPC stacks built on CIQ's optimized version of Rocky Linux. RENO, Nev., September 10, 2025 - CIQ, the…
·ciq.com·
CIQ to Accelerate AI and HPC Workloads with NVIDIA CUDA
Base - SQLite editor for macOS
Base - SQLite editor for macOS
Base is the SQLite database editor Mac users love. Designed for everyone, with a comfortable interface that makes database work so much nicer.
·menial.co.uk·
Base - SQLite editor for macOS
AI & DevOps Toolkit - Ep34 - Ask Me Anything About Anything with Scott Rosenberg - https://www.youtube.com/watch?v=IoBrA6gUESk
AI & DevOps Toolkit - Ep34 - Ask Me Anything About Anything with Scott Rosenberg - https://www.youtube.com/watch?v=IoBrA6gUESk

Ep34 - Ask Me Anything About Anything with Scott Rosenberg

There are no restrictions in this AMA session. You can ask anything about DevOps, AI, Cloud, Kubernetes, Platform Engineering, containers, or anything else. Scott Rosenberg, a regular guest, will be here to help us out.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: Codefresh 🔗 GitOps Argo CD Certifications: https://learning.codefresh.io (use "viktor" for a 50% discount) ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

via YouTube https://www.youtube.com/watch?v=IoBrA6gUESk

·youtube.com·
AI & DevOps Toolkit - Ep34 - Ask Me Anything About Anything with Scott Rosenberg - https://www.youtube.com/watch?v=IoBrA6gUESk
Kubernetes v1.34: Snapshottable API server cache
Kubernetes v1.34: Snapshottable API server cache

Kubernetes v1.34: Snapshottable API server cache

https://kubernetes.io/blog/2025/09/09/kubernetes-v1-34-snapshottable-api-server-cache/

For years, the Kubernetes community has been on a mission to improve the stability and performance predictability of the API server. A major focus of this effort has been taming list requests, which have historically been a primary source of high memory usage and heavy load on the etcd datastore. With each release, we've chipped away at the problem, and today, we're thrilled to announce the final major piece of this puzzle.

The snapshottable API server cache feature has graduated to Beta in Kubernetes v1.34, culminating a multi-release effort to allow virtually all read requests to be served directly from the API server's cache.

Evolving the cache for performance and stability

The path to the current state involved several key enhancements over recent releases that paved the way for today's announcement.

Consistent reads from cache (Beta in v1.31)

While the API server has long used a cache for performance, a key milestone was guaranteeing consistent reads of the latest data from it. This v1.31 enhancement allowed the watch cache to be used for strongly-consistent read requests for the first time, a huge win as it enabled filtered collections (e.g. "a list of pods bound to this node") to be safely served from the cache instead of etcd, dramatically reducing its load for common workloads.

Taming large responses with streaming (Beta in v1.33)

Another key improvement was tackling the problem of memory spikes when transmitting large responses. The streaming encoder, introduced in v1.33, allowed the API server to send list items one by one, rather than buffering the entire multi-gigabyte response in memory. This made the memory cost of sending a response predictable and minimal, regardless of its size.

The missing piece

Despite these huge improvements, a critical gap remained. Any request for a historical LIST—most commonly used for paginating through large result sets—still had to bypass the cache and query etcd directly. This meant that the cost of retrieving the data was still unpredictable and could put significant memory pressure on the API server.

Kubernetes 1.34: snapshots complete the picture

The snapshottable API server cache solves this final piece of the puzzle. This feature enhances the watch cache, enabling it to generate efficient, point-in-time snapshots of its state.

Here’s how it works: for each update, the cache creates a lightweight snapshot. These snapshots are "lazy copies," meaning they don't duplicate objects but simply store pointers, making them incredibly memory-efficient.

When a list request for a historical resourceVersion arrives, the API server now finds the corresponding snapshot and serves the response directly from its memory. This closes the final major gap, allowing paginated requests to be served entirely from the cache.

A new era of API Server performance 🚀

With this final piece in place, the synergy of these three features ushers in a new era of API server predictability and performance:

Get Data from Cache: Consistent reads and snapshottable cache work together to ensure nearly all read requests—whether for the latest data or a historical snapshot—are served from the API server's memory.

Send data via stream: Streaming list responses ensure that sending this data to the client has a minimal and constant memory footprint.

The result is a system where the resource cost of read operations is almost fully predictable and much more resiliant to spikes in request load. This means dramatically reduced memory pressure, a lighter load on etcd, and a more stable, scalable, and reliable control plane for all Kubernetes clusters.

How to get started

With its graduation to Beta, the SnapshottableCache feature gate is enabled by default in Kubernetes v1.34. There are no actions required to start benefiting from these performance and stability improvements.

Acknowledgements

Special thanks for designing, implementing, and reviewing these critical features go to:

Ahmad Zolfaghari (@ah8ad3)

Ben Luddy (@benluddy) – Red Hat

Chen Chen (@z1cheng) – Microsoft

Davanum Srinivas (@dims) – Nvidia

David Eads (@deads2k) – Red Hat

Han Kang (@logicalhan) – CoreWeave

haosdent (@haosdent) – Shopee

Joe Betz (@jpbetz) – Google

Jordan Liggitt (@liggitt) – Google

Łukasz Szaszkiewicz (@p0lyn0mial) – Red Hat

Maciej Borsz (@mborsz) – Google

Madhav Jivrajani (@MadhavJivrajani) – UIUC

Marek Siarkowicz (@serathius) – Google

NKeert (@NKeert)

Tim Bannister (@lmktfy)

Wei Fu (@fuweid) - Microsoft

Wojtek Tyczyński (@wojtek-t) – Google

...and many others in SIG API Machinery. This milestone is a testament to the community's dedication to building a more scalable and robust Kubernetes.

via Kubernetes Blog https://kubernetes.io/

September 09, 2025 at 02:30PM

·kubernetes.io·
Kubernetes v1.34: Snapshottable API server cache
CHAOSScast Episode 117: Business Success with Open Source with VM (Vicky) Brasseur
CHAOSScast Episode 117: Business Success with Open Source with VM (Vicky) Brasseur
In this episode of CHAOSScast, Georg Link and Sean Goggins welcome guest Vicky Brasseur, author of *Business Success with Open Source* and *Forge Your Future with Open Source*. The conversation explores Vicky’s early journey into open source, starting from discovering Project Gutenberg in the early '90s to using Linux for the first time, the challenges companies face when using open source software, and how organizations can better leverage it strategically. The discussion also delves into her book, *Forge Your Future with Open Source*, which addresses common questions about contributing to open source projects. Vicky highlights the gaps in strategic open source usage within organizations and offers insights on how companies can better utilize open source software to reduce business risks. The conversation wraps up with practical advice for making a compelling business case for open source contributions and the importance of speaking the language of decision-makers. Press download now!
·podcast.chaoss.community·
CHAOSScast Episode 117: Business Success with Open Source with VM (Vicky) Brasseur
Predictive vs Reactive: A Journey to Smarter Kubernetes Scaling with Jorrick Stempher
Predictive vs Reactive: A Journey to Smarter Kubernetes Scaling with Jorrick Stempher

Predictive vs Reactive: A Journey to Smarter Kubernetes Scaling, with Jorrick Stempher

https://ku.bz/clbDWqPYp

Jorrick Stempher shares how his team of eight students built a complete predictive scaling system for Kubernetes clusters using machine learning.

Rather than waiting for nodes to become overloaded, their system uses the Prophet forecasting model to proactively anticipate load patterns and scale infrastructure, giving them the 8-9 minutes needed to provision new nodes on Vultr.

You will learn:

How to implement predictive scaling using Prophet ML model, Prometheus metrics, and custom APIs to forecast Kubernetes workload patterns

The Node Ranking Index (NRI) - a unified metric that combines CPU, RAM, and request data into a single comparable number for efficient scaling decisions

Real-world implementation challenges, including data validation, node startup timing constraints, load testing strategies, and the importance of proper research before building complex scaling solutions

Sponsor

This episode is brought to you by Testkube—the ultimate Continuous Testing Platform for Cloud Native applications. Scale fast, test continuously, and ship confidently. Check it out at testkube.io

More info

Find all the links and info for this episode here: https://ku.bz/clbDWqPYp

Interested in sponsoring an episode? Learn more.

via KubeFM https://kube.fm

September 09, 2025 at 06:00AM

·kube.fm·
Predictive vs Reactive: A Journey to Smarter Kubernetes Scaling with Jorrick Stempher
Kubernetes v1.34: VolumeAttributesClass for Volume Modification GA
Kubernetes v1.34: VolumeAttributesClass for Volume Modification GA

Kubernetes v1.34: VolumeAttributesClass for Volume Modification GA

https://kubernetes.io/blog/2025/09/08/kubernetes-v1-34-volume-attributes-class/

The VolumeAttributesClass API, which empowers users to dynamically modify volume attributes, has officially graduated to General Availability (GA) in Kubernetes v1.34. This marks a significant milestone, providing a robust and stable way to tune your persistent storage directly within Kubernetes.

What is VolumeAttributesClass?

At its core, VolumeAttributesClass is a cluster-scoped resource that defines a set of mutable parameters for a volume. Think of it as a "profile" for your storage, allowing cluster administrators to expose different quality-of-service (QoS) levels or performance tiers.

Users can then specify a volumeAttributesClassName in their PersistentVolumeClaim (PVC) to indicate which class of attributes they desire. The magic happens through the Container Storage Interface (CSI): when a PVC referencing a VolumeAttributesClass is updated, the associated CSI driver interacts with the underlying storage system to apply the specified changes to the volume.

This means you can now:

Dynamically scale performance: Increase IOPS or throughput for a busy database, or reduce it for a less critical application.

Optimize costs: Adjust attributes on the fly to match your current needs, avoiding over-provisioning.

Simplify operations: Manage volume modifications directly within the Kubernetes API, rather than relying on external tools or manual processes.

What is new from Beta to GA

There are two major enhancements from beta.

Cancel support from infeasible errors

To improve resilience and user experience, the GA release introduces explicit cancel support when a requested volume modification becomes infeasible. If the underlying storage system or CSI driver indicates that the requested changes cannot be applied (e.g., due to invalid arguments), users can cancel the operation and revert the volume to its previous stable configuration, preventing the volume from being left in an inconsistent state.

Quota support based on scope

While VolumeAttributesClass doesn't add a new quota type, the Kubernetes control plane can be configured to enforce quotas on PersistentVolumeClaims that reference a specific VolumeAttributesClass.

This is achieved by using the scopeSelector field in a ResourceQuota to target PVCs that have .spec.volumeAttributesClassName set to a particular VolumeAttributesClass name. Please see more details here.

Drivers support VolumeAttributesClass

Amazon EBS CSI Driver: The AWS EBS CSI driver has robust support for VolumeAttributesClass and allows you to modify parameters like volume type (e.g., gp2 to gp3, io1 to io2), IOPS, and throughput of EBS volumes dynamically.

Google Compute Engine (GCE) Persistent Disk CSI Driver (pd.csi.storage.gke.io): This driver also supports dynamic modification of persistent disk attributes, including IOPS and throughput, via VolumeAttributesClass.

Contact

For any inquiries or specific questions related to VolumeAttributesClass, please reach out to the SIG Storage community.

via Kubernetes Blog https://kubernetes.io/

September 08, 2025 at 02:30PM

·kubernetes.io·
Kubernetes v1.34: VolumeAttributesClass for Volume Modification GA
AI & DevOps Toolkit - Why Kubernetes Discovery Sucks for AI (And How Vector DBs Fix It) - https://www.youtube.com/watch?v=MSNstHj4rmk
AI & DevOps Toolkit - Why Kubernetes Discovery Sucks for AI (And How Vector DBs Fix It) - https://www.youtube.com/watch?v=MSNstHj4rmk

Why Kubernetes Discovery Sucks for AI (And How Vector DBs Fix It)

Discover why the Kubernetes API is brilliant for execution but a complete nightmare for discovery, and learn how semantic search with vector databases can finally solve this problem. This video demonstrates the real-world challenge of finding the right Kubernetes resources when you have hundreds of cryptically named resource types in your cluster, and shows how AI struggles with the same discovery issues that plague human users.

We'll walk through a practical scenario where you need to create a PostgreSQL database with schema management in AWS, revealing how traditional keyword-based searching through 443+ Kubernetes resources becomes an exercise in frustration. Even when filtering by logical terms like "database," "postgresql," and "aws," the perfect solution remains hidden because it doesn't match your search keywords. The video then introduces a game-changing approach using vector databases and semantic search that enables both humans and AI to discover resources through natural language queries, regardless of exact keyword matches. By converting Kubernetes resource definitions into embeddings that capture semantic meaning, we transform an unsearchable cluster into an instantly discoverable one where you can simply describe what you want to accomplish rather than memorizing cryptic resource names.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: UpCloud 🔗 https://signup.upcloud.com/?promo=devopstoolkit500 👉 Promo code: devopstoolkit500 ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

KubernetesAPI #SemanticSearch #VectorDatabase

Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join

▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/kubernetes/why-kubernetes-discovery-sucks-for-ai-and-how-vector-dbs-fix-it 🔗 DevOps AI Toolkit: https://github.com/vfarcic/dot-ai

▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Kubernetes API Discovery with AI 01:30 UpCloud (sponsor) 02:37 Kubernetes API Discovery Nightmare 11:33 Why AI Fails at Kubernetes Discovery 16:47 Vector Database Semantic Search Solution 23:15 Semantic Search Pros, Cons, and Key Takeaways

via YouTube https://www.youtube.com/watch?v=MSNstHj4rmk

·youtube.com·
AI & DevOps Toolkit - Why Kubernetes Discovery Sucks for AI (And How Vector DBs Fix It) - https://www.youtube.com/watch?v=MSNstHj4rmk
Kubernetes v1.34: Pod Replacement Policy for Jobs Goes GA
Kubernetes v1.34: Pod Replacement Policy for Jobs Goes GA

Kubernetes v1.34: Pod Replacement Policy for Jobs Goes GA

https://kubernetes.io/blog/2025/09/05/kubernetes-v1-34-pod-replacement-policy-for-jobs-goes-ga/

In Kubernetes v1.34, the Pod replacement policy feature has reached general availability (GA). This blog post describes the Pod replacement policy feature and how to use it in your Jobs.

About Pod Replacement Policy

By default, the Job controller immediately recreates Pods as soon as they fail or begin terminating (when they have a deletion timestamp).

As a result, while some Pods are terminating, the total number of running Pods for a Job can temporarily exceed the specified parallelism. For Indexed Jobs, this can even mean multiple Pods running for the same index at the same time.

This behavior works fine for many workloads, but it can cause problems in certain cases.

For example, popular machine learning frameworks like TensorFlow and JAX expect exactly one Pod per worker index. If two Pods run at the same time, you might encounter errors such as:

/job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4

Additionally, starting replacement Pods before the old ones fully terminate can lead to:

Scheduling delays by kube-scheduler as the nodes remain occupied.

Unnecessary cluster scale-ups to accommodate the replacement Pods.

Temporary bypassing of quota checks by workload orchestrators like Kueue.

With Pod replacement policy, Kubernetes gives you control over when the control plane replaces terminating Pods, helping you avoid these issues.

How Pod Replacement Policy works

This enhancement means that Jobs in Kubernetes have an optional field .spec.podReplacementPolicy.

You can choose one of two policies:

TerminatingOrFailed (default): Replaces Pods as soon as they start terminating.

Failed: Replaces Pods only after they fully terminate and transition to the Failed phase.

Setting the policy to Failed ensures that a new Pod is only created after the previous one has completely terminated.

For Jobs with a Pod Failure Policy, the default podReplacementPolicy is Failed, and no other value is allowed. See Pod Failure Policy to learn more about Pod Failure Policies for Jobs.

You can check how many Pods are currently terminating by inspecting the Job’s .status.terminating field:

kubectl get job myjob -o=jsonpath='{.status.terminating}'

Example

Here’s a Job example that executes a task two times (spec.completions: 2) in parallel (spec.parallelism: 2) and replaces Pods only after they fully terminate (spec.podReplacementPolicy: Failed):

apiVersion: batch/v1 kind: Job metadata: name: example-job spec: completions: 2 parallelism: 2 podReplacementPolicy: Failed template: spec: restartPolicy: Never containers:

  • name: worker image: your-image

If a Pod receives a SIGTERM signal (deletion, eviction, preemption...), it begins terminating. When the container handles termination gracefully, cleanup may take some time.

When the Job starts, we will see two Pods running:

kubectl get pods

NAME READY STATUS RESTARTS AGE example-job-qr8kf 1/1 Running 0 2s example-job-stvb4 1/1 Running 0 2s

Let's delete one of the Pods (example-job-qr8kf).

With the TerminatingOrFailed policy, as soon as one Pod (example-job-qr8kf) starts terminating, the Job controller immediately creates a new Pod (example-job-b59zk) to replace it.

kubectl get pods

NAME READY STATUS RESTARTS AGE example-job-b59zk 1/1 Running 0 1s example-job-qr8kf 1/1 Terminating 0 17s example-job-stvb4 1/1 Running 0 17s

With the Failed policy, the new Pod (example-job-b59zk) is not created while the old Pod (example-job-qr8kf) is terminating.

kubectl get pods

NAME READY STATUS RESTARTS AGE example-job-qr8kf 1/1 Terminating 0 17s example-job-stvb4 1/1 Running 0 17s

When the terminating Pod has fully transitioned to the Failed phase, a new Pod is created:

kubectl get pods

NAME READY STATUS RESTARTS AGE example-job-b59zk 1/1 Running 0 1s example-job-stvb4 1/1 Running 0 25s

How can you learn more?

Read the user-facing documentation for Pod Replacement Policy, Backoff Limit per Index, and Pod Failure Policy.

Read the KEPs for Pod Replacement Policy, Backoff Limit per Index, and Pod Failure Policy.

Acknowledgments

As with any Kubernetes feature, multiple people contributed to getting this done, from testing and filing bugs to reviewing code.

As this feature moves to stable after 2 years, we would like to thank the following people:

Kevin Hannon - for writing the KEP and the initial implementation.

Michał Woźniak - for guidance, mentorship, and reviews.

Aldo Culquicondor - for guidance, mentorship, and reviews.

Maciej Szulik - for guidance, mentorship, and reviews.

Dejan Zele Pejchev - for taking over the feature and promoting it from Alpha through Beta to GA.

Get involved

This work was sponsored by the Kubernetes batch working group in close collaboration with the SIG Apps community.

If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.

via Kubernetes Blog https://kubernetes.io/

September 05, 2025 at 02:30PM

·kubernetes.io·
Kubernetes v1.34: Pod Replacement Policy for Jobs Goes GA
Should AI Get Legal Rights?
Should AI Get Legal Rights?
Model welfare is an emerging field of research that seeks to determine whether AI is conscious and, if so, how humanity should respond.
·wired.com·
Should AI Get Legal Rights?
PSI Metrics for Kubernetes Graduates to Beta
PSI Metrics for Kubernetes Graduates to Beta

PSI Metrics for Kubernetes Graduates to Beta

https://kubernetes.io/blog/2025/09/04/kubernetes-v1-34-introducing-psi-metrics-beta/

As Kubernetes clusters grow in size and complexity, understanding the health and performance of individual nodes becomes increasingly critical. We are excited to announce that as of Kubernetes v1.34, Pressure Stall Information (PSI) Metrics has graduated to Beta.

What is Pressure Stall Information (PSI)?

Pressure Stall Information (PSI) is a feature of the Linux kernel (version 4.20 and later) that provides a canonical way to quantify pressure on infrastructure resources, in terms of whether demand for a resource exceeds current supply. It moves beyond simple resource utilization metrics and instead measures the amount of time that tasks are stalled due to resource contention. This is a powerful way to identify and diagnose resource bottlenecks that can impact application performance.

PSI exposes metrics for CPU, memory, and I/O, categorized as either some or full pressure:

some

The percentage of time that at least one task is stalled on a resource. This indicates some level of resource contention.

full

The percentage of time that all non-idle tasks are stalled on a resource simultaneously. This indicates a more severe resource bottleneck.

PSI: 'Some' vs. 'Full' Pressure

These metrics are aggregated over 10-second, 1-minute, and 5-minute rolling windows, providing a comprehensive view of resource pressure over time.

PSI metrics in Kubernetes

With the KubeletPSI feature gate enabled, the kubelet can now collect PSI metrics from the Linux kernel and expose them through two channels: the Summary API and the /metrics/cadvisor Prometheus endpoint. This allows you to monitor and alert on resource pressure at the node, pod, and container level.

The following new metrics are available in Prometheus exposition format via /metrics/cadvisor:

container_pressure_cpu_stalled_seconds_total

container_pressure_cpu_waiting_seconds_total

container_pressure_memory_stalled_seconds_total

container_pressure_memory_waiting_seconds_total

container_pressure_io_stalled_seconds_total

container_pressure_io_waiting_seconds_total

These metrics, along with the data from the Summary API, provide a granular view of resource pressure, enabling you to pinpoint the source of performance issues and take corrective action. For example, you can use these metrics to:

Identify memory leaks: A steadily increasing some pressure for memory can indicate a memory leak in an application.

Optimize resource requests and limits: By understanding the resource pressure of your workloads, you can more accurately tune their resource requests and limits.

Autoscale workloads: You can use PSI metrics to trigger autoscaling events, ensuring that your workloads have the resources they need to perform optimally.

How to enable PSI metrics

To enable PSI metrics in your Kubernetes cluster, you need to:

Ensure your nodes are running a Linux kernel version 4.20 or later and are using cgroup v2.

Enable the KubeletPSI feature gate on the kubelet.

Once enabled, you can start scraping the /metrics/cadvisor endpoint with your Prometheus-compatible monitoring solution or query the Summary API to collect and visualize the new PSI metrics. Note that PSI is a Linux-kernel feature, so these metrics are not available on Windows nodes. Your cluster can contain a mix of Linux and Windows nodes, and on the Windows nodes the kubelet does not expose PSI metrics.

What's next?

We are excited to bring PSI metrics to the Kubernetes community and look forward to your feedback. As a beta feature, we are actively working on improving and extending this functionality towards a stable GA release. We encourage you to try it out and share your experiences with us.

To learn more about PSI metrics, check out the official Kubernetes documentation. You can also get involved in the conversation on the #sig-node Slack channel.

via Kubernetes Blog https://kubernetes.io/

September 04, 2025 at 02:30PM

·kubernetes.io·
PSI Metrics for Kubernetes Graduates to Beta