1_r/devopsish

1_r/devopsish

54624 bookmarks
Custom sorting
Kubernetes v1.34: Use An Init Container To Define App Environment Variables
Kubernetes v1.34: Use An Init Container To Define App Environment Variables

Kubernetes v1.34: Use An Init Container To Define App Environment Variables

https://kubernetes.io/blog/2025/09/10/kubernetes-v1-34-env-files/

Kubernetes typically uses ConfigMaps and Secrets to set environment variables, which introduces additional API calls and complexity, For example, you need to separately manage the Pods of your workloads and their configurations, while ensuring orderly updates for both the configurations and the workload Pods.

Alternatively, you might be using a vendor-supplied container that requires environment variables (such as a license key or a one-time token), but you don’t want to hard-code them or mount volumes just to get the job done.

If that's the situation you are in, you now have a new (alpha) way to achieve that. Provided you have the EnvFiles feature gate enabled across your cluster, you can tell the kubelet to load a container's environment variables from a volume (the volume must be part of the Pod that the container belongs to). this feature gate allows you to load environment variables directly from a file in an emptyDir volume without actually mounting that file into the container. It’s a simple yet elegant solution to some surprisingly common problems.

What’s this all about?

At its core, this feature allows you to point your container to a file, one generated by an initContainer, and have Kubernetes parse that file to set your environment variables. The file lives in an emptyDir volume (a temporary storage space that lasts as long as the pod does), Your main container doesn’t need to mount the volume. The kubelet will read the file and inject these variables when the container starts.

How It Works

Here's a simple example:

apiVersion: v1 kind: Pod spec: initContainers:

  • name: generate-config image: busybox command: ['sh', '-c', 'echo "CONFIG_VAR=HELLO" > /config/config.env'] volumeMounts:
  • name: config-volume mountPath: /config containers:
  • name: app-container image: gcr.io/distroless/static env:
  • name: CONFIG_VAR valueFrom: fileKeyRef: path: config.env volumeName: config-volume key: CONFIG_VAR volumes:
  • name: config-volume emptyDir: {}

Using this approach is a breeze. You define your environment variables in the pod spec using the fileKeyRef field, which tells Kubernetes where to find the file and which key to pull. The file itself resembles the standard for .env syntax (think KEY=VALUE), and (for this alpha stage at least) you must ensure that it is written into an emptyDir volume. Other volume types aren't supported for this feature. At least one init container must mount that emptyDir volume (to write the file), but the main container doesn’t need to—it just gets the variables handed to it at startup.

A word on security

While this feature supports handling sensitive data such as keys or tokens, note that its implementation relies on emptyDir volumes mounted into pod. Operators with node filesystem access could therefore easily retrieve this sensitive data through pod directory paths.

If storing sensitive data like keys or tokens using this feature, ensure your cluster security policies effectively protect nodes against unauthorized access to prevent exposure of confidential information.

Summary

This feature will eliminate a number of complex workarounds used today, simplifying apps authoring, and opening doors for more use cases. Kubernetes stays flexible and open for feedback. Tell us how you use this feature or what is missing.

via Kubernetes Blog https://kubernetes.io/

September 10, 2025 at 02:30PM

·kubernetes.io·
Kubernetes v1.34: Use An Init Container To Define App Environment Variables
CIQ to Accelerate AI and HPC Workloads with NVIDIA CUDA
CIQ to Accelerate AI and HPC Workloads with NVIDIA CUDA
CIQ is the leading Enterprise Linux provider licensed to include NVIDIA CUDA in all AI and HPC stacks built on CIQ's optimized version of Rocky Linux. RENO, Nev., September 10, 2025 - CIQ, the…
·ciq.com·
CIQ to Accelerate AI and HPC Workloads with NVIDIA CUDA
Base - SQLite editor for macOS
Base - SQLite editor for macOS
Base is the SQLite database editor Mac users love. Designed for everyone, with a comfortable interface that makes database work so much nicer.
·menial.co.uk·
Base - SQLite editor for macOS
AI & DevOps Toolkit - Ep34 - Ask Me Anything About Anything with Scott Rosenberg - https://www.youtube.com/watch?v=IoBrA6gUESk
AI & DevOps Toolkit - Ep34 - Ask Me Anything About Anything with Scott Rosenberg - https://www.youtube.com/watch?v=IoBrA6gUESk

Ep34 - Ask Me Anything About Anything with Scott Rosenberg

There are no restrictions in this AMA session. You can ask anything about DevOps, AI, Cloud, Kubernetes, Platform Engineering, containers, or anything else. Scott Rosenberg, a regular guest, will be here to help us out.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: Codefresh 🔗 GitOps Argo CD Certifications: https://learning.codefresh.io (use "viktor" for a 50% discount) ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

via YouTube https://www.youtube.com/watch?v=IoBrA6gUESk

·youtube.com·
AI & DevOps Toolkit - Ep34 - Ask Me Anything About Anything with Scott Rosenberg - https://www.youtube.com/watch?v=IoBrA6gUESk
Kubernetes v1.34: Snapshottable API server cache
Kubernetes v1.34: Snapshottable API server cache

Kubernetes v1.34: Snapshottable API server cache

https://kubernetes.io/blog/2025/09/09/kubernetes-v1-34-snapshottable-api-server-cache/

For years, the Kubernetes community has been on a mission to improve the stability and performance predictability of the API server. A major focus of this effort has been taming list requests, which have historically been a primary source of high memory usage and heavy load on the etcd datastore. With each release, we've chipped away at the problem, and today, we're thrilled to announce the final major piece of this puzzle.

The snapshottable API server cache feature has graduated to Beta in Kubernetes v1.34, culminating a multi-release effort to allow virtually all read requests to be served directly from the API server's cache.

Evolving the cache for performance and stability

The path to the current state involved several key enhancements over recent releases that paved the way for today's announcement.

Consistent reads from cache (Beta in v1.31)

While the API server has long used a cache for performance, a key milestone was guaranteeing consistent reads of the latest data from it. This v1.31 enhancement allowed the watch cache to be used for strongly-consistent read requests for the first time, a huge win as it enabled filtered collections (e.g. "a list of pods bound to this node") to be safely served from the cache instead of etcd, dramatically reducing its load for common workloads.

Taming large responses with streaming (Beta in v1.33)

Another key improvement was tackling the problem of memory spikes when transmitting large responses. The streaming encoder, introduced in v1.33, allowed the API server to send list items one by one, rather than buffering the entire multi-gigabyte response in memory. This made the memory cost of sending a response predictable and minimal, regardless of its size.

The missing piece

Despite these huge improvements, a critical gap remained. Any request for a historical LIST—most commonly used for paginating through large result sets—still had to bypass the cache and query etcd directly. This meant that the cost of retrieving the data was still unpredictable and could put significant memory pressure on the API server.

Kubernetes 1.34: snapshots complete the picture

The snapshottable API server cache solves this final piece of the puzzle. This feature enhances the watch cache, enabling it to generate efficient, point-in-time snapshots of its state.

Here’s how it works: for each update, the cache creates a lightweight snapshot. These snapshots are "lazy copies," meaning they don't duplicate objects but simply store pointers, making them incredibly memory-efficient.

When a list request for a historical resourceVersion arrives, the API server now finds the corresponding snapshot and serves the response directly from its memory. This closes the final major gap, allowing paginated requests to be served entirely from the cache.

A new era of API Server performance 🚀

With this final piece in place, the synergy of these three features ushers in a new era of API server predictability and performance:

Get Data from Cache: Consistent reads and snapshottable cache work together to ensure nearly all read requests—whether for the latest data or a historical snapshot—are served from the API server's memory.

Send data via stream: Streaming list responses ensure that sending this data to the client has a minimal and constant memory footprint.

The result is a system where the resource cost of read operations is almost fully predictable and much more resiliant to spikes in request load. This means dramatically reduced memory pressure, a lighter load on etcd, and a more stable, scalable, and reliable control plane for all Kubernetes clusters.

How to get started

With its graduation to Beta, the SnapshottableCache feature gate is enabled by default in Kubernetes v1.34. There are no actions required to start benefiting from these performance and stability improvements.

Acknowledgements

Special thanks for designing, implementing, and reviewing these critical features go to:

Ahmad Zolfaghari (@ah8ad3)

Ben Luddy (@benluddy) – Red Hat

Chen Chen (@z1cheng) – Microsoft

Davanum Srinivas (@dims) – Nvidia

David Eads (@deads2k) – Red Hat

Han Kang (@logicalhan) – CoreWeave

haosdent (@haosdent) – Shopee

Joe Betz (@jpbetz) – Google

Jordan Liggitt (@liggitt) – Google

Łukasz Szaszkiewicz (@p0lyn0mial) – Red Hat

Maciej Borsz (@mborsz) – Google

Madhav Jivrajani (@MadhavJivrajani) – UIUC

Marek Siarkowicz (@serathius) – Google

NKeert (@NKeert)

Tim Bannister (@lmktfy)

Wei Fu (@fuweid) - Microsoft

Wojtek Tyczyński (@wojtek-t) – Google

...and many others in SIG API Machinery. This milestone is a testament to the community's dedication to building a more scalable and robust Kubernetes.

via Kubernetes Blog https://kubernetes.io/

September 09, 2025 at 02:30PM

·kubernetes.io·
Kubernetes v1.34: Snapshottable API server cache
CHAOSScast Episode 117: Business Success with Open Source with VM (Vicky) Brasseur
CHAOSScast Episode 117: Business Success with Open Source with VM (Vicky) Brasseur
In this episode of CHAOSScast, Georg Link and Sean Goggins welcome guest Vicky Brasseur, author of *Business Success with Open Source* and *Forge Your Future with Open Source*. The conversation explores Vicky’s early journey into open source, starting from discovering Project Gutenberg in the early '90s to using Linux for the first time, the challenges companies face when using open source software, and how organizations can better leverage it strategically. The discussion also delves into her book, *Forge Your Future with Open Source*, which addresses common questions about contributing to open source projects. Vicky highlights the gaps in strategic open source usage within organizations and offers insights on how companies can better utilize open source software to reduce business risks. The conversation wraps up with practical advice for making a compelling business case for open source contributions and the importance of speaking the language of decision-makers. Press download now!
·podcast.chaoss.community·
CHAOSScast Episode 117: Business Success with Open Source with VM (Vicky) Brasseur
Predictive vs Reactive: A Journey to Smarter Kubernetes Scaling with Jorrick Stempher
Predictive vs Reactive: A Journey to Smarter Kubernetes Scaling with Jorrick Stempher

Predictive vs Reactive: A Journey to Smarter Kubernetes Scaling, with Jorrick Stempher

https://ku.bz/clbDWqPYp

Jorrick Stempher shares how his team of eight students built a complete predictive scaling system for Kubernetes clusters using machine learning.

Rather than waiting for nodes to become overloaded, their system uses the Prophet forecasting model to proactively anticipate load patterns and scale infrastructure, giving them the 8-9 minutes needed to provision new nodes on Vultr.

You will learn:

How to implement predictive scaling using Prophet ML model, Prometheus metrics, and custom APIs to forecast Kubernetes workload patterns

The Node Ranking Index (NRI) - a unified metric that combines CPU, RAM, and request data into a single comparable number for efficient scaling decisions

Real-world implementation challenges, including data validation, node startup timing constraints, load testing strategies, and the importance of proper research before building complex scaling solutions

Sponsor

This episode is brought to you by Testkube—the ultimate Continuous Testing Platform for Cloud Native applications. Scale fast, test continuously, and ship confidently. Check it out at testkube.io

More info

Find all the links and info for this episode here: https://ku.bz/clbDWqPYp

Interested in sponsoring an episode? Learn more.

via KubeFM https://kube.fm

September 09, 2025 at 06:00AM

·kube.fm·
Predictive vs Reactive: A Journey to Smarter Kubernetes Scaling with Jorrick Stempher
Kubernetes v1.34: VolumeAttributesClass for Volume Modification GA
Kubernetes v1.34: VolumeAttributesClass for Volume Modification GA

Kubernetes v1.34: VolumeAttributesClass for Volume Modification GA

https://kubernetes.io/blog/2025/09/08/kubernetes-v1-34-volume-attributes-class/

The VolumeAttributesClass API, which empowers users to dynamically modify volume attributes, has officially graduated to General Availability (GA) in Kubernetes v1.34. This marks a significant milestone, providing a robust and stable way to tune your persistent storage directly within Kubernetes.

What is VolumeAttributesClass?

At its core, VolumeAttributesClass is a cluster-scoped resource that defines a set of mutable parameters for a volume. Think of it as a "profile" for your storage, allowing cluster administrators to expose different quality-of-service (QoS) levels or performance tiers.

Users can then specify a volumeAttributesClassName in their PersistentVolumeClaim (PVC) to indicate which class of attributes they desire. The magic happens through the Container Storage Interface (CSI): when a PVC referencing a VolumeAttributesClass is updated, the associated CSI driver interacts with the underlying storage system to apply the specified changes to the volume.

This means you can now:

Dynamically scale performance: Increase IOPS or throughput for a busy database, or reduce it for a less critical application.

Optimize costs: Adjust attributes on the fly to match your current needs, avoiding over-provisioning.

Simplify operations: Manage volume modifications directly within the Kubernetes API, rather than relying on external tools or manual processes.

What is new from Beta to GA

There are two major enhancements from beta.

Cancel support from infeasible errors

To improve resilience and user experience, the GA release introduces explicit cancel support when a requested volume modification becomes infeasible. If the underlying storage system or CSI driver indicates that the requested changes cannot be applied (e.g., due to invalid arguments), users can cancel the operation and revert the volume to its previous stable configuration, preventing the volume from being left in an inconsistent state.

Quota support based on scope

While VolumeAttributesClass doesn't add a new quota type, the Kubernetes control plane can be configured to enforce quotas on PersistentVolumeClaims that reference a specific VolumeAttributesClass.

This is achieved by using the scopeSelector field in a ResourceQuota to target PVCs that have .spec.volumeAttributesClassName set to a particular VolumeAttributesClass name. Please see more details here.

Drivers support VolumeAttributesClass

Amazon EBS CSI Driver: The AWS EBS CSI driver has robust support for VolumeAttributesClass and allows you to modify parameters like volume type (e.g., gp2 to gp3, io1 to io2), IOPS, and throughput of EBS volumes dynamically.

Google Compute Engine (GCE) Persistent Disk CSI Driver (pd.csi.storage.gke.io): This driver also supports dynamic modification of persistent disk attributes, including IOPS and throughput, via VolumeAttributesClass.

Contact

For any inquiries or specific questions related to VolumeAttributesClass, please reach out to the SIG Storage community.

via Kubernetes Blog https://kubernetes.io/

September 08, 2025 at 02:30PM

·kubernetes.io·
Kubernetes v1.34: VolumeAttributesClass for Volume Modification GA
AI & DevOps Toolkit - Why Kubernetes Discovery Sucks for AI (And How Vector DBs Fix It) - https://www.youtube.com/watch?v=MSNstHj4rmk
AI & DevOps Toolkit - Why Kubernetes Discovery Sucks for AI (And How Vector DBs Fix It) - https://www.youtube.com/watch?v=MSNstHj4rmk

Why Kubernetes Discovery Sucks for AI (And How Vector DBs Fix It)

Discover why the Kubernetes API is brilliant for execution but a complete nightmare for discovery, and learn how semantic search with vector databases can finally solve this problem. This video demonstrates the real-world challenge of finding the right Kubernetes resources when you have hundreds of cryptically named resource types in your cluster, and shows how AI struggles with the same discovery issues that plague human users.

We'll walk through a practical scenario where you need to create a PostgreSQL database with schema management in AWS, revealing how traditional keyword-based searching through 443+ Kubernetes resources becomes an exercise in frustration. Even when filtering by logical terms like "database," "postgresql," and "aws," the perfect solution remains hidden because it doesn't match your search keywords. The video then introduces a game-changing approach using vector databases and semantic search that enables both humans and AI to discover resources through natural language queries, regardless of exact keyword matches. By converting Kubernetes resource definitions into embeddings that capture semantic meaning, we transform an unsearchable cluster into an instantly discoverable one where you can simply describe what you want to accomplish rather than memorizing cryptic resource names.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: UpCloud 🔗 https://signup.upcloud.com/?promo=devopstoolkit500 👉 Promo code: devopstoolkit500 ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

KubernetesAPI #SemanticSearch #VectorDatabase

Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join

▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/kubernetes/why-kubernetes-discovery-sucks-for-ai-and-how-vector-dbs-fix-it 🔗 DevOps AI Toolkit: https://github.com/vfarcic/dot-ai

▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Kubernetes API Discovery with AI 01:30 UpCloud (sponsor) 02:37 Kubernetes API Discovery Nightmare 11:33 Why AI Fails at Kubernetes Discovery 16:47 Vector Database Semantic Search Solution 23:15 Semantic Search Pros, Cons, and Key Takeaways

via YouTube https://www.youtube.com/watch?v=MSNstHj4rmk

·youtube.com·
AI & DevOps Toolkit - Why Kubernetes Discovery Sucks for AI (And How Vector DBs Fix It) - https://www.youtube.com/watch?v=MSNstHj4rmk
Kubernetes v1.34: Pod Replacement Policy for Jobs Goes GA
Kubernetes v1.34: Pod Replacement Policy for Jobs Goes GA

Kubernetes v1.34: Pod Replacement Policy for Jobs Goes GA

https://kubernetes.io/blog/2025/09/05/kubernetes-v1-34-pod-replacement-policy-for-jobs-goes-ga/

In Kubernetes v1.34, the Pod replacement policy feature has reached general availability (GA). This blog post describes the Pod replacement policy feature and how to use it in your Jobs.

About Pod Replacement Policy

By default, the Job controller immediately recreates Pods as soon as they fail or begin terminating (when they have a deletion timestamp).

As a result, while some Pods are terminating, the total number of running Pods for a Job can temporarily exceed the specified parallelism. For Indexed Jobs, this can even mean multiple Pods running for the same index at the same time.

This behavior works fine for many workloads, but it can cause problems in certain cases.

For example, popular machine learning frameworks like TensorFlow and JAX expect exactly one Pod per worker index. If two Pods run at the same time, you might encounter errors such as:

/job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4

Additionally, starting replacement Pods before the old ones fully terminate can lead to:

Scheduling delays by kube-scheduler as the nodes remain occupied.

Unnecessary cluster scale-ups to accommodate the replacement Pods.

Temporary bypassing of quota checks by workload orchestrators like Kueue.

With Pod replacement policy, Kubernetes gives you control over when the control plane replaces terminating Pods, helping you avoid these issues.

How Pod Replacement Policy works

This enhancement means that Jobs in Kubernetes have an optional field .spec.podReplacementPolicy.

You can choose one of two policies:

TerminatingOrFailed (default): Replaces Pods as soon as they start terminating.

Failed: Replaces Pods only after they fully terminate and transition to the Failed phase.

Setting the policy to Failed ensures that a new Pod is only created after the previous one has completely terminated.

For Jobs with a Pod Failure Policy, the default podReplacementPolicy is Failed, and no other value is allowed. See Pod Failure Policy to learn more about Pod Failure Policies for Jobs.

You can check how many Pods are currently terminating by inspecting the Job’s .status.terminating field:

kubectl get job myjob -o=jsonpath='{.status.terminating}'

Example

Here’s a Job example that executes a task two times (spec.completions: 2) in parallel (spec.parallelism: 2) and replaces Pods only after they fully terminate (spec.podReplacementPolicy: Failed):

apiVersion: batch/v1 kind: Job metadata: name: example-job spec: completions: 2 parallelism: 2 podReplacementPolicy: Failed template: spec: restartPolicy: Never containers:

  • name: worker image: your-image

If a Pod receives a SIGTERM signal (deletion, eviction, preemption...), it begins terminating. When the container handles termination gracefully, cleanup may take some time.

When the Job starts, we will see two Pods running:

kubectl get pods

NAME READY STATUS RESTARTS AGE example-job-qr8kf 1/1 Running 0 2s example-job-stvb4 1/1 Running 0 2s

Let's delete one of the Pods (example-job-qr8kf).

With the TerminatingOrFailed policy, as soon as one Pod (example-job-qr8kf) starts terminating, the Job controller immediately creates a new Pod (example-job-b59zk) to replace it.

kubectl get pods

NAME READY STATUS RESTARTS AGE example-job-b59zk 1/1 Running 0 1s example-job-qr8kf 1/1 Terminating 0 17s example-job-stvb4 1/1 Running 0 17s

With the Failed policy, the new Pod (example-job-b59zk) is not created while the old Pod (example-job-qr8kf) is terminating.

kubectl get pods

NAME READY STATUS RESTARTS AGE example-job-qr8kf 1/1 Terminating 0 17s example-job-stvb4 1/1 Running 0 17s

When the terminating Pod has fully transitioned to the Failed phase, a new Pod is created:

kubectl get pods

NAME READY STATUS RESTARTS AGE example-job-b59zk 1/1 Running 0 1s example-job-stvb4 1/1 Running 0 25s

How can you learn more?

Read the user-facing documentation for Pod Replacement Policy, Backoff Limit per Index, and Pod Failure Policy.

Read the KEPs for Pod Replacement Policy, Backoff Limit per Index, and Pod Failure Policy.

Acknowledgments

As with any Kubernetes feature, multiple people contributed to getting this done, from testing and filing bugs to reviewing code.

As this feature moves to stable after 2 years, we would like to thank the following people:

Kevin Hannon - for writing the KEP and the initial implementation.

Michał Woźniak - for guidance, mentorship, and reviews.

Aldo Culquicondor - for guidance, mentorship, and reviews.

Maciej Szulik - for guidance, mentorship, and reviews.

Dejan Zele Pejchev - for taking over the feature and promoting it from Alpha through Beta to GA.

Get involved

This work was sponsored by the Kubernetes batch working group in close collaboration with the SIG Apps community.

If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.

via Kubernetes Blog https://kubernetes.io/

September 05, 2025 at 02:30PM

·kubernetes.io·
Kubernetes v1.34: Pod Replacement Policy for Jobs Goes GA
Should AI Get Legal Rights?
Should AI Get Legal Rights?
Model welfare is an emerging field of research that seeks to determine whether AI is conscious and, if so, how humanity should respond.
·wired.com·
Should AI Get Legal Rights?
PSI Metrics for Kubernetes Graduates to Beta
PSI Metrics for Kubernetes Graduates to Beta

PSI Metrics for Kubernetes Graduates to Beta

https://kubernetes.io/blog/2025/09/04/kubernetes-v1-34-introducing-psi-metrics-beta/

As Kubernetes clusters grow in size and complexity, understanding the health and performance of individual nodes becomes increasingly critical. We are excited to announce that as of Kubernetes v1.34, Pressure Stall Information (PSI) Metrics has graduated to Beta.

What is Pressure Stall Information (PSI)?

Pressure Stall Information (PSI) is a feature of the Linux kernel (version 4.20 and later) that provides a canonical way to quantify pressure on infrastructure resources, in terms of whether demand for a resource exceeds current supply. It moves beyond simple resource utilization metrics and instead measures the amount of time that tasks are stalled due to resource contention. This is a powerful way to identify and diagnose resource bottlenecks that can impact application performance.

PSI exposes metrics for CPU, memory, and I/O, categorized as either some or full pressure:

some

The percentage of time that at least one task is stalled on a resource. This indicates some level of resource contention.

full

The percentage of time that all non-idle tasks are stalled on a resource simultaneously. This indicates a more severe resource bottleneck.

PSI: 'Some' vs. 'Full' Pressure

These metrics are aggregated over 10-second, 1-minute, and 5-minute rolling windows, providing a comprehensive view of resource pressure over time.

PSI metrics in Kubernetes

With the KubeletPSI feature gate enabled, the kubelet can now collect PSI metrics from the Linux kernel and expose them through two channels: the Summary API and the /metrics/cadvisor Prometheus endpoint. This allows you to monitor and alert on resource pressure at the node, pod, and container level.

The following new metrics are available in Prometheus exposition format via /metrics/cadvisor:

container_pressure_cpu_stalled_seconds_total

container_pressure_cpu_waiting_seconds_total

container_pressure_memory_stalled_seconds_total

container_pressure_memory_waiting_seconds_total

container_pressure_io_stalled_seconds_total

container_pressure_io_waiting_seconds_total

These metrics, along with the data from the Summary API, provide a granular view of resource pressure, enabling you to pinpoint the source of performance issues and take corrective action. For example, you can use these metrics to:

Identify memory leaks: A steadily increasing some pressure for memory can indicate a memory leak in an application.

Optimize resource requests and limits: By understanding the resource pressure of your workloads, you can more accurately tune their resource requests and limits.

Autoscale workloads: You can use PSI metrics to trigger autoscaling events, ensuring that your workloads have the resources they need to perform optimally.

How to enable PSI metrics

To enable PSI metrics in your Kubernetes cluster, you need to:

Ensure your nodes are running a Linux kernel version 4.20 or later and are using cgroup v2.

Enable the KubeletPSI feature gate on the kubelet.

Once enabled, you can start scraping the /metrics/cadvisor endpoint with your Prometheus-compatible monitoring solution or query the Summary API to collect and visualize the new PSI metrics. Note that PSI is a Linux-kernel feature, so these metrics are not available on Windows nodes. Your cluster can contain a mix of Linux and Windows nodes, and on the Windows nodes the kubelet does not expose PSI metrics.

What's next?

We are excited to bring PSI metrics to the Kubernetes community and look forward to your feedback. As a beta feature, we are actively working on improving and extending this functionality towards a stable GA release. We encourage you to try it out and share your experiences with us.

To learn more about PSI metrics, check out the official Kubernetes documentation. You can also get involved in the conversation on the #sig-node Slack channel.

via Kubernetes Blog https://kubernetes.io/

September 04, 2025 at 02:30PM

·kubernetes.io·
PSI Metrics for Kubernetes Graduates to Beta
Last Week in Kubernetes Development - Week Ending August 21 2025
Last Week in Kubernetes Development - Week Ending August 21 2025

Week Ending August 21, 2025

https://lwkd.info/2025/20250904

Developer News

The Kubernetes Steering Committee 2025 election is open for four seats. Candidate nominations are due by September 8 and voting begins on September 10. Voting will be conducted through Elekto using GitHub login, where you can also verify your voter eligibility. The election ends October 24, and results will be announced on November 5.

Equinix Metal platform will shut down on June 30, 2026, so SIG Cloud Provider will deprecate cloud-provider-equinix-metal. The repo will be updated to Kubernetes 1.34, maintained with fixes and tests, and archived after the 1.37 release.

The KubeCon North America 2025 Maintainer Summit schedule is out.

Release Schedule

Next Deadline: 1.35 Release Cycle Starts, September

We are between release cycles right now. The 1.35 cycle will start in September. Watch the Dev mailing list for the call for release shadows.

The cherry-pick deadline for the next set of patch releases is this Friday.

Featured PRs

132798: Show simple values in validation rule failures

This PR improves error messages from CEL validation in CRDs; Previously, failures displayed the field type (for example, “string”) instead of the value that caused the failure; Now, when the value is a number, boolean, or string, the error message shows that value; This makes validation errors clearer and easier to understand.

133323: Make kubectl auth reconcile retry on conflict

This PR improves the kubectl auth reconcile command; Before, if the command tried to update an object and hit a conflict (for example, because another change happened at the same time), it would fail right away; Now, it retries when a conflict occurs, making the command more reliable when multiple updates happen concurrently.

Other Merges

Use consistent documentation of aliases in API

Improve shell completion for api resources

Drop experimental prefix from kubectl wait command

Remove ListType marker from non-list field

Move GetAffinityTerms functions to staging repo

kube proxy iptables logging now displays correctly

PodFailurePolicy conditions no longer require explicit status

Add resourceClaimModified to bindClaim update assume cache

Gate Storage version migration behind RealFIFO to prevent possible race conditions.

Improve godoc by enabling accurate deprecation warnings

Validate flush frequency is positive

Skip PreEnqueue when pod reactivated from backoffQ

Add conversion for timeoutForControlPlane field

Optimize calculatePodRequests for specific container lookups

Make DeleteOptions decode returns 400 instead of 500

Enable KYAML gate by default

Improve conversion-gen handling of unexported fields and pointer conversions

Make kubectl auth reconcile retries on conflict

Store WithContext ctx in a wrapper to avoid conflict

Extend applyconfiguration-gen to generate extract functions for all subresources, not just status.

Report actionable error when GC fails due to disk pressure

Increment metric for duplicate validation errors

Remove duplicate RBAC resources update validations

Prevent race in scheduler integration test

Resolve kubectl writing current-context to the wrong kubeconfig file when using multiple kubeconfig files

Enable multiple volume references to a single PVC

VAC API test to conformance

Deprecated

Removed deprecated gogo protocol definitions from k8s.io/kubelet/pkg/apis/dra in favor of google.golang.org/protobuf.

Remove StatefulSetAutoDeletePVC after feature GA-ed in 1.32

Remove OuterVolumeSpecName from ASW

Version Updates

Bumped cri-tools to v1.34.0

Update CoreDNS to v1.12.3

Subprojects and Dependency Updates

cloud-provider-vSphere v1.34.0 adds daemonset volumes, shared sessions, fixes service/tag issues, and updates Go, CAPI, CAPV, and Kubernetes.

cluster-apiv1.11.1 extends Kubernetes support to v1.34 for both management and workload clusters

cluster-api-provider-vsphere v1.14.0 upgrades to CAPI v1.11, Go 1.24, and adds multi-networking for NSX-VPC and vSphere providers

CRI-O v1.33.4 fixes CNI teardown, validates memory limits, pulls OCI images earlier and adds hostnetwork info

Ingress-NGINX v1.13.2 fixes nginx_ingress_controller_config_last_reload_successful metrics and hardens socket security; Helm Chart v4.13.2 updates to controller v1.13.2 and bumps Kube Webhook CertGen.

kind v0.30.0 contains patched dependencies and Kubernetes 1.34, as well as a bugfix for Kubernetes v1.33.0+ cluster reboots

kOps v1.33.1 adds Debian 13 support, fixes Amazon Linux 2 and CoreDNS issues, and updates Kubernetes hashes

Shoutouts

Rajalakshmi Girish: A big shout-out to the Kubernetes v1.34 Release Signal Team! @adil @ChengHao Yang (tico88612) @elieser1101 @Prajyot Parab @Sarthak Negi It has been an incredible journey with such a dedicated and committed group throughout this cycle. Experienced members supported and guided the new ones, while the newcomers showed eagerness and openness to learn. This team consistently showed up with the highest attendance in release team calls, whether it was the weekly syncs or burndown meetings. From diligently updating meeting notes, giving timely Go/No-Go signals for release cuts, and collaborating without a hitch, every member stepped up and delivered flawlessly. Despite busy schedules—whether balancing organizational responsibilities or internship commitments, everyone fulfilled their role with remarkable dedication. Our direct chat group reflected the unity and support within the team, always backing each other up whenever needed. Kudos to each of you. I am proud to have led such an energetic, collaborative, and committed team!

Vyom Yadav: Kubernetes v1.34 is shipped It was an absolute pleasure to be a part of this journey across the ocean, which wouldn’t have been possible without my fellow sailors. Lead Shadows: @Wendy Ha @Sreeram Venkitesh @Ryota @dchan - I felt very comfortable knowing I had y’all to help me steer this ship and proactively check the state of things on your own! Enhancements: @Jenny Shu @Drew Hagen @rayandas @Faeka Ansari @Sean McGinnis @jmickey - Enhancements gets quite busy early on in the cycle and it’s due to your efforts that we’ve 58 strong enhancements this cycle and a very well rounded Kubernetes release. Comms: @aibarbetta @Alejandro Leon @Dipesh @Graziano Casto @Melony Q. (aka.cloudmelon ) - Going through all the enhancements to select a few is quite daunting, especially when there are about 75 of them before the code freeze, y’all did an amazing job highlighting the enhancements we’ve and coordinating with CNCF to get things done on time. Release Signal: @Rajalakshmi Girish @ChengHao Yang (tico88612) @elieser1101 @Prajyot Parab @adil @Sarthak Negi - The flake that we find just before the release cut is always there, but the way you navigated those (and the structure of communication) to not cause any delay to the release is commendable. Docs: @Michelle Nguyen @Urvashi @Arvind Parekh @YuJen Huang(Dylan) @DangerBuff @Rashan - Docs is a team that’s busy during the complete cycle, from enforcing KEPs to have docs to managing release notes, when we’ve inherited some rough winds is a job well done. Branch Management: @Matteo (away until Jan ‘26) @Drew Hagen @Angelos Kolaitis @satyampsoni - Thank you for actually shipping Kubernetes (literally), and all the improvements you have been making to the process. and a very special thank you to @Kat Cosgrove and @fsmunoz for all the guidance and being there, jumping in when I required help, and to all SIG leads, tech leads, contributors for helping us ship this release. I’ve a lot to say about this cycle and the release team. I joined back in v1.27 and every cycle I’ve learned, grown, made friends and just enjoyed myself working to ship one the largest open source projects on this planet which is a no small feat and y’all should be incredibly proud. Feels absolutely nostalgic and I can’t thank everyone enough whom I’ve worked with (this or previous cycles) and it was an honor to be a part of this crew and steering the ship this release (until the next time!)

Michelle Nguyen: A big shout-out to Kubernetes v1.34 Docs team! @Rashan @Urvashi @Arvind Parekh @YuJen Huang(Dylan) @DangerBuff You all consistently went above and beyond—whether updating meeting notes meticulously, tracking down docs, or supporting each other with krel tasks. Every single person delivered exceptional work without fail. Thanks to you all, our release was smooth, especially from a docs perspective! A special shoutout to @Drew Hagen for helping Docs out during Docs Freeze. You absolutely rock! I’m incredibly proud of what we’ve accomplished as a team and am extremely grateful for the opportunity to work alongside everyone.

Agustina Barbetta: As we wrap up post-release communications for v1.34, I want to give a big shoutout to the Kubernetes v1.34 Comms team: @Dipesh @Graziano Casto @Alejandro Leon @Melony Q. (aka.cloudmelon ) Comms gets more challenging as the cycle progresses, but you’ve consistently stepped up and tackled everything from a quick outreach to major writing tasks. The second half of the cycle saw us publish 2 blogs, one of which highlighted 44 SIG features, while also reviewing 18 Feature Blogs that are currently rolling out. And through it all, we stayed on track with every commitment in the v1.34 timeline. Thank you for making v1.34 communications a huge success!

via Last Week in Kubernetes Development https://lwkd.info/

September 04, 2025 at 07:52AM

·lwkd.info·
Last Week in Kubernetes Development - Week Ending August 21 2025
Kubernetes v1.34: Service Account Token Integration for Image Pulls Graduates to Beta
Kubernetes v1.34: Service Account Token Integration for Image Pulls Graduates to Beta

Kubernetes v1.34: Service Account Token Integration for Image Pulls Graduates to Beta

https://kubernetes.io/blog/2025/09/03/kubernetes-v1-34-sa-tokens-image-pulls-beta/

The Kubernetes community continues to advance security best practices by reducing reliance on long-lived credentials. Following the successful alpha release in Kubernetes v1.33, Service Account Token Integration for Kubelet Credential Providers has now graduated to beta in Kubernetes v1.34, bringing us closer to eliminating long-lived image pull secrets from Kubernetes clusters.

This enhancement allows credential providers to use workload-specific service account tokens to obtain registry credentials, providing a secure, ephemeral alternative to traditional image pull secrets.

What's new in beta?

The beta graduation brings several important changes that make the feature more robust and production-ready:

Required cacheType field

Breaking change from alpha: The cacheType field is required in the credential provider configuration when using service account tokens. This field is new in beta and must be specified to ensure proper caching behavior.

CAUTION: this is not a complete configuration example, just a reference for the 'tokenAttributes.cacheType' field.

tokenAttributes: serviceAccountTokenAudience: "my-registry-audience" cacheType: "ServiceAccount" # Required field in beta requireServiceAccount: true

Choose between two caching strategies:

Token: Cache credentials per service account token (use when credential lifetime is tied to the token). This is useful when the credential provider transforms the service account token into registry credentials with the same lifetime as the token, or when registries support Kubernetes service account tokens directly. Note: The kubelet cannot send service account tokens directly to registries; credential provider plugins are needed to transform tokens into the username/password format expected by registries.

ServiceAccount: Cache credentials per service account identity (use when credentials are valid for all pods using the same service account)

Isolated image pull credentials

The beta release provides stronger security isolation for container images when using service account tokens for image pulls. It ensures that pods can only access images that were pulled using ServiceAccounts they're authorized to use. This prevents unauthorized access to sensitive container images and enables granular access control where different workloads can have different registry permissions based on their ServiceAccount.

When credential providers use service account tokens, the system tracks ServiceAccount identity (namespace, name, and UID) for each pulled image. When a pod attempts to use a cached image, the system verifies that the pod's ServiceAccount matches exactly with the ServiceAccount that was used to originally pull the image.

Administrators can revoke access to previously pulled images by deleting and recreating the ServiceAccount, which changes the UID and invalidates cached image access.

For more details about this capability, see the image pull credential verification documentation.

How it works

Configuration

Credential providers opt into using ServiceAccount tokens by configuring the tokenAttributes field:

#

CAUTION: this is an example configuration.

Do not use this for your own cluster!

# apiVersion: kubelet.config.k8s.io/v1 kind: CredentialProviderConfig providers:

  • name: my-credential-provider matchImages:
    • ".myregistry.io/" defaultCacheDuration: "10m" apiVersion: credentialprovider.kubelet.k8s.io/v1 tokenAttributes: serviceAccountTokenAudience: "my-registry-audience" cacheType: "ServiceAccount" # New in beta requireServiceAccount: true requiredServiceAccountAnnotationKeys:
    • "myregistry.io/identity-id" optionalServiceAccountAnnotationKeys:
    • "myregistry.io/optional-annotation"

Image pull flow

At a high level, kubelet coordinates with your credential provider and the container runtime as follows:

When the image is not present locally:

kubelet checks its credential cache using the configured cacheType (Token or ServiceAccount)

If needed, kubelet requests a ServiceAccount token for the pod's ServiceAccount and passes it, plus any required annotations, to the credential provider

The provider exchanges that token for registry credentials and returns them to kubelet

kubelet caches credentials per the cacheType strategy and pulls the image with those credentials

kubelet records the ServiceAccount coordinates (namespace, name, UID) associated with the pulled image for later authorization checks

When the image is already present locally:

kubelet verifies the pod's ServiceAccount coordinates match the coordinates recorded for the cached image

If they match exactly, the cached image can be used without pulling from the registry

If they differ, kubelet performs a fresh pull using credentials for the new ServiceAccount

With image pull credential verification enabled:

Authorization is enforced using the recorded ServiceAccount coordinates, ensuring pods only use images pulled by a ServiceAccount they are authorized to use

Administrators can revoke access by deleting and recreating a ServiceAccount; the UID changes and previously recorded authorization no longer matches

Audience restriction

The beta release builds on service account node audience restriction (beta since v1.33) to ensure kubelet can only request tokens for authorized audiences. Administrators configure allowed audiences using RBAC to enable kubelet to request service account tokens for image pulls:

#

CAUTION: this is an example configuration.

Do not use this for your own cluster!

# apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: kubelet-credential-provider-audiences rules:

  • verbs: ["request-serviceaccounts-token-audience"] apiGroups: [""] resources: ["my-registry-audience"] resourceNames: ["registry-access-sa"] # Optional: specific SA

Getting started with beta

Prerequisites

Kubernetes v1.34 or later

Feature gate enabled: KubeletServiceAccountTokenForCredentialProviders=true (beta, enabled by default)

Credential provider support: Update your credential provider to handle ServiceAccount tokens

Migration from alpha

If you're already using the alpha version, the migration to beta requires minimal changes:

Add cacheType field: Update your credential provider configuration to include the required cacheType field

Review caching strategy: Choose between Token and ServiceAccount cache types based on your provider's behavior

Test audience restrictions: Ensure your RBAC configuration, or other cluster authorization rules, will properly restrict token audiences

Example setup

Here's a complete example for setting up a credential provider with service account tokens (this example assumes your cluster uses RBAC authorization):

#

CAUTION: this is an example configuration.

Do not use this for your own cluster!

#

Service Account with registry annotations

apiVersion: v1 kind: ServiceAccount metadata: name: registry-access-sa namespace: default annotations: myregistry.io/identity-id: "user123" ---

RBAC for audience restriction

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: registry-audience-access rules:

  • verbs: ["request-serviceaccounts-token-audience"] apiGroups: [""] resources: ["my-registry-audience"] resourceNames: ["registry-access-sa"] # Optional: specific ServiceAccount --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: kubelet-registry-audience roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: registry-audience-access subjects:
  • kind: Group name: system:nodes apiGroup: rbac.authorization.k8s.io --- # Pod using the ServiceAccount apiVersion: v1 kind: Pod metadata: name: my-pod spec: serviceAccountName: registry-access-sa containers:
    • name: my-app image: myregistry.example/my-app:latest

What's next?

For Kubernetes v1.35, we - Kubernetes SIG Auth - expect the feature to stay in beta, and we will continue to solicit feedback.

You can learn more about this feature on the service account token for image pulls page in the Kubernetes documentation.

You can also follow along on the KEP-4412 to track progress across the coming Kubernetes releases.

Call to action

In this blog post, I have covered the beta graduation of ServiceAccount token integration for Kubelet Credential Providers in Kubernetes v1.34. I discussed the key improvements, including the required cacheType field and enhanced integration with Ensure Secret Pull Images.

We have been receiving positive feedback from the community during the alpha phase and would love to hear more as we stabilize this feature for GA. In particular, we would like feedback from credential provider implementors as they integrate with the new beta API and caching mechanisms. Please reach out to us on the #sig-auth-authenticators-dev channel on Kubernetes Slack.

How to get involved

If you are interested in getting involved in the development of this feature, share feedback, or participate in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.

You are also welcome to join the bi-weekly SIG Auth meetings, held every other Wednesday.

via Kubernetes Blog https://kubernetes.io/

September 03, 2025 at 02:30PM

·kubernetes.io·
Kubernetes v1.34: Service Account Token Integration for Image Pulls Graduates to Beta
Impossible Puzzles Quantum Computers Dream of Solving
Impossible Puzzles Quantum Computers Dream of Solving
Imagine you’re the mayor of a bustling city. Every day you face decisions that seem impossible: how to schedule buses so no one waits too long, how to route traffic without creating new jams, how to balance electricity demand when everyone cranks up their air conditioning at once.
·linkedin.com·
Impossible Puzzles Quantum Computers Dream of Solving
Kubernetes v1.34: Introducing CPU Manager Static Policy Option for Uncore Cache Alignment
Kubernetes v1.34: Introducing CPU Manager Static Policy Option for Uncore Cache Alignment

Kubernetes v1.34: Introducing CPU Manager Static Policy Option for Uncore Cache Alignment

https://kubernetes.io/blog/2025/09/02/kubernetes-v1-34-prefer-align-by-uncore-cache-cpumanager-static-policy-optimization/

A new CPU Manager Static Policy Option called prefer-align-cpus-by-uncorecache was introduced in Kubernetes v1.32 as an alpha feature, and has graduated to beta in Kubernetes v1.34. This CPU Manager Policy Option is designed to optimize performance for specific workloads running on processors with a split uncore cache architecture. In this article, I'll explain what that means and why it's useful.

Understanding the feature

What is uncore cache?

Until relatively recently, nearly all mainstream computer processors had a monolithic last-level-cache cache that was shared across every core in a multiple CPU package. This monolithic cache is also referred to as uncore cache (because it is not linked to a specific core), or as Level 3 cache. As well as the Level 3 cache, there is other cache, commonly called Level 1 and Level 2 cache, that is associated with a specific CPU core.

In order to reduce access latency between the CPU cores and their cache, recent AMD64 and ARM architecture based processors have introduced a split uncore cache architecture, where the last-level-cache is divided into multiple physical caches, that are aligned to specific CPU groupings within the physical package. The shorter distances within the CPU package help to reduce latency.

Kubernetes is able to place workloads in a way that accounts for the cache topology within the CPU package(s).

Cache-aware workload placement

The matrix below shows the CPU-to-CPU latency measured in nanoseconds (lower is better) when passing a packet between CPUs, via its cache coherence protocol on a processor that uses split uncore cache. In this example, the processor package consists of 2 uncore caches. Each uncore cache serves 8 CPU cores.

Blue entries in the matrix represent latency between CPUs sharing the same uncore cache, while grey entries indicate latency between CPUs corresponding to different uncore caches. Latency between CPUs that correspond to different caches are higher than the latency between CPUs that belong to the same cache.

With prefer-align-cpus-by-uncorecache enabled, the static CPU Manager attempts to allocates CPU resources for a container, such that all CPUs assigned to a container share the same uncore cache. This policy operates on a best-effort basis, aiming to minimize the distribution of a container's CPU resources across uncore caches, based on the container's requirements, and accounting for allocatable resources on the node.

By running a workload, where it can, on a set of CPUS that use the smallest feasible number of uncore caches, applications benefit from reduced cache latency (as seen in the matrix above), and from reduced contention against other workloads, which can result in overall higher throughput. The benefit only shows up if your nodes use a split uncore cache topology for their processors.

The following diagram below illustrates uncore cache alignment when the feature is enabled.

By default, Kubernetes does not account for uncore cache topology; containers are assigned CPU resources using a packed methodology. As a result, Container 1 and Container 2 can experience a noisy neighbor impact due to cache access contention on Uncore Cache 0. Additionally, Container 2 will have CPUs distributed across both caches which can introduce a cross-cache latency.

With prefer-align-cpus-by-uncorecache enabled, each container is isolated on an individual cache. This resolves the cache contention between the containers and minimizes the cache latency for the CPUs being utilized.

Use cases

Common use cases can include telco applications like vRAN, Mobile Packet Core, and Firewalls. It's important to note that the optimization provided by prefer-align-cpus-by-uncorecache can be dependent on the workload. For example, applications that are memory bandwidth bound may not benefit from uncore cache alignment, as utilizing more uncore caches can increase memory bandwidth access.

Enabling the feature

To enable this feature, set the CPU Manager Policy to static and enable the CPU Manager Policy Options with prefer-align-cpus-by-uncorecache.

For Kubernetes 1.34, the feature is in the beta stage and requires the CPUManagerPolicyBetaOptions feature gate to also be enabled.

Append the following to the kubelet configuration file:

kind: KubeletConfiguration apiVersion: kubelet.config.k8s.io/v1beta1 featureGates: ... CPUManagerPolicyBetaOptions: true cpuManagerPolicy: "static" cpuManagerPolicyOptions: prefer-align-cpus-by-uncorecache: "true" reservedSystemCPUs: "0" ...

If you're making this change to an existing node, remove the cpu_manager_state file and then restart kubelet.

prefer-align-cpus-by-uncorecache can be enabled on nodes with a monolithic uncore cache processor. The feature will mimic a best-effort socket alignment effect and will pack CPU resources on the socket similar to the default static CPU Manager policy.

Further reading

See Node Resource Managers to learn more about the CPU Manager and the available policies.

Reference the documentation for prefer-align-cpus-by-uncorecache here.

Please see the Kubernetes Enhancement Proposal for more information on how prefer-align-cpus-by-uncorecache is implemented.

Getting involved

This feature is driven by SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please attend the SIG Node meeting for more details.

via Kubernetes Blog https://kubernetes.io/

September 02, 2025 at 02:30PM

·kubernetes.io·
Kubernetes v1.34: Introducing CPU Manager Static Policy Option for Uncore Cache Alignment
Solving Cold Starts: Uses Istio to Warm Up Java Pods with Frédéric Gaudet
Solving Cold Starts: Uses Istio to Warm Up Java Pods with Frédéric Gaudet

Solving Cold Starts: Uses Istio to Warm Up Java Pods, with Frédéric Gaudet

https://ku.bz/grxcypt9j

If you're running Java applications in Kubernetes, you've likely experienced the pain of slow pod startups affecting user experience during deployments and scaling events.

Frédéric Gaudet, Senior SRE at BlaBlaCar, shares how his team solved the cold start problem for their 1,500 Java microservices using Istio's warm-up capabilities.

You will learn:

Why Java applications struggle with cold starts and how JIT compilation affects initial request latency in Kubernetes environments

How Istio's warm-up feature works to gradually ramp up traffic to new pods

Why other common solutions fail, including resource over-provisioning, init containers, and tools like GraalVM

Real production impact from implementing this solution, including dramatic improvements in message moderation SLOs at BlaBlaCar's scale of 4,000 pods

Sponsor

This episode is brought to you by Testkube—the ultimate Continuous Testing Platform for Cloud Native applications. Scale fast, test continuously, and ship confidently. Check it out at testkube.io

More info

Find all the links and info for this episode here: https://ku.bz/grxcypt9j

Interested in sponsoring an episode? Learn more.

via KubeFM https://kube.fm

September 02, 2025 at 06:00AM

·kube.fm·
Solving Cold Starts: Uses Istio to Warm Up Java Pods with Frédéric Gaudet
Kubernetes v1.34: DRA has graduated to GA
Kubernetes v1.34: DRA has graduated to GA

Kubernetes v1.34: DRA has graduated to GA

https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/

Kubernetes 1.34 is here, and it has brought a huge wave of enhancements for Dynamic Resource Allocation (DRA)! This release marks a major milestone with many APIs in the resource.k8s.io group graduating to General Availability (GA), unlocking the full potential of how you manage devices on Kubernetes. On top of that, several key features have moved to beta, and a fresh batch of new alpha features promise even more expressiveness and flexibility.

Let's dive into what's new for DRA in Kubernetes 1.34!

The core of DRA is now GA

The headline feature of the v1.34 release is that the core of DRA has graduated to General Availability.

Kubernetes Dynamic Resource Allocation (DRA) provides a flexible framework for managing specialized hardware and infrastructure resources, such as GPUs or FPGAs. DRA provides APIs that enable each workload to specify the properties of the devices it needs, but leaving it to the scheduler to allocate actual devices, allowing increased reliability and improved utilization of expensive hardware.

With the graduation to GA, DRA is stable and will be part of Kubernetes for the long run. The community can still expect a steady stream of new features being added to DRA over the next several Kubernetes releases, but they will not make any breaking changes to DRA. So users and developers of DRA drivers can start adopting DRA with confidence.

Starting with Kubernetes 1.34, DRA is enabled by default; the DRA features that have reached beta are also enabled by default. That's because the default API version for DRA is now the stable v1 version, and not the earlier versions (eg: v1beta1 or v1beta2) that needed explicit opt in.

Features promoted to beta

Several powerful features have been promoted to beta, adding more control, flexibility, and observability to resource management with DRA.

Admin access labelling has been updated. In v1.34, you can restrict device support to people (or software) authorized to use it. This is meant as a way to avoid privilege escalation if a DRA driver grants additional privileges when admin access is requested and to avoid accessing devices which are in use by normal applications, potentially in another namespace. The restriction works by ensuring that only users with access to a namespace with the resource.k8s.io/admin-access: "true" label are authorized to create ResourceClaim or ResourceClaimTemplates objects with the adminAccess field set to true. This ensures that non-admin users cannot misuse the feature.

Prioritized list lets users specify a list of acceptable devices for their workloads, rather than just a single type of device. So while the workload might run best on a single high-performance GPU, it might also be able to run on 2 mid-level GPUs. The scheduler will attempt to satisfy the alternatives in the list in order, so the workload will be allocated the best set of devices available on the node.

The kubelet's API has been updated to report on Pod resources allocated through DRA. This allows node monitoring agents to know the allocated DRA resources for Pods on a node and makes it possible to use the DRA information in the PodResources API to develop new features and integrations.

New alpha features

Kubernetes 1.34 also introduces several new alpha features that give us a glimpse into the future of resource management with DRA.

Extended resource mapping support in DRA allows cluster administrators to advertise DRA-managed resources as extended resources, allowing developers to consume them using the familiar, simpler request syntax while still benefiting from dynamic allocation. This makes it possible for existing workloads to start using DRA without modifications, simplifying the transition to DRA for both application developers and cluster administrators.

Consumable capacity introduces a flexible device sharing model where multiple, independent resource claims from unrelated pods can each be allocated a share of the same underlying physical device. This new capability is managed through optional, administrator-defined sharing policies that govern how a device's total capacity is divided and enforced by the platform for each request. This allows for sharing of devices in scenarios where pre-defined partitions are not viable. A blog about this feature is coming soon.

Binding conditions improve scheduling reliability for certain classes of devices by allowing the Kubernetes scheduler to delay binding a pod to a node until its required external resources, such as attachable devices or FPGAs, are confirmed to be fully prepared. This prevents premature pod assignments that could lead to failures and ensures more robust, predictable scheduling by explicitly modeling resource readiness before the pod is committed to a node.

Resource health status for DRA improves observability by exposing the health status of devices allocated to a Pod via Pod Status. This works whether the device is allocated through DRA or Device Plugin. This makes it easier to understand the cause of an unhealthy device and respond properly. A blog about this feature is coming soon.

What’s next?

While DRA got promoted to GA this cycle, the hard work on DRA doesn't stop. There are several features in alpha and beta that we plan to bring to GA in the next couple of releases and we are looking to continue to improve performance, scalability and reliability of DRA. So expect an equally ambitious set of features in DRA for the 1.35 release.

Getting involved

A good starting point is joining the WG Device Management Slack channel and meetings, which happen at US/EU and EU/APAC friendly time slots.

Not all enhancement ideas are tracked as issues yet, so come talk to us if you want to help or have some ideas yourself! We have work to do at all levels, from difficult core changes to usability enhancements in kubectl, which could be picked up by newcomers.

Acknowledgments

A huge thanks to the new contributors to DRA this cycle:

Alay Patel (alaypatel07)

Gaurav Kumar Ghildiyal (gauravkghildiyal)

JP (Jpsassine)

Kobayashi Daisuke (KobayashiD27)

Laura Lorenz (lauralorenz)

Sunyanan Choochotkaew (sunya-ch)

Swati Gupta (guptaNswati)

Yu Liao (yliaog)

via Kubernetes Blog https://kubernetes.io/

September 01, 2025 at 02:30PM

·kubernetes.io·
Kubernetes v1.34: DRA has graduated to GA
AI & DevOps Toolkit - Stop Blaming AI: Vector DBs RAG = Game Changer - https://www.youtube.com/watch?v=zqpJr1qZhTg
AI & DevOps Toolkit - Stop Blaming AI: Vector DBs RAG = Game Changer - https://www.youtube.com/watch?v=zqpJr1qZhTg

Stop Blaming AI: Vector DBs + RAG = Game Changer

Still think AI "doesn't work" because it hallucinates about your codebase and infrastructure? The problem isn't AI – it's you. You're asking AI about information it never had access to, then acting surprised when it makes things up. This video reveals the uncomfortable truth about why your AI experiments failed and shows you exactly how to fix them using vector databases and RAG (Retrieval-Augmented Generation).

Learn how to transform AI from a generic assistant that invents procedures and suggests deprecated APIs into one that knows your actual policies, architectural decisions, and operational standards. We'll explore why traditional APIs aren't designed for AI's semantic queries, how vector databases enable meaning-based search instead of keyword matching, and how RAG grounds AI responses in your real documentation. Plus, get a hands-on demonstration using Qdrant vector database to semantically search organizational knowledge. Stop blaming the technology and start implementing AI that actually understands your organization.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: Outskill 👉 Grab your free seat to the 2-Day AI Mastermind: https://link.outskill.com/AIDOPSS1 🔐 100% Discount for the first 1000 people 💥 Dive deep into AI and Learn Automations, Build AI Agents, Make videos & images – all for free! 🎁 Bonuses worth $5100+ if you join and attend ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

AIImplementation #VectorDatabases #RAG

Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join

▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/ai/stop-blaming-ai-vector-dbs-+-rag-=-game-changer 🔗 Qdrant: https://qdrant.tech

▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Vector Databases for AI Agents 02:03 Outskill (sponsor) 03:34 Why AI Hallucinates About Your Code 11:15 Vector Databases for AI Context 21:47 RAG: How AI Gets Your Context 30:50 Fix Your AI Implementation Now

via YouTube https://www.youtube.com/watch?v=zqpJr1qZhTg

·youtube.com·
AI & DevOps Toolkit - Stop Blaming AI: Vector DBs RAG = Game Changer - https://www.youtube.com/watch?v=zqpJr1qZhTg