Ep35 - Ask Me Anything About Anything
There are no restrictions in this AMA session. You can ask anything about DevOps, AI, Cloud, Kubernetes, Platform Engineering, containers, or anything else.
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: Codefresh 🔗 GitOps Argo CD Certifications: https://learning.codefresh.io (use "viktor" for a 50% discount) ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/
▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox
via YouTube https://www.youtube.com/watch?v=ym9AX4kEkss
Scaling CI horizontally with Buildkite, Kubernetes, and multiple pipelines, with Ben Poland
Ben Poland walks through Faire's complete CI transformation, from a single Jenkins instance struggling with thousands of lines of Groovy to a distributed Buildkite system running across multiple Kubernetes clusters.
He details the technical challenges of running CI workloads at scale, including API rate limiting, etcd pressure points, and the trade-offs of splitting monolithic pipelines into service-scoped ones.
You will learn:
How to architect CI systems that match team ownership and eliminate shared failure points across services
Kubernetes scaling patterns for CI workloads, including multi-cluster strategies, predictive node provisioning, and handling API throttling
Performance optimization techniques like Git mirroring, node-level caching, and spot instance management for variable CI demands
Migration strategies and lessons learned from moving away from monolithic CI, including proof-of-concept approaches and avoiding the sunk cost fallacy
Sponsor
This episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.io
More info
Find all the links and info for this episode here: https://ku.bz/klBmzMY5-
Interested in sponsoring an episode? Learn more.
via KubeFM https://kube.fm
September 30, 2025 at 06:00AM
How I Tamed Chaotic AI Coding with Simple Workflow Commands
Tired of AI coding agents that jump between tasks chaotically and lose track of context? This video demonstrates a complete systematic workflow for AI-assisted development that keeps both you and your AI agent focused and organized from initial idea through production deployment.
I'll walk you through my entire PRD-based development system, showing real implementation of a complex feature from start to finish. You'll see how to create comprehensive technical requirements with AI analysis, track progress systematically, handle inevitable plan changes, prioritize tasks intelligently, and complete features with full traceability. The workflow uses simple MCP commands like /prd-create, /prd-next, /prd-update-progress, and /prd-done to guide systematic development without requiring complex external tools. By the end, you'll understand how to transform chaotic AI coding sessions into structured, professional development workflows that actually ship reliable software.
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: OutSkill 👉 Grab your free seat to the 2-Day AI Mastermind: https://link.outskill.com/AIDOS2 🔐 100% Discount for the first 1000 people 💥 Dive deep into AI and Learn Automations, Build AI Agents, Make videos & images – all for free! 🎁 Bonuses worth $5100+ if you join and attend ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
AICoding #PRDWorkflow #ClaudeCode
Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join
▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/development/how-i-tamed-chaotic-ai-coding-with-simple-workflow-commands 🔗 DevOps AI Toolkit: https://github.com/vfarcic/dot-ai 🎬 Stop Wasting Time: Turn AI Prompts and Context Into Production Code: https://youtu.be/XwWCFINXIoU
▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).
▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/
▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox
▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Introduction 01:50 AI Development Workflow 05:03 Outskill (sponsor) 06:25 Create PRDs with AI 12:27 Find Active PRDs with AI 14:17 Start PRD Implementation with AI 18:21 Track Development Progress with AI 20:50 AI Task Prioritization 22:41 Update PRD Decisions with AI 24:56 Complete PRD Workflow with AI 28:44 Key Takeaways
via YouTube https://www.youtube.com/watch?v=LUFJuj1yIik
Week Ending September 21, 2025
https://lwkd.info/2025/20250925
Developer News
Ray Wainman shared that he is stepping down as co-lead of SIG Autoscaling, and Adrian Moisey will step into the role alongside Jack Francis.
From the SIG K8s Infra leaders Davanum Srinivas (@dims) and Benjamin Elder (@bentheelder) are stepping down and nominating Ciprian Hacman (@hakman) and Dylan Page (@GenPage) as new chairs.
Release Schedule
Next Deadline: PRR Freeze, October 9
The Kubernetes v1.35 release cycle has officially started and we are now collecting enhancements. Work with your SIG leads to get a lead-opted-in label for your KEPs to get them added to the v1.35 cycle.
Please note that the PRR Freeze is a hard deadline starting v1.35. You can read more about the PRR Freeze deadline here and the exception process here.
Other Merges
Replace HandleCrash with HandleCrashWithContext in apiserver — adds contextual logging
Add case-insensitive DNS subdomain validation via k8s-long-name-caseless format — lets long names be validated without forcing lower case
Enable declarative validation for DeviceClass type in resource APIs and resource APIs (v1, v1beta1, v1beta2) — validation-gen tags + tests.
Ensure cacher and etcd3 use consistent key schema requirements
Add RunWithContext variant to EstablishingController — enables context-aware cancellation and richer logging for controller actions
Use iifname in kube-proxy’s nftables mode for interface matching — improves correct filtering by interface name.
Add k8s-label-key & k8s-label-value formats for declarative validation — enables using those formats in +k8s:format= tags so label keys/values are validated automatically
Honor KUBEADM_UPGRADE_DRYRUN_DIR during kubeadm upgrades
Replace WaitForNamedCacheSync with WaitForNamedCacheSyncWithContext in pkg/controller/ and pkg/controller/garbagecollector
Add fine-grained metrics to distinguish declarative validationmismatches & panics — includes a validation_identifier label for better diagnostics.
Add metric for StatefulSet MaxUnavailable violations — tracks when availability drops below spec’s threshold
Enforce API conventions for Conditions fields — ensures metav1.Condition is used and markers/tags follow standard format
Make admission & pod-security admission checks respect emulation version
Add proper goroutine management in kube-controller-manager to prevent leaks
Update MutatingAdmissionPolicy storage version to use v1beta1
Promotions
Graduate ControlPlaneKubeletLocalMode to GA in kubeadm
Deprecated
Set the deprecated version to 1.34.0 for apiserver_storage_objects metric
Remove automaxprocs workaround now that Go 1.25 manages GOMAXPROCS automatically
Version Updates
golangci-lint to v2.4.0
go language version upgraded to v1.25
system-validators to v1.11.1
Bump Go to 1.25.1, update dependencies & distroless iptables images
Subprojects and Dependency Updates
etcd v3.6.5 fixes lease renewals, snapshot/defrag corruption, removes a flag, builds with Go 1.24.7
kubebuilder v4.9.0 upgrades deps, updates Helm CRDs, fixes Docker builds and CRD handling
prometheus v3.6.0 adds PromQL duration funcs, new TSDB blocks API, OTLP/tracing tweaks, bug fixes
vertical-pod-autoscaler v1.5.0 makes In-Place Updates Beta, deprecates Auto mode, adds metrics, supports K8s 1.34
via Last Week in Kubernetes Development https://lwkd.info/
September 25, 2025 at 07:19PM
Announcing Changed Block Tracking API support (alpha)
https://kubernetes.io/blog/2025/09/25/csi-changed-block-tracking/
We're excited to announce the alpha support for a changed block tracking mechanism. This enhances the Kubernetes storage ecosystem by providing an efficient way for CSI storage drivers to identify changed blocks in PersistentVolume snapshots. With a driver that can use the feature, you could benefit from faster and more resource-efficient backup operations.
If you're eager to try this feature, you can skip to the Getting Started section.
What is changed block tracking?
Changed block tracking enables storage systems to identify and track modifications at the block level between snapshots, eliminating the need to scan entire volumes during backup operations. The improvement is a change to the Container Storage Interface (CSI), and also to the storage support in Kubernetes itself. With the alpha feature enabled, your cluster can:
Identify allocated blocks within a CSI volume snapshot
Determine changed blocks between two snapshots of the same volume
Streamline backup operations by focusing only on changed data blocks
For Kubernetes users managing large datasets, this API enables significantly more efficient backup processes. Backup applications can now focus only on the blocks that have changed, rather than processing entire volumes.
Note: As of now, the Changed Block Tracking API is supported only for block volumes and not for file volumes. CSI drivers that manage file-based storage systems will not be able to implement this capability.
Benefits of changed block tracking support in Kubernetes
As Kubernetes adoption grows for stateful workloads managing critical data, the need for efficient backup solutions becomes increasingly important. Traditional full backup approaches face challenges with:
Long backup windows: Full volume backups can take hours for large datasets, making it difficult to complete within maintenance windows.
High resource utilization: Backup operations consume substantial network bandwidth and I/O resources, especially for large data volumes and data-intensive applications.
Increased storage costs: Repetitive full backups store redundant data, causing storage requirements to grow linearly even when only a small percentage of data actually changes between backups.
The Changed Block Tracking API addresses these challenges by providing native Kubernetes support for incremental backup capabilities through the CSI interface.
Key components
The implementation consists of three primary components:
CSI SnapshotMetadata Service API: An API, offered by gRPC, that provides volume snapshot and changed block data.
SnapshotMetadataService API: A Kubernetes CustomResourceDefinition (CRD) that advertises CSI driver metadata service availability and connection details to cluster clients.
External Snapshot Metadata Sidecar: An intermediary component that connects CSI drivers to backup applications via a standardized gRPC interface.
Implementation requirements
Storage provider responsibilities
If you're an author of a storage integration with Kubernetes and want to support the changed block tracking feature, you must implement specific requirements:
Implement CSI RPCs: Storage providers need to implement the SnapshotMetadata service as defined in the CSI specifications protobuf. This service requires server-side streaming implementations for the following RPCs:
GetMetadataAllocated: For identifying allocated blocks in a snapshot
GetMetadataDelta: For determining changed blocks between two snapshots
Storage backend capabilities: Ensure the storage backend has the capability to track and report block-level changes.
Deploy external components: Integrate with the external-snapshot-metadata sidecar to expose the snapshot metadata service.
Register custom resource: Register the SnapshotMetadataService resource using a CustomResourceDefinition and create a SnapshotMetadataService custom resource that advertises the availability of the metadata service and provides connection details.
Support error handling: Implement proper error handling for these RPCs according to the CSI specification requirements.
Backup solution responsibilities
A backup solution looking to leverage this feature must:
Set up authentication: The backup application must provide a Kubernetes ServiceAccount token when using the Kubernetes SnapshotMetadataService API. Appropriate access grants, such as RBAC RoleBindings, must be established to authorize the backup application ServiceAccount to obtain such tokens.
Implement streaming client-side code: Develop clients that implement the streaming gRPC APIs defined in the schema.proto file. Specifically:
Implement streaming client code for GetMetadataAllocated and GetMetadataDelta methods
Handle server-side streaming responses efficiently as the metadata comes in chunks
Process the SnapshotMetadataResponse message format with proper error handling
The external-snapshot-metadata GitHub repository provides a convenient iterator support package to simplify client implementation.
Handle large dataset streaming: Design clients to efficiently handle large streams of block metadata that could be returned for volumes with significant changes.
Optimize backup processes: Modify backup workflows to use the changed block metadata to identify and only transfer changed blocks to make backups more efficient, reducing both backup duration and resource consumption.
Getting started
To use changed block tracking in your cluster:
Ensure your CSI driver supports volume snapshots and implements the snapshot metadata capabilities with the required external-snapshot-metadata sidecar
Make sure the SnapshotMetadataService custom resource is registered using CRD
Verify the presence of a SnapshotMetadataService custom resource for your CSI driver
Create clients that can access the API using appropriate authentication (via Kubernetes ServiceAccount tokens)
The API provides two main functions:
GetMetadataAllocated: Lists blocks allocated in a single snapshot
GetMetadataDelta: Lists blocks changed between two snapshots
What’s next?
Depending on feedback and adoption, the Kubernetes developers hope to push the CSI Snapshot Metadata implementation to Beta in the future releases.
Where can I learn more?
For those interested in trying out this new feature:
Official Kubernetes CSI Developer Documentation
The enhancement proposal for the snapshot metadata feature.
GitHub repository for implementation and release status of external-snapshot-metadata
Complete gRPC protocol definitions for snapshot metadata API: schema.proto
Example snapshot metadata client implementation: snapshot-metadata-lister
End-to-end example with csi-hostpath-driver: example documentation
How do I get involved?
This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who helped review the design and implementation of the project, including but not limited to the following:
Ben Swartzlander (bswartz)
Carl Braganza (carlbraganza)
Daniil Fedotov (hairyhum)
Ivan Sim (ihcsim)
Nikhil Ladha (Nikhil-Ladha)
Prasad Ghangal (PrasadG193)
Praveen M (iPraveenParihar)
Rakshith R (Rakshith-R)
Xing Yang (xing-yang)
Thank also to everyone who has contributed to the project, including others who helped review the KEP and the CSI spec PR
For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.
The SIG also holds regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.
via Kubernetes Blog https://kubernetes.io/
September 25, 2025 at 09:00AM
Not Every Problem Needs Kubernetes, with Danyl Novhorodov
Danyl Novhorodov, a veteran .NET engineer and architect at Eneco, presents his controversial thesis that 90% of teams don't actually need Kubernetes. He walks through practical decision-making frameworks, explores powerful alternatives like BEAM runtimes and Actor models, and explains why starting with modular monoliths often beats premature microservices adoption.
You will learn:
The COST decision framework - How to evaluate infrastructure choices based on Complexity, Ownership, Skills, and Time rather than industry hype
Platform engineering vs. managed services - How to honestly assess whether your team can compete with AWS, Azure, and Google's managed container platforms
Evolutionary architecture approach - Why modular monoliths with clear boundaries often provide better foundations than distributed systems from day one
Sponsor
This episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.io
More info
Find all the links and info for this episode here: https://ku.bz/BYhFw8RwW
Interested in sponsoring an episode? Learn more.
via KubeFM https://kube.fm
September 23, 2025 at 06:00AM
Kubernetes v1.34: Pod Level Resources Graduated to Beta
https://kubernetes.io/blog/2025/09/22/kubernetes-v1-34-pod-level-resources/
On behalf of the Kubernetes community, I am thrilled to announce that the Pod Level Resources feature has graduated to Beta in the Kubernetes v1.34 release and is enabled by default! This significant milestone introduces a new layer of flexibility for defining and managing resource allocation for your Pods. This flexibility stems from the ability to specify CPU and memory resources for the Pod as a whole. Pod level resources can be combined with the container-level specifications to express the exact resource requirements and limits your application needs.
Pod-level specification for resources
Until recently, resource specifications that applied to Pods were primarily defined at the individual container level. While effective, this approach sometimes required duplicating or meticulously calculating resource needs across multiple containers within a single Pod. As a beta feature, Kubernetes allows you to specify the CPU, memory and hugepages resources at the Pod-level. This means you can now define resource requests and limits for an entire Pod, enabling easier resource sharing without requiring granular, per-container management of these resources where it's not needed.
Why does Pod-level specification matter?
This feature enhances resource management in Kubernetes by offering flexible resource management at both the Pod and container levels.
It provides a consolidated approach to resource declaration, reducing the need for meticulous, per-container management, especially for Pods with multiple containers.
Pod-level resources enable containers within a pod to share unused resoures amongst themselves, promoting efficient utilization within the pod. For example, it prevents sidecar containers from becoming performance bottlenecks. Previously, a sidecar (e.g., a logging agent or service mesh proxy) hitting its individual CPU limit could be throttled and slow down the entire Pod, even if the main application container had plenty of spare CPU. With pod-level resources, the sidecar and the main container can share Pod's resource budget, ensuring smooth operation during traffic spikes - either the whole Pod is throttled or all containers work.
When both pod-level and container-level resources are specified, pod-level requests and limits take precedence. This gives you – and cluster administrators - a powerful way to enforce overall resource boundaries for your Pods.
For scheduling, if a pod-level request is explicitly defined, the scheduler uses that specific value to find a suitable node, insteaf of the aggregated requests of the individual containers. At runtime, the pod-level limit acts as a hard ceiling for the combined resource usage of all containers. Crucially, this pod-level limit is the absolute enforcer; even if the sum of the individual container limits is higher, the total resource consumption can never exceed the pod-level limit.
Pod-level resources are prioritized in influencing the Quality of Service (QoS) class of the Pod.
For Pods running on Linux nodes, the Out-Of-Memory (OOM) score adjustment calculation considers both pod-level and container-level resources requests.
Pod-level resources are designed to be compatible with existing Kubernetes functionalities, ensuring a smooth integration into your workflows.
How to specify resources for an entire Pod
Using PodLevelResources feature gate requires Kubernetes v1.34 or newer for all cluster components, including the control plane and every node. This feature gate is in beta and enabled by default in v1.34.
Example manifest
You can specify CPU, memory and hugepages resources directly in the Pod spec manifest at the resources field for the entire Pod.
Here’s an example demonstrating a Pod with both CPU and memory requests and limits defined at the Pod level:
apiVersion: v1 kind: Pod metadata: name: pod-resources-demo namespace: pod-resources-example spec:
The 'resources' field at the Pod specification level defines the overall
resource budget for all containers within this Pod combined.
resources: # Pod-level resources
'limits' specifies the maximum amount of resources the Pod is allowed to use.
The sum of the limits of all containers in the Pod cannot exceed these values.
limits: cpu: "1" # The entire Pod cannot use more than 1 CPU core. memory: "200Mi" # The entire Pod cannot use more than 200 MiB of memory.
'requests' specifies the minimum amount of resources guaranteed to the Pod.
This value is used by the Kubernetes scheduler to find a node with enough capacity.
requests: cpu: "1" # The Pod is guaranteed 1 CPU core when scheduled. memory: "100Mi" # The Pod is guaranteed 100 MiB of memory when scheduled. containers:
- name: main-app-container image: nginx ... # This container has no resource requests or limits specified.
- name: auxiliary-container image: fedora command: ["sleep", "inf"] ... # This container has no resource requests or limits specified.
In this example, the pod-resources-demo Pod as a whole requests 1 CPU and 100 MiB of memory, and is limited to 1 CPU and 200 MiB of memory. The containers within will operate under these overall Pod-level constraints, as explained in the next section.
Interaction with container-level resource requests or limits
When both pod-level and container-level resources are specified, pod-level requests and limits take precedence. This means the node allocates resources based on the pod-level specifications.
Consider a Pod with two containers where pod-level CPU and memory requests and limits are defined, and only one container has its own explicit resource definitions:
apiVersion: v1 kind: Pod metadata: name: pod-resources-demo namespace: pod-resources-example spec: resources: limits: cpu: "1" memory: "200Mi" requests: cpu: "1" memory: "100Mi" containers:
- name: main-app-container image: nginx resources: requests: cpu: "0.5" memory: "50Mi"
- name: auxiliary-container image: fedora command: [ "sleep", "inf"] # This container has no resource requests or limits specified.
Pod-Level Limits: The pod-level limits (cpu: "1", memory: "200Mi") establish an absolute boundary for the entire Pod. The sum of resources consumed by all its containers is enforced at this ceiling and cannot be surpassed.
Resource Sharing and Bursting: Containers can dynamically borrow any unused capacity, allowing them to burst as needed, so long as the Pod's aggregate usage stays within the overall limit.
Pod-Level Requests: The pod-level requests (cpu: "1", memory: "100Mi") serve as the foundational resource guarantee for the entire Pod. This value informs the scheduler's placement decision and represents the minimum resources the Pod can rely on during node-level contention.
Container-Level Requests: Container-level requests create a priority system within the Pod's guaranteed budget. Because main-app-container has an explicit request (cpu: "0.5", memory: "50Mi"), it is given precedence for its share of resources under resource pressure over the auxiliary-container, which has no such explicit claim.
Limitations
First of all, in-place resize of pod-level resources is not supported for Kubernetes v1.34 (or earlier). Attempting to modify the pod-level resource limits or requests on a running Pod results in an error: the resize is rejected. The v1.34 implementation of Pod level resources focuses on allowing initial declaration of an overall resource envelope, that applies to the entire Pod. That is distinct from in-place pod resize, which (despite what the name might suggest) allows you to make dynamic adjustments to container resource requests and limits, within a running Pod, and potentially without a container restart. In-place resizing is also not yet a stable feature; it graduated to Beta in the v1.33 release.
Only CPU, memory, and hugepages resources can be specified at pod-level.
Pod-level resources are not supported for Windows pods. If the Pod specification explicitly targets Windows (e.g., by setting spec.os.name: "windows"), the API server will reject the Pod during the validation step. If the Pod is not explicitly marked for Windows but is scheduled to a Windows node (e.g., via a nodeSelector), the Kubelet on that Windows node will reject the Pod during its admission process.
The Topology Manager, Memory Manager and CPU Manager do not align pods and containers based on pod-level resources as these resource managers don't currently support pod-level resources.
Getting started and providing feedback
Ready to explore Pod Level Resources feature? You'll need a Kubernetes cluster running version 1.34 or later. Remember to enable the PodLevelResources feature gate across your control plane and all nodes.
As this feature moves through Beta, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels:
Slack: #sig-node
Mailing list
Open Community Issues/PRs
via Kubernetes Blog https://kubernetes.io/
September 22, 2025 at 02:30PM
Blog: Spotlight on the Kubernetes Steering Committee
https://www.kubernetes.dev/blog/2025/09/22/k8s-steering-spotlight-2025/
This interview was conducted in August 2024, and due to the dynamic nature of the Steering Committee membership and election process it might not represent the actual composition accurately. The topics covered are, however, overwhelmingly relevant to understand its scope of work. As we approach the Steering Committee elections, it provides useful insights into the workings of the Committee.
The Kubernetes Steering Committee is the backbone of the Kubernetes project, ensuring that its vibrant community and governance structures operate smoothly and effectively. While the technical brilliance of Kubernetes is often spotlighted through its Special Interest Groups (SIGs) and Working Groups (WGs), the unsung heroes quietly steering the ship are the members of the Steering Committee. They tackle complex organizational challenges, empower contributors, and foster the thriving open source ecosystem that Kubernetes is celebrated for.
But what does it really take to lead one of the world’s largest open source communities? What are the hidden challenges, and what drives these individuals to dedicate their time and effort to such an impactful role? In this exclusive conversation, we sit down with current Steering Committee (SC) members — Ben, Nabarun, Paco, Patrick, and Maciej — to uncover the rewarding, and sometimes demanding, realities of steering Kubernetes. From their personal journeys and motivations to the committee’s vital responsibilities and future outlook, this Spotlight offers a rare behind-the-scenes glimpse into the people who keep Kubernetes on course.
Introductions
Sandipan: Can you tell us a little bit about yourself?
Ben: Hi, I’m Benjamin Elder, also known as BenTheElder. I started in Kubernetes as a Google Summer of Code student in 2015 and have been working at Google in the space since 2017. I have contributed a lot to many areas but especially build, CI, test tooling, etc. My favorite project so far was building KIND. I have been on the release team, a chair of SIG Testing, and currently a tech lead of SIG Testing and SIG K8s Infra.
Nabarun: Hi, I am Nabarun from India. I have been working on Kubernetes since 2019. I have been contributing across multiple areas in Kubernetes: SIG ContribEx (where I am also a chair), API Machinery, Architecture, and SIG Release, where I contributed to several releases including being the Release Team Lead of Kubernetes 1.21.
Paco: I am Paco from China. I worked as an open source team lead in DaoCloud, Shanghai. In the community, I participate mainly in kubeadm, SIG Node and SIG Testing. Besides, I helped in KCD China and was co-chair of the recent KubeCon+CloudNativeCon China 2024 in Hong Kong.
Patrick: Hello! I’m Patrick. I’ve contributed to Kubernetes since 2018. I started in SIG Storage and then got involved in more and more areas. Nowadays, I am a SIG Testing tech lead, logging infrastructure maintainer, organizer of the Structured Logging and Device Management working groups, contributor in SIG Scheduling, and of course member of the Steering Committee. My main focus area currently is Dynamic Resource Allocation (DRA), a new API for accelerators.
Maciej: Hey, my name is Maciej and I’ve been working on Kubernetes since late 2014 in various areas, including controllers, apiserver and kubectl. Aside from being part of the Steering Committee, I’m also helping guide SIG CLI, SIG Apps and WG Batch.
About the Steering Committee
Sandipan: What does Steering do?
Ben: The charter is the definitive answer, but I see Steering as helping resolve Kubernetes-organization-level “people problems” (as opposed to technical problems), such as clarifying project governance and liaising with the Cloud Native Computing Foundation (for example, to request additional resources and support) and other CNCF projects.
Maciej: Our charter nicely describes all the responsibilities. In short, we make sure the project runs smoothly by supporting our maintainers and contributors in their daily tasks.
Patrick: Ideally, we don’t do anything 😀 All of the day-to-day business has been delegated to SIGs and WGs. Steering gets involved when something pops up where it isn’t obvious who should handle it or when conflicts need to be resolved.
**Sandipan: And how is Steering different from SIGs?
Ben: From a governance perspective: Steering delegates all of the ownership of subprojects to the SIGs and/or committees (Security Response, Code Of Conduct, etc.). They’re very different. The SIGs own pieces of the project, and Steering handles some of the overarching people and policy issues. You’ll find all of the software development, releasing, communications and documentation work happening in the SIGs and committees.
Maciej: SIGs or WGs are primarily concerned with the technical direction of a particular area in Kubernetes. Steering, on the other hand, is primarily concerned with ensuring all the SIGs, WGs, and most importantly maintainers have everything they need to run the project smoothly. This includes anything from ensuring financing of our CI systems, through governance structures and policies all the way to supporting individual maintainers in various inquiries.
**Sandipan: You’ve mentioned projects, could you give us an example of a project Steering has worked on recently?
Ben: We’ve been discussing the logistics to sync a better definition of the project’s official maintainers to the CNCF, which are used, for example, to vote for the Technical Oversight Committee (TOC). Currently that list is the Steering Committee, with SIG Contributor Experience and Infra + Release leads having access to the CNCF service desk. This isn’t well standardized yet across CNCF projects but I think it’s important.
Maciej: For the past year I’ve been sitting on the SC, I believe the majority of tasks we’ve been involved in were around providing letters supporting visa applications. Also, like every year, we’ve been helping all the SIGs and WGs with their annual reports.
Patrick: Apparently it has been a quiet year since Maciej and I joined the Steering Committee at the end of 2023. That’s exactly how it should be.
Sandipan: Do you have any examples of projects that came to Steering, which you then redirected to SIGs?
Ben: We often get requests for test/build related resources that we redirect to SIG K8s Infra + SIG Testing, or more specifically about releasing for subprojects that we redirect to SIG K8s Infra / SIG Release.
The road to the Steering Committee
Sandipan: What motivated you to be part of the Steering Committee? What has your journey been like?
Ben: I had a few people reach out and prompt me to run, but I was motivated by my passion for this community and the project. I think we have something really special going here and I care deeply about the ongoing success. I’ve been involved in this space my whole career and while there’s always rough edges, this community has been really supportive and I hope we can keep it that way.
Paco: After my journey to the Kubernetes Contributor Summit EU 2023, I met and chatted with many maintainers and members there, and attended the steering AMA there for the first time as there hadn’t been a contributor summit in China since 2019, and I started to connect with contributors in China to make it later the year. Through conversations at KCS EU and with local contributors, I realized that it is quite important to make it easy to start a contributor journey for APAC contributors and want to attract more contributors to the community. Then, I was elected just after the KCS CN 2023.
Patrick: I had done a lot of technical work, of which some affects and (hopefully) benefits all contributors to Kubernetes (linting and testing) and users (better log output). I saw joining the Steering Committee as an opportunity to help also with the organizational aspects of running a big open source project.
Maciej: I’ve been going through the idea of running for SC for a while now. My biggest drive was conversations with various members of our community. Eventually last year, I decided to follow their advice, and got elected :-)
Sandipan: What is your favorite part of being part of Steering?
Ben: When we get to help contributors directly. For example, sometimes extensive contributors reach out for an official letter from Steering explaining their contribution and its value for visa support. When we get to just purely help out Kubernetes contributors, that’s my favorite part.
Patrick: It’s a good place to learn more about how the project is actually run, directly from the other great people who are doing it.
Maciej: The same thing as with the project — it’s always the people that surround us, that give us opportunities to collaborate and create something interesting and exciting.
Sandipan: What do you think is most challenging about being part of Steering?
Ben: I think we’ve all spent a lot of time grappling with the sustainability issues in the project and not having a single great answer to solve them. A lot of people are working on these problems but we have limited time and resources. We’ve officially delegated most of this (for example, to SIGs Contributor Experience and K8s Infra), but I think we all still consider it very important and deserving of more time and energy, yet we only have so much and the answers are not obvious. The balancing act is hard.
Paco: Sustainability of contributors and maintainers is one of the most challenging aspects to me. I am constantly advocating for OSS users and employers to join the community. Community is a place that developers can learn from each other, discuss issues they encounter, and share their experience or solutions. Ensuring everyone in the community to feel supported and valued is crucial for the long-term health of the project.
Patrick: There is documentation about how things are done,
Teaching AI Your Company Policies: Vector Search + Enforcement
Ever wondered why AI keeps failing at simple infrastructure tasks? The problem isn't AI itself - it's that AI doesn't know your company's policies. Most organizations have their rules scattered across wikis, Slack messages, and locked in people's heads, making it impossible for AI to make compliant decisions.
This video demonstrates a different approach: extracting tribal knowledge from your brain and turning it into both AI-searchable policies and automatic Kubernetes enforcement. Using a guided workflow, we'll create database regional compliance policies that simultaneously feed semantic search for AI guidance and generate Kyverno policies for cluster enforcement. Watch as AI learns to proactively recommend compliant configurations while Kubernetes blocks any attempts to violate your rules - creating a dual strategy that works whether someone follows the guidance or tries to bypass it entirely.
KubernetesPolicies #DevOpsAI #Kyverno
Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join
▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/ai/teaching-ai-your-company-policies-vector-search-+-enforcement 🔗 DevOps AI Toolkit: https://github.com/vfarcic/dot-ai 🎬 Stop Blaming AI: Vector DBs + RAG = Game Changer: https://youtu.be/zqpJr1qZhTg
▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).
▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/
▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox
▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 What are Policies? 05:36 AI Policy Extraction 13:36 Policy Enforcement Demo 17:51 Dual Policy Strategy
via YouTube https://www.youtube.com/watch?v=hLK9j2cn6c0
Kubernetes v1.34: Recovery From Volume Expansion Failure (GA)
https://kubernetes.io/blog/2025/09/19/kubernetes-v1-34-recover-expansion-failure/
Have you ever made a typo when expanding your persistent volumes in Kubernetes? Meant to specify 2TB but specified 20TiB? This seemingly innocuous problem was kinda hard to fix - and took the project almost 5 years to fix. Automated recovery from storage expansion has been around for a while in beta; however, with the v1.34 release, we have graduated this to general availability.
While it was always possible to recover from failing volume expansions manually, it usually required cluster-admin access and was tedious to do (See aformentioned link for more information).
What if you make a mistake and then realize immediately? With Kubernetes v1.34, you should be able to reduce the requested size of the PersistentVolumeClaim (PVC) and, as long as the expansion to previously requested size hadn't finished, you can amend the size requested. Kubernetes will automatically work to correct it. Any quota consumed by failed expansion will be returned to the user and the associated PersistentVolume should be resized to the latest size you specified.
I'll walk through an example of how all of this works.
Reducing PVC size to recover from failed expansion
Imagine that you are running out of disk space for one of your database servers, and you want to expand the PVC from previously specified 10TB to 100TB - but you make a typo and specify 1000TB.
kind: PersistentVolumeClaim apiVersion: v1 metadata: name: myclaim spec: accessModes:
- ReadWriteOnce resources: requests: storage: 1000TB # newly specified size - but incorrect!
Now, you may be out of disk space on your disk array or simply ran out of allocated quota on your cloud-provider. But, assume that expansion to 1000TB is never going to succeed.
In Kubernetes v1.34, you can simply correct your mistake and request a new PVC size, that is smaller than the mistake, provided it is still larger than the original size of the actual PersistentVolume.
kind: PersistentVolumeClaim apiVersion: v1 metadata: name: myclaim spec: accessModes:
- ReadWriteOnce resources: requests: storage: 100TB # Corrected size; has to be greater than 10TB. # You cannot shrink the volume below its actual size.
This requires no admin intervention. Even better, any surplus Kubernetes quota that you temporarily consumed will be automatically returned.
This fault recovery mechanism does have a caveat: whatever new size you specify for the PVC, it must be still higher than the original size in .status.capacity. Since Kubernetes doesn't support shrinking your PV objects, you can never go below the size that was originally allocated for your PVC request.
Improved error handling and observability of volume expansion
Implementing what might look like a relatively minor change also required us to almost fully redo how volume expansion works under the hood in Kubernetes. There are new API fields available in PVC objects which you can monitor to observe progress of volume expansion.
Improved observability of in-progress expansion
You can query .status.allocatedResourceStatus['storage'] of a PVC to monitor progress of a volume expansion operation. For a typical block volume, this should transition between ControllerResizeInProgress, NodeResizePending and NodeResizeInProgress and become nil/empty when volume expansion has finished.
If for some reason, volume expansion to requested size is not feasible it should accordingly be in states like - ControllerResizeInfeasible or NodeResizeInfeasible.
You can also observe size towards which Kubernetes is working by watching pvc.status.allocatedResources.
Improved error handling and reporting
Kubernetes should now retry your failed volume expansions at slower rate, it should make fewer requests to both storage system and Kubernetes apiserver.
Errors observerd during volume expansion are now reported as condition on PVC objects and should persist unlike events. Kubernetes will now populate pvc.status.conditions with error keys ControllerResizeError or NodeResizeError when volume expansion fails.
Fixes long standing bugs in resizing workflows
This feature also has allowed us to fix long standing bugs in resizing workflow such as Kubernetes issue #115294. If you observe anything broken, please report your bugs to https://github.com/kubernetes/kubernetes/issues, along with details about how to reproduce the problem.
Working on this feature through its lifecycle was challenging and it wouldn't have been possible to reach GA without feedback from @msau42, @jsafrane and @xing-yang.
All of the contributors who worked on this also appreciate the input provided by @thockin and @liggitt at various Kubernetes contributor summits.
via Kubernetes Blog https://kubernetes.io/
September 19, 2025 at 02:30PM
Week Ending September 14, 2025
https://lwkd.info/2025/20250918
Developer News
The Steering Committee Election is underway. Please make sure to vote before October 25th, and request an exception if you need one before October 20th.
The Kubernetes Steering Committee reaffirmed that SIG Release and the Release Team have full authority to enforce policies, deadlines, and requirements, including blocking releases if needed. Steering does not override release execution but will back policy updates and clearer communication to ensure safe, stable, and predictable releases.
A medium-severity vulnerability (CVE-2025-9708) affects the Kubernetes C# client ≤ v17.0.13, where improper certificate validation could enable man-in-the-middle attacks. Users are advised to upgrade to v17.0.14+ and review any custom CA usage in kubeconfig files. See the GitHub issue. for more details.
Release Schedule
Next Deadline: 1.35 Release Cycle Starts, September 15
Kubernetes 1.35 release cycle kicks off on Sept 15, targeting final release on Dec 17, 2025, with key milestones including Enhancements Freeze on Oct 16 and Code Freeze on Nov 6.
Patch releases v1.34.1, v1.33.5, v1.32.9, v1.31.13 were out last week, delivering the latest fix and updates.
KEP of the Week
KEP-3243: Respect PodTopologySpread after rolling upgrades
This KEP introduces a complementary field, MatchLabelKeys, in TopologySpreadConstraint to enhance pod topology spread. It allows users to specify only label keys, with kube-apiserver resolving their values from the incoming pod and merging them with the existing LabelSelector to identify the target pod group. This simplifies skew calculation, supports revision-level spreading during Deployment rollouts, and is also handled by kube-scheduler when used in cluster-level default constraints.
This KEP is tracked for beta in v1.34.
Other Merges
Remove container name from container event messages
Replace NewIndexerInformerWatcher with NewIndexerInformerWatcherWithLogger
Standardize not found error message of kubectl scale
validation-gen uses JSON names for error paths
Prevent ClusterIP load balancer loss with InternalTrafficPolicy: Local in kube-proxy
Avoid deadlock when gRPC connection to driver goes idle
validation-gen adds uuid format for string fields
client-go/cli-runtime fixes config override when ClientKey/ClientCertificate are set
Replace WaitForNamedCacheSync with WaitForNamedCacheSyncWithContext
Update PodObservedGenerationTracking description in OpenAPI
kubectl includes container fieldPath in event messages
StorageVersionMigrator adds discovery check to avoid stuck migrations
agnhost adds fake-registry-server for e2e image-pull tests
Add E2e test for cleaning of terminated containers
kube-apiserver protects against delete/finalizer race
Update pod resize test to accept new cpu.weight conversion
DRA accepts implicit device-class extended resource names even when extendedResourceName is set in the DeviceClass
Skip creating storage for non-stored and non-served versions
Allow OpenAPI model package names to be declared by APIs
kubelet fixes negative pod startup duration values
kube-scheduler statusz lists registered paths
applyconfiguration-gen preserves struct and field comments in generated code
Scheduler framework interfaces move to k8s.io/kube-scheduler
CRD validation ratchets the max selectableFields limit
apiserver storage only accesses keys under resourcePrefix
apiserver storage replace SetKeysFunc with EnableResourceSizeEstimation
Subprojects and Dependency Updates
grpc v1.75.0 introduces Spiffe verification, OTel C++ retry metrics, bug fixes, and Python and Ruby updates
nerdctl v2.1.4 adds manifest, export, import commands, improves networking, and drops containerd 1.6 support
vertical-pod-autoscaler v1.4.2 improves logging, fixes updater metrics, adjusts webhook CA, and falls back to eviction on failed updates
via Last Week in Kubernetes Development https://lwkd.info/
September 18, 2025 at 05:00PM
Kubernetes v1.34: DRA Consumable Capacity
https://kubernetes.io/blog/2025/09/18/kubernetes-v1-34-dra-consumable-capacity/
Dynamic Resource Allocation (DRA) is a Kubernetes API for managing scarce resources across Pods and containers. It enables flexible resource requests, going beyond simply allocating N number of devices to support more granular usage scenarios. With DRA, users can request specific types of devices based on their attributes, define custom configurations tailored to their workloads, and even share the same resource among multiple containers or Pods.
In this blog, we focus on the device sharing feature and dive into a new capability introduced in Kubernetes 1.34: DRA consumable capacity, which extends DRA to support finer-grained device sharing.
Background: device sharing via ResourceClaims
From the beginning, DRA introduced the ability for multiple Pods to share a device by referencing the same ResourceClaim. This design decouples resource allocation from specific hardware, allowing for more dynamic and reusable provisioning of devices.
In Kubernetes 1.33, the new support for partitionable devices allowed resource drivers to advertise slices of a device that are available, rather than exposing the entire device as an all-or-nothing resource. This enabled Kubernetes to model shareable hardware more accurately.
But there was still a missing piece: it didn't yet support scenarios where the device driver manages fine-grained, dynamic portions of a device resource — like network bandwidth — based on user demand, or to share those resources independently of ResourceClaims, which are restricted by their spec and namespace.
That’s where consumable capacity for DRA comes in.
Benefits of DRA consumable capacity support
Here's a taste of what you get in a cluster with the DRAConsumableCapacity feature gate enabled.
Device sharing across multiple ResourceClaims or DeviceRequests
Resource drivers can now support sharing the same device — or even a slice of a device — across multiple ResourceClaims or across multiple DeviceRequests.
This means that Pods from different namespaces can simultaneously share the same device, if permitted and supported by the specific DRA driver.
Device resource allocation
Kubernetes extends the allocation algorithm in the scheduler to support allocating a portion of a device's resources, as defined in the capacity field. The scheduler ensures that the total allocated capacity across all consumers never exceeds the device’s total capacity, even when shared across multiple ResourceClaims or DeviceRequests. This is very similar to the way the scheduler allows Pods and containers to share allocatable resources on Nodes; in this case, it allows them to share allocatable (consumable) resources on Devices.
This feature expands support for scenarios where the device driver is able to manage resources within a device and on a per-process basis — for example, allocating a specific amount of memory (e.g., 8 GiB) from a virtual GPU, or setting bandwidth limits on virtual network interfaces allocated to specific Pods. This aims to provide safe and efficient resource sharing.
DistinctAttribute constraint
This feature also introduces a new constraint: DistinctAttribute, which is the complement of the existing MatchAttribute constraint.
The primary goal of DistinctAttribute is to prevent the same underlying device from being allocated multiple times within a single ResourceClaim, which could happen since we are allocating shares (or subsets) of devices. This constraint ensures that each allocation refers to a distinct resource, even if they belong to the same device class.
It is useful for use cases such as allocating network devices connecting to different subnets to expand coverage or provide redundancy across failure domains.
How to use consumable capacity?
DRAConsumableCapacity is introduced as an alpha feature in Kubernetes 1.34. The feature gate DRAConsumableCapacity must be enabled in kubelet, kube-apiserver, kube-scheduler and kube-controller-manager.
--feature-gates=...,DRAConsumableCapacity=true
As a DRA driver developer
As a DRA driver developer writing in Golang, you can make a device within a ResourceSlice allocatable to multiple ResourceClaims (or devices.requests) by setting AllowMultipleAllocations to true.
Device { ... AllowMultipleAllocations: ptr.To(true), ... }
Additionally, you can define a policy to restrict how each device's Capacity should be consumed by each DeviceRequest by defining RequestPolicy field in the DeviceCapacity. The example below shows how to define a policy that requires a GPU with 40 GiB of memory to allocate at least 5 GiB per request, with each allocation in multiples of 5 GiB.
DeviceCapacity{ Value: resource.MustParse("40Gi"), RequestPolicy: &CapacityRequestPolicy{ Default: ptr.To(resource.MustParse("5Gi")), ValidRange: &CapacityRequestPolicyRange { Min: ptr.To(resource.MustParse("5Gi")), Step: ptr.To(resource.MustParse("5Gi")), } } }
This will be published to the ResourceSlice, as partially shown below:
apiVersion: resource.k8s.io/v1 kind: ResourceSlice ... spec: devices:
- name: gpu0 allowMultipleAllocations: true capacity: memory: value: 40Gi requestPolicy: default: 5Gi validRange: min: 5Gi step: 5Gi
An allocated device with a specified portion of consumed capacity will have a ShareID field set in the allocation status.
claim.Status.Allocation.Devices.Results[i].ShareID
This ShareID allows the driver to distinguish between different allocations that refer to the same device or same statically-partitioned slice but come from different ResourceClaim requests.
It acts as a unique identifier for each shared slice, enabling the driver to manage and enforce resource limits independently across multiple consumers.
As a consumer
As a consumer (or user), the device resource can be requested with a ResourceClaim like this:
apiVersion: resource.k8s.io/v1 kind: ResourceClaim ... spec: devices: requests: # for devices
- name: req0 exactly:
- deviceClassName: resource.example.com capacity: requests: # for resources which must be provided by those devices memory: 10Gi
This configuration ensures that the requested device can provide at least 10GiB of memory.
Notably that any resource.example.com device that has at least 10GiB of memory can be allocated. If a device that does not support multiple allocations is chosen, the allocation would consume the entire device. To filter only devices that support multiple allocations, you can define a selector like this:
selectors:
- cel: expression: |- device.allowMultipleAllocations == true
Integration with DRA device status
In device sharing, general device information is provided through the resource slice. However, some details are set dynamically after allocation. These can be conveyed using the .status.devices field of a ResourceClaim. That field is only published in clusters where the DRAResourceClaimDeviceStatus feature gate is enabled.
If you do have device status support available, a driver can expose additional device-specific information beyond the ShareID. One particularly useful use case is for virtual networks, where a driver can include the assigned IP address(es) in the status. This is valuable for both network service operations and troubleshooting.
You can find more information by watching our recording at: KubeCon Japan 2025 - Reimagining Cloud Native Networks: The Critical Role of DRA.
What can you do next?
Check out the CNI DRA Driver project for an example of DRA integration in Kubernetes networking. Try integrating with network resources like macvlan, ipvlan, or smart NICs.
Start enabling the DRAConsumableCapacity feature gate and experimenting with virtualized or partitionable devices. Specify your workloads with consumable capacity (for example: fractional bandwidth or memory).
Let us know your feedback:
✅ What worked well?
⚠️ What didn’t?
If you encountered issues to fix or opportunities to enhance, please file a new issue and reference KEP-5075 there, or reach out via Slack (#wg-device-management).
Conclusion
Consumable capacity support enhances the device sharing capability of DRA by allowing effective device sharing across namespaces, across claims, and tailored to each Pod’s actual needs. It also empowers drivers to enforce capacity limits, improves scheduling accuracy, and unlocks new use cases like bandwidth-aware networking and multi-tenant device sharing.
Try it out, experiment with consumable resources, and help shape the future of dynamic resource allocation in Kubernetes!
Further Reading
DRA in the Kubernetes documentation
KEP for DRA Partitionable Devices
KEP for DRA Device Status
KEP for DRA Consumable Capacity
Kubernetes 1.34 Release Notes
via Kubernetes Blog https://kubernetes.io/
September 18, 2025 at 02:30PM