1_r/devopsish
Week Ending October 19, 2025
https://lwkd.info/2025/20251022
Developer News
SIG-Etcd has found another potential upgrade failure preventing some users upgrading to etcd 3.6. The blog gives instructions on steps to avoid it, mainly updating to 3.5.24.
Release Schedule
Next Deadline: Docs Deadline for placeholder PRs, October 23
The deadline for opening your placeholder docs PRs is coming up soon. If you have a KEP tracked for v1.35, make sure that you have a placeholder PR in k/website for your docs before the deadline.
THe v1.35 Enhancements Freeze is in effect from October 17th. Out of the 101 KEPs opted in for the release, 75 made the cut for enhancements freeze.
Steering Committee Election
The Steering Committee Election voting ends later this week on Friday, 24th October, AoE. You can check your eligibility to vote in the voting app. Don’t forget to cast your votes if you haven’t already!
The deadline to file an exception request is 22nd October, AoE. Submit an exception request soon if you think you’re eligible!
KEP of the Week
KEP-4742: Expose Node Topology Labels via Downward API
This KEP introduces a built-in Kubernetes admission plugin that automatically copies node topology labels (like zone, region, or rack) onto Pods. It allows Pods to access this topology data through the Downward API without using privileged init containers or custom scripts. The change simplifies topology-aware workloads such as distributed AI/ML training, CNI optimizations, and sharded databases, making topology awareness a secure and native part of Kubernetes.
This KEP is tracked for beta in v1.35.
Other Merges
Declarative validation tags have a StabilityLevel
Test external VolumeGroupSnapshots in 1.35
AllocationConfigSource is validated
APF properly counts legacy watches
Declarative Validation rollout: DeviceClassName, update, ResourceClaim, maxItems, DRA fields, DeviceAllocationMode
Simplify kube-cross builds
Promotions
ExecProbeTimeout to GA
max-allowable-numa-nodes to GA
Deprecated
storage.k8s.io/v1alpha1 is no longer served
Version Updates
Golang update: 1.24.9 in 1.31 through 1.34, 1.25.3 in 1.35
etcd to v3.5.23, just in time to replace it with 3.5.24
Shoutouts
Rayan Das – A big shout-out to the v1.35 Enhancements shadows ( @dchan @jmickey @aibarbetta @Subhasmita @Faeka Ansari) for their hard work leading up to Enhancements Freeze yesterday.
via Last Week in Kubernetes Development https://lwkd.info/
October 22, 2025 at 07:55PM
Ep37 - Ask Me Anything About Anything with Scott Rosenberg
There are no restrictions in this AMA session. You can ask anything about DevOps, AI, Cloud, Kubernetes, Platform Engineering, containers, or anything else. Scott Rosenberg, a regular guest, will be here to help us out.
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: Octopus 🔗 Enterprise Support for Argo: https://octopus.com/support/enterprise-argo-support ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/
▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox
via YouTube https://www.youtube.com/watch?v=--z0NqQN3J8
The Double-Edged Sword of AI-Assisted Kubernetes Operations, with Mai Nishitani
Mai Nishitani, Director of Enterprise Architecture at NTT Data and AWS Community Builder, demonstrates how Model Context Protocol (MCP) enables Claude to directly interact with Kubernetes clusters through natural language commands.
You will learn:
How MCP servers work and why they're significant for standardizing AI integration with DevOps tools, moving beyond custom integrations to a universal protocol
The practical capabilities and critical limitations of AI in Kubernetes operations
Why fundamental troubleshooting skills matter more than ever as AI abstractions can fail in unexpected ways, especially during crisis scenarios and complex system failures
How DevOps roles are evolving from manual administration toward strategic architecture and orchestration
Sponsor
This episode is brought to you by Testkube—where teams run millions of performance tests in real Kubernetes infrastructure. From air-gapped environments to massive scale deployments, orchestrate every testing tool in one platform. Check it out at testkube.io
More info
Find all the links and info for this episode here: https://ku.bz/3hWvQjXxp
Interested in sponsoring an episode? Learn more.
via KubeFM https://kube.fm
October 21, 2025 at 06:00AM
MCP Server Deployment Guide: From Local To Production
Discover the four main ways to deploy MCP servers, from simple local execution to enterprise-ready Kubernetes clusters. This comprehensive guide explores the trade-offs between NPX local deployment, Docker containerization, Kubernetes production setups, and cloud platform alternatives like Fly.io and Cloudflare Workers.
You'll see practical demonstrations of each approach using a real MCP server, learning about security implications, scalability challenges, and team collaboration benefits. The video covers why local NPX execution creates security risks and dependency nightmares, how Docker provides better isolation but remains single-user, and why Kubernetes offers the best solution for shared organizational infrastructure. We also examine the ToolHive operator's limitations and explore various cloud deployment options with their respective vendor lock-in considerations. Whether you're developing MCP servers or deploying them for your team, this guide will help you choose the right deployment strategy for your specific needs.
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: Browserbase 🔗 https://browserbase.com ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
MCP #ModelContextProtocol
Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join
▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/ai/mcp-server-deployment-guide-from-local-to-production 🔗 Model Context Protocol: https://modelcontextprotocol.io
▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).
▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/
▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox
▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Model Context Protocol (MCP) Deployment 01:40 Browserbase (sponsor) 02:50 MCP Local NPX Deployment 06:32 MCP Docker Container Deployment 09:23 MCP Kubernetes Production Deployment 14:09 MCP ToolHive Kubernetes Operator 19:15 Alternative MCP Deployment Options 22:46 Choosing the Right MCP Deployment
via YouTube https://www.youtube.com/watch?v=MHf-M8qOogY
7 Common Kubernetes Pitfalls (and How I Learned to Avoid Them)
https://kubernetes.io/blog/2025/10/20/seven-kubernetes-pitfalls-and-how-to-avoid/
It’s no secret that Kubernetes can be both powerful and frustrating at times. When I first started dabbling with container orchestration, I made more than my fair share of mistakes enough to compile a whole list of pitfalls. In this post, I want to walk through seven big gotchas I’ve encountered (or seen others run into) and share some tips on how to avoid them. Whether you’re just kicking the tires on Kubernetes or already managing production clusters, I hope these insights help you steer clear of a little extra stress.
- Skipping resource requests and limits
The pitfall: Not specifying CPU and memory requirements in Pod specifications. This typically happens because Kubernetes does not require these fields, and workloads can often start and run without them—making the omission easy to overlook in early configurations or during rapid deployment cycles.
Context: In Kubernetes, resource requests and limits are critical for efficient cluster management. Resource requests ensure that the scheduler reserves the appropriate amount of CPU and memory for each pod, guaranteeing that it has the necessary resources to operate. Resource limits cap the amount of CPU and memory a pod can use, preventing any single pod from consuming excessive resources and potentially starving other pods. When resource requests and limits are not set:
Resource Starvation: Pods may get insufficient resources, leading to degraded performance or failures. This is because Kubernetes schedules pods based on these requests. Without them, the scheduler might place too many pods on a single node, leading to resource contention and performance bottlenecks.
Resource Hoarding: Conversely, without limits, a pod might consume more than its fair share of resources, impacting the performance and stability of other pods on the same node. This can lead to issues such as other pods getting evicted or killed by the Out-Of-Memory (OOM) killer due to lack of available memory.
How to avoid it:
Start with modest requests (for example 100m CPU, 128Mi memory) and see how your app behaves.
Monitor real-world usage and refine your values; the HorizontalPodAutoscaler can help automate scaling based on metrics.
Keep an eye on kubectl top pods or your logging/monitoring tool to confirm you’re not over- or under-provisioning.
My reality check: Early on, I never thought about memory limits. Things seemed fine on my local cluster. Then, on a larger environment, Pods got OOMKilled left and right. Lesson learned. For detailed instructions on configuring resource requests and limits for your containers, please refer to Assign Memory Resources to Containers and Pods (part of the official Kubernetes documentation).
- Underestimating liveness and readiness probes
The pitfall: Deploying containers without explicitly defining how Kubernetes should check their health or readiness. This tends to happen because Kubernetes will consider a container “running” as long as the process inside hasn’t exited. Without additional signals, Kubernetes assumes the workload is functioning—even if the application inside is unresponsive, initializing, or stuck.
Context:
Liveness, readiness, and startup probes are mechanisms Kubernetes uses to monitor container health and availability.
Liveness probes determine if the application is still alive. If a liveness check fails, the container is restarted.
Readiness probes control whether a container is ready to serve traffic. Until the readiness probe passes, the container is removed from Service endpoints.
Startup probes help distinguish between long startup times and actual failures.
How to avoid it:
Add a simple HTTP livenessProbe to check a health endpoint (for example /healthz) so Kubernetes can restart a hung container.
Use a readinessProbe to ensure traffic doesn’t reach your app until it’s warmed up.
Keep probes simple. Overly complex checks can create false alarms and unnecessary restarts.
My reality check: I once forgot a readiness probe for a web service that took a while to load. Users hit it prematurely, got weird timeouts, and I spent hours scratching my head. A 3-line readiness probe would have saved the day.
For comprehensive instructions on configuring liveness, readiness, and startup probes for containers, please refer to Configure Liveness, Readiness and Startup Probes in the official Kubernetes documentation.
- “We’ll just look at container logs” (famous last words)
The pitfall: Relying solely on container logs retrieved via kubectl logs. This often happens because the command is quick and convenient, and in many setups, logs appear accessible during development or early troubleshooting. However, kubectl logs only retrieves logs from currently running or recently terminated containers, and those logs are stored on the node’s local disk. As soon as the container is deleted, evicted, or the node is restarted, the log files may be rotated out or permanently lost.
How to avoid it:
Centralize logs using CNCF tools like Fluentd or Fluent Bit to aggregate output from all Pods.
Adopt OpenTelemetry for a unified view of logs, metrics, and (if needed) traces. This lets you spot correlations between infrastructure events and app-level behavior.
Pair logs with Prometheus metrics to track cluster-level data alongside application logs. If you need distributed tracing, consider CNCF projects like Jaeger.
My reality check: The first time I lost Pod logs to a quick restart, I realized how flimsy “kubectl logs” can be on its own. Since then, I’ve set up a proper pipeline for every cluster to avoid missing vital clues.
- Treating dev and prod exactly the same
The pitfall: Deploying the same Kubernetes manifests with identical settings across development, staging, and production environments. This often occurs when teams aim for consistency and reuse, but overlook that environment-specific factors—such as traffic patterns, resource availability, scaling needs, or access control—can differ significantly. Without customization, configurations optimized for one environment may cause instability, poor performance, or security gaps in another.
How to avoid it:
Use environment overlays or kustomize to maintain a shared base while customizing resource requests, replicas, or config for each environment.
Extract environment-specific configuration into ConfigMaps and / or Secrets. You can use a specialized tool such as Sealed Secrets to manage confidential data.
Plan for scale in production. Your dev cluster can probably get away with minimal CPU/memory, but prod might need significantly more.
My reality check: One time, I scaled up replicaCount from 2 to 10 in a tiny dev environment just to “test.” I promptly ran out of resources and spent half a day cleaning up the aftermath. Oops.
- Leaving old stuff floating around
The pitfall: Leaving unused or outdated resources—such as Deployments, Services, ConfigMaps, or PersistentVolumeClaims—running in the cluster. This often happens because Kubernetes does not automatically remove resources unless explicitly instructed, and there is no built-in mechanism to track ownership or expiration. Over time, these forgotten objects can accumulate, consuming cluster resources, increasing cloud costs, and creating operational confusion, especially when stale Services or LoadBalancers continue to route traffic.
How to avoid it:
Label everything with a purpose or owner label. That way, you can easily query resources you no longer need.
Regularly audit your cluster: run kubectl get all -n <namespace> to see what’s actually running, and confirm it’s all legit.
Adopt Kubernetes’ Garbage Collection: K8s docs show how to remove dependent objects automatically.
Leverage policy automation: Tools like Kyverno can automatically delete or block stale resources after a certain period, or enforce lifecycle policies so you don’t have to remember every single cleanup step.
My reality check: After a hackathon, I forgot to tear down a “test-svc” pinned to an external load balancer. Three weeks later, I realized I’d been paying for that load balancer the entire time. Facepalm.
- Diving too deep into networking too soon
The pitfall: Introducing advanced networking solutions—such as service meshes, custom CNI plugins, or multi-cluster communication—before fully understanding Kubernetes' native networking primitives. This commonly occurs when teams implement features like traffic routing, observability, or mTLS using external tools without first mastering how core Kubernetes networking works: including Pod-to-Pod communication, ClusterIP Services, DNS resolution, and basic ingress traffic handling. As a result, network-related issues become harder to troubleshoot, especially when overlays introduce additional abstractions and failure points.
How to avoid it:
Start small: a Deployment, a Service, and a basic ingress controller such as one based on NGINX (e.g., Ingress-NGINX).
Make sure you understand how traffic flows within the cluster, how service discovery works, and how DNS is configured.
Only move to a full-blown mesh or advanced CNI features when you actually need them, complex networking adds overhead.
My reality check: I tried Istio on a small internal app once, then spent more time debugging Istio itself than the actual app. Eventually, I stepped back, removed Istio, and everything worked fine.
- Going too light on security and RBAC
The pitfall: Deploying workloads with insecure configurations, such as running containers as the root user, using the latest image tag, disabling security contexts, or assigning overly broad RBAC roles like cluster-admin. These practices persist because Kubernetes does not enforce strict security defaults out of the box, and the platform is designed to be flexible rather than opinionated. Without explicit securi
The Making of Flux: The Future, a KubeFM Original Series
In this closing episode, Bryan Ross (Field CTO at GitLab), Jane Yan (Principal Program Manager at Microsoft), Sean O’Meara (CTO at Mirantis) and William Rizzo (Strategy Lead, CTO Office at Mirantis) discuss how GitOps evolves in practice.
How enterprises are embedding Flux into developer platforms and managed cloud services.
Why bridging CI/CD and infrastructure remains a core challenge—and how GitOps addresses it.
What leading platform teams (GitLab, Microsoft, Mirantis) see as the next frontier for GitOps.
Sponsor
Join the Flux maintainers and community at FluxCon, November 11th in Atlanta—register here
More info
Find all the links and info for this episode here: https://ku.bz/tVqKwNYQH
Interested in sponsoring an episode? Learn more.
via KubeFM https://kube.fm
October 20, 2025 at 06:00AM