54927 bookmarks

Newest

Arguing point-by-point considered harmful

Engineers love to have technical discussions point-by-point: replying to every idea in turn, treating each as its own mini-discussion. It just makes sense! A…

1_r/devopsish

·seangoedecke.com·Jun 10, 2025

Arguing point-by-point considered harmful

9 Questions to Ask When You Start to Notice Underperformance

Troubleshooting performance early, when you’re not entirely sure what the issue is...

1_r/devopsish

·newsletter.canopy.is·Jun 10, 2025

9 Questions to Ask When You Start to Notice Underperformance

Snowflake and Databricks bank PostgreSQL acquisitions

: Data analytics vendors have tried this before with limited success

1_r/devopsish

·theregister.com·Jun 10, 2025

Snowflake and Databricks bank PostgreSQL acquisitions

Shared Nothing Shared Everything: The Truth About Kubernetes Multi-Tenancy with Molly Sheets

Shared Nothing, Shared Everything: The Truth About Kubernetes Multi-Tenancy, with Molly Sheets

https://ku.bz/Rmpl8948_

Molly Sheets, Director of Engineering for Kubernetes at Zynga, discusses her team's approach to platform engineering. She explains why their initial one-cluster-per-team model became unsustainable and how they're transitioning to multi-tenant architectures.

You will learn:

Why slowing down deployments actually increases risk and how manual approval gates can make systems less resilient than faster, smaller deployments

The operational reality of cluster proliferation - why managing hundreds of clusters becomes unsustainable and when multi-tenancy becomes necessary

Practical multi-tenancy implementation strategies including resource quotas, priority classes, and namespace organization patterns that work in production

Better metrics for multi-tenant environments - why control plane uptime doesn't matter and how to build meaningful SLOs for distributed platform health

Sponsor

This episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.

More info

Find all the links and info for this episode here: https://ku.bz/Rmpl8948_

Interested in sponsoring an episode? Learn more.

via KubeFM https://kube.fm

June 10, 2025 at 06:00AM

1_r/devopsish

·kube.fm·Jun 10, 2025

Shared Nothing Shared Everything: The Truth About Kubernetes Multi-Tenancy with Molly Sheets

Enhancing Kubernetes Event Management with Custom Aggregation

https://kubernetes.io/blog/2025/06/10/enhancing-kubernetes-event-management-custom-aggregation/

Kubernetes Events provide crucial insights into cluster operations, but as clusters grow, managing and analyzing these events becomes increasingly challenging. This blog post explores how to build custom event aggregation systems that help engineering teams better understand cluster behavior and troubleshoot issues more effectively.

The challenge with Kubernetes events

In a Kubernetes cluster, events are generated for various operations - from pod scheduling and container starts to volume mounts and network configurations. While these events are invaluable for debugging and monitoring, several challenges emerge in production environments:

Volume: Large clusters can generate thousands of events per minute

Retention: Default event retention is limited to one hour

Correlation: Related events from different components are not automatically linked

Classification: Events lack standardized severity or category classifications

Aggregation: Similar events are not automatically grouped

To learn more about Events in Kubernetes, read the Event API reference.

Real-World value

Consider a production environment with tens of microservices where the users report intermittent transaction failures:

Traditional event aggregation process: Engineers are wasting hours sifting through thousands of standalone events spread across namespaces. By the time they look into it, the older events have long since purged, and correlating pod restarts to node-level issues is practically impossible.

With its event aggregation in its custom events: The system groups events across resources, instantly surfacing correlation patterns such as volume mount timeouts before pod restarts. History indicates it occurred during past record traffic spikes, highlighting a storage scalability issue in minutes rather than hours.

The beneﬁt of this approach is that organizations that implement it commonly cut down their troubleshooting time significantly along with increasing the reliability of systems by detecting patterns early.

Building an Event aggregation system

This post explores how to build a custom event aggregation system that addresses these challenges, aligned to Kubernetes best practices. I've picked the Go programming language for my example.

Architecture overview

This event aggregation system consists of three main components:

Event Watcher: Monitors the Kubernetes API for new events

Event Processor: Processes, categorizes, and correlates events

Storage Backend: Stores processed events for longer retention

Here's a sketch for how to implement the event watcher:

package main

import ( "context" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/client-go/kubernetes" "k8s.io/client-go/rest" eventsv1 "k8s.io/api/events/v1" )

type EventWatcher struct { clientset *kubernetes.Clientset }

func NewEventWatcher(config *rest.Config) (*EventWatcher, error) { clientset, err := kubernetes.NewForConfig(config) if err != nil { return nil, err } return &EventWatcher{clientset: clientset}, nil }

func (w *EventWatcher) Watch(ctx context.Context) (<-chan *eventsv1.Event, error) { events := make(chan *eventsv1.Event)

watcher, err := w.clientset.EventsV1().Events("").Watch(ctx, metav1.ListOptions{}) if err != nil { return nil, err }

go func() { defer close(events) for { select { case event := <-watcher.ResultChan(): if e, ok := event.Object.(*eventsv1.Event); ok { events <- e } case <-ctx.Done(): watcher.Stop() return } } }()

return events, nil }

Event processing and classification

The event processor enriches events with additional context and classification:

type EventProcessor struct { categoryRules []CategoryRule correlationRules []CorrelationRule }

type ProcessedEvent struct { Event *eventsv1.Event Category string Severity string CorrelationID string Metadata map[string]string }

func (p *EventProcessor) Process(event *eventsv1.Event) *ProcessedEvent { processed := &ProcessedEvent{ Event: event, Metadata: make(map[string]string), }

// Apply classification rules processed.Category = p.classifyEvent(event) processed.Severity = p.determineSeverity(event)

// Generate correlation ID for related events processed.CorrelationID = p.correlateEvent(event)

// Add useful metadata processed.Metadata = p.extractMetadata(event)

return processed }

Implementing Event correlation

One of the key features you could implement is a way of correlating related Events. Here's an example correlation strategy:

func (p *EventProcessor) correlateEvent(event *eventsv1.Event) string { // Correlation strategies: // 1. Time-based: Events within a time window // 2. Resource-based: Events affecting the same resource // 3. Causation-based: Events with cause-effect relationships

correlationKey := generateCorrelationKey(event) return correlationKey }

func generateCorrelationKey(event *eventsv1.Event) string { // Example: Combine namespace, resource type, and name return fmt.Sprintf("%s/%s/%s", event.InvolvedObject.Namespace, event.InvolvedObject.Kind, event.InvolvedObject.Name, ) }

Event storage and retention

For long-term storage and analysis, you'll probably want a backend that supports:

Efficient querying of large event volumes

Flexible retention policies

Support for aggregation queries

Here's a sample storage interface:

type EventStorage interface { Store(context.Context, *ProcessedEvent) error Query(context.Context, EventQuery) ([]ProcessedEvent, error) Aggregate(context.Context, AggregationParams) ([]EventAggregate, error) }

type EventQuery struct { TimeRange TimeRange Categories []string Severity []string CorrelationID string Limit int }

type AggregationParams struct { GroupBy []string TimeWindow string Metrics []string }

Good practices for Event management

Resource Efficiency

Implement rate limiting for event processing

Use efficient filtering at the API server level

Batch events for storage operations

Scalability

Distribute event processing across multiple workers

Use leader election for coordination

Implement backoff strategies for API rate limits

Reliability

Handle API server disconnections gracefully

Buffer events during storage backend unavailability

Implement retry mechanisms with exponential backoff

Advanced features

Pattern detection

Implement pattern detection to identify recurring issues:

type PatternDetector struct { patterns map[string]*Pattern threshold int }

func (d *PatternDetector) Detect(events []ProcessedEvent) []Pattern { // Group similar events groups := groupSimilarEvents(events)

// Analyze frequency and timing patterns := identifyPatterns(groups)

return patterns }

func groupSimilarEvents(events []ProcessedEvent) map[string][]ProcessedEvent { groups := make(map[string][]ProcessedEvent)

for _, event := range events { // Create similarity key based on event characteristics similarityKey := fmt.Sprintf("%s:%s:%s", event.Event.Reason, event.Event.InvolvedObject.Kind, event.Event.InvolvedObject.Namespace, )

// Group events with the same key groups[similarityKey] = append(groups[similarityKey], event) }

return groups }

func identifyPatterns(groups map[string][]ProcessedEvent) []Pattern { var patterns []Pattern

for key, events := range groups { // Only consider groups with enough events to form a pattern if len(events) < 3 { continue }

// Sort events by time sort.Slice(events, func(i, j int) bool { return events[i].Event.LastTimestamp.Time.Before(events[j].Event.LastTimestamp.Time) })

// Calculate time range and frequency firstSeen := events[0].Event.FirstTimestamp.Time lastSeen := events[len(events)-1].Event.LastTimestamp.Time duration := lastSeen.Sub(firstSeen).Minutes()

var frequency float64 if duration > 0 { frequency = float64(len(events)) / duration }

// Create a pattern if it meets threshold criteria if frequency > 0.5 { // More than 1 event per 2 minutes pattern := Pattern{ Type: key, Count: len(events), FirstSeen: firstSeen, LastSeen: lastSeen, Frequency: frequency, EventSamples: events[:min(3, len(events))], // Keep up to 3 samples } patterns = append(patterns, pattern) } }

return patterns }

With this implementation, the system can identify recurring patterns such as node pressure events, pod scheduling failures, or networking issues that occur with a specific frequency.

Real-time alerts

The following example provides a starting point for building an alerting system based on event patterns. It is not a complete solution but a conceptual sketch to illustrate the approach.

type AlertManager struct { rules []AlertRule notifiers []Notifier }

func (a *AlertManager) EvaluateEvents(events []ProcessedEvent) { for _, rule := range a.rules { if rule.Matches(events) { alert := rule.GenerateAlert(events) a.notify(alert) } } }

Conclusion

A well-designed event aggregation system can significantly improve cluster observability and troubleshooting capabilities. By implementing custom event processing, correlation, and storage, operators can better understand cluster behavior and respond to issues more effectively.

The solutions presented here can be extended and customized based on specific requirements while maintaining compatibility with the Kubernetes API and following best practices for scalability and reliability.

Next steps

Future enhancements could include:

Machine learning for anomaly detection

Integration with popular observability platforms

Custom event APIs for application-specific events

Enhanced visualization and reporting capabilities

For more information on Kubernetes events and custom controllers, refer to the official Kubernetes documentation.

via Kubernetes Blog https://kubernetes.io/

June 09, 2025 at 08:00PM

1_r/devopsish

·kubernetes.io·Jun 10, 2025

Enhancing Kubernetes Event Management with Custom Aggregation

Meet Containerization - WWDC25 - Videos - Apple Developer

Meet Containerization, an open source project written in Swift to create and run Linux containers on your Mac. Learn how Containerization...

1_r/devopsish

·developer.apple.com·Jun 9, 2025

Meet Containerization - WWDC25 - Videos - Apple Developer

DevOps Toolkit - Ep24 - Ask Me Anything About Anything with Scott Rosenberg - https://www.youtube.com/watch?v=JaO74iWnRwY

Ep24 - Ask Me Anything About Anything with Scott Rosenberg

There are no restrictions in this AMA session. You can ask anything about DevOps, Cloud, Kubernetes, Platform Engineering, containers, or anything else. We'll have special guests Scott Rosenberg and Ramiro Berrelleza to help us out.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: Codefresh 🔗 GitOps Argo CD Certifications: https://learning.codefresh.io (use "viktor" for a 50% discount) ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

via YouTube https://www.youtube.com/watch?v=JaO74iWnRwY

1_r/devopsish

·youtube.com·Jun 9, 2025

DevOps Toolkit - Ep24 - Ask Me Anything About Anything with Scott Rosenberg - https://www.youtube.com/watch?v=JaO74iWnRwY

DevOps Toolkit - How I Fixed My Lazy Vibe Coding Habits with Taskmaster - https://www.youtube.com/watch?v=0WtCBbIHoKE

How I Fixed My Lazy Vibe Coding Habits with Taskmaster

AI agents often struggle with large, complex tasks, losing context and producing inconsistent results. Enter Taskmaster, an open-source project designed to orchestrate AI agents, maintain permanent context, and efficiently handle multi-step tasks. In this video, we'll explore how Taskmaster can improve your workflow by automatically generating detailed Product Requirements Documents (PRDs), breaking down tasks, and guiding AI agents seamlessly through complex projects without losing context or focus.

Witness how Taskmaster effortlessly plans, organizes, and manages tasks that would otherwise require hours of tedious manual effort. Whether you're using GitHub Copilot, Cursor, or other AI assistants, Taskmaster will significantly enhance your productivity and change your approach to working with AI.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: Blacksmith 🔗 https://blacksmith.sh ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

AIProductivity, #SoftwareDevelopment, #TaskManagement

Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join

▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ 🔗 Taskmaster: https://task-master.dev

▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Problems with AI for Larger Tasks 02:31 Blacksmit (sponsor) 03:39 The Shame 06:03 Product Requirements Document (PRD) with Taskmaster 11:00 Working on Tasks with Taskmaster 14:27 Taskmaster Pros and Cons

via YouTube https://www.youtube.com/watch?v=0WtCBbIHoKE

1_r/devopsish

·youtube.com·Jun 9, 2025

DevOps Toolkit - How I Fixed My Lazy Vibe Coding Habits with Taskmaster - https://www.youtube.com/watch?v=0WtCBbIHoKE

Employee Stress Is a Business Risk—Not an HR Problem

This research-backed framework will help leaders measure and pinpoint the steep costs of a stressed-out workforce.

1_r/devopsish

·hbr.org·Jun 9, 2025

Employee Stress Is a Business Risk—Not an HR Problem

The 2-Week Vacation Test for Engineers and Managers

Are you sharing knowledge with others or are you a bottleneck? This is what to do to find out!

1_r/devopsish

·newsletter.eng-leadership.com·Jun 9, 2025

The 2-Week Vacation Test for Engineers and Managers

DevOps Newsletters

https://chrisshort.net/devops-news/

Continuous learning is a critical part of DevOps. Staying current is imperative. These are DevOps newsletters of note from several well regarded DevOps leaders.

via Chris Short https://chrisshort.net/

June 08, 2025

1_r/devopsish

·chrisshort.net·Jun 8, 2025

DevOps Newsletters

DevOps README

https://chrisshort.net/devops-readme/

What books 📚 to read to learn more about DevOps

via Chris Short https://chrisshort.net/

June 08, 2025

1_r/devopsish

·chrisshort.net·Jun 8, 2025

DevOps README

Kubernetes News

https://chrisshort.net/kubernetes-news/

What books 📚 to read to learn more about Kubernetes

via Chris Short https://chrisshort.net/

June 08, 2025

1_r/devopsish

·chrisshort.net·Jun 8, 2025

Kubernetes News

Kubernetes README

https://chrisshort.net/kubernetes-readme/

What books 📚 to read to learn more about Kubernetes

via Chris Short https://chrisshort.net/

June 08, 2025

1_r/devopsish

·chrisshort.net·Jun 8, 2025

Kubernetes README

Google's Cloud IDP Could Replace Platform Engineering

Google Cloud's Internal Development Platform project promises to revolutionize software building by shifting platform engineering responsibilities from developers to the cloud itself through integrated, app-centric services.

1_r/devopsish

·thenewstack.io·Jun 7, 2025

Google's Cloud IDP Could Replace Platform Engineering

Welcome to Tha Carter by Lil Wayne on Apple Music

Song · 2025 · Duration 3:35

3_Music

·music.apple.com·Jun 7, 2025

Welcome to Tha Carter by Lil Wayne on Apple Music

CISA Warns of Chrome 0-Day Vulnerability Exploited in the Wild to Execute Arbitrary Code

CISA has issued an urgent warning about a critical zero-day vulnerability in Google Chrome's V8 JavaScript engine that is being actively exploited by cybercriminals to execute arbitrary code on victims' systems.

1_r/devopsish

·cybersecuritynews.com·Jun 6, 2025

CISA Warns of Chrome 0-Day Vulnerability Exploited in the Wild to Execute Arbitrary Code

It turns out you can train AI models without copyrighted material

It's just a pain in the ass.

1_r/devopsish

·engadget.com·Jun 6, 2025

It turns out you can train AI models without copyrighted material

HomePod Turns 8: Here's When to Expect New Models

Eight years ago today, Apple introduced the HomePod, a smart speaker that it said would provide "amazing sound quality and intelligence" in...

1_r/devopsish

·macrumors.com·Jun 5, 2025

HomePod Turns 8: Here's When to Expect New Models

This is the take I was looking for | Latest iPhone 17 Air rumor could stop a lot of Pro users from switching - 9to5Mac

iPhone 17 Air may not have ProMotion features, per one rumor, despite boasting a 120Hz display. Here’s how that could sway purchase decisions.

1_r/devopsish

·9to5mac.com·Jun 5, 2025

This is the take I was looking for | Latest iPhone 17 Air rumor could stop a lot of Pro users from switching - 9to5Mac

Foreign propagandists continue using ChatGPT in influence campaigns

ChatGPT is being used to generate content in foreign influence campaigns across China, Russia, and Iran.

1_r/devopsish

·engadget.com·Jun 5, 2025

Foreign propagandists continue using ChatGPT in influence campaigns

“In 10 years, all bets are off”—Anthropic CEO opposes decadelong freeze on state AI laws

Amodei says AI “too fast” for blanket law ban; sees fundamental world change in 2 years.

1_r/devopsish

·arstechnica.com·Jun 5, 2025

“In 10 years, all bets are off”—Anthropic CEO opposes decadelong freeze on state AI laws

My experience with Canonical's interview process

Personal blog of Julien (jvoisin) Voisin

1_r/devopsish

·dustri.org·Jun 5, 2025

My experience with Canonical's interview process

hakluke/hakrawler: Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application - hakluke/hakrawler

1_r/devopsish

·github.com·Jun 5, 2025

hakluke/hakrawler: Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

Offer I Couldn't Refuse by Bruiser Wolf & F1lthy on Apple Music

Song · 2025 · Duration 3:19

3_Music

·music.apple.com·Jun 5, 2025

Offer I Couldn't Refuse by Bruiser Wolf & F1lthy on Apple Music

OpenAI takes down covert operations tied to China and other countries

The company said China and other nations are covertly trying to use chatbots to influence opinion around the world. In one case, operatives also used the tools to write internal performance reports.

1_r/devopsish

·npr.org·Jun 5, 2025

OpenAI takes down covert operations tied to China and other countries

China’s rare earth restrictions halt first auto industry production lines

The US and EU auto industries are running out of essential magnets.

2_News

·theverge.com·Jun 5, 2025

China’s rare earth restrictions halt first auto industry production lines

Introducing Gateway API Inference Extension

https://kubernetes.io/blog/2025/06/05/introducing-gateway-api-inference-extension/

Modern generative AI and large language model (LLM) services create unique traffic-routing challenges on Kubernetes. Unlike typical short-lived, stateless web requests, LLM inference sessions are often long-running, resource-intensive, and partially stateful. For example, a single GPU-backed model server may keep multiple inference sessions active and maintain in-memory token caches.

Traditional load balancers focused on HTTP path or round-robin lack the specialized capabilities needed for these workloads. They also don’t account for model identity or request criticality (e.g., interactive chat vs. batch jobs). Organizations often patch together ad-hoc solutions, but a standardized approach is missing.

Gateway API Inference Extension

Gateway API Inference Extension was created to address this gap by building on the existing Gateway API, adding inference-specific routing capabilities while retaining the familiar model of Gateways and HTTPRoutes. By adding an inference extension to your existing gateway, you effectively transform it into an Inference Gateway, enabling you to self-host GenAI/LLMs with a “model-as-a-service” mindset.

The project’s goal is to improve and standardize routing to inference workloads across the ecosystem. Key objectives include enabling model-aware routing, supporting per-request criticalities, facilitating safe model roll-outs, and optimizing load balancing based on real-time model metrics. By achieving these, the project aims to reduce latency and improve accelerator (GPU) utilization for AI workloads.

How it works

The design introduces two new Custom Resources (CRDs) with distinct responsibilities, each aligning with a specific user persona in the AI/ML serving workflow:

InferencePool Defines a pool of pods (model servers) running on shared compute (e.g., GPU nodes). The platform admin can configure how these pods are deployed, scaled, and balanced. An InferencePool ensures consistent resource usage and enforces platform-wide policies. An InferencePool is similar to a Service but specialized for AI/ML serving needs and aware of the model-serving protocol.

InferenceModel A user-facing model endpoint managed by AI/ML owners. It maps a public name (e.g., "gpt-4-chat") to the actual model within an InferencePool. This lets workload owners specify which models (and optional fine-tuning) they want served, plus a traffic-splitting or prioritization policy.

In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets platform operators manage where and how it’s served.

Request flow

The flow of a request builds on the Gateway API model (Gateways and HTTPRoutes) with one or more extra inference-aware steps (extensions) in the middle. Here’s a high-level example of the request flow with the Endpoint Selection Extension (ESE):

Gateway Routing

A client sends a request (e.g., an HTTP POST to /completions). The Gateway (like Envoy) examines the HTTPRoute and identifies the matching InferencePool backend.

Endpoint Selection

Instead of simply forwarding to any available pod, the Gateway consults an inference-specific routing extension— the Endpoint Selection Extension—to pick the best of the available pods. This extension examines live pod metrics (queue lengths, memory usage, loaded adapters) to choose the ideal pod for the request.

Inference-Aware Scheduling

The chosen pod is the one that can handle the request with the lowest latency or highest efficiency, given the user’s criticality or resource needs. The Gateway then forwards traffic to that specific pod.

This extra step provides a smarter, model-aware routing mechanism that still feels like a normal single request to the client. Additionally, the design is extensible—any Inference Gateway can be enhanced with additional inference-specific extensions to handle new routing strategies, advanced scheduling logic, or specialized hardware needs. As the project continues to grow, contributors are encouraged to develop new extensions that are fully compatible with the same underlying Gateway API model, further expanding the possibilities for efficient and intelligent GenAI/LLM routing.

Benchmarks

We evaluated this extension against a standard Kubernetes Service for a vLLM‐based model serving deployment. The test environment consisted of multiple H100 (80 GB) GPU pods running vLLM (version 1) on a Kubernetes cluster, with 10 Llama2 model replicas. The Latency Profile Generator (LPG) tool was used to generate traffic and measure throughput, latency, and other metrics. The ShareGPT dataset served as the workload, and traffic was ramped from 100 Queries per Second (QPS) up to 1000 QPS.

Key results

Comparable Throughput: Throughout the tested QPS range, the ESE delivered throughput roughly on par with a standard Kubernetes Service.

Lower Latency:

Per‐Output‐Token Latency: The ESE showed significantly lower p90 latency at higher QPS (500+), indicating that its model-aware routing decisions reduce queueing and resource contention as GPU memory approaches saturation.

Overall p90 Latency: Similar trends emerged, with the ESE reducing end‐to‐end tail latencies compared to the baseline, particularly as traffic increased beyond 400–500 QPS.

These results suggest that this extension's model‐aware routing significantly reduced latency for GPU‐backed LLM workloads. By dynamically selecting the least‐loaded or best‐performing model server, it avoids hotspots that can appear when using traditional load balancing methods for large, long‐running inference requests.

Roadmap

As the Gateway API Inference Extension heads toward GA, planned features include:

Prefix-cache aware load balancing for remote caches

LoRA adapter pipelines for automated rollout

Fairness and priority between workloads in the same criticality band

HPA support for scaling based on aggregate, per-model metrics

Support for large multi-modal inputs/outputs

Additional model types (e.g., diffusion models)

Heterogeneous accelerators (serving on multiple accelerator types with latency- and cost-aware load balancing)

Disaggregated serving for independently scaling pools

Summary

By aligning model serving with Kubernetes-native tooling, Gateway API Inference Extension aims to simplify and standardize how AI/ML traffic is routed. With model-aware routing, criticality-based prioritization, and more, it helps ops teams deliver the right LLM services to the right users—smoothly and efficiently.

Ready to learn more? Visit the project docs to dive deeper, give an Inference Gateway extension a try with a few simple steps, and get involved if you’re interested in contributing to the project!

via Kubernetes Blog https://kubernetes.io/

June 04, 2025 at 08:00PM

1_r/devopsish

·kubernetes.io·Jun 5, 2025

Introducing Gateway API Inference Extension

Last Week in Kubernetes Development - Week Ending June 1 2025

Week Ending June 1, 2025

https://lwkd.info/2025/20250604

Developer News

The Enhancements subteam issued a reminder about tracking changes. Contributors must opt in for tracking if their KEP includes user-facing changes or behavior affecting improvements, even if the KEP doesn’t graduate to a new stage. Pure refactors, non-behavioural improvements, and bug fixes do not require tracking by the Release Team.

Release Schedule

Next Deadline: PRR Freeze, June 12th

This is the last week-and-a-half before those KEPs need to be ready for production readiness review with all the PRR questions answered.

Cherry-picks for the June Patch Releases are due on June 6.

KEP of the Week

KEP-3331: Structured Authentication Config

This KEP introduces the capability to add structured authentication configuration to the Kubernetes API server, using a new API Object called AuthenticationConfiguration. It supports a JWT token, which serves as the next stage for the OIDC authenticator. Previously, authentication for the API server was performed using command-line flags, which were difficult to manage, validate, and lacked consistency. The KEP implements the Kubernetes API schema for validation and provides a standardized, extensible format, improving configuration clarity.

This KEP is tracked as stable in v1.34.

Other Merges

kuberc adds tests for DefaultGetPreferences

PVCs annotated with node-expand-not-required to not be expanded

Pod admission and resize logic moved into the allocation manager

Stress tests added for VolumeAttributesClass

New –show-swap option for kubectl top subcommands

5s delay of tainting node.kubernetes.io/unreachable:NoExecute reduced when a Node becomes unreachable

kubelet: the –image-credential-provider-config flag can now specify a directory path as well

Moved Scheduler interfaces to staging

agnhost: added server address for conntrack cleanup entries

kube-proxy: Remove iptables cli wait interval flag