Suggested Reads

Suggested Reads

54794 bookmarks
Newest
Blog: Changes to Kubernetes Slack
Blog: Changes to Kubernetes Slack

Blog: Changes to Kubernetes Slack

https://www.kubernetes.dev/blog/2025/06/16/changes-to-kubernetes-slack-2025/

Kubernetes Slack will lose its special status and will be changing into a standard free Slack on June 20. Sometime later this year, our community will likely move to a new platform. If you are responsible for a channel or private channel, or a member of a User Group, you will need to take some actions as soon as you can.

For the last decade, Slack has supported our project with a free customized enterprise account. They have let us know that they can no longer do so, particularly since our Slack is one of the largest and more active ones on the platform. As such, they will be downgrading it to a standard free Slack while we decide on, and implement, other options.

On Friday, June 20, we will be subject to the [feature limitations of free Slack] https://slack.com/help/articles/27204752526611-Feature-limitations-on-the-free-version-of-Slack. The primary ones which will affect us will be only retaining 90 days of history, and having to disable several apps and workflows which we are currently using. The Slack Admin team will do their best to manage these limitations.

Responsible channel owners, members of private channels, and members of User Groups should take some actions to prepare for the upgrade and preserve information as soon as possible.

The CNCF Projects Staff have proposed that our community look at migrating to Discord. Because of existing issues where we have been pushing the limits of Slack, they have already explored what a Kubernetes Discord would look like. Discord would allow us to implement new tools and integrations which would help the community, such as GitHub group membership synchronization. The Steering Committee will discuss and decide on our future platform.

Please see our FAQ, and check the kubernetes-dev mailing list and the #announcements channel for further news. If you have specific feedback on our Slack status join the discussion on GitHub.

via Kubernetes Contributors – Contributor Blog https://www.kubernetes.dev/blog/

June 15, 2025 at 08:00PM

·kubernetes.dev·
Blog: Changes to Kubernetes Slack
It matters. I care. - Molly White
It matters. I care. - Molly White
When we throw up our hands and say none of it matters, we're doing the fascists’ work for them. They don't need to hide their corruption if they can convince us it's pointless to look. They don't need to silence truth-tellers if we've already decided truth is meaningless.
·citationneeded.news·
It matters. I care. - Molly White
Last Week in Kubernetes Development - Week Ending June 8 2025
Last Week in Kubernetes Development - Week Ending June 8 2025

Week Ending June 8, 2025

https://lwkd.info/2025/20250611

Developer News

The next New Contributor Orientations will be held June 17th. If your SIG/WG/team has any help wanted opportunities to share, please let Mario know in #chairs-and-techleads.

The Elections Subproject is looking for another election officer for the 2025 Steering Election. Please review the role requirements, and express your interest.

Kubecon NA: The CFP for Maintainer Track talks and Project Kiosks is open and closes on July 7th. The CFP for the Maintainer Summit closes on July 20th.

Release Schedule

Next Deadline: PRR Freeze, June 12

Once you get done putting info in your KEPs for production readiness, you’ll be ready for the Enhancements Freeze 8 days later. Now’s the time to decide whether your enhancement is tracked for 1.34 or not.

Patch releases for June have been delayed until next week, as has the 1.34a1 release.

Featured PRs

131632: feat: Allow leases to have custom labels set when a new holder takes the lease

This PR allows users to set custom labels on LeaseLock resources when a new holder acquires the lease; users can now track which node holds the lease, thus improving observability for components using leader election.

KEP of the Week

KEP 3015: PreferSameZone and PreferSameNode Traffic Distribution

This enhancement deprecated the PreferClose Pod Topology Spread Constraints type and replaced it with PreferSameZone as a new name for the old behaviour. The KEP also added a new value PreferSameNode, which indicates that traffic for a service should preferentially be routed to endpoints on the same node as the client. This KEP made traffic distribution less ambiguous and delivers traffic to a local endpoint when possible. If the local endpoint is unavailable, the traffic is routed to a remote endpoint.

This KEP is tracked for beta in v1.34.

Other Merges

IsDNS1123SubdomainWithUnderscore function to return the correct error message

Fix for incorrect logging of insufficientResources in preemption

Support for API streaming from the List() method of the metadata client removed

Declarative validation to use named params and structured tags

Fix for unexpected delay of creating pods for newly created jobs

queue.FIFOs replaced with k8s.io/utils/buffer.Ring

kubeadm to consistently print an ‘error: ‘ prefix before errors

Promotions

ResilientWatchCacheInitialization to GA

Version Updates

gengo/v2 to latest

Subprojects and Dependency Updates

cloud-provider-openstack v1.33.0 adds OpenStack 2024.1, updates drivers, improves load balancer, fixes security and metadata, releases csi and controller charts v2.33.0

CoreDNS v1.12.2 adds multicluster, file fallthrough, forward proxy options, limits QUIC streams

etcd v3.6.1 replaces otelgrpc, adds member protections, fixes cluster removal and watcher race, validates discovery, builds with Go 1.23.10

grpc v1.73.0 enables Abseil sync on macOS/iOS, updates Protobuf, adds OpenSSL and disable sync flags

Shoutouts

Josh Berkus (@jberkus): Kudos to Carson Weeks (@Carson Weeks) and Ludo (@Ludo) for getting Elekto (the thing we use for Steering elections) to 97% unit test coverage. Yay!

via Last Week in Kubernetes Development https://lwkd.info/

June 11, 2025 at 07:59AM

·lwkd.info·
Last Week in Kubernetes Development - Week Ending June 8 2025
Arguing point-by-point considered harmful
Arguing point-by-point considered harmful
Engineers love to have technical discussions point-by-point: replying to every idea in turn, treating each as its own mini-discussion. It just makes sense! A…
·seangoedecke.com·
Arguing point-by-point considered harmful
Shared Nothing Shared Everything: The Truth About Kubernetes Multi-Tenancy with Molly Sheets
Shared Nothing Shared Everything: The Truth About Kubernetes Multi-Tenancy with Molly Sheets

Shared Nothing, Shared Everything: The Truth About Kubernetes Multi-Tenancy, with Molly Sheets

https://ku.bz/Rmpl8948_

Molly Sheets, Director of Engineering for Kubernetes at Zynga, discusses her team's approach to platform engineering. She explains why their initial one-cluster-per-team model became unsustainable and how they're transitioning to multi-tenant architectures.

You will learn:

Why slowing down deployments actually increases risk and how manual approval gates can make systems less resilient than faster, smaller deployments

The operational reality of cluster proliferation - why managing hundreds of clusters becomes unsustainable and when multi-tenancy becomes necessary

Practical multi-tenancy implementation strategies including resource quotas, priority classes, and namespace organization patterns that work in production

Better metrics for multi-tenant environments - why control plane uptime doesn't matter and how to build meaningful SLOs for distributed platform health

Sponsor

This episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.

More info

Find all the links and info for this episode here: https://ku.bz/Rmpl8948_

Interested in sponsoring an episode? Learn more.

via KubeFM https://kube.fm

June 10, 2025 at 06:00AM

·kube.fm·
Shared Nothing Shared Everything: The Truth About Kubernetes Multi-Tenancy with Molly Sheets
Enhancing Kubernetes Event Management with Custom Aggregation
Enhancing Kubernetes Event Management with Custom Aggregation

Enhancing Kubernetes Event Management with Custom Aggregation

https://kubernetes.io/blog/2025/06/10/enhancing-kubernetes-event-management-custom-aggregation/

Kubernetes Events provide crucial insights into cluster operations, but as clusters grow, managing and analyzing these events becomes increasingly challenging. This blog post explores how to build custom event aggregation systems that help engineering teams better understand cluster behavior and troubleshoot issues more effectively.

The challenge with Kubernetes events

In a Kubernetes cluster, events are generated for various operations - from pod scheduling and container starts to volume mounts and network configurations. While these events are invaluable for debugging and monitoring, several challenges emerge in production environments:

Volume: Large clusters can generate thousands of events per minute

Retention: Default event retention is limited to one hour

Correlation: Related events from different components are not automatically linked

Classification: Events lack standardized severity or category classifications

Aggregation: Similar events are not automatically grouped

To learn more about Events in Kubernetes, read the Event API reference.

Real-World value

Consider a production environment with tens of microservices where the users report intermittent transaction failures:

Traditional event aggregation process: Engineers are wasting hours sifting through thousands of standalone events spread across namespaces. By the time they look into it, the older events have long since purged, and correlating pod restarts to node-level issues is practically impossible.

With its event aggregation in its custom events: The system groups events across resources, instantly surfacing correlation patterns such as volume mount timeouts before pod restarts. History indicates it occurred during past record traffic spikes, highlighting a storage scalability issue in minutes rather than hours.

The benefit of this approach is that organizations that implement it commonly cut down their troubleshooting time significantly along with increasing the reliability of systems by detecting patterns early.

Building an Event aggregation system

This post explores how to build a custom event aggregation system that addresses these challenges, aligned to Kubernetes best practices. I've picked the Go programming language for my example.

Architecture overview

This event aggregation system consists of three main components:

Event Watcher: Monitors the Kubernetes API for new events

Event Processor: Processes, categorizes, and correlates events

Storage Backend: Stores processed events for longer retention

Here's a sketch for how to implement the event watcher:

package main

import ( "context" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/client-go/kubernetes" "k8s.io/client-go/rest" eventsv1 "k8s.io/api/events/v1" )

type EventWatcher struct { clientset *kubernetes.Clientset }

func NewEventWatcher(config *rest.Config) (*EventWatcher, error) { clientset, err := kubernetes.NewForConfig(config) if err != nil { return nil, err } return &EventWatcher{clientset: clientset}, nil }

func (w *EventWatcher) Watch(ctx context.Context) (<-chan *eventsv1.Event, error) { events := make(chan *eventsv1.Event)

watcher, err := w.clientset.EventsV1().Events("").Watch(ctx, metav1.ListOptions{}) if err != nil { return nil, err }

go func() { defer close(events) for { select { case event := <-watcher.ResultChan(): if e, ok := event.Object.(*eventsv1.Event); ok { events <- e } case <-ctx.Done(): watcher.Stop() return } } }()

return events, nil }

Event processing and classification

The event processor enriches events with additional context and classification:

type EventProcessor struct { categoryRules []CategoryRule correlationRules []CorrelationRule }

type ProcessedEvent struct { Event *eventsv1.Event Category string Severity string CorrelationID string Metadata map[string]string }

func (p *EventProcessor) Process(event *eventsv1.Event) *ProcessedEvent { processed := &ProcessedEvent{ Event: event, Metadata: make(map[string]string), }

// Apply classification rules processed.Category = p.classifyEvent(event) processed.Severity = p.determineSeverity(event)

// Generate correlation ID for related events processed.CorrelationID = p.correlateEvent(event)

// Add useful metadata processed.Metadata = p.extractMetadata(event)

return processed }

Implementing Event correlation

One of the key features you could implement is a way of correlating related Events. Here's an example correlation strategy:

func (p *EventProcessor) correlateEvent(event *eventsv1.Event) string { // Correlation strategies: // 1. Time-based: Events within a time window // 2. Resource-based: Events affecting the same resource // 3. Causation-based: Events with cause-effect relationships

correlationKey := generateCorrelationKey(event) return correlationKey }

func generateCorrelationKey(event *eventsv1.Event) string { // Example: Combine namespace, resource type, and name return fmt.Sprintf("%s/%s/%s", event.InvolvedObject.Namespace, event.InvolvedObject.Kind, event.InvolvedObject.Name, ) }

Event storage and retention

For long-term storage and analysis, you'll probably want a backend that supports:

Efficient querying of large event volumes

Flexible retention policies

Support for aggregation queries

Here's a sample storage interface:

type EventStorage interface { Store(context.Context, *ProcessedEvent) error Query(context.Context, EventQuery) ([]ProcessedEvent, error) Aggregate(context.Context, AggregationParams) ([]EventAggregate, error) }

type EventQuery struct { TimeRange TimeRange Categories []string Severity []string CorrelationID string Limit int }

type AggregationParams struct { GroupBy []string TimeWindow string Metrics []string }

Good practices for Event management

Resource Efficiency

Implement rate limiting for event processing

Use efficient filtering at the API server level

Batch events for storage operations

Scalability

Distribute event processing across multiple workers

Use leader election for coordination

Implement backoff strategies for API rate limits

Reliability

Handle API server disconnections gracefully

Buffer events during storage backend unavailability

Implement retry mechanisms with exponential backoff

Advanced features

Pattern detection

Implement pattern detection to identify recurring issues:

type PatternDetector struct { patterns map[string]*Pattern threshold int }

func (d *PatternDetector) Detect(events []ProcessedEvent) []Pattern { // Group similar events groups := groupSimilarEvents(events)

// Analyze frequency and timing patterns := identifyPatterns(groups)

return patterns }

func groupSimilarEvents(events []ProcessedEvent) map[string][]ProcessedEvent { groups := make(map[string][]ProcessedEvent)

for _, event := range events { // Create similarity key based on event characteristics similarityKey := fmt.Sprintf("%s:%s:%s", event.Event.Reason, event.Event.InvolvedObject.Kind, event.Event.InvolvedObject.Namespace, )

// Group events with the same key groups[similarityKey] = append(groups[similarityKey], event) }

return groups }

func identifyPatterns(groups map[string][]ProcessedEvent) []Pattern { var patterns []Pattern

for key, events := range groups { // Only consider groups with enough events to form a pattern if len(events) < 3 { continue }

// Sort events by time sort.Slice(events, func(i, j int) bool { return events[i].Event.LastTimestamp.Time.Before(events[j].Event.LastTimestamp.Time) })

// Calculate time range and frequency firstSeen := events[0].Event.FirstTimestamp.Time lastSeen := events[len(events)-1].Event.LastTimestamp.Time duration := lastSeen.Sub(firstSeen).Minutes()

var frequency float64 if duration > 0 { frequency = float64(len(events)) / duration }

// Create a pattern if it meets threshold criteria if frequency > 0.5 { // More than 1 event per 2 minutes pattern := Pattern{ Type: key, Count: len(events), FirstSeen: firstSeen, LastSeen: lastSeen, Frequency: frequency, EventSamples: events[:min(3, len(events))], // Keep up to 3 samples } patterns = append(patterns, pattern) } }

return patterns }

With this implementation, the system can identify recurring patterns such as node pressure events, pod scheduling failures, or networking issues that occur with a specific frequency.

Real-time alerts

The following example provides a starting point for building an alerting system based on event patterns. It is not a complete solution but a conceptual sketch to illustrate the approach.

type AlertManager struct { rules []AlertRule notifiers []Notifier }

func (a *AlertManager) EvaluateEvents(events []ProcessedEvent) { for _, rule := range a.rules { if rule.Matches(events) { alert := rule.GenerateAlert(events) a.notify(alert) } } }

Conclusion

A well-designed event aggregation system can significantly improve cluster observability and troubleshooting capabilities. By implementing custom event processing, correlation, and storage, operators can better understand cluster behavior and respond to issues more effectively.

The solutions presented here can be extended and customized based on specific requirements while maintaining compatibility with the Kubernetes API and following best practices for scalability and reliability.

Next steps

Future enhancements could include:

Machine learning for anomaly detection

Integration with popular observability platforms

Custom event APIs for application-specific events

Enhanced visualization and reporting capabilities

For more information on Kubernetes events and custom controllers, refer to the official Kubernetes documentation.

via Kubernetes Blog https://kubernetes.io/

June 09, 2025 at 08:00PM

·kubernetes.io·
Enhancing Kubernetes Event Management with Custom Aggregation
DevOps Toolkit - Ep24 - Ask Me Anything About Anything with Scott Rosenberg - https://www.youtube.com/watch?v=JaO74iWnRwY
DevOps Toolkit - Ep24 - Ask Me Anything About Anything with Scott Rosenberg - https://www.youtube.com/watch?v=JaO74iWnRwY

Ep24 - Ask Me Anything About Anything with Scott Rosenberg

There are no restrictions in this AMA session. You can ask anything about DevOps, Cloud, Kubernetes, Platform Engineering, containers, or anything else. We'll have special guests Scott Rosenberg and Ramiro Berrelleza to help us out.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: Codefresh 🔗 GitOps Argo CD Certifications: https://learning.codefresh.io (use "viktor" for a 50% discount) ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

via YouTube https://www.youtube.com/watch?v=JaO74iWnRwY

·youtube.com·
DevOps Toolkit - Ep24 - Ask Me Anything About Anything with Scott Rosenberg - https://www.youtube.com/watch?v=JaO74iWnRwY
DevOps Toolkit - How I Fixed My Lazy Vibe Coding Habits with Taskmaster - https://www.youtube.com/watch?v=0WtCBbIHoKE
DevOps Toolkit - How I Fixed My Lazy Vibe Coding Habits with Taskmaster - https://www.youtube.com/watch?v=0WtCBbIHoKE

How I Fixed My Lazy Vibe Coding Habits with Taskmaster

AI agents often struggle with large, complex tasks, losing context and producing inconsistent results. Enter Taskmaster, an open-source project designed to orchestrate AI agents, maintain permanent context, and efficiently handle multi-step tasks. In this video, we'll explore how Taskmaster can improve your workflow by automatically generating detailed Product Requirements Documents (PRDs), breaking down tasks, and guiding AI agents seamlessly through complex projects without losing context or focus.

Witness how Taskmaster effortlessly plans, organizes, and manages tasks that would otherwise require hours of tedious manual effort. Whether you're using GitHub Copilot, Cursor, or other AI assistants, Taskmaster will significantly enhance your productivity and change your approach to working with AI.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: Blacksmith 🔗 https://blacksmith.sh ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

AIProductivity, #SoftwareDevelopment, #TaskManagement

Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join

▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ 🔗 Taskmaster: https://task-master.dev

▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Problems with AI for Larger Tasks 02:31 Blacksmit (sponsor) 03:39 The Shame 06:03 Product Requirements Document (PRD) with Taskmaster 11:00 Working on Tasks with Taskmaster 14:27 Taskmaster Pros and Cons

via YouTube https://www.youtube.com/watch?v=0WtCBbIHoKE

·youtube.com·
DevOps Toolkit - How I Fixed My Lazy Vibe Coding Habits with Taskmaster - https://www.youtube.com/watch?v=0WtCBbIHoKE
Google's Cloud IDP Could Replace Platform Engineering
Google's Cloud IDP Could Replace Platform Engineering
Google Cloud's Internal Development Platform project promises to revolutionize software building by shifting platform engineering responsibilities from developers to the cloud itself through integrated, app-centric services.
·thenewstack.io·
Google's Cloud IDP Could Replace Platform Engineering