Found 54467 bookmarks

Newest

Microsoft to lay off as many as 9,000 employees in latest round

The move follows two waves of layoffs in May and June, which saw Microsoft fire more than 6,000 employees.

·seattletimes.com·today at 1:02 PM

Microsoft to lay off as many as 9,000 employees in latest round

andreybleme/lazycontainer

Fancy terminal UI for Apple Containers. Contribute to andreybleme/lazycontainer development by creating an account on GitHub.

brew install container

·github.com·today at 12:47 PM

andreybleme/lazycontainer

tursodatabase/turso: Turso Database is a project to build the next evolution of SQLite.

Turso Database is a project to build the next evolution of SQLite. - tursodatabase/turso

·github.com·today at 12:45 PM

tursodatabase/turso: Turso Database is a project to build the next evolution of SQLite.

Linux Sudo chroot Vulnerability Enables Hackers to Elevate Privileges to Root

A security vulnerability in the widely used Linux Sudo utility has been disclosed, allowing any local unprivileged user to escalate privileges.

·cybersecuritynews.com·today at 12:04 PM

Linux Sudo chroot Vulnerability Enables Hackers to Elevate Privileges to Root

Navigating Failures in Pods With Devices

https://kubernetes.io/blog/2025/07/03/navigating-failures-in-pods-with-devices/

Kubernetes is the de facto standard for container orchestration, but when it comes to handling specialized hardware like GPUs and other accelerators, things get a bit complicated. This blog post dives into the challenges of managing failure modes when operating pods with devices in Kubernetes, based on insights from Sergey Kanzhelev and Mrunal Patel's talk at KubeCon NA

You can follow the links to slides and recording.

The AI/ML boom and its impact on Kubernetes

The rise of AI/ML workloads has brought new challenges to Kubernetes. These workloads often rely heavily on specialized hardware, and any device failure can significantly impact performance and lead to frustrating interruptions. As highlighted in the 2024 Llama paper, hardware issues, particularly GPU failures, are a major cause of disruption in AI/ML training. You can also learn how much effort NVIDIA spends on handling devices failures and maintenance in the KubeCon talk by Ryan Hallisey and Piotr Prokop All-Your-GPUs-Are-Belong-to-Us: An Inside Look at NVIDIA's Self-Healing GeForce NOW Infrastructure (recording) as they see 19 remediation requests per 1000 nodes a day! We also see data centers offering spot consumption models and overcommit on power, making device failures commonplace and a part of the business model.

However, Kubernetes’s view on resources is still very static. The resource is either there or not. And if it is there, the assumption is that it will stay there fully functional - Kubernetes lacks good support for handling full or partial hardware failures. These long-existing assumptions combined with the overall complexity of a setup lead to a variety of failure modes, which we discuss here.

Understanding AI/ML workloads

Generally, all AI/ML workloads require specialized hardware, have challenging scheduling requirements, and are expensive when idle. AI/ML workloads typically fall into two categories - training and inference. Here is an oversimplified view of those categories’ characteristics, which are different from traditional workloads like web services:

Training

These workloads are resource-intensive, often consuming entire machines and running as gangs of pods. Training jobs are usually "run to completion" - but that could be days, weeks or even months. Any failure in a single pod can necessitate restarting the entire step across all the pods.

Inference

These workloads are usually long-running or run indefinitely, and can be small enough to consume a subset of a Node’s devices or large enough to span multiple nodes. They often require downloading huge files with the model weights.

These workload types specifically break many past assumptions:

Workload assumptions before and now

Before

Now

Can get a better CPU and the app will work faster.

Require a specific device (or class of devices) to run.

When something doesn’t work, just recreate it.

Allocation or reallocation is expensive.

Any node will work. No need to coordinate between Pods.

Scheduled in a special way - devices often connected in a cross-node topology.

Each Pod can be plug-and-play replaced if failed.

Pods are a part of a larger task. Lifecycle of an entire task depends on each Pod.

Container images are slim and easily available.

Container images may be so big that they require special handling.

Long initialization can be offset by slow rollout.

Initialization may be long and should be optimized, sometimes across many Pods together.

Compute nodes are commoditized and relatively inexpensive, so some idle time is acceptable.

Nodes with specialized hardware can be an order of magnitude more expensive than those without, so idle time is very wasteful.

The existing failure model was relying on old assumptions. It may still work for the new workload types, but it has limited knowledge about devices and is very expensive for them. In some cases, even prohibitively expensive. You will see more examples later in this article.

Why Kubernetes still reigns supreme

This article is not going deeper into the question: why not start fresh for

AI/ML workloads since they are so different from the traditional Kubernetes workloads. Despite many challenges, Kubernetes remains the platform of choice for AI/ML workloads. Its maturity, security, and rich ecosystem of tools make it a compelling option. While alternatives exist, they often lack the years of development and refinement that Kubernetes offers. And the Kubernetes developers are actively addressing the gaps identified in this article and beyond.

The current state of device failure handling

This section outlines different failure modes and the best practices and DIY (Do-It-Yourself) solutions used today. The next session will describe a roadmap of improving things for those failure modes.

Failure modes: K8s infrastructure

In order to understand the failures related to the Kubernetes infrastructure, you need to understand how many moving parts are involved in scheduling a Pod on the node. The sequence of events when the Pod is scheduled in the Node is as follows:

Device plugin is scheduled on the Node

Device plugin is registered with the kubelet via local gRPC

Kubelet uses device plugin to watch for devices and updates capacity of the node

Scheduler places a user Pod on a Node based on the updated capacity

Kubelet asks Device plugin to Allocate devices for a User Pod

Kubelet creates a User Pod with the allocated devices attached to it

This diagram shows some of those actors involved:

As there are so many actors interconnected, every one of them and every connection may experience interruptions. This leads to many exceptional situations that are often considered failures, and may cause serious workload interruptions:

Pods failing admission at various stages of its lifecycle

Pods unable to run on perfectly fine hardware

Scheduling taking unexpectedly long time

The goal for Kubernetes is to make the interruption between these components as reliable as possible. Kubelet already implements retries, grace periods, and other techniques to improve it. The roadmap section goes into details on other edge cases that the Kubernetes project tracks. However, all these improvements only work when these best practices are followed:

Configure and restart kubelet and the container runtime (such as containerd or CRI-O) as early as possible to not interrupt the workload.

Monitor device plugin health and carefully plan for upgrades.

Do not overload the node with less-important workloads to prevent interruption of device plugin and other components.

Configure user pods tolerations to handle node readiness flakes.

Configure and code graceful termination logic carefully to not block devices for too long.

Another class of Kubernetes infra-related issues is driver-related. With traditional resources like CPU and memory, no compatibility checks between the application and hardware were needed. With special devices like hardware accelerators, there are new failure modes. Device drivers installed on the node:

Must match the hardware

Be compatible with an app

Must work with other drivers (like nccl, etc.)

Best practices for handling driver versions:

Monitor driver installer health

Plan upgrades of infrastructure and Pods to match the version

Have canary deployments whenever possible

Following the best practices in this section and using device plugins and device driver installers from trusted and reliable sources generally eliminate this class of failures. Kubernetes is tracking work to make this space even better.

Failure modes: device failed

There is very little handling of device failure in Kubernetes today. Device plugins report the device failure only by changing the count of allocatable devices. And Kubernetes relies on standard mechanisms like liveness probes or container failures to allow Pods to communicate the failure condition to the kubelet. However, Kubernetes does not correlate device failures with container crashes and does not offer any mitigation beyond restarting the container while being attached to the same device.

This is why many plugins and DIY solutions exist to handle device failures based on various signals.

Health controller

In many cases a failed device will result in unrecoverable and very expensive nodes doing nothing. A simple DIY solution is a node health controller. The controller could compare the device allocatable count with the capacity and if the capacity is greater, it starts a timer. Once the timer reaches a threshold, the health controller kills and recreates a node.

There are problems with the health controller approach:

Root cause of the device failure is typically not known

The controller is not workload aware

Failed device might not be in use and you want to keep other devices running

The detection may be too slow as it is very generic

The node may be part of a bigger set of nodes and simply cannot be deleted in isolation without other nodes

There are variations of the health controller solving some of the problems above. The overall theme here though is that to best handle failed devices, you need customized handling for the specific workload. Kubernetes doesn’t yet offer enough abstraction to express how critical the device is for a node, for the cluster, and for the Pod it is assigned to.

Pod failure policy

Another DIY approach for device failure handling is a per-pod reaction on a failed device. This approach is applicable for training workloads that are implemented as Jobs.

Pod can define special error codes for device failures. For example, whenever unexpected device behavior is encountered, Pod exits with a special exit code. Then the Pod failure policy can handle the device failure in a special way. Read more on Handling retriable and non-retriable pod failures with Pod failure policy

There are some problems

·kubernetes.io·today at 12:51 AM

Navigating Failures in Pods With Devices

One year after EOL - The State of CentOS

This CIQ webinar originally aired June 30, 2025.CentOS is still used widely. One source reports that over 300,000 companies still use It, and almost 800,000 ...

·youtu.be·yesterday at 12:37 PM

One year after EOL - The State of CentOS

AI & DevOps Toolkit - Ep27 - Ask Me Anything About Anything with Scott Rosenberg - https://www.youtube.com/watch?v=3fNbnkzB-po

Ep27 - Ask Me Anything About Anything with Scott Rosenberg

There are no restrictions in this AMA session. You can ask anything about DevOps, AI, Cloud, Kubernetes, Platform Engineering, containers, or anything else. Scott Rosenberg, regular guest, will be here to help us out.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: Codefresh 🔗 GitOps Argo CD Certifications: https://learning.codefresh.io (use "viktor" for a 50% discount) ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

via YouTube https://www.youtube.com/watch?v=3fNbnkzB-po

·youtube.com·yesterday at 11:28 AM

AI & DevOps Toolkit - Ep27 - Ask Me Anything About Anything with Scott Rosenberg - https://www.youtube.com/watch?v=3fNbnkzB-po

Dehumidifier

·xkcd.com·yesterday at 12:18 AM

Dehumidifier

Beyond a RHEL Clone: How Rocky Linux Is Evolving Into Something More - Techstrong IT

I’ve been working with Linux for over two decades, and I’ve seen a lot of changes in that time. Some changes start small and grow to have a massive impact, like the birth of Ubuntu. Others have faded into obscurity. And some changes cause massive ripples through the industry. That’s what it felt like in

·techstrong.it·Jul 1, 2025

Beyond a RHEL Clone: How Rocky Linux Is Evolving Into Something More - Techstrong IT

What I learned trying seven coding agents

There's still room for improvement, but don't underestimate this technology.

·understandingai.org·Jul 1, 2025

What I learned trying seven coding agents

New advice for aspiring managers - The Engineering Manager

What does it mean to get into management in 2025? What is expected of you now compared to the last ten years?

·theengineeringmanager.com·Jul 1, 2025

New advice for aspiring managers - The Engineering Manager

FBI Warns of Scattered Spider's Expanding Attacks on Airlines Using Social Engineering

Scattered Spider targets airlines with advanced social engineering and MFA bypass tactics. Industry must reassess identity verification.

·thehackernews.com·Jul 1, 2025

FBI Warns of Scattered Spider's Expanding Attacks on Airlines Using Social Engineering

Substack Is Having a Moment—Again. But Time Is Running Out

While star reporters continue to flock to Substack, subscription fatigue is only getting worse.

·wired.com·Jul 1, 2025

Substack Is Having a Moment—Again. But Time Is Running Out

MCP: An (Accidentally) Universal Plugin System

Or: The Day My Toaster Started Taking Phone Calls

·worksonmymachine.substack.com·Jul 1, 2025

MCP: An (Accidentally) Universal Plugin System

Norwegian Dam Valve Forced Open for Hours in Cyberattack

·hackread.com·Jul 1, 2025

Norwegian Dam Valve Forced Open for Hours in Cyberattack

Donate Less | Gnome Blog

We have a new donation page. But before you go there, I would like to impress upon you this idea: We would vastly prefer you donate $10/mo for one year ($120 total) than $200 in one lump sum....

·blogs.gnome.org·Jun 30, 2025

Donate Less | Gnome Blog

Beyond a RHEL Clone: How Rocky Linux Is Evolving Into Something More - DevOps.com

I’ve been working with Linux for over two decades, and I’ve seen a lot of changes in that time. Some changes start small and grow to have a massive What sets Rocky Linux apart now isn’t just its stability or compatibility—it’s the energy and breadth of its community, particularly the Special Interest Groups, or SIGs.

·devops.com·Jun 30, 2025

Beyond a RHEL Clone: How Rocky Linux Is Evolving Into Something More - DevOps.com

AI & DevOps Toolkit - Better Code Reviews with AI? GitHub Copilot and Qodo Merge Tested - https://www.youtube.com/watch?v=wmmMYFVNxA0

Better Code Reviews with AI? GitHub Copilot and Qodo Merge Tested

Discover how AI is transforming code reviews by comparing two prominent AI agents: GitHub Copilot Code Review and Qodo Merge. We'll explore how these tools integrate seamlessly into GitHub pull requests, evaluate their strengths and weaknesses, and see how their suggestions can be efficiently incorporated into your development workflow directly from your IDE. Whether you're already using AI for code reviews or just getting started, this comparison will help you understand which tool might best fit your needs and why second (or even third) opinions from AI can significantly improve your coding process.

In this video, we'll weigh Qodo's detailed and comprehensive suggestions against Copilot's more structured and familiar commenting style. You'll see firsthand how each tool performs in real-world scenarios, highlighting Qodo's superior issue detection and Copilot's cleaner presentation. By the end, you'll understand the practical benefits of integrating AI code reviews into your workflow and gain clarity on which AI-powered solution is right for you.

AIcodeReview #GitHubCopilot #QodoMerge

Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join

▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/ai/better-code-reviews-with-ai-github-copilot-and-qodo-merge-tested 🔗 GitHub Copilot Code Review: https://docs.github.com/en/copilot 🔗 Qodo Merge: https://www.qodo.ai/products/qodo-merge/ 🎬 My Workflow With AI: How I Code, Test, and Deploy Faster Than Ever: https://youtu.be/2E610yzqQwg

▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Introduction 01:18 AI Code Reviews 07:10 Fixing Issues Detected in Code Reviews 08:38 AI Code Reviews Pros and Cons

via YouTube https://www.youtube.com/watch?v=wmmMYFVNxA0

·youtube.com·Jun 30, 2025

AI & DevOps Toolkit - Better Code Reviews with AI? GitHub Copilot and Qodo Merge Tested - https://www.youtube.com/watch?v=wmmMYFVNxA0

I somehow managed to avoid these | Anker is recalling another five power banks over fire risks

Anker is having a recall-filled June.

·theverge.com·Jun 30, 2025

I somehow managed to avoid these | Anker is recalling another five power banks over fire risks

OpenAI Leadership Responds to Meta Offers: 'Someone Has Broken Into Our Home'

As Mark Zuckerberg lures away top research talent to Meta, OpenAI executives say they're ‘recalibrating comp,’ according to an internal memo.

·wired.com·Jun 29, 2025

OpenAI Leadership Responds to Meta Offers: 'Someone Has Broken Into Our Home'

A Video Game Engine Just Broke a Huge Barrier for Nuclear Fusion. This Could Be the Key to Unlimited Power.

Detecting high-speed particle collisions in a fusion reactor is very tricky, but video game tech makes the task surprisingly simple.

·popularmechanics.com·Jun 29, 2025

A Video Game Engine Just Broke a Huge Barrier for Nuclear Fusion. This Could Be the Key to Unlimited Power.

The Kubernetes Course 2025

🚀 Welcome to the ultimate Kubernetes Course! Whether you're just starting out or want to level up your Kubernetes skills, this hands-on course walks you thr...

·youtu.be·Jun 28, 2025

The Kubernetes Course 2025

How AlmaLinux and Rocky Linux Have Diverged Since CentOS

The two main independent Linux distros that emerged from CentOS's demise have gone two different ways.

·thenewstack.io·Jun 28, 2025

How AlmaLinux and Rocky Linux Have Diverged Since CentOS

The European Union Linux desktop

Opinion: True digital sovereignty begins at the desktop

·theregister.com·Jun 28, 2025

The European Union Linux desktop

My (mostly) minimalistic AI setup as a Senior Engineer in Big Tech

You can stop being overwhelmed by the hundreds of new tools every day

·read.highgrowthengineer.com·Jun 27, 2025

My (mostly) minimalistic AI setup as a Senior Engineer in Big Tech

We Can Just Measure Things

Using programming agents to measure measuring developer productivity.

·lucumr.pocoo.org·Jun 27, 2025

We Can Just Measure Things

NickTikhonov/snap-ql: AI-powered Postgres Client

AI-powered Postgres Client. Contribute to NickTikhonov/snap-ql development by creating an account on GitHub.

·github.com·Jun 27, 2025

NickTikhonov/snap-ql: AI-powered Postgres Client

Windows 11 Retires Blue Screen of Death Error Replaces With Black Screen

Microsoft is retiring one of computing's most recognizable error messages after nearly four decades. The iconic BSOD that has haunted Windows users.

·cybersecuritynews.com·Jun 27, 2025

Windows 11 Retires Blue Screen of Death Error Replaces With Black Screen

Last Week in Kubernetes Development - Week Ending June 22 2025

Week Ending June 22, 2025

https://lwkd.info/2025/20250627

Developer News

Having completed their work, WG-Policy is being archived. Congrats Policy team!

There is an ongoing discussion in the Kubernetes community regarding the Slack migration, and new platform options are currently being evaluated. Please share your thoughts to help shortlist a suitable new platform.

The CFPS for the CNCF-hosted Co-located Events North America 2025 are closing soon. Make sure to submit your proposals by June 30th.

The KubeCon North America 2025 Maintainer Summit CFP is also open. Please submit your sessions by July 20th.

Release Schedule

Next Deadline: Open Doc Placeholders, July 3

With 70 enhancements tracked, it’s time to wrap up work on those changes. The next step is opening a Docs placeholder PR so that the Docs team knows that you’ll be ready by Docs deadline on Jul 29. Didn’t get your Enhancement approved in time? You have until July 7th to request an exception.

Patch releases v1.33.2, 1.32.6, 1.31.10 and 1.30.14 are released, including a security update for Golang. This is likely to be the last patch release for Kubernetes 1.30, so users on that version should plan to upgrade soon.

Featured PRs

132504: Introduce OpenAPI format support for k8s-short-name and k8s-long-name

This PR introduces support for k8s-short-name and k8s-long-name in OpenAPI schema validation for Custom Resource Definitions (CRDs); These formats are now recognized in the OpenAPI validation of CRD schemas, allowing Kubernetes-native name formats to be used consistently in the validation of CRD fields.

126619: Show namespace on delete

This PR updates the kubectl delete command to include the namespace in the output, improving clarity when resources are deleted across multiple namespaces; Previously, the output could be ambiguous, especially when targeting resources in different namespaces; This enhancement helps to avoid confusion by explicitly showing the namespace during delete operations.

KEP of the Week

KEP 4800: Split UncoreCache Topology Awareness in CPU Manager

This KEP introduced a new static policy prefer-align-cpus-by-uncorecache for the CPU Manager that groups CPU resources by uncore cache where possible. An uncore cache refers to the cache that exists at a shared level among CPU cores. This is primarily beneficial for CPU architectures that utilize multiple uncore caches, or split uncore caches, within the processor.

This KEP is tracked for beta in v1.34.

Other Merges

Actively poll for namespace termination instead of sleeping

Fix for being able to custom resources with server side apply even when its CustomResourceDefinition was terminating

e2e/watchlist test for checking metadata informer

apimachinery/pkg/util/errors to deprecate MessageCountMap

API response for StorageClassList to return a graceful error message if the provided ResourceVersion is too large

MutableCSINodeAllocatableCount storage e2e test refactored to use the Mock CSI driver

omitempty and opt tag added to the API v1beta2 AdminAccess type in the DeviceRequestAllocationResult struct

Job controller now uses controller UID index for pod lookups

ListAll and ListAllByNamespace optimized to return directly when there is nothing to select

Cleanup after alpha feature MountContainers was removed

New runtime.ApplyConfiguration interface added that is implemented by all generated applyconfigs

cloud provider calls in storage/volume_provisioning.go removed

Usage of deprecated function ExtractCommentTags migrated to ExtractFunctionStyleCommentTags

Delay added to node updates after kubelet startup

Conntrack reconciler now considers service’s target port during cleanup of stale flow entries

kube-scheduler: Apply EnablePlugins to CoreResourceEnqueueTestCases

etcd server overrides to etcd probe factory for healthz and readyz

endpointsleases and configmapsleases options removed from leader-elect-resource-lock in LeaderElectionConfiguration

Deprecated –register-schedulable command line argument removed from the kubelet

Promotions

JobPodReplacementPolicy to GA

Subprojects and Dependency Updates

containerd v2.1.3: fixes registry fetch and transfer service issues

cluster-api v1.11.0-alpha.1: releases alpha version for testing

Shoutouts

Josh Berkus (@jberkus): Kudos to Mario Fahlandt (@Mario Fahlandt) for figuring out how to back up private channels from Slack.

via Last Week in Kubernetes Development https://lwkd.info/

June 27, 2025 at 09:08AM

·lwkd.info·Jun 27, 2025

Last Week in Kubernetes Development - Week Ending June 22 2025

Google positions itself for 'next decade' of AI as Gemini CLI arrives with generous free tier • DEVCLASS

Google has released Gemini CLI (command line interface), a terminal-based version of its AI assistant, with a generous […]

·devclass.com·Jun 27, 2025

Google positions itself for 'next decade' of AI as Gemini CLI arrives with generous free tier • DEVCLASS