54549 bookmarks

Custom sorting

Exclusive: macOS 26 hints at sealed Mac updates at Apple Stores - 9to5Mac

In recent years, Apple developed a system that updates sealed iPhones wirelessly. Now, it is working on something similar for the Mac.

·9to5mac.com·Jul 10, 2025

Exclusive: macOS 26 hints at sealed Mac updates at Apple Stores - 9to5Mac

Time to cancel that 6E AP upgrade | What Trump's 'big beautiful bill' means for Wi-Fi 6E and Wi-Fi 7 users (Hint: It's not pretty)

Hidden inside the now-passed bill is a provision enabling the Federal Communications Commission to sell off portions of the 6GHz band currently being used by high-end Wi-Fi equipment.

·zdnet.com·Jul 10, 2025

Time to cancel that 6E AP upgrade | What Trump's 'big beautiful bill' means for Wi-Fi 6E and Wi-Fi 7 users (Hint: It's not pretty)

Investigate your dependencies with Deptective

Deptective, our new open-source tool, automatically finds the packages needed to install software dependencies. It does so not based on the software’s self-reported requirements, but by observing what the software needs at runtime.

·blog.trailofbits.com·Jul 10, 2025

Investigate your dependencies with Deptective

Kubernetes List API performance and reliability

At my current employer, we use Kubernetes to run hundreds of thousands of bare metal servers, spread over hundreds of Kubernetes clusters. We use Kubernetes beyond officially supported/tested scale limits by running more than 5,000 nodes and over a...

·ahmet.im·Jul 10, 2025

Kubernetes List API performance and reliability

Am I online?

Checking internet connectivity with 'generate 204' endpoints.

·antonz.org·Jul 10, 2025

Am I online?

systemd has been a complete, utter, unmitigated success

Eleven init systems enter, one init system leaves.

·blog.tjll.net·Jul 10, 2025

systemd has been a complete, utter, unmitigated success

US airports relax security rules

The requirement has been in place since 2006 after a British man in 2001 attempted to destroy an airliner with explosives hidden in a shoe.

·semafor.com·Jul 10, 2025

US airports relax security rules

Iranian ransomware group offers bigger payouts for attacks on Israel, US

The Iran-linked ransomware-as-a-service group Pay2Key.I2P told affiliates that they can keep a larger cut of extortion payments if they attack entities within Iran's adversaries.

·therecord.media·Jul 10, 2025

Iranian ransomware group offers bigger payouts for attacks on Israel, US

Jeff Bezos sells $666 million in Amazon stock as part of plan to unload 25 million shares

Bezos' latest stock sale comes shortly after his $50 million high-profile wedding to Lauren Sanchez in Venice with a star-studded list of celebrity guests.

·cnbc.com·Jul 10, 2025

Jeff Bezos sells $666 million in Amazon stock as part of plan to unload 25 million shares

Sakana AI’s TreeQuest: Deploy multi-model teams that outperform individual LLMs by 30%

Sakana AI's new inference-time scaling technique uses Monte-Carlo Tree Search to orchestrate multiple LLMs to collaborate on complex tasks.

·venturebeat.com·Jul 9, 2025

Sakana AI’s TreeQuest: Deploy multi-model teams that outperform individual LLMs by 30%

Announcing GoReleaser v2.11 | Carlos Becker

This version consists mostly of improvements to the Homebrew Cask feature introduced in the last release, and in other features.

·carlosbecker.com·Jul 9, 2025

Announcing GoReleaser v2.11 | Carlos Becker

TIL: Principle of least astonishment - Wikipedia

·en.wikipedia.org·Jul 8, 2025

TIL: Principle of least astonishment - Wikipedia

AI & DevOps Toolkit - Ep28 - Ask Me Anything About Anything with Scott Rosenberg - https://www.youtube.com/watch?v=-u7zcjeAEh8

Ep28 - Ask Me Anything About Anything with Scott Rosenberg

There are no restrictions in this AMA session. You can ask anything about DevOps, AI, Cloud, Kubernetes, Platform Engineering, containers, or anything else. Scott Rosenberg, regular guest, will be here to help us out.

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: Codefresh 🔗 GitOps Argo CD Certifications: https://learning.codefresh.io (use "viktor" for a 50% discount) ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

via YouTube https://www.youtube.com/watch?v=-u7zcjeAEh8

·youtube.com·Jul 8, 2025

AI & DevOps Toolkit - Ep28 - Ask Me Anything About Anything with Scott Rosenberg - https://www.youtube.com/watch?v=-u7zcjeAEh8

DevOps teams and roles were a bad idea. The best DevOps jobs I had never had the word DevOps in them.

We need to talk about DevOps.

·open.substack.com·Jul 8, 2025

DevOps teams and roles were a bad idea. The best DevOps jobs I had never had the word DevOps in them.

AI & DevOps Toolkit - Vibe Coding Explained: AI Coding Best Practices - https://www.youtube.com/watch?v=W1105cy1D84

Vibe Coding Explained: AI Coding Best Practices

Vibe coding is transforming software development by enabling us to interact with AI through simple, natural language instructions. Instead of manually writing code line by line, we can now direct AI agents to generate code, conduct tests, and manage various software development lifecycle operations. In this video, we'll explore essential best practices for vibe coding, including effective session management, the importance of detailed product requirements, memory and context management strategies, testing guidelines, and tips for making the most of "thinking" AI models.

Discover how to leverage vibe coding efficiently and avoid common pitfalls, such as overtrusting AI or neglecting proper code review. Learn why maintaining a small, manageable codebase, regularly updating AI memory, and consistently seeking second opinions are crucial for success. Whether you're new to vibe coding or already integrating it into your workflow, these practical guidelines will help you collaborate more effectively with AI and elevate your development practices.

VibeCoding #AIProgramming #BestPractices

Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join

▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/ai/vibe-coding-explained-ai-coding-best-practices 🎬 The Missing Link: How MCP Servers Supercharge Your AI Coding Assistant: https://youtu.be/n0dCFY6wMeI 🎬 From Shame to Fame: How I Fixed My Lazy Vibe Coding Habits with Taskmaster: https://youtu.be/0WtCBbIHoKE 🎬 Outdated AI Responses? Context7 Solves LLMs' Biggest Flaw: https://youtu.be/DeZ-gw_aop0

▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Intro to Vibe Coding 04:10 UpCloud (sponsor) 05:09 New Session 08:23 Development 11:28 Memory and Context Management 14:38 Miscelaneous 19:27 Vibe Coding Best Practices and Rules

via YouTube https://www.youtube.com/watch?v=W1105cy1D84

·youtube.com·Jul 7, 2025

AI & DevOps Toolkit - Vibe Coding Explained: AI Coding Best Practices - https://www.youtube.com/watch?v=W1105cy1D84

LLMs Are Recommending Phishing Sites—Here’s Why That’s Dangerous

What happens when AI chatbots send users to phishing links? Netcraft’s latest research reveals that over one-third of LLM-suggested login URLs aren’t owned by the brand—posing a growing threat to online safety. Learn what’s happening and how to protect your users.

·netcraft.com·Jul 7, 2025

LLMs Are Recommending Phishing Sites—Here’s Why That’s Dangerous

macOS NimDoor | DPRK Threat Actors Target Web3 and Crypto Platforms with Nim-Based Malware

NimDoor reflects a leap in DPRK’s offensive toolkit, mixing compile-time trickery with native scripting to complicate and deter analysis.

·sentinelone.com·Jul 7, 2025

macOS NimDoor | DPRK Threat Actors Target Web3 and Crypto Platforms with Nim-Based Malware

Red Team Tactics: Evading EDR on Linux with io_uring

Learn how to bypass modern defenses with io_uring

·matheuzsecurity.github.io·Jul 7, 2025

Red Team Tactics: Evading EDR on Linux with io_uring

Insecure Boot: Injecting initramfs from a debug shell

Many Linux hardening guides focus on well-known protections: full-disk encryption, Secure Boot, and password-protected bootloaders. While these measures are critical, they often overlook a subtle but serious attack vector: the ability to drop into a debug shell via the Initial RAM Filesystem (initramfs). This oversight can enable an attacker with brief physical access to bypass conventional bo ...

·insinuator.net·Jul 7, 2025

Insecure Boot: Injecting initramfs from a debug shell

donBarbos/awesome-standards: A curated list of technical standards, they may be called requests for comments, proposals, drafts, notes, specifications, or something else

A curated list of technical standards, they may be called requests for comments, proposals, drafts, notes, specifications, or something else - donBarbos/awesome-standards

·github.com·Jul 7, 2025

donBarbos/awesome-standards: A curated list of technical standards, they may be called requests for comments, proposals, drafts, notes, specifications, or something else

Behind the scenes with the weed-smoking, Labubu-loving, hackathon king of SF

Rene Turcios has attended over 200 hackathons in two years — and he doesn’t even know how to code.

·sfstandard.com·Jul 6, 2025

Behind the scenes with the weed-smoking, Labubu-loving, hackathon king of SF

Last Week in Kubernetes Development - Week Ending June 29 2025

Week Ending June 29, 2025

https://lwkd.info/2025/20250704

Developer News

Kubernetes is auditing and cleaning up inactive GitHub organization members in the first week of July 2025 to ensure active and accurate community representation. Contributors who are still active but not tracked by Dev-Stats can request an exception by commenting on the cleanup issue before the deadline on July 18, 2025.

The KubeCon North America 2025 Project Lightning Talk and Maintainer Track CFP is now open and closes soon on July 7th. Make sure to submit your talks before the deadline!

Ongoing discussion in the Kubernetes community regarding Slack migration is closed now. Since Salesforce has postponed the downgrade. Any future conversations about potential migration will take place later on a more relaxed timeline.

Release Schedule

Next Deadline: Feature Blog Placeholders, July 11th

1.34-alpha.2 was released this week, in case you want to play around with the new version.

Featured PRs

12937: feature(kubectl): support –cpu, –memory flag to kubectl autoscale

This PR introduces support for the --cpu and --memory flags in the kubectl autoscale command; The new flags allow users to specify CPU and Memory metrics for horizontal pod autoscaling; The update supports both percentage-based utilization and fixed resource values, thus streamlining resource management; This PR also deprecates the --cpu-percent flag, following the new approach for defining resource targets.

132351: bugfix(hpa): introduce buildQuantity helper for consistent resource quantity

This PR introduces the buildQuantity helper function in the Horizontal Pod Autoscaler (HPA) controller to ensure consistent handling of resource quantities; Before this change, resource quantities were created inline, which causes inconsistencies in handling CPU and memory metrics; With this update, the buildQuantity function standardizes the process by converting raw memory values to KiB and use BinarySI and handling CPU and other resources in milli-units with DecimalSI; Memory metrics are now displayed correctly in Ki, instead of incorrectly appending the “m” suffix thus improving consistency in metric calculations and display.

131837: Deny pod admission for static pods referencing API objects

Static pods that reference API objects are now denied admission by the kubelet. This is to prevent static pods silently running even after the mirror pod creation fails. Currently, mirror pod reconciliation for static pods which reference API objects will fail. However the pod itself is not denied admission and the node would be silently running the static pod’s container. A new feature gate PreventStaticPodAPIReferences is introduced to enable stricter validation for static pods. Enabling this feature gate ensures that the static pod container is not created when the mirror pod creation fails.

KEP of the Week

KEP-3902: Decouple Taint-based Pod Eviction from Node Lifecycle Controller

This KEP splits the existing NodeLifecycleController into two controllers: NodeLifecycleController (to add taints to unhealthy nodes) and TaintEvictionController (to evict pods from tainted nodes). Previously, both tainting and eviction were handled by a single controller, but the main goal of this change is to separate responsibilities for better clarity, organization, and maintainability. The new TaintEvictionController is created from existing taint-manager code and now runs separately. A feature gate called SeparateTaintEvictionController lets you enable or disable the new setup. From Kubernetes v1.29, the taint-based eviction is still enabled by default, but cluster admins can disable the default TaintEvictionController using the --controllers=-taint-eviction-controller flag in kube-controller-manager if needed.

This KEP is tracked as stable in v1.34.

Other Merges

Commonize filtering of Pods by Owner with all orphans in namespace

Fix validation for Job with suspend=true, and completions=0 to set the Complete condition

DRA: the v1alpha4 kubelet gRPC API is no longer supported

Bug fix for replica set failing to be created when a deployment name is too long

Deprecated package ‘k8s.io/utils/pointer’ replaced with ‘k8s.io/utils/ptr’ for the kube-apiserver

More usages of deprecated function ExtractCommentTags migrated to ExtractFunctionStyleCommentTags

Defunct make vet target removed

New SchedulerAsyncAPICalls feature gate added

Code coverage increased for kubelet_client

Validation error message for required fields simplified by removing redundant messages

Flags added to kube-apiserver to make coordinated leader election timers configurable

SizeBasedListCostEstimate feature gate to allow apiserver to estimate sizes of objects to calculate cost of LIST requests

HPA status now displays memory metrics with proper units

ClusterEvent type moved to staging repo

Code and status moved from pkg/scheduler/framework to staging repo

DRA: the kubelet now also cleans up ResourceSlices in some additional failure scenarios

Objects are transformed prior to storage in SharedInformers if a transformer is provided and WatchList is activated

kubectl debug: label added for debugger pod for making cleanup easier

podSpec validation added during StatefulSet creation

Promotions

StreamingCollectionEncodingToJSON and StreamingCollectionEncodingToProtobuf to GA

WaitForAllControlPlaneComponents to GA

Deprecated

StreamingConnectionIdleTimeout field of the kubelet config deprecated

Version Updates

etcd to v3.6.1

kube-openapi bumped

Shoutouts

Jenny Shu (@Jenny Shu) : A little belated, but I want to give a big shout-out to the 1.34 Enhancements Shadows: Drew Hagen(@Drew Hagen), Faeka Ansari (@Faeka Ansari), Josh Michielsen (@jmickey), Rayan Das (@rayandas), Sean McGinnis (@Sean McGinnis), for all their hard work leading up to Enhancements Freeze last week! Keep up the great work!

via Last Week in Kubernetes Development https://lwkd.info/

July 04, 2025 at 02:49AM

·lwkd.info·Jul 4, 2025

Last Week in Kubernetes Development - Week Ending June 29 2025

Microsoft to lay off as many as 9,000 employees in latest round

The move follows two waves of layoffs in May and June, which saw Microsoft fire more than 6,000 employees.

·seattletimes.com·Jul 3, 2025

Microsoft to lay off as many as 9,000 employees in latest round

andreybleme/lazycontainer

Fancy terminal UI for Apple Containers. Contribute to andreybleme/lazycontainer development by creating an account on GitHub.

brew install container

·github.com·Jul 3, 2025

andreybleme/lazycontainer

tursodatabase/turso: Turso Database is a project to build the next evolution of SQLite.

Turso Database is a project to build the next evolution of SQLite. - tursodatabase/turso

·github.com·Jul 3, 2025

tursodatabase/turso: Turso Database is a project to build the next evolution of SQLite.

Linux Sudo chroot Vulnerability Enables Hackers to Elevate Privileges to Root

A security vulnerability in the widely used Linux Sudo utility has been disclosed, allowing any local unprivileged user to escalate privileges.

·cybersecuritynews.com·Jul 3, 2025

Linux Sudo chroot Vulnerability Enables Hackers to Elevate Privileges to Root

Navigating Failures in Pods With Devices

https://kubernetes.io/blog/2025/07/03/navigating-failures-in-pods-with-devices/

Kubernetes is the de facto standard for container orchestration, but when it comes to handling specialized hardware like GPUs and other accelerators, things get a bit complicated. This blog post dives into the challenges of managing failure modes when operating pods with devices in Kubernetes, based on insights from Sergey Kanzhelev and Mrunal Patel's talk at KubeCon NA

You can follow the links to slides and recording.

The AI/ML boom and its impact on Kubernetes

The rise of AI/ML workloads has brought new challenges to Kubernetes. These workloads often rely heavily on specialized hardware, and any device failure can significantly impact performance and lead to frustrating interruptions. As highlighted in the 2024 Llama paper, hardware issues, particularly GPU failures, are a major cause of disruption in AI/ML training. You can also learn how much effort NVIDIA spends on handling devices failures and maintenance in the KubeCon talk by Ryan Hallisey and Piotr Prokop All-Your-GPUs-Are-Belong-to-Us: An Inside Look at NVIDIA's Self-Healing GeForce NOW Infrastructure (recording) as they see 19 remediation requests per 1000 nodes a day! We also see data centers offering spot consumption models and overcommit on power, making device failures commonplace and a part of the business model.

However, Kubernetes’s view on resources is still very static. The resource is either there or not. And if it is there, the assumption is that it will stay there fully functional - Kubernetes lacks good support for handling full or partial hardware failures. These long-existing assumptions combined with the overall complexity of a setup lead to a variety of failure modes, which we discuss here.

Understanding AI/ML workloads

Generally, all AI/ML workloads require specialized hardware, have challenging scheduling requirements, and are expensive when idle. AI/ML workloads typically fall into two categories - training and inference. Here is an oversimplified view of those categories’ characteristics, which are different from traditional workloads like web services:

Training

These workloads are resource-intensive, often consuming entire machines and running as gangs of pods. Training jobs are usually "run to completion" - but that could be days, weeks or even months. Any failure in a single pod can necessitate restarting the entire step across all the pods.

Inference

These workloads are usually long-running or run indefinitely, and can be small enough to consume a subset of a Node’s devices or large enough to span multiple nodes. They often require downloading huge files with the model weights.

These workload types specifically break many past assumptions:

Workload assumptions before and now

Before

Now

Can get a better CPU and the app will work faster.

Require a specific device (or class of devices) to run.

When something doesn’t work, just recreate it.

Allocation or reallocation is expensive.

Any node will work. No need to coordinate between Pods.

Scheduled in a special way - devices often connected in a cross-node topology.

Each Pod can be plug-and-play replaced if failed.

Pods are a part of a larger task. Lifecycle of an entire task depends on each Pod.

Container images are slim and easily available.

Container images may be so big that they require special handling.

Long initialization can be offset by slow rollout.

Initialization may be long and should be optimized, sometimes across many Pods together.

Compute nodes are commoditized and relatively inexpensive, so some idle time is acceptable.

Nodes with specialized hardware can be an order of magnitude more expensive than those without, so idle time is very wasteful.

The existing failure model was relying on old assumptions. It may still work for the new workload types, but it has limited knowledge about devices and is very expensive for them. In some cases, even prohibitively expensive. You will see more examples later in this article.

Why Kubernetes still reigns supreme

This article is not going deeper into the question: why not start fresh for

AI/ML workloads since they are so different from the traditional Kubernetes workloads. Despite many challenges, Kubernetes remains the platform of choice for AI/ML workloads. Its maturity, security, and rich ecosystem of tools make it a compelling option. While alternatives exist, they often lack the years of development and refinement that Kubernetes offers. And the Kubernetes developers are actively addressing the gaps identified in this article and beyond.

The current state of device failure handling

This section outlines different failure modes and the best practices and DIY (Do-It-Yourself) solutions used today. The next session will describe a roadmap of improving things for those failure modes.

Failure modes: K8s infrastructure

In order to understand the failures related to the Kubernetes infrastructure, you need to understand how many moving parts are involved in scheduling a Pod on the node. The sequence of events when the Pod is scheduled in the Node is as follows:

Device plugin is scheduled on the Node

Device plugin is registered with the kubelet via local gRPC

Kubelet uses device plugin to watch for devices and updates capacity of the node

Scheduler places a user Pod on a Node based on the updated capacity

Kubelet asks Device plugin to Allocate devices for a User Pod

Kubelet creates a User Pod with the allocated devices attached to it

This diagram shows some of those actors involved:

As there are so many actors interconnected, every one of them and every connection may experience interruptions. This leads to many exceptional situations that are often considered failures, and may cause serious workload interruptions:

Pods failing admission at various stages of its lifecycle

Pods unable to run on perfectly fine hardware

Scheduling taking unexpectedly long time

The goal for Kubernetes is to make the interruption between these components as reliable as possible. Kubelet already implements retries, grace periods, and other techniques to improve it. The roadmap section goes into details on other edge cases that the Kubernetes project tracks. However, all these improvements only work when these best practices are followed:

Configure and restart kubelet and the container runtime (such as containerd or CRI-O) as early as possible to not interrupt the workload.

Monitor device plugin health and carefully plan for upgrades.

Do not overload the node with less-important workloads to prevent interruption of device plugin and other components.

Configure user pods tolerations to handle node readiness flakes.

Configure and code graceful termination logic carefully to not block devices for too long.

Another class of Kubernetes infra-related issues is driver-related. With traditional resources like CPU and memory, no compatibility checks between the application and hardware were needed. With special devices like hardware accelerators, there are new failure modes. Device drivers installed on the node:

Must match the hardware

Be compatible with an app

Must work with other drivers (like nccl, etc.)

Best practices for handling driver versions:

Monitor driver installer health

Plan upgrades of infrastructure and Pods to match the version

Have canary deployments whenever possible

Following the best practices in this section and using device plugins and device driver installers from trusted and reliable sources generally eliminate this class of failures. Kubernetes is tracking work to make this space even better.

Failure modes: device failed

There is very little handling of device failure in Kubernetes today. Device plugins report the device failure only by changing the count of allocatable devices. And Kubernetes relies on standard mechanisms like liveness probes or container failures to allow Pods to communicate the failure condition to the kubelet. However, Kubernetes does not correlate device failures with container crashes and does not offer any mitigation beyond restarting the container while being attached to the same device.

This is why many plugins and DIY solutions exist to handle device failures based on various signals.

Health controller

In many cases a failed device will result in unrecoverable and very expensive nodes doing nothing. A simple DIY solution is a node health controller. The controller could compare the device allocatable count with the capacity and if the capacity is greater, it starts a timer. Once the timer reaches a threshold, the health controller kills and recreates a node.

There are problems with the health controller approach:

Root cause of the device failure is typically not known

The controller is not workload aware

Failed device might not be in use and you want to keep other devices running

The detection may be too slow as it is very generic

The node may be part of a bigger set of nodes and simply cannot be deleted in isolation without other nodes

There are variations of the health controller solving some of the problems above. The overall theme here though is that to best handle failed devices, you need customized handling for the specific workload. Kubernetes doesn’t yet offer enough abstraction to express how critical the device is for a node, for the cluster, and for the Pod it is assigned to.

Pod failure policy

Another DIY approach for device failure handling is a per-pod reaction on a failed device. This approach is applicable for training workloads that are implemented as Jobs.

Pod can define special error codes for device failures. For example, whenever unexpected device behavior is encountered, Pod exits with a special exit code. Then the Pod failure policy can handle the device failure in a special way. Read more on Handling retriable and non-retriable pod failures with Pod failure policy

There are some problems

·kubernetes.io·Jul 3, 2025

Navigating Failures in Pods With Devices

One year after EOL - The State of CentOS

This CIQ webinar originally aired June 30, 2025.CentOS is still used widely. One source reports that over 300,000 companies still use It, and almost 800,000 ...

·youtu.be·Jul 2, 2025

One year after EOL - The State of CentOS