Suggested Reads

Suggested Reads

54940 bookmarks
Newest
Blog: Kubernetes 1.26: Non-Graceful Node Shutdown Moves to Beta
Blog: Kubernetes 1.26: Non-Graceful Node Shutdown Moves to Beta
Author: Xing Yang (VMware), Ashutosh Kumar (VMware) Kubernetes v1.24 introduced an alpha quality implementation of improvements for handling a non-graceful node shutdown . In Kubernetes v1.26, this feature moves to beta. This feature allows stateful workloads to failover to a different node after the original node is shut down or in a non-recoverable state, such as the hardware failure or broken OS. What is a node shutdown in Kubernetes? In a Kubernetes cluster, it is possible for a node to shut down. This could happen either in a planned way or it could happen unexpectedly. You may plan for a security patch, or a kernel upgrade and need to reboot the node, or it may shut down due to preemption of VM instances. A node may also shut down due to a hardware failure or a software problem. To trigger a node shutdown, you could run a shutdown or poweroff command in a shell, or physically press a button to power off a machine. A node shutdown could lead to workload failure if the node is not drained before the shutdown. In the following, we will describe what is a graceful node shutdown and what is a non-graceful node shutdown. What is a graceful node shutdown? The kubelet's handling for a graceful node shutdown allows the kubelet to detect a node shutdown event, properly terminate the pods on that node, and release resources before the actual shutdown. Critical pods are terminated after all the regular pods are terminated, to ensure that the essential functions of an application can continue to work as long as possible. What is a non-graceful node shutdown? A Node shutdown can be graceful only if the kubelet's node shutdown manager can detect the upcoming node shutdown action. However, there are cases where a kubelet does not detect a node shutdown action. This could happen because the shutdown command does not trigger the Inhibitor Locks mechanism used by the kubelet on Linux, or because of a user error. For example, if the shutdownGracePeriod and shutdownGracePeriodCriticalPods details are not configured correctly for that node. When a node is shut down (or crashes), and that shutdown was not detected by the kubelet node shutdown manager, it becomes a non-graceful node shutdown. Non-graceful node shutdown is a problem for stateful apps. If a node containing a pod that is part of a StatefulSet is shut down in a non-graceful way, the Pod will be stuck in Terminating status indefinitely, and the control plane cannot create a replacement Pod for that StatefulSet on a healthy node. You can delete the failed Pods manually, but this is not ideal for a self-healing cluster. Similarly, pods that ReplicaSets created as part of a Deployment will be stuck in Terminating status, and that were bound to the now-shutdown node, stay as Terminating indefinitely. If you have set a horizontal scaling limit, even those terminating Pods count against the limit, so your workload may struggle to self-heal if it was already at maximum scale. (By the way: if the node that had done a non-graceful shutdown comes back up, the kubelet does delete the old Pod, and the control plane can make a replacement.) What's new for the beta? For Kubernetes v1.26, the non-graceful node shutdown feature is beta and enabled by default. The NodeOutOfServiceVolumeDetach feature gate is enabled by default on kube-controller-manager instead of being opt-in; you can still disable it if needed (please also file an issue to explain the problem). On the instrumentation side, the kube-controller-manager reports two new metrics. force_delete_pods_total number of pods that are being forcibly deleted (resets on Pod garbage collection controller restart) force_delete_pod_errors_total number of errors encountered when attempting forcible Pod deletion (also resets on Pod garbage collection controller restart) How does it work? In the case of a node shutdown, if a graceful shutdown is not working or the node is in a non-recoverable state due to hardware failure or broken OS, you can manually add an out-of-service taint on the Node. For example, this can be node.kubernetes.io/out-of-service=nodeshutdown:NoExecute or node.kubernetes.io/out-of-service=nodeshutdown:NoSchedule . This taint trigger pods on the node to be forcefully deleted if there are no matching tolerations on the pods. Persistent volumes attached to the shutdown node will be detached, and new pods will be created successfully on a different running node. kubectl taint nodes node-name node.kubernetes.io/out-of-service=nodeshutdown:NoExecute Note: Before applying the out-of-service taint, you must verify that a node is already in shutdown or power-off state (not in the middle of restarting), either because the user intentionally shut it down or the node is down due to hardware failures, OS issues, etc. Once all the workload pods that are linked to the out-of-service node are moved to a new running node, and the shutdown node has been recovered, you should remove that taint on the affected node after the node is recovered. What’s next? Depending on feedback and adoption, the Kubernetes team plans to push the Non-Graceful Node Shutdown implementation to GA in either 1.27 or 1.28. This feature requires a user to manually add a taint to the node to trigger the failover of workloads and remove the taint after the node is recovered. The cluster operator can automate this process by automatically applying the out-of-service taint if there is a programmatic way to determine that the node is really shut down and there isn’t IO between the node and storage. The cluster operator can then automatically remove the taint after the workload fails over successfully to another running node and that the shutdown node has been recovered. In the future, we plan to find ways to automatically detect and fence nodes that are shut down or in a non-recoverable state and fail their workloads over to another node. How can I learn more? To learn more, read Non Graceful node shutdown in the Kubernetes documentation. How to get involved? We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature: Michelle Au (msau42 ) Derek Carr (derekwaynecarr ) Danielle Endocrimes (endocrimes ) Tim Hockin (thockin ) Ashutosh Kumar (sonasingh46 ) Hemant Kumar (gnufied ) Yuiko Mouri(YuikoTakada ) Mrunal Patel (mrunalp ) David Porter (bobbypage ) Yassine Tijani (yastij ) Jing Xu (jingxu97 ) Xing Yang (xing-yang ) There are many people who have helped review the design and implementation along the way. We want to thank everyone who has contributed to this effort including the about 30 people who have reviewed the KEP and implementation over the last couple of years. This feature is a collaboration between SIG Storage and SIG Node. For those interested in getting involved with the design and development of any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). For those interested in getting involved with the design and development of the components that support the controlled interactions between pods and host resources, join the Kubernetes Node SIG.
·kubernetes.io·
Blog: Kubernetes 1.26: Non-Graceful Node Shutdown Moves to Beta
Hello, Mastodon
Hello, Mastodon
I finally decided to create an account on Mastodon. You can follow me at @jsq@mastodon.social. I put this off for so long because I was skeptical and I did n...
·jessesquires.com·
Hello, Mastodon
Twitter manually reviewed all accounts that posted links to ElonJet -exec
Twitter manually reviewed all accounts that posted links to ElonJet -exec
Twitter Inc's head of trust and safety told Reuters the company manually reviewed "any and all accounts" that violated its new privacy policy by posting links to a Twitter account called ElonJet that tracked Elon Musk's private jet using information in the public domain.
·reuters.com·
Twitter manually reviewed all accounts that posted links to ElonJet -exec
I sold mine because it was far too heavy and exacerbated issues in my arm and elbow. It’s not an accessible device IMHO | Valve: No Performance Upgrades for the Next-Gen Steam Deck
I sold mine because it was far too heavy and exacerbated issues in my arm and elbow. It’s not an accessible device IMHO | Valve: No Performance Upgrades for the Next-Gen Steam Deck
The next generation of Steam Decks will likely focus on better displays and battery life, say designers.
·tomshardware.com·
I sold mine because it was far too heavy and exacerbated issues in my arm and elbow. It’s not an accessible device IMHO | Valve: No Performance Upgrades for the Next-Gen Steam Deck
Never did I ever think RSS would live on forever yet at the same time I see how it came to fall out of favor | How to rebuild social media on top of RSS
Never did I ever think RSS would live on forever yet at the same time I see how it came to fall out of favor | How to rebuild social media on top of RSS
We should look for ways to make reading, publishing, and community services all play nicely together. I'm calling this model "the unbundled web," and I think RSS should be the primary method of interop.
·tfos.co·
Never did I ever think RSS would live on forever yet at the same time I see how it came to fall out of favor | How to rebuild social media on top of RSS
Interesting | Why using Alpine Docker images and Python is probably bad for your project (right now)
Interesting | Why using Alpine Docker images and Python is probably bad for your project (right now)
Alpine Linux is a distribution that is designed to be lightweight. In particular, it’s seen a lot of use in Docker images because the resulting image bundles are considerably smaller than those generated by other minimal distros. However, in the context of building a Docker image for a Python application, it’s worth thinking carefully before using Alpine, as it can often result in slower builds and counterintuitively it can even result in larger images occasionally.
·rpep.dev·
Interesting | Why using Alpine Docker images and Python is probably bad for your project (right now)
Blog: Kubernetes 1.26: Alpha API For Dynamic Resource Allocation
Blog: Kubernetes 1.26: Alpha API For Dynamic Resource Allocation
Authors: Patrick Ohly (Intel), Kevin Klues (NVIDIA) Dynamic resource allocation is a new API for requesting resources. It is a generalization of the persistent volumes API for generic resources, making it possible to: access the same resource instance in different pods and containers, attach arbitrary constraints to a resource request to get the exact resource you are looking for, initialize a resource according to parameters provided by the user. Third-party resource drivers are responsible for interpreting these parameters as well as tracking and allocating resources as requests come in. Dynamic resource allocation is an alpha feature and only enabled when the DynamicResourceAllocation feature gate and the resource.k8s.io/v1alpha1 API group are enabled. For details, see the --feature-gates and --runtime-config kube-apiserver parameters . The kube-scheduler, kube-controller-manager and kubelet components all need the feature gate enabled as well. The default configuration of kube-scheduler enables the DynamicResources plugin if and only if the feature gate is enabled. Custom configurations may have to be modified to include it. Once dynamic resource allocation is enabled, resource drivers can be installed to manage certain kinds of hardware. Kubernetes has a test driver that is used for end-to-end testing, but also can be run manually. See below for step-by-step instructions. API The new resource.k8s.io/v1alpha1 API group provides four new types: ResourceClass Defines which resource driver handles a certain kind of resource and provides common parameters for it. ResourceClasses are created by a cluster administrator when installing a resource driver. ResourceClaim Defines a particular resource instances that is required by a workload. Created by a user (lifecycle managed manually, can be shared between different Pods) or for individual Pods by the control plane based on a ResourceClaimTemplate (automatic lifecycle, typically used by just one Pod). ResourceClaimTemplate Defines the spec and some meta data for creating ResourceClaims. Created by a user when deploying a workload. PodScheduling Used internally by the control plane and resource drivers to coordinate pod scheduling when ResourceClaims need to be allocated for a Pod. Parameters for ResourceClass and ResourceClaim are stored in separate objects, typically using the type defined by a CRD that was created when installing a resource driver. With this alpha feature enabled, the spec of Pod defines ResourceClaims that are needed for a Pod to run: this information goes into a new resourceClaims field. Entries in that list reference either a ResourceClaim or a ResourceClaimTemplate. When referencing a ResourceClaim, all Pods using this .spec (for example, inside a Deployment or StatefulSet) share the same ResourceClaim instance. When referencing a ResourceClaimTemplate, each Pod gets its own ResourceClaim instance. For a container defined within a Pod, the resources.claims list defines whether that container gets access to these resource instances, which makes it possible to share resources between one or more containers inside the same Pod. For example, an init container could set up the resource before the application uses it. Here is an example of a fictional resource driver. Two ResourceClaim objects will get created for this Pod and each container gets access to one of them. Assuming a resource driver called resource-driver.example.com was installed together with the following resource class: apiVersion: resource.k8s.io/v1alpha1 kind: ResourceClass name: resource.example.com driverName: resource-driver.example.com An end-user could then allocate two specific resources of type resource.example.com as follows: --- apiVersion : cats.resource.example.com/v1 kind : ClaimParameters name : large-black-cats spec : color : black size : large --- apiVersion : resource.k8s.io/v1alpha1 kind : ResourceClaimTemplate metadata : name : large-black-cats spec : spec : resourceClassName : resource.example.com parametersRef : apiGroup : cats.resource.example.com kind : ClaimParameters name : large-black-cats –-- apiVersion : v1 kind : Pod metadata : name : pod-with-cats spec : containers : # two example containers; each container claims one cat resource - name : first-example image : ubuntu:22.04 command : ["sleep" , "9999" ] resources : claims : - name : cat-0 - name : second-example image : ubuntu:22.04 command : ["sleep" , "9999" ] resources : claims : - name : cat-1 resourceClaims : - name : cat-0 source : resourceClaimTemplateName : large-black-cats - name : cat-1 source : resourceClaimTemplateName : large-black-cats Scheduling In contrast to native resources (such as CPU or RAM) and extended resources (managed by a device plugin, advertised by kubelet), the scheduler has no knowledge of what dynamic resources are available in a cluster or how they could be split up to satisfy the requirements of a specific ResourceClaim. Resource drivers are responsible for that. Drivers mark ResourceClaims as allocated once resources for it are reserved. This also then tells the scheduler where in the cluster a claimed resource is actually available. ResourceClaims can get resources allocated as soon as the ResourceClaim is created (immediate allocation ), without considering which Pods will use the resource. The default (wait for first consumer ) is to delay allocation until a Pod that relies on the ResourceClaim becomes eligible for scheduling. This design with two allocation options is similar to how Kubernetes handles storage provisioning with PersistentVolumes and PersistentVolumeClaims. In the wait for first consumer mode, the scheduler checks all ResourceClaims needed by a Pod. If the Pods has any ResourceClaims, the scheduler creates a PodScheduling (a special object that requests scheduling details on behalf of the Pod). The PodScheduling has the same name and namespace as the Pod and the Pod as its as owner. Using its PodScheduling, the scheduler informs the resource drivers responsible for those ResourceClaims about nodes that the scheduler considers suitable for the Pod. The resource drivers respond by excluding nodes that don't have enough of the driver's resources left. Once the scheduler has that resource information, it selects one node and stores that choice in the PodScheduling object. The resource drivers then allocate resources based on the relevant ResourceClaims so that the resources will be available on that selected node. Once that resource allocation is complete, the scheduler attempts to schedule the Pod to a suitable node. Scheduling can still fail at this point; for example, a different Pod could be scheduled to the same node in the meantime. If this happens, already allocated ResourceClaims may get deallocated to enable scheduling onto a different node. As part of this process, ResourceClaims also get reserved for the Pod. Currently ResourceClaims can either be used exclusively by a single Pod or an unlimited number of Pods. One key feature is that Pods do not get scheduled to a node unless all of their resources are allocated and reserved. This avoids the scenario where a Pod gets scheduled onto one node and then cannot run there, which is bad because such a pending Pod also blocks all other resources like RAM or CPU that were set aside for it. Limitations The scheduler plugin must be involved in scheduling Pods which use ResourceClaims. Bypassing the scheduler by setting the nodeName field leads to Pods that the kubelet refuses to start because the ResourceClaims are not reserved or not even allocated. It may be possible to remove this limitation in the future. Writing a resource driver A dynamic resource allocation driver typically consists of two separate-but-coordinating components: a centralized controller, and a DaemonSet of node-local kubelet plugins. Most of the work required by the centralized controller to coordinate with the scheduler can be handled by boilerplate code. Only the business logic required to actually allocate ResourceClaims against the ResourceClasses owned by the plugin needs to be customized. As such, Kubernetes provides the following package, including APIs for invoking this boilerplate code as well as a Driver interface that you can implement to provide their custom business logic: k8s.io/dynamic-resource-allocation/controller Likewise, boilerplate code can be used to register the node-local plugin with the kubelet, as well as start a gRPC server to implement the kubelet plugin API. For drivers written in Go, the following package is recommended: k8s.io/dynamic-resource-allocation/kubeletplugin It is up to the driver developer to decide how these two components communicate. The KEP outlines an approach using CRDs . Within SIG Node, we also plan to provide a complete example driver that can serve as a template for other drivers. Running the test driver The following steps bring up a local, one-node cluster directly from the Kubernetes source code. As a prerequisite, your cluster must have nodes with a container runtime that supports the Container Device Interface (CDI). For example, you can run CRI-O v1.23.2 or later. Once containerd v1.7.0 is released, we expect that you can run that or any later version. In the example below, we use CRI-O. First, clone the Kubernetes source code. Inside that directory, run: $ hack/install-etcd.sh ... $ RUNTIME_CONFIG = resource.k8s.io/v1alpha1 \ FEATURE_GATES=DynamicResourceAllocation=true \ DNS_ADDON="coredns" \ CGROUP_DRIVER=systemd \ CONTAINER_RUNTIME_ENDPOINT=unix:///var/run/crio/crio.sock \ LOG_LEVEL=6 \ ENABLE_CSI_SNAPSHOTTER=false \ API_SECURE_PORT=6444 \ ALLOW_PRIVILEGED=1 \ PATH=$(pwd)/third_party/etcd:$PATH \ ./hack/local-up-cluster.sh -O ... To start using your cluster, you...
·kubernetes.io·
Blog: Kubernetes 1.26: Alpha API For Dynamic Resource Allocation
WebAssembly vs. Kubernetes
WebAssembly vs. Kubernetes
WebAssembly, or Wasm, was shown to be a very practical way to run code on a  web browser, serving as a compiler of sorts. Eventually, it dawned on developers that Wasm could run on server operations systems as well and its use now extends across hardware platforms, leading some to attempt to view it as an alternative to Kubernetes.
·thenewstack.io·
WebAssembly vs. Kubernetes
Zoë Schiffer on Twitter
Zoë Schiffer on Twitter
NEW: Twitter currently does not have admin access to some of its GitHub repos. These repos contain Twitter source code (much of it is open source; some is not). This includes code for companies Twitter acquired, like Smyte. 1/— Zoë Schiffer (@ZoeSchiffer) December 13, 2022
·twitter.com·
Zoë Schiffer on Twitter
I think the big takeaway is the US gov’t got a Discord Voice chat of the scheme | SEC says social media influencers used Twitter and Discord to manipulate stocks
I think the big takeaway is the US gov’t got a Discord Voice chat of the scheme | SEC says social media influencers used Twitter and Discord to manipulate stocks
The regulatory agency charged them in what it says was a $100 million securities fraud scheme run by people who portrayed themselves as successful stock traders.
·nbcnews.com·
I think the big takeaway is the US gov’t got a Discord Voice chat of the scheme | SEC says social media influencers used Twitter and Discord to manipulate stocks