1_DevOps'ish

1_DevOps'ish

55119 bookmarks
Custom sorting
College Knowledge
College Knowledge
'Your chitin armor is no match for our iron-tipped stingers! Better go hide in your jars!' --common playground taunt
·xkcd.com·
College Knowledge
Blog: Kubernetes 1.27: Quality-of-Service for Memory Resources (alpha)
Blog: Kubernetes 1.27: Quality-of-Service for Memory Resources (alpha)
Authors: Dixita Narang (Google) Kubernetes v1.27, released in April 2023, introduced changes to Memory QoS (alpha) to improve memory management capabilites in Linux nodes. Support for Memory QoS was initially added in Kubernetes v1.22, and later some limitations around the formula for calculating memory.high were identified. These limitations are addressed in Kubernetes v1.27. Background Kubernetes allows you to optionally specify how much of each resources a container needs in the Pod specification. The most common resources to specify are CPU and Memory. For example, a Pod manifest that defines container resource requirements could look like: apiVersion: v1 kind: Pod metadata: name: example spec: containers: - name: nginx resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "64Mi" cpu: "500m" spec.containers[].resources.requests When you specify the resource request for containers in a Pod, the Kubernetes scheduler uses this information to decide which node to place the Pod on. The scheduler ensures that for each resource type, the sum of the resource requests of the scheduled containers is less than the total allocatable resources on the node. spec.containers[].resources.limits When you specify the resource limit for containers in a Pod, the kubelet enforces those limits so that the running containers are not allowed to use more of those resources than the limits you set. When the kubelet starts a container as a part of a Pod, kubelet passes the container's requests and limits for CPU and memory to the container runtime. The container runtime assigns both CPU request and CPU limit to a container. Provided the system has free CPU time, the containers are guaranteed to be allocated as much CPU as they request. Containers cannot use more CPU than the configured limit i.e. containers CPU usage will be throttled if they use more CPU than the specified limit within a given time slice. Prior to Memory QoS feature, the container runtime only used the memory limit and discarded the memory request (requests were, and still are, also used to influence scheduling ). If a container uses more memory than the configured limit, the Linux Out Of Memory (OOM) killer will be invoked. Let's compare how the container runtime on Linux typically configures memory request and limit in cgroups, with and without Memory QoS feature: Memory request The memory request is mainly used by kube-scheduler during (Kubernetes) Pod scheduling. In cgroups v1, there are no controls to specify the minimum amount of memory the cgroups must always retain. Hence, the container runtime did not use the value of requested memory set in the Pod spec. cgroups v2 introduced a memory.min setting, used to specify the minimum amount of memory that should remain available to the processes within a given cgroup. If the memory usage of a cgroup is within its effective min boundary, the cgroup’s memory won’t be reclaimed under any conditions. If the kernel cannot maintain at least memory.min bytes of memory for the processes within the cgroup, the kernel invokes its OOM killer. In other words, the kernel guarantees at least this much memory is available or terminates processes (which may be outside the cgroup) in order to make memory more available. Memory QoS maps memory.min to spec.containers[].resources.requests.memory to ensure the availability of memory for containers in Kubernetes Pods. Memory limit The memory.limit specifies the memory limit, beyond which if the container tries to allocate more memory, Linux kernel will terminate a process with an OOM (Out of Memory) kill. If the terminated process was the main (or only) process inside the container, the container may exit. In cgroups v1, memory.limit_in_bytes interface is used to set the memory usage limit. However, unlike CPU, it was not possible to apply memory throttling: as soon as a container crossed the memory limit, it would be OOM killed. In cgroups v2, memory.max is analogous to memory.limit_in_bytes in cgroupv1. Memory QoS maps memory.max to spec.containers[].resources.limits.memory to specify the hard limit for memory usage. If the memory consumption goes above this level, the kernel invokes its OOM Killer. cgroups v2 also added memory.high configuration . Memory QoS uses memory.high to set memory usage throttle limit. If the memory.high limit is breached, the offending cgroups are throttled, and the kernel tries to reclaim memory which may avoid an OOM kill. How it works Cgroups v2 memory controller interfaces & Kubernetes container resources mapping Memory QoS uses the memory controller of cgroups v2 to guarantee memory resources in Kubernetes. cgroupv2 interfaces that this feature uses are: memory.max memory.min memory.high . Memory QoS Levels memory.max is mapped to limits.memory specified in the Pod spec. The kubelet and the container runtime configure the limit in the respective cgroup. The kernel enforces the limit to prevent the container from using more than the configured resource limit. If a process in a container tries to consume more than the specified limit, kernel terminates a process(es) with an out of memory Out of Memory (OOM) error. memory.max maps to limits.memory memory.min is mapped to requests.memory , which results in reservation of memory resources that should never be reclaimed by the kernel. This is how Memory QoS ensures the availability of memory for Kubernetes pods. If there's no unprotected reclaimable memory available, the OOM killer is invoked to make more memory available. memory.min maps to requests.memory For memory protection, in addition to the original way of limiting memory usage, Memory QoS throttles workload approaching its memory limit, ensuring that the system is not overwhelmed by sporadic increases in memory usage. A new field, memoryThrottlingFactor , is available in the KubeletConfiguration when you enable MemoryQoS feature. It is set to 0.9 by default. memory.high is mapped to throttling limit calculated by using memoryThrottlingFactor , requests.memory and limits.memory as in the formula below, and rounding down the value to the nearest page size: memory.high formula Note : If a container has no memory limits specified, limits.memory is substituted for node allocatable memory. Summary: File Description memory.max memory.max specifies the maximum memory limit, a container is allowed to use. If a process within the container tries to consume more memory than the configured limit, the kernel terminates the process with an Out of Memory (OOM) error. It is mapped to the container's memory limit specified in Pod manifest. memory.min memory.min specifies a minimum amount of memory the cgroups must always retain, i.e., memory that should never be reclaimed by the system. If there's no unprotected reclaimable memory available, OOM kill is invoked. It is mapped to the container's memory request specified in the Pod manifest. memory.high memory.high specifies the memory usage throttle limit. This is the main mechanism to control a cgroup's memory use. If cgroups memory use goes over the high boundary specified here, the cgroups processes are throttled and put under heavy reclaim pressure. Kubernetes uses a formula to calculate memory.high , depending on container's memory request, memory limit or node allocatable memory (if container's memory limit is empty) and a throttling factor. Please refer to the KEP for more details on the formula. Note memory.high is set only on container level cgroups while memory.min is set on container, pod, and node level cgroups. memory.min calculations for cgroups heirarchy When container memory requests are made, kubelet passes memory.min to the back-end CRI runtime (such as containerd or CRI-O) via the Unified field in CRI during container creation. The memory.min in container level cgroups will be set to: $memory.min = pod.spec.containers[i].resources.requests[memory]$ for every ith container in a pod Since the memory.min interface requires that the ancestor cgroups directories are all set, the pod and node cgroups directories need to be set correctly. memory.min in pod level cgroup: $memory.min = \sum_{i=0}^{no. of pods}pod.spec.containers[i].resources.requests[memory]$ for every ith container in a pod memory.min in node level cgroup: $memory.min = \sum_{i}^{no. of nodes}\sum_{j}^{no. of pods}pod[i].spec.containers[j].resources.requests[memory]$ for every jth container in every ith pod on a node Kubelet will manage the cgroups hierarchy of the pod level and node level cgroups directly using the libcontainer library (from the runc project), while container cgroups limits are managed by the container runtime. Support for Pod QoS classes Based on user feedback for the Alpha feature in Kubernetes v1.22, some users would like to opt out of MemoryQoS on a per-pod basis to ensure there is no early memory throttling. Therefore, in Kubernetes v1.27 Memory QOS also supports memory.high to be set as per Quality of Service(QoS) for Pod classes. Following are the different cases for memory.high as per QOS classes: Guaranteed pods by their QoS definition require memory requests=memory limits and are not overcommitted. Hence MemoryQoS feature is disabled on those pods by not setting memory.high. This ensures that Guaranteed pods can fully use their memory requests up to their set limit, and not hit any throttling. Burstable pods by their QoS definition require at least one container in the Pod with CPU or memory request or limit set. When requests.memory and limits.memory are set, the formula is used as-is: memory.high when requests and limits are set When requests.memory is set and limits.memory is not set, limits.memory is substituted for node allocatable memory in the formula: memory.high when requests and limits are not set BestEffort by their QoS de...
·kubernetes.io·
Blog: Kubernetes 1.27: Quality-of-Service for Memory Resources (alpha)
UEFI Secure Boot on the Raspberry Pi
UEFI Secure Boot on the Raspberry Pi
A port of the free software TianoCore UEFI firmware can be used instead of the proprietary boot blob to boot the Raspberry Pi. This allows to install Debian on the RPi …
·linux.it·
UEFI Secure Boot on the Raspberry Pi
My Weekend With an Emotional Support A.I. Companion
My Weekend With an Emotional Support A.I. Companion
Pi, an A.I. tool that debuted this week, is a twist on the new wave of chatbots: It assists people with their wellness and emotions.
·nytimes.com·
My Weekend With an Emotional Support A.I. Companion
Nordstrom cuts tech workforce
Nordstrom cuts tech workforce
The layoffs come weeks after the company appointed a new chief technology and information officer.
·ciodive.com·
Nordstrom cuts tech workforce
White House to scrutinize use of AI, worker tracking tools
White House to scrutinize use of AI, worker tracking tools
The Office of Science and Technology Policy said it will request information from the public regarding how such tools are used to “surveil, monitor, evaluate and manage” workers.
·ciodive.com·
White House to scrutinize use of AI, worker tracking tools
Unity lays off 600 more, closing half of offices
Unity lays off 600 more, closing half of offices
Unity is cutting another 600 employees in its third round of layoffs in less than a year, The Wall Street Journal repor…
·gamesindustry.biz·
Unity lays off 600 more, closing half of offices
ChatGPT vs. MySQL DBA Challenge
ChatGPT vs. MySQL DBA Challenge
Asking ChatGPT some questions that a MySQL DBA usually needs to answer in an interview process — and seeing how it does.
·percona.com·
ChatGPT vs. MySQL DBA Challenge
Blog: Kubernetes 1.27: StatefulSet PVC Auto-Deletion (beta)
Blog: Kubernetes 1.27: StatefulSet PVC Auto-Deletion (beta)
Author: Matthew Cary (Google) Kubernetes v1.27 graduated to beta a new policy mechanism for StatefulSets that controls the lifetime of their PersistentVolumeClaims (PVCs). The new PVC retention policy lets users specify if the PVCs generated from the StatefulSet spec template should be automatically deleted or retrained when the StatefulSet is deleted or replicas in the StatefulSet are scaled down. What problem does this solve? A StatefulSet spec can include Pod and PVC templates. When a replica is first created, the Kubernetes control plane creates a PVC for that replica if one does not already exist. The behavior before the PVC retention policy was that the control plane never cleaned up the PVCs created for StatefulSets - this was left up to the cluster administrator, or to some add-on automation that you’d have to find, check suitability, and deploy. The common pattern for managing PVCs, either manually or through tools such as Helm, is that the PVCs are tracked by the tool that manages them, with explicit lifecycle. Workflows that use StatefulSets must determine on their own what PVCs are created by a StatefulSet and what their lifecycle should be. Before this new feature, when a StatefulSet-managed replica disappears, either because the StatefulSet is reducing its replica count, or because its StatefulSet is deleted, the PVC and its backing volume remains and must be manually deleted. While this behavior is appropriate when the data is critical, in many cases the persistent data in these PVCs is either temporary, or can be reconstructed from another source. In those cases, PVCs and their backing volumes remaining after their StatefulSet or replicas have been deleted are not necessary, incur cost, and require manual cleanup. The new StatefulSet PVC retention policy The new StatefulSet PVC retention policy is used to control if and when PVCs created from a StatefulSet ’s volumeClaimTemplate are deleted. There are two contexts when this may occur. The first context is when the StatefulSet resource is deleted (which implies that all replicas are also deleted). This is controlled by the whenDeleted policy. The second context, controlled by whenScaled is when the StatefulSet is scaled down, which removes some but not all of the replicas in a StatefulSet . In both cases the policy can either be Retain , where the corresponding PVCs are not touched, or Delete , which means that PVCs are deleted. The deletion is done with a normal object deletion , so that, for example, all retention policies for the underlying PV are respected. This policy forms a matrix with four cases. I’ll walk through and give an example for each one. whenDeleted and whenScaled are both Retain . This matches the existing behavior for StatefulSets , where no PVCs are deleted. This is also the default retention policy. It’s appropriate to use when data on StatefulSet volumes may be irreplaceable and should only be deleted manually. whenDeleted is Delete and whenScaled is Retain . In this case, PVCs are deleted only when the entire StatefulSet is deleted. If the StatefulSet is scaled down, PVCs are not touched, meaning they are available to be reattached if a scale-up occurs with any data from the previous replica. This might be used for a temporary StatefulSet , such as in a CI instance or ETL pipeline, where the data on the StatefulSet is needed only during the lifetime of the StatefulSet lifetime, but while the task is running the data is not easily reconstructible. Any retained state is needed for any replicas that scale down and then up. whenDeleted and whenScaled are both Delete . PVCs are deleted immediately when their replica is no longer needed. Note this does not include when a Pod is deleted and a new version rescheduled, for example when a node is drained and Pods need to migrate elsewhere. The PVC is deleted only when the replica is no longer needed as signified by a scale-down or StatefulSet deletion. This use case is for when data does not need to live beyond the life of its replica. Perhaps the data is easily reconstructable and the cost savings of deleting unused PVCs is more important than quick scale-up, or perhaps that when a new replica is created, any data from a previous replica is not usable and must be reconstructed anyway. whenDeleted is Retain and whenScaled is Delete . This is similar to the previous case, when there is little benefit to keeping PVCs for fast reuse during scale-up. An example of a situation where you might use this is an Elasticsearch cluster. Typically you would scale that workload up and down to match demand, whilst ensuring a minimum number of replicas (for example: 3). When scaling down, data is migrated away from removed replicas and there is no benefit to retaining those PVCs. However, it can be useful to bring the entire Elasticsearch cluster down temporarily for maintenance. If you need to take the Elasticsearch system offline, you can do this by temporarily deleting the StatefulSet , and then bringing the Elasticsearch cluster back by recreating the StatefulSet . The PVCs holding the Elasticsearch data will still exist and the new replicas will automatically use them. Visit the documentation to see all the details. What’s next? Try it out! The StatefulSetAutoDeletePVC feature gate is beta and enabled by default on cluster running Kubernetes 1.27. Create a StatefulSet using the new policy, test it out and tell us what you think! I'm very curious to see if this owner reference mechanism works well in practice. For example, I realized there is no mechanism in Kubernetes for knowing who set a reference, so it’s possible that the StatefulSet controller may fight with custom controllers that set their own references. Fortunately, maintaining the existing retention behavior does not involve any new owner references, so default behavior will be compatible. Please tag any issues you report with the label sig/apps and assign them to Matthew Cary (@mattcary at GitHub). Enjoy!
·kubernetes.io·
Blog: Kubernetes 1.27: StatefulSet PVC Auto-Deletion (beta)
Sun Tzu wouldn't like the cybersecurity industry
Sun Tzu wouldn't like the cybersecurity industry
Sun Tzu quotes are beloved to the point of overuse in cybersecurity. Here’s why the legend himself would be dissatisfied with the security status quo.
·kellyshortridge.com·
Sun Tzu wouldn't like the cybersecurity industry
Blog: Kubernetes 1.27: HorizontalPodAutoscaler ContainerResource type metric moves to beta
Blog: Kubernetes 1.27: HorizontalPodAutoscaler ContainerResource type metric moves to beta
Author: Kensei Nakada (Mercari) Kubernetes 1.20 introduced the ContainerResource type metric in HorizontalPodAutoscaler (HPA). In Kubernetes 1.27, this feature moves to beta and the corresponding feature gate (HPAContainerMetrics ) gets enabled by default. What is the ContainerResource type metric The ContainerResource type metric allows us to configure the autoscaling based on resource usage of individual containers. In the following example, the HPA controller scales the target so that the average utilization of the cpu in the application container of all the pods is around 60%. (See the algorithm details to know how the desired replica number is calculated exactly) type : ContainerResource containerResource : name : cpu container : application target : type : Utilization averageUtilization : 60 The difference from the Resource type metric HPA already had a Resource type metric . You can define the target resource utilization like the following, and then HPA will scale up/down the replicas based on the current utilization. type : Resource resource : name : cpu target : type : Utilization averageUtilization : 60 But, this Resource type metric refers to the average utilization of the Pods . In case a Pod has multiple containers, the utilization calculation would be: sum{the resource usage of each container} / sum{the resource request of each container} The resource utilization of each container may not have a direct correlation or may grow at different rates as the load changes. For example: A sidecar container is only providing an auxiliary service such as log shipping. If the application does not log very frequently or does not produce logs in its hotpath then the usage of the log shipper will not grow. A sidecar container which provides authentication. Due to heavy caching the usage will only increase slightly when the load on the main container increases. In the current blended usage calculation approach this usually results in the HPA not scaling up the deployment because the blended usage is still low. A sidecar may be injected without resources set which prevents scaling based on utilization. In the current logic the HPA controller can only scale on absolute resource usage of the pod when the resource requests are not set. And, in such case, if only one container's resource utilization goes high, the Resource type metric may not suggest scaling up. So, for the accurate autoscaling, you may want to use the ContainerResource type metric for such Pods instead. What's new for the beta? For Kubernetes v1.27, the ContainerResource type metric is available by default as described at the beginning of this article. (You can still disable it by the HPAContainerMetrics feature gate.) Also, we've improved the observability of HPA controller by exposing some metrics from the kube-controller-manager: metric_computation_total : Number of metric computations. metric_computation_duration_seconds : The time that the HPA controller takes to calculate one metric. reconciliations_total : Number of reconciliation of HPA controller. reconciliation_duration_seconds : The time that the HPA controller takes to reconcile a HPA object once. These metrics have labels action (scale_up , scale_down , none ) and error (spec , internal , none ). And, in addition to them, the first two metrics have the metric_type label which corresponds to .spec.metrics[*].type for a HorizontalPodAutoscaler. All metrics are useful for general monitoring of HPA controller, you can get deeper insight into which part has a problem, where it takes time, how much scaling tends to happen at which time on your cluster etc. Another minor stuff, we've changed the SuccessfulRescale event's messages so that everyone can check whether the events came from the resource metric or the container resource metric (See the related PR ). Getting involved This feature is managed by SIG Autoscaling . Please join us and share your feedback. We look forward to hearing from you! How can I learn more? The official document of the ContainerResource type metric KEP-1610: Container Resource based Autoscaling
·kubernetes.io·
Blog: Kubernetes 1.27: HorizontalPodAutoscaler ContainerResource type metric moves to beta