(21) Post | Feed | LinkedIn
1 notification total Repost successful. View repost…
August 19, 2024 at 11:24AM
via Instapaper
(21) Post | Feed | LinkedIn
1 notification total Repost successful. View repost…
August 19, 2024 at 11:24AM
via Instapaper
Tags:
via Pocket https://www.reuters.com/world/uk/black-britons-uk-riots-leave-lasting-scars-2024-08-19/
August 19, 2024 at 09:37AM
Tags:
August 19, 2024 at 09:37AM
The Dark Side of Open Source: Are We All Just Selfish?
Open-source software is often seen as a free-for-all, but the reality is more complex. Many companies invest heavily in open source projects as a go-to-market strategy, paying full-time maintainers to ensure project success. This video explores the motivations behind open source, the role of big companies like Google and AWS, and the impact of license changes by companies like MongoDB and HashiCorp. Discover why no open-source project should be owned by a single company and the benefits of foundation-owned projects like Kubernetes and Linux. Learn how you can contribute to and support the open source ecosystem.
▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).
▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ Twitter: https://twitter.com/vfarcic ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/
▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox
via YouTube https://www.youtube.com/watch?v=4l_kK90khNA
Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA
https://kubernetes.io/blog/2024/08/19/kubernetes-1-31-pod-failure-policy-for-jobs-goes-ga/
This post describes Pod failure policy, which graduates to stable in Kubernetes 1.31, and how to use it in your Jobs.
About Pod failure policy
When you run workloads on Kubernetes, Pods might fail for a variety of reasons. Ideally, workloads like Jobs should be able to ignore transient, retriable failures and continue running to completion.
To allow for these transient failures, Kubernetes Jobs include the backoffLimit field, which lets you specify a number of Pod failures that you're willing to tolerate during Job execution. However, if you set a large value for the backoffLimit field and rely solely on this field, you might notice unnecessary increases in operating costs as Pods restart excessively until the backoffLimit is met.
This becomes particularly problematic when running large-scale Jobs with thousands of long-running Pods across thousands of nodes.
The Pod failure policy extends the backoff limit mechanism to help you reduce costs in the following ways:
Gives you control to fail the Job as soon as a non-retriable Pod failure occurs.
Allows you to ignore retriable errors without increasing the backoffLimit field.
For example, you can use a Pod failure policy to run your workload on more affordable spot machines by ignoring Pod failures caused by graceful node shutdown.
The policy allows you to distinguish between retriable and non-retriable Pod failures based on container exit codes or Pod conditions in a failed Pod.
How it works
You specify a Pod failure policy in the Job specification, represented as a list of rules.
For each rule you define match requirements based on one of the following properties:
Container exit codes: the onExitCodes property.
Pod conditions: the onPodConditions property.
Additionally, for each rule, you specify one of the following actions to take when a Pod matches the rule:
Ignore: Do not count the failure towards the backoffLimit or backoffLimitPerIndex.
FailJob: Fail the entire Job and terminate all running Pods.
FailIndex: Fail the index corresponding to the failed Pod. This action works with the Backoff limit per index feature.
Count: Count the failure towards the backoffLimit or backoffLimitPerIndex. This is the default behavior.
When Pod failures occur in a running Job, Kubernetes matches the failed Pod status against the list of Pod failure policy rules, in the specified order, and takes the corresponding actions for the first matched rule.
Note that when specifying the Pod failure policy, you must also set the Job's Pod template with restartPolicy: Never. This prevents race conditions between the kubelet and the Job controller when counting Pod failures.
Kubernetes-initiated Pod disruptions
To allow matching Pod failure policy rules against failures caused by disruptions initiated by Kubernetes, this feature introduces the DisruptionTarget Pod condition.
Kubernetes adds this condition to any Pod, regardless of whether it's managed by a Job controller, that fails because of a retriable disruption scenario. The DisruptionTarget condition contains one of the following reasons that corresponds to these disruption scenarios:
PreemptionByKubeScheduler: Preemption by kube-scheduler to accommodate a new Pod that has a higher priority.
DeletionByTaintManager - the Pod is due to be deleted by kube-controller-manager due to a NoExecute taint that the Pod doesn't tolerate.
EvictionByEvictionAPI - the Pod is due to be deleted by an API-initiated eviction.
DeletionByPodGC - the Pod is bound to a node that no longer exists, and is due to be deleted by Pod garbage collection.
TerminationByKubelet - the Pod was terminated by graceful node shutdown, node pressure eviction or preemption for system critical pods.
In all other disruption scenarios, like eviction due to exceeding Pod container limits, Pods don't receive the DisruptionTarget condition because the disruptions were likely caused by the Pod and would reoccur on retry.
Example
The Pod failure policy snippet below demonstrates an example use:
podFailurePolicy: rules:
In this example, the Pod failure policy does the following:
Ignores any failed Pods that have the built-in DisruptionTarget condition. These Pods don't count towards Job backoff limits.
Fails the Job if any failed Pods have the custom user-supplied ConfigIssue condition, which was added either by a custom controller or webhook.
Fails the Job if any containers exited with the exit code 42.
Counts all other Pod failures towards the default backoffLimit (or backoffLimitPerIndex if used).
Learn more
For a hands-on guide to using Pod failure policy, see Handling retriable and non-retriable pod failures with Pod failure policy
Read the documentation for Pod failure policy and Backoff limit per index
Read the documentation for Pod disruption conditions
Read the KEP for Pod failure policy
Related work
Based on the concepts introduced by Pod failure policy, the following additional work is in progress:
JobSet integration: Configurable Failure Policy API
Pod failure policy extension to add more granular failure reasons
Support for Pod failure policy via JobSet in Kubeflow Training v2
Proposal: Disrupted Pods should be removed from endpoints
Get involved
This work was sponsored by batch working group in close collaboration with the SIG Apps, and SIG Node, and SIG Scheduling communities.
If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.
Acknowledgments
I would love to thank everyone who was involved in this project over the years - it's been a journey and a joint community effort! The list below is my best-effort attempt to remember and recognize people who made an impact. Thank you!
Aldo Culquicondor for guidance and reviews throughout the process
Jordan Liggitt for KEP and API reviews
David Eads for API reviews
Maciej Szulik for KEP reviews from SIG Apps PoV
Clayton Coleman for guidance and SIG Node reviews
Sergey Kanzhelev for KEP reviews from SIG Node PoV
Dawn Chen for KEP reviews from SIG Node PoV
Daniel Smith for reviews from SIG API machinery PoV
Antoine Pelisse for reviews from SIG API machinery PoV
John Belamaric for PRR reviews
Filip Křepinský for thorough reviews from SIG Apps PoV and bug-fixing
David Porter for thorough reviews from SIG Node PoV
Jensen Lo for early requirements discussions, testing and reporting issues
Daniel Vega-Myhre for advancing JobSet integration and reporting issues
Abdullah Gharaibeh for early design discussions and guidance
Antonio Ojea for test reviews
Yuki Iwai for reviews and aligning implementation of the closely related Job features
Kevin Hannon for reviews and aligning implementation of the closely related Job features
Tim Bannister for docs reviews
Shannon Kularathna for docs reviews
Paola Cortés for docs reviews
via Kubernetes Blog https://kubernetes.io/
August 18, 2024 at 08:00PM
LEGOs labeled and bins for partial projects allocated. It’s a LEGO playroom now.
August 18, 2024 at 04:56PM
via Instagram https://instagr.am/p/C-01F5yvXDK/
.@juliemshort massively reorganized Max’s playroom a couple days ago into more of a LEGO builder space since that’s really all Max does in here.
Max asked me to come see his battlefield that he’d set up. The organizer is new to help keep minifigs sorted as the previous solution was overflowing. I went to grab a droid and had no idea where they were. Max tells me and I immediately forget.
So this is what I’m doing right now. Labeling drawers after Julie came through and made sure everything was in the right place (it wasn’t; hence the labeling). Happy Sunday! #LEGO #organization #legominifigs
August 18, 2024 at 12:43PM
via Instagram https://instagr.am/p/C-0YI-gvPmH/
CVE-2024-7646
https://github.com/kubernetes/kubernetes/issues/126744
Ingress-nginx Annotation Validation Bypass
via Kubernetes Vulnerability Announcements - CVE Feed https://kubernetes.io/docs/reference/issues-security/official-cve-feed/
August 16, 2024 at 12:10PM
Kubernetes 1.31: MatchLabelKeys in PodAffinity graduates to beta
https://kubernetes.io/blog/2024/08/16/matchlabelkeys-podaffinity/
Kubernetes 1.29 introduced new fields MatchLabelKeys and MismatchLabelKeys in PodAffinity and PodAntiAffinity.
In Kubernetes 1.31, this feature moves to beta and the corresponding feature gate (MatchLabelKeysInPodAffinity) gets enabled by default.
MatchLabelKeys - Enhanced scheduling for versatile rolling updates
During a workload's (e.g., Deployment) rolling update, a cluster may have Pods from multiple versions at the same time. However, the scheduler cannot distinguish between old and new versions based on the LabelSelector specified in PodAffinity or PodAntiAffinity. As a result, it will co-locate or disperse Pods regardless of their versions.
This can lead to sub-optimal scheduling outcome, for example:
New version Pods are co-located with old version Pods (PodAffinity), which will eventually be removed after rolling updates.
Old version Pods are distributed across all available topologies, preventing new version Pods from finding nodes due to PodAntiAffinity.
MatchLabelKeys is a set of Pod label keys and addresses this problem. The scheduler looks up the values of these keys from the new Pod's labels and combines them with LabelSelector so that PodAffinity matches Pods that have the same key-value in labels.
By using label pod-template-hash in MatchLabelKeys, you can ensure that only Pods of the same version are evaluated for PodAffinity or PodAntiAffinity.
apiVersion: apps/v1 kind: Deployment metadata: name: application-server ... affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution:
The above matchLabelKeys will be translated in Pods like:
kind: Pod metadata: name: application-server labels: pod-template-hash: xyz ... affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution:
MismatchLabelKeys - Service isolation
MismatchLabelKeys is a set of Pod label keys, like MatchLabelKeys, which looks up the values of these keys from the new Pod's labels, and merge them with LabelSelector as key notin (value) so that PodAffinity does not match Pods that have the same key-value in labels.
Suppose all Pods for each tenant get tenant label via a controller or a manifest management tool like Helm.
Although the value of tenant label is unknown when composing each workload's manifest, the cluster admin wants to achieve exclusive 1:1 tenant to domain placement for a tenant isolation.
MismatchLabelKeys works for this usecase; By applying the following affinity globally using a mutating webhook, the cluster admin can ensure that the Pods from the same tenant will land on the same domain exclusively, meaning Pods from other tenants won't land on the same domain.
affinity: podAffinity: # ensures the pods of this tenant land on the same node pool requiredDuringSchedulingIgnoredDuringExecution:
The above matchLabelKeys and mismatchLabelKeys will be translated to like:
kind: Pod metadata: name: application-server labels: tenant: service-a spec: affinity: podAffinity: # ensures the pods of this tenant land on the same node pool requiredDuringSchedulingIgnoredDuringExecution:
Getting involved
These features are managed by Kubernetes SIG Scheduling.
Please join us and share your feedback. We look forward to hearing from you!
How can I learn more?
The official document of PodAffinity
KEP-3633: Introduce MatchLabelKeys and MismatchLabelKeys to PodAffinity and PodAntiAffinity
via Kubernetes Blog https://kubernetes.io/
August 15, 2024 at 08:00PM
Kubernetes 1.31: Prevent PersistentVolume Leaks When Deleting out of Order
PersistentVolume (or PVs for short) are associated with Reclaim Policy. The reclaim policy is used to determine the actions that need to be taken by the storage backend on deletion of the PVC Bound to a PV. When the reclaim policy is Delete, the expectation is that the storage backend releases the storage resource allocated for the PV. In essence, the reclaim policy needs to be honored on PV deletion.
With the recent Kubernetes v1.31 release, a beta feature lets you configure your cluster to behave that way and honor the configured reclaim policy.
How did reclaim work in previous Kubernetes releases?
PersistentVolumeClaim (or PVC for short) is a user's request for storage. A PV and PVC are considered Bound if a newly created PV or a matching PV is found. The PVs themselves are backed by volumes allocated by the storage backend.
Normally, if the volume is to be deleted, then the expectation is to delete the PVC for a bound PV-PVC pair. However, there are no restrictions on deleting a PV before deleting a PVC.
First, I'll demonstrate the behavior for clusters running an older version of Kubernetes.
Retrieve a PVC that is bound to a PV
Retrieve an existing PVC example-vanilla-block-pvc
kubectl get pvc example-vanilla-block-pvc
The following output shows the PVC and its bound PV; the PV is shown under the VOLUME column:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE example-vanilla-block-pvc Bound pvc-6791fdd4-5fad-438e-a7fb-16410363e3da 5Gi RWO example-vanilla-block-sc 19s
Delete PV
When I try to delete a bound PV, the kubectl session blocks and the kubectl tool does not return back control to the shell; for example:
kubectl delete pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da
persistentvolume "pvc-6791fdd4-5fad-438e-a7fb-16410363e3da" deleted ^C
Retrieving the PV
kubectl get pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da
It can be observed that the PV is in a Terminating state
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-6791fdd4-5fad-438e-a7fb-16410363e3da 5Gi RWO Delete Terminating default/example-vanilla-block-pvc example-vanilla-block-sc 2m23s
Delete PVC
kubectl delete pvc example-vanilla-block-pvc
The following output is seen if the PVC gets successfully deleted:
persistentvolumeclaim "example-vanilla-block-pvc" deleted
The PV object from the cluster also gets deleted. When attempting to retrieve the PV it will be observed that the PV is no longer found:
kubectl get pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da
Error from server (NotFound): persistentvolumes "pvc-6791fdd4-5fad-438e-a7fb-16410363e3da" not found
Although the PV is deleted, the underlying storage resource is not deleted and needs to be removed manually.
To sum up, the reclaim policy associated with the PersistentVolume is currently ignored under certain circumstances. For a Bound PV-PVC pair, the ordering of PV-PVC deletion determines whether the PV reclaim policy is honored. The reclaim policy is honored if the PVC is deleted first; however, if the PV is deleted prior to deleting the PVC, then the reclaim policy is not exercised. As a result of this behavior, the associated storage asset in the external infrastructure is not removed.
PV reclaim policy with Kubernetes v1.31
The new behavior ensures that the underlying storage object is deleted from the backend when users attempt to delete a PV manually.
How to enable new behavior?
To take advantage of the new behavior, you must have upgraded your cluster to the v1.31 release of Kubernetes and run the CSI external-provisioner version 5.0.1 or later.
How does it work?
For CSI volumes, the new behavior is achieved by adding a finalizer external-provisioner.volume.kubernetes.io/finalizer on new and existing PVs. The finalizer is only removed after the storage from the backend is deleted. `
An example of a PV with the finalizer, notice the new finalizer in the finalizers list
kubectl get pv pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53 -o yaml
apiVersion: v1 kind: PersistentVolume metadata: annotations: pv.kubernetes.io/provisioned-by: csi.vsphere.vmware.com creationTimestamp: "2021-11-17T19:28:56Z" finalizers:
The finalizer prevents this PersistentVolume from being removed from the cluster. As stated previously, the finalizer is only removed from the PV object after it is successfully deleted from the storage backend. To learn more about finalizers, please refer to Using Finalizers to Control Deletion.
Similarly, the finalizer kubernetes.io/pv-controller is added to dynamically provisioned in-tree plugin volumes.
What about CSI migrated volumes?
The fix applies to CSI migrated volumes as well.
Some caveats
The fix does not apply to statically provisioned in-tree plugin volumes.
References
KEP-2644
Volume leak issue
How do I get involved?
The Kubernetes Slack channel SIG Storage communication channels are great mediums to reach out to the SIG Storage and migration working group teams.
Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution:
Fan Baofa (carlory)
Jan Šafránek (jsafrane)
Xing Yang (xing-yang)
Matthew Wong (wongma7)
Join the Kubernetes Storage Special Interest Group (SIG) if you're interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system. We’re rapidly growing and always welcome new contributors.
via Kubernetes Blog https://kubernetes.io/
August 15, 2024 at 08:00PM
Kubernetes 1.31: Read Only Volumes Based On OCI Artifacts (alpha)
https://kubernetes.io/blog/2024/08/16/kubernetes-1-31-image-volume-source/
The Kubernetes community is moving towards fulfilling more Artificial Intelligence (AI) and Machine Learning (ML) use cases in the future. While the project has been designed to fulfill microservice architectures in the past, it’s now time to listen to the end users and introduce features which have a stronger focus on AI/ML.
One of these requirements is to support Open Container Initiative (OCI) compatible images and artifacts (referred as OCI objects) directly as a native volume source. This allows users to focus on OCI standards as well as enables them to store and distribute any content using OCI registries. A feature like this gives the Kubernetes project a chance to grow into use cases which go beyond running particular images.
Given that, the Kubernetes community is proud to present a new alpha feature introduced in v1.31: The Image Volume Source (KEP-4639). This feature allows users to specify an image reference as volume in a pod while reusing it as volume mount within containers:
… kind: Pod spec: containers:
The above example would result in mounting my-image:tag to /path/to/directory in the pod’s container.
Use cases
The goal of this enhancement is to stick as close as possible to the existing container image implementation within the kubelet, while introducing a new API surface to allow more extended use cases.
For example, users could share a configuration file among multiple containers in a pod without including the file in the main image, so that they can minimize security risks and the overall image size. They can also package and distribute binary artifacts using OCI images and mount them directly into Kubernetes pods, so that they can streamline their CI/CD pipeline as an example.
Data scientists, MLOps engineers, or AI developers, can mount large language model weights or machine learning model weights in a pod alongside a model-server, so that they can efficiently serve them without including them in the model-server container image. They can package these in an OCI object to take advantage of OCI distribution and ensure efficient model deployment. This allows them to separate the model specifications/content from the executables that process them.
Another use case is that security engineers can use a public image for a malware scanner and mount in a volume of private (commercial) malware signatures, so that they can load those signatures without baking their own combined image (which might not be allowed by the copyright on the public image). Those files work regardless of the OS or version of the scanner software.
But in the long term it will be up to you as an end user of this project to outline further important use cases for the new feature. SIG Node is happy to retrieve any feedback or suggestions for further enhancements to allow more advanced usage scenarios. Feel free to provide feedback by either using the Kubernetes Slack (#sig-node) channel or the SIG Node mailinglist.
Detailed example
The Kubernetes alpha feature gate ImageVolume needs to be enabled on the API Server as well as the kubelet to make it functional. If that’s the case and the container runtime has support for the feature (like CRI-O ≥ v1.31), then an example pod.yaml like this can be created:
apiVersion: v1 kind: Pod metadata: name: pod spec: containers:
The pod declares a new volume using the image.reference of quay.io/crio/artifact:v1, which refers to an OCI object containing two files. The pullPolicy behaves in the same way as for container images and allows the following values:
Always: the kubelet always attempts to pull the reference and the container creation will fail if the pull fails.
Never: the kubelet never pulls the reference and only uses a local image or artifact. The container creation will fail if the reference isn’t present.
IfNotPresent: the kubelet pulls if the reference isn’t already present on disk. The container creation will fail if the reference isn’t present and the pull fails.
The volumeMounts field is indicating that the container with the name test should mount the volume under the path /volume.
If you now create the pod:
kubectl apply -f pod.yaml
And exec into it:
kubectl exec -it pod -- sh
Then you’re able to investigate what has been mounted:
/ # ls /volume dir file / # cat /volume/file 2 / # ls /volume/dir file / # cat /volume/dir/file 1
You managed to consume an OCI artifact using Kubernetes!
The container runtime pulls the image (or artifact), mounts it to the container and makes it finally available for direct usage. There are a bunch of details in the implementation, which closely align to the existing image pull behavior of the kubelet. For example:
If a :latest tag as reference is provided, then the pullPolicy will default to Always, while in any other case it will default to IfNotPresent if unset.
The volume gets re-resolved if the pod gets deleted and recreated, which means that new remote content will become available on pod recreation. A failure to resolve or pull the image during pod startup will block containers from starting and may add significant latency. Failures will be retried using normal volume backoff and will be reported on the pod reason and message.
Pull secrets will be assembled in the same way as for the container image by looking up node credentials, service account image pull secrets, and pod spec image pull secrets.
The OCI object gets mounted in a single directory by merging the manifest layers in the same way as for container images.
The volume is mounted as read-only (ro) and non-executable files (noexec).
Sub-path mounts for containers are not supported (spec.containers[*].volumeMounts.subpath).
The field spec.securityContext.fsGroupChangePolicy has no effect on this volume type.
The feature will also work with the AlwaysPullImages admission plugin if enabled.
Thank you for reading through the end of this blog post! SIG Node is proud and happy to deliver this feature as part of Kubernetes v1.31.
As writer of this blog post, I would like to emphasize my special thanks to all involved individuals out there! You all rock, let’s keep on hacking!
Further reading
Use an Image Volume With a Pod
image volume overview
via Kubernetes Blog https://kubernetes.io/
August 15, 2024 at 08:00PM
Evolving our self-hosted offering and license model
Contact us Sign in Evolving our self-hosted offering and license model What you need to know about the upcoming changes to CockroachDB Enterprise arriving this…
August 15, 2024 at 10:13AM
via Instapaper
How Trump rolled Elon
Musk tried to meet Trump as an equal. He came away as just another lap dog. No one has torched their image in service of Donald Trump like Elon Musk has.
Tags:
August 15, 2024 at 09:55AM
Kubernetes 1.31: VolumeAttributesClass for Volume Modification Beta
https://kubernetes.io/blog/2024/08/15/kubernetes-1-31-volume-attributes-class/
Volumes in Kubernetes have been described by two attributes: their storage class, and their capacity. The storage class is an immutable property of the volume, while the capacity can be changed dynamically with volume resize.
This complicates vertical scaling of workloads with volumes. While cloud providers and storage vendors often offer volumes which allow specifying IO quality of service (Performance) parameters like IOPS or throughput and tuning them as workloads operate, Kubernetes has no API which allows changing them.
We are pleased to announce that the VolumeAttributesClass KEP, alpha since Kubernetes 1.29, will be beta in 1.31. This provides a generic, Kubernetes-native API for modifying volume parameters like provisioned IO.
Like all new volume features in Kubernetes, this API is implemented via the container storage interface (CSI). In addition to the VolumeAttributesClass feature gate, your provisioner-specific CSI driver must support the new ModifyVolume API which is the CSI side of this feature.
See the full documentation for all details. Here we show the common workflow.
Dynamically modifying volume attributes.
A VolumeAttributesClass is a cluster-scoped resource that specifies provisioner-specific attributes. These are created by the cluster administrator in the same way as storage classes. For example, a series of gold, silver and bronze volume attribute classes can be created for volumes with greater or lessor amounts of provisioned IO.
apiVersion: storage.k8s.io/v1alpha1 kind: VolumeAttributesClass metadata: name: silver driverName: your-csi-driver parameters: provisioned-iops: "500" provisioned-throughput: "50MiB/s" --- apiVersion: storage.k8s.io/v1alpha1 kind: VolumeAttributesClass metadata: name: gold driverName: your-csi-driver parameters: provisioned-iops: "10000" provisioned-throughput: "500MiB/s"
An attribute class is added to a PVC in much the same way as a storage class.
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: test-pv-claim spec: storageClassName: any-storage-class volumeAttributesClassName: silver accessModes:
Unlike a storage class, the volume attributes class can be changed:
kubectl patch pvc test-pv-claim -p '{"spec": "volumeAttributesClassName": "gold"}'
Kubernetes will work with the CSI driver to update the attributes of the volume. The status of the PVC will track the current and desired attributes class. The PV resource will also be updated with the new volume attributes class which will be set to the currently active attributes of the PV.
Limitations with the beta
As a beta feature, there are still some features which are planned for GA but not yet present. The largest is quota support, see the KEP and discussion in sig-storage for details.
See the Kubernetes CSI driver list for up-to-date information of support for this feature in CSI drivers.
via Kubernetes Blog https://kubernetes.io/
August 14, 2024 at 08:00PM
Kubernetes v1.31: Accelerating Cluster Performance with Consistent Reads from Cache
https://kubernetes.io/blog/2024/08/15/consistent-read-from-cache-beta/
Kubernetes is renowned for its robust orchestration of containerized applications, but as clusters grow, the demands on the control plane can become a bottleneck. A key challenge has been ensuring strongly consistent reads from the etcd datastore, requiring resource-intensive quorum reads.
Today, the Kubernetes community is excited to announce a major improvement: consistent reads from cache, graduating to Beta in Kubernetes v1.31.
Why consistent reads matter
Consistent reads are essential for ensuring that Kubernetes components have an accurate view of the latest cluster state. Guaranteeing consistent reads is crucial for maintaining the accuracy and reliability of Kubernetes operations, enabling components to make informed decisions based on up-to-date information. In large-scale clusters, fetching and processing this data can be a performance bottleneck, especially for requests that involve filtering results. While Kubernetes can filter data by namespace directly within etcd, any other filtering by labels or field selectors requires the entire dataset to be fetched from etcd and then filtered in-memory by the Kubernetes API server. This is particularly impactful for components like the kubelet, which only needs to list pods scheduled to its node - but previously required the API Server and etcd to process all pods in the cluster.
The breakthrough: Caching with confidence
Kubernetes has long used a watch cache to optimize read operations. The watch cache stores a snapshot of the cluster state and receives updates through etcd watches. However, until now, it couldn't serve consistent reads directly, as there was no guarantee the cache was sufficiently up-to-date.
The consistent reads from cache feature addresses this by leveraging etcd's progress notifications mechanism. These notifications inform the watch cache about how current its data is compared to etcd. When a consistent read is requested, the system first checks if the watch cache is up-to-date. If the cache is not up-to-date, the system queries etcd for progress notifications until it's confirmed that the cache is sufficiently fresh. Once ready, the read is efficiently served directly from the cache, which can significantly improve performance, particularly in cases where it would require fetching a lot of data from etcd. This enables requests that filter data to be served from the cache, with only minimal metadata needing to be read from etcd.
Important Note: To benefit from this feature, your Kubernetes cluster must be running etcd version 3.4.31+ or 3.5.13+. For older etcd versions, Kubernetes will automatically fall back to serving consistent reads directly from etcd.
Performance gains you'll notice
This seemingly simple change has a profound impact on Kubernetes performance and scalability:
Reduced etcd Load: Kubernetes v1.31 can offload work from etcd, freeing up resources for other critical operations.
Lower Latency: Serving reads from cache is significantly faster than fetching and processing data from etcd. This translates to quicker responses for components, improving overall cluster responsiveness.
Improved Scalability: Large clusters with thousands of nodes and pods will see the most significant gains, as the reduction in etcd load allows the control plane to handle more requests without sacrificing performance.
5k Node Scalability Test Results: In recent scalability tests on 5,000 node clusters, enabling consistent reads from cache delivered impressive improvements:
30% reduction in kube-apiserver CPU usage
25% reduction in etcd CPU usage
Up to 3x reduction (from 5 seconds to 1.5 seconds) in 99th percentile pod LIST request latency
What's next?
With the graduation to beta, consistent reads from cache are enabled by default, offering a seamless performance boost to all Kubernetes users running a supported etcd version.
Our journey doesn't end here. Kubernetes community is actively exploring pagination support in the watch cache, which will unlock even more performance optimizations in the future.
Getting started
Upgrading to Kubernetes v1.31 and ensuring you are using etcd version 3.4.31+ or 3.5.13+ is the easiest way to experience the benefits of consistent reads from cache. If you have any questions or feedback, don't hesitate to reach out to the Kubernetes community.
Let us know how consistent reads from cache transforms your Kubernetes experience!
Special thanks to @ah8ad3 and @p0lyn0mial for their contributions to this feature!
via Kubernetes Blog https://kubernetes.io/
August 14, 2024 at 08:00PM