Suggested Reads

Suggested Reads

54832 bookmarks
Newest
(21) Post | Feed | LinkedIn
(21) Post | Feed | LinkedIn

(21) Post | Feed | LinkedIn

1 notification total Repost successful. View repost…

August 19, 2024 at 11:24AM

via Instapaper

·linkedin.com·
(21) Post | Feed | LinkedIn
The Dark Side of Open Source: Are We All Just Selfish?
The Dark Side of Open Source: Are We All Just Selfish?

The Dark Side of Open Source: Are We All Just Selfish?

Open-source software is often seen as a free-for-all, but the reality is more complex. Many companies invest heavily in open source projects as a go-to-market strategy, paying full-time maintainers to ensure project success. This video explores the motivations behind open source, the role of big companies like Google and AWS, and the impact of license changes by companies like MongoDB and HashiCorp. Discover why no open-source project should be owned by a single company and the benefits of foundation-owned projects like Kubernetes and Linux. Learn how you can contribute to and support the open source ecosystem.

OpenSource #TechIndustry #SoftwareDevelopment #CorporateSponsorship

▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).

▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ Twitter: https://twitter.com/vfarcic ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/

▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox

via YouTube https://www.youtube.com/watch?v=4l_kK90khNA

·youtube.com·
The Dark Side of Open Source: Are We All Just Selfish?
Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA
Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA

Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA

https://kubernetes.io/blog/2024/08/19/kubernetes-1-31-pod-failure-policy-for-jobs-goes-ga/

This post describes Pod failure policy, which graduates to stable in Kubernetes 1.31, and how to use it in your Jobs.

About Pod failure policy

When you run workloads on Kubernetes, Pods might fail for a variety of reasons. Ideally, workloads like Jobs should be able to ignore transient, retriable failures and continue running to completion.

To allow for these transient failures, Kubernetes Jobs include the backoffLimit field, which lets you specify a number of Pod failures that you're willing to tolerate during Job execution. However, if you set a large value for the backoffLimit field and rely solely on this field, you might notice unnecessary increases in operating costs as Pods restart excessively until the backoffLimit is met.

This becomes particularly problematic when running large-scale Jobs with thousands of long-running Pods across thousands of nodes.

The Pod failure policy extends the backoff limit mechanism to help you reduce costs in the following ways:

Gives you control to fail the Job as soon as a non-retriable Pod failure occurs.

Allows you to ignore retriable errors without increasing the backoffLimit field.

For example, you can use a Pod failure policy to run your workload on more affordable spot machines by ignoring Pod failures caused by graceful node shutdown.

The policy allows you to distinguish between retriable and non-retriable Pod failures based on container exit codes or Pod conditions in a failed Pod.

How it works

You specify a Pod failure policy in the Job specification, represented as a list of rules.

For each rule you define match requirements based on one of the following properties:

Container exit codes: the onExitCodes property.

Pod conditions: the onPodConditions property.

Additionally, for each rule, you specify one of the following actions to take when a Pod matches the rule:

Ignore: Do not count the failure towards the backoffLimit or backoffLimitPerIndex.

FailJob: Fail the entire Job and terminate all running Pods.

FailIndex: Fail the index corresponding to the failed Pod. This action works with the Backoff limit per index feature.

Count: Count the failure towards the backoffLimit or backoffLimitPerIndex. This is the default behavior.

When Pod failures occur in a running Job, Kubernetes matches the failed Pod status against the list of Pod failure policy rules, in the specified order, and takes the corresponding actions for the first matched rule.

Note that when specifying the Pod failure policy, you must also set the Job's Pod template with restartPolicy: Never. This prevents race conditions between the kubelet and the Job controller when counting Pod failures.

Kubernetes-initiated Pod disruptions

To allow matching Pod failure policy rules against failures caused by disruptions initiated by Kubernetes, this feature introduces the DisruptionTarget Pod condition.

Kubernetes adds this condition to any Pod, regardless of whether it's managed by a Job controller, that fails because of a retriable disruption scenario. The DisruptionTarget condition contains one of the following reasons that corresponds to these disruption scenarios:

PreemptionByKubeScheduler: Preemption by kube-scheduler to accommodate a new Pod that has a higher priority.

DeletionByTaintManager - the Pod is due to be deleted by kube-controller-manager due to a NoExecute taint that the Pod doesn't tolerate.

EvictionByEvictionAPI - the Pod is due to be deleted by an API-initiated eviction.

DeletionByPodGC - the Pod is bound to a node that no longer exists, and is due to be deleted by Pod garbage collection.

TerminationByKubelet - the Pod was terminated by graceful node shutdown, node pressure eviction or preemption for system critical pods.

In all other disruption scenarios, like eviction due to exceeding Pod container limits, Pods don't receive the DisruptionTarget condition because the disruptions were likely caused by the Pod and would reoccur on retry.

Example

The Pod failure policy snippet below demonstrates an example use:

podFailurePolicy: rules:

  • action: Ignore onPodConditions:
  • type: DisruptionTarget
  • action: FailJob onPodConditions:
  • type: ConfigIssue
  • action: FailJob onExitCodes: operator: In values: [ 42 ]

In this example, the Pod failure policy does the following:

Ignores any failed Pods that have the built-in DisruptionTarget condition. These Pods don't count towards Job backoff limits.

Fails the Job if any failed Pods have the custom user-supplied ConfigIssue condition, which was added either by a custom controller or webhook.

Fails the Job if any containers exited with the exit code 42.

Counts all other Pod failures towards the default backoffLimit (or backoffLimitPerIndex if used).

Learn more

For a hands-on guide to using Pod failure policy, see Handling retriable and non-retriable pod failures with Pod failure policy

Read the documentation for Pod failure policy and Backoff limit per index

Read the documentation for Pod disruption conditions

Read the KEP for Pod failure policy

Related work

Based on the concepts introduced by Pod failure policy, the following additional work is in progress:

JobSet integration: Configurable Failure Policy API

Pod failure policy extension to add more granular failure reasons

Support for Pod failure policy via JobSet in Kubeflow Training v2

Proposal: Disrupted Pods should be removed from endpoints

Get involved

This work was sponsored by batch working group in close collaboration with the SIG Apps, and SIG Node, and SIG Scheduling communities.

If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.

Acknowledgments

I would love to thank everyone who was involved in this project over the years - it's been a journey and a joint community effort! The list below is my best-effort attempt to remember and recognize people who made an impact. Thank you!

Aldo Culquicondor for guidance and reviews throughout the process

Jordan Liggitt for KEP and API reviews

David Eads for API reviews

Maciej Szulik for KEP reviews from SIG Apps PoV

Clayton Coleman for guidance and SIG Node reviews

Sergey Kanzhelev for KEP reviews from SIG Node PoV

Dawn Chen for KEP reviews from SIG Node PoV

Daniel Smith for reviews from SIG API machinery PoV

Antoine Pelisse for reviews from SIG API machinery PoV

John Belamaric for PRR reviews

Filip Křepinský for thorough reviews from SIG Apps PoV and bug-fixing

David Porter for thorough reviews from SIG Node PoV

Jensen Lo for early requirements discussions, testing and reporting issues

Daniel Vega-Myhre for advancing JobSet integration and reporting issues

Abdullah Gharaibeh for early design discussions and guidance

Antonio Ojea for test reviews

Yuki Iwai for reviews and aligning implementation of the closely related Job features

Kevin Hannon for reviews and aligning implementation of the closely related Job features

Tim Bannister for docs reviews

Shannon Kularathna for docs reviews

Paola Cortés for docs reviews

via Kubernetes Blog https://kubernetes.io/

August 18, 2024 at 08:00PM

·kubernetes.io·
Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA
.@juliemshort massively reorganized Maxs playroom a couple days ago into more of a LEGO builder space since thats really all Max does in here. Max asked me to come see his battlefield that hed set up. The organizer is new to help keep minifigs sorted as the previous solution was overflowing. I went to grab a droid and had no idea where they were. Max tells me and I immediately forget. So this is what Im doing right now. Labeling drawers after Julie came through and made sure everything was in the right place (it wasnt; hence the labeling). Happy Sunday! #LEGO #organization #legominifigs
.@juliemshort massively reorganized Maxs playroom a couple days ago into more of a LEGO builder space since thats really all Max does in here. Max asked me to come see his battlefield that hed set up. The organizer is new to help keep minifigs sorted as the previous solution was overflowing. I went to grab a droid and had no idea where they were. Max tells me and I immediately forget. So this is what Im doing right now. Labeling drawers after Julie came through and made sure everything was in the right place (it wasnt; hence the labeling). Happy Sunday! #LEGO #organization #legominifigs

.@juliemshort massively reorganized Max’s playroom a couple days ago into more of a LEGO builder space since that’s really all Max does in here.

Max asked me to come see his battlefield that he’d set up. The organizer is new to help keep minifigs sorted as the previous solution was overflowing. I went to grab a droid and had no idea where they were. Max tells me and I immediately forget.

So this is what I’m doing right now. Labeling drawers after Julie came through and made sure everything was in the right place (it wasn’t; hence the labeling). Happy Sunday! #LEGO #organization #legominifigs

August 18, 2024 at 12:43PM

via Instagram https://instagr.am/p/C-0YI-gvPmH/

·instagr.am·
.@juliemshort massively reorganized Maxs playroom a couple days ago into more of a LEGO builder space since thats really all Max does in here. Max asked me to come see his battlefield that hed set up. The organizer is new to help keep minifigs sorted as the previous solution was overflowing. I went to grab a droid and had no idea where they were. Max tells me and I immediately forget. So this is what Im doing right now. Labeling drawers after Julie came through and made sure everything was in the right place (it wasnt; hence the labeling). Happy Sunday! #LEGO #organization #legominifigs
Kubernetes 1.31: MatchLabelKeys in PodAffinity graduates to beta
Kubernetes 1.31: MatchLabelKeys in PodAffinity graduates to beta

Kubernetes 1.31: MatchLabelKeys in PodAffinity graduates to beta

https://kubernetes.io/blog/2024/08/16/matchlabelkeys-podaffinity/

Kubernetes 1.29 introduced new fields MatchLabelKeys and MismatchLabelKeys in PodAffinity and PodAntiAffinity.

In Kubernetes 1.31, this feature moves to beta and the corresponding feature gate (MatchLabelKeysInPodAffinity) gets enabled by default.

MatchLabelKeys - Enhanced scheduling for versatile rolling updates

During a workload's (e.g., Deployment) rolling update, a cluster may have Pods from multiple versions at the same time. However, the scheduler cannot distinguish between old and new versions based on the LabelSelector specified in PodAffinity or PodAntiAffinity. As a result, it will co-locate or disperse Pods regardless of their versions.

This can lead to sub-optimal scheduling outcome, for example:

New version Pods are co-located with old version Pods (PodAffinity), which will eventually be removed after rolling updates.

Old version Pods are distributed across all available topologies, preventing new version Pods from finding nodes due to PodAntiAffinity.

MatchLabelKeys is a set of Pod label keys and addresses this problem. The scheduler looks up the values of these keys from the new Pod's labels and combines them with LabelSelector so that PodAffinity matches Pods that have the same key-value in labels.

By using label pod-template-hash in MatchLabelKeys, you can ensure that only Pods of the same version are evaluated for PodAffinity or PodAntiAffinity.

apiVersion: apps/v1 kind: Deployment metadata: name: application-server ... affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution:

  • labelSelector: matchExpressions:
  • key: app operator: In values:
  • database topologyKey: topology.kubernetes.io/zone matchLabelKeys:
  • pod-template-hash

The above matchLabelKeys will be translated in Pods like:

kind: Pod metadata: name: application-server labels: pod-template-hash: xyz ... affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution:

  • labelSelector: matchExpressions:
  • key: app operator: In values:
  • database
  • key: pod-template-hash # Added from matchLabelKeys; Only Pods from the same replicaset will match this affinity. operator: In values:
  • xyz topologyKey: topology.kubernetes.io/zone matchLabelKeys:
  • pod-template-hash

MismatchLabelKeys - Service isolation

MismatchLabelKeys is a set of Pod label keys, like MatchLabelKeys, which looks up the values of these keys from the new Pod's labels, and merge them with LabelSelector as key notin (value) so that PodAffinity does not match Pods that have the same key-value in labels.

Suppose all Pods for each tenant get tenant label via a controller or a manifest management tool like Helm.

Although the value of tenant label is unknown when composing each workload's manifest, the cluster admin wants to achieve exclusive 1:1 tenant to domain placement for a tenant isolation.

MismatchLabelKeys works for this usecase; By applying the following affinity globally using a mutating webhook, the cluster admin can ensure that the Pods from the same tenant will land on the same domain exclusively, meaning Pods from other tenants won't land on the same domain.

affinity: podAffinity: # ensures the pods of this tenant land on the same node pool requiredDuringSchedulingIgnoredDuringExecution:

  • matchLabelKeys:
  • tenant topologyKey: node-pool podAntiAffinity: # ensures only Pods from this tenant lands on the same node pool requiredDuringSchedulingIgnoredDuringExecution:
  • mismatchLabelKeys:
  • tenant labelSelector: matchExpressions:
  • key: tenant operator: Exists topologyKey: node-pool

The above matchLabelKeys and mismatchLabelKeys will be translated to like:

kind: Pod metadata: name: application-server labels: tenant: service-a spec: affinity: podAffinity: # ensures the pods of this tenant land on the same node pool requiredDuringSchedulingIgnoredDuringExecution:

  • matchLabelKeys:
  • tenant topologyKey: node-pool labelSelector: matchExpressions:
  • key: tenant operator: In values:
  • service-a podAntiAffinity: # ensures only Pods from this tenant lands on the same node pool requiredDuringSchedulingIgnoredDuringExecution:
  • mismatchLabelKeys:
  • tenant labelSelector: matchExpressions:
  • key: tenant operator: Exists
  • key: tenant operator: NotIn values:
  • service-a topologyKey: node-pool

Getting involved

These features are managed by Kubernetes SIG Scheduling.

Please join us and share your feedback. We look forward to hearing from you!

How can I learn more?

The official document of PodAffinity

KEP-3633: Introduce MatchLabelKeys and MismatchLabelKeys to PodAffinity and PodAntiAffinity

via Kubernetes Blog https://kubernetes.io/

August 15, 2024 at 08:00PM

·kubernetes.io·
Kubernetes 1.31: MatchLabelKeys in PodAffinity graduates to beta
Kubernetes 1.31: Prevent PersistentVolume Leaks When Deleting out of Order
Kubernetes 1.31: Prevent PersistentVolume Leaks When Deleting out of Order

Kubernetes 1.31: Prevent PersistentVolume Leaks When Deleting out of Order

https://kubernetes.io/blog/2024/08/16/kubernetes-1-31-prevent-persistentvolume-leaks-when-deleting-out-of-order/

PersistentVolume (or PVs for short) are associated with Reclaim Policy. The reclaim policy is used to determine the actions that need to be taken by the storage backend on deletion of the PVC Bound to a PV. When the reclaim policy is Delete, the expectation is that the storage backend releases the storage resource allocated for the PV. In essence, the reclaim policy needs to be honored on PV deletion.

With the recent Kubernetes v1.31 release, a beta feature lets you configure your cluster to behave that way and honor the configured reclaim policy.

How did reclaim work in previous Kubernetes releases?

PersistentVolumeClaim (or PVC for short) is a user's request for storage. A PV and PVC are considered Bound if a newly created PV or a matching PV is found. The PVs themselves are backed by volumes allocated by the storage backend.

Normally, if the volume is to be deleted, then the expectation is to delete the PVC for a bound PV-PVC pair. However, there are no restrictions on deleting a PV before deleting a PVC.

First, I'll demonstrate the behavior for clusters running an older version of Kubernetes.

Retrieve a PVC that is bound to a PV

Retrieve an existing PVC example-vanilla-block-pvc

kubectl get pvc example-vanilla-block-pvc

The following output shows the PVC and its bound PV; the PV is shown under the VOLUME column:

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE example-vanilla-block-pvc Bound pvc-6791fdd4-5fad-438e-a7fb-16410363e3da 5Gi RWO example-vanilla-block-sc 19s

Delete PV

When I try to delete a bound PV, the kubectl session blocks and the kubectl tool does not return back control to the shell; for example:

kubectl delete pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da

persistentvolume "pvc-6791fdd4-5fad-438e-a7fb-16410363e3da" deleted ^C

Retrieving the PV

kubectl get pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da

It can be observed that the PV is in a Terminating state

NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-6791fdd4-5fad-438e-a7fb-16410363e3da 5Gi RWO Delete Terminating default/example-vanilla-block-pvc example-vanilla-block-sc 2m23s

Delete PVC

kubectl delete pvc example-vanilla-block-pvc

The following output is seen if the PVC gets successfully deleted:

persistentvolumeclaim "example-vanilla-block-pvc" deleted

The PV object from the cluster also gets deleted. When attempting to retrieve the PV it will be observed that the PV is no longer found:

kubectl get pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da

Error from server (NotFound): persistentvolumes "pvc-6791fdd4-5fad-438e-a7fb-16410363e3da" not found

Although the PV is deleted, the underlying storage resource is not deleted and needs to be removed manually.

To sum up, the reclaim policy associated with the PersistentVolume is currently ignored under certain circumstances. For a Bound PV-PVC pair, the ordering of PV-PVC deletion determines whether the PV reclaim policy is honored. The reclaim policy is honored if the PVC is deleted first; however, if the PV is deleted prior to deleting the PVC, then the reclaim policy is not exercised. As a result of this behavior, the associated storage asset in the external infrastructure is not removed.

PV reclaim policy with Kubernetes v1.31

The new behavior ensures that the underlying storage object is deleted from the backend when users attempt to delete a PV manually.

How to enable new behavior?

To take advantage of the new behavior, you must have upgraded your cluster to the v1.31 release of Kubernetes and run the CSI external-provisioner version 5.0.1 or later.

How does it work?

For CSI volumes, the new behavior is achieved by adding a finalizer external-provisioner.volume.kubernetes.io/finalizer on new and existing PVs. The finalizer is only removed after the storage from the backend is deleted. `

An example of a PV with the finalizer, notice the new finalizer in the finalizers list

kubectl get pv pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53 -o yaml

apiVersion: v1 kind: PersistentVolume metadata: annotations: pv.kubernetes.io/provisioned-by: csi.vsphere.vmware.com creationTimestamp: "2021-11-17T19:28:56Z" finalizers:

  • kubernetes.io/pv-protection
  • external-provisioner.volume.kubernetes.io/finalizer name: pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53 resourceVersion: "194711" uid: 087f14f2-4157-4e95-8a70-8294b039d30e spec: accessModes:
  • ReadWriteOnce capacity: storage: 1Gi claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: example-vanilla-block-pvc namespace: default resourceVersion: "194677" uid: a7b7e3ba-f837-45ba-b243-dec7d8aaed53 csi: driver: csi.vsphere.vmware.com fsType: ext4 volumeAttributes: storage.kubernetes.io/csiProvisionerIdentity: 1637110610497-8081-csi.vsphere.vmware.com type: vSphere CNS Block Volume volumeHandle: 2dacf297-803f-4ccc-afc7-3d3c3f02051e persistentVolumeReclaimPolicy: Delete storageClassName: example-vanilla-block-sc volumeMode: Filesystem status: phase: Bound

The finalizer prevents this PersistentVolume from being removed from the cluster. As stated previously, the finalizer is only removed from the PV object after it is successfully deleted from the storage backend. To learn more about finalizers, please refer to Using Finalizers to Control Deletion.

Similarly, the finalizer kubernetes.io/pv-controller is added to dynamically provisioned in-tree plugin volumes.

What about CSI migrated volumes?

The fix applies to CSI migrated volumes as well.

Some caveats

The fix does not apply to statically provisioned in-tree plugin volumes.

References

KEP-2644

Volume leak issue

How do I get involved?

The Kubernetes Slack channel SIG Storage communication channels are great mediums to reach out to the SIG Storage and migration working group teams.

Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution:

Fan Baofa (carlory)

Jan Šafránek (jsafrane)

Xing Yang (xing-yang)

Matthew Wong (wongma7)

Join the Kubernetes Storage Special Interest Group (SIG) if you're interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system. We’re rapidly growing and always welcome new contributors.

via Kubernetes Blog https://kubernetes.io/

August 15, 2024 at 08:00PM

·kubernetes.io·
Kubernetes 1.31: Prevent PersistentVolume Leaks When Deleting out of Order
Kubernetes 1.31: Read Only Volumes Based On OCI Artifacts (alpha)
Kubernetes 1.31: Read Only Volumes Based On OCI Artifacts (alpha)

Kubernetes 1.31: Read Only Volumes Based On OCI Artifacts (alpha)

https://kubernetes.io/blog/2024/08/16/kubernetes-1-31-image-volume-source/

The Kubernetes community is moving towards fulfilling more Artificial Intelligence (AI) and Machine Learning (ML) use cases in the future. While the project has been designed to fulfill microservice architectures in the past, it’s now time to listen to the end users and introduce features which have a stronger focus on AI/ML.

One of these requirements is to support Open Container Initiative (OCI) compatible images and artifacts (referred as OCI objects) directly as a native volume source. This allows users to focus on OCI standards as well as enables them to store and distribute any content using OCI registries. A feature like this gives the Kubernetes project a chance to grow into use cases which go beyond running particular images.

Given that, the Kubernetes community is proud to present a new alpha feature introduced in v1.31: The Image Volume Source (KEP-4639). This feature allows users to specify an image reference as volume in a pod while reusing it as volume mount within containers:

… kind: Pod spec: containers:

  • … volumeMounts:
  • name: my-volume mountPath: /path/to/directory volumes:
  • name: my-volume image: reference: my-image:tag

The above example would result in mounting my-image:tag to /path/to/directory in the pod’s container.

Use cases

The goal of this enhancement is to stick as close as possible to the existing container image implementation within the kubelet, while introducing a new API surface to allow more extended use cases.

For example, users could share a configuration file among multiple containers in a pod without including the file in the main image, so that they can minimize security risks and the overall image size. They can also package and distribute binary artifacts using OCI images and mount them directly into Kubernetes pods, so that they can streamline their CI/CD pipeline as an example.

Data scientists, MLOps engineers, or AI developers, can mount large language model weights or machine learning model weights in a pod alongside a model-server, so that they can efficiently serve them without including them in the model-server container image. They can package these in an OCI object to take advantage of OCI distribution and ensure efficient model deployment. This allows them to separate the model specifications/content from the executables that process them.

Another use case is that security engineers can use a public image for a malware scanner and mount in a volume of private (commercial) malware signatures, so that they can load those signatures without baking their own combined image (which might not be allowed by the copyright on the public image). Those files work regardless of the OS or version of the scanner software.

But in the long term it will be up to you as an end user of this project to outline further important use cases for the new feature. SIG Node is happy to retrieve any feedback or suggestions for further enhancements to allow more advanced usage scenarios. Feel free to provide feedback by either using the Kubernetes Slack (#sig-node) channel or the SIG Node mailinglist.

Detailed example

The Kubernetes alpha feature gate ImageVolume needs to be enabled on the API Server as well as the kubelet to make it functional. If that’s the case and the container runtime has support for the feature (like CRI-O ≥ v1.31), then an example pod.yaml like this can be created:

apiVersion: v1 kind: Pod metadata: name: pod spec: containers:

  • name: test image: registry.k8s.io/e2e-test-images/echoserver:2.3 volumeMounts:
  • name: volume mountPath: /volume volumes:
  • name: volume image: reference: quay.io/crio/artifact:v1 pullPolicy: IfNotPresent

The pod declares a new volume using the image.reference of quay.io/crio/artifact:v1, which refers to an OCI object containing two files. The pullPolicy behaves in the same way as for container images and allows the following values:

Always: the kubelet always attempts to pull the reference and the container creation will fail if the pull fails.

Never: the kubelet never pulls the reference and only uses a local image or artifact. The container creation will fail if the reference isn’t present.

IfNotPresent: the kubelet pulls if the reference isn’t already present on disk. The container creation will fail if the reference isn’t present and the pull fails.

The volumeMounts field is indicating that the container with the name test should mount the volume under the path /volume.

If you now create the pod:

kubectl apply -f pod.yaml

And exec into it:

kubectl exec -it pod -- sh

Then you’re able to investigate what has been mounted:

/ # ls /volume dir file / # cat /volume/file 2 / # ls /volume/dir file / # cat /volume/dir/file 1

You managed to consume an OCI artifact using Kubernetes!

The container runtime pulls the image (or artifact), mounts it to the container and makes it finally available for direct usage. There are a bunch of details in the implementation, which closely align to the existing image pull behavior of the kubelet. For example:

If a :latest tag as reference is provided, then the pullPolicy will default to Always, while in any other case it will default to IfNotPresent if unset.

The volume gets re-resolved if the pod gets deleted and recreated, which means that new remote content will become available on pod recreation. A failure to resolve or pull the image during pod startup will block containers from starting and may add significant latency. Failures will be retried using normal volume backoff and will be reported on the pod reason and message.

Pull secrets will be assembled in the same way as for the container image by looking up node credentials, service account image pull secrets, and pod spec image pull secrets.

The OCI object gets mounted in a single directory by merging the manifest layers in the same way as for container images.

The volume is mounted as read-only (ro) and non-executable files (noexec).

Sub-path mounts for containers are not supported (spec.containers[*].volumeMounts.subpath).

The field spec.securityContext.fsGroupChangePolicy has no effect on this volume type.

The feature will also work with the AlwaysPullImages admission plugin if enabled.

Thank you for reading through the end of this blog post! SIG Node is proud and happy to deliver this feature as part of Kubernetes v1.31.

As writer of this blog post, I would like to emphasize my special thanks to all involved individuals out there! You all rock, let’s keep on hacking!

Further reading

Use an Image Volume With a Pod

image volume overview

via Kubernetes Blog https://kubernetes.io/

August 15, 2024 at 08:00PM

·kubernetes.io·
Kubernetes 1.31: Read Only Volumes Based On OCI Artifacts (alpha)
"Swatting" spree targets Michigan pols
"Swatting" spree targets Michigan pols
Both of Michigan's Senate nominees and its secretary of state have been the targets of "swatting" phone calls over the past week, The Detroit News reported today.
·axios.com·
"Swatting" spree targets Michigan pols
Evolving our self-hosted offering and license model
Evolving our self-hosted offering and license model

Evolving our self-hosted offering and license model

Contact us Sign in Evolving our self-hosted offering and license model What you need to know about the upcoming changes to CockroachDB Enterprise arriving this…

August 15, 2024 at 10:13AM

via Instapaper

·cockroachlabs.com·
Evolving our self-hosted offering and license model
Kubernetes 1.31: VolumeAttributesClass for Volume Modification Beta
Kubernetes 1.31: VolumeAttributesClass for Volume Modification Beta

Kubernetes 1.31: VolumeAttributesClass for Volume Modification Beta

https://kubernetes.io/blog/2024/08/15/kubernetes-1-31-volume-attributes-class/

Volumes in Kubernetes have been described by two attributes: their storage class, and their capacity. The storage class is an immutable property of the volume, while the capacity can be changed dynamically with volume resize.

This complicates vertical scaling of workloads with volumes. While cloud providers and storage vendors often offer volumes which allow specifying IO quality of service (Performance) parameters like IOPS or throughput and tuning them as workloads operate, Kubernetes has no API which allows changing them.

We are pleased to announce that the VolumeAttributesClass KEP, alpha since Kubernetes 1.29, will be beta in 1.31. This provides a generic, Kubernetes-native API for modifying volume parameters like provisioned IO.

Like all new volume features in Kubernetes, this API is implemented via the container storage interface (CSI). In addition to the VolumeAttributesClass feature gate, your provisioner-specific CSI driver must support the new ModifyVolume API which is the CSI side of this feature.

See the full documentation for all details. Here we show the common workflow.

Dynamically modifying volume attributes.

A VolumeAttributesClass is a cluster-scoped resource that specifies provisioner-specific attributes. These are created by the cluster administrator in the same way as storage classes. For example, a series of gold, silver and bronze volume attribute classes can be created for volumes with greater or lessor amounts of provisioned IO.

apiVersion: storage.k8s.io/v1alpha1 kind: VolumeAttributesClass metadata: name: silver driverName: your-csi-driver parameters: provisioned-iops: "500" provisioned-throughput: "50MiB/s" --- apiVersion: storage.k8s.io/v1alpha1 kind: VolumeAttributesClass metadata: name: gold driverName: your-csi-driver parameters: provisioned-iops: "10000" provisioned-throughput: "500MiB/s"

An attribute class is added to a PVC in much the same way as a storage class.

apiVersion: v1 kind: PersistentVolumeClaim metadata: name: test-pv-claim spec: storageClassName: any-storage-class volumeAttributesClassName: silver accessModes:

  • ReadWriteOnce resources: requests: storage: 64Gi

Unlike a storage class, the volume attributes class can be changed:

kubectl patch pvc test-pv-claim -p '{"spec": "volumeAttributesClassName": "gold"}'

Kubernetes will work with the CSI driver to update the attributes of the volume. The status of the PVC will track the current and desired attributes class. The PV resource will also be updated with the new volume attributes class which will be set to the currently active attributes of the PV.

Limitations with the beta

As a beta feature, there are still some features which are planned for GA but not yet present. The largest is quota support, see the KEP and discussion in sig-storage for details.

See the Kubernetes CSI driver list for up-to-date information of support for this feature in CSI drivers.

via Kubernetes Blog https://kubernetes.io/

August 14, 2024 at 08:00PM

·kubernetes.io·
Kubernetes 1.31: VolumeAttributesClass for Volume Modification Beta
Announcing Karpenter 1.0 | Amazon Web Services
Announcing Karpenter 1.0 | Amazon Web Services
Introduction In November 2021, AWS announced the launch of v0.5 of Karpenter, “a new open source Kubernetes cluster auto scaling project.” Originally conceived as a flexible, dynamic, and high-performance alternative to the Kubernetes Cluster Autoscaler, in the nearly three years since then Karpenter has evolved substantially into a fully featured, Kubernetes native node lifecycle manager. […]
·aws.amazon.com·
Announcing Karpenter 1.0 | Amazon Web Services
SNMP alerted us when we were supposed to start shutting down servers when our ancient air handlers bit the dust at Pope AFB. I tracked variances to time of day. | Uncertainties and issues in using IPMI temperature data
SNMP alerted us when we were supposed to start shutting down servers when our ancient air handlers bit the dust at Pope AFB. I tracked variances to time of day. | Uncertainties and issues in using IPMI temperature data
Uncertainties and issues in using IPMI temperature data
·utcc.utoronto.ca·
SNMP alerted us when we were supposed to start shutting down servers when our ancient air handlers bit the dust at Pope AFB. I tracked variances to time of day. | Uncertainties and issues in using IPMI temperature data
Kubernetes v1.31: Accelerating Cluster Performance with Consistent Reads from Cache
Kubernetes v1.31: Accelerating Cluster Performance with Consistent Reads from Cache

Kubernetes v1.31: Accelerating Cluster Performance with Consistent Reads from Cache

https://kubernetes.io/blog/2024/08/15/consistent-read-from-cache-beta/

Kubernetes is renowned for its robust orchestration of containerized applications, but as clusters grow, the demands on the control plane can become a bottleneck. A key challenge has been ensuring strongly consistent reads from the etcd datastore, requiring resource-intensive quorum reads.

Today, the Kubernetes community is excited to announce a major improvement: consistent reads from cache, graduating to Beta in Kubernetes v1.31.

Why consistent reads matter

Consistent reads are essential for ensuring that Kubernetes components have an accurate view of the latest cluster state. Guaranteeing consistent reads is crucial for maintaining the accuracy and reliability of Kubernetes operations, enabling components to make informed decisions based on up-to-date information. In large-scale clusters, fetching and processing this data can be a performance bottleneck, especially for requests that involve filtering results. While Kubernetes can filter data by namespace directly within etcd, any other filtering by labels or field selectors requires the entire dataset to be fetched from etcd and then filtered in-memory by the Kubernetes API server. This is particularly impactful for components like the kubelet, which only needs to list pods scheduled to its node - but previously required the API Server and etcd to process all pods in the cluster.

The breakthrough: Caching with confidence

Kubernetes has long used a watch cache to optimize read operations. The watch cache stores a snapshot of the cluster state and receives updates through etcd watches. However, until now, it couldn't serve consistent reads directly, as there was no guarantee the cache was sufficiently up-to-date.

The consistent reads from cache feature addresses this by leveraging etcd's progress notifications mechanism. These notifications inform the watch cache about how current its data is compared to etcd. When a consistent read is requested, the system first checks if the watch cache is up-to-date. If the cache is not up-to-date, the system queries etcd for progress notifications until it's confirmed that the cache is sufficiently fresh. Once ready, the read is efficiently served directly from the cache, which can significantly improve performance, particularly in cases where it would require fetching a lot of data from etcd. This enables requests that filter data to be served from the cache, with only minimal metadata needing to be read from etcd.

Important Note: To benefit from this feature, your Kubernetes cluster must be running etcd version 3.4.31+ or 3.5.13+. For older etcd versions, Kubernetes will automatically fall back to serving consistent reads directly from etcd.

Performance gains you'll notice

This seemingly simple change has a profound impact on Kubernetes performance and scalability:

Reduced etcd Load: Kubernetes v1.31 can offload work from etcd, freeing up resources for other critical operations.

Lower Latency: Serving reads from cache is significantly faster than fetching and processing data from etcd. This translates to quicker responses for components, improving overall cluster responsiveness.

Improved Scalability: Large clusters with thousands of nodes and pods will see the most significant gains, as the reduction in etcd load allows the control plane to handle more requests without sacrificing performance.

5k Node Scalability Test Results: In recent scalability tests on 5,000 node clusters, enabling consistent reads from cache delivered impressive improvements:

30% reduction in kube-apiserver CPU usage

25% reduction in etcd CPU usage

Up to 3x reduction (from 5 seconds to 1.5 seconds) in 99th percentile pod LIST request latency

What's next?

With the graduation to beta, consistent reads from cache are enabled by default, offering a seamless performance boost to all Kubernetes users running a supported etcd version.

Our journey doesn't end here. Kubernetes community is actively exploring pagination support in the watch cache, which will unlock even more performance optimizations in the future.

Getting started

Upgrading to Kubernetes v1.31 and ensuring you are using etcd version 3.4.31+ or 3.5.13+ is the easiest way to experience the benefits of consistent reads from cache. If you have any questions or feedback, don't hesitate to reach out to the Kubernetes community.

Let us know how consistent reads from cache transforms your Kubernetes experience!

Special thanks to @ah8ad3 and @p0lyn0mial for their contributions to this feature!

via Kubernetes Blog https://kubernetes.io/

August 14, 2024 at 08:00PM

·kubernetes.io·
Kubernetes v1.31: Accelerating Cluster Performance with Consistent Reads from Cache