1_r/devopsish

1_r/devopsish

54569 bookmarks
Custom sorting
Hotmail email delivery fails after Microsoft misconfigures DNS
Hotmail email delivery fails after Microsoft misconfigures DNS
Hotmail users worldwide have problems sending emails, with messages flagged as spam or not delivered after Microsoft misconfigured the domain's DNS SPF record.
·bleepingcomputer.com·
Hotmail email delivery fails after Microsoft misconfigures DNS
DNF5 delayed
DNF5 delayed
It is fair to say that the DNF package manager is not the favorite tool of many Fedora users. It was brought in as a replacement for Yum but got off to a rather rocky start; DNF has stabilized over the years, though and the complaints have subsided. That can only mean one thing: it must be time to throw it away and start over from the beginning. The replacement, called DNF5, was slated to be a part of the Fedora 39 release, due in October, but that is not going to happen.
·lwn.net·
DNF5 delayed
Arm's full-year revenue fell 1% ahead of IPO - source
Arm's full-year revenue fell 1% ahead of IPO - source
SoftBank Group Corp's Arm Ltd is expected to report a revenue decline of about 1% in the year ended March, when the chip designer reveals its initial public offering (IPO) filing on Monday, according to a person familiar with the matter.
·reuters.com·
Arm's full-year revenue fell 1% ahead of IPO - source
Oh Bufferapp devs… | Posting via the Bluesky API | AT Protocol
Oh Bufferapp devs… | Posting via the Bluesky API | AT Protocol
The Bluesky post record type has many features, including replies, quote-posts, embedded social cards, mentions, and images. Here's some example code for all the common post formats.
·atproto.com·
Oh Bufferapp devs… | Posting via the Bluesky API | AT Protocol
Blog: Kubernetes 1.28: Improved failure handling for Jobs
Blog: Kubernetes 1.28: Improved failure handling for Jobs
Authors: Kevin Hannon (G-Research), Michał Woźniak (Google) This blog discusses two new features in Kubernetes 1.28 to improve Jobs for batch users: Pod replacement policy and Backoff limit per index . These features continue the effort started by the Pod failure policy to improve the handling of Pod failures in a Job. Pod replacement policy By default, when a pod enters a terminating state (e.g. due to preemption or eviction), Kubernetes immediately creates a replacement Pod. Therefore, both Pods are running at the same time. In API terms, a pod is considered terminating when it has a deletionTimestamp and it has a phase Pending or Running . The scenario when two Pods are running at a given time is problematic for some popular machine learning frameworks, such as TensorFlow and JAX , which require at most one Pod running at the same time, for a given index. Tensorflow gives the following error if two pods are running for a given index. /job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4 See more details in the (issue ). Creating the replacement Pod before the previous one fully terminates can also cause problems in clusters with scarce resources or with tight budgets, such as: cluster resources can be difficult to obtain for Pods pending to be scheduled, as Kubernetes might take a long time to find available nodes until the existing Pods are fully terminated. if cluster autoscaler is enabled, the replacement Pods might produce undesired scale ups. How can you use it? This is an alpha feature, which you can enable by turning on JobPodReplacementPolicy feature gate in your cluster. Once the feature is enabled in your cluster, you can use it by creating a new Job that specifies a podReplacementPolicy field as shown here: kind : Job metadata : name : new ... spec : podReplacementPolicy : Failed ... In that Job, the Pods would only be replaced once they reached the Failed phase, and not when they are terminating. Additionally, you can inspect the .status.terminating field of a Job. The value of the field is the number of Pods owned by the Job that are currently terminating. kubectl get jobs/myjob -o= jsonpath = '{.items[*].status.terminating}' 3 # three Pods are terminating and have not yet reached the Failed phase This can be particularly useful for external queueing controllers, such as Kueue , that tracks quota from running Pods of a Job until the resources are reclaimed from the currently terminating Job. Note that the podReplacementPolicy: Failed is the default when using a custom Pod failure policy . Backoff limit per index By default, Pod failures for Indexed Jobs are counted towards the global limit of retries, represented by .spec.backoffLimit . This means, that if there is a consistently failing index, it is restarted repeatedly until it exhausts the limit. Once the limit is reached the entire Job is marked failed and some indexes may never be even started. This is problematic for use cases where you want to handle Pod failures for every index independently. For example, if you use Indexed Jobs for running integration tests where each index corresponds to a testing suite. In that case, you may want to account for possible flake tests allowing for 1 or 2 retries per suite. There might be some buggy suites, making the corresponding indexes fail consistently. In that case you may prefer to limit retries for the buggy suites, yet allowing other suites to complete. The feature allows you to: complete execution of all indexes, despite some indexes failing. better utilize the computational resources by avoiding unnecessary retries of consistently failing indexes. How can you use it? This is an alpha feature, which you can enable by turning on the JobBackoffLimitPerIndex feature gate in your cluster. Once the feature is enabled in your cluster, you can create an Indexed Job with the .spec.backoffLimitPerIndex field specified. Example The following example demonstrates how to use this feature to make sure the Job executes all indexes (provided there is no other reason for the early Job termination, such as reaching the activeDeadlineSeconds timeout, or being manually deleted by the user), and the number of failures is controlled per index. apiVersion : batch/v1 kind : Job metadata : name : job-backoff-limit-per-index-execute-all spec : completions : 8 parallelism : 2 completionMode : Indexed backoffLimitPerIndex : 1 template : spec : restartPolicy : Never containers : - name : example # this example container returns an error, and fails, # when it is run as the second or third index in any Job # (even after a retry) image : python command : - python3 - -c - | import os, sys, time id = int(os.environ.get("JOB_COMPLETION_INDEX")) if id == 1 or id == 2: sys.exit(1) time.sleep(1) Now, inspect the Pods after the job is finished: kubectl get pods -l job-name= job-backoff-limit-per-index-execute-all Returns output similar to this: NAME READY STATUS RESTARTS AGE job-backoff-limit-per-index-execute-all-0-b26vc 0/1 Completed 0 49s job-backoff-limit-per-index-execute-all-1-6j5gd 0/1 Error 0 49s job-backoff-limit-per-index-execute-all-1-6wd82 0/1 Error 0 37s job-backoff-limit-per-index-execute-all-2-c66hg 0/1 Error 0 32s job-backoff-limit-per-index-execute-all-2-nf982 0/1 Error 0 43s job-backoff-limit-per-index-execute-all-3-cxmhf 0/1 Completed 0 33s job-backoff-limit-per-index-execute-all-4-9q6kq 0/1 Completed 0 28s job-backoff-limit-per-index-execute-all-5-z9hqf 0/1 Completed 0 28s job-backoff-limit-per-index-execute-all-6-tbkr8 0/1 Completed 0 23s job-backoff-limit-per-index-execute-all-7-hxjsq 0/1 Completed 0 22s Additionally, you can take a look at the status for that Job: kubectl get jobs job-backoff-limit-per-index-fail-index -o yaml The output ends with a status similar to: status : completedIndexes : 0 ,3-7 failedIndexes : 1 ,2 succeeded : 6 failed : 4 conditions : - message : Job has failed indexes reason : FailedIndexes status : "True" type : Failed Here, indexes 1 and 2 were both retried once. After the second failure, in each of them, the specified .spec.backoffLimitPerIndex was exceeded, so the retries were stopped. For comparison, if the per-index backoff was disabled, then the buggy indexes would retry until the global backoffLimit was exceeded, and then the entire Job would be marked failed, before some of the higher indexes are started. How can you learn more? Read the user-facing documentation for Pod replacement policy , Backoff limit per index , and Pod failure policy Read the KEPs for Pod Replacement Policy , Backoff limit per index , and Pod failure policy . Getting Involved These features were sponsored by SIG Apps . Batch use cases are actively being improved for Kubernetes users in the batch working group . Working groups are relatively short-lived initiatives focused on specific goals. The goal of the WG Batch is to improve experience for batch workload users, offer support for batch processing use cases, and enhance the Job API for common use cases. If that interests you, please join the working group either by subscriping to our mailing list or on Slack . Acknowledgments As with any Kubernetes feature, multiple people contributed to getting this done, from testing and filing bugs to reviewing code. We would not have been able to achieve either of these features without Aldo Culquicondor (Google) providing excellent domain knowledge and expertise throughout the Kubernetes ecosystem.
·kubernetes.io·
Blog: Kubernetes 1.28: Improved failure handling for Jobs
TACOS Framework
TACOS Framework
TACOS Framework has 6 repositories available. Follow their code on GitHub.
·github.com·
TACOS Framework
Steve Lord (@stevelord@bladerunner.social)
Steve Lord (@stevelord@bladerunner.social)
Attached: 1 image Hey @linuxfoundation@social.lfx.dev why are you sending takedowns on redbubble for generic Unix terms and project names you don't own?
·bladerunner.social·
Steve Lord (@stevelord@bladerunner.social)
The ABCs of Generative AI
The ABCs of Generative AI
Generative AI exploded onto the scene so quickly that many developers haven’t been able to catch up with new technical concepts in Generative AI. Whether you’re a builder without an AI/ML background, or you’re feeling like you’ve “missed the boat,” this glossary is for you!
·community.aws·
The ABCs of Generative AI
Blog: Kubernetes v1.28: Retroactive Default StorageClass move to GA
Blog: Kubernetes v1.28: Retroactive Default StorageClass move to GA
Author: Roman Bednář (Red Hat) Announcing graduation to General Availability (GA) - Retroactive Default StorageClass Assignment in Kubernetes v1.28! Kubernetes SIG Storage team is thrilled to announce that the "Retroactive Default StorageClass Assignment" feature, introduced as an alpha in Kubernetes v1.25, has now graduated to GA and is officially part of the Kubernetes v1.28 release. This enhancement brings a significant improvement to how default StorageClasses are assigned to PersistentVolumeClaims (PVCs). With this feature enabled, you no longer need to create a default StorageClass first and then a PVC to assign the class. Instead, any PVCs without a StorageClass assigned will now be retroactively updated to include the default StorageClass. This enhancement ensures that PVCs no longer get stuck in an unbound state, and storage provisioning works seamlessly, even when a default StorageClass is not defined at the time of PVC creation. What changed? The PersistentVolume (PV) controller has been modified to automatically assign a default StorageClass to any unbound PersistentVolumeClaim with the storageClassName not set. Additionally, the PersistentVolumeClaim admission validation mechanism within the API server has been adjusted to allow changing values from an unset state to an actual StorageClass name. How to use it? As this feature has graduated to GA, there's no need to enable a feature gate anymore. Simply make sure you are running Kubernetes v1.28 or later, and the feature will be available for use. For more details, read about default StorageClass assignment in the Kubernetes documentation. You can also read the previous blog post announcing beta graduation in v1.26. To provide feedback, join our Kubernetes Storage Special-Interest-Group (SIG) or participate in discussions on our public Slack channel .
·kubernetes.io·
Blog: Kubernetes v1.28: Retroactive Default StorageClass move to GA
Whoa!!! University of Chicago agrees to pay $13.5 million to students after being accused of participating in a 'price-fixing cartel' with other prestigious schools to limit financial aid
Whoa!!! University of Chicago agrees to pay $13.5 million to students after being accused of participating in a 'price-fixing cartel' with other prestigious schools to limit financial aid
After nearly two years of litigation, UChicago settled claims it conspired with top colleges including Brown and Yale to limit financial aid packages.
·businessinsider.com·
Whoa!!! University of Chicago agrees to pay $13.5 million to students after being accused of participating in a 'price-fixing cartel' with other prestigious schools to limit financial aid
HashiCorp's license change | LWN.net
HashiCorp's license change | LWN.net
Readers have been pointing us to HashiCorp's announcement that it is moving to its own "Business Source License" for some of its (formerly) open-source products. Like other companies (example) that have taken this path, HashiCorp is removing the freedom to use its products commercially in ways that it sees as competitive. This is, in a real sense, an old and tiresome story.
·lwn.net·
HashiCorp's license change | LWN.net
Blog: Kubernetes 1.28: Non-Graceful Node Shutdown Moves to GA
Blog: Kubernetes 1.28: Non-Graceful Node Shutdown Moves to GA
Authors: Xing Yang (VMware) and Ashutosh Kumar (Elastic) The Kubernetes Non-Graceful Node Shutdown feature is now GA in Kubernetes v1.28. It was introduced as alpha in Kubernetes v1.24, and promoted to beta in Kubernetes v1.26. This feature allows stateful workloads to restart on a different node if the original node is shutdown unexpectedly or ends up in a non-recoverable state such as the hardware failure or unresponsive OS. What is a Non-Graceful Node Shutdown In a Kubernetes cluster, a node can be shutdown in a planned graceful way or unexpectedly because of reasons such as power outage or something else external. A node shutdown could lead to workload failure if the node is not drained before the shutdown. A node shutdown can be either graceful or non-graceful. The Graceful Node Shutdown feature allows Kubelet to detect a node shutdown event, properly terminate the pods, and release resources, before the actual shutdown. When a node is shutdown but not detected by Kubelet's Node Shutdown Manager, this becomes a non-graceful node shutdown. Non-graceful node shutdown is usually not a problem for stateless apps, however, it is a problem for stateful apps. The stateful application cannot function properly if the pods are stuck on the shutdown node and are not restarting on a running node. In the case of a non-graceful node shutdown, you can manually add an out-of-service taint on the Node. kubectl taint nodes node-name node.kubernetes.io/out-of-service=nodeshutdown:NoExecute This taint triggers pods on the node to be forcefully deleted if there are no matching tolerations on the pods. Persistent volumes attached to the shutdown node will be detached, and new pods will be created successfully on a different running node. Note: Before applying the out-of-service taint, you must verify that a node is already in shutdown or power-off state (not in the middle of restarting). Once all the workload pods that are linked to the out-of-service node are moved to a new running node, and the shutdown node has been recovered, you should remove that taint on the affected node after the node is recovered. What’s new in stable With the promotion of the Non-Graceful Node Shutdown feature to stable, the feature gate NodeOutOfServiceVolumeDetach is locked to true on kube-controller-manager and cannot be disabled. Metrics force_delete_pods_total and force_delete_pod_errors_total in the Pod GC Controller are enhanced to account for all forceful pods deletion. A reason is added to the metric to indicate whether the pod is forcefully deleted because it is terminated, orphaned, terminating with the out-of-service taint, or terminating and unscheduled. A "reason" is also added to the metric attachdetach_controller_forced_detaches in the Attach Detach Controller to indicate whether the force detach is caused by the out-of-service taint or a timeout. What’s next? This feature requires a user to manually add a taint to the node to trigger workloads failover and remove the taint after the node is recovered. In the future, we plan to find ways to automatically detect and fence nodes that are shutdown/failed and automatically failover workloads to another node. How can I learn more? Check out additional documentation on this feature here . How to get involved? We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature and helped move it from alpha, beta, to stable: Michelle Au (msau42 ) Derek Carr (derekwaynecarr ) Danielle Endocrimes (endocrimes ) Baofa Fan (carlory ) Tim Hockin (thockin ) Ashutosh Kumar (sonasingh46 ) Hemant Kumar (gnufied ) Yuiko Mouri (YuikoTakada ) Mrunal Patel (mrunalp ) David Porter (bobbypage ) Yassine Tijani (yastij ) Jing Xu (jingxu97 ) Xing Yang (xing-yang ) This feature is a collaboration between SIG Storage and SIG Node. For those interested in getting involved with the design and development of any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). For those interested in getting involved with the design and development of the components that support the controlled interactions between pods and host resources, join the Kubernetes Node SIG.
·kubernetes.io·
Blog: Kubernetes 1.28: Non-Graceful Node Shutdown Moves to GA