Author: Matthew Cary (Google)
Kubernetes v1.27 graduated to beta a new policy mechanism for
StatefulSets that controls the lifetime of
their PersistentVolumeClaims (PVCs). The new PVC
retention policy lets users specify if the PVCs generated from the StatefulSet spec template should
be automatically deleted or retrained when the StatefulSet is deleted or replicas in the StatefulSet
are scaled down.
What problem does this solve?
A StatefulSet spec can include Pod and PVC templates. When a replica is first created, the
Kubernetes control plane creates a PVC for that replica if one does not already exist. The behavior
before the PVC retention policy was that the control plane never cleaned up the PVCs created for
StatefulSets - this was left up to the cluster administrator, or to some add-on automation that
you’d have to find, check suitability, and deploy. The common pattern for managing PVCs, either
manually or through tools such as Helm, is that the PVCs are tracked by the tool that manages them,
with explicit lifecycle. Workflows that use StatefulSets must determine on their own what PVCs are
created by a StatefulSet and what their lifecycle should be.
Before this new feature, when a StatefulSet-managed replica disappears, either because the
StatefulSet is reducing its replica count, or because its StatefulSet is deleted, the PVC and its
backing volume remains and must be manually deleted. While this behavior is appropriate when the
data is critical, in many cases the persistent data in these PVCs is either temporary, or can be
reconstructed from another source. In those cases, PVCs and their backing volumes remaining after
their StatefulSet or replicas have been deleted are not necessary, incur cost, and require manual
cleanup.
The new StatefulSet PVC retention policy
The new StatefulSet PVC retention policy is used to control if and when PVCs created from a
StatefulSet ’s volumeClaimTemplate are deleted. There are two contexts when this may occur.
The first context is when the StatefulSet resource is deleted (which implies that all replicas are
also deleted). This is controlled by the whenDeleted policy. The second context, controlled by
whenScaled is when the StatefulSet is scaled down, which removes some but not all of the replicas
in a StatefulSet . In both cases the policy can either be Retain , where the corresponding PVCs are
not touched, or Delete , which means that PVCs are deleted. The deletion is done with a normal
object deletion , so that, for example, all
retention policies for the underlying PV are respected.
This policy forms a matrix with four cases. I’ll walk through and give an example for each one.
whenDeleted and whenScaled are both Retain .
This matches the existing behavior for StatefulSets , where no PVCs are deleted. This is also
the default retention policy. It’s appropriate to use when data on StatefulSet volumes may be
irreplaceable and should only be deleted manually.
whenDeleted is Delete and whenScaled is Retain .
In this case, PVCs are deleted only when the entire StatefulSet is deleted. If the
StatefulSet is scaled down, PVCs are not touched, meaning they are available to be reattached
if a scale-up occurs with any data from the previous replica. This might be used for a temporary
StatefulSet , such as in a CI instance or ETL pipeline, where the data on the StatefulSet is
needed only during the lifetime of the StatefulSet lifetime, but while the task is running the
data is not easily reconstructible. Any retained state is needed for any replicas that scale
down and then up.
whenDeleted and whenScaled are both Delete .
PVCs are deleted immediately when their replica is no longer needed. Note this does not include
when a Pod is deleted and a new version rescheduled, for example when a node is drained and
Pods need to migrate elsewhere. The PVC is deleted only when the replica is no longer needed
as signified by a scale-down or StatefulSet deletion. This use case is for when data does not
need to live beyond the life of its replica. Perhaps the data is easily reconstructable and the
cost savings of deleting unused PVCs is more important than quick scale-up, or perhaps that when
a new replica is created, any data from a previous replica is not usable and must be
reconstructed anyway.
whenDeleted is Retain and whenScaled is Delete .
This is similar to the previous case, when there is little benefit to keeping PVCs for fast
reuse during scale-up. An example of a situation where you might use this is an Elasticsearch
cluster. Typically you would scale that workload up and down to match demand, whilst ensuring a
minimum number of replicas (for example: 3). When scaling down, data is migrated away from
removed replicas and there is no benefit to retaining those PVCs. However, it can be useful to
bring the entire Elasticsearch cluster down temporarily for maintenance. If you need to take the
Elasticsearch system offline, you can do this by temporarily deleting the StatefulSet , and
then bringing the Elasticsearch cluster back by recreating the StatefulSet . The PVCs holding
the Elasticsearch data will still exist and the new replicas will automatically use them.
Visit the
documentation to
see all the details.
What’s next?
Try it out! The StatefulSetAutoDeletePVC feature gate is beta and enabled by default on
cluster running Kubernetes 1.27. Create a StatefulSet using the new policy, test it out and tell
us what you think!
I'm very curious to see if this owner reference mechanism works well in practice. For example, I
realized there is no mechanism in Kubernetes for knowing who set a reference, so it’s possible that
the StatefulSet controller may fight with custom controllers that set their own
references. Fortunately, maintaining the existing retention behavior does not involve any new owner
references, so default behavior will be compatible.
Please tag any issues you report with the label sig/apps and assign them to Matthew Cary
(@mattcary at GitHub).
Enjoy!
Blog: Kubernetes 1.27: HorizontalPodAutoscaler ContainerResource type metric moves to beta
Author: Kensei Nakada (Mercari)
Kubernetes 1.20 introduced the ContainerResource type metric
in HorizontalPodAutoscaler (HPA).
In Kubernetes 1.27, this feature moves to beta and the corresponding feature gate (HPAContainerMetrics ) gets enabled by default.
What is the ContainerResource type metric
The ContainerResource type metric allows us to configure the autoscaling based on resource usage of individual containers.
In the following example, the HPA controller scales the target
so that the average utilization of the cpu in the application container of all the pods is around 60%.
(See the algorithm details
to know how the desired replica number is calculated exactly)
type : ContainerResource
containerResource :
name : cpu
container : application
target :
type : Utilization
averageUtilization : 60
The difference from the Resource type metric
HPA already had a Resource type metric .
You can define the target resource utilization like the following,
and then HPA will scale up/down the replicas based on the current utilization.
type : Resource
resource :
name : cpu
target :
type : Utilization
averageUtilization : 60
But, this Resource type metric refers to the average utilization of the Pods .
In case a Pod has multiple containers, the utilization calculation would be:
sum{the resource usage of each container} / sum{the resource request of each container}
The resource utilization of each container may not have a direct correlation or may grow at different rates as the load changes.
For example:
A sidecar container is only providing an auxiliary service such as log shipping.
If the application does not log very frequently or does not produce logs in its hotpath
then the usage of the log shipper will not grow.
A sidecar container which provides authentication. Due to heavy caching
the usage will only increase slightly when the load on the main container increases.
In the current blended usage calculation approach this usually results in
the HPA not scaling up the deployment because the blended usage is still low.
A sidecar may be injected without resources set which prevents scaling
based on utilization. In the current logic the HPA controller can only scale
on absolute resource usage of the pod when the resource requests are not set.
And, in such case, if only one container's resource utilization goes high,
the Resource type metric may not suggest scaling up.
So, for the accurate autoscaling, you may want to use the ContainerResource type metric for such Pods instead.
What's new for the beta?
For Kubernetes v1.27, the ContainerResource type metric is available by default as described at the beginning
of this article.
(You can still disable it by the HPAContainerMetrics feature gate.)
Also, we've improved the observability of HPA controller by exposing some metrics from the kube-controller-manager:
metric_computation_total : Number of metric computations.
metric_computation_duration_seconds : The time that the HPA controller takes to calculate one metric.
reconciliations_total : Number of reconciliation of HPA controller.
reconciliation_duration_seconds : The time that the HPA controller takes to reconcile a HPA object once.
These metrics have labels action (scale_up , scale_down , none ) and error (spec , internal , none ).
And, in addition to them, the first two metrics have the metric_type label
which corresponds to .spec.metrics[*].type for a HorizontalPodAutoscaler.
All metrics are useful for general monitoring of HPA controller,
you can get deeper insight into which part has a problem, where it takes time, how much scaling tends to happen at which time on your cluster etc.
Another minor stuff, we've changed the SuccessfulRescale event's messages
so that everyone can check whether the events came from the resource metric or
the container resource metric (See the related PR ).
Getting involved
This feature is managed by SIG Autoscaling .
Please join us and share your feedback. We look forward to hearing from you!
How can I learn more?
The official document of the ContainerResource type metric
KEP-1610: Container Resource based Autoscaling
Fedora Onyx Aims To Be A New Fedora Linux Immutable Variant
While there is already Fedora Silverblue as a Fedora Workstation variant leveraging RPM-OSTree for creating an ummutable OS image and Fedora Kinoite as a KDE-based alternative, Fedora Onyx has been proposed as a new immutable variant of Fedora Linux.
Put all of your prereqs at the beginning of your documentation because if I get 3/4 through the tutorial and I see I gotta install something else, I will rage.— Josie (@javavvitch) May 1, 2023
eBPF for Observability: The Good, the Bad, and the Ugly - Anna Kapuscinska, Isovalent
eBPF for Observability: The Good, the Bad, and the Ugly - Anna Kapuscinska, IsovalenteBPF’s promise of zero-instrumentation observability with low performanc...
Exclusive: SoftBank's Arm registers for blockbuster US IPO
SoftBank Group Corp's chip maker Arm Ltd has filed with regulators confidentially for a U.S. stock market listing, people familiar with the matter said on Saturday, setting the stage for this year's largest initial public offering.
Exclusive: SoftBank's Arm registers for blockbuster US IPO
SoftBank Group Corp's chip maker Arm Ltd has filed with regulators confidentially for a U.S. stock market listing, people familiar with the matter said on Saturday, setting the stage for this year's largest initial public offering.
Pinterest, Snap tumble 18% as outlooks disappoint investors
Shares of both Pinterest Inc and Snap Inc tumbled about 18% on Friday after the two social media companies' quarterly reports spooked investors worried about weak digital ad spending.
It has been deemed unnecessary for so many for so long | Lyft asks employees to come to office more regularly
Lyft Inc said on Friday it has asked employees to return to office, the latest move by new CEO David Risher after the ride-hailing firm decided to cut 26% of its workforce.