Operating in the Kubernetes Cloud on Amazon EKS with Eswar Bala - Last Week in AWS Podcast

Suggested Reads
WebAuthn.wtf
WebAuthn Explained. Everything a developer needs to know about the Web Authentication API.
Blog: Kubernetes 1.27: Introducing An API For Volume Group Snapshots
Author: Xing Yang (VMware)
Volume group snapshot is introduced as an Alpha feature in Kubernetes v1.27.
This feature introduces a Kubernetes API that allows users to take crash consistent
snapshots for multiple volumes together. It uses a label selector to group multiple
PersistentVolumeClaims for snapshotting.
This new feature is only supported for CSI volume drivers.
An overview of volume group snapshots
Some storage systems provide the ability to create a crash consistent snapshot of
multiple volumes. A group snapshot represents “copies” from multiple volumes that
are taken at the same point-in-time. A group snapshot can be used either to rehydrate
new volumes (pre-populated with the snapshot data) or to restore existing volumes to
a previous state (represented by the snapshots).
Why add volume group snapshots to Kubernetes?
The Kubernetes volume plugin system already provides a powerful abstraction that
automates the provisioning, attaching, mounting, resizing, and snapshotting of block
and file storage.
Underpinning all these features is the Kubernetes goal of workload portability:
Kubernetes aims to create an abstraction layer between distributed applications and
underlying clusters so that applications can be agnostic to the specifics of the
cluster they run on and application deployment requires no cluster specific knowledge.
There is already a VolumeSnapshot API
that provides the ability to take a snapshot of a persistent volume to protect against
data loss or data corruption. However, there are other snapshotting functionalities
not covered by the VolumeSnapshot API.
Some storage systems support consistent group snapshots that allow a snapshot to be
taken from multiple volumes at the same point-in-time to achieve write order consistency.
This can be useful for applications that contain multiple volumes. For example,
an application may have data stored in one volume and logs stored in another volume.
If snapshots for the data volume and the logs volume are taken at different times,
the application will not be consistent and will not function properly if it is restored
from those snapshots when a disaster strikes.
It is true that you can quiesce the application first, take an individual snapshot from
each volume that is part of the application one after the other, and then unquiesce the
application after all the individual snapshots are taken. This way, you would get
application consistent snapshots.
However, sometimes it may not be possible to quiesce an application or the application
quiesce can be too expensive so you want to do it less frequently. Taking individual
snapshots one after another may also take longer time compared to taking a consistent
group snapshot. Some users may not want to do application quiesce very often for these
reasons. For example, a user may want to run weekly backups with application quiesce
and nightly backups without application quiesce but with consistent group support which
provides crash consistency across all volumes in the group.
Kubernetes Volume Group Snapshots API
Kubernetes Volume Group Snapshots introduce three new API
objects
for managing snapshots:
VolumeGroupSnapshot
Created by a Kubernetes user (or perhaps by your own automation) to request
creation of a volume group snapshot for multiple persistent volume claims.
It contains information about the volume group snapshot operation such as the
timestamp when the volume group snapshot was taken and whether it is ready to use.
The creation and deletion of this object represents a desire to create or delete a
cluster resource (a group snapshot).
VolumeGroupSnapshotContent
Created by the snapshot controller for a dynamically created VolumeGroupSnapshot.
It contains information about the volume group snapshot including the volume group
snapshot ID.
This object represents a provisioned resource on the cluster (a group snapshot).
The VolumeGroupSnapshotContent object binds to the VolumeGroupSnapshot for which it
was created with a one-to-one mapping.
VolumeGroupSnapshotClass
Created by cluster administrators to describe how volume group snapshots should be
created. including the driver information, the deletion policy, etc.
These three API kinds are defined as CustomResourceDefinitions (CRDs).
These CRDs must be installed in a Kubernetes cluster for a CSI Driver to support
volume group snapshots.
How do I use Kubernetes Volume Group Snapshots
Volume group snapshots are implemented in the
external-snapshotter repository. Implementing volume
group snapshots meant adding or changing several components:
Added new CustomResourceDefinitions for VolumeGroupSnapshot and two supporting APIs.
Volume group snapshot controller logic is added to the common snapshot controller.
Volume group snapshot validation webhook logic is added to the common snapshot validation webhook.
Adding logic to make CSI calls into the snapshotter sidecar controller.
The volume snapshot controller, CRDs, and validation webhook are deployed once per
cluster, while the sidecar is bundled with each CSI driver.
Therefore, it makes sense to deploy the volume snapshot controller, CRDs, and validation
webhook as a cluster addon. I strongly recommend that Kubernetes distributors
bundle and deploy the volume snapshot controller, CRDs, and validation webhook as part
of their Kubernetes cluster management process (independent of any CSI Driver).
Creating a new group snapshot with Kubernetes
Once a VolumeGroupSnapshotClass object is defined and you have volumes you want to
snapshot together, you may request a new group snapshot by creating a VolumeGroupSnapshot
object.
The source of the group snapshot specifies whether the underlying group snapshot
should be dynamically created or if a pre-existing VolumeGroupSnapshotContent
should be used.
A pre-existing VolumeGroupSnapshotContent is created by a cluster administrator.
It contains the details of the real volume group snapshot on the storage system which
is available for use by cluster users.
One of the following members in the source of the group snapshot must be set.
selector - a label query over PersistentVolumeClaims that are to be grouped
together for snapshotting. This labelSelector will be used to match the label
added to a PVC.
volumeGroupSnapshotContentName - specifies the name of a pre-existing
VolumeGroupSnapshotContent object representing an existing volume group snapshot.
In the following example, there are two PVCs.
NAME STATUS VOLUME CAPACITY ACCESSMODES AGE
pvc-0 Bound pvc-a42d7ea2-e3df-11ed-b5ea-0242ac120002 1Gi RWO 48s
pvc-1 Bound pvc-a42d81b8-e3df-11ed-b5ea-0242ac120002 1Gi RWO 48s
Label the PVCs.
% kubectl label pvc pvc-0 group=myGroup
persistentvolumeclaim/pvc-0 labeled
% kubectl label pvc pvc-1 group=myGroup
persistentvolumeclaim/pvc-1 labeled
For dynamic provisioning, a selector must be set so that the snapshot controller can
find PVCs with the matching labels to be snapshotted together.
apiVersion : groupsnapshot.storage.k8s.io/v1alpha1
kind : VolumeGroupSnapshot
metadata :
name : new-group-snapshot-demo
namespace : demo-namespace
spec :
volumeGroupSnapshotClassName : csi-groupSnapclass
source :
selector :
matchLabels :
group : myGroup
In the VolumeGroupSnapshot spec, a user can specify the VolumeGroupSnapshotClass which
has the information about which CSI driver should be used for creating the group snapshot.
Two individual volume snapshots will be created as part of the volume group snapshot creation.
snapshot-62abb5db7204ac6e4c1198629fec533f2a5d9d60ea1a25f594de0bf8866c7947-2023-04-26-2.20.4
snapshot-2026811eb9f0787466171fe189c805a22cdb61a326235cd067dc3a1ac0104900-2023-04-26-2.20.4
How to use group snapshot for restore in Kubernetes
At restore time, the user can request a new PersistentVolumeClaim to be created from
a VolumeSnapshot object that is part of a VolumeGroupSnapshot. This will trigger
provisioning of a new volume that is pre-populated with data from the specified
snapshot. The user should repeat this until all volumes are created from all the
snapshots that are part of a group snapshot.
apiVersion : v1
kind : PersistentVolumeClaim
metadata :
name : pvc0-restore
namespace : demo-namespace
spec :
storageClassName : csi-hostpath-sc
dataSource :
name : snapshot-62abb5db7204ac6e4c1198629fec533f2a5d9d60ea1a25f594de0bf8866c7947-2023-04-26-2.20.4
kind : VolumeSnapshot
apiGroup : snapshot.storage.k8s.io
accessModes :
- ReadWriteOnce
resources :
requests :
storage : 1Gi
As a storage vendor, how do I add support for group snapshots to my CSI driver?
To implement the volume group snapshot feature, a CSI driver must :
Implement a new group controller service.
Implement group controller RPCs: CreateVolumeGroupSnapshot , DeleteVolumeGroupSnapshot , and GetVolumeGroupSnapshot .
Add group controller capability CREATE_DELETE_GET_VOLUME_GROUP_SNAPSHOT .
See the CSI spec
and the Kubernetes-CSI Driver Developer Guide
for more details.
a CSI Volume Driver as possible, it provides a suggested mechanism to deploy a
containerized CSI driver to simplify the process.
As part of this recommended deployment process, the Kubernetes team provides a number of
sidecar (helper) containers, including the
external-snapshotter sidecar container
which has been updated to support volume group snapshot.
The external-snapshotter watches the Kubernetes API server for the
VolumeGroupSnapshotContent object and triggers CreateVolumeGroupSnapshot and
DeleteVolumeGroupSnapshot operations against a CSI endpoint.
What are the limitations?
The alpha implementation of volume group snapshots for Kubernetes has the following
limitations:
Does not support reverting an existing PVC to an earlier state represented by
a snapshot (only supports provisioning a new volume from a snapshot).
No application consistency guarantees beyond any guarantees provided by the sto...
Bluesky is not allowing heads of state in beta test
Presidents, prime ministers, and dictators will all have to wait to join Bluesky, the buzzy Twitter competitor.
Beyond the Repository - ACM Queue
💎Crystal the language for humans💎
I recently implemented a Brainfuck interpreter in the Crystal programming language and I’d like to share my honest opinion about working in the language. When looking at a new programming language I am interested in these things in no particular order
dys2p
strengthening digital self-defense | research and development | providing privacy-focused goods and services
Microservices won’t work for everything just like mainframes (which there are MANY, MANY of still in use today) | Monoliths are not dinosaurs
Building evolvable software systems is a strategy, not a religion. And revisiting your architectures with an open mind is a must.
Europe’s major satellite players line up to build Starlink competitor
The bid includes large players such as Airbus Defence and Space, Eutelsat, and SES.
Westinghouse announces a new small nuclear reactor — a notable step in the industry's efforts to remake itself
Westinghouse announced on Thursday it is launching a small nuclear reactor, a miniature version of its flagship AP1000.
How I Got Involved with the OpenSSF - Open Source Security Foundation
Let’s get it out of the way early: it’s not always clear how you can best plug into organizations like OpenSSF. That’s why I’m writing this guest blog post as an “outsider.” I’m just your average tech employee who has become progressively more involved since my company, Sonatype, became members of OpenSSF. If you’re connecting for your first time, the recommended engagement path is effectively “choose your adventure!”
TSMC, partners plan to invest up to $11 billion in German fabrication plant, Bloomberg reports
Taiwan Semiconductor Manufacturing Co is in talks with partners to invest as much as 10 billion euros ($11.04 billion) to build a chip fabrication plant in Germany, Bloomberg News reported on Wednesday, citing people familiar with the matter.
r2d4/llm.ts
Call any LLM with a single API. Zero dependencies.
Discord is growing up, so everyone needs to pick a new username
Discord is dropping the four-digit suffix that followed each username.
Former Uber security chief Sullivan avoids prison in data breach case
macOS Internals
macOS Internals. GitHub Gist: instantly share code, notes, and snippets.
Discord leaks ‘demoralizing’ for US intelligence agencies, DNI Haines says
The leaks of classified documents online by a Massachusetts Air National Guard member have had an emotional impact on the government agencies that produce those products, the director of national intelligence told Congress on Thursday.
1 kilogram! Do that Apple. Save my shoulder. | Asus releases the Zenbook S 13 with world’s slimmest OLED display · TechNode
Taiwan-based personal computer vendor Asus has unveiled its latest Zenbook S 13 OLED globally, touting it as the world’s thinnest laptop.
Adidas Reveals Just How Much Yeezy Stock It's Stuck With After Kanye West Split
The company cut ties with the rapper last year over his antisemitic and offensive comments. Its “options are narrowing" on what to do with the unsold sneakers.
India bans flagship client for the Matrix network
Element is one of 14 messaging apps blocked by the Central Indian Government which - we believe from media reports - relates to Section 69A of the Information Technology Act, 2000.
Ahs inad 2023
PSA. Don’t share your password in your app’s release notes
Cinema chain Odeon may have shared more information than it intended in the release notes accompanying its latest iOS app update.
Apple AirTag Reverse Engineering - Adam Catley
It’s rare that I share press releases | Thales Seizes Control of ESA Demonstration Satellite in First Cybersecurity Exercise of Its Kind
The European Space Agency (ESA) challenged cybersecurity experts in the space industry ecosystem to disrupt the operation of the agency's OPS-SAT demo
I wonder if this will be an option for Advanced Protection accounts? | Google adds passkeys support for passwordless sign-in on all accounts
Google is rolling out support for passkeys for Google Accounts across all services and platforms, allowing users to sign into their Google accounts without entering a password or using 2-Step Verification (2SV) when logging in.
Colorado kills law that made it harder for cities to offer Internet service
State law forced cities and towns to hold elections before offering broadband.
College Knowledge
'Your chitin armor is no match for our iron-tipped stingers! Better go hide in your jars!' --common playground taunt
Blog: Kubernetes 1.27: Quality-of-Service for Memory Resources (alpha)
Authors: Dixita Narang (Google)
Kubernetes v1.27, released in April 2023, introduced changes to
Memory QoS (alpha) to improve memory management capabilites in Linux nodes.
Support for Memory QoS was initially added in Kubernetes v1.22, and later some
limitations
around the formula for calculating memory.high were identified. These limitations are
addressed in Kubernetes v1.27.
Background
Kubernetes allows you to optionally specify how much of each resources a container needs
in the Pod specification. The most common resources to specify are CPU and Memory.
For example, a Pod manifest that defines container resource requirements could look like:
apiVersion: v1
kind: Pod
metadata:
name: example
spec:
containers:
- name: nginx
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "64Mi"
cpu: "500m"
spec.containers[].resources.requests
When you specify the resource request for containers in a Pod, the
Kubernetes scheduler
uses this information to decide which node to place the Pod on. The scheduler
ensures that for each resource type, the sum of the resource requests of the
scheduled containers is less than the total allocatable resources on the node.
spec.containers[].resources.limits
When you specify the resource limit for containers in a Pod, the kubelet enforces
those limits so that the running containers are not allowed to use more of those
resources than the limits you set.
When the kubelet starts a container as a part of a Pod, kubelet passes the
container's requests and limits for CPU and memory to the container runtime.
The container runtime assigns both CPU request and CPU limit to a container.
Provided the system has free CPU time, the containers are guaranteed to be
allocated as much CPU as they request. Containers cannot use more CPU than
the configured limit i.e. containers CPU usage will be throttled if they
use more CPU than the specified limit within a given time slice.
Prior to Memory QoS feature, the container runtime only used the memory
limit and discarded the memory request (requests were, and still are,
also used to influence scheduling ).
If a container uses more memory than the configured limit,
the Linux Out Of Memory (OOM) killer will be invoked.
Let's compare how the container runtime on Linux typically configures memory
request and limit in cgroups, with and without Memory QoS feature:
Memory request
The memory request is mainly used by kube-scheduler during (Kubernetes) Pod
scheduling. In cgroups v1, there are no controls to specify the minimum amount
of memory the cgroups must always retain. Hence, the container runtime did not
use the value of requested memory set in the Pod spec.
cgroups v2 introduced a memory.min setting, used to specify the minimum
amount of memory that should remain available to the processes within
a given cgroup. If the memory usage of a cgroup is within its effective
min boundary, the cgroup’s memory won’t be reclaimed under any conditions.
If the kernel cannot maintain at least memory.min bytes of memory for the
processes within the cgroup, the kernel invokes its OOM killer. In other words,
the kernel guarantees at least this much memory is available or terminates
processes (which may be outside the cgroup) in order to make memory more available.
Memory QoS maps memory.min to spec.containers[].resources.requests.memory
to ensure the availability of memory for containers in Kubernetes Pods.
Memory limit
The memory.limit specifies the memory limit, beyond which if the container tries
to allocate more memory, Linux kernel will terminate a process with an
OOM (Out of Memory) kill. If the terminated process was the main (or only) process
inside the container, the container may exit.
In cgroups v1, memory.limit_in_bytes interface is used to set the memory usage limit.
However, unlike CPU, it was not possible to apply memory throttling: as soon as a
container crossed the memory limit, it would be OOM killed.
In cgroups v2, memory.max is analogous to memory.limit_in_bytes in cgroupv1.
Memory QoS maps memory.max to spec.containers[].resources.limits.memory to
specify the hard limit for memory usage. If the memory consumption goes above this
level, the kernel invokes its OOM Killer.
cgroups v2 also added memory.high configuration . Memory QoS uses memory.high
to set memory usage throttle limit. If the memory.high limit is breached,
the offending cgroups are throttled, and the kernel tries to reclaim memory
which may avoid an OOM kill.
How it works
Cgroups v2 memory controller interfaces & Kubernetes container resources mapping
Memory QoS uses the memory controller of cgroups v2 to guarantee memory resources in
Kubernetes. cgroupv2 interfaces that this feature uses are:
memory.max
memory.min
memory.high .
Memory QoS Levels
memory.max is mapped to limits.memory specified in the Pod spec. The kubelet and
the container runtime configure the limit in the respective cgroup. The kernel
enforces the limit to prevent the container from using more than the configured
resource limit. If a process in a container tries to consume more than the
specified limit, kernel terminates a process(es) with an out of
memory Out of Memory (OOM) error.
memory.max maps to limits.memory
memory.min is mapped to requests.memory , which results in reservation of memory resources
that should never be reclaimed by the kernel. This is how Memory QoS ensures the availability of
memory for Kubernetes pods. If there's no unprotected reclaimable memory available, the OOM
killer is invoked to make more memory available.
memory.min maps to requests.memory
For memory protection, in addition to the original way of limiting memory usage, Memory QoS
throttles workload approaching its memory limit, ensuring that the system is not overwhelmed
by sporadic increases in memory usage. A new field, memoryThrottlingFactor , is available in
the KubeletConfiguration when you enable MemoryQoS feature. It is set to 0.9 by default.
memory.high is mapped to throttling limit calculated by using memoryThrottlingFactor ,
requests.memory and limits.memory as in the formula below, and rounding down the
value to the nearest page size:
memory.high formula
Note : If a container has no memory limits specified, limits.memory is substituted for node allocatable memory.
Summary:
File
Description
memory.max
memory.max specifies the maximum memory limit,
a container is allowed to use. If a process within the container
tries to consume more memory than the configured limit,
the kernel terminates the process with an Out of Memory (OOM) error.
It is mapped to the container's memory limit specified in Pod manifest.
memory.min
memory.min specifies a minimum amount of memory
the cgroups must always retain, i.e., memory that should never be
reclaimed by the system.
If there's no unprotected reclaimable memory available, OOM kill is invoked.
It is mapped to the container's memory request specified in the Pod manifest.
memory.high
memory.high specifies the memory usage throttle limit.
This is the main mechanism to control a cgroup's memory use. If
cgroups memory use goes over the high boundary specified here,
the cgroups processes are throttled and put under heavy reclaim pressure.
Kubernetes uses a formula to calculate memory.high ,
depending on container's memory request, memory limit or node allocatable memory
(if container's memory limit is empty) and a throttling factor.
Please refer to the KEP
for more details on the formula.
Note memory.high is set only on container level cgroups while memory.min is set on
container, pod, and node level cgroups.
memory.min calculations for cgroups heirarchy
When container memory requests are made, kubelet passes memory.min to the back-end
CRI runtime (such as containerd or CRI-O) via the Unified field in CRI during
container creation. The memory.min in container level cgroups will be set to:
$memory.min = pod.spec.containers[i].resources.requests[memory]$
for every ith container in a pod
Since the memory.min interface requires that the ancestor cgroups directories are all
set, the pod and node cgroups directories need to be set correctly.
memory.min in pod level cgroup:
$memory.min = \sum_{i=0}^{no. of pods}pod.spec.containers[i].resources.requests[memory]$
for every ith container in a pod
memory.min in node level cgroup:
$memory.min = \sum_{i}^{no. of nodes}\sum_{j}^{no. of pods}pod[i].spec.containers[j].resources.requests[memory]$
for every jth container in every ith pod on a node
Kubelet will manage the cgroups hierarchy of the pod level and node level cgroups
directly using the libcontainer library (from the runc project), while container
cgroups limits are managed by the container runtime.
Support for Pod QoS classes
Based on user feedback for the Alpha feature in Kubernetes v1.22, some users would like
to opt out of MemoryQoS on a per-pod basis to ensure there is no early memory throttling.
Therefore, in Kubernetes v1.27 Memory QOS also supports memory.high to be set as per
Quality of Service(QoS) for Pod classes. Following are the different cases for memory.high
as per QOS classes:
Guaranteed pods by their QoS definition require memory requests=memory limits and are
not overcommitted. Hence MemoryQoS feature is disabled on those pods by not setting
memory.high. This ensures that Guaranteed pods can fully use their memory requests up
to their set limit, and not hit any throttling.
Burstable pods by their QoS definition require at least one container in the Pod with
CPU or memory request or limit set.
When requests.memory and limits.memory are set, the formula is used as-is:
memory.high when requests and limits are set
When requests.memory is set and limits.memory is not set, limits.memory is substituted
for node allocatable memory in the formula:
memory.high when requests and limits are not set
BestEffort by their QoS de...
UEFI Secure Boot on the Raspberry Pi
A port of the free software TianoCore UEFI firmware can be used instead of the proprietary boot blob to boot the Raspberry Pi. This allows to install Debian on the RPi …
DOJ Detected SolarWinds Breach Months Before Public Disclosure
In May 2020, the US Department of Justice noticed Russian hackers in its network but did not realize the significance of what it had found for six months.