Search Cloudification GmbH

Configuring Ceph pg_autoscale with Rook: A Guide to Balanced Data Distribution

https://cloudificationgmbh.blogspot.com/2025/02/configuring-ceph-pgautoscale-with-rook.html

Configuring Ceph pg_autoscale with Rook for OpenStack Deployments: A Guide to Balanced Data Distribution

At Cloudification, we deploy private clouds based on OpenStack, leveraging Rook-Ceph as a highly available storage solution. During the installation process, one of the recurring issues we faced is a proper configuration of the Ceph cluster to ensure balanced data distribution across OSDs (Object Storage Daemons).

The Problem: PG Imbalance Alerts

Right after a fresh installation, we started receiving PGImbalance alerts from Prometheus, indicating poorly distributed data across hosts. PG stands for Placement Group which is an abstraction under Storage Pool, where each individual object in a cluster is assigned to a PG. Since the number of objects in the cluster can be on the count of hundreds of millions, PGs allow Ceph to operate and rebalance without the need to address each object individually. Let’s have a look at Ceph Placement groups in the cluster:

$ ceph pg dump ... OSD_STAT USED AVAIL USED_RAW TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM 23 33 GiB 1.7 TiB 33 GiB 1.7 TiB [0,2,5,7,8,10,16,18,19,22] 4 0 4 113 MiB 1.7 TiB 113 MiB 1.7 TiB [0,1,2,3,5,6,8,9,11,12,14,15,16,17,20,23] 2 1 1 49 GiB 1.7 TiB 49 GiB 1.7 TiB [0,2,5,6,9,10,12,13,15,16,17,18,21,22] 26 19 19 23 GiB 1.7 TiB 23 GiB 1.7 TiB [1,2,3,5,10,16,18,20,21,22] 15 17 22 19 GiB 1.7 TiB 19 GiB 1.7 TiB [4,5,6,11,15,17,19,20,21,23] 11 0 21 226 GiB 1.5 TiB 226 GiB 1.7 TiB [1,3,9,10,13,16,17,18,20,22] 108 17 20 117 MiB 1.7 TiB 117 MiB 1.7 TiB [0,4,7,12,14,17,18,19,21,22] 5 0 18 258 GiB 1.5 TiB 258 GiB 1.7 TiB [1,5,8,10,11,14,16,17,19,21,22,23] 122 19 17 34 GiB 1.7 TiB 34 GiB 1.7 TiB [0,1,2,3,5,6,8,9,11,12,13,15,16,18,20,21,22,23] 6 4 16 33 GiB 1.7 TiB 33 GiB 1.7 TiB [0,5,7,8,11,12,13,15,17,20] 23 2 15 109 MiB 1.7 TiB 109 MiB 1.7 TiB [2,10,12,14,16,18,19,21,22,23] 4 0 0 109 MiB 1.7 TiB 109 MiB 1.7 TiB [1,2,7,8,12,13,14,17,20,23] 5 1 13 111 MiB 1.7 TiB 111 MiB 1.7 TiB [0,1,2,3,8,9,12,14,15,17,19,21] 7 2 2 116 MiB 1.7 TiB 116 MiB 1.7 TiB [1,3,8,11,15,17,18,19,20,22] 3 0 3 33 GiB 1.7 TiB 33 GiB 1.7 TiB [2,4,5,7,8,9,10,11,16,23] 12 0 5 52 GiB 1.7 TiB 52 GiB 1.7 TiB [1,4,6,11,12,13,14,16,17,18,19,20,21,22,23] 16 2 6 23 GiB 1.7 TiB 23 GiB 1.7 TiB [4,5,7,9,10,11,15,19,20,22] 4 2 7 793 MiB 1.7 TiB 793 MiB 1.7 TiB [0,1,3,4,6,8,10,12,13,14,15,16,18,19,21,23] 4 20 8 34 GiB 1.7 TiB 34 GiB 1.7 TiB [0,5,7,9,12,13,14,18,20,22] 5 2 9 60 GiB 1.7 TiB 60 GiB 1.7 TiB [0,1,3,8,10,12,13,16,17,21] 5 2 10 216 GiB 1.5 TiB 216 GiB 1.7 TiB [1,3,4,5,6,7,9,11,12,14,15,16,18,19,21,22] 101 18 11 101 MiB 1.7 TiB 101 MiB 1.7 TiB [1,2,5,10,12,16,18,19,22,23] 4 1 12 54 GiB 1.7 TiB 54 GiB 1.7 TiB [0,1,3,5,6,7,8,9,10,11,13,14,18,20,21] 16 34 14 25 GiB 1.7 TiB 25 GiB 1.7 TiB [4,5,6,7,10,12,13,15,19,20,22] 5 2 sum 1.1 TiB 41 TiB 1.1 TiB 42 TiB

Let’s check how many PGs are configured for pools:

bash-5.1$ for pool in $(ceph osd lspools | awk '{print $2}') ; do echo "pool: $pool - pg_num: ceph osd pool get $pool pg_num" ; done

pool: .rgw.root - pg_num: pg_num: 1 pool: replicapool - pg_num: pg_num: 1 pool: .mgr - pg_num: pg_num: 1 pool: rgw-data-pool - pg_num: pg_num: 1 pool: s3-store.rgw.log - pg_num: pg_num: 1 pool: s3-store.rgw.control - pg_num: pg_num: 1 pool: s3-store.rgw.buckets.index - pg_num: pg_num: 1 pool: s3-store.rgw.otp - pg_num: pg_num: 1 pool: s3-store.rgw.buckets.non-ec - pg_num: pg_num: 1 pool: s3-store.rgw.meta - pg_num: pg_num: 1 pool: rgw-meta-pool - pg_num: pg_num: 1 pool: s3-store.rgw.buckets.data - pg_num: pg_num: 1 pool: cephfs-metadata - pg_num: pg_num: 1 pool: cephfs-data0 - pg_num: pg_num: 1 pool: cinder.volumes.hdd - pg_num: pg_num: 1 pool: cinder.backups - pg_num: pg_num: 1 pool: glance.images - pg_num: pg_num: 1 pool: nova.ephemeral - pg_num: pg_num: 1

This directly correlates with imbalanced OSD utilization, as Ceph was only creating 1 Placement Group per pool, leading to inefficient data distribution.

To diagnose the issue, we used the rados df command to identify pools consuming the most space and adjusting pg_num. In this document you will find what you need to calculate this number here.

If we manually reconfigure the current number of PGs for several pools, for example Cinder, Nova, Glance and CephFS:

$ ceph osd pool set cinder.volumes.nvme pg_num 256 $ ceph osd pool set nova.ephemeral pg_num 16 $ ceph osd pool set glance.images pg_num 16 $ ceph osd pool set cephfs-data0 pg_num 16

This triggers rebalancing, resulting in more balanced usage and the resolution of the alert:

bash-5.1$ ceph -s cluster: id: a6ab9446-2c0d-42f4-b009-514e989fd4a0 health: HEALTH_OK

services: mon: 3 daemons, quorum b,d,f (age 3d) mgr: b(active, since 3d), standbys: a mds: 1/1 daemons up, 1 hot standby osd: 24 osds: 24 up (since 3d), 24 in (since 3d) rgw: 3 daemons active (3 hosts, 1 zones)

data: volumes: 1/1 healthy pools: 17 pools, 331 pgs objects: 101.81k objects, 371 GiB usage: 1.2 TiB used, 41 TiB / 42 TiB avail pgs: 331 active+clean

io: client: 7.4 KiB/s rd, 1.7 MiB/s wr, 9 op/s rd, 166 op/s wr

...

OSD_STAT USED AVAIL 23 68 GiB 1.7 TiB 4 33 GiB 1.7 TiB 1 37 GiB 1.7 TiB 19 39 GiB 1.7 TiB 22 36 GiB 1.7 TiB 21 62 GiB 1.7 TiB 20 35 GiB 1.7 TiB 18 67 GiB 1.7 TiB 17 65 GiB 1.7 TiB 16 35 GiB 1.7 TiB 15 39 GiB 1.7 TiB 0 34 GiB 1.7 TiB 13 31 GiB 1.7 TiB 2 33 GiB 1.7 TiB 3 33 GiB 1.7 TiB 5 64 GiB 1.7 TiB 6 54 GiB 1.7 TiB 7 38 GiB 1.7 TiB 8 65 GiB 1.7 TiB 9 95 GiB 1.7 TiB 10 62 GiB 1.7 TiB 11 35 GiB 1.7 TiB 12 58 GiB 1.7 TiB 14 56 GiB 1.7 TiB sum 1.1 TiB USED_RAW TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM 68 GiB 1.7 TiB [0,1,2,3,4,5,6,10,11,12,13,14,16,17,18,19,22] 37 12 33 GiB 1.7 TiB [0,1,2,3,5,6,7,8,9,10,11,12,13,14,15,16,17,20,22,23] 34 13 37 GiB 1.7 TiB [0,2,3,5,6,7,9,10,11,12,13,14,15,16,17,18,20,21,22] 42 13 39 GiB 1.7 TiB [0,2,3,6,7,9,10,11,12,13,15,16,17,18,20,22,23] 41 12 36 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,15,16,19,21,23] 36 11 62 GiB 1.7 TiB [0,1,2,3,5,6,8,9,10,13,14,15,16,17,18,19,20,22] 37 9 35 GiB 1.7 TiB [0,1,4,6,7,8,10,12,14,15,16,17,18,19,21] 39 10 67 GiB 1.7 TiB [1,2,5,7,8,9,10,11,13,14,16,17,19,20,21,22,23] 37 12 65 GiB 1.7 TiB [0,1,2,3,4,5,6,8,9,11,12,13,15,16,18,19,20,21,22,23] 34 14 35 GiB 1.7 TiB [0,1,4,5,7,8,9,10,11,12,13,15,17,18,19,20,21,22,23] 39 13 39 GiB 1.7 TiB [1,2,6,10,12,13,14,16,18,19,21,23] 41 5 34 GiB 1.7 TiB [1,2,4,5,7,8,9,10,11,12,13,14,15,16,17,19,20,21,22,23] 37 13 31 GiB 1.7 TiB [0,1,2,3,4,5,6,7,8,9,12,14,15,16,17,18,19,20,21,22,23] 36 16 33 GiB 1.7 TiB [0,1,3,6,8,11,13,14,15,16,17,18,19,20,21,22] 34 11 33 GiB 1.7 TiB [2,4,5,7,8,9,10,12,13,15,16,17,19,21,22,23] 33 12 64 GiB 1.7 TiB [0,1,3,4,6,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23] 37 9 54 GiB 1.7 TiB [1,4,5,7,8,9,10,11,12,13,14,15,16,19,20,21,22,23] 32 9 38 GiB 1.7 TiB [0,1,3,4,6,8,10,11,12,13,14,15,16,17,18,19,20,22,23] 39 11 65 GiB 1.7 TiB [0,3,5,6,7,9,10,12,13,14,15,17,18,20,22] 33 14 95 GiB 1.7 TiB [0,1,3,6,8,10,11,12,13,14,15,16,17,18,19,20,21,23] 36 11 62 GiB 1.7 TiB [0,3,4,5,6,7,8,9,11,14,15,16,17,18,19,20,21,22,23] 36 14 35 GiB 1.7 TiB [0,1,2,3,5,8,9,10,12,14,15,16,18,19,20,22,23] 37 14 58 GiB 1.7 TiB [0,1,3,4,5,6,7,8,9,11,13,14,15,17,18,19,20,21,23] 35 13 56 GiB 1.7 TiB [1,2,4,5,6,7,8,9,10,12,13,15,18,19,20,21,22,23] 34 15 41 TiB 1.1 TiB 42 TiB

Why did this happen?

By default, Ceph might not create the optimal number of PGs for each pool, resulting in data skew and uneven utilization of storage devices. Manually setting the pg_num for each pool is not a sustainable solution, as data volume is expected to grow over time.

That mean

#Cloudification GmbH

·cloudificationgmbh.blogspot.com·Feb 27, 2025

Configuring Ceph pg_autoscale with Rook: A Guide to Balanced Data Distribution