Re: [pve-devel] [PATCH storage 2/2] rbd plugin: status: explain why percentage value can be different from Ceph

From: Fiona Ebner <f.ebner@proxmox.com>
To: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>,
	"Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] [PATCH storage 2/2] rbd plugin: status: explain why percentage value can be different from Ceph
Date: Wed, 14 May 2025 11:31:17 +0200	[thread overview]
Message-ID: <651c22bb-69b3-43f4-9ed8-9357ce828bcf@proxmox.com> (raw)
In-Reply-To: <451129351.14846.1747213617524@webmail.proxmox.com>

Am 14.05.25 um 11:06 schrieb Fabian Grünbichler:
>> Fiona Ebner <f.ebner@proxmox.com> hat am 14.05.2025 10:22 CEST geschrieben:
>>
>>  
>> Am 13.05.25 um 15:31 schrieb Fiona Ebner:
>>> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
>>> ---
>>>  src/PVE/Storage/RBDPlugin.pm | 6 ++++++
>>>  1 file changed, 6 insertions(+)
>>>
>>> diff --git a/src/PVE/Storage/RBDPlugin.pm b/src/PVE/Storage/RBDPlugin.pm
>>> index 154fa00..b56f8e4 100644
>>> --- a/src/PVE/Storage/RBDPlugin.pm
>>> +++ b/src/PVE/Storage/RBDPlugin.pm
>>> @@ -703,6 +703,12 @@ sub status {
>>>  
>>>      # max_avail -> max available space for data w/o replication in the pool
>>>      # stored -> amount of user data w/o replication in the pool
>>> +    # NOTE These values are used because they are most natural from a user perspective.
>>> +    # However, the %USED/percent_used value in Ceph is calculated from values before factoring out
>>> +    # replication, namely 'bytes_used / (bytes_used + avail_raw)'. In certain setups, e.g. with LZ4
>>> +    # compression, this percentage can be noticeably different form the percentage
>>> +    # 'stored / (stored + max_avail)' shown in the Proxmox VE CLI/UI. See also src/mon/PGMap.cc from
>>> +    # the Ceph source code, which also mentions that 'stored' is an approximation.
>>>      my $free = $d->{stats}->{max_avail};
>>>      my $used = $d->{stats}->{stored};
>>>      my $total = $used + $free;
>>
>> Thinking about this again, I don't think continuing to use 'stored' is
>> best after all, because that is before compression. And this is where
>> the mismatch really comes from AFAICT. For highly compressible data, the
>> mismatch between actual usage on the storage and 'stored' can be very
>> big (in a quick test using the 'yes' command to fill an RBD image, I got
>> stored = 2 * (used / replication_count)). And here in the storage stats
>> we are interested in the usage on the storage, not the actual amount of
>> data written by the user. For ZFS we also don't use 'logicalused', but
>> 'used'.
> 
> but for ZFS, we actually use the "logical" view provided by `zfs list/get`,
> not the "physical" view provided by `zpool list/get` (and even the latter
> would already account for redundancy).

But that is not the same logcial view as 'logicalused' which would not
consider compression.

> 
> e.g., with a testpool consisting of three mirrored vdevs of size 1G, with
> a single dataset filled with a file with 512MB of random data:
> 
> $ zpool list -v testpool
> NAME                 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
> testpool             960M   513M   447M        -         -    42%    53%  1.00x    ONLINE  -
>   mirror-0           960M   513M   447M        -         -    42%  53.4%      -    ONLINE
>     /tmp/vdev1.img     1G      -      -        -         -      -      -      -    ONLINE
>     /tmp/vdev2.img     1G      -      -        -         -      -      -      -    ONLINE
>     /tmp/vdev3.img     1G      -      -        -         -      -      -      -    ONLINE
> 
> and what we use for the storage status:
> 
> $ zfs get available,used testpool/data
> NAME           PROPERTY   VALUE  SOURCE
> testpool/data  available  319M   -
> testpool/data  used       512M   -
> 
> if we switch away from `stored`, we'd have to account for replication
> ourselves to match that, right? and we don't have that information
> readily available (and also no idea how to handle EC pools?)? wouldn't
> we just exchange one wrong set of numbers with another (differently)
> wrong set of numbers?

I would've used avail_raw / max_avail to calculate the replication
factor and apply that to bytes_used. Sure it won't be perfect, but it
should lead to matching the percent_used reported by Ceph:

percent_used = used_bytes / (used_bytes + avail_raw)
max_avail = avail_raw / rep

(rep is called raw_used_rate in Ceph source, but I'm shortening it for
readability)

Thus:
rep = avail_raw / max_avail

our_used = used_bytes / rep
our_avail = max_avail = avail_raw / rep

our_percentage = our_used / (our_used + our_avail) =
(used_bytes/rep) / (used_bytes/rep + avail_raw/rep) =
then canceling rep
= used_bytes / (used_bytes + avail_raw) = percent_used from Ceph

The point is that it'd be much better than not considering compression.

> 
> FWIW, we already provide raw numbers in the pool view, and could maybe
> expand that view to provide more details?
> 
> e.g., for my test rbd pool the pool view shows 50,29% used amounting to
> 163,43GiB, whereas the storage status says 51.38% used amounting to
> 61.11GB of 118.94GB, with the default 3/2 replication
> 
> ceph df detail says:
> 
> {
>       "name": "rbd",
>       "id": 2,
>       "stats": {
>         "stored": 61108710142,               => /1000/1000/1000 == storage used

But this is not really "storage used". This is the amount of user data,
before compression. The actual usage on the storage can be much lower
than this.

>         "stored_data": 61108699136,
>         "stored_omap": 11006,
>         "objects": 15579,
>         "kb_used": 171373017,
>         "bytes_used": 175485968635,          => /1024/1024/1024 == pool used
>         "data_bytes_used": 175485935616,
>         "omap_bytes_used": 33019,
>         "percent_used": 0.5028545260429382,  => rounded this is the pool view percentage
>         "max_avail": 57831211008,            => (this + stored)/1000/1000/1000 storage total
>         "quota_objects": 0,
>         "quota_bytes": 0,
>         "dirty": 0,
>         "rd": 253354,
>         "rd_bytes": 38036885504,
>         "wr": 75833,
>         "wr_bytes": 33857918976,
>         "compress_bytes_used": 0,
>         "compress_under_bytes": 0,
>         "stored_raw": 183326130176,
>         "avail_raw": 173493638191
>       }
>     },
> 
> 
>> From src/osd/osd_types.h:
>>
>>>   int64_t data_stored = 0;                ///< Bytes actually stored by the user
>>>   int64_t data_compressed = 0;            ///< Bytes stored after compression
>>>   int64_t data_compressed_allocated = 0;  ///< Bytes allocated for compressed data
>>>   int64_t data_compressed_original = 0;   ///< Bytes that were compressed
>>
>>
>>
>> _______________________________________________
>> pve-devel mailing list
>> pve-devel@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel