From: Dominik Csapak <d.csapak@proxmox.com>
To: Thomas Lamprecht <t.lamprecht@proxmox.com>,
Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
Date: Fri, 20 Sep 2024 08:16:32 +0200 [thread overview]
Message-ID: <fdbbdb59-0e66-425c-a0ba-b5f9de69ba9b@proxmox.com> (raw)
In-Reply-To: <d5b69768-acc4-45f7-bd3b-7cfe26cb261a@proxmox.com>
On 9/19/24 16:57, Thomas Lamprecht wrote:
> Am 19/09/2024 um 14:45 schrieb Dominik Csapak:
>> On 9/19/24 14:01, Thomas Lamprecht wrote:
>>> Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
>>>> by default libfuse2 limits writes to 4k size, which means that on writes
>>>> bigger than that, we do a whole write cycle for each 4k block that comes
>>>> in. To avoid that, add the option 'big_writes' to allow writes bigger
>>>> than 4k at once.
>>>>
>>>> This should improve pmxcfs performance for situations where we often
>>>> write large files (e.g. big ha status) and maybe reduce writes to disk.
>>>
>>> Should? Something like before/after for benchmark numbers, flamegraphs
>>> would be really good to have, without those it's rather hard to discuss
>>> this, and I'd like to avoid having to do those, or check the inner workings
>>> of the affected fuse userspace/kernel code paths here myself.
>>
>> well I mean the code change is relatively small and the result is rather clear:
>
> Well sure the code change is just setting an option... But the actual change is
> abstracted away and would benefit from actually looking into..
>
>> in the current case we have the following calls from pmxcfs (shortened for e-mail)
>> when writing a single 128k block:
>> (dd if=... of=/etc/pve/test bs=128k count=1)
>
> Better than nothing but still no actual numbers (reduced time, reduced write amp
> in combination with sqlite, ...), some basic analysis over file/write size distribution
> on a single node and (e.g. three node) cluster, ...
> If that's all obvious for you then great, but as already mentioned in the past, I
> want actual data in commit messages for such stuff, and I cannot really see a downside
> of having such numbers.
>
> Again, as is I'm not really seeing what's to discuss, you send it as RFC after
> all.
>
>> [...]
>> so a factor of 32 less calls to cfs_fuse_write (including memdb_pwrite)
>
> That can be huge or not so big at all, i.e. as mentioned above, it would we good to
> measure the impact through some other metrics.
>
> And FWIW, I used bpftrace to count [0] with an unpatched pmxcfs, there I get
> the 32 calls to cfs_fuse_write only for a new file, overwriting the existing
> file again with the same amount of data (128k) just causes a single call.
> I tried using more data (e.g. from 128k initially to 256k or 512k) and it's
> always the data divided by 128k (even if the first file has a different size)
>
> We do not override existing files often, but rather write to a new file and
> then rename, but still quite interesting and IMO really showing that just
> because this is 1 +-1 line change it doesn't necessarily have to be trivial
> and obvious in its effects.
>
> [0]: bpftrace -e 'u:cfs_fuse_write /str(args->path) == "/test"/ {@ = count();} END { print(@) }' -p "$(pidof pmxcfs)"
>
>
>>>> If we'd change to libfuse3, this would be a non-issue, since that option
>>>> got removed and is the default there.
>>>
>>> I'd prefer that. At least if done with the future PVE 9.0, as I do not think
>>> it's a good idea in the middle of a stable release cycle.
>>
>> why not this change now, and the rewrite to libfuse3 later? that way we can
>> have some improvements now too...
>
> Because I want some actual data and reasoning first, even if it's quite likely
> that this improves things Somehow™, I'd like to actually know in what metrics
> and by how much (even if just an upper bound due to the benchmark or some
> measurement being rather artificial).
>
> I mean you name the big HA status, why not measure that for real? like, probably
> among other things, in terms of bytes hitting the block layer, i.e. the actual
> backing disk from those requests as then we'd know for real if this can reduce
> the write load there, not just that it maybe should.
sure, since you insist, I'll do some benchmarks today (when i have the time).
I'm just not sure if doing that is actually worth it. Would it make any
difference for the patch if the improvements were 1%, 10% or 1000% ?
The resulting change in behavior seems rather obvious to me:
less fuse -> db round trips for big writes, which can really only make the
whole thing more efficient, so IMHO it's very clear to me that this is an
improvement.
I sent this as an RFC because, as i wrote in the beginning, I'd like
for someone with more C/pmxcfs knowledge/experience to check or warn
for edge cases where we might depend on lower blocksizes (or similar things).
I very much doubt that any performance benchmarks would have revealed such things
that i wouldn't already have noticed by testing my local install.
if we do benchmarks on all changes that might impact performance (which
are most changes really) we'd probably would do little else...
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
next prev parent reply other threads:[~2024-09-20 6:16 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-19 9:52 Dominik Csapak
2024-09-19 12:01 ` Thomas Lamprecht
2024-09-19 12:45 ` Dominik Csapak
2024-09-19 14:57 ` Thomas Lamprecht
2024-09-20 4:04 ` Esi Y via pve-devel
2024-09-20 5:29 ` Esi Y via pve-devel
[not found] ` <CABtLnHoQFAUN0KcahbMF6hoX=WTfL8bHL0St77gQMSaojVGhBA@mail.gmail.com>
2024-09-20 7:32 ` Dominik Csapak
2024-09-20 6:16 ` Dominik Csapak [this message]
2024-09-22 9:25 ` Dietmar Maurer
2024-09-23 9:17 ` Dominik Csapak
2024-09-23 11:48 ` Filip Schauer
2024-09-23 14:08 ` Filip Schauer
2024-09-23 12:00 ` Friedrich Weber
2024-09-23 12:03 ` Dominik Csapak
2024-09-19 12:23 ` Esi Y via pve-devel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fdbbdb59-0e66-425c-a0ba-b5f9de69ba9b@proxmox.com \
--to=d.csapak@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
--cc=t.lamprecht@proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox