public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Dominik Csapak <d.csapak@proxmox.com>
To: Thomas Lamprecht <t.lamprecht@proxmox.com>,
	Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
Date: Fri, 20 Sep 2024 08:16:32 +0200	[thread overview]
Message-ID: <fdbbdb59-0e66-425c-a0ba-b5f9de69ba9b@proxmox.com> (raw)
In-Reply-To: <d5b69768-acc4-45f7-bd3b-7cfe26cb261a@proxmox.com>

On 9/19/24 16:57, Thomas Lamprecht wrote:
> Am 19/09/2024 um 14:45 schrieb Dominik Csapak:
>> On 9/19/24 14:01, Thomas Lamprecht wrote:
>>> Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
>>>> by default libfuse2 limits writes to 4k size, which means that on writes
>>>> bigger than that, we do a whole write cycle for each 4k block that comes
>>>> in. To avoid that, add the option 'big_writes' to allow writes bigger
>>>> than 4k at once.
>>>>
>>>> This should improve pmxcfs performance for situations where we often
>>>> write large files (e.g. big ha status) and maybe reduce writes to disk.
>>>
>>> Should? Something like before/after for benchmark numbers, flamegraphs
>>> would be really good to have, without those it's rather hard to discuss
>>> this, and I'd like to avoid having to do those, or check the inner workings
>>> of the affected fuse userspace/kernel code paths here myself.
>>
>> well I mean the code change is relatively small and the result is rather clear:
> 
> Well sure the code change is just setting an option... But the actual change is
> abstracted away and would benefit from actually looking into..
> 
>> in the current case we have the following calls from pmxcfs (shortened for e-mail)
>> when writing a single 128k block:
>> (dd if=... of=/etc/pve/test bs=128k count=1)
> 
> Better than nothing but still no actual numbers (reduced time, reduced write amp
> in combination with sqlite, ...), some basic analysis over file/write size distribution
> on a single node and (e.g. three node) cluster, ...
> If that's all obvious for you then great, but as already mentioned in the past, I
> want actual data in commit messages for such stuff, and I cannot really see a downside
> of having such numbers.
> 
> Again, as is I'm not really seeing what's to discuss, you send it as RFC after
> all.
> 
>> [...]
>> so a factor of 32 less calls to cfs_fuse_write (including memdb_pwrite)
> 
> That can be huge or not so big at all, i.e. as mentioned above, it would we good to
> measure the impact through some other metrics.
> 
> And FWIW, I used bpftrace to count [0] with an unpatched pmxcfs, there I get
> the 32 calls to cfs_fuse_write only for a new file, overwriting the existing
> file again with the same amount of data (128k) just causes a single call.
> I tried using more data (e.g. from 128k initially to 256k or 512k) and it's
> always the data divided by 128k (even if the first file has a different size)
> 
> We do not override existing files often, but rather write to a new file and
> then rename, but still quite interesting and IMO really showing that just
> because this is 1 +-1 line change it doesn't necessarily have to be trivial
> and obvious in its effects.
> 
> [0]: bpftrace -e 'u:cfs_fuse_write /str(args->path) == "/test"/ {@ = count();} END { print(@) }' -p "$(pidof pmxcfs)"
> 
> 
>>>> If we'd change to libfuse3, this would be a non-issue, since that option
>>>> got removed and is the default there.
>>>
>>> I'd prefer that. At least if done with the future PVE 9.0, as I do not think
>>> it's a good idea in the middle of a stable release cycle.
>>
>> why not this change now, and the rewrite to libfuse3 later? that way we can
>> have some improvements now too...
> 
> Because I want some actual data and reasoning first, even if it's quite likely
> that this improves things Somehow™, I'd like to actually know in what metrics
> and by how much (even if just an upper bound due to the benchmark or some
> measurement being rather artificial).
> 
> I mean you name the big HA status, why not measure that for real? like, probably
> among other things, in terms of bytes hitting the block layer, i.e. the actual
> backing disk from those requests as then we'd know for real if this can reduce
> the write load there, not just that it maybe should.

sure, since you insist, I'll do some benchmarks today (when i have the time).

I'm just not sure if doing that is actually worth it. Would it make any
difference for the patch if the improvements were 1%, 10% or 1000% ?

The resulting change in behavior seems rather obvious to me:

less fuse -> db round trips for big writes, which can really only make the
whole thing more efficient, so IMHO it's very clear to me that this is an
improvement.

I sent this as an RFC because, as i wrote in the beginning, I'd like
for someone with more C/pmxcfs knowledge/experience to check or warn
for edge cases where we might depend on lower blocksizes (or similar things).
I very much doubt that any performance benchmarks would have revealed such things
that i wouldn't already have noticed by testing my local install.

if we do benchmarks on all changes that might impact performance (which
are most changes really) we'd probably would do little else...


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

  parent reply	other threads:[~2024-09-20  6:16 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-19  9:52 Dominik Csapak
2024-09-19 12:01 ` Thomas Lamprecht
2024-09-19 12:45   ` Dominik Csapak
2024-09-19 14:57     ` Thomas Lamprecht
2024-09-20  4:04       ` Esi Y via pve-devel
2024-09-20  5:29         ` Esi Y via pve-devel
     [not found]         ` <CABtLnHoQFAUN0KcahbMF6hoX=WTfL8bHL0St77gQMSaojVGhBA@mail.gmail.com>
2024-09-20  7:32           ` Dominik Csapak
2024-09-20  6:16       ` Dominik Csapak [this message]
2024-09-22  9:25       ` Dietmar Maurer
2024-09-23  9:17       ` Dominik Csapak
2024-09-23 11:48         ` Filip Schauer
2024-09-23 14:08           ` Filip Schauer
2024-09-23 12:00         ` Friedrich Weber
2024-09-23 12:03           ` Dominik Csapak
2024-09-19 12:23 ` Esi Y via pve-devel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fdbbdb59-0e66-425c-a0ba-b5f9de69ba9b@proxmox.com \
    --to=d.csapak@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    --cc=t.lamprecht@proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal