[pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
@ 2024-09-19  9:52 Dominik Csapak
  2024-09-19 12:01 ` Thomas Lamprecht
  2024-09-19 12:23 ` Esi Y via pve-devel
  0 siblings, 2 replies; 15+ messages in thread
From: Dominik Csapak @ 2024-09-19  9:52 UTC (permalink / raw)
  To: pve-devel

by default libfuse2 limits writes to 4k size, which means that on writes
bigger than that, we do a whole write cycle for each 4k block that comes
in. To avoid that, add the option 'big_writes' to allow writes bigger
than 4k at once.

This should improve pmxcfs performance for situations where we often
write large files (e.g. big ha status) and maybe reduce writes to disk.

If we'd change to libfuse3, this would be a non-issue, since that option
got removed and is the default there.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
sending as RFC, since I'm not sure if there are maybe any sideeffects i
overlooked. I scanned the pmxcfs code for potential problems where we
would e.g. assume a maximum size of 4096 for fuse writes, but could not
find any. (also my C foo is not that great). I run it here currently,
and it seems to work just fine atm.

 src/pmxcfs/pmxcfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/pmxcfs/pmxcfs.c b/src/pmxcfs/pmxcfs.c
index 1cf8ab8..c5b8f04 100644
--- a/src/pmxcfs/pmxcfs.c
+++ b/src/pmxcfs/pmxcfs.c
@@ -945,7 +945,7 @@ int main(int argc, char *argv[])
 
 	mkdir(CFSDIR, 0755);
 
-	char *fa[] = { "-f", "-odefault_permissions", "-oallow_other", NULL};
+	char *fa[] = { "-f", "-odefault_permissions", "-oallow_other", "-obig_writes", NULL};
 
 	struct fuse_args fuse_args = FUSE_ARGS_INIT(sizeof (fa)/sizeof(gpointer) - 1, fa);
 
-- 
2.39.2



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
  2024-09-19  9:52 [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse Dominik Csapak
@ 2024-09-19 12:01 ` Thomas Lamprecht
  2024-09-19 12:45   ` Dominik Csapak
  2024-09-19 12:23 ` Esi Y via pve-devel
  1 sibling, 1 reply; 15+ messages in thread
From: Thomas Lamprecht @ 2024-09-19 12:01 UTC (permalink / raw)
  To: Proxmox VE development discussion, Dominik Csapak

Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
> by default libfuse2 limits writes to 4k size, which means that on writes
> bigger than that, we do a whole write cycle for each 4k block that comes
> in. To avoid that, add the option 'big_writes' to allow writes bigger
> than 4k at once.
> 
> This should improve pmxcfs performance for situations where we often
> write large files (e.g. big ha status) and maybe reduce writes to disk.

Should? Something like before/after for benchmark numbers, flamegraphs
would be really good to have, without those it's rather hard to discuss
this, and I'd like to avoid having to do those, or check the inner workings
of the affected fuse userspace/kernel code paths here myself.

> 
> If we'd change to libfuse3, this would be a non-issue, since that option
> got removed and is the default there.

I'd prefer that. At least if done with the future PVE 9.0, as I do not think
it's a good idea in the middle of a stable release cycle.

FWIW, some Debian Devs also want to push the fuse 2 to 3 migration forward
[0]. So, sooner or later we have to switch over pmxcfs anyway, and there's
already a POC from Fabian [1].

[0]: https://lists.debian.org/debian-devel/2024/08/msg00177.html
[1]: https://lore.proxmox.com/pve-devel/20220715074742.3387301-1-f.gruenbichler@proxmox.com/




_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
  2024-09-19  9:52 [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse Dominik Csapak
  2024-09-19 12:01 ` Thomas Lamprecht
@ 2024-09-19 12:23 ` Esi Y via pve-devel
  1 sibling, 0 replies; 15+ messages in thread
From: Esi Y via pve-devel @ 2024-09-19 12:23 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: Esi Y

[-- Attachment #1: Type: message/rfc822, Size: 6987 bytes --]

From: Esi Y <esiy0676+proxmox@gmail.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
Date: Thu, 19 Sep 2024 14:23:55 +0200
Message-ID: <CABtLnHpSk+XWjzfvG=qOqGOet0JFda4_Lb_DkqR3ZgjsXvQHFA@mail.gmail.com>

First of all, thanks for spotting this!

I believe you will still be limited to 128K, this could be tuned up to
max 1M [1] (current file limit size in pmxcfs, coincidentally), but
needs FUSE 3.6.0 [2]:

[1] https://github.com/torvalds/linux/commit/5da784cce4308ae10a79e3c8c41b13fb9568e4e0
[2] https://github.com/libfuse/libfuse/blob/master/ChangeLog.rst#libfuse-360-2019-06-13

Obviously, this should be buffered within the pmxcfs logic instead, however.

-- 
Esi Y

On Thu, Sep 19, 2024 at 11:52 AM Dominik Csapak <d.csapak@proxmox.com> wrote:
>
> by default libfuse2 limits writes to 4k size, which means that on writes
> bigger than that, we do a whole write cycle for each 4k block that comes
> in. To avoid that, add the option 'big_writes' to allow writes bigger
> than 4k at once.
>
> This should improve pmxcfs performance for situations where we often
> write large files (e.g. big ha status) and maybe reduce writes to disk.
>
> If we'd change to libfuse3, this would be a non-issue, since that option
> got removed and is the default there.
>
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> ---
> sending as RFC, since I'm not sure if there are maybe any sideeffects i
> overlooked. I scanned the pmxcfs code for potential problems where we
> would e.g. assume a maximum size of 4096 for fuse writes, but could not
> find any. (also my C foo is not that great). I run it here currently,
> and it seems to work just fine atm.
>
>  src/pmxcfs/pmxcfs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/src/pmxcfs/pmxcfs.c b/src/pmxcfs/pmxcfs.c
> index 1cf8ab8..c5b8f04 100644
> --- a/src/pmxcfs/pmxcfs.c
> +++ b/src/pmxcfs/pmxcfs.c
> @@ -945,7 +945,7 @@ int main(int argc, char *argv[])
>
>         mkdir(CFSDIR, 0755);
>
> -       char *fa[] = { "-f", "-odefault_permissions", "-oallow_other", NULL};
> +       char *fa[] = { "-f", "-odefault_permissions", "-oallow_other", "-obig_writes", NULL};
>
>         struct fuse_args fuse_args = FUSE_ARGS_INIT(sizeof (fa)/sizeof(gpointer) - 1, fa);
>
> --
> 2.39.2
>
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>


[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
  2024-09-19 12:01 ` Thomas Lamprecht
@ 2024-09-19 12:45   ` Dominik Csapak
  2024-09-19 14:57     ` Thomas Lamprecht
  0 siblings, 1 reply; 15+ messages in thread
From: Dominik Csapak @ 2024-09-19 12:45 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE development discussion

On 9/19/24 14:01, Thomas Lamprecht wrote:
> Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
>> by default libfuse2 limits writes to 4k size, which means that on writes
>> bigger than that, we do a whole write cycle for each 4k block that comes
>> in. To avoid that, add the option 'big_writes' to allow writes bigger
>> than 4k at once.
>>
>> This should improve pmxcfs performance for situations where we often
>> write large files (e.g. big ha status) and maybe reduce writes to disk.
> 
> Should? Something like before/after for benchmark numbers, flamegraphs
> would be really good to have, without those it's rather hard to discuss
> this, and I'd like to avoid having to do those, or check the inner workings
> of the affected fuse userspace/kernel code paths here myself.

well I mean the code change is relatively small and the result is rather clear:

in the current case we have the following calls from pmxcfs (shortened for e-mail)
when writing a single 128k block:
(dd if=... of=/etc/pve/test bs=128k count=1)

---
[main] debug: enter cfs_fuse_write /test 4096 0 (pmxcfs.c:355:cfs_fuse_write)
[main] debug: find_plug start test (pmxcfs.c:103:find_plug)
[main] debug: cfs_plug_base_lookup_plug test (cfs-plug.c:52:cfs_plug_base_lookup_plug)
[main] debug: cfs_plug_base_lookup_plug name = test new path = (null) 
(cfs-plug.c:59:cfs_plug_base_lookup_plug)
[main] debug: find_plug end test = 0x64319c3e0bc0 (test) (pmxcfs.c:110:find_plug)
[main] debug: enter cfs_plug_base_write test 4096 0 (cfs-plug.c:294:cfs_plug_base_write)
[main] debug: memdb_pwrite . test 0000000000DE88EC 0000000000DE88ED (memdb.c:850:memdb_pwrite)
[database] debug: enter backend_write_inode 0000000000DE88EC 'test', size 4096 
(database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 
(database.c:157:backend_write_inode)
[main] debug: leave cfs_fuse_write /test (4096) (pmxcfs.c:368:cfs_fuse_write)
[main] debug: enter cfs_fuse_write /test 4096 4096 (pmxcfs.c:355:cfs_fuse_write)
[main] debug: find_plug start test (pmxcfs.c:103:find_plug)
[main] debug: cfs_plug_base_lookup_plug test (cfs-plug.c:52:cfs_plug_base_lookup_plug)
[main] debug: cfs_plug_base_lookup_plug name = test new path = (null) 
(cfs-plug.c:59:cfs_plug_base_lookup_plug)
[main] debug: find_plug end test = 0x64319c3e0bc0 (test) (pmxcfs.c:110:find_plug)
[main] debug: enter cfs_plug_base_write test 4096 4096 (cfs-plug.c:294:cfs_plug_base_write)
[main] debug: memdb_pwrite . test 0000000000DE88EC 0000000000DE88EE (memdb.c:850:memdb_pwrite)
[database] debug: enter backend_write_inode 0000000000DE88EC 'test', size 8192 
(database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 
(database.c:157:backend_write_inode)
[main] debug: leave cfs_fuse_write /test (4096) (pmxcfs.c:368:cfs_fuse_write)
[main] debug: enter cfs_fuse_write /test 4096 8192 (pmxcfs.c:355:cfs_fuse_write)
---
and so on until we reach

---
[main] debug: enter cfs_fuse_write /test 4096 126976 (pmxcfs.c:355:cfs_fuse_write)
[main] debug: find_plug start test (pmxcfs.c:103:find_plug)
[main] debug: cfs_plug_base_lookup_plug test (cfs-plug.c:52:cfs_plug_base_lookup_plug)
[main] debug: cfs_plug_base_lookup_plug name = test new path = (null) 
(cfs-plug.c:59:cfs_plug_base_lookup_plug)
[main] debug: find_plug end test = 0x64319c3e0bc0 (test) (pmxcfs.c:110:find_plug)
[main] debug: enter cfs_plug_base_write test 4096 126976 (cfs-plug.c:294:cfs_plug_base_write)
[main] debug: memdb_pwrite . test 0000000000DE88EC 0000000000DE890C (memdb.c:850:memdb_pwrite)
[database] debug: enter backend_write_inode 0000000000DE88EC 'test', size 131072 
(database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 
(database.c:157:backend_write_inode)
[main] debug: leave cfs_fuse_write /test (4096) (pmxcfs.c:368:cfs_fuse_write)
---

with my changes it results in:

---
[main] debug: enter cfs_fuse_write /test 131072 0 (pmxcfs.c:355:cfs_fuse_write)
[main] debug: find_plug start test (pmxcfs.c:103:find_plug)
[main] debug: cfs_plug_base_lookup_plug test (cfs-plug.c:52:cfs_plug_base_lookup_plug)
[main] debug: cfs_plug_base_lookup_plug name = test new path = (null) 
(cfs-plug.c:59:cfs_plug_base_lookup_plug)
[main] debug: find_plug end test = 0x62f47945f360 (test) (pmxcfs.c:110:find_plug)
[main] debug: enter cfs_plug_base_write test 131072 0 (cfs-plug.c:294:cfs_plug_base_write)
[main] debug: memdb_pwrite . test 0000000000DE88EC 0000000000DE8AB8 (memdb.c:850:memdb_pwrite)
[database] debug: enter backend_write_inode 0000000000DE88EC 'test', size 131072 
(database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 
(database.c:157:backend_write_inode)
[main] debug: leave cfs_fuse_write /test (131072) (pmxcfs.c:368:cfs_fuse_write)
---

so a factor of 32 less calls to cfs_fuse_write (including memdb_pwrite)

> 
>>
>> If we'd change to libfuse3, this would be a non-issue, since that option
>> got removed and is the default there.
> 
> I'd prefer that. At least if done with the future PVE 9.0, as I do not think
> it's a good idea in the middle of a stable release cycle.

why not this change now, and the rewrite to libfuse3 later? that way we can
have some improvements now too...

> 
> FWIW, some Debian Devs also want to push the fuse 2 to 3 migration forward
> [0]. So, sooner or later we have to switch over pmxcfs anyway, and there's
> already a POC from Fabian [1].
> 
> [0]: https://lists.debian.org/debian-devel/2024/08/msg00177.html
> [1]: https://lore.proxmox.com/pve-devel/20220715074742.3387301-1-f.gruenbichler@proxmox.com/
> 
> 


sure I can look at that when we're nearing 9.0


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
  2024-09-19 12:45   ` Dominik Csapak
@ 2024-09-19 14:57     ` Thomas Lamprecht
  2024-09-20  4:04       ` Esi Y via pve-devel
                         ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Thomas Lamprecht @ 2024-09-19 14:57 UTC (permalink / raw)
  To: Dominik Csapak, Proxmox VE development discussion

Am 19/09/2024 um 14:45 schrieb Dominik Csapak:
> On 9/19/24 14:01, Thomas Lamprecht wrote:
>> Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
>>> by default libfuse2 limits writes to 4k size, which means that on writes
>>> bigger than that, we do a whole write cycle for each 4k block that comes
>>> in. To avoid that, add the option 'big_writes' to allow writes bigger
>>> than 4k at once.
>>>
>>> This should improve pmxcfs performance for situations where we often
>>> write large files (e.g. big ha status) and maybe reduce writes to disk.
>>
>> Should? Something like before/after for benchmark numbers, flamegraphs
>> would be really good to have, without those it's rather hard to discuss
>> this, and I'd like to avoid having to do those, or check the inner workings
>> of the affected fuse userspace/kernel code paths here myself.
> 
> well I mean the code change is relatively small and the result is rather clear:

Well sure the code change is just setting an option... But the actual change is
abstracted away and would benefit from actually looking into..

> in the current case we have the following calls from pmxcfs (shortened for e-mail)
> when writing a single 128k block:
> (dd if=... of=/etc/pve/test bs=128k count=1)

Better than nothing but still no actual numbers (reduced time, reduced write amp
in combination with sqlite, ...), some basic analysis over file/write size distribution
on a single node and (e.g. three node) cluster, ...
If that's all obvious for you then great, but as already mentioned in the past, I
want actual data in commit messages for such stuff, and I cannot really see a downside
of having such numbers.

Again, as is I'm not really seeing what's to discuss, you send it as RFC after
all.

> [...] 
> so a factor of 32 less calls to cfs_fuse_write (including memdb_pwrite)

That can be huge or not so big at all, i.e. as mentioned above, it would we good to
measure the impact through some other metrics.

And FWIW, I used bpftrace to count [0] with an unpatched pmxcfs, there I get
the 32 calls to cfs_fuse_write only for a new file, overwriting the existing
file again with the same amount of data (128k) just causes a single call.
I tried using more data (e.g. from 128k initially to 256k or 512k) and it's
always the data divided by 128k (even if the first file has a different size)

We do not override existing files often, but rather write to a new file and
then rename, but still quite interesting and IMO really showing that just
because this is 1 +-1 line change it doesn't necessarily have to be trivial
and obvious in its effects.

[0]: bpftrace -e 'u:cfs_fuse_write /str(args->path) == "/test"/ {@ = count();} END { print(@) }' -p "$(pidof pmxcfs)"

>>> If we'd change to libfuse3, this would be a non-issue, since that option
>>> got removed and is the default there.
>>
>> I'd prefer that. At least if done with the future PVE 9.0, as I do not think
>> it's a good idea in the middle of a stable release cycle.
> 
> why not this change now, and the rewrite to libfuse3 later? that way we can
> have some improvements now too...

Because I want some actual data and reasoning first, even if it's quite likely
that this improves things Somehow™, I'd like to actually know in what metrics
and by how much (even if just an upper bound due to the benchmark or some
measurement being rather artificial).

I mean you name the big HA status, why not measure that for real? like, probably
among other things, in terms of bytes hitting the block layer, i.e. the actual
backing disk from those requests as then we'd know for real if this can reduce
the write load there, not just that it maybe should.

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
  2024-09-19 14:57     ` Thomas Lamprecht
@ 2024-09-20  4:04       ` Esi Y via pve-devel
  2024-09-20  5:29         ` Esi Y via pve-devel
       [not found]         ` <CABtLnHoQFAUN0KcahbMF6hoX=WTfL8bHL0St77gQMSaojVGhBA@mail.gmail.com>
  2024-09-20  6:16       ` Dominik Csapak
                         ` (2 subsequent siblings)
  3 siblings, 2 replies; 15+ messages in thread
From: Esi Y via pve-devel @ 2024-09-20  4:04 UTC (permalink / raw)
  To: t.lamprecht, Dominik Csapak; +Cc: Esi Y, Proxmox VE development discussion

[-- Attachment #1: Type: message/rfc822, Size: 11294 bytes --]

From: Esi Y <esiy0676+proxmox@gmail.com>
To: t.lamprecht@proxmox.com, Dominik Csapak <d.csapak@proxmox.com>
Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
Date: Fri, 20 Sep 2024 06:04:36 +0200
Message-ID: <CABtLnHp=5p9rkD-+aEMfBT2WfQnPhqcsFTGZPnSou4=U2xZOUw@mail.gmail.com>

I can't help it, I am sorry in advance, but ...

No one is going to bring up the elephant in the room (you may want to
call FUSE), such as that backend_write_inode is hit every single time
on virtually every memdb_pwrite, i.e. in addition to cfs_fuse_write
also on (logically related):
cfs_fuse_truncate
cfs_fuse_rename
cfs_fuse_utimens

So these are all separate transactions hitting the backend capturing
one and the same event.

Additionally, there's nothing atomic about updating __version__ and
the actual file ("inode") DB rows, so double the number of
transactions hit on every amplified hit yet.

Also, locks are persisted into backend only to be removed soon after.

WRT to FUSE2 buffering doing just fine for overwrites (<= original
size), this is true, but then at the same time the mode of PVE
operation (albeit quite correctly) is to create a .tmp.XXX (so this is
your NEW file being appended) and then rename, whilst all that
in-place of that very FUSE mountpoint (not so correctly) and at the
same time pmxcfs being completely oblivious to this.

I could not help this because this is a developer who - in my opinion
- quite rightly wanted to pick the low hanging fruit first with his
intuition (and a self-evident reasoning) completely disregarded,
however the same scrutiny was not exercised when e.g. bumping limits
[1] of that very FS . And that all back then was "tested with touch".
And this is all on someone else's codebase that is 10 years old (so
designed with different use case in mind, good enough for ~4K files),
meanwhile the well-meaning individual even admits he is not a C guru,
but is asked to spend a day profiling this multi-threaded CPG bespoke
code?

NB I will completely leave out what the above mentioned does to the
CPG messages flying around, for brevity. But it is why I originally
got interested.

I am sure I made many friends now that I called even the FUSE
migration on its own futile, but well, it is an RFC after all.

Thank you, gentlemen.

Esi Y

On Thu, Sep 19, 2024 at 4:57 PM Thomas Lamprecht
<t.lamprecht@proxmox.com> wrote:
>
> Am 19/09/2024 um 14:45 schrieb Dominik Csapak:
> > On 9/19/24 14:01, Thomas Lamprecht wrote:
> >> Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
> >>> by default libfuse2 limits writes to 4k size, which means that on writes
> >>> bigger than that, we do a whole write cycle for each 4k block that comes
> >>> in. To avoid that, add the option 'big_writes' to allow writes bigger
> >>> than 4k at once.
> >>>
> >>> This should improve pmxcfs performance for situations where we often
> >>> write large files (e.g. big ha status) and maybe reduce writes to disk.
> >>
> >> Should? Something like before/after for benchmark numbers, flamegraphs
> >> would be really good to have, without those it's rather hard to discuss
> >> this, and I'd like to avoid having to do those, or check the inner workings
> >> of the affected fuse userspace/kernel code paths here myself.
> >
> > well I mean the code change is relatively small and the result is rather clear:
>
> Well sure the code change is just setting an option... But the actual change is
> abstracted away and would benefit from actually looking into..
>
> > in the current case we have the following calls from pmxcfs (shortened for e-mail)
> > when writing a single 128k block:
> > (dd if=... of=/etc/pve/test bs=128k count=1)
>
> Better than nothing but still no actual numbers (reduced time, reduced write amp
> in combination with sqlite, ...), some basic analysis over file/write size distribution
> on a single node and (e.g. three node) cluster, ...
> If that's all obvious for you then great, but as already mentioned in the past, I
> want actual data in commit messages for such stuff, and I cannot really see a downside
> of having such numbers.
>
> Again, as is I'm not really seeing what's to discuss, you send it as RFC after
> all.
>
> > [...]
> > so a factor of 32 less calls to cfs_fuse_write (including memdb_pwrite)
>
> That can be huge or not so big at all, i.e. as mentioned above, it would we good to
> measure the impact through some other metrics.
>
> And FWIW, I used bpftrace to count [0] with an unpatched pmxcfs, there I get
> the 32 calls to cfs_fuse_write only for a new file, overwriting the existing
> file again with the same amount of data (128k) just causes a single call.
> I tried using more data (e.g. from 128k initially to 256k or 512k) and it's
> always the data divided by 128k (even if the first file has a different size)
>
> We do not override existing files often, but rather write to a new file and
> then rename, but still quite interesting and IMO really showing that just
> because this is 1 +-1 line change it doesn't necessarily have to be trivial
> and obvious in its effects.
>
> [0]: bpftrace -e 'u:cfs_fuse_write /str(args->path) == "/test"/ {@ = count();} END { print(@) }' -p "$(pidof pmxcfs)"
>
>
> >>> If we'd change to libfuse3, this would be a non-issue, since that option
> >>> got removed and is the default there.
> >>
> >> I'd prefer that. At least if done with the future PVE 9.0, as I do not think
> >> it's a good idea in the middle of a stable release cycle.
> >
> > why not this change now, and the rewrite to libfuse3 later? that way we can
> > have some improvements now too...
>
> Because I want some actual data and reasoning first, even if it's quite likely
> that this improves things Somehow™, I'd like to actually know in what metrics
> and by how much (even if just an upper bound due to the benchmark or some
> measurement being rather artificial).
>
> I mean you name the big HA status, why not measure that for real? like, probably
> among other things, in terms of bytes hitting the block layer, i.e. the actual
> backing disk from those requests as then we'd know for real if this can reduce
> the write load there, not just that it maybe should.
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
  2024-09-20  4:04       ` Esi Y via pve-devel
@ 2024-09-20  5:29         ` Esi Y via pve-devel
       [not found]         ` <CABtLnHoQFAUN0KcahbMF6hoX=WTfL8bHL0St77gQMSaojVGhBA@mail.gmail.com>
  1 sibling, 0 replies; 15+ messages in thread
From: Esi Y via pve-devel @ 2024-09-20  5:29 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: Esi Y, t.lamprecht

[-- Attachment #1: Type: message/rfc822, Size: 14348 bytes --]

From: Esi Y <esiy0676+proxmox@gmail.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Cc: t.lamprecht@proxmox.com, Dominik Csapak <d.csapak@proxmox.com>
Subject: Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
Date: Fri, 20 Sep 2024 07:29:05 +0200
Message-ID: <CABtLnHoQFAUN0KcahbMF6hoX=WTfL8bHL0St77gQMSaojVGhBA@mail.gmail.com>

Somehow, my ending did not get included, regarding the "hitting block
later" - this is almost impossible to quantify reliably. The reason is
the use of the WAL, 80% of instructions [1] are happening in the
hocus-pocus SQLite and it really may or may not checkpoint at any
given time depending on how many transactions are hitting (from all
sides, i.e. other nodes).

In the end and for my single-node testing at first, I ended up with
MEMORY journal (yes, the old fashioned one), at least it was giving
some consistent results. The amplification dropped (just from not
using WAL alone) rapidly [2] and it would half with bigger buffers.
But I found this a non-topic. It is in memory, it needs copying there,
then it is in the backend, additionally at times in WAL and now FUSE3
would be increasing buffers in order to ... hit memory first.

One has to question, if the WAL is necessary (on read once, write all
the time DB), but for me it's more about whether the DB is necessary.
It is there for the atomicity, I know, but there's other ways to do it
without all the overhead. At the end of the day, it is about whether
one wants a hard or soft state CPG. I personally aim for the soft, I
don't expect much sympathy with that though.

[1] https://forum.proxmox.com/threads/etc-pve-pmxcfs-amplification-inefficiencies.154074/#post-703261
[2] https://forum.proxmox.com/threads/ssd-wearout-and-rrdcache-pmxcfs-commit-interval.124638/#post-702765

On Fri, Sep 20, 2024 at 6:05 AM Esi Y via pve-devel
<pve-devel@lists.proxmox.com> wrote:
>
>
>
>
> ---------- Forwarded message ----------
> From: Esi Y <esiy0676+proxmox@gmail.com>
> To: t.lamprecht@proxmox.com, Dominik Csapak <d.csapak@proxmox.com>
> Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
> Bcc:
> Date: Fri, 20 Sep 2024 06:04:36 +0200
> Subject: Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
> I can't help it, I am sorry in advance, but ...
>
> No one is going to bring up the elephant in the room (you may want to
> call FUSE), such as that backend_write_inode is hit every single time
> on virtually every memdb_pwrite, i.e. in addition to cfs_fuse_write
> also on (logically related):
> cfs_fuse_truncate
> cfs_fuse_rename
> cfs_fuse_utimens
>
> So these are all separate transactions hitting the backend capturing
> one and the same event.
>
> Additionally, there's nothing atomic about updating __version__ and
> the actual file ("inode") DB rows, so double the number of
> transactions hit on every amplified hit yet.
>
> Also, locks are persisted into backend only to be removed soon after.
>
> WRT to FUSE2 buffering doing just fine for overwrites (<= original
> size), this is true, but then at the same time the mode of PVE
> operation (albeit quite correctly) is to create a .tmp.XXX (so this is
> your NEW file being appended) and then rename, whilst all that
> in-place of that very FUSE mountpoint (not so correctly) and at the
> same time pmxcfs being completely oblivious to this.
>
> I could not help this because this is a developer who - in my opinion
> - quite rightly wanted to pick the low hanging fruit first with his
> intuition (and a self-evident reasoning) completely disregarded,
> however the same scrutiny was not exercised when e.g. bumping limits
> [1] of that very FS . And that all back then was "tested with touch".
> And this is all on someone else's codebase that is 10 years old (so
> designed with different use case in mind, good enough for ~4K files),
> meanwhile the well-meaning individual even admits he is not a C guru,
> but is asked to spend a day profiling this multi-threaded CPG bespoke
> code?
>
> NB I will completely leave out what the above mentioned does to the
> CPG messages flying around, for brevity. But it is why I originally
> got interested.
>
> I am sure I made many friends now that I called even the FUSE
> migration on its own futile, but well, it is an RFC after all.
>
> Thank you, gentlemen.
>
> Esi Y
>
> On Thu, Sep 19, 2024 at 4:57 PM Thomas Lamprecht
> <t.lamprecht@proxmox.com> wrote:
> >
> > Am 19/09/2024 um 14:45 schrieb Dominik Csapak:
> > > On 9/19/24 14:01, Thomas Lamprecht wrote:
> > >> Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
> > >>> by default libfuse2 limits writes to 4k size, which means that on writes
> > >>> bigger than that, we do a whole write cycle for each 4k block that comes
> > >>> in. To avoid that, add the option 'big_writes' to allow writes bigger
> > >>> than 4k at once.
> > >>>
> > >>> This should improve pmxcfs performance for situations where we often
> > >>> write large files (e.g. big ha status) and maybe reduce writes to disk.
> > >>
> > >> Should? Something like before/after for benchmark numbers, flamegraphs
> > >> would be really good to have, without those it's rather hard to discuss
> > >> this, and I'd like to avoid having to do those, or check the inner workings
> > >> of the affected fuse userspace/kernel code paths here myself.
> > >
> > > well I mean the code change is relatively small and the result is rather clear:
> >
> > Well sure the code change is just setting an option... But the actual change is
> > abstracted away and would benefit from actually looking into..
> >
> > > in the current case we have the following calls from pmxcfs (shortened for e-mail)
> > > when writing a single 128k block:
> > > (dd if=... of=/etc/pve/test bs=128k count=1)
> >
> > Better than nothing but still no actual numbers (reduced time, reduced write amp
> > in combination with sqlite, ...), some basic analysis over file/write size distribution
> > on a single node and (e.g. three node) cluster, ...
> > If that's all obvious for you then great, but as already mentioned in the past, I
> > want actual data in commit messages for such stuff, and I cannot really see a downside
> > of having such numbers.
> >
> > Again, as is I'm not really seeing what's to discuss, you send it as RFC after
> > all.
> >
> > > [...]
> > > so a factor of 32 less calls to cfs_fuse_write (including memdb_pwrite)
> >
> > That can be huge or not so big at all, i.e. as mentioned above, it would we good to
> > measure the impact through some other metrics.
> >
> > And FWIW, I used bpftrace to count [0] with an unpatched pmxcfs, there I get
> > the 32 calls to cfs_fuse_write only for a new file, overwriting the existing
> > file again with the same amount of data (128k) just causes a single call.
> > I tried using more data (e.g. from 128k initially to 256k or 512k) and it's
> > always the data divided by 128k (even if the first file has a different size)
> >
> > We do not override existing files often, but rather write to a new file and
> > then rename, but still quite interesting and IMO really showing that just
> > because this is 1 +-1 line change it doesn't necessarily have to be trivial
> > and obvious in its effects.
> >
> > [0]: bpftrace -e 'u:cfs_fuse_write /str(args->path) == "/test"/ {@ = count();} END { print(@) }' -p "$(pidof pmxcfs)"
> >
> >
> > >>> If we'd change to libfuse3, this would be a non-issue, since that option
> > >>> got removed and is the default there.
> > >>
> > >> I'd prefer that. At least if done with the future PVE 9.0, as I do not think
> > >> it's a good idea in the middle of a stable release cycle.
> > >
> > > why not this change now, and the rewrite to libfuse3 later? that way we can
> > > have some improvements now too...
> >
> > Because I want some actual data and reasoning first, even if it's quite likely
> > that this improves things Somehow™, I'd like to actually know in what metrics
> > and by how much (even if just an upper bound due to the benchmark or some
> > measurement being rather artificial).
> >
> > I mean you name the big HA status, why not measure that for real? like, probably
> > among other things, in terms of bytes hitting the block layer, i.e. the actual
> > backing disk from those requests as then we'd know for real if this can reduce
> > the write load there, not just that it maybe should.
> >
> >
> > _______________________________________________
> > pve-devel mailing list
> > pve-devel@lists.proxmox.com
> > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>
>
>
>
> ---------- Forwarded message ----------
> From: Esi Y via pve-devel <pve-devel@lists.proxmox.com>
> To: t.lamprecht@proxmox.com, Dominik Csapak <d.csapak@proxmox.com>
> Cc: Esi Y <esiy0676+proxmox@gmail.com>, Proxmox VE development discussion <pve-devel@lists.proxmox.com>
> Bcc:
> Date: Fri, 20 Sep 2024 06:04:36 +0200
> Subject: Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
  2024-09-19 14:57     ` Thomas Lamprecht
  2024-09-20  4:04       ` Esi Y via pve-devel
@ 2024-09-20  6:16       ` Dominik Csapak
  2024-09-22  9:25       ` Dietmar Maurer
  2024-09-23  9:17       ` Dominik Csapak
  3 siblings, 0 replies; 15+ messages in thread
From: Dominik Csapak @ 2024-09-20  6:16 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE development discussion

On 9/19/24 16:57, Thomas Lamprecht wrote:
> Am 19/09/2024 um 14:45 schrieb Dominik Csapak:
>> On 9/19/24 14:01, Thomas Lamprecht wrote:
>>> Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
>>>> by default libfuse2 limits writes to 4k size, which means that on writes
>>>> bigger than that, we do a whole write cycle for each 4k block that comes
>>>> in. To avoid that, add the option 'big_writes' to allow writes bigger
>>>> than 4k at once.
>>>>
>>>> This should improve pmxcfs performance for situations where we often
>>>> write large files (e.g. big ha status) and maybe reduce writes to disk.
>>>
>>> Should? Something like before/after for benchmark numbers, flamegraphs
>>> would be really good to have, without those it's rather hard to discuss
>>> this, and I'd like to avoid having to do those, or check the inner workings
>>> of the affected fuse userspace/kernel code paths here myself.
>>
>> well I mean the code change is relatively small and the result is rather clear:
> 
> Well sure the code change is just setting an option... But the actual change is
> abstracted away and would benefit from actually looking into..
> 
>> in the current case we have the following calls from pmxcfs (shortened for e-mail)
>> when writing a single 128k block:
>> (dd if=... of=/etc/pve/test bs=128k count=1)
> 
> Better than nothing but still no actual numbers (reduced time, reduced write amp
> in combination with sqlite, ...), some basic analysis over file/write size distribution
> on a single node and (e.g. three node) cluster, ...
> If that's all obvious for you then great, but as already mentioned in the past, I
> want actual data in commit messages for such stuff, and I cannot really see a downside
> of having such numbers.
> 
> Again, as is I'm not really seeing what's to discuss, you send it as RFC after
> all.
> 
>> [...]
>> so a factor of 32 less calls to cfs_fuse_write (including memdb_pwrite)
> 
> That can be huge or not so big at all, i.e. as mentioned above, it would we good to
> measure the impact through some other metrics.
> 
> And FWIW, I used bpftrace to count [0] with an unpatched pmxcfs, there I get
> the 32 calls to cfs_fuse_write only for a new file, overwriting the existing
> file again with the same amount of data (128k) just causes a single call.
> I tried using more data (e.g. from 128k initially to 256k or 512k) and it's
> always the data divided by 128k (even if the first file has a different size)
> 
> We do not override existing files often, but rather write to a new file and
> then rename, but still quite interesting and IMO really showing that just
> because this is 1 +-1 line change it doesn't necessarily have to be trivial
> and obvious in its effects.
> 
> [0]: bpftrace -e 'u:cfs_fuse_write /str(args->path) == "/test"/ {@ = count();} END { print(@) }' -p "$(pidof pmxcfs)"
> 
> 
>>>> If we'd change to libfuse3, this would be a non-issue, since that option
>>>> got removed and is the default there.
>>>
>>> I'd prefer that. At least if done with the future PVE 9.0, as I do not think
>>> it's a good idea in the middle of a stable release cycle.
>>
>> why not this change now, and the rewrite to libfuse3 later? that way we can
>> have some improvements now too...
> 
> Because I want some actual data and reasoning first, even if it's quite likely
> that this improves things Somehow™, I'd like to actually know in what metrics
> and by how much (even if just an upper bound due to the benchmark or some
> measurement being rather artificial).
> 
> I mean you name the big HA status, why not measure that for real? like, probably
> among other things, in terms of bytes hitting the block layer, i.e. the actual
> backing disk from those requests as then we'd know for real if this can reduce
> the write load there, not just that it maybe should.

sure, since you insist, I'll do some benchmarks today (when i have the time).

I'm just not sure if doing that is actually worth it. Would it make any
difference for the patch if the improvements were 1%, 10% or 1000% ?

The resulting change in behavior seems rather obvious to me:

less fuse -> db round trips for big writes, which can really only make the
whole thing more efficient, so IMHO it's very clear to me that this is an
improvement.

I sent this as an RFC because, as i wrote in the beginning, I'd like
for someone with more C/pmxcfs knowledge/experience to check or warn
for edge cases where we might depend on lower blocksizes (or similar things).
I very much doubt that any performance benchmarks would have revealed such things
that i wouldn't already have noticed by testing my local install.

if we do benchmarks on all changes that might impact performance (which
are most changes really) we'd probably would do little else...


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
       [not found]         ` <CABtLnHoQFAUN0KcahbMF6hoX=WTfL8bHL0St77gQMSaojVGhBA@mail.gmail.com>
@ 2024-09-20  7:32           ` Dominik Csapak
  0 siblings, 0 replies; 15+ messages in thread
From: Dominik Csapak @ 2024-09-20  7:32 UTC (permalink / raw)
  To: Esi Y, Proxmox VE development discussion

Hi,

I'm replying to this mail, but it's more of a general answer.

While we always appreciate efforts to improve our software/code/stack, maybe you're recent attempts
are not as constructive as they could be.

Of course we can try to optimize every part of our code to be as efficient and good as possible
all the time, but we don't do that because:

* we don't have unlimited developer time
* having stable behavior can be more important than squeezing out the most performance

So we often change working existing code (especially the most important parts like the pmxcfs) only 
if either:
* there is a bug
* the change is pretty obviously better
* we encounter an actual limit/problem that is important (e.g. affects many people in a noticeable way)
* there is a big demand from users/customers

(note that this list is not exhaustive, and even if a change fulfills one of these, we might not 
pursue it, depending on the specific circumstances)

Instead of just calling out weird/unfamiliar behavior/code/design, I'd recommend going a different
route, for example explaining why a problem has a big impact on many people, best while
providing actual code and/or concrete solutions. (It's always easier to criticize other people's 
work than actually doing it better)

I cannot and don't want to stop you from looking into our code base and find flaws/issues/etc. but
maybe using a different approach to communicate those, would yield better results than what you have
done until now.

Please don't take this answer the wrong way, I just want to improve communication, so we can
improve our software in a better and more efficient way.

best regards
Dominik

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
  2024-09-19 14:57     ` Thomas Lamprecht
  2024-09-20  4:04       ` Esi Y via pve-devel
  2024-09-20  6:16       ` Dominik Csapak
@ 2024-09-22  9:25       ` Dietmar Maurer
  2024-09-23  9:17       ` Dominik Csapak
  3 siblings, 0 replies; 15+ messages in thread
From: Dietmar Maurer @ 2024-09-22  9:25 UTC (permalink / raw)
  To: Proxmox VE development discussion, Thomas Lamprecht, Dominik Csapak

> > [...] 
> > so a factor of 32 less calls to cfs_fuse_write (including memdb_pwrite)
> 
> That can be huge or not so big at all, i.e. as mentioned above, it would we good to
> measure the impact through some other metrics.
> 
> And FWIW, I used bpftrace to count [0] with an unpatched pmxcfs, there I get
> the 32 calls to cfs_fuse_write only for a new file, overwriting the existing
> file again with the same amount of data (128k) just causes a single call.
> I tried using more data (e.g. from 128k initially to 256k or 512k) and it's
> always the data divided by 128k (even if the first file has a different size)

I observe the same thing. Seems the fuse2 use 4K block for new files, but 128K if the file already exists...


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
  2024-09-19 14:57     ` Thomas Lamprecht
                         ` (2 preceding siblings ...)
  2024-09-22  9:25       ` Dietmar Maurer
@ 2024-09-23  9:17       ` Dominik Csapak
  2024-09-23 11:48         ` Filip Schauer
  2024-09-23 12:00         ` Friedrich Weber
  3 siblings, 2 replies; 15+ messages in thread
From: Dominik Csapak @ 2024-09-23  9:17 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE development discussion

On 9/19/24 16:57, Thomas Lamprecht wrote:
> Am 19/09/2024 um 14:45 schrieb Dominik Csapak:
>> On 9/19/24 14:01, Thomas Lamprecht wrote:
>>> Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
>>>> by default libfuse2 limits writes to 4k size, which means that on writes
>>>> bigger than that, we do a whole write cycle for each 4k block that comes
>>>> in. To avoid that, add the option 'big_writes' to allow writes bigger
>>>> than 4k at once.
>>>>
>>>> This should improve pmxcfs performance for situations where we often
>>>> write large files (e.g. big ha status) and maybe reduce writes to disk.
>>>
>>> Should? Something like before/after for benchmark numbers, flamegraphs
>>> would be really good to have, without those it's rather hard to discuss
>>> this, and I'd like to avoid having to do those, or check the inner workings
>>> of the affected fuse userspace/kernel code paths here myself.
>>
>> well I mean the code change is relatively small and the result is rather clear:
> 
> Well sure the code change is just setting an option... But the actual change is
> abstracted away and would benefit from actually looking into..
> 
>> in the current case we have the following calls from pmxcfs (shortened for e-mail)
>> when writing a single 128k block:
>> (dd if=... of=/etc/pve/test bs=128k count=1)
> 
> Better than nothing but still no actual numbers (reduced time, reduced write amp
> in combination with sqlite, ...), some basic analysis over file/write size distribution
> on a single node and (e.g. three node) cluster, ...
> If that's all obvious for you then great, but as already mentioned in the past, I
> want actual data in commit messages for such stuff, and I cannot really see a downside
> of having such numbers.
> 
> Again, as is I'm not really seeing what's to discuss, you send it as RFC after
> all.
> 
>> [...]
>> so a factor of 32 less calls to cfs_fuse_write (including memdb_pwrite)
> 
> That can be huge or not so big at all, i.e. as mentioned above, it would we good to
> measure the impact through some other metrics.
> 
> And FWIW, I used bpftrace to count [0] with an unpatched pmxcfs, there I get
> the 32 calls to cfs_fuse_write only for a new file, overwriting the existing
> file again with the same amount of data (128k) just causes a single call.
> I tried using more data (e.g. from 128k initially to 256k or 512k) and it's
> always the data divided by 128k (even if the first file has a different size)
> 
> We do not override existing files often, but rather write to a new file and
> then rename, but still quite interesting and IMO really showing that just
> because this is 1 +-1 line change it doesn't necessarily have to be trivial
> and obvious in its effects.
> 
> [0]: bpftrace -e 'u:cfs_fuse_write /str(args->path) == "/test"/ {@ = count();} END { print(@) }' -p "$(pidof pmxcfs)"
> 
> 
>>>> If we'd change to libfuse3, this would be a non-issue, since that option
>>>> got removed and is the default there.
>>>
>>> I'd prefer that. At least if done with the future PVE 9.0, as I do not think
>>> it's a good idea in the middle of a stable release cycle.
>>
>> why not this change now, and the rewrite to libfuse3 later? that way we can
>> have some improvements now too...
> 
> Because I want some actual data and reasoning first, even if it's quite likely
> that this improves things Somehow™, I'd like to actually know in what metrics
> and by how much (even if just an upper bound due to the benchmark or some
> measurement being rather artificial).
> 
> I mean you name the big HA status, why not measure that for real? like, probably
> among other things, in terms of bytes hitting the block layer, i.e. the actual
> backing disk from those requests as then we'd know for real if this can reduce
> the write load there, not just that it maybe should.


hi,

first i just wanted to say I'm sorry for my snarky comment about not needing to test
performance for such code. You're right, any insight we can gain there
is good and we (I!) should take the time to do that, even if the
change looks "obvious" like it does here

so i did some benchmarks (mostly disk writes) and wrote the short script below
(maybe we can reuse that?)

----8<----
use strict;
use warnings;

use PVE::Tools;

my $size = shift;

sub get_bytes_written {
         my $fh = IO::File->new("/proc/diskstats", "r");
         die if !$fh;
         my $bytes = undef;
         while (defined(my $line = <$fh>)) {
                 if ($line =~ m/sdb/) {
                         my @fields = split(/\s+/, $line);
                         $bytes = $fields[10] * 512;
                 }
         }
         return $bytes;
}

sub test_write {
         my ($k) = @_;
         system("rm /etc/pve/testfile");
         my $data = "a"x($k*1024);
         system("sync; echo -n 3> /proc/sys/vm/drop_caches");
         my $bytes_before = get_bytes_written();
         PVE::Tools::file_set_contents("/etc/pve/testfile", $data);
         system("sync; echo -n 3> /proc/sys/vm/drop_caches");
         my $bytes_after = get_bytes_written();
         return $bytes_after - $bytes_before;
}

$size //= 128;

my $written = test_write($size) / 1024;
print("$written\n");
---->8----

to simulate our real write patterns with varying file sizes

i installed a fresh pve, turned off pvestatd and put /var/lib/pve-cluster on it's own disk
(/dev/sdb), so diskstats only contains writes from the pmxcfs

the results are below (all sizes are kbytes, ran them multiple times, but they
seem to be consistent)

data size  written (old)  amplification (old)  written (new)  amplification (new)
1          56             56                   56             56
2          72             36                   76             38
4          84             21                   88             22
8          144            18                   104            13
16         236            14                   160            10
32         532            16                   324            10
64         1496           23                   836            13
128        6616           51                   3848           30
256        20600          80                   10568          41
512        87296          170                  43416          84
1024       388460         379                  197032         192

for smaller writes there seems to be a minimum overhead of ~50-100 kbytes
for big files with have a massive amplification of > 100x

my patch does seem to make a difference for files >4k

but the biggest surprise here is that the write amplification
is not linear, but increases with an increase in bytes we want to write
so e.g. going from 128k -> 256k file write we don't just write double the amount,
but 3x to 4x as much.

i also produced a flamegraph according to 
https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html, but that showed virtually no change 
between versions without
and with my patch (if one has a good svg hoster, i can post them ofc)

so, tl;dr

for small writes does not make that much of a difference, but
we can save ~half the writes for large files in the pmxcfs
(which e.g. a ha-state file could do)



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
  2024-09-23  9:17       ` Dominik Csapak
@ 2024-09-23 11:48         ` Filip Schauer
  2024-09-23 14:08           ` Filip Schauer
  2024-09-23 12:00         ` Friedrich Weber
  1 sibling, 1 reply; 15+ messages in thread
From: Filip Schauer @ 2024-09-23 11:48 UTC (permalink / raw)
  To: pve-devel

I also ran some benchmarks with the same script.

I created a VM with two virtual disks, (both on an LVM Thin storage)
installed PVE on one disk and set up an ext4 partition on the other.

I stopped pvestatd and pve-cluster,

```
systemctl stop pvestatd
systemctl stop pve-cluster
```

moved the pmxcfs database file to its own disk

```
mv /var/lib/pve-cluster/config.db /tmp/
mount /dev/sdb1 /var/lib/pve-cluster
mv /tmp/config.db /var/lib/pve-cluster/
```

and started pve-cluster again.

```
systemctl start pve-cluster
```

These are my results before and after applying the patch:

data size  written (old)  amplification (old)  written (new) 
amplification (new)
1          48             48                   45             45
2          48             24                   45             23
4          82             21                   80             20
8          121            15                   90             11
16         217            14                   146            9
32         506            16                   314            10
64         1472           23                   826            13
128        5585           44                   3765           29
256        20424          80                   10743          42
512        86715          169                  43650          85
1024       369568         361                  187496         183

As can be seen, my results are similar with amplification really picking
up at 128k. The patch seems to cut write amplification in half with big
files, while making virtually no difference with files up to 4k.

On 23/09/2024 11:17, Dominik Csapak wrote:
> On 9/19/24 16:57, Thomas Lamprecht wrote:
>> Am 19/09/2024 um 14:45 schrieb Dominik Csapak:
>>> On 9/19/24 14:01, Thomas Lamprecht wrote:
>>>> Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
>>>>> by default libfuse2 limits writes to 4k size, which means that on 
>>>>> writes
>>>>> bigger than that, we do a whole write cycle for each 4k block that 
>>>>> comes
>>>>> in. To avoid that, add the option 'big_writes' to allow writes bigger
>>>>> than 4k at once.
>>>>>
>>>>> This should improve pmxcfs performance for situations where we often
>>>>> write large files (e.g. big ha status) and maybe reduce writes to 
>>>>> disk.
>>>>
>>>> Should? Something like before/after for benchmark numbers, flamegraphs
>>>> would be really good to have, without those it's rather hard to 
>>>> discuss
>>>> this, and I'd like to avoid having to do those, or check the inner 
>>>> workings
>>>> of the affected fuse userspace/kernel code paths here myself.
>>>
>>> well I mean the code change is relatively small and the result is 
>>> rather clear:
>>
>> Well sure the code change is just setting an option... But the actual 
>> change is
>> abstracted away and would benefit from actually looking into..
>>
>>> in the current case we have the following calls from pmxcfs 
>>> (shortened for e-mail)
>>> when writing a single 128k block:
>>> (dd if=... of=/etc/pve/test bs=128k count=1)
>>
>> Better than nothing but still no actual numbers (reduced time, 
>> reduced write amp
>> in combination with sqlite, ...), some basic analysis over file/write 
>> size distribution
>> on a single node and (e.g. three node) cluster, ...
>> If that's all obvious for you then great, but as already mentioned in 
>> the past, I
>> want actual data in commit messages for such stuff, and I cannot 
>> really see a downside
>> of having such numbers.
>>
>> Again, as is I'm not really seeing what's to discuss, you send it as 
>> RFC after
>> all.
>>
>>> [...]
>>> so a factor of 32 less calls to cfs_fuse_write (including memdb_pwrite)
>>
>> That can be huge or not so big at all, i.e. as mentioned above, it 
>> would we good to
>> measure the impact through some other metrics.
>>
>> And FWIW, I used bpftrace to count [0] with an unpatched pmxcfs, 
>> there I get
>> the 32 calls to cfs_fuse_write only for a new file, overwriting the 
>> existing
>> file again with the same amount of data (128k) just causes a single 
>> call.
>> I tried using more data (e.g. from 128k initially to 256k or 512k) 
>> and it's
>> always the data divided by 128k (even if the first file has a 
>> different size)
>>
>> We do not override existing files often, but rather write to a new 
>> file and
>> then rename, but still quite interesting and IMO really showing that 
>> just
>> because this is 1 +-1 line change it doesn't necessarily have to be 
>> trivial
>> and obvious in its effects.
>>
>> [0]: bpftrace -e 'u:cfs_fuse_write /str(args->path) == "/test"/ {@ = 
>> count();} END { print(@) }' -p "$(pidof pmxcfs)"
>>
>>
>>>>> If we'd change to libfuse3, this would be a non-issue, since that 
>>>>> option
>>>>> got removed and is the default there.
>>>>
>>>> I'd prefer that. At least if done with the future PVE 9.0, as I do 
>>>> not think
>>>> it's a good idea in the middle of a stable release cycle.
>>>
>>> why not this change now, and the rewrite to libfuse3 later? that way 
>>> we can
>>> have some improvements now too...
>>
>> Because I want some actual data and reasoning first, even if it's 
>> quite likely
>> that this improves things Somehow™, I'd like to actually know in what 
>> metrics
>> and by how much (even if just an upper bound due to the benchmark or 
>> some
>> measurement being rather artificial).
>>
>> I mean you name the big HA status, why not measure that for real? 
>> like, probably
>> among other things, in terms of bytes hitting the block layer, i.e. 
>> the actual
>> backing disk from those requests as then we'd know for real if this 
>> can reduce
>> the write load there, not just that it maybe should.
>
>
> hi,
>
> first i just wanted to say I'm sorry for my snarky comment about not 
> needing to test
> performance for such code. You're right, any insight we can gain there
> is good and we (I!) should take the time to do that, even if the
> change looks "obvious" like it does here
>
> so i did some benchmarks (mostly disk writes) and wrote the short 
> script below
> (maybe we can reuse that?)
>
> ----8<----
> use strict;
> use warnings;
>
> use PVE::Tools;
>
> my $size = shift;
>
> sub get_bytes_written {
>         my $fh = IO::File->new("/proc/diskstats", "r");
>         die if !$fh;
>         my $bytes = undef;
>         while (defined(my $line = <$fh>)) {
>                 if ($line =~ m/sdb/) {
>                         my @fields = split(/\s+/, $line);
>                         $bytes = $fields[10] * 512;
>                 }
>         }
>         return $bytes;
> }
>
> sub test_write {
>         my ($k) = @_;
>         system("rm /etc/pve/testfile");
>         my $data = "a"x($k*1024);
>         system("sync; echo -n 3> /proc/sys/vm/drop_caches");
>         my $bytes_before = get_bytes_written();
>         PVE::Tools::file_set_contents("/etc/pve/testfile", $data);
>         system("sync; echo -n 3> /proc/sys/vm/drop_caches");
>         my $bytes_after = get_bytes_written();
>         return $bytes_after - $bytes_before;
> }
>
> $size //= 128;
>
> my $written = test_write($size) / 1024;
> print("$written\n");
> ---->8----
>
> to simulate our real write patterns with varying file sizes
>
> i installed a fresh pve, turned off pvestatd and put 
> /var/lib/pve-cluster on it's own disk
> (/dev/sdb), so diskstats only contains writes from the pmxcfs
>
> the results are below (all sizes are kbytes, ran them multiple times, 
> but they
> seem to be consistent)
>
> data size  written (old)  amplification (old)  written (new) 
> amplification (new)
> 1          56             56                   56             56
> 2          72             36                   76             38
> 4          84             21                   88             22
> 8          144            18                   104            13
> 16         236            14                   160            10
> 32         532            16                   324            10
> 64         1496           23                   836            13
> 128        6616           51                   3848           30
> 256        20600          80                   10568          41
> 512        87296          170                  43416          84
> 1024       388460         379                  197032         192
>
> for smaller writes there seems to be a minimum overhead of ~50-100 kbytes
> for big files with have a massive amplification of > 100x
>
> my patch does seem to make a difference for files >4k
>
> but the biggest surprise here is that the write amplification
> is not linear, but increases with an increase in bytes we want to write
> so e.g. going from 128k -> 256k file write we don't just write double 
> the amount,
> but 3x to 4x as much.
>
> i also produced a flamegraph according to 
> https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html, but that 
> showed virtually no change between versions without
> and with my patch (if one has a good svg hoster, i can post them ofc)
>
> so, tl;dr
>
> for small writes does not make that much of a difference, but
> we can save ~half the writes for large files in the pmxcfs
> (which e.g. a ha-state file could do)
>
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
  2024-09-23  9:17       ` Dominik Csapak
  2024-09-23 11:48         ` Filip Schauer
@ 2024-09-23 12:00         ` Friedrich Weber
  2024-09-23 12:03           ` Dominik Csapak
  1 sibling, 1 reply; 15+ messages in thread
From: Friedrich Weber @ 2024-09-23 12:00 UTC (permalink / raw)
  To: Proxmox VE development discussion, Dominik Csapak, Thomas Lamprecht

On 23/09/2024 11:17, Dominik Csapak wrote:
> [...]
> so i did some benchmarks (mostly disk writes) and wrote the short script
> below
> (maybe we can reuse that?)
> 
> ----8<----
> use strict;
> use warnings;
> 
> use PVE::Tools;
> 
> my $size = shift;
> 
> sub get_bytes_written {
>         my $fh = IO::File->new("/proc/diskstats", "r");
>         die if !$fh;
>         my $bytes = undef;
>         while (defined(my $line = <$fh>)) {
>                 if ($line =~ m/sdb/) {
>                         my @fields = split(/\s+/, $line);
>                         $bytes = $fields[10] * 512;
>                 }
>         }
>         return $bytes;
> }
> 
> sub test_write {
>         my ($k) = @_;
>         system("rm /etc/pve/testfile");
>         my $data = "a"x($k*1024);
>         system("sync; echo -n 3> /proc/sys/vm/drop_caches");

I'm not sure this actually drops the caches: Without the space between
`3` and `>` I think this redirects fd 3 to that file (so doesn't
actually write the `3`)? I didn't run the script though, so not sure if
it makes any difference for the results.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
  2024-09-23 12:00         ` Friedrich Weber
@ 2024-09-23 12:03           ` Dominik Csapak
  0 siblings, 0 replies; 15+ messages in thread
From: Dominik Csapak @ 2024-09-23 12:03 UTC (permalink / raw)
  To: Friedrich Weber, Proxmox VE development discussion, Thomas Lamprecht

On 9/23/24 14:00, Friedrich Weber wrote:
> On 23/09/2024 11:17, Dominik Csapak wrote:
>> [...]
>> so i did some benchmarks (mostly disk writes) and wrote the short script
>> below
>> (maybe we can reuse that?)
>>
>> ----8<----
>> use strict;
>> use warnings;
>>
>> use PVE::Tools;
>>
>> my $size = shift;
>>
>> sub get_bytes_written {
>>          my $fh = IO::File->new("/proc/diskstats", "r");
>>          die if !$fh;
>>          my $bytes = undef;
>>          while (defined(my $line = <$fh>)) {
>>                  if ($line =~ m/sdb/) {
>>                          my @fields = split(/\s+/, $line);
>>                          $bytes = $fields[10] * 512;
>>                  }
>>          }
>>          return $bytes;
>> }
>>
>> sub test_write {
>>          my ($k) = @_;
>>          system("rm /etc/pve/testfile");
>>          my $data = "a"x($k*1024);
>>          system("sync; echo -n 3> /proc/sys/vm/drop_caches");
> 
> I'm not sure this actually drops the caches: Without the space between
> `3` and `>` I think this redirects fd 3 to that file (so doesn't
> actually write the `3`)? I didn't run the script though, so not sure if
> it makes any difference for the results.

ah yeah, i noticed, but forgot to answer here. i fixed it here locally and
rerun the tests but had the same results (+- a bit of variation)


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
  2024-09-23 11:48         ` Filip Schauer
@ 2024-09-23 14:08           ` Filip Schauer
  0 siblings, 0 replies; 15+ messages in thread
From: Filip Schauer @ 2024-09-23 14:08 UTC (permalink / raw)
  To: pve-devel

Changing the way we write files, can eliminate the exponential growth of
write amplification with files of up to 128KiB in size.

Right now we are using `PVE::Tools::file_set_contents`, which itself
uses `print`. And looking at the debug output we can see that this ends
up writing the file contents in 8k blocks.

```
$ echo 1 > /etc/pve/.debug && perl -e 'use PVE::Tools; my $data = 
"a"x(128*1024); PVE::Tools::file_set_contents("/etc/pve/testfile", 
$data);' && echo 0 > /etc/pve/.debug
$ journalctl -n250 -u pve-cluster | grep cfs_fuse_write
Sep 23 15:23:04 pmxcfsbench pmxcfs[16835]: [main] debug: enter 
cfs_fuse_write /testfile.tmp.22487 8192 0 (pmxcfs.c:355:cfs_fuse_write)
Sep 23 15:23:04 pmxcfsbench pmxcfs[16835]: [main] debug: leave 
cfs_fuse_write /testfile.tmp.22487 (8192) (pmxcfs.c:368:cfs_fuse_write)
...
```

So lets change the benchmark script to write all the contents in a
single block.

```diff
@@ -21,10 +21,9 @@
  sub test_write {
          my ($k) = @_;
          system("rm /etc/pve/testfile");
-        my $data = "a"x($k*1024);
          system("sync; echo -n 3 > /proc/sys/vm/drop_caches");
          my $bytes_before = get_bytes_written();
-        PVE::Tools::file_set_contents("/etc/pve/testfile", $data);
+        system("dd if=/dev/urandom of=/etc/pve/testfile bs=${k}k 
count=1 2> /dev/null");
          system("sync; echo -n 3 > /proc/sys/vm/drop_caches");
          my $bytes_after = get_bytes_written();
          return $bytes_after - $bytes_before;
```

Along with the `-obig_writes` patch applied, this gives the following
results:

data size  written  amplification
1          54       54.0
2          33       16.5
4          49       12.3
8          53       6.6
16         62       3.9
32         77       2.4
64         114      1.8
128        178      1.4
256        580      2.3
512        2157     4.2
1024       9844     9.6

With this we write in 128k blocks instead of 8k blocks, eliminating the
rapid write amplification growth up until 128k data size.

It seems that `PVE::Tools::file_set_contents` needs to be optimized to
not write the contents in 8k blocks. Instead of `print` we might want to
use `syswrite`.


On 23/09/2024 13:48, Filip Schauer wrote:
> I also ran some benchmarks with the same script.
>
> I created a VM with two virtual disks, (both on an LVM Thin storage)
> installed PVE on one disk and set up an ext4 partition on the other.
>
> I stopped pvestatd and pve-cluster,
>
> ```
> systemctl stop pvestatd
> systemctl stop pve-cluster
> ```
>
> moved the pmxcfs database file to its own disk
>
> ```
> mv /var/lib/pve-cluster/config.db /tmp/
> mount /dev/sdb1 /var/lib/pve-cluster
> mv /tmp/config.db /var/lib/pve-cluster/
> ```
>
> and started pve-cluster again.
>
> ```
> systemctl start pve-cluster
> ```
>
> These are my results before and after applying the patch:
>
> data size  written (old)  amplification (old)  written (new) 
> amplification (new)
> 1          48             48                   45             45
> 2          48             24                   45             23
> 4          82             21                   80             20
> 8          121            15                   90             11
> 16         217            14                   146            9
> 32         506            16                   314            10
> 64         1472           23                   826            13
> 128        5585           44                   3765           29
> 256        20424          80                   10743          42
> 512        86715          169                  43650          85
> 1024       369568         361                  187496         183
>
> As can be seen, my results are similar with amplification really picking
> up at 128k. The patch seems to cut write amplification in half with big
> files, while making virtually no difference with files up to 4k.
>
> On 23/09/2024 11:17, Dominik Csapak wrote:
>> On 9/19/24 16:57, Thomas Lamprecht wrote:
>>> Am 19/09/2024 um 14:45 schrieb Dominik Csapak:
>>>> On 9/19/24 14:01, Thomas Lamprecht wrote:
>>>>> Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
>>>>>> by default libfuse2 limits writes to 4k size, which means that on 
>>>>>> writes
>>>>>> bigger than that, we do a whole write cycle for each 4k block 
>>>>>> that comes
>>>>>> in. To avoid that, add the option 'big_writes' to allow writes 
>>>>>> bigger
>>>>>> than 4k at once.
>>>>>>
>>>>>> This should improve pmxcfs performance for situations where we often
>>>>>> write large files (e.g. big ha status) and maybe reduce writes to 
>>>>>> disk.
>>>>>
>>>>> Should? Something like before/after for benchmark numbers, 
>>>>> flamegraphs
>>>>> would be really good to have, without those it's rather hard to 
>>>>> discuss
>>>>> this, and I'd like to avoid having to do those, or check the inner 
>>>>> workings
>>>>> of the affected fuse userspace/kernel code paths here myself.
>>>>
>>>> well I mean the code change is relatively small and the result is 
>>>> rather clear:
>>>
>>> Well sure the code change is just setting an option... But the 
>>> actual change is
>>> abstracted away and would benefit from actually looking into..
>>>
>>>> in the current case we have the following calls from pmxcfs 
>>>> (shortened for e-mail)
>>>> when writing a single 128k block:
>>>> (dd if=... of=/etc/pve/test bs=128k count=1)
>>>
>>> Better than nothing but still no actual numbers (reduced time, 
>>> reduced write amp
>>> in combination with sqlite, ...), some basic analysis over 
>>> file/write size distribution
>>> on a single node and (e.g. three node) cluster, ...
>>> If that's all obvious for you then great, but as already mentioned 
>>> in the past, I
>>> want actual data in commit messages for such stuff, and I cannot 
>>> really see a downside
>>> of having such numbers.
>>>
>>> Again, as is I'm not really seeing what's to discuss, you send it as 
>>> RFC after
>>> all.
>>>
>>>> [...]
>>>> so a factor of 32 less calls to cfs_fuse_write (including 
>>>> memdb_pwrite)
>>>
>>> That can be huge or not so big at all, i.e. as mentioned above, it 
>>> would we good to
>>> measure the impact through some other metrics.
>>>
>>> And FWIW, I used bpftrace to count [0] with an unpatched pmxcfs, 
>>> there I get
>>> the 32 calls to cfs_fuse_write only for a new file, overwriting the 
>>> existing
>>> file again with the same amount of data (128k) just causes a single 
>>> call.
>>> I tried using more data (e.g. from 128k initially to 256k or 512k) 
>>> and it's
>>> always the data divided by 128k (even if the first file has a 
>>> different size)
>>>
>>> We do not override existing files often, but rather write to a new 
>>> file and
>>> then rename, but still quite interesting and IMO really showing that 
>>> just
>>> because this is 1 +-1 line change it doesn't necessarily have to be 
>>> trivial
>>> and obvious in its effects.
>>>
>>> [0]: bpftrace -e 'u:cfs_fuse_write /str(args->path) == "/test"/ {@ = 
>>> count();} END { print(@) }' -p "$(pidof pmxcfs)"
>>>
>>>
>>>>>> If we'd change to libfuse3, this would be a non-issue, since that 
>>>>>> option
>>>>>> got removed and is the default there.
>>>>>
>>>>> I'd prefer that. At least if done with the future PVE 9.0, as I do 
>>>>> not think
>>>>> it's a good idea in the middle of a stable release cycle.
>>>>
>>>> why not this change now, and the rewrite to libfuse3 later? that 
>>>> way we can
>>>> have some improvements now too...
>>>
>>> Because I want some actual data and reasoning first, even if it's 
>>> quite likely
>>> that this improves things Somehow™, I'd like to actually know in 
>>> what metrics
>>> and by how much (even if just an upper bound due to the benchmark or 
>>> some
>>> measurement being rather artificial).
>>>
>>> I mean you name the big HA status, why not measure that for real? 
>>> like, probably
>>> among other things, in terms of bytes hitting the block layer, i.e. 
>>> the actual
>>> backing disk from those requests as then we'd know for real if this 
>>> can reduce
>>> the write load there, not just that it maybe should.
>>
>>
>> hi,
>>
>> first i just wanted to say I'm sorry for my snarky comment about not 
>> needing to test
>> performance for such code. You're right, any insight we can gain there
>> is good and we (I!) should take the time to do that, even if the
>> change looks "obvious" like it does here
>>
>> so i did some benchmarks (mostly disk writes) and wrote the short 
>> script below
>> (maybe we can reuse that?)
>>
>> ----8<----
>> use strict;
>> use warnings;
>>
>> use PVE::Tools;
>>
>> my $size = shift;
>>
>> sub get_bytes_written {
>>         my $fh = IO::File->new("/proc/diskstats", "r");
>>         die if !$fh;
>>         my $bytes = undef;
>>         while (defined(my $line = <$fh>)) {
>>                 if ($line =~ m/sdb/) {
>>                         my @fields = split(/\s+/, $line);
>>                         $bytes = $fields[10] * 512;
>>                 }
>>         }
>>         return $bytes;
>> }
>>
>> sub test_write {
>>         my ($k) = @_;
>>         system("rm /etc/pve/testfile");
>>         my $data = "a"x($k*1024);
>>         system("sync; echo -n 3> /proc/sys/vm/drop_caches");
>>         my $bytes_before = get_bytes_written();
>>         PVE::Tools::file_set_contents("/etc/pve/testfile", $data);
>>         system("sync; echo -n 3> /proc/sys/vm/drop_caches");
>>         my $bytes_after = get_bytes_written();
>>         return $bytes_after - $bytes_before;
>> }
>>
>> $size //= 128;
>>
>> my $written = test_write($size) / 1024;
>> print("$written\n");
>> ---->8----
>>
>> to simulate our real write patterns with varying file sizes
>>
>> i installed a fresh pve, turned off pvestatd and put 
>> /var/lib/pve-cluster on it's own disk
>> (/dev/sdb), so diskstats only contains writes from the pmxcfs
>>
>> the results are below (all sizes are kbytes, ran them multiple times, 
>> but they
>> seem to be consistent)
>>
>> data size  written (old)  amplification (old)  written (new) 
>> amplification (new)
>> 1          56             56                   56             56
>> 2          72             36                   76             38
>> 4          84             21                   88             22
>> 8          144            18                   104            13
>> 16         236            14                   160            10
>> 32         532            16                   324            10
>> 64         1496           23                   836            13
>> 128        6616           51                   3848           30
>> 256        20600          80                   10568          41
>> 512        87296          170                  43416          84
>> 1024       388460         379                  197032 192
>>
>> for smaller writes there seems to be a minimum overhead of ~50-100 
>> kbytes
>> for big files with have a massive amplification of > 100x
>>
>> my patch does seem to make a difference for files >4k
>>
>> but the biggest surprise here is that the write amplification
>> is not linear, but increases with an increase in bytes we want to write
>> so e.g. going from 128k -> 256k file write we don't just write double 
>> the amount,
>> but 3x to 4x as much.
>>
>> i also produced a flamegraph according to 
>> https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html, but 
>> that showed virtually no change between versions without
>> and with my patch (if one has a good svg hoster, i can post them ofc)
>>
>> so, tl;dr
>>
>> for small writes does not make that much of a difference, but
>> we can save ~half the writes for large files in the pmxcfs
>> (which e.g. a ha-state file could do)
>>
>>
>>
>> _______________________________________________
>> pve-devel mailing list
>> pve-devel@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-09-23 14:08 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-19  9:52 [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse Dominik Csapak
2024-09-19 12:01 ` Thomas Lamprecht
2024-09-19 12:45   ` Dominik Csapak
2024-09-19 14:57     ` Thomas Lamprecht
2024-09-20  4:04       ` Esi Y via pve-devel
2024-09-20  5:29         ` Esi Y via pve-devel
     [not found]         ` <CABtLnHoQFAUN0KcahbMF6hoX=WTfL8bHL0St77gQMSaojVGhBA@mail.gmail.com>
2024-09-20  7:32           ` Dominik Csapak
2024-09-20  6:16       ` Dominik Csapak
2024-09-22  9:25       ` Dietmar Maurer
2024-09-23  9:17       ` Dominik Csapak
2024-09-23 11:48         ` Filip Schauer
2024-09-23 14:08           ` Filip Schauer
2024-09-23 12:00         ` Friedrich Weber
2024-09-23 12:03           ` Dominik Csapak
2024-09-19 12:23 ` Esi Y via pve-devel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal