public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
@ 2024-09-19  9:52 Dominik Csapak
  2024-09-19 12:01 ` Thomas Lamprecht
  2024-09-19 12:23 ` Esi Y via pve-devel
  0 siblings, 2 replies; 5+ messages in thread
From: Dominik Csapak @ 2024-09-19  9:52 UTC (permalink / raw)
  To: pve-devel

by default libfuse2 limits writes to 4k size, which means that on writes
bigger than that, we do a whole write cycle for each 4k block that comes
in. To avoid that, add the option 'big_writes' to allow writes bigger
than 4k at once.

This should improve pmxcfs performance for situations where we often
write large files (e.g. big ha status) and maybe reduce writes to disk.

If we'd change to libfuse3, this would be a non-issue, since that option
got removed and is the default there.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
sending as RFC, since I'm not sure if there are maybe any sideeffects i
overlooked. I scanned the pmxcfs code for potential problems where we
would e.g. assume a maximum size of 4096 for fuse writes, but could not
find any. (also my C foo is not that great). I run it here currently,
and it seems to work just fine atm.

 src/pmxcfs/pmxcfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/pmxcfs/pmxcfs.c b/src/pmxcfs/pmxcfs.c
index 1cf8ab8..c5b8f04 100644
--- a/src/pmxcfs/pmxcfs.c
+++ b/src/pmxcfs/pmxcfs.c
@@ -945,7 +945,7 @@ int main(int argc, char *argv[])
 
 	mkdir(CFSDIR, 0755);
 
-	char *fa[] = { "-f", "-odefault_permissions", "-oallow_other", NULL};
+	char *fa[] = { "-f", "-odefault_permissions", "-oallow_other", "-obig_writes", NULL};
 
 	struct fuse_args fuse_args = FUSE_ARGS_INIT(sizeof (fa)/sizeof(gpointer) - 1, fa);
 
-- 
2.39.2



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
  2024-09-19  9:52 [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse Dominik Csapak
@ 2024-09-19 12:01 ` Thomas Lamprecht
  2024-09-19 12:45   ` Dominik Csapak
  2024-09-19 12:23 ` Esi Y via pve-devel
  1 sibling, 1 reply; 5+ messages in thread
From: Thomas Lamprecht @ 2024-09-19 12:01 UTC (permalink / raw)
  To: Proxmox VE development discussion, Dominik Csapak

Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
> by default libfuse2 limits writes to 4k size, which means that on writes
> bigger than that, we do a whole write cycle for each 4k block that comes
> in. To avoid that, add the option 'big_writes' to allow writes bigger
> than 4k at once.
> 
> This should improve pmxcfs performance for situations where we often
> write large files (e.g. big ha status) and maybe reduce writes to disk.

Should? Something like before/after for benchmark numbers, flamegraphs
would be really good to have, without those it's rather hard to discuss
this, and I'd like to avoid having to do those, or check the inner workings
of the affected fuse userspace/kernel code paths here myself.

> 
> If we'd change to libfuse3, this would be a non-issue, since that option
> got removed and is the default there.

I'd prefer that. At least if done with the future PVE 9.0, as I do not think
it's a good idea in the middle of a stable release cycle.

FWIW, some Debian Devs also want to push the fuse 2 to 3 migration forward
[0]. So, sooner or later we have to switch over pmxcfs anyway, and there's
already a POC from Fabian [1].

[0]: https://lists.debian.org/debian-devel/2024/08/msg00177.html
[1]: https://lore.proxmox.com/pve-devel/20220715074742.3387301-1-f.gruenbichler@proxmox.com/




_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
  2024-09-19  9:52 [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse Dominik Csapak
  2024-09-19 12:01 ` Thomas Lamprecht
@ 2024-09-19 12:23 ` Esi Y via pve-devel
  1 sibling, 0 replies; 5+ messages in thread
From: Esi Y via pve-devel @ 2024-09-19 12:23 UTC (permalink / raw)
  To: Proxmox VE development discussion; +Cc: Esi Y

[-- Attachment #1: Type: message/rfc822, Size: 6987 bytes --]

From: Esi Y <esiy0676+proxmox@gmail.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
Date: Thu, 19 Sep 2024 14:23:55 +0200
Message-ID: <CABtLnHpSk+XWjzfvG=qOqGOet0JFda4_Lb_DkqR3ZgjsXvQHFA@mail.gmail.com>

First of all, thanks for spotting this!

I believe you will still be limited to 128K, this could be tuned up to
max 1M [1] (current file limit size in pmxcfs, coincidentally), but
needs FUSE 3.6.0 [2]:

[1] https://github.com/torvalds/linux/commit/5da784cce4308ae10a79e3c8c41b13fb9568e4e0
[2] https://github.com/libfuse/libfuse/blob/master/ChangeLog.rst#libfuse-360-2019-06-13

Obviously, this should be buffered within the pmxcfs logic instead, however.

-- 
Esi Y

On Thu, Sep 19, 2024 at 11:52 AM Dominik Csapak <d.csapak@proxmox.com> wrote:
>
> by default libfuse2 limits writes to 4k size, which means that on writes
> bigger than that, we do a whole write cycle for each 4k block that comes
> in. To avoid that, add the option 'big_writes' to allow writes bigger
> than 4k at once.
>
> This should improve pmxcfs performance for situations where we often
> write large files (e.g. big ha status) and maybe reduce writes to disk.
>
> If we'd change to libfuse3, this would be a non-issue, since that option
> got removed and is the default there.
>
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> ---
> sending as RFC, since I'm not sure if there are maybe any sideeffects i
> overlooked. I scanned the pmxcfs code for potential problems where we
> would e.g. assume a maximum size of 4096 for fuse writes, but could not
> find any. (also my C foo is not that great). I run it here currently,
> and it seems to work just fine atm.
>
>  src/pmxcfs/pmxcfs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/src/pmxcfs/pmxcfs.c b/src/pmxcfs/pmxcfs.c
> index 1cf8ab8..c5b8f04 100644
> --- a/src/pmxcfs/pmxcfs.c
> +++ b/src/pmxcfs/pmxcfs.c
> @@ -945,7 +945,7 @@ int main(int argc, char *argv[])
>
>         mkdir(CFSDIR, 0755);
>
> -       char *fa[] = { "-f", "-odefault_permissions", "-oallow_other", NULL};
> +       char *fa[] = { "-f", "-odefault_permissions", "-oallow_other", "-obig_writes", NULL};
>
>         struct fuse_args fuse_args = FUSE_ARGS_INIT(sizeof (fa)/sizeof(gpointer) - 1, fa);
>
> --
> 2.39.2
>
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>


[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
  2024-09-19 12:01 ` Thomas Lamprecht
@ 2024-09-19 12:45   ` Dominik Csapak
  2024-09-19 14:57     ` Thomas Lamprecht
  0 siblings, 1 reply; 5+ messages in thread
From: Dominik Csapak @ 2024-09-19 12:45 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE development discussion

On 9/19/24 14:01, Thomas Lamprecht wrote:
> Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
>> by default libfuse2 limits writes to 4k size, which means that on writes
>> bigger than that, we do a whole write cycle for each 4k block that comes
>> in. To avoid that, add the option 'big_writes' to allow writes bigger
>> than 4k at once.
>>
>> This should improve pmxcfs performance for situations where we often
>> write large files (e.g. big ha status) and maybe reduce writes to disk.
> 
> Should? Something like before/after for benchmark numbers, flamegraphs
> would be really good to have, without those it's rather hard to discuss
> this, and I'd like to avoid having to do those, or check the inner workings
> of the affected fuse userspace/kernel code paths here myself.

well I mean the code change is relatively small and the result is rather clear:

in the current case we have the following calls from pmxcfs (shortened for e-mail)
when writing a single 128k block:
(dd if=... of=/etc/pve/test bs=128k count=1)

---
[main] debug: enter cfs_fuse_write /test 4096 0 (pmxcfs.c:355:cfs_fuse_write)
[main] debug: find_plug start test (pmxcfs.c:103:find_plug)
[main] debug: cfs_plug_base_lookup_plug test (cfs-plug.c:52:cfs_plug_base_lookup_plug)
[main] debug: cfs_plug_base_lookup_plug name = test new path = (null) 
(cfs-plug.c:59:cfs_plug_base_lookup_plug)
[main] debug: find_plug end test = 0x64319c3e0bc0 (test) (pmxcfs.c:110:find_plug)
[main] debug: enter cfs_plug_base_write test 4096 0 (cfs-plug.c:294:cfs_plug_base_write)
[main] debug: memdb_pwrite . test 0000000000DE88EC 0000000000DE88ED (memdb.c:850:memdb_pwrite)
[database] debug: enter backend_write_inode 0000000000DE88EC 'test', size 4096 
(database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 
(database.c:157:backend_write_inode)
[main] debug: leave cfs_fuse_write /test (4096) (pmxcfs.c:368:cfs_fuse_write)
[main] debug: enter cfs_fuse_write /test 4096 4096 (pmxcfs.c:355:cfs_fuse_write)
[main] debug: find_plug start test (pmxcfs.c:103:find_plug)
[main] debug: cfs_plug_base_lookup_plug test (cfs-plug.c:52:cfs_plug_base_lookup_plug)
[main] debug: cfs_plug_base_lookup_plug name = test new path = (null) 
(cfs-plug.c:59:cfs_plug_base_lookup_plug)
[main] debug: find_plug end test = 0x64319c3e0bc0 (test) (pmxcfs.c:110:find_plug)
[main] debug: enter cfs_plug_base_write test 4096 4096 (cfs-plug.c:294:cfs_plug_base_write)
[main] debug: memdb_pwrite . test 0000000000DE88EC 0000000000DE88EE (memdb.c:850:memdb_pwrite)
[database] debug: enter backend_write_inode 0000000000DE88EC 'test', size 8192 
(database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 
(database.c:157:backend_write_inode)
[main] debug: leave cfs_fuse_write /test (4096) (pmxcfs.c:368:cfs_fuse_write)
[main] debug: enter cfs_fuse_write /test 4096 8192 (pmxcfs.c:355:cfs_fuse_write)
---
and so on until we reach

---
[main] debug: enter cfs_fuse_write /test 4096 126976 (pmxcfs.c:355:cfs_fuse_write)
[main] debug: find_plug start test (pmxcfs.c:103:find_plug)
[main] debug: cfs_plug_base_lookup_plug test (cfs-plug.c:52:cfs_plug_base_lookup_plug)
[main] debug: cfs_plug_base_lookup_plug name = test new path = (null) 
(cfs-plug.c:59:cfs_plug_base_lookup_plug)
[main] debug: find_plug end test = 0x64319c3e0bc0 (test) (pmxcfs.c:110:find_plug)
[main] debug: enter cfs_plug_base_write test 4096 126976 (cfs-plug.c:294:cfs_plug_base_write)
[main] debug: memdb_pwrite . test 0000000000DE88EC 0000000000DE890C (memdb.c:850:memdb_pwrite)
[database] debug: enter backend_write_inode 0000000000DE88EC 'test', size 131072 
(database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 
(database.c:157:backend_write_inode)
[main] debug: leave cfs_fuse_write /test (4096) (pmxcfs.c:368:cfs_fuse_write)
---

with my changes it results in:

---
[main] debug: enter cfs_fuse_write /test 131072 0 (pmxcfs.c:355:cfs_fuse_write)
[main] debug: find_plug start test (pmxcfs.c:103:find_plug)
[main] debug: cfs_plug_base_lookup_plug test (cfs-plug.c:52:cfs_plug_base_lookup_plug)
[main] debug: cfs_plug_base_lookup_plug name = test new path = (null) 
(cfs-plug.c:59:cfs_plug_base_lookup_plug)
[main] debug: find_plug end test = 0x62f47945f360 (test) (pmxcfs.c:110:find_plug)
[main] debug: enter cfs_plug_base_write test 131072 0 (cfs-plug.c:294:cfs_plug_base_write)
[main] debug: memdb_pwrite . test 0000000000DE88EC 0000000000DE8AB8 (memdb.c:850:memdb_pwrite)
[database] debug: enter backend_write_inode 0000000000DE88EC 'test', size 131072 
(database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 
(database.c:157:backend_write_inode)
[main] debug: leave cfs_fuse_write /test (131072) (pmxcfs.c:368:cfs_fuse_write)
---

so a factor of 32 less calls to cfs_fuse_write (including memdb_pwrite)

> 
>>
>> If we'd change to libfuse3, this would be a non-issue, since that option
>> got removed and is the default there.
> 
> I'd prefer that. At least if done with the future PVE 9.0, as I do not think
> it's a good idea in the middle of a stable release cycle.

why not this change now, and the rewrite to libfuse3 later? that way we can
have some improvements now too...

> 
> FWIW, some Debian Devs also want to push the fuse 2 to 3 migration forward
> [0]. So, sooner or later we have to switch over pmxcfs anyway, and there's
> already a POC from Fabian [1].
> 
> [0]: https://lists.debian.org/debian-devel/2024/08/msg00177.html
> [1]: https://lore.proxmox.com/pve-devel/20220715074742.3387301-1-f.gruenbichler@proxmox.com/
> 
> 


sure I can look at that when we're nearing 9.0


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse
  2024-09-19 12:45   ` Dominik Csapak
@ 2024-09-19 14:57     ` Thomas Lamprecht
  0 siblings, 0 replies; 5+ messages in thread
From: Thomas Lamprecht @ 2024-09-19 14:57 UTC (permalink / raw)
  To: Dominik Csapak, Proxmox VE development discussion

Am 19/09/2024 um 14:45 schrieb Dominik Csapak:
> On 9/19/24 14:01, Thomas Lamprecht wrote:
>> Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
>>> by default libfuse2 limits writes to 4k size, which means that on writes
>>> bigger than that, we do a whole write cycle for each 4k block that comes
>>> in. To avoid that, add the option 'big_writes' to allow writes bigger
>>> than 4k at once.
>>>
>>> This should improve pmxcfs performance for situations where we often
>>> write large files (e.g. big ha status) and maybe reduce writes to disk.
>>
>> Should? Something like before/after for benchmark numbers, flamegraphs
>> would be really good to have, without those it's rather hard to discuss
>> this, and I'd like to avoid having to do those, or check the inner workings
>> of the affected fuse userspace/kernel code paths here myself.
> 
> well I mean the code change is relatively small and the result is rather clear:

Well sure the code change is just setting an option... But the actual change is
abstracted away and would benefit from actually looking into..

> in the current case we have the following calls from pmxcfs (shortened for e-mail)
> when writing a single 128k block:
> (dd if=... of=/etc/pve/test bs=128k count=1)

Better than nothing but still no actual numbers (reduced time, reduced write amp
in combination with sqlite, ...), some basic analysis over file/write size distribution
on a single node and (e.g. three node) cluster, ...
If that's all obvious for you then great, but as already mentioned in the past, I
want actual data in commit messages for such stuff, and I cannot really see a downside
of having such numbers.

Again, as is I'm not really seeing what's to discuss, you send it as RFC after
all.

> [...] 
> so a factor of 32 less calls to cfs_fuse_write (including memdb_pwrite)

That can be huge or not so big at all, i.e. as mentioned above, it would we good to
measure the impact through some other metrics.

And FWIW, I used bpftrace to count [0] with an unpatched pmxcfs, there I get
the 32 calls to cfs_fuse_write only for a new file, overwriting the existing
file again with the same amount of data (128k) just causes a single call.
I tried using more data (e.g. from 128k initially to 256k or 512k) and it's
always the data divided by 128k (even if the first file has a different size)

We do not override existing files often, but rather write to a new file and
then rename, but still quite interesting and IMO really showing that just
because this is 1 +-1 line change it doesn't necessarily have to be trivial
and obvious in its effects.

[0]: bpftrace -e 'u:cfs_fuse_write /str(args->path) == "/test"/ {@ = count();} END { print(@) }' -p "$(pidof pmxcfs)"


>>> If we'd change to libfuse3, this would be a non-issue, since that option
>>> got removed and is the default there.
>>
>> I'd prefer that. At least if done with the future PVE 9.0, as I do not think
>> it's a good idea in the middle of a stable release cycle.
> 
> why not this change now, and the rewrite to libfuse3 later? that way we can
> have some improvements now too...

Because I want some actual data and reasoning first, even if it's quite likely
that this improves things Somehow™, I'd like to actually know in what metrics
and by how much (even if just an upper bound due to the benchmark or some
measurement being rather artificial).

I mean you name the big HA status, why not measure that for real? like, probably
among other things, in terms of bytes hitting the block layer, i.e. the actual
backing disk from those requests as then we'd know for real if this can reduce
the write load there, not just that it maybe should.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-09-19 14:56 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-19  9:52 [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse Dominik Csapak
2024-09-19 12:01 ` Thomas Lamprecht
2024-09-19 12:45   ` Dominik Csapak
2024-09-19 14:57     ` Thomas Lamprecht
2024-09-19 12:23 ` Esi Y via pve-devel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal