From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 2C1DA1FF15E for ; Fri, 20 Sep 2024 07:29:18 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 298BE30C18; Fri, 20 Sep 2024 07:29:26 +0200 (CEST) References: <20240919095202.1375181-1-d.csapak@proxmox.com> <21f250b8-a59c-426d-96de-11606cbb0e42@proxmox.com> In-Reply-To: Date: Fri, 20 Sep 2024 07:29:05 +0200 To: Proxmox VE development discussion MIME-Version: 1.0 Message-ID: List-Id: Proxmox VE development discussion List-Post: From: Esi Y via pve-devel Precedence: list Cc: Esi Y , t.lamprecht@proxmox.com X-Mailman-Version: 2.1.29 X-BeenThere: pve-devel@lists.proxmox.com List-Subscribe: , List-Unsubscribe: , List-Archive: Reply-To: Proxmox VE development discussion List-Help: Subject: Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse Content-Type: multipart/mixed; boundary="===============8466812100758011061==" Errors-To: pve-devel-bounces@lists.proxmox.com Sender: "pve-devel" --===============8466812100758011061== Content-Type: message/rfc822 Content-Disposition: inline Return-Path: X-Original-To: pve-devel@lists.proxmox.com Delivered-To: pve-devel@lists.proxmox.com Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 3B31BC05F1 for ; Fri, 20 Sep 2024 07:29:25 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 22B7330B6E for ; Fri, 20 Sep 2024 07:29:25 +0200 (CEST) Received: from mail-qt1-x833.google.com (mail-qt1-x833.google.com [IPv6:2607:f8b0:4864:20::833]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Fri, 20 Sep 2024 07:29:23 +0200 (CEST) Received: by mail-qt1-x833.google.com with SMTP id d75a77b69052e-4582fb3822eso11594061cf.1 for ; Thu, 19 Sep 2024 22:29:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1726810156; x=1727414956; darn=lists.proxmox.com; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=tcPC7ygS+8xWW2hpBrrHPKJG9T819vKCNIH4FF8FU7Q=; b=mTs5IvzyJsHcn32jXiUiLuimNdE0lfP++/4wFPLHkbbkurYkQwxgYdyPBhGjAYiOY/ uEsxW1KawFaEsrm8i/w+Uki5cP+An0WqAtv434IDhaI3BGDZHSerSL9HDvqj3Xfmu4vX 4egubcHVSt8n17iy7Lz81WeldMpefeBWMkOmZvDIaEFP4sCHJHIDK74DKuEBA9cXnIZ2 x1dHPvLby/V/6rx1ZTC3EJZ43KAI2McC7rsAjw+P5zCYoblOqfASxJbA/Y27AU7weykZ 3N4lQBOtOzpvTF1pMxUFJ4NaTwV63l/BUv5Xwuzj7fCmR/KWa1k57wnQ45gZ1XMi34wz TRWQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726810156; x=1727414956; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tcPC7ygS+8xWW2hpBrrHPKJG9T819vKCNIH4FF8FU7Q=; b=QUakKaprrv+sx4CP4ffjxNCVpyaQ6GLVLmPYQyQd4QXSPxowO64dxt4GIJ8HVlh0Ki 9vEaM+sT2D/t40g0onkeUMuuTeRFllQ80QvkUd9mTQYey/DyHG+do0uSqNUW9UjR9ws/ j1ixO36QALUvxf+ZsU7/qt97mgR07gaj8JiAbfd8YUCUhk7YWm9BvKbU9gqHIaBH9/aV TXV1oTQhpBODRFUVK7P5xaFVfPjDTuwJ13q0bv7r4eV5s04gT8YStmGYsrAASDICVLKm K9+Wh5JBfEq2GQPiXumVU2OBx4gFJjkQhhaJu05cxVbjCnNg4zJjrYFJMjoemXHwOu86 xbhA== X-Gm-Message-State: AOJu0YzY9AHsT0Age3Q7gmz+3aUnM7mMnd6JFqKHppr5MYLm4A92ZiAI b1bHBHIHHA4ofAzdHVy2Slz76u1xiDt021Dr+++X4icAr3T1toS0nBbXwiEZb+CoFGsH+kVJe8V /qeWjC/T5KgQMsRkdjbQtSigBgaSL2+98 X-Google-Smtp-Source: AGHT+IEHrxMtto6yIcHb9z4PaAszPmFL8DUX0PtGn4PprP9ztXoPuycckU9EWo/sCxN7XOiu0syFqJtKoFQxSpQWK7A= X-Received: by 2002:a05:622a:1998:b0:454:f388:d686 with SMTP id d75a77b69052e-45b2050184cmr30102251cf.29.1726810155928; Thu, 19 Sep 2024 22:29:15 -0700 (PDT) MIME-Version: 1.0 References: <20240919095202.1375181-1-d.csapak@proxmox.com> <21f250b8-a59c-426d-96de-11606cbb0e42@proxmox.com> In-Reply-To: From: Esi Y Date: Fri, 20 Sep 2024 07:29:05 +0200 Message-ID: Subject: Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow bigger writes than 4k for fuse To: Proxmox VE development discussion Cc: t.lamprecht@proxmox.com, Dominik Csapak Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-SPAM-LEVEL: Spam detection results: 0 AWL -0.400 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DKIM_SIGNED 0.1 Message has a DKIM or DK signature, not necessarily valid DKIM_VALID -0.1 Message has at least one valid DKIM or DK signature DKIM_VALID_AU -0.1 Message has a valid DKIM or DK signature from author's domain DKIM_VALID_EF -0.1 Message has a valid DKIM or DK signature from envelope-from domain DMARC_PASS -0.1 DMARC pass policy FREEMAIL_ENVFROM_END_DIGIT 0.25 Envelope-from freemail username ends in digit FREEMAIL_FROM 0.001 Sender email is commonly abused enduser mail provider KAM_ASCII_DIVIDERS 0.8 Email that uses ascii formatting dividers and possible spam tricks RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com] Somehow, my ending did not get included, regarding the "hitting block later" - this is almost impossible to quantify reliably. The reason is the use of the WAL, 80% of instructions [1] are happening in the hocus-pocus SQLite and it really may or may not checkpoint at any given time depending on how many transactions are hitting (from all sides, i.e. other nodes). In the end and for my single-node testing at first, I ended up with MEMORY journal (yes, the old fashioned one), at least it was giving some consistent results. The amplification dropped (just from not using WAL alone) rapidly [2] and it would half with bigger buffers. But I found this a non-topic. It is in memory, it needs copying there, then it is in the backend, additionally at times in WAL and now FUSE3 would be increasing buffers in order to ... hit memory first. One has to question, if the WAL is necessary (on read once, write all the time DB), but for me it's more about whether the DB is necessary. It is there for the atomicity, I know, but there's other ways to do it without all the overhead. At the end of the day, it is about whether one wants a hard or soft state CPG. I personally aim for the soft, I don't expect much sympathy with that though. [1] https://forum.proxmox.com/threads/etc-pve-pmxcfs-amplification-ineffici= encies.154074/#post-703261 [2] https://forum.proxmox.com/threads/ssd-wearout-and-rrdcache-pmxcfs-commi= t-interval.124638/#post-702765 On Fri, Sep 20, 2024 at 6:05=E2=80=AFAM Esi Y via pve-devel wrote: > > > > > ---------- Forwarded message ---------- > From: Esi Y > To: t.lamprecht@proxmox.com, Dominik Csapak > Cc: Proxmox VE development discussion > Bcc: > Date: Fri, 20 Sep 2024 06:04:36 +0200 > Subject: Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow= bigger writes than 4k for fuse > I can't help it, I am sorry in advance, but ... > > No one is going to bring up the elephant in the room (you may want to > call FUSE), such as that backend_write_inode is hit every single time > on virtually every memdb_pwrite, i.e. in addition to cfs_fuse_write > also on (logically related): > cfs_fuse_truncate > cfs_fuse_rename > cfs_fuse_utimens > > So these are all separate transactions hitting the backend capturing > one and the same event. > > Additionally, there's nothing atomic about updating __version__ and > the actual file ("inode") DB rows, so double the number of > transactions hit on every amplified hit yet. > > Also, locks are persisted into backend only to be removed soon after. > > WRT to FUSE2 buffering doing just fine for overwrites (<=3D original > size), this is true, but then at the same time the mode of PVE > operation (albeit quite correctly) is to create a .tmp.XXX (so this is > your NEW file being appended) and then rename, whilst all that > in-place of that very FUSE mountpoint (not so correctly) and at the > same time pmxcfs being completely oblivious to this. > > I could not help this because this is a developer who - in my opinion > - quite rightly wanted to pick the low hanging fruit first with his > intuition (and a self-evident reasoning) completely disregarded, > however the same scrutiny was not exercised when e.g. bumping limits > [1] of that very FS . And that all back then was "tested with touch". > And this is all on someone else's codebase that is 10 years old (so > designed with different use case in mind, good enough for ~4K files), > meanwhile the well-meaning individual even admits he is not a C guru, > but is asked to spend a day profiling this multi-threaded CPG bespoke > code? > > NB I will completely leave out what the above mentioned does to the > CPG messages flying around, for brevity. But it is why I originally > got interested. > > I am sure I made many friends now that I called even the FUSE > migration on its own futile, but well, it is an RFC after all. > > Thank you, gentlemen. > > Esi Y > > On Thu, Sep 19, 2024 at 4:57=E2=80=AFPM Thomas Lamprecht > wrote: > > > > Am 19/09/2024 um 14:45 schrieb Dominik Csapak: > > > On 9/19/24 14:01, Thomas Lamprecht wrote: > > >> Am 19/09/2024 um 11:52 schrieb Dominik Csapak: > > >>> by default libfuse2 limits writes to 4k size, which means that on w= rites > > >>> bigger than that, we do a whole write cycle for each 4k block that = comes > > >>> in. To avoid that, add the option 'big_writes' to allow writes bigg= er > > >>> than 4k at once. > > >>> > > >>> This should improve pmxcfs performance for situations where we ofte= n > > >>> write large files (e.g. big ha status) and maybe reduce writes to d= isk. > > >> > > >> Should? Something like before/after for benchmark numbers, flamegrap= hs > > >> would be really good to have, without those it's rather hard to disc= uss > > >> this, and I'd like to avoid having to do those, or check the inner w= orkings > > >> of the affected fuse userspace/kernel code paths here myself. > > > > > > well I mean the code change is relatively small and the result is rat= her clear: > > > > Well sure the code change is just setting an option... But the actual c= hange is > > abstracted away and would benefit from actually looking into.. > > > > > in the current case we have the following calls from pmxcfs (shortene= d for e-mail) > > > when writing a single 128k block: > > > (dd if=3D... of=3D/etc/pve/test bs=3D128k count=3D1) > > > > Better than nothing but still no actual numbers (reduced time, reduced = write amp > > in combination with sqlite, ...), some basic analysis over file/write s= ize distribution > > on a single node and (e.g. three node) cluster, ... > > If that's all obvious for you then great, but as already mentioned in t= he past, I > > want actual data in commit messages for such stuff, and I cannot really= see a downside > > of having such numbers. > > > > Again, as is I'm not really seeing what's to discuss, you send it as RF= C after > > all. > > > > > [...] > > > so a factor of 32 less calls to cfs_fuse_write (including memdb_pwrit= e) > > > > That can be huge or not so big at all, i.e. as mentioned above, it woul= d we good to > > measure the impact through some other metrics. > > > > And FWIW, I used bpftrace to count [0] with an unpatched pmxcfs, there = I get > > the 32 calls to cfs_fuse_write only for a new file, overwriting the exi= sting > > file again with the same amount of data (128k) just causes a single cal= l. > > I tried using more data (e.g. from 128k initially to 256k or 512k) and = it's > > always the data divided by 128k (even if the first file has a different= size) > > > > We do not override existing files often, but rather write to a new file= and > > then rename, but still quite interesting and IMO really showing that ju= st > > because this is 1 +-1 line change it doesn't necessarily have to be tri= vial > > and obvious in its effects. > > > > [0]: bpftrace -e 'u:cfs_fuse_write /str(args->path) =3D=3D "/test"/ {@ = =3D count();} END { print(@) }' -p "$(pidof pmxcfs)" > > > > > > >>> If we'd change to libfuse3, this would be a non-issue, since that o= ption > > >>> got removed and is the default there. > > >> > > >> I'd prefer that. At least if done with the future PVE 9.0, as I do n= ot think > > >> it's a good idea in the middle of a stable release cycle. > > > > > > why not this change now, and the rewrite to libfuse3 later? that way = we can > > > have some improvements now too... > > > > Because I want some actual data and reasoning first, even if it's quite= likely > > that this improves things Somehow=E2=84=A2, I'd like to actually know i= n what metrics > > and by how much (even if just an upper bound due to the benchmark or so= me > > measurement being rather artificial). > > > > I mean you name the big HA status, why not measure that for real? like,= probably > > among other things, in terms of bytes hitting the block layer, i.e. the= actual > > backing disk from those requests as then we'd know for real if this can= reduce > > the write load there, not just that it maybe should. > > > > > > _______________________________________________ > > pve-devel mailing list > > pve-devel@lists.proxmox.com > > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel > > > > > ---------- Forwarded message ---------- > From: Esi Y via pve-devel > To: t.lamprecht@proxmox.com, Dominik Csapak > Cc: Esi Y , Proxmox VE development discussion= > Bcc: > Date: Fri, 20 Sep 2024 06:04:36 +0200 > Subject: Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow= bigger writes than 4k for fuse > _______________________________________________ > pve-devel mailing list > pve-devel@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel --===============8466812100758011061== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel --===============8466812100758011061==--