From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id B030D1FF15E
	for <inbox@lore.proxmox.com>; Fri, 20 Sep 2024 06:05:20 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 883073045F;
	Fri, 20 Sep 2024 06:05:28 +0200 (CEST)
References: <20240919095202.1375181-1-d.csapak@proxmox.com>
 <21f250b8-a59c-426d-96de-11606cbb0e42@proxmox.com>
 <c6a05311-215e-4a5a-82ac-a1032fb331ec@proxmox.com>
 <d5b69768-acc4-45f7-bd3b-7cfe26cb261a@proxmox.com>
In-Reply-To: <d5b69768-acc4-45f7-bd3b-7cfe26cb261a@proxmox.com>
Date: Fri, 20 Sep 2024 06:04:36 +0200
To: t.lamprecht@proxmox.com, Dominik Csapak <d.csapak@proxmox.com>
MIME-Version: 1.0
Message-ID: <mailman.23.1726805127.332.pve-devel@lists.proxmox.com>
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
From: Esi Y via pve-devel <pve-devel@lists.proxmox.com>
Precedence: list
Cc: Esi Y <esiy0676+proxmox@gmail.com>,
 Proxmox VE development discussion <pve-devel@lists.proxmox.com>
X-Mailman-Version: 2.1.29
X-BeenThere: pve-devel@lists.proxmox.com
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
Reply-To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
Subject: Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow
 bigger writes than 4k for fuse
Content-Type: multipart/mixed; boundary="===============5360423196243984490=="
Errors-To: pve-devel-bounces@lists.proxmox.com
Sender: "pve-devel" <pve-devel-bounces@lists.proxmox.com>

--===============5360423196243984490==
Content-Type: message/rfc822
Content-Disposition: inline

Return-Path: <esiy0676@gmail.com>
X-Original-To: pve-devel@lists.proxmox.com
Delivered-To: pve-devel@lists.proxmox.com
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by lists.proxmox.com (Postfix) with ESMTPS id 6D268C05CD
	for <pve-devel@lists.proxmox.com>; Fri, 20 Sep 2024 06:05:26 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 536D93042B
	for <pve-devel@lists.proxmox.com>; Fri, 20 Sep 2024 06:04:56 +0200 (CEST)
Received: from mail-qt1-x831.google.com (mail-qt1-x831.google.com [IPv6:2607:f8b0:4864:20::831])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by firstgate.proxmox.com (Proxmox) with ESMTPS
	for <pve-devel@lists.proxmox.com>; Fri, 20 Sep 2024 06:04:54 +0200 (CEST)
Received: by mail-qt1-x831.google.com with SMTP id d75a77b69052e-4581ee65b46so11989021cf.3
        for <pve-devel@lists.proxmox.com>; Thu, 19 Sep 2024 21:04:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1726805087; x=1727409887; darn=lists.proxmox.com;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=6Yr5z7J83hOX4xTahEkDgabBCHKymuH1pQW3zw2cTw0=;
        b=SeAohDtBL5shIiXB3KJn6H7E5QF18CUG5pl61Fec4vImcUuBQPSqmIvZCjJJSIocnn
         v4D5Qv+FFh++0M6P+KDzbLt8zhOcMnxL0yvBUvjyInS4uXaWT2Dw75HX6gHzQZaz0BvD
         Ayw1KfKAwfEOP6zS03MHA/xn84C7jq9l6CPkP8QYb6DM5WwYHzKZYmrG779skguCikG9
         EyAx0ZOIXpVec74MItkOHdfbp7qcDlZNrltjgz6HIsw2+vaT3nmZBtXtCQ3xqQ0cGZjJ
         wLAkqCYiZV1JV9Go+As0L3ObZv7YjrWI8AyMSAkSURRoln9Ey4wagIksx+BN+cTjH+Mb
         uBXg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1726805087; x=1727409887;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=6Yr5z7J83hOX4xTahEkDgabBCHKymuH1pQW3zw2cTw0=;
        b=YbW6FuHkuXkRMhCui2RQg4rABEXkErYN6CnrN6Tcgwh5Z8TqCj7N1vc1tiOy+ZCmhN
         JulEFcQ4XovpBig5zRjHg8XtLK7edTxM2TXOeNjuIvIAXWRFpgDtgPkzMQHGJDuzHflP
         F0PMtewkgcTYmlvmmHbajilRU+hmfWZiTjD417wdSH6hlcHGaNEDkqOLESIolqoJhokG
         AEQbJ2ZBMjK7+jYkDsPQ5y4AzsMMIKN4ImiV8/ZmHnB6vsOfSLY0TQLsfLmgGgE2DyJR
         4o1OEDjVXE8C/SWtDJGLfByfMLR9AJP6dUa0hwVyS26PJBbvazGsyB2vonHgbm+OOmD0
         XIdA==
X-Gm-Message-State: AOJu0Yykp1rtgZPjDaX/+h0Ux7Hr1Ht1+aPeRDof4xh/Irl3UIyI8klO
	Qy6i9a+0+Ie2yECJ8WDmOhQGcj7dNVgK/XzLNXLT3EdVy4FMT7YPbeVr7h4ZhdfUp9hyq1BcNCv
	ask4hknNurB+o2erm2A5+ZQW3nCQ=
X-Google-Smtp-Source: AGHT+IESQnpCpXqn8zTpAfAmlJ5pOPfp9zqBvUaqidWOzv/Ll5a7AsLt7V4dR5xejFMnkl8TREpeo8Qvta6EgzQrTp8=
X-Received: by 2002:a05:622a:8f:b0:458:333b:335f with SMTP id
 d75a77b69052e-45b2058338amr26657521cf.49.1726805087096; Thu, 19 Sep 2024
 21:04:47 -0700 (PDT)
MIME-Version: 1.0
References: <20240919095202.1375181-1-d.csapak@proxmox.com>
 <21f250b8-a59c-426d-96de-11606cbb0e42@proxmox.com> <c6a05311-215e-4a5a-82ac-a1032fb331ec@proxmox.com>
 <d5b69768-acc4-45f7-bd3b-7cfe26cb261a@proxmox.com>
In-Reply-To: <d5b69768-acc4-45f7-bd3b-7cfe26cb261a@proxmox.com>
From: Esi Y <esiy0676+proxmox@gmail.com>
Date: Fri, 20 Sep 2024 06:04:36 +0200
Message-ID: <CABtLnHp=5p9rkD-+aEMfBT2WfQnPhqcsFTGZPnSou4=U2xZOUw@mail.gmail.com>
Subject: Re: [pve-devel] [RFC PATCH pve-cluster] fix #5728: pmxcfs: allow
 bigger writes than 4k for fuse
To: t.lamprecht@proxmox.com, Dominik Csapak <d.csapak@proxmox.com>
Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-SPAM-LEVEL: Spam detection results:  0
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DKIM_SIGNED               0.1 Message has a DKIM or DK signature, not necessarily valid
	DKIM_VALID               -0.1 Message has at least one valid DKIM or DK signature
	DKIM_VALID_AU            -0.1 Message has a valid DKIM or DK signature from author's domain
	DKIM_VALID_EF            -0.1 Message has a valid DKIM or DK signature from envelope-from domain
	DMARC_PASS               -0.1 DMARC pass policy
	FREEMAIL_ENVFROM_END_DIGIT   0.25 Envelope-from freemail username ends in digit
	FREEMAIL_FROM           0.001 Sender email is commonly abused enduser mail provider
	RCVD_IN_DNSWL_NONE     -0.0001 Sender listed at https://www.dnswl.org/, no trust
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
	URIBL_BLOCKED           0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked.  See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com]

I can't help it, I am sorry in advance, but ...

No one is going to bring up the elephant in the room (you may want to
call FUSE), such as that backend_write_inode is hit every single time
on virtually every memdb_pwrite, i.e. in addition to cfs_fuse_write
also on (logically related):
cfs_fuse_truncate
cfs_fuse_rename
cfs_fuse_utimens

So these are all separate transactions hitting the backend capturing
one and the same event.

Additionally, there's nothing atomic about updating __version__ and
the actual file ("inode") DB rows, so double the number of
transactions hit on every amplified hit yet.

Also, locks are persisted into backend only to be removed soon after.

WRT to FUSE2 buffering doing just fine for overwrites (<=3D original
size), this is true, but then at the same time the mode of PVE
operation (albeit quite correctly) is to create a .tmp.XXX (so this is
your NEW file being appended) and then rename, whilst all that
in-place of that very FUSE mountpoint (not so correctly) and at the
same time pmxcfs being completely oblivious to this.

I could not help this because this is a developer who - in my opinion
- quite rightly wanted to pick the low hanging fruit first with his
intuition (and a self-evident reasoning) completely disregarded,
however the same scrutiny was not exercised when e.g. bumping limits
[1] of that very FS . And that all back then was "tested with touch".
And this is all on someone else's codebase that is 10 years old (so
designed with different use case in mind, good enough for ~4K files),
meanwhile the well-meaning individual even admits he is not a C guru,
but is asked to spend a day profiling this multi-threaded CPG bespoke
code?

NB I will completely leave out what the above mentioned does to the
CPG messages flying around, for brevity. But it is why I originally
got interested.

I am sure I made many friends now that I called even the FUSE
migration on its own futile, but well, it is an RFC after all.

Thank you, gentlemen.

Esi Y

On Thu, Sep 19, 2024 at 4:57=E2=80=AFPM Thomas Lamprecht
<t.lamprecht@proxmox.com> wrote:
>
> Am 19/09/2024 um 14:45 schrieb Dominik Csapak:
> > On 9/19/24 14:01, Thomas Lamprecht wrote:
> >> Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
> >>> by default libfuse2 limits writes to 4k size, which means that on wri=
tes
> >>> bigger than that, we do a whole write cycle for each 4k block that co=
mes
> >>> in. To avoid that, add the option 'big_writes' to allow writes bigger
> >>> than 4k at once.
> >>>
> >>> This should improve pmxcfs performance for situations where we often
> >>> write large files (e.g. big ha status) and maybe reduce writes to dis=
k.
> >>
> >> Should? Something like before/after for benchmark numbers, flamegraphs
> >> would be really good to have, without those it's rather hard to discus=
s
> >> this, and I'd like to avoid having to do those, or check the inner wor=
kings
> >> of the affected fuse userspace/kernel code paths here myself.
> >
> > well I mean the code change is relatively small and the result is rathe=
r clear:
>
> Well sure the code change is just setting an option... But the actual cha=
nge is
> abstracted away and would benefit from actually looking into..
>
> > in the current case we have the following calls from pmxcfs (shortened =
for e-mail)
> > when writing a single 128k block:
> > (dd if=3D... of=3D/etc/pve/test bs=3D128k count=3D1)
>
> Better than nothing but still no actual numbers (reduced time, reduced wr=
ite amp
> in combination with sqlite, ...), some basic analysis over file/write siz=
e distribution
> on a single node and (e.g. three node) cluster, ...
> If that's all obvious for you then great, but as already mentioned in the=
 past, I
> want actual data in commit messages for such stuff, and I cannot really s=
ee a downside
> of having such numbers.
>
> Again, as is I'm not really seeing what's to discuss, you send it as RFC =
after
> all.
>
> > [...]
> > so a factor of 32 less calls to cfs_fuse_write (including memdb_pwrite)
>
> That can be huge or not so big at all, i.e. as mentioned above, it would =
we good to
> measure the impact through some other metrics.
>
> And FWIW, I used bpftrace to count [0] with an unpatched pmxcfs, there I =
get
> the 32 calls to cfs_fuse_write only for a new file, overwriting the exist=
ing
> file again with the same amount of data (128k) just causes a single call.
> I tried using more data (e.g. from 128k initially to 256k or 512k) and it=
's
> always the data divided by 128k (even if the first file has a different s=
ize)
>
> We do not override existing files often, but rather write to a new file a=
nd
> then rename, but still quite interesting and IMO really showing that just
> because this is 1 +-1 line change it doesn't necessarily have to be trivi=
al
> and obvious in its effects.
>
> [0]: bpftrace -e 'u:cfs_fuse_write /str(args->path) =3D=3D "/test"/ {@ =
=3D count();} END { print(@) }' -p "$(pidof pmxcfs)"
>
>
> >>> If we'd change to libfuse3, this would be a non-issue, since that opt=
ion
> >>> got removed and is the default there.
> >>
> >> I'd prefer that. At least if done with the future PVE 9.0, as I do not=
 think
> >> it's a good idea in the middle of a stable release cycle.
> >
> > why not this change now, and the rewrite to libfuse3 later? that way we=
 can
> > have some improvements now too...
>
> Because I want some actual data and reasoning first, even if it's quite l=
ikely
> that this improves things Somehow=E2=84=A2, I'd like to actually know in =
what metrics
> and by how much (even if just an upper bound due to the benchmark or some
> measurement being rather artificial).
>
> I mean you name the big HA status, why not measure that for real? like, p=
robably
> among other things, in terms of bytes hitting the block layer, i.e. the a=
ctual
> backing disk from those requests as then we'd know for real if this can r=
educe
> the write load there, not just that it maybe should.
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


--===============5360423196243984490==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

--===============5360423196243984490==--