From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <t.lamprecht@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 0E2B268ED6
 for <pve-devel@lists.proxmox.com>; Wed, 10 Mar 2021 07:56:26 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id EEAE719438
 for <pve-devel@lists.proxmox.com>; Wed, 10 Mar 2021 07:55:55 +0100 (CET)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [212.186.127.180])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id B15121942A
 for <pve-devel@lists.proxmox.com>; Wed, 10 Mar 2021 07:55:54 +0100 (CET)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 818A345B4B;
 Wed, 10 Mar 2021 07:55:54 +0100 (CET)
Message-ID: <3463e859-a6d9-ea66-481e-4f7548306e7d@proxmox.com>
Date: Wed, 10 Mar 2021 07:55:53 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:87.0) Gecko/20100101
 Thunderbird/87.0
Content-Language: en-US
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>,
 Roland <devzero@web.de>
References: <792c380c-6757-e058-55f6-f7d5436417f9@web.de>
From: Thomas Lamprecht <t.lamprecht@proxmox.com>
In-Reply-To: <792c380c-6757-e058-55f6-f7d5436417f9@web.de>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-SPAM-LEVEL: Spam detection results:  0
 AWL -0.050 Adjusted score from AWL reputation of From: address
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 NICE_REPLY_A           -0.001 Looks like a legit reply (A)
 RCVD_IN_DNSWL_MED        -2.3 Sender listed at https://www.dnswl.org/,
 medium trust
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pve-devel] avoidable writes of pmxcfs to
 /var/lib/pve-cluster/config.db ?
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Wed, 10 Mar 2021 06:56:26 -0000

Hi,

On 09.03.21 21:45, Roland wrote:
> hello proxmox team,
>=20
> i found that pmxcfs process is quite "chatty" and one of the top
> disk-writers on our proxmox nodes.
>=20
> i had a closer look, because i was curious, why wearout of our samsung
> EVO is already at 4% .=C2=A0 as disk I/O of our vms is typically very l=
ow, so
> we used lower end ssd for those maschines.

FWIW, my crucial MX200 512GB disks is at 3% wear out, running as PVE
root FS since 5.5 years.

>=20
> it seems pmxcfs is constantly writing into config.db-wal at a >10 kB/s
> and >10 writes/s rate level, whereas i can only see few changes in conf=
ig.db
>=20
> from my rough calculation, these writes probably sum up to several
> hundreds of gigabytes of disk blocks and >100mio iops written in a year=
,
> which isn't "just nothing" for lower-end ssd=C2=A0 (small and cheap ssd=
's may
> only have some tens of TBW lifetime).
>=20
> i know that it's recommended to use enterprise ssd for proxmox, but as
> they are expensive i also dislike if they get avoidable wearout on any
> of our systems.
>=20
>=20
> what makes me raise my eyebrowe is, that it seems that most of the data=

> written to the sqlite db seems to be unchanged data, i.e. i don't see
> significant changes in config.db over time, (compared with sqldiff),
> whereas the write-ahead-log at config.db-wal has quite high "flow rate"=
=2E
>=20
> I cannot decide if this really is a must have, but it looks that writin=
g
> of (at least parts of)=C2=A0 the cluster runtime data (like rsa key
> information) is being done in a "just dump it all down into the
> database" way. this may make it easy at the implementation level and
> easy for the programmer.
>=20
> i would love to hear a comment on this finding .
>=20
> maybe there is will/room for optimisation to avoid unnecessary disk
> wearout, saving avoidable database write/workload (ok it's tiny
> workload)=C2=A0 , but thus probably also lower the risk of database
> corruption in particular problem situations like server crash or whatev=
er.

So the prime candidate for this write load are the PVE HA Local Resource
Manager services on each node, they update their status and that is often=

required to signal the current Cluster Resource Manager's master service
that the HA stack on that node is well alive and that commands got
executed with result X. So yes, this is required and intentional.
There maybe some room for optimization, but its not that straight forward=
,
and (over-)clever solutions are often the wrong ones for an HA stack - as=

failure here is something we really want to avoid. But yeah, some easier
to pick fruits could maybe be found here.

The other thing I just noticed when checking out:
# ls -l "/proc/$(pidof pmxcfs)/fd"

to get the FDs for all db related FDs and then watch writes with:
# strace -v -s $[1<<16] -f -p "$(pidof pmxcfs)" -e write=3D4,5,6

Was seeing additionally some writes for the RSA key files which should ju=
st
not be there, but I need to closer investigate this, seemed a bit too odd=20
to
me.

I'll see if I can find out a bit more details about above, maybe there's
something to improve lurking there.

FWIW, in general we try to keep stuff rather simple, the main reason is t=
hat
simpler systems tend to work more reliable and are easier to maintain, an=
d
the load of even simple services can still get quite complex in sum, like=20
in
PVE; But we still try to avoid efficiency trade offs over oversimplificat=
ion.

cheers,
Thomas