[pve-devel] avoidable writes of pmxcfs to /var/lib/pve-cluster/config.db ?

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [pve-devel] avoidable writes of pmxcfs to /var/lib/pve-cluster/config.db ?
@ 2021-03-09 20:45 Roland
  2021-03-10  6:55 ` Thomas Lamprecht
  0 siblings, 1 reply; 4+ messages in thread
From: Roland @ 2021-03-09 20:45 UTC (permalink / raw)
  To: pve-devel

hello proxmox team,

i found that pmxcfs process is quite "chatty" and one of the top
disk-writers on our proxmox nodes.

i had a closer look, because i was curious, why wearout of our samsung
EVO is already at 4% .  as disk I/O of our vms is typically very low, so
we used lower end ssd for those maschines.

it seems pmxcfs is constantly writing into config.db-wal at a >10 kB/s
and >10 writes/s rate level, whereas i can only see few changes in config.db

from my rough calculation, these writes probably sum up to several
hundreds of gigabytes of disk blocks and >100mio iops written in a year,
which isn't "just nothing" for lower-end ssd  (small and cheap ssd's may
only have some tens of TBW lifetime).

i know that it's recommended to use enterprise ssd for proxmox, but as
they are expensive i also dislike if they get avoidable wearout on any
of our systems.

what makes me raise my eyebrowe is, that it seems that most of the data
written to the sqlite db seems to be unchanged data, i.e. i don't see
significant changes in config.db over time, (compared with sqldiff),
whereas the write-ahead-log at config.db-wal has quite high "flow rate".

I cannot decide if this really is a must have, but it looks that writing
of (at least parts of)  the cluster runtime data (like rsa key
information) is being done in a "just dump it all down into the
database" way. this may make it easy at the implementation level and
easy for the programmer.

i would love to hear a comment on this finding .

maybe there is will/room for optimisation to avoid unnecessary disk
wearout, saving avoidable database write/workload (ok it's tiny
workload)  , but thus probably also lower the risk of database
corruption in particular problem situations like server crash or whatever.

regards
Roland Kletzing

ps:
sorry, if this may look pointy-headed or bean-counting from a
non-involved and sorry for posting here, but i was unsure if bugzilla
was better for this, especially because i could not select
"corosync-pve" as component for the bug/rfe ticket.  this is an
opensource project and typically, opensource is something getting closer
look and better optimization , which often makes it superior. at least
it should be allowed to ask "is this intentional ?".

# strace -f -p $(pidof pmxcfs)  -s32768  2>&1|grep pwrite64 | pv -br
-i10 >/dev/null
1.93MiB [18.8KiB/s]
^C

# fatrace -c |stdbuf -oL grep pmxcfs | stdbuf -oL pv -lr -i60 >/dev/null
[11.5 /s]

# cat config.db-wal |strings|grep "BEGIN RSA" |cut -b 1-50|sort |uniq -c
     331 pve-www.key-----BEGIN RSA PRIVATE KEY-----

# cat config.db-wal |strings|grep "hp-ml350" |cut -b 1-50|sort |uniq -c
     114 hp-ml350 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDGN

# iotop -a

Total DISK READ:         0.00 B/s | Total DISK WRITE:        71.79 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s
   TID  PRIO  USER     DISK READ DISK WRITE>  SWAPIN      IO COMMAND
26962 be/4 root         20.00 K   1295.00 K  0.00 %  0.00 % pmxcfs
[cfs_loop]
  8425 be/4 root          0.00 B     32.00 K  0.00 %  0.00 %
[kworker/u16:1-poll_mpt2sas0_statu]
26992 be/4 root          0.00 B      8.00 K  0.00 %  0.00 % rrdcached -B
-b /var/lib/rrdcached/db/ -j /var/lib/rrdcached/journal/ -p
/var/run/rrdcached.pid -l unix:/var/run/rrdcached.sock
  7832 be/4 www-data      0.00 B   1024.00 B  0.00 %  0.00 % pveproxy worker
     1 be/4 root          0.00 B      0.00 B  0.00 %  0.00 % init
     2 be/4 root          0.00 B      0.00 B  0.00 %  0.00 % [kthreadd]
     3 be/0 root          0.00 B      0.00 B  0.00 %  0.00 % [rcu_gp]
     4 be/0 root          0.00 B      0.00 B  0.00 %  0.00 % [rcu_par_gp]
     6 be/0 root          0.00 B      0.00 B  0.00 %  0.00 %
[kworker/0:0H-kblockd]
     8 be/0 root          0.00 B      0.00 B  0.00 %  0.00 % [mm_percpu_wq]
     9 be/4 root          0.00 B      0.00 B  0.00 %  0.00 % [ksoftirqd/0]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] avoidable writes of pmxcfs to /var/lib/pve-cluster/config.db ?
  2021-03-09 20:45 [pve-devel] avoidable writes of pmxcfs to /var/lib/pve-cluster/config.db ? Roland
@ 2021-03-10  6:55 ` Thomas Lamprecht
  2021-03-10  8:18   ` Roland
  0 siblings, 1 reply; 4+ messages in thread
From: Thomas Lamprecht @ 2021-03-10  6:55 UTC (permalink / raw)
  To: Proxmox VE development discussion, Roland

Hi,

On 09.03.21 21:45, Roland wrote:
> hello proxmox team,
> 
> i found that pmxcfs process is quite "chatty" and one of the top
> disk-writers on our proxmox nodes.
> 
> i had a closer look, because i was curious, why wearout of our samsung
> EVO is already at 4% .  as disk I/O of our vms is typically very low, so
> we used lower end ssd for those maschines.

FWIW, my crucial MX200 512GB disks is at 3% wear out, running as PVE
root FS since 5.5 years.

> 
> it seems pmxcfs is constantly writing into config.db-wal at a >10 kB/s
> and >10 writes/s rate level, whereas i can only see few changes in config.db
> 
> from my rough calculation, these writes probably sum up to several
> hundreds of gigabytes of disk blocks and >100mio iops written in a year,
> which isn't "just nothing" for lower-end ssd  (small and cheap ssd's may
> only have some tens of TBW lifetime).
> 
> i know that it's recommended to use enterprise ssd for proxmox, but as
> they are expensive i also dislike if they get avoidable wearout on any
> of our systems.
> 
> 
> what makes me raise my eyebrowe is, that it seems that most of the data
> written to the sqlite db seems to be unchanged data, i.e. i don't see
> significant changes in config.db over time, (compared with sqldiff),
> whereas the write-ahead-log at config.db-wal has quite high "flow rate".
> 
> I cannot decide if this really is a must have, but it looks that writing
> of (at least parts of)  the cluster runtime data (like rsa key
> information) is being done in a "just dump it all down into the
> database" way. this may make it easy at the implementation level and
> easy for the programmer.
> 
> i would love to hear a comment on this finding .
> 
> maybe there is will/room for optimisation to avoid unnecessary disk
> wearout, saving avoidable database write/workload (ok it's tiny
> workload)  , but thus probably also lower the risk of database
> corruption in particular problem situations like server crash or whatever.

So the prime candidate for this write load are the PVE HA Local Resource
Manager services on each node, they update their status and that is often
required to signal the current Cluster Resource Manager's master service
that the HA stack on that node is well alive and that commands got
executed with result X. So yes, this is required and intentional.
There maybe some room for optimization, but its not that straight forward,
and (over-)clever solutions are often the wrong ones for an HA stack - as
failure here is something we really want to avoid. But yeah, some easier
to pick fruits could maybe be found here.

The other thing I just noticed when checking out:
# ls -l "/proc/$(pidof pmxcfs)/fd"

to get the FDs for all db related FDs and then watch writes with:
# strace -v -s $[1<<16] -f -p "$(pidof pmxcfs)" -e write=4,5,6

Was seeing additionally some writes for the RSA key files which should just
not be there, but I need to closer investigate this, seemed a bit too odd 
to
me.

I'll see if I can find out a bit more details about above, maybe there's
something to improve lurking there.

FWIW, in general we try to keep stuff rather simple, the main reason is that
simpler systems tend to work more reliable and are easier to maintain, and
the load of even simple services can still get quite complex in sum, like 
in
PVE; But we still try to avoid efficiency trade offs over oversimplification.

cheers,
Thomas

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] avoidable writes of pmxcfs to /var/lib/pve-cluster/config.db ?
  2021-03-10  6:55 ` Thomas Lamprecht
@ 2021-03-10  8:18   ` Roland
  2021-03-10  9:08     ` Thomas Lamprecht
  0 siblings, 1 reply; 4+ messages in thread
From: Roland @ 2021-03-10  8:18 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE development discussion


>> corruption in particular problem situations like server crash or whatever.
> So the prime candidate for this write load are the PVE HA Local Resource
> Manager services on each node, they update their status and that is often
> required to signal the current Cluster Resource Manager's master service
> that the HA stack on that node is well alive and that commands got
> executed with result X. So yes, this is required and intentional.
> There maybe some room for optimization, but its not that straight forward,
> and (over-)clever solutions are often the wrong ones for an HA stack - as
> failure here is something we really want to avoid. But yeah, some easier
> to pick fruits could maybe be found here.
>
> The other thing I just noticed when checking out:
> # ls -l "/proc/$(pidof pmxcfs)/fd"
>
> to get the FDs for all db related FDs and then watch writes with:
> # strace -v -s $[1<<16] -f -p "$(pidof pmxcfs)" -e write=4,5,6
>
> Was seeing additionally some writes for the RSA key files which should just
> not be there, but I need to closer investigate this, seemed a bit too odd
> to
> me.
not only these, i also see constant rewrite of  (non-changing?) vm
configuration data , too.

just cat config.db-wal |strings|grep ..... |sort | uniq -c   to see
what's getting there.

the weird thing is, that it does not happen for every VM. just some. i
send you an email with additional data (don't want to post all my VMs
mac adresses in public)

>
> I'll see if I can find out a bit more details about above, maybe there's
> something to improve lurking there.
>
> FWIW, in general we try to keep stuff rather simple, the main reason is that
> simpler systems tend to work more reliable and are easier to maintain, and
> the load of even simple services can still get quite complex in sum, like
> in
> PVE; But we still try to avoid efficiency trade offs over oversimplification.


thanks for explaining , for the hints how to trace writes and for having
a look.

sure, critical components SHOULD be simple!

roland







^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] avoidable writes of pmxcfs to /var/lib/pve-cluster/config.db ?
  2021-03-10  8:18   ` Roland
@ 2021-03-10  9:08     ` Thomas Lamprecht
  0 siblings, 0 replies; 4+ messages in thread
From: Thomas Lamprecht @ 2021-03-10  9:08 UTC (permalink / raw)
  To: Proxmox VE development discussion, Roland

On 10.03.21 09:18, Roland wrote:
> 
>>> corruption in particular problem situations like server crash or whatever.
>> So the prime candidate for this write load are the PVE HA Local Resource
>> Manager services on each node, they update their status and that is often
>> required to signal the current Cluster Resource Manager's master service
>> that the HA stack on that node is well alive and that commands got
>> executed with result X. So yes, this is required and intentional.
>> There maybe some room for optimization, but its not that straight forward,
>> and (over-)clever solutions are often the wrong ones for an HA stack - as
>> failure here is something we really want to avoid. But yeah, some easier
>> to pick fruits could maybe be found here.
>>
>> The other thing I just noticed when checking out:
>> # ls -l "/proc/$(pidof pmxcfs)/fd"
>>
>> to get the FDs for all db related FDs and then watch writes with:
>> # strace -v -s $[1<<16] -f -p "$(pidof pmxcfs)" -e write=4,5,6
>>
>> Was seeing additionally some writes for the RSA key files which should just
>> not be there, but I need to closer investigate this, seemed a bit too odd
>> to
>> me.
> not only these, i also see constant rewrite of  (non-changing?) vm
> configuration data , too.
> 
> just cat config.db-wal |strings|grep ..... |sort | uniq -c   to see
> what's getting there.
> 

but that's not a real issue though, the WAL is dimensioned quite big (4 MiB,
while DB is often only 1 or 2 MiB), so it will always contain lots of DB data.
This big WAL actually reduces additional write+syncs as we do not need to
checkpoint it that often, so at least for reads it should be more performant.

Also, the WAL is accessed in read and writes with off-sets (e.g., pwrite64)
and thus only some specific small and contained parts are actually written
newly. Thus you cannot really conclude anything from the total content in it,
only from actual new writes (which can be seen with my strace command).

Regarding the extra data I mentioned, it could be that this is due to sqlite
handling memory pages directly, I need to still check it out closer.

> the weird thing is, that it does not happen for every VM. just some. i
> send you an email with additional data (don't want to post all my VMs
> mac adresses in public)
> 

for now I'm good, thanks, I can check that on my test clusters too - but if
I need anything I'll come back to this offer.

cheers,
Thomas

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-03-10  9:08 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-09 20:45 [pve-devel] avoidable writes of pmxcfs to /var/lib/pve-cluster/config.db ? Roland
2021-03-10  6:55 ` Thomas Lamprecht
2021-03-10  8:18   ` Roland
2021-03-10  9:08     ` Thomas Lamprecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal