public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH SERIES storage/qemu-server/-manager] RFC : add lvmqcow2 storage support
@ 2024-08-26 11:00 Alexandre Derumier via pve-devel
  2024-08-28 12:53 ` Dominik Csapak
  2024-09-05  7:51 ` Fabian Grünbichler
  0 siblings, 2 replies; 5+ messages in thread
From: Alexandre Derumier via pve-devel @ 2024-08-26 11:00 UTC (permalink / raw)
  To: pve-devel; +Cc: Alexandre Derumier

[-- Attachment #1: Type: message/rfc822, Size: 5189 bytes --]

From: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH SERIES storage/qemu-server/-manager] RFC : add lvmqcow2 storage support
Date: Mon, 26 Aug 2024 13:00:18 +0200
Message-ID: <20240826110030.1744732-1-alexandre.derumier@groupe-cyllene.com>

This patch series add support for a new lvmqcow2 storage format.

Currently, we can't do snasphot && thin provisionning on shared block devices because
lvm thin can't share his metavolume. I have a lot of onprem vmware customers
where it's really blocking the proxmox migration. (and they are looking for ovirt/oracle
virtualisation where it's working fine).

It's possible to format a block device without filesystem with qcow2 format directly.
This is used by redhat rhev/ovirt since almost 10year in their vsdm daemon.

For thin provisiniong or to handle extra size of snapshot, we need to be able to resize
the lvm volume dynamically.
The volume is increased by chunk of 1GB by default (can be changed).
Qemu implement events to sent an alert when the write usage is reaching a threshold.
(Threshold is 50% of last chunk, so when vm have 500MB free)

The resize is async (around 2s), so user need to choose a correct chunk size && threshold,
if the storage is really fast (nvme for example, where you can write more than 500MB in 2ss)

If the resize is not enough fast, the vm will pause in io-error.
pvestatd is looking for this error, and try to extend again if needed and resume the vm


pve-storage:

Alexandre Derumier (5):
  add lvmqcow2 plugin
  vdisk_alloc: add underlay_size option
  add volume_underlay_resize
  add refresh volume
  add volume_underlay_shrink

 src/PVE/Storage.pm                |  52 +++++-
 src/PVE/Storage/LVMQcow2Plugin.pm | 272 ++++++++++++++++++++++++++++++
 src/PVE/Storage/Makefile          |   3 +-
 src/PVE/Storage/Plugin.pm         |  20 +++
 4 files changed, 344 insertions(+), 3 deletions(-)
 create mode 100644 src/PVE/Storage/LVMQcow2Plugin.pm


qemu-server:

Alexandre Derumier (6):
  lvmqcow2: set disk write threshold
  qm cli: add blockextend
  qmevent: call qm disk blockextend when write_threshold event is
    received
  migration: refresh remote disk size before resume
  qemu_img_format: lvmqcow2 is a path_storage
  clone: allocate && shrink lvmcow2 underlay

 PVE/CLI/qm.pm                         |  57 ++++++++++
 PVE/QemuMigrate.pm                    |  13 +++
 PVE/QemuServer.pm                     | 154 +++++++++++++++++++++++++-
 qmeventd/qmeventd.c                   |  27 +++++
 test/MigrationTest/QemuMigrateMock.pm |   2 +
 5 files changed, 251 insertions(+), 2 deletions(-)


pve-manager:

Alexandre Derumier (1):
  pvestatd: lvmqcow2 : extend disk on io-error

 PVE/Service/pvestatd.pm | 62 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)


-- 
2.39.2



[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] [PATCH SERIES storage/qemu-server/-manager] RFC : add lvmqcow2 storage support
  2024-08-26 11:00 [pve-devel] [PATCH SERIES storage/qemu-server/-manager] RFC : add lvmqcow2 storage support Alexandre Derumier via pve-devel
@ 2024-08-28 12:53 ` Dominik Csapak
  2024-08-29  8:27   ` DERUMIER, Alexandre via pve-devel
       [not found]   ` <98a4b03b8969f7c4aef42fc5cdd677752b4dbf83.camel@groupe-cyllene.com>
  2024-09-05  7:51 ` Fabian Grünbichler
  1 sibling, 2 replies; 5+ messages in thread
From: Dominik Csapak @ 2024-08-28 12:53 UTC (permalink / raw)
  To: Proxmox VE development discussion

On 8/26/24 13:00, Alexandre Derumier via pve-devel wrote:
> 
> This patch series add support for a new lvmqcow2 storage format.
> 
> Currently, we can't do snasphot && thin provisionning on shared block devices because
> lvm thin can't share his metavolume. I have a lot of onprem vmware customers
> where it's really blocking the proxmox migration. (and they are looking for ovirt/oracle
> virtualisation where it's working fine).
> 
> It's possible to format a block device without filesystem with qcow2 format directly.
> This is used by redhat rhev/ovirt since almost 10year in their vsdm daemon.
> 
> For thin provisiniong or to handle extra size of snapshot, we need to be able to resize
> the lvm volume dynamically.
> The volume is increased by chunk of 1GB by default (can be changed).
> Qemu implement events to sent an alert when the write usage is reaching a threshold.
> (Threshold is 50% of last chunk, so when vm have 500MB free)
> 
> The resize is async (around 2s), so user need to choose a correct chunk size && threshold,
> if the storage is really fast (nvme for example, where you can write more than 500MB in 2ss)
> 
> If the resize is not enough fast, the vm will pause in io-error.
> pvestatd is looking for this error, and try to extend again if needed and resume the vm


Hi,

just my personal opinion, maybe you also want to wait for more feedback from somebody else...
(also i just glanced over the patches, so correct me if I'm wrong)

i see some problems with this approach (some are maybe fixable, some probably not?)

* as you mentioned, if the storage is fast enough you have a runaway VM
   this is IMHO not acceptable, as that leads to VMs that are completely blocked and
   can't do anything. I fear this will generate many support calls why their guests
   are stopped/hanging...

* the code says containers are supported (rootdir => 1) but i don't see how?
   there is AFAICS no code to handle them in any way...
   (maybe just falsely copied?)

* you lock the local blockextend call, but give it a timeout of 60 seconds.
   what if that timeout expires? the vm again gets completely blocked until it's
   resized by pvestatd

* IMHO pvestatd is the wrong place to make such a call. It's already doing much
   stuff in a way where a single storage operation blocks many other things
   (metrics, storage/vm status, ballooning, etc..)

   cramming another thing in there seems wrong and will only lead to even more people
   complaining about the pvestatd not working, only in this case the vms
   will be in an io-error state indefinitely then.

   I'd rather make a separate daemon/program, or somehow integrate it into
   qmeventd (but then it would have to become multi threaded/processes/etc.
   to not block it's other purposes)

* there is no cluster locking?
   you only mention

   ---8<---
   #don't·use·global·cluster·lock·here,·use·on·native·local·lvm·lock
   --->8---

   but don't configure any lock? (AFAIR lvm cluster locking needs additional
   configuration/daemons?)

   this *will* lead to errors if multiple VMs on different hosts try
   to resize at the same time.

   even with cluster locking, this will very soon lead to contention, since
   storage operations are inherently expensive, e.g. if i have
   10-100 VMs wanting to resize at the same time, some of them will run
   into a timeout or at least into the blocking state.

   That does not even need much IO, just bad luck when multiple VMs go
   over the threshold within a short time.

All in all, I'm not really sure if the gain (snapshots on shared LVM) is worth
the potential cost in maintenance, support and customer dissatisfaction with
stalled/blocked VMs.

Generally a better approach could be for your customers to use some
kind of shared filesystem (GFS2/OCFS/?). I know those are not really
tested or supported by us, but i would hope that they scale and behave
better than qcow2-on-lvm-with-dynamic-resize.

best regards
Dominik


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] [PATCH SERIES storage/qemu-server/-manager] RFC : add lvmqcow2 storage support
  2024-08-28 12:53 ` Dominik Csapak
@ 2024-08-29  8:27   ` DERUMIER, Alexandre via pve-devel
       [not found]   ` <98a4b03b8969f7c4aef42fc5cdd677752b4dbf83.camel@groupe-cyllene.com>
  1 sibling, 0 replies; 5+ messages in thread
From: DERUMIER, Alexandre via pve-devel @ 2024-08-29  8:27 UTC (permalink / raw)
  To: pve-devel, d.csapak; +Cc: DERUMIER, Alexandre

[-- Attachment #1: Type: message/rfc822, Size: 19941 bytes --]

From: "DERUMIER, Alexandre" <alexandre.derumier@groupe-cyllene.com>
To: "pve-devel@lists.proxmox.com" <pve-devel@lists.proxmox.com>, "d.csapak@proxmox.com" <d.csapak@proxmox.com>
Subject: Re: [pve-devel] [PATCH SERIES storage/qemu-server/-manager] RFC : add lvmqcow2 storage support
Date: Thu, 29 Aug 2024 08:27:07 +0000
Message-ID: <98a4b03b8969f7c4aef42fc5cdd677752b4dbf83.camel@groupe-cyllene.com>




>>just my personal opinion, maybe you also want to wait for more
>>feedback from somebody else...
>>(also i just glanced over the patches, so correct me if I'm wrong)

Hi Dominik !

i see some problems with this approach (some are maybe fixable, some
probably not?)

>>* as you mentioned, if the storage is fast enough you have a runaway
>>VM
>>   this is IMHO not acceptable, as that leads to VMs that are
>>completely blocked and
>>   can't do anything. I fear this will generate many support calls
>>why their guests
>>   are stopped/hanging...

If the chunksize is correctly configured, it shouldn't happen.
(for example, if the storage is able to write 500MB/S,   use a
chunksize of 5~10GB  , this give your 10~20s window )


>>* the code says containers are supported (rootdir => 1) but i don't
>>see how?
>>   there is AFAICS no code to handle them in any way...
>>   (maybe just falsely copied?)

oh indeed, I don't have checked about CT yet. (it could be implemented
with a storage usage check each x second, but I'm not sure it's scale
fine with a lof of CT volumes)



>>* you lock the local blockextend call, but give it a timeout of 60
>>seconds.
>>   what if that timeout expires? the vm again gets completely blocked
>>until it's
>>   resized by pvestatd

I'm locking it to avoid multiple extend, I have set arbitrary 60s, but
it could be a lot lower. (lvm extend don't use more than 1s for me)




>>* IMHO pvestatd is the wrong place to make such a call. It's already
>>doing much
>>   stuff in a way where a single storage operation blocks many other
>>things
>>   (metrics, storage/vm status, ballooning, etc..)
>>
>>   cramming another thing in there seems wrong and will only lead to
>>even more people
>>   complaining about the pvestatd not working, only in this case the
>>vms
>>   will be in an io-error state indefinitely then.
>>
>>   I'd rather make a separate daemon/program, or somehow integrate it
>>into
>>   qmeventd (but then it would have to become multi
>>threaded/processes/etc.
>>   to not block it's other purposes)

Yes, I agree with this.  (BTW, if one day, we could have threading,
queues  or seperate daemon for each storage monitor, it could help a
lot with hangy storage)




>>* there is no cluster locking?
>>   you only mention
>>
>>   ---8<---
>>   #don't·use·global·cluster·lock·here,·use·on·native·local·lvm·lock
>>   --->8---
>>
>>   but don't configure any lock? (AFAIR lvm cluster locking needs
>>additional
>>   configuration/daemons?)
>>
>>   this *will* lead to errors if multiple VMs on different hosts try
>>   to resize at the same time.
>>
>>   even with cluster locking, this will very soon lead to contention,
>>since
>>   storage operations are inherently expensive, e.g. if i have
>>   10-100 VMs wanting to resize at the same time, some of them will
>>run
>>   into a timeout or at least into the blocking state.
>>
>>   That does not even need much IO, just bad luck when multiple VMs
>>go
>>   over the threshold within a short time.

mmm,ok, This one could be a problem indeed.  
I need to look at ovirt code. (because they are really using it in
production since 10year), to see how they handle locks.


>>All in all, I'm not really sure if the gain (snapshots on shared LVM)
>>is worth
>>the potential cost in maintenance, support and customer
>>dissatisfaction with
>>stalled/blocked VMs.

>>Generally a better approach could be for your customers to use some
>>kind of shared filesystem (GFS2/OCFS/?). I know those are not really
>>tested or supported by us, but i would hope that they scale and
>>behave
>>better than qcow2-on-lvm-with-dynamic-resize.

Yes, if we can get it working fine, it could be *a lot* better.  I'm
still afraid about kernel bug/regression. (at least in ocfs2, 10 year
ago, it was a knightmare. I have used it in prod for 1~2 year).

For gfs2, they are a user on proxmox forum, using it in production
since 2019 without any problem.
https://forum.proxmox.com/threads/pve-7-x-cluster-setup-of-shared-lvm-lv-with-msa2040-sas-partial-howto.57536/


I need to test if we can have storage timeout if one node goes down.
(for ocfs2, it was the case, the forum user tell me that it was ok with
gfs2)

I'll do test on my side.

I really need this feature for a lot of onprem customers, migrating
from vmware.  They are mostly small clusters. (2~3 nodes with direct
attach san).  

So even if gfs2 don't scale too much with many nodes, personnaly, it
should be enough for me if we limits the number of supported nodes.


>>best regards
>>Dominik

Thanks again for the review ! 



(BTW, I have some small fixes to do on this patch series on pvestatd
code)


[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] [PATCH SERIES storage/qemu-server/-manager] RFC : add lvmqcow2 storage support
       [not found]   ` <98a4b03b8969f7c4aef42fc5cdd677752b4dbf83.camel@groupe-cyllene.com>
@ 2024-08-30  8:44     ` DERUMIER, Alexandre via pve-devel
  0 siblings, 0 replies; 5+ messages in thread
From: DERUMIER, Alexandre via pve-devel @ 2024-08-30  8:44 UTC (permalink / raw)
  To: pve-devel, d.csapak; +Cc: DERUMIER, Alexandre

[-- Attachment #1: Type: message/rfc822, Size: 19227 bytes --]

From: "DERUMIER, Alexandre" <alexandre.derumier@groupe-cyllene.com>
To: "pve-devel@lists.proxmox.com" <pve-devel@lists.proxmox.com>, "d.csapak@proxmox.com" <d.csapak@proxmox.com>
Subject: Re: [pve-devel] [PATCH SERIES storage/qemu-server/-manager] RFC : add lvmqcow2 storage support
Date: Fri, 30 Aug 2024 08:44:03 +0000
Message-ID: <323f0eb43434620a592a46e3ea8f5b3c41bc9fe1.camel@groupe-cyllene.com>

> 
Hi,
I have done more tests

> > * there is no cluster locking?
> >    you only mention
> > 
> >    ---8<---
> >   
> > #don't·use·global·cluster·lock·here,·use·on·native·local·lvm·lock
> >    --->8---
> > 
> >    but don't configure any lock? (AFAIR lvm cluster locking needs
> > additional
> >    configuration/daemons?)
> > 
> >    this *will* lead to errors if multiple VMs on different hosts
> > try
> >    to resize at the same time.
> > 
> >    even with cluster locking, this will very soon lead to
> > contention,
> > since
> >    storage operations are inherently expensive, e.g. if i have
> >    10-100 VMs wanting to resize at the same time, some of them will
> > run
> >    into a timeout or at least into the blocking state.
> > 
> >    That does not even need much IO, just bad luck when multiple VMs
> > go
> >    over the threshold within a short time.

>>>mmm,ok, This one could be a problem indeed.  
>>>I need to look at ovirt code. (because they are really using it in
>>>production since 10year), to see how they handle locks.

ok, you are right, we need a cluster lock here. Redhat is using sanlock
daemon or dlm using corosync for coordination. 



* IMHO pvestatd is the wrong place to make such a call. It's
already
doing much
   stuff in a way where a single storage operation blocks many
other
things
   (metrics, storage/vm status, ballooning, etc..)

   cramming another thing in there seems wrong and will only lead
to
even more people
   complaining about the pvestatd not working, only in this case
the
vms
   will be in an io-error state indefinitely then.

   I'd rather make a separate daemon/program, or somehow integrate
it
into
   qmeventd (but then it would have to become multi
threaded/processes/etc.
   to not block it's other purposes)

>>Yes, I agree with this.  (BTW, if one day, we could have threading,
>>queues  or seperate daemon for each storage monitor, it could help a
>>lot with hangy storage)

Ok, I think we could manage a queue of disk to resize somewhere.
pvestatd could fill the queue on io-error, and it could be processed
by qemu-eventd. (or maybe another daemon)
It could be done sequentially, as we need a cluster lock anyway


> > All in all, I'm not really sure if the gain (snapshots on shared
> > LVM)
> > is worth
> > the potential cost in maintenance, support and customer
> > dissatisfaction with
> > stalled/blocked VMs.

> > Generally a better approach could be for your customers to use some
> > kind of shared filesystem (GFS2/OCFS/?). I know those are not
> > really
> > tested or supported by us, but i would hope that they scale and
> > behave
> > better than qcow2-on-lvm-with-dynamic-resize.

>>>Yes, if we can get it working fine, it could be *a lot* better.  I'm
>>>still afraid about kernel bug/regression. (at least in ocfs2, 10
>>>year
>>>ago, it was a knightmare. I have used it in prod for 1~2 year).

>>>For gfs2, they are a user on proxmox forum, using it in production
>>>since 2019 without any problem.


>>>I need to test if we can have storage timeout if one node goes down.
>>>(for ocfs2, it was the case, the forum user tell me that it was ok
>>>with
>>gfs2)

>>>I'll do test on my side.

ok, I have done tests with gfs2. Installation is easy, and it's well
integrated with corosync. (using dlm daemon to manage locks which use
corosync). (Note: It need fencing if corosync is dead, it's currently
not able to resume the lock)

It's working fine with preallocated qcow2. I have almost same
performance than raw device, around 20k iops 4k && 3GB/s on my test
storage.

But when the file is not preallocated. (or when you take a snapshot on
a preallocated drive, so new write are not preallocated anymore),
the performance is abymissal.  (60 iops 4k, 40MB/S).
Seem to be a well known problem with gfs2, with cluster lock on block
allocation.


I'll do tests with ocfs2 to compare



[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] [PATCH SERIES storage/qemu-server/-manager] RFC : add lvmqcow2 storage support
  2024-08-26 11:00 [pve-devel] [PATCH SERIES storage/qemu-server/-manager] RFC : add lvmqcow2 storage support Alexandre Derumier via pve-devel
  2024-08-28 12:53 ` Dominik Csapak
@ 2024-09-05  7:51 ` Fabian Grünbichler
  1 sibling, 0 replies; 5+ messages in thread
From: Fabian Grünbichler @ 2024-09-05  7:51 UTC (permalink / raw)
  To: Proxmox VE development discussion


> Alexandre Derumier via pve-devel <pve-devel@lists.proxmox.com> hat am 26.08.2024 13:00 CEST geschrieben:
> This patch series add support for a new lvmqcow2 storage format.
> 
> Currently, we can't do snasphot && thin provisionning on shared block devices because
> lvm thin can't share his metavolume. I have a lot of onprem vmware customers
> where it's really blocking the proxmox migration. (and they are looking for ovirt/oracle
> virtualisation where it's working fine).
> 
> It's possible to format a block device without filesystem with qcow2 format directly.
> This is used by redhat rhev/ovirt since almost 10year in their vsdm daemon.
> 
> For thin provisiniong or to handle extra size of snapshot, we need to be able to resize
> the lvm volume dynamically.
> The volume is increased by chunk of 1GB by default (can be changed).
> Qemu implement events to sent an alert when the write usage is reaching a threshold.
> (Threshold is 50% of last chunk, so when vm have 500MB free)
> 
> The resize is async (around 2s), so user need to choose a correct chunk size && threshold,
> if the storage is really fast (nvme for example, where you can write more than 500MB in 2ss)
> 
> If the resize is not enough fast, the vm will pause in io-error.
> pvestatd is looking for this error, and try to extend again if needed and resume the vm

I agree with Dominik about the downsides of this approach.

We had a brief chat this morning and came up with a possible alternative that would still allow snapshots (even if thin-provisioning would be out of scope):

- allocate the volume with the full size and put a fully pre-allocated qcow2 file on it
- no need to monitor regular guest I/O, it's guaranteed that the qcow2 file can be fully written
- when creating a snapshot
-- check the actual usage of the qcow2 file
-- extend the underlying volume so that the total size is current usage + size exposed to the guest
-- create the actual (qcwo2-internal) snapshot
- still no need to monitor guest I/O, the underlying volume should be big enough to overwrite all data

this would give us effectively the same semantics as thick-provisioned zvols, which also always reserve enough space at snapshot creation time to allow a full overwrite of the whole zvol. if the underlying volume cannot be extended by the required space, snapshot creation would fail.

some open questions:
- do we actually get enough information about space usage out of the qcow2 file (I think so, but haven't checked in detail)
- is there a way to compact/shrink either when removing snapshots, or as (potentially expensive) standalone action (worst case, compact by copying the whole disk?)

another, less involved approach would be to over-allocate the volume to provide a fixed, limited amount of slack for snapshots (e.g., "allocate 50% extra space for snapshots" when creating a guest volume) - but that has all the usual downsides of thin-provisioning (the guest is lied to about the disk size, and can run into weird error states when space runs out) and is less flexible.

what do you think about the above approaches?


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-09-05  7:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-26 11:00 [pve-devel] [PATCH SERIES storage/qemu-server/-manager] RFC : add lvmqcow2 storage support Alexandre Derumier via pve-devel
2024-08-28 12:53 ` Dominik Csapak
2024-08-29  8:27   ` DERUMIER, Alexandre via pve-devel
     [not found]   ` <98a4b03b8969f7c4aef42fc5cdd677752b4dbf83.camel@groupe-cyllene.com>
2024-08-30  8:44     ` DERUMIER, Alexandre via pve-devel
2024-09-05  7:51 ` Fabian Grünbichler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal