[PVE-User] Unresponsive VM(s) during VZdump

public inbox for pve-user@lists.proxmox.com
 help / color / mirror / Atom feed

* [PVE-User] Unresponsive VM(s) during VZdump
@ 2024-05-09  8:35 Iztok Gregori
  2024-05-09  9:30 ` Mike O'Connor
  0 siblings, 1 reply; 8+ messages in thread
From: Iztok Gregori @ 2024-05-09  8:35 UTC (permalink / raw)
  To: Proxmox VE user list

Hi to all!

We are in the process of upgrading our Hyper-converged (Ceph based) 
cluster from PVE 6 to PVE 8 and yesterday we finished upgrading all 
nodes to PVE 7.4.1 without issues. Tonight, during our usual VZdump 
backup (vzdump on NFS share), we were notified by our monitoring system 
that 2 VMs (of 107) were unresponsive. In the VM logs there were a lot 
of lines like this:

kernel: hda: irq timeout: status=0xd0 { Busy }
kernel: sd 2:0:0:0: [sda] abort
kernel: sd 2:0:0:1: [sdb] abort

After (successfully) finish the backup, the VM started to function 
correctly again.

On PVE 6 everything was ok.

The affected machines are running old kernels "2.6.18" and "2.6.32", one 
has qemu agent enabled the other has not. Both are using kvm64 as 
processor type, one is using "Virtio Scsi" the other "LSI 53C895A". All 
the disks are on Ceph RBD.

No related logs were logged on the host machine, the Ceph cluster was 
working as expected. Both VM are "biggish" 100-200GB and it takes 1/2 
hours to complete the backup.

Have you any idea what could be the culprit of the problem? I suspect 
something with qemu-kvm, but I didn't find (yet) any usefull hints.

I'm still planning to upgrade everything to PVE 8, maybe the "problem" 
was fixed in later releases of qemu-kvm...

I can give you more information if needed, any help is appreciated.

Thanks
   Iztok

P.S This is the software stack on our cluster (16 nodes):
# pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.149-1-pve)
pve-manager: 7.4-17 (running version: 7.4-17/513c62be)
pve-kernel-5.15: 7.4-12
pve-kernel-5.4: 6.4-20
pve-kernel-5.15.149-1-pve: 5.15.149-1
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.157-1-pve: 5.4.157-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph: 15.2.17-pve1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.3
libpve-apiclient-perl: 3.2-2
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.6-1
proxmox-backup-file-restore: 2.4.6-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+3
pve-firewall: 4.3-5
pve-firmware: 3.6-6
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.10-1
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-5
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.15-pve1

-- 
Iztok Gregori
ICT Systems and Services
Elettra - Sincrotrone Trieste S.C.p.A.
http://www.elettra.eu

_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PVE-User] Unresponsive VM(s) during VZdump
  2024-05-09  8:35 [PVE-User] Unresponsive VM(s) during VZdump Iztok Gregori
@ 2024-05-09  9:30 ` Mike O'Connor
  2024-05-09 10:02   ` Iztok Gregori
  0 siblings, 1 reply; 8+ messages in thread
From: Mike O'Connor @ 2024-05-09  9:30 UTC (permalink / raw)
  To: Proxmox VE user list, Iztok Gregori

Hi Iztok

You need to enable fleecing in the advanced backup settings. A slow 
backup storage system will cause this issue, configuring fleecing will 
fix this by storing changes in a local sparse image.

Mike

On 9/5/2024 6:05 pm, Iztok Gregori wrote:
> Hi to all!
>
> We are in the process of upgrading our Hyper-converged (Ceph based) 
> cluster from PVE 6 to PVE 8 and yesterday we finished upgrading all 
> nodes to PVE 7.4.1 without issues. Tonight, during our usual VZdump 
> backup (vzdump on NFS share), we were notified by our monitoring 
> system that 2 VMs (of 107) were unresponsive. In the VM logs there 
> were a lot of lines like this:
>
> kernel: hda: irq timeout: status=0xd0 { Busy }
> kernel: sd 2:0:0:0: [sda] abort
> kernel: sd 2:0:0:1: [sdb] abort
>
> After (successfully) finish the backup, the VM started to function 
> correctly again.
>
> On PVE 6 everything was ok.
>
> The affected machines are running old kernels "2.6.18" and "2.6.32", 
> one has qemu agent enabled the other has not. Both are using kvm64 as 
> processor type, one is using "Virtio Scsi" the other "LSI 53C895A". 
> All the disks are on Ceph RBD.
>
> No related logs were logged on the host machine, the Ceph cluster was 
> working as expected. Both VM are "biggish" 100-200GB and it takes 1/2 
> hours to complete the backup.
>
> Have you any idea what could be the culprit of the problem? I suspect 
> something with qemu-kvm, but I didn't find (yet) any usefull hints.
>
> I'm still planning to upgrade everything to PVE 8, maybe the "problem" 
> was fixed in later releases of qemu-kvm...
>
> I can give you more information if needed, any help is appreciated.
>
> Thanks
>   Iztok
>
> P.S This is the software stack on our cluster (16 nodes):
> # pveversion -v
> proxmox-ve: 7.4-1 (running kernel: 5.15.149-1-pve)
> pve-manager: 7.4-17 (running version: 7.4-17/513c62be)
> pve-kernel-5.15: 7.4-12
> pve-kernel-5.4: 6.4-20
> pve-kernel-5.15.149-1-pve: 5.15.149-1
> pve-kernel-5.4.203-1-pve: 5.4.203-1
> pve-kernel-5.4.157-1-pve: 5.4.157-1
> pve-kernel-5.4.106-1-pve: 5.4.106-1
> ceph: 15.2.17-pve1
> ceph-fuse: 15.2.17-pve1
> corosync: 3.1.7-pve1
> criu: 3.15-1+pve-1
> glusterfs-client: 9.2-1
> ifupdown: 0.8.36+pve2
> ksm-control-daemon: 1.4-1
> libjs-extjs: 7.0.0-1
> libknet1: 1.24-pve2
> libproxmox-acme-perl: 1.4.4
> libproxmox-backup-qemu0: 1.3.1-1
> libproxmox-rs-perl: 0.2.1
> libpve-access-control: 7.4.3
> libpve-apiclient-perl: 3.2-2
> libpve-common-perl: 7.4-2
> libpve-guest-common-perl: 4.2-4
> libpve-http-server-perl: 4.2-3
> libpve-rs-perl: 0.7.7
> libpve-storage-perl: 7.4-3
> libqb0: 1.0.5-1
> libspice-server1: 0.14.3-2.1
> lvm2: 2.03.11-2.1
> lxc-pve: 5.0.2-2
> lxcfs: 5.0.3-pve1
> novnc-pve: 1.4.0-1
> proxmox-backup-client: 2.4.6-1
> proxmox-backup-file-restore: 2.4.6-1
> proxmox-kernel-helper: 7.4-1
> proxmox-mail-forward: 0.1.1-1
> proxmox-mini-journalreader: 1.3-1
> proxmox-offline-mirror-helper: 0.5.2
> proxmox-widget-toolkit: 3.7.3
> pve-cluster: 7.3-3
> pve-container: 4.4-6
> pve-docs: 7.4-2
> pve-edk2-firmware: 3.20230228-4~bpo11+3
> pve-firewall: 4.3-5
> pve-firmware: 3.6-6
> pve-ha-manager: 3.6.1
> pve-i18n: 2.12-1
> pve-qemu-kvm: 7.2.10-1
> pve-xtermjs: 4.16.0-2
> qemu-server: 7.4-5
> smartmontools: 7.2-pve3
> spiceterm: 3.2-2
> swtpm: 0.8.0~bpo11+3
> vncterm: 1.7-1
> zfsutils-linux: 2.1.15-pve1
>
_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PVE-User] Unresponsive VM(s) during VZdump
  2024-05-09  9:30 ` Mike O'Connor
@ 2024-05-09 10:02   ` Iztok Gregori
  2024-05-09 10:11     ` Mike O'Connor
  2024-05-10  7:36     ` Iztok Gregori
  0 siblings, 2 replies; 8+ messages in thread
From: Iztok Gregori @ 2024-05-09 10:02 UTC (permalink / raw)
  To: Mike O'Connor, Proxmox VE user list

Hi Mike!

On 09/05/24 11:30, Mike O'Connor wrote:
> You need to enable fleecing in the advanced backup settings. A slow 
> backup storage system will cause this issue, configuring fleecing will 
> fix this by storing changes in a local sparse image.

I see that fleecing is available from PVE 8.2, I will enable it next 
week once all the nodes will be upgraded to the latest version.

Thanks for the suggestion.

In the meantime I found this thread on the forum:

https://forum.proxmox.com/threads/high-io-wait-during-backups-after-upgrading-to-proxmox-7.113790/

which mention the max_workers parameter. I change it to 4 for the next 
scheduled backup and see if there are some improvements (I migrated the 
affected VMs to catch the new configuration).

I will keep you posted!

Iztok

_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PVE-User] Unresponsive VM(s) during VZdump
  2024-05-09 10:02   ` Iztok Gregori
@ 2024-05-09 10:11     ` Mike O'Connor
  2024-05-09 11:24       ` Alexander Burke via pve-user
       [not found]       ` <11db3d6f-1879-44b3-9f99-01e6fde6ebc8@alexburke.ca>
  2024-05-10  7:36     ` Iztok Gregori
  1 sibling, 2 replies; 8+ messages in thread
From: Mike O'Connor @ 2024-05-09 10:11 UTC (permalink / raw)
  To: Iztok Gregori, Proxmox VE user list

I played with all the drive interface settings, the end result was I 
lost customers because of backups causing windows drive failures.
Since fleecing was an option, I've not had a lockup.



On 9/5/2024 7:32 pm, Iztok Gregori wrote:
> Hi Mike!
>
> On 09/05/24 11:30, Mike O'Connor wrote:
>> You need to enable fleecing in the advanced backup settings. A slow 
>> backup storage system will cause this issue, configuring fleecing 
>> will fix this by storing changes in a local sparse image.
>
> I see that fleecing is available from PVE 8.2, I will enable it next 
> week once all the nodes will be upgraded to the latest version.
>
> Thanks for the suggestion.
>
> In the meantime I found this thread on the forum:
>
> https://forum.proxmox.com/threads/high-io-wait-during-backups-after-upgrading-to-proxmox-7.113790/ 
>
>
> which mention the max_workers parameter. I change it to 4 for the next 
> scheduled backup and see if there are some improvements (I migrated 
> the affected VMs to catch the new configuration).
>
> I will keep you posted!
>
> Iztok

_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PVE-User] Unresponsive VM(s) during VZdump
  2024-05-09 10:11     ` Mike O'Connor
@ 2024-05-09 11:24       ` Alexander Burke via pve-user
       [not found]       ` <11db3d6f-1879-44b3-9f99-01e6fde6ebc8@alexburke.ca>
  1 sibling, 0 replies; 8+ messages in thread
From: Alexander Burke via pve-user @ 2024-05-09 11:24 UTC (permalink / raw)
  To: Proxmox VE user list; +Cc: Alexander Burke

[-- Attachment #1: Type: message/rfc822, Size: 5278 bytes --]

From: Alexander Burke <alex@alexburke.ca>
To: Proxmox VE user list <pve-user@lists.proxmox.com>
Subject: Re: [PVE-User] Unresponsive VM(s) during VZdump
Date: Thu, 9 May 2024 11:24:06 +0000 (UTC)
Message-ID: <11db3d6f-1879-44b3-9f99-01e6fde6ebc8@alexburke.ca>

Hello all,

My understanding is that if the backing store is ZFS, a snapshot of the zvol underlying the guest's disk(s) is instant and atomic, and the snapshot is what gets backed up so fleecing is moot. Am I wrong on this?

I know nothing about Ceph other than the fact that it supports snapshots; assuming the above understanding is correct, does snapshot-based backup not work much the same way on Ceph?

Cheers,
Alex
----------------------------------------

2024-05-09T10:11:20Z Mike O'Connor <mike@oeg.com.au>:

> I played with all the drive interface settings, the end result was I lost customers because of backups causing windows drive failures.
> Since fleecing was an option, I've not had a lockup.
> 
> 
> 
> On 9/5/2024 7:32 pm, Iztok Gregori wrote:
>> Hi Mike!
>> 
>> On 09/05/24 11:30, Mike O'Connor wrote:
>>> You need to enable fleecing in the advanced backup settings. A slow backup storage system will cause this issue, configuring fleecing will fix this by storing changes in a local sparse image.
>> 
>> I see that fleecing is available from PVE 8.2, I will enable it next week once all the nodes will be upgraded to the latest version.
>> 
>> Thanks for the suggestion.
>> 
>> In the meantime I found this thread on the forum:
>> 
>> https://forum.proxmox.com/threads/high-io-wait-during-backups-after-upgrading-to-proxmox-7.113790/
>> 
>> which mention the max_workers parameter. I change it to 4 for the next scheduled backup and see if there are some improvements (I migrated the affected VMs to catch the new configuration).
>> 
>> I will keep you posted!
>> 
>> Iztok
> 
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user


[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PVE-User] Unresponsive VM(s) during VZdump
       [not found]       ` <11db3d6f-1879-44b3-9f99-01e6fde6ebc8@alexburke.ca>
@ 2024-05-10  5:21         ` Mike O'Connor
  2024-05-10  9:07         ` Fiona Ebner
  1 sibling, 0 replies; 8+ messages in thread
From: Mike O'Connor @ 2024-05-10  5:21 UTC (permalink / raw)
  To: Alexander Burke, Proxmox VE user list

Hi Alex

Not in my experience, without this option on spinning rust we get 
problems for single host systems.

I've never seen a snapshot created during backups for KVM on ZFS or 
Ceph, Qemu/KVM is doing the backup processing.

I had a hand built system which used snapshots for Ceph before we moved 
to PBS backup. PBS backup is great, but it caused major issues with 
drives going read only until fleecing was an option.

Mike

On 9/5/2024 8:54 pm, Alexander Burke wrote:
> Hello all,
>
> My understanding is that if the backing store is ZFS, a snapshot of the zvol underlying the guest's disk(s) is instant and atomic, and the snapshot is what gets backed up so fleecing is moot. Am I wrong on this?
>
> I know nothing about Ceph other than the fact that it supports snapshots; assuming the above understanding is correct, does snapshot-based backup not work much the same way on Ceph?
>
> Cheers,
> Alex
> ----------------------------------------
>
> 2024-05-09T10:11:20Z Mike O'Connor<mike@oeg.com.au>:
>
>> I played with all the drive interface settings, the end result was I lost customers because of backups causing windows drive failures.
>> Since fleecing was an option, I've not had a lockup.
>>
>>
>>
>> On 9/5/2024 7:32 pm, Iztok Gregori wrote:
>>> Hi Mike!
>>>
>>> On 09/05/24 11:30, Mike O'Connor wrote:
>>>> You need to enable fleecing in the advanced backup settings. A slow backup storage system will cause this issue, configuring fleecing will fix this by storing changes in a local sparse image.
>>> I see that fleecing is available from PVE 8.2, I will enable it next week once all the nodes will be upgraded to the latest version.
>>>
>>> Thanks for the suggestion.
>>>
>>> In the meantime I found this thread on the forum:
>>>
>>> https://forum.proxmox.com/threads/high-io-wait-during-backups-after-upgrading-to-proxmox-7.113790/
>>>
>>> which mention the max_workers parameter. I change it to 4 for the next scheduled backup and see if there are some improvements (I migrated the affected VMs to catch the new configuration).
>>>
>>> I will keep you posted!
>>>
>>> Iztok
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PVE-User] Unresponsive VM(s) during VZdump
  2024-05-09 10:02   ` Iztok Gregori
  2024-05-09 10:11     ` Mike O'Connor
@ 2024-05-10  7:36     ` Iztok Gregori
  1 sibling, 0 replies; 8+ messages in thread
From: Iztok Gregori @ 2024-05-10  7:36 UTC (permalink / raw)
  To: pve-user

Hi to all!

On 09/05/24 12:02, Iztok Gregori wrote:

[cut]

> 
> In the meantime I found this thread on the forum:
> 
> https://forum.proxmox.com/threads/high-io-wait-during-backups-after-upgrading-to-proxmox-7.113790/
> 
> which mention the max_workers parameter. I change it to 4 for the next 
> scheduled backup and see if there are some improvements (I migrated the 
> affected VMs to catch the new configuration).
> 
> I will keep you posted!

At the end I've set max_workers parameter to 1 (basically revert it to 
PVE 6 default) and the backups didn't cause issues to the guest VMs (or 
at least there were no notifications, and the checks I made were all 
negative).

Next week, when the upgrade is finished, I will try "fleecing" and I 
will revert the max_workers configuration to the default values to see 
validate the setup.

Also, in the short term, I'm planning to change the target storage for 
vzdump to something more fast.

Have a great day!

Iztok

_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PVE-User] Unresponsive VM(s) during VZdump
       [not found]       ` <11db3d6f-1879-44b3-9f99-01e6fde6ebc8@alexburke.ca>
  2024-05-10  5:21         ` Mike O'Connor
@ 2024-05-10  9:07         ` Fiona Ebner
  1 sibling, 0 replies; 8+ messages in thread
From: Fiona Ebner @ 2024-05-10  9:07 UTC (permalink / raw)
  To: Alexander Burke, Proxmox VE user list

Am 09.05.24 um 13:24 schrieb Alexander Burke:
> Hello all,
> 
> My understanding is that if the backing store is ZFS, a snapshot of the zvol underlying the guest's disk(s) is instant and atomic, and the snapshot is what gets backed up so fleecing is moot. Am I wrong on this?
> 

The snapshot-level backup in Proxmox VE is not done on the storage
layer, but in the QEMU block layer. That makes it independent of
whether/how the underlying storage supports snapshots and allows for
dirty tracking for incremental backup. See here for further
information/discussion why it's not done on the storage layer:

https://pve.proxmox.com/pve-docs/chapter-vzdump.html#_vm_backup_fleecing
https://bugzilla.proxmox.com/show_bug.cgi?id=3233#c19

Best Regards,
Fiona


_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-05-10  9:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-09  8:35 [PVE-User] Unresponsive VM(s) during VZdump Iztok Gregori
2024-05-09  9:30 ` Mike O'Connor
2024-05-09 10:02   ` Iztok Gregori
2024-05-09 10:11     ` Mike O'Connor
2024-05-09 11:24       ` Alexander Burke via pve-user
     [not found]       ` <11db3d6f-1879-44b3-9f99-01e6fde6ebc8@alexburke.ca>
2024-05-10  5:21         ` Mike O'Connor
2024-05-10  9:07         ` Fiona Ebner
2024-05-10  7:36     ` Iztok Gregori

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal