[PVE-User] qemu-server / qmp issues: One VM cannot complete backup and freezes

public inbox for pve-user@lists.proxmox.com
 help / color / mirror / Atom feed

* [PVE-User] qemu-server / qmp issues: One VM cannot complete backup and freezes
@ 2022-01-05  8:01 iztok Gregori
  0 siblings, 0 replies; only message in thread
From: iztok Gregori @ 2022-01-05  8:01 UTC (permalink / raw)
  To: Proxmox VE user list

Hi to all!

Starting from one week, when we added new nodes to the cluster and 
upgrade all to the latest proxmox 6.4 (with the ultimate goal to upgrade 
all the nodes to 7.1 in the not-so-near future), *one* of the VM stopped 
to backup. The backup job was blocked, and once we manually terminated 
the VM freezed, only a hard poweroff/poweron resumed the VM.

In the logs we have a lot of the following:

VM 0000 qmp command failed - VM 0000 qmp command 'query-proxmox-support' 
failed - unable to connect to VM 0000 qmp socket - timeout after 31 retries

I searched for it and I found multiple threads on the forum so, in some 
form, is a known issue, but I'm curious what was the trigger and what 
could we do to work-around that problem (apart upgrade to PVE 7.1 which 
we will, but not this week).

Can you give me some advice?

To summarize the work we did last week (from when the backup stopped 
working):

- Did full upgrade on all the cluster nodes and reboot them.
- Upgrade CEPH from Nautilus to Octopus.
- Install new CEPH OSDs on the new nodes (8 out of 16).

The problematic VM was running (when it wasn't problematic) on one node 
which (at that moment) wasn't part of the CEPH cluster (but the storage 
was, and still is, allways CEPH). We migrated it on a different node but 
had the same issues. The VM has 12 RBD disk (which is a lot more that 
the cluster average) and all the disks are backupped on a NFS share.

Because the problem is *only* on that particular VM, I could split it in 
2 VMs and rearrange the number of disks (to be more in line with the 
cluster average), or I could rush to upgrade to 7.1 (hopping that the 
problem is only on 6.4 PVE...).

Here is the conf:

> agent: 1
> bootdisk: virtio0
> cores: 4
> ide2: none,media=cdrom
> memory: 4096
> name: problematic-vm
> net0: virtio=A2:69:F4:8C:38:22,bridge=vmbr0,tag=000
> numa: 0
> onboot: 1
> ostype: l26
> scsihw: virtio-scsi-pci
> smbios1: uuid=8bd477be-69ac-4b51-9c5a-a149f96da521
> sockets: 1
> virtio0: rbd_vm:vm-1043-disk-0,size=8G
> virtio1: rbd_vm:vm-1043-disk-1,size=100G
> virtio10: rbd_vm:vm-1043-disk-10,size=30G
> virtio11: rbd_vm:vm-1043-disk-11,size=100G
> virtio12: rbd_vm:vm-1043-disk-12,size=200G
> virtio2: rbd_vm:vm-1043-disk-2,size=100G
> virtio3: rbd_vm:vm-1043-disk-3,size=20G
> virtio4: rbd_vm:vm-1043-disk-4,size=20G
> virtio5: rbd_vm:vm-1043-disk-5,size=30G
> virtio6: rbd_vm:vm-1043-disk-6,size=100G
> virtio7: rbd_vm:vm-1043-disk-7,size=200G
> virtio8: rbd_vm:vm-1043-disk-8,size=20G
> virtio9: rbd_vm:vm-1043-disk-9,size=20G

The VM is a CENTOS 7 NFS server.

The CEPH cluster health is OK:

>   cluster:
>     id:     645e8181-8424-41c4-9bc9-7e37b740e9a9
>     health: HEALTH_OK
>  
>   services:
>     mon: 5 daemons, quorum node-01,node-02,node-03,node-05,node-07 (age 8d)
>     mgr: node-01(active, since 8d), standbys: node-03, node-02, node-07, node-05
>     osd: 120 osds: 120 up (since 6d), 120 in (since 6d)
>  
>   task status:
>  
>   data:
>     pools:   3 pools, 1057 pgs
>     objects: 4.65M objects, 17 TiB
>     usage:   67 TiB used, 139 TiB / 207 TiB avail
>     pgs:     1056 active+clean
>              1    active+clean+scrubbing+deep
>  

All of the nodes have the same PVE version:

> proxmox-ve: 6.4-1 (running kernel: 5.4.157-1-pve)
> pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
> pve-kernel-5.4: 6.4-11
> pve-kernel-helper: 6.4-11
> pve-kernel-5.4.157-1-pve: 5.4.157-1
> pve-kernel-5.4.140-1-pve: 5.4.140-1
> pve-kernel-5.4.106-1-pve: 5.4.106-1
> ceph: 15.2.15-pve1~bpo10
> ceph-fuse: 15.2.15-pve1~bpo10
> corosync: 3.1.5-pve2~bpo10+1
> criu: 3.11-3
> glusterfs-client: 5.5-3
> ifupdown: 0.8.35+pve1
> ksm-control-daemon: 1.3-1
> libjs-extjs: 6.0.1-10
> libknet1: 1.22-pve2~bpo10+1
> libproxmox-acme-perl: 1.1.0
> libproxmox-backup-qemu0: 1.1.0-1
> libpve-access-control: 6.4-3
> libpve-apiclient-perl: 3.1-3
> libpve-common-perl: 6.4-4
> libpve-guest-common-perl: 3.1-5
> libpve-http-server-perl: 3.2-3
> libpve-storage-perl: 6.4-1
> libqb0: 1.0.5-1
> libspice-server1: 0.14.2-4~pve6+1
> lvm2: 2.03.02-pve4
> lxc-pve: 4.0.6-2
> lxcfs: 4.0.6-pve1
> novnc-pve: 1.1.0-1
> proxmox-backup-client: 1.1.13-2
> proxmox-mini-journalreader: 1.1-1
> proxmox-widget-toolkit: 2.6-1
> pve-cluster: 6.4-1
> pve-container: 3.3-6
> pve-docs: 6.4-2
> pve-edk2-firmware: 2.20200531-1
> pve-firewall: 4.1-4
> pve-firmware: 3.3-2
> pve-ha-manager: 3.1-1
> pve-i18n: 2.3-1
> pve-qemu-kvm: 5.2.0-6
> pve-xtermjs: 4.7.0-3
> qemu-server: 6.4-2
> smartmontools: 7.2-pve2
> spiceterm: 3.1-1
> vncterm: 1.6-2
> zfsutils-linux: 2.0.6-pve1~bpo10+1

I can provide more informations if it is necessary.

Cheers
Iztok

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2022-01-05  8:10 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-05  8:01 [PVE-User] qemu-server / qmp issues: One VM cannot complete backup and freezes iztok Gregori

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal