From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 79F078813D for ; Wed, 5 Jan 2022 09:10:49 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 52F9C244B5 for ; Wed, 5 Jan 2022 09:10:19 +0100 (CET) Received: from bacon.elettra.eu (bacon.elettra.trieste.it [140.105.206.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 25557244A5 for ; Wed, 5 Jan 2022 09:10:18 +0100 (CET) X-Envelope-From: Received: from zmp.elettra.eu (zmp.elettra.trieste.it [140.105.206.204]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by bacon.elettra.eu (Postfix) with ESMTPS id 7ED184043B for ; Wed, 5 Jan 2022 09:01:31 +0100 (CET) Received: from zmp.elettra.eu (localhost [127.0.0.1]) by zmp.elettra.eu (Postfix) with ESMTPS id 58B4A1412B09 for ; Wed, 5 Jan 2022 09:01:31 +0100 (CET) Received: from localhost (localhost [127.0.0.1]) by zmp.elettra.eu (Postfix) with ESMTP id 441871412B11 for ; Wed, 5 Jan 2022 09:01:31 +0100 (CET) X-Virus-Scanned: amavisd-new at zmp.elettra.eu Received: from zmp.elettra.eu ([127.0.0.1]) by localhost (zmp.elettra.eu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 3FzBDcrNBVRl for ; Wed, 5 Jan 2022 09:01:31 +0100 (CET) Received: from [172.20.254.189] (unknown [172.20.254.189]) by zmp.elettra.eu (Postfix) with ESMTPSA id 1015A1412B09 for ; Wed, 5 Jan 2022 09:01:31 +0100 (CET) To: Proxmox VE user list From: iztok Gregori Message-ID: <0fa9e157-3d2b-a133-6faa-eb4876e214e3@elettra.eu> Date: Wed, 5 Jan 2022 09:01:30 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.14.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 7bit X-elettra-Libra-ESVA-Information: Please contact elettra for more information X-elettra-Libra-ESVA-ID: 7ED184043B.A999D X-elettra-Libra-ESVA: No virus found X-elettra-Libra-ESVA-From: iztok.gregori@elettra.eu X-elettra-Libra-ESVA-Watermark: 1641974491.91492@Bnrkvg2pPhcCr3SAxLv4hg X-SPAM-LEVEL: Spam detection results: 0 AWL 0.950 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: [PVE-User] qemu-server / qmp issues: One VM cannot complete backup and freezes X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 Jan 2022 08:10:49 -0000 Hi to all! Starting from one week, when we added new nodes to the cluster and upgrade all to the latest proxmox 6.4 (with the ultimate goal to upgrade all the nodes to 7.1 in the not-so-near future), *one* of the VM stopped to backup. The backup job was blocked, and once we manually terminated the VM freezed, only a hard poweroff/poweron resumed the VM. In the logs we have a lot of the following: VM 0000 qmp command failed - VM 0000 qmp command 'query-proxmox-support' failed - unable to connect to VM 0000 qmp socket - timeout after 31 retries I searched for it and I found multiple threads on the forum so, in some form, is a known issue, but I'm curious what was the trigger and what could we do to work-around that problem (apart upgrade to PVE 7.1 which we will, but not this week). Can you give me some advice? To summarize the work we did last week (from when the backup stopped working): - Did full upgrade on all the cluster nodes and reboot them. - Upgrade CEPH from Nautilus to Octopus. - Install new CEPH OSDs on the new nodes (8 out of 16). The problematic VM was running (when it wasn't problematic) on one node which (at that moment) wasn't part of the CEPH cluster (but the storage was, and still is, allways CEPH). We migrated it on a different node but had the same issues. The VM has 12 RBD disk (which is a lot more that the cluster average) and all the disks are backupped on a NFS share. Because the problem is *only* on that particular VM, I could split it in 2 VMs and rearrange the number of disks (to be more in line with the cluster average), or I could rush to upgrade to 7.1 (hopping that the problem is only on 6.4 PVE...). Here is the conf: > agent: 1 > bootdisk: virtio0 > cores: 4 > ide2: none,media=cdrom > memory: 4096 > name: problematic-vm > net0: virtio=A2:69:F4:8C:38:22,bridge=vmbr0,tag=000 > numa: 0 > onboot: 1 > ostype: l26 > scsihw: virtio-scsi-pci > smbios1: uuid=8bd477be-69ac-4b51-9c5a-a149f96da521 > sockets: 1 > virtio0: rbd_vm:vm-1043-disk-0,size=8G > virtio1: rbd_vm:vm-1043-disk-1,size=100G > virtio10: rbd_vm:vm-1043-disk-10,size=30G > virtio11: rbd_vm:vm-1043-disk-11,size=100G > virtio12: rbd_vm:vm-1043-disk-12,size=200G > virtio2: rbd_vm:vm-1043-disk-2,size=100G > virtio3: rbd_vm:vm-1043-disk-3,size=20G > virtio4: rbd_vm:vm-1043-disk-4,size=20G > virtio5: rbd_vm:vm-1043-disk-5,size=30G > virtio6: rbd_vm:vm-1043-disk-6,size=100G > virtio7: rbd_vm:vm-1043-disk-7,size=200G > virtio8: rbd_vm:vm-1043-disk-8,size=20G > virtio9: rbd_vm:vm-1043-disk-9,size=20G The VM is a CENTOS 7 NFS server. The CEPH cluster health is OK: > cluster: > id: 645e8181-8424-41c4-9bc9-7e37b740e9a9 > health: HEALTH_OK > > services: > mon: 5 daemons, quorum node-01,node-02,node-03,node-05,node-07 (age 8d) > mgr: node-01(active, since 8d), standbys: node-03, node-02, node-07, node-05 > osd: 120 osds: 120 up (since 6d), 120 in (since 6d) > > task status: > > data: > pools: 3 pools, 1057 pgs > objects: 4.65M objects, 17 TiB > usage: 67 TiB used, 139 TiB / 207 TiB avail > pgs: 1056 active+clean > 1 active+clean+scrubbing+deep > All of the nodes have the same PVE version: > proxmox-ve: 6.4-1 (running kernel: 5.4.157-1-pve) > pve-manager: 6.4-13 (running version: 6.4-13/9f411e79) > pve-kernel-5.4: 6.4-11 > pve-kernel-helper: 6.4-11 > pve-kernel-5.4.157-1-pve: 5.4.157-1 > pve-kernel-5.4.140-1-pve: 5.4.140-1 > pve-kernel-5.4.106-1-pve: 5.4.106-1 > ceph: 15.2.15-pve1~bpo10 > ceph-fuse: 15.2.15-pve1~bpo10 > corosync: 3.1.5-pve2~bpo10+1 > criu: 3.11-3 > glusterfs-client: 5.5-3 > ifupdown: 0.8.35+pve1 > ksm-control-daemon: 1.3-1 > libjs-extjs: 6.0.1-10 > libknet1: 1.22-pve2~bpo10+1 > libproxmox-acme-perl: 1.1.0 > libproxmox-backup-qemu0: 1.1.0-1 > libpve-access-control: 6.4-3 > libpve-apiclient-perl: 3.1-3 > libpve-common-perl: 6.4-4 > libpve-guest-common-perl: 3.1-5 > libpve-http-server-perl: 3.2-3 > libpve-storage-perl: 6.4-1 > libqb0: 1.0.5-1 > libspice-server1: 0.14.2-4~pve6+1 > lvm2: 2.03.02-pve4 > lxc-pve: 4.0.6-2 > lxcfs: 4.0.6-pve1 > novnc-pve: 1.1.0-1 > proxmox-backup-client: 1.1.13-2 > proxmox-mini-journalreader: 1.1-1 > proxmox-widget-toolkit: 2.6-1 > pve-cluster: 6.4-1 > pve-container: 3.3-6 > pve-docs: 6.4-2 > pve-edk2-firmware: 2.20200531-1 > pve-firewall: 4.1-4 > pve-firmware: 3.3-2 > pve-ha-manager: 3.1-1 > pve-i18n: 2.3-1 > pve-qemu-kvm: 5.2.0-6 > pve-xtermjs: 4.7.0-3 > qemu-server: 6.4-2 > smartmontools: 7.2-pve2 > spiceterm: 3.1-1 > vncterm: 1.6-2 > zfsutils-linux: 2.0.6-pve1~bpo10+1 I can provide more informations if it is necessary. Cheers Iztok