From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 66A266074E for ; Thu, 3 Dec 2020 23:01:08 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 552142F54D for ; Thu, 3 Dec 2020 23:00:38 +0100 (CET) Received: from mx-ext.inet.dkfz-heidelberg.de (mx-ext.inet.dkfz-heidelberg.de [192.54.49.101]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 1EAEC2F53D for ; Thu, 3 Dec 2020 23:00:37 +0100 (CET) X-Virus-Scanned-DKFZ: amavisd-new at dkfz-heidelberg.de Received: from [194.94.115.45] (dkfz-vpn45.inet.dkfz-heidelberg.de [194.94.115.45]) (authenticated bits=0) by mx-ext.inet.dkfz-heidelberg.de (8.14.7/8.14.7/smtpin) with ESMTP id 0B3LGxXI030802 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Thu, 3 Dec 2020 22:17:02 +0100 DKIM-Filter: OpenDKIM Filter v2.11.0 mx-ext.inet.dkfz-heidelberg.de 0B3LGxXI030802 From: Frank Thommen To: PVE User List Organization: DKFZ Heidelberg, Omics IT and Data Management Core Facility (ODCF) Message-ID: <6f8b35b3-bd74-93f1-5298-eb9980c70d77@dkfz-heidelberg.de> Date: Thu, 3 Dec 2020 22:16:59 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.4.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.6.2 (mx-ext.inet.dkfz-heidelberg.de [192.54.49.101]); Thu, 03 Dec 2020 22:17:02 +0100 (CET) X-Spam-Status: No, score=-100.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on mx-ext.inet.dkfz-heidelberg.de X-SPAM-LEVEL: Spam detection results: 0 AWL -0.400 Adjusted score from AWL reputation of From: address KAM_ASCII_DIVIDERS 0.8 Spam that uses ascii formatting tricks KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: [PVE-User] Backup of one VM always fails X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 03 Dec 2020 22:01:08 -0000 Dear all, on our PVE cluster, the backup of a specific VM always fails (which makes us worry, as it is our GitLab instance). The general backup plan is "back up all VMs at 00:30". In the confirmation email we see, that the backup of this specific VM takes six to seven hours and then fails. The error message in the overview table used to be: vma_queue_write: write error - Broken pipe With detailed log --------------------- 123: 2020-12-01 02:53:08 INFO: Starting Backup of VM 123 (qemu) 123: 2020-12-01 02:53:08 INFO: status = running 123: 2020-12-01 02:53:09 INFO: update VM 123: -lock backup 123: 2020-12-01 02:53:09 INFO: VM Name: odcf-vm123 123: 2020-12-01 02:53:09 INFO: include disk 'virtio0' 'ceph-rbd:vm-123-disk-0' 20G 123: 2020-12-01 02:53:09 INFO: include disk 'virtio1' 'ceph-rbd:vm-123-disk-2' 1000G 123: 2020-12-01 02:53:09 INFO: include disk 'virtio2' 'ceph-rbd:vm-123-disk-3' 2T 123: 2020-12-01 02:53:09 INFO: backup mode: snapshot 123: 2020-12-01 02:53:09 INFO: ionice priority: 7 123: 2020-12-01 02:53:09 INFO: creating archive '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_01-02_53_08.vma.lzo' 123: 2020-12-01 02:53:09 INFO: started backup task 'a38ff50a-f474-4b0a-a052-01a835d5c5c7' 123: 2020-12-01 02:53:12 INFO: status: 0% (167772160/3294239916032), sparse 0% (31563776), duration 3, read/write 55/45 MB/s [... ecc. ecc. ...] 123: 2020-12-01 09:42:14 INFO: status: 35% (1170252365824/3294239916032), sparse 0% (26845003776), duration 24545, read/write 59/56 MB/s 123: 2020-12-01 09:42:14 ERROR: vma_queue_write: write error - Broken pipe 123: 2020-12-01 09:42:14 INFO: aborting backup job 123: 2020-12-01 09:42:15 ERROR: Backup of VM 123 failed - vma_queue_write: write error - Broken pipe --------------------- Since lately (upgrade to the newest PVE release) it's VM 123 qmp command 'query-backup' failed - got timeout with log --------------------- 123: 2020-12-03 03:29:00 INFO: Starting Backup of VM 123 (qemu) 123: 2020-12-03 03:29:00 INFO: status = running 123: 2020-12-03 03:29:00 INFO: VM Name: odcf-vm123 123: 2020-12-03 03:29:00 INFO: include disk 'virtio0' 'ceph-rbd:vm-123-disk-0' 20G 123: 2020-12-03 03:29:00 INFO: include disk 'virtio1' 'ceph-rbd:vm-123-disk-2' 1000G 123: 2020-12-03 03:29:00 INFO: include disk 'virtio2' 'ceph-rbd:vm-123-disk-3' 2T 123: 2020-12-03 03:29:01 INFO: backup mode: snapshot 123: 2020-12-03 03:29:01 INFO: ionice priority: 7 123: 2020-12-03 03:29:01 INFO: creating vzdump archive '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_03-03_29_00.vma.lzo' 123: 2020-12-03 03:29:01 INFO: started backup task 'cc7cde4e-20e8-4e26-a89a-f6f1aa9e9612' 123: 2020-12-03 03:29:01 INFO: resuming VM again 123: 2020-12-03 03:29:04 INFO: 0% (284.0 MiB of 3.0 TiB) in 3s, read: 94.7 MiB/s, write: 51.7 MiB/s [... ecc. ecc. ...] 123: 2020-12-03 09:05:08 INFO: 36% (1.1 TiB of 3.0 TiB) in 5h 36m 7s, read: 57.3 MiB/s, write: 53.6 MiB/s 123: 2020-12-03 09:22:57 ERROR: VM 123 qmp command 'query-backup' failed - got timeout 123: 2020-12-03 09:22:57 INFO: aborting backup job 123: 2020-12-03 09:32:57 ERROR: VM 123 qmp command 'backup-cancel' failed - unable to connect to VM 123 qmp socket - timeout after 5981 retries 123: 2020-12-03 09:32:57 ERROR: Backup of VM 123 failed - VM 123 qmp command 'query-backup' failed - got timeout --------------------- The VM has some quite big vdisks (20G, 1T and 2T). All stored in Ceph. There is still plenty of space in Ceph. Can anyone give us some hint on how to investigate and debug this further? Thanks in advance Frank