From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 1D87A68DB5 for ; Fri, 4 Dec 2020 15:21:13 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 0A003AE79 for ; Fri, 4 Dec 2020 15:20:43 +0100 (CET) Received: from mx-ext.inet.dkfz-heidelberg.de (mx-ext.inet.dkfz-heidelberg.de [192.54.49.101]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id B6DBBAE6C for ; Fri, 4 Dec 2020 15:20:41 +0100 (CET) X-Virus-Scanned-DKFZ: amavisd-new at dkfz-heidelberg.de Received: from [194.94.115.240] (dkfz-vpn240.inet.dkfz-heidelberg.de [194.94.115.240]) (authenticated bits=0) by mx-ext.inet.dkfz-heidelberg.de (8.14.7/8.14.7/smtpin) with ESMTP id 0B4EKcnd025480 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Fri, 4 Dec 2020 15:20:39 +0100 DKIM-Filter: OpenDKIM Filter v2.11.0 mx-ext.inet.dkfz-heidelberg.de 0B4EKcnd025480 To: Proxmox VE user list References: <6f8b35b3-bd74-93f1-5298-eb9980c70d77@dkfz-heidelberg.de> <9d09aa69-95aa-0d96-e119-57b724f29080@dkfz-heidelberg.de> From: Frank Thommen Organization: DKFZ Heidelberg, Omics IT and Data Management Core Facility (ODCF) Message-ID: <1d7f5c19-790a-472c-9fcd-6f8d6b8d0f9c@dkfz-heidelberg.de> Date: Fri, 4 Dec 2020 15:20:38 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.6.2 (mx-ext.inet.dkfz-heidelberg.de [192.54.49.101]); Fri, 04 Dec 2020 15:20:39 +0100 (CET) X-Spam-Status: No, score=-100.0 required=5.0 tests=ALL_TRUSTED,NICE_REPLY_A, URIBL_BLOCKED autolearn=disabled version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on mx-ext.inet.dkfz-heidelberg.de X-SPAM-LEVEL: Spam detection results: 0 AWL -0.067 Adjusted score from AWL reputation of From: address KAM_ASCII_DIVIDERS 0.8 Spam that uses ascii formatting tricks KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -0.001 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com] Subject: Re: [PVE-User] Backup of one VM always fails X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 04 Dec 2020 14:21:13 -0000 Lot's of klicking to configure the other 30 VMs, but yes, that's probably the most appropriate thing to do now :-) I have to arrange for a free NFS share first, though, as there is no free local disk space for 3+ TB of backup... On 04/12/2020 15:00, Yannis Milios wrote: > Can you try removing this specific VM from the normal backup schedule and > then create a new test schedule for it, if possible to a different backup > target (nfs, local etc) ? > > > > On Fri, 4 Dec 2020 at 11:10, Frank Thommen > wrote: > >> On 04/12/2020 11:36, Arjen via pve-user wrote: >>> On Fri, 2020-12-04 at 11:22 +0100, Frank Thommen wrote: >>>> >>>> On 04/12/2020 09:30, Frank Thommen wrote: >>>>>> On Thursday, December 3, 2020 10:16 PM, Frank Thommen >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> Dear all, >>>>>>> >>>>>>> on our PVE cluster, the backup of a specific VM always fails >>>>>>> (which >>>>>>> makes us worry, as it is our GitLab instance). The general >>>>>>> backup plan >>>>>>> is "back up all VMs at 00:30". In the confirmation email we >>>>>>> see, that >>>>>>> the backup of this specific VM takes six to seven hours and >>>>>>> then fails. >>>>>>> The error message in the overview table used to be: >>>>>>> >>>>>>> vma_queue_write: write error - Broken pipe >>>>>>> >>>>>>> With detailed log >>>>>>> >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ----------------------------------------------- >>>>>>> >>>>>>> >>>>>>> 123: 2020-12-01 02:53:08 INFO: Starting Backup of VM 123 (qemu) >>>>>>> 123: 2020-12-01 02:53:08 INFO: status = running >>>>>>> 123: 2020-12-01 02:53:09 INFO: update VM 123: -lock backup >>>>>>> 123: 2020-12-01 02:53:09 INFO: VM Name: odcf-vm123 >>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio0' >>>>>>> 'ceph-rbd:vm-123-disk-0' 20G >>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio1' >>>>>>> 'ceph-rbd:vm-123-disk-2' 1000G >>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio2' >>>>>>> 'ceph-rbd:vm-123-disk-3' 2T >>>>>>> 123: 2020-12-01 02:53:09 INFO: backup mode: snapshot >>>>>>> 123: 2020-12-01 02:53:09 INFO: ionice priority: 7 >>>>>>> 123: 2020-12-01 02:53:09 INFO: creating archive >>>>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_01- >>>>>>> 02_53_08.vma.lzo' >>>>>>> 123: 2020-12-01 02:53:09 INFO: started backup task >>>>>>> 'a38ff50a-f474-4b0a-a052-01a835d5c5c7' >>>>>>> 123: 2020-12-01 02:53:12 INFO: status: 0% >>>>>>> (167772160/3294239916032), >>>>>>> sparse 0% (31563776), duration 3, read/write 55/45 MB/s >>>>>>> [... ecc. ecc. ...] >>>>>>> 123: 2020-12-01 09:42:14 INFO: status: 35% >>>>>>> (1170252365824/3294239916032), sparse 0% (26845003776), >>>>>>> duration 24545, >>>>>>> read/write 59/56 MB/s >>>>>>> 123: 2020-12-01 09:42:14 ERROR: vma_queue_write: write error - >>>>>>> Broken >>>>>>> pipe >>>>>>> 123: 2020-12-01 09:42:14 INFO: aborting backup job >>>>>>> 123: 2020-12-01 09:42:15 ERROR: Backup of VM 123 failed - >>>>>>> vma_queue_write: write error - Broken pipe >>>>>>> >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> ----------! >>>>>>> >>>>> --------- >>>>> ----------------------------------------------------------------- >>>>> ----------------------------------------------------------------- >>>>> ----------------------------------------------------------------- >>>>> ----------------------------------------------------------------- >>>>> ---------------------------------- >>>>> >>>>>>> Since lately (upgrade to the newest PVE release) it's >>>>>>> >>>>>>> VM 123 qmp command 'query-backup' failed - got timeout >>>>>>> >>>>>>> with log >>>>>>> >>>>>>> ------------------------------------------------------------- >>>>>>> ------------------------------------------------------------- >>>>>>> >>>>>>> >>>>>>> 123: 2020-12-03 03:29:00 INFO: Starting Backup of VM 123 (qemu) >>>>>>> 123: 2020-12-03 03:29:00 INFO: status = running >>>>>>> 123: 2020-12-03 03:29:00 INFO: VM Name: odcf-vm123 >>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio0' >>>>>>> 'ceph-rbd:vm-123-disk-0' 20G >>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio1' >>>>>>> 'ceph-rbd:vm-123-disk-2' 1000G >>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio2' >>>>>>> 'ceph-rbd:vm-123-disk-3' 2T >>>>>>> 123: 2020-12-03 03:29:01 INFO: backup mode: snapshot >>>>>>> 123: 2020-12-03 03:29:01 INFO: ionice priority: 7 >>>>>>> 123: 2020-12-03 03:29:01 INFO: creating vzdump archive >>>>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_03- >>>>>>> 03_29_00.vma.lzo' >>>>>>> 123: 2020-12-03 03:29:01 INFO: started backup task >>>>>>> 'cc7cde4e-20e8-4e26-a89a-f6f1aa9e9612' >>>>>>> 123: 2020-12-03 03:29:01 INFO: resuming VM again >>>>>>> 123: 2020-12-03 03:29:04 INFO: 0% (284.0 MiB of 3.0 TiB) in 3s, >>>>>>> read: >>>>>>> 94.7 MiB/s, write: 51.7 MiB/s >>>>>>> [... ecc. ecc. ...] >>>>>>> 123: 2020-12-03 09:05:08 INFO: 36% (1.1 TiB of 3.0 TiB) in 5h >>>>>>> 36m 7s, >>>>>>> read: 57.3 MiB/s, write: 53.6 MiB/s >>>>>>> 123: 2020-12-03 09:22:57 ERROR: VM 123 qmp command 'query- >>>>>>> backup' failed >>>>>>> >>>>>>> - got timeout >>>>>>> 123: 2020-12-03 09:22:57 INFO: aborting backup job >>>>>>> 123: 2020-12-03 09:32:57 ERROR: VM 123 qmp command 'backup- >>>>>>> cancel' >>>>>>> failed - unable to connect to VM 123 qmp socket - timeout >>>>>>> after >>>>>>> 5981 retries >>>>>>> 123: 2020-12-03 09:32:57 ERROR: Backup of VM 123 failed - >>>>>>> VM 123 qmp >>>>>>> command 'query-backup' failed - got timeout >>>>>>> >>>>>>> >>>>>>> The VM has some quite big vdisks (20G, 1T and 2T). All stored >>>>>>> in Ceph. >>>>>>> There is still plenty of space in Ceph. >>>>>>> >>>>>>> Can anyone give us some hint on how to investigate and debug >>>>>>> this >>>>>>> further? >>>>>> >>>>>> Because it is a write error, maybe we should look at the backup >>>>>> destination. >>>>>> Maybe it is a network connection issue? Maybe something wrong >>>>>> with the >>>>>> host? Maybe the disk is full? >>>>>> Which storage are you using for backup? Can you show us the >>>>>> corresponding entry in /etc/pve/storage.cfg? >>>>> >>>>> We are backing up to cephfs with still 8 TB or so free. >>>>> >>>>> /etc/pve/storage.cfg is >>>>> ------------ >>>>> dir: local >>>>> path /var/lib/vz >>>>> content vztmpl,backup,iso >>>>> >>>>> dir: data >>>>> path /data >>>>> content snippets,images,backup,iso,rootdir,vztmpl >>>>> >>>>> cephfs: cephfs >>>>> path /mnt/pve/cephfs >>>>> content backup,vztmpl,iso >>>>> maxfiles 5 >>>>> >>>>> rbd: ceph-rbd >>>>> content images,rootdir >>>>> krbd 0 >>>>> pool pve-pool1 >>>>> ------------ >>>>> >>>> >>>> The problem has reached a new level of urgency, as since two days >>>> each >>>> time after a failed backup the VMm becomes unaccessible and has to be >>>> stopped and started manually from the PVE UI. >>> >>> I don't see anything wrong the configuration that you shared. >>> Was anything changed in the last few days since the last successful >>> backup? Any updates from Proxmox? Changes to the network? >>> I know very little about Ceph and clusters, sorry. >>> What makes this VM different, except for the size of the disks? >> >> On December 1st the Hypervisor has been updated to PVE 6.3-2 (I think >> from 6.1-3). After that the error message slightly changed and - in >> hindsight - since then the VM stops being accessible after the failed >> backup. >> >> However: The VM never ever backed up successfully, not even before the >> PVE upgrade. It's just that no one really took notice of it. >> >> The VM is not really special. It's our only Debian VM (but I hope >> that's not an issue :-) and the VM has been migrated 1:1 from oVirt by >> migrating and importing the disk images. But we have a few other such >> VMs and they run and back up just fine. >> >> No network changes. Basically nothing changed that I could think of. >> >> But to be clear: Our current main problem is the failing backup, not the >> crash. >> >> >> Cheers, Frank >> >> >> >> >> _______________________________________________ >> pve-user mailing list >> pve-user@lists.proxmox.com >> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user >> >> -- > Sent from Gmail Mobile > _______________________________________________ > pve-user mailing list > pve-user@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user >