From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id BD07E612F6 for ; Wed, 16 Dec 2020 19:31:01 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id AB5C71EC94 for ; Wed, 16 Dec 2020 19:30:31 +0100 (CET) Received: from mx-ext.inet.dkfz-heidelberg.de (mx-ext.inet.dkfz-heidelberg.de [192.54.49.101]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 7CEE31EC88 for ; Wed, 16 Dec 2020 19:30:29 +0100 (CET) X-Virus-Scanned-DKFZ: amavisd-new at dkfz-heidelberg.de Received: from [194.94.115.3] (dkfz-vpn3.inet.dkfz-heidelberg.de [194.94.115.3]) (authenticated bits=0) by mx-ext.inet.dkfz-heidelberg.de (8.14.7/8.14.7/smtpin) with ESMTP id 0BGIUKKg026542 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Wed, 16 Dec 2020 19:30:21 +0100 DKIM-Filter: OpenDKIM Filter v2.11.0 mx-ext.inet.dkfz-heidelberg.de 0BGIUKKg026542 From: Frank Thommen To: Proxmox VE user list References: <6f8b35b3-bd74-93f1-5298-eb9980c70d77@dkfz-heidelberg.de> <9d09aa69-95aa-0d96-e119-57b724f29080@dkfz-heidelberg.de> <1d7f5c19-790a-472c-9fcd-6f8d6b8d0f9c@dkfz-heidelberg.de> Organization: DKFZ Heidelberg, Omics IT and Data Management Core Facility (ODCF) Message-ID: Date: Wed, 16 Dec 2020 19:30:20 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0 MIME-Version: 1.0 In-Reply-To: <1d7f5c19-790a-472c-9fcd-6f8d6b8d0f9c@dkfz-heidelberg.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.6.2 (mx-ext.inet.dkfz-heidelberg.de [192.54.49.101]); Wed, 16 Dec 2020 19:30:21 +0100 (CET) X-Spam-Status: No, score=-100.0 required=5.0 tests=ALL_TRUSTED,NICE_REPLY_A, URIBL_BLOCKED autolearn=disabled version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on mx-ext.inet.dkfz-heidelberg.de X-SPAM-LEVEL: Spam detection results: 0 AWL -0.057 Adjusted score from AWL reputation of From: address KAM_ASCII_DIVIDERS 0.8 Spam that uses ascii formatting tricks KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -0.001 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com] Subject: Re: [PVE-User] Backup of one VM always fails X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 16 Dec 2020 18:31:01 -0000 I finally got a NFS share and ran a backup of said VM on it. It took 19 hours (only a 2Gbit bond is available to the outside) but otherwise ran w/o issues or error messages. However I'm not sure what the consequence of this is. Backing up to external NFS shares is not an option for us in the long run. PBS is also not an option at this time. Maybe in the very long run. I'd really like to use ceph, and I would like to solve the timeout issue. Any ideas on how to troubleshoot that? Frank On 04/12/2020 15:20, Frank Thommen wrote: > Lot's of klicking to configure the other 30 VMs, but yes, that's > probably the most appropriate thing to do now :-)  I have to arrange for > a free NFS share first, though, as there is no free local disk space for > 3+ TB of backup... > > > On 04/12/2020 15:00, Yannis Milios wrote: >> Can you try removing this specific VM from the normal backup schedule and >> then create a new test schedule for it, if possible to a different backup >> target (nfs, local etc) ? >> >> >> >> On Fri, 4 Dec 2020 at 11:10, Frank Thommen >> wrote: >> >>> On 04/12/2020 11:36, Arjen via pve-user wrote: >>>> On Fri, 2020-12-04 at 11:22 +0100, Frank Thommen wrote: >>>>> >>>>> On 04/12/2020 09:30, Frank Thommen wrote: >>>>>>> On Thursday, December 3, 2020 10:16 PM, Frank Thommen >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> Dear all, >>>>>>>> >>>>>>>> on our PVE cluster, the backup of a specific VM always fails >>>>>>>> (which >>>>>>>> makes us worry, as it is our GitLab instance). The general >>>>>>>> backup plan >>>>>>>> is "back up all VMs at 00:30". In the confirmation email we >>>>>>>> see, that >>>>>>>> the backup of this specific VM takes six to seven hours and >>>>>>>> then fails. >>>>>>>> The error message in the overview table used to be: >>>>>>>> >>>>>>>> vma_queue_write: write error - Broken pipe >>>>>>>> >>>>>>>> With detailed log >>>>>>>> >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ----------------------------------------------- >>>>>>>> >>>>>>>> >>>>>>>> 123: 2020-12-01 02:53:08 INFO: Starting Backup of VM 123 (qemu) >>>>>>>> 123: 2020-12-01 02:53:08 INFO: status = running >>>>>>>> 123: 2020-12-01 02:53:09 INFO: update VM 123: -lock backup >>>>>>>> 123: 2020-12-01 02:53:09 INFO: VM Name: odcf-vm123 >>>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio0' >>>>>>>> 'ceph-rbd:vm-123-disk-0' 20G >>>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio1' >>>>>>>> 'ceph-rbd:vm-123-disk-2' 1000G >>>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio2' >>>>>>>> 'ceph-rbd:vm-123-disk-3' 2T >>>>>>>> 123: 2020-12-01 02:53:09 INFO: backup mode: snapshot >>>>>>>> 123: 2020-12-01 02:53:09 INFO: ionice priority: 7 >>>>>>>> 123: 2020-12-01 02:53:09 INFO: creating archive >>>>>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_01- >>>>>>>> 02_53_08.vma.lzo' >>>>>>>> 123: 2020-12-01 02:53:09 INFO: started backup task >>>>>>>> 'a38ff50a-f474-4b0a-a052-01a835d5c5c7' >>>>>>>> 123: 2020-12-01 02:53:12 INFO: status: 0% >>>>>>>> (167772160/3294239916032), >>>>>>>> sparse 0% (31563776), duration 3, read/write 55/45 MB/s >>>>>>>> [... ecc. ecc. ...] >>>>>>>> 123: 2020-12-01 09:42:14 INFO: status: 35% >>>>>>>> (1170252365824/3294239916032), sparse 0% (26845003776), >>>>>>>> duration 24545, >>>>>>>> read/write 59/56 MB/s >>>>>>>> 123: 2020-12-01 09:42:14 ERROR: vma_queue_write: write error - >>>>>>>> Broken >>>>>>>> pipe >>>>>>>> 123: 2020-12-01 09:42:14 INFO: aborting backup job >>>>>>>> 123: 2020-12-01 09:42:15 ERROR: Backup of VM 123 failed - >>>>>>>> vma_queue_write: write error - Broken pipe >>>>>>>> >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> ----------! >>>>>>>> >>>>>> --------- >>>>>> ----------------------------------------------------------------- >>>>>> ----------------------------------------------------------------- >>>>>> ----------------------------------------------------------------- >>>>>> ----------------------------------------------------------------- >>>>>> ---------------------------------- >>>>>> >>>>>>>> Since lately (upgrade to the newest PVE release) it's >>>>>>>> >>>>>>>> VM 123 qmp command 'query-backup' failed - got timeout >>>>>>>> >>>>>>>> with log >>>>>>>> >>>>>>>> ------------------------------------------------------------- >>>>>>>> ------------------------------------------------------------- >>>>>>>> >>>>>>>> >>>>>>>> 123: 2020-12-03 03:29:00 INFO: Starting Backup of VM 123 (qemu) >>>>>>>> 123: 2020-12-03 03:29:00 INFO: status = running >>>>>>>> 123: 2020-12-03 03:29:00 INFO: VM Name: odcf-vm123 >>>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio0' >>>>>>>> 'ceph-rbd:vm-123-disk-0' 20G >>>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio1' >>>>>>>> 'ceph-rbd:vm-123-disk-2' 1000G >>>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio2' >>>>>>>> 'ceph-rbd:vm-123-disk-3' 2T >>>>>>>> 123: 2020-12-03 03:29:01 INFO: backup mode: snapshot >>>>>>>> 123: 2020-12-03 03:29:01 INFO: ionice priority: 7 >>>>>>>> 123: 2020-12-03 03:29:01 INFO: creating vzdump archive >>>>>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_03- >>>>>>>> 03_29_00.vma.lzo' >>>>>>>> 123: 2020-12-03 03:29:01 INFO: started backup task >>>>>>>> 'cc7cde4e-20e8-4e26-a89a-f6f1aa9e9612' >>>>>>>> 123: 2020-12-03 03:29:01 INFO: resuming VM again >>>>>>>> 123: 2020-12-03 03:29:04 INFO: 0% (284.0 MiB of 3.0 TiB) in 3s, >>>>>>>> read: >>>>>>>> 94.7 MiB/s, write: 51.7 MiB/s >>>>>>>> [... ecc. ecc. ...] >>>>>>>> 123: 2020-12-03 09:05:08 INFO: 36% (1.1 TiB of 3.0 TiB) in 5h >>>>>>>> 36m 7s, >>>>>>>> read: 57.3 MiB/s, write: 53.6 MiB/s >>>>>>>> 123: 2020-12-03 09:22:57 ERROR: VM 123 qmp command 'query- >>>>>>>> backup' failed >>>>>>>> >>>>>>>> -   got timeout >>>>>>>>      123: 2020-12-03 09:22:57 INFO: aborting backup job >>>>>>>>      123: 2020-12-03 09:32:57 ERROR: VM 123 qmp command 'backup- >>>>>>>> cancel' >>>>>>>>      failed - unable to connect to VM 123 qmp socket - timeout >>>>>>>> after >>>>>>>> 5981 retries >>>>>>>>      123: 2020-12-03 09:32:57 ERROR: Backup of VM 123 failed - >>>>>>>> VM 123 qmp >>>>>>>>      command 'query-backup' failed - got timeout >>>>>>>> >>>>>>>> >>>>>>>> The VM has some quite big vdisks (20G, 1T and 2T). All stored >>>>>>>> in Ceph. >>>>>>>> There is still plenty of space in Ceph. >>>>>>>> >>>>>>>> Can anyone give us some hint on how to investigate and debug >>>>>>>> this >>>>>>>> further? >>>>>>> >>>>>>> Because it is a write error, maybe we should look at the backup >>>>>>> destination. >>>>>>> Maybe it is a network connection issue? Maybe something wrong >>>>>>> with the >>>>>>> host? Maybe the disk is full? >>>>>>> Which storage are you using for backup? Can you show us the >>>>>>> corresponding entry in /etc/pve/storage.cfg? >>>>>> >>>>>> We are backing up to cephfs with still 8 TB or so free. >>>>>> >>>>>> /etc/pve/storage.cfg is >>>>>> ------------ >>>>>> dir: local >>>>>>           path /var/lib/vz >>>>>>           content vztmpl,backup,iso >>>>>> >>>>>> dir: data >>>>>>           path /data >>>>>>           content snippets,images,backup,iso,rootdir,vztmpl >>>>>> >>>>>> cephfs: cephfs >>>>>>           path /mnt/pve/cephfs >>>>>>           content backup,vztmpl,iso >>>>>>           maxfiles 5 >>>>>> >>>>>> rbd: ceph-rbd >>>>>>           content images,rootdir >>>>>>           krbd 0 >>>>>>           pool pve-pool1 >>>>>> ------------ >>>>>> >>>>> >>>>> The problem has reached a new level of urgency, as since two days >>>>> each >>>>> time after a failed backup the VMm becomes unaccessible and has to be >>>>> stopped and started manually from the PVE UI. >>>> >>>> I don't see anything wrong the configuration that you shared. >>>> Was anything changed in the last few days since the last successful >>>> backup? Any updates from Proxmox? Changes to the network? >>>> I know very little about Ceph and clusters, sorry. >>>> What makes this VM different, except for the size of the disks? >>> >>> On December 1st the Hypervisor has been updated to PVE 6.3-2 (I think >>> from 6.1-3).  After that the error message slightly changed and - in >>> hindsight - since then the VM stops being accessible after the failed >>> backup. >>> >>> However: The VM never ever backed up successfully, not even before the >>> PVE upgrade.  It's just that no one really took notice of it. >>> >>> The VM is not really special.  It's our only Debian VM (but I hope >>> that's not an issue :-) and the VM has been migrated 1:1 from oVirt by >>> migrating and importing the disk images.  But we have a few other such >>> VMs and they run and back up just fine. >>> >>> No network changes. Basically nothing changed that I could think of. >>> >>> But to be clear: Our current main problem is the failing backup, not the >>> crash. >>> >>> >>> Cheers, Frank >>> >>> >>> >>> >>> _______________________________________________ >>> pve-user mailing list >>> pve-user@lists.proxmox.com >>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user >>> >>> -- >> Sent from Gmail Mobile >> _______________________________________________ >> pve-user mailing list >> pve-user@lists.proxmox.com >> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user >> > > > _______________________________________________ > pve-user mailing list > pve-user@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user >