From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <f.thommen@dkfz-heidelberg.de>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 1D87A68DB5
 for <pve-user@lists.proxmox.com>; Fri,  4 Dec 2020 15:21:13 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 0A003AE79
 for <pve-user@lists.proxmox.com>; Fri,  4 Dec 2020 15:20:43 +0100 (CET)
Received: from mx-ext.inet.dkfz-heidelberg.de (mx-ext.inet.dkfz-heidelberg.de
 [192.54.49.101])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id B6DBBAE6C
 for <pve-user@lists.proxmox.com>; Fri,  4 Dec 2020 15:20:41 +0100 (CET)
X-Virus-Scanned-DKFZ: amavisd-new at dkfz-heidelberg.de
Received: from [194.94.115.240] (dkfz-vpn240.inet.dkfz-heidelberg.de
 [194.94.115.240]) (authenticated bits=0)
 by mx-ext.inet.dkfz-heidelberg.de (8.14.7/8.14.7/smtpin) with ESMTP id
 0B4EKcnd025480
 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO)
 for <pve-user@lists.proxmox.com>; Fri, 4 Dec 2020 15:20:39 +0100
DKIM-Filter: OpenDKIM Filter v2.11.0 mx-ext.inet.dkfz-heidelberg.de
 0B4EKcnd025480
To: Proxmox VE user list <pve-user@lists.proxmox.com>
References: <6f8b35b3-bd74-93f1-5298-eb9980c70d77@dkfz-heidelberg.de>
 <mailman.131.1607062291.440.pve-user@lists.proxmox.com>
 <c1c069d7-af43-ed63-176d-43a9d5fd11b2@dkfz-heidelberg.de>
 <e93c3508-d164-4f6b-bfa1-e36975e36778@dkfz-heidelberg.de>
 <mailman.2.1607078234.376.pve-user@lists.proxmox.com>
 <9d09aa69-95aa-0d96-e119-57b724f29080@dkfz-heidelberg.de>
 <CAFiF2Or3cXrb1zRQ4neCyQs6wqy6xDTppJxBzkUc-zL=2kYrDQ@mail.gmail.com>
From: Frank Thommen <f.thommen@dkfz-heidelberg.de>
Organization: DKFZ Heidelberg, Omics IT and Data Management Core Facility
 (ODCF)
Message-ID: <1d7f5c19-790a-472c-9fcd-6f8d6b8d0f9c@dkfz-heidelberg.de>
Date: Fri, 4 Dec 2020 15:20:38 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.4.0
MIME-Version: 1.0
In-Reply-To: <CAFiF2Or3cXrb1zRQ4neCyQs6wqy6xDTppJxBzkUc-zL=2kYrDQ@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.6.2
 (mx-ext.inet.dkfz-heidelberg.de [192.54.49.101]);
 Fri, 04 Dec 2020 15:20:39 +0100 (CET)
X-Spam-Status: No, score=-100.0 required=5.0 tests=ALL_TRUSTED,NICE_REPLY_A,
 URIBL_BLOCKED autolearn=disabled version=3.4.0
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 mx-ext.inet.dkfz-heidelberg.de
X-SPAM-LEVEL: Spam detection results:  0
 AWL -0.067 Adjusted score from AWL reputation of From: address
 KAM_ASCII_DIVIDERS        0.8 Spam that uses ascii formatting tricks
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 NICE_REPLY_A           -0.001 Looks like a legit reply (A)
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [proxmox.com]
Subject: Re: [PVE-User] Backup of one VM always fails
X-BeenThere: pve-user@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE user list <pve-user.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-user/>
List-Post: <mailto:pve-user@lists.proxmox.com>
List-Help: <mailto:pve-user-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Fri, 04 Dec 2020 14:21:13 -0000

Lot's of klicking to configure the other 30 VMs, but yes, that's 
probably the most appropriate thing to do now :-)  I have to arrange for 
a free NFS share first, though, as there is no free local disk space for 
3+ TB of backup...


On 04/12/2020 15:00, Yannis Milios wrote:
> Can you try removing this specific VM from the normal backup schedule and
> then create a new test schedule for it, if possible to a different backup
> target (nfs, local etc) ?
> 
> 
> 
> On Fri, 4 Dec 2020 at 11:10, Frank Thommen <f.thommen@dkfz-heidelberg.de>
> wrote:
> 
>> On 04/12/2020 11:36, Arjen via pve-user wrote:
>>> On Fri, 2020-12-04 at 11:22 +0100, Frank Thommen wrote:
>>>>
>>>> On 04/12/2020 09:30, Frank Thommen wrote:
>>>>>> On Thursday, December 3, 2020 10:16 PM, Frank Thommen
>>>>>> <f.thommen@dkfz-heidelberg.de> wrote:
>>>>>>
>>>>>>>
>>>>>>> Dear all,
>>>>>>>
>>>>>>> on our PVE cluster, the backup of a specific VM always fails
>>>>>>> (which
>>>>>>> makes us worry, as it is our GitLab instance). The general
>>>>>>> backup plan
>>>>>>> is "back up all VMs at 00:30". In the confirmation email we
>>>>>>> see, that
>>>>>>> the backup of this specific VM takes six to seven hours and
>>>>>>> then fails.
>>>>>>> The error message in the overview table used to be:
>>>>>>>
>>>>>>> vma_queue_write: write error - Broken pipe
>>>>>>>
>>>>>>> With detailed log
>>>>>>>
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -----------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>> 123: 2020-12-01 02:53:08 INFO: Starting Backup of VM 123 (qemu)
>>>>>>> 123: 2020-12-01 02:53:08 INFO: status = running
>>>>>>> 123: 2020-12-01 02:53:09 INFO: update VM 123: -lock backup
>>>>>>> 123: 2020-12-01 02:53:09 INFO: VM Name: odcf-vm123
>>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio0'
>>>>>>> 'ceph-rbd:vm-123-disk-0' 20G
>>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio1'
>>>>>>> 'ceph-rbd:vm-123-disk-2' 1000G
>>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio2'
>>>>>>> 'ceph-rbd:vm-123-disk-3' 2T
>>>>>>> 123: 2020-12-01 02:53:09 INFO: backup mode: snapshot
>>>>>>> 123: 2020-12-01 02:53:09 INFO: ionice priority: 7
>>>>>>> 123: 2020-12-01 02:53:09 INFO: creating archive
>>>>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_01-
>>>>>>> 02_53_08.vma.lzo'
>>>>>>> 123: 2020-12-01 02:53:09 INFO: started backup task
>>>>>>> 'a38ff50a-f474-4b0a-a052-01a835d5c5c7'
>>>>>>> 123: 2020-12-01 02:53:12 INFO: status: 0%
>>>>>>> (167772160/3294239916032),
>>>>>>> sparse 0% (31563776), duration 3, read/write 55/45 MB/s
>>>>>>> [... ecc. ecc. ...]
>>>>>>> 123: 2020-12-01 09:42:14 INFO: status: 35%
>>>>>>> (1170252365824/3294239916032), sparse 0% (26845003776),
>>>>>>> duration 24545,
>>>>>>> read/write 59/56 MB/s
>>>>>>> 123: 2020-12-01 09:42:14 ERROR: vma_queue_write: write error -
>>>>>>> Broken
>>>>>>> pipe
>>>>>>> 123: 2020-12-01 09:42:14 INFO: aborting backup job
>>>>>>> 123: 2020-12-01 09:42:15 ERROR: Backup of VM 123 failed -
>>>>>>> vma_queue_write: write error - Broken pipe
>>>>>>>
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> ----------!
>>>>>>>
>>>>> ---------
>>>>> -----------------------------------------------------------------
>>>>> -----------------------------------------------------------------
>>>>> -----------------------------------------------------------------
>>>>> -----------------------------------------------------------------
>>>>> ----------------------------------
>>>>>
>>>>>>> Since lately (upgrade to the newest PVE release) it's
>>>>>>>
>>>>>>> VM 123 qmp command 'query-backup' failed - got timeout
>>>>>>>
>>>>>>> with log
>>>>>>>
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>> 123: 2020-12-03 03:29:00 INFO: Starting Backup of VM 123 (qemu)
>>>>>>> 123: 2020-12-03 03:29:00 INFO: status = running
>>>>>>> 123: 2020-12-03 03:29:00 INFO: VM Name: odcf-vm123
>>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio0'
>>>>>>> 'ceph-rbd:vm-123-disk-0' 20G
>>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio1'
>>>>>>> 'ceph-rbd:vm-123-disk-2' 1000G
>>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio2'
>>>>>>> 'ceph-rbd:vm-123-disk-3' 2T
>>>>>>> 123: 2020-12-03 03:29:01 INFO: backup mode: snapshot
>>>>>>> 123: 2020-12-03 03:29:01 INFO: ionice priority: 7
>>>>>>> 123: 2020-12-03 03:29:01 INFO: creating vzdump archive
>>>>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_03-
>>>>>>> 03_29_00.vma.lzo'
>>>>>>> 123: 2020-12-03 03:29:01 INFO: started backup task
>>>>>>> 'cc7cde4e-20e8-4e26-a89a-f6f1aa9e9612'
>>>>>>> 123: 2020-12-03 03:29:01 INFO: resuming VM again
>>>>>>> 123: 2020-12-03 03:29:04 INFO: 0% (284.0 MiB of 3.0 TiB) in 3s,
>>>>>>> read:
>>>>>>> 94.7 MiB/s, write: 51.7 MiB/s
>>>>>>> [... ecc. ecc. ...]
>>>>>>> 123: 2020-12-03 09:05:08 INFO: 36% (1.1 TiB of 3.0 TiB) in 5h
>>>>>>> 36m 7s,
>>>>>>> read: 57.3 MiB/s, write: 53.6 MiB/s
>>>>>>> 123: 2020-12-03 09:22:57 ERROR: VM 123 qmp command 'query-
>>>>>>> backup' failed
>>>>>>>
>>>>>>> -   got timeout
>>>>>>>      123: 2020-12-03 09:22:57 INFO: aborting backup job
>>>>>>>      123: 2020-12-03 09:32:57 ERROR: VM 123 qmp command 'backup-
>>>>>>> cancel'
>>>>>>>      failed - unable to connect to VM 123 qmp socket - timeout
>>>>>>> after
>>>>>>> 5981 retries
>>>>>>>      123: 2020-12-03 09:32:57 ERROR: Backup of VM 123 failed -
>>>>>>> VM 123 qmp
>>>>>>>      command 'query-backup' failed - got timeout
>>>>>>>
>>>>>>>
>>>>>>> The VM has some quite big vdisks (20G, 1T and 2T). All stored
>>>>>>> in Ceph.
>>>>>>> There is still plenty of space in Ceph.
>>>>>>>
>>>>>>> Can anyone give us some hint on how to investigate and debug
>>>>>>> this
>>>>>>> further?
>>>>>>
>>>>>> Because it is a write error, maybe we should look at the backup
>>>>>> destination.
>>>>>> Maybe it is a network connection issue? Maybe something wrong
>>>>>> with the
>>>>>> host? Maybe the disk is full?
>>>>>> Which storage are you using for backup? Can you show us the
>>>>>> corresponding entry in /etc/pve/storage.cfg?
>>>>>
>>>>> We are backing up to cephfs with still 8 TB or so free.
>>>>>
>>>>> /etc/pve/storage.cfg is
>>>>> ------------
>>>>> dir: local
>>>>>           path /var/lib/vz
>>>>>           content vztmpl,backup,iso
>>>>>
>>>>> dir: data
>>>>>           path /data
>>>>>           content snippets,images,backup,iso,rootdir,vztmpl
>>>>>
>>>>> cephfs: cephfs
>>>>>           path /mnt/pve/cephfs
>>>>>           content backup,vztmpl,iso
>>>>>           maxfiles 5
>>>>>
>>>>> rbd: ceph-rbd
>>>>>           content images,rootdir
>>>>>           krbd 0
>>>>>           pool pve-pool1
>>>>> ------------
>>>>>
>>>>
>>>> The problem has reached a new level of urgency, as since two days
>>>> each
>>>> time after a failed backup the VMm becomes unaccessible and has to be
>>>> stopped and started manually from the PVE UI.
>>>
>>> I don't see anything wrong the configuration that you shared.
>>> Was anything changed in the last few days since the last successful
>>> backup? Any updates from Proxmox? Changes to the network?
>>> I know very little about Ceph and clusters, sorry.
>>> What makes this VM different, except for the size of the disks?
>>
>> On December 1st the Hypervisor has been updated to PVE 6.3-2 (I think
>> from 6.1-3).  After that the error message slightly changed and - in
>> hindsight - since then the VM stops being accessible after the failed
>> backup.
>>
>> However: The VM never ever backed up successfully, not even before the
>> PVE upgrade.  It's just that no one really took notice of it.
>>
>> The VM is not really special.  It's our only Debian VM (but I hope
>> that's not an issue :-) and the VM has been migrated 1:1 from oVirt by
>> migrating and importing the disk images.  But we have a few other such
>> VMs and they run and back up just fine.
>>
>> No network changes. Basically nothing changed that I could think of.
>>
>> But to be clear: Our current main problem is the failing backup, not the
>> crash.
>>
>>
>> Cheers, Frank
>>
>>
>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>> --
> Sent from Gmail Mobile
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>