[PVE-User] Backup of one VM always fails

public inbox for pve-user@lists.proxmox.com
 help / color / mirror / Atom feed

* [PVE-User] Backup of one VM always fails
@ 2020-12-03 21:16 Frank Thommen
  2020-12-03 22:10 ` Gerald Brandt
       [not found] ` <mailman.131.1607062291.440.pve-user@lists.proxmox.com>
  0 siblings, 2 replies; 11+ messages in thread
From: Frank Thommen @ 2020-12-03 21:16 UTC (permalink / raw)
  To: PVE User List


Dear all,

on our PVE cluster, the backup of a specific VM always fails (which 
makes us worry, as it is our GitLab instance).  The general backup plan 
is "back up all VMs at 00:30".  In the confirmation email we see, that 
the backup of this specific VM takes six to seven hours and then fails. 
The error message in the overview table used to be:

   vma_queue_write: write error - Broken pipe

With detailed log
---------------------
123: 2020-12-01 02:53:08 INFO: Starting Backup of VM 123 (qemu)
123: 2020-12-01 02:53:08 INFO: status = running
123: 2020-12-01 02:53:09 INFO: update VM 123: -lock backup
123: 2020-12-01 02:53:09 INFO: VM Name: odcf-vm123
123: 2020-12-01 02:53:09 INFO: include disk 'virtio0' 
'ceph-rbd:vm-123-disk-0' 20G
123: 2020-12-01 02:53:09 INFO: include disk 'virtio1' 
'ceph-rbd:vm-123-disk-2' 1000G
123: 2020-12-01 02:53:09 INFO: include disk 'virtio2' 
'ceph-rbd:vm-123-disk-3' 2T
123: 2020-12-01 02:53:09 INFO: backup mode: snapshot
123: 2020-12-01 02:53:09 INFO: ionice priority: 7
123: 2020-12-01 02:53:09 INFO: creating archive 
'/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_01-02_53_08.vma.lzo'
123: 2020-12-01 02:53:09 INFO: started backup task 
'a38ff50a-f474-4b0a-a052-01a835d5c5c7'
123: 2020-12-01 02:53:12 INFO: status: 0% (167772160/3294239916032), 
sparse 0% (31563776), duration 3, read/write 55/45 MB/s
[... ecc. ecc. ...]
123: 2020-12-01 09:42:14 INFO: status: 35% 
(1170252365824/3294239916032), sparse 0% (26845003776), duration 24545, 
read/write 59/56 MB/s
123: 2020-12-01 09:42:14 ERROR: vma_queue_write: write error - Broken pipe
123: 2020-12-01 09:42:14 INFO: aborting backup job
123: 2020-12-01 09:42:15 ERROR: Backup of VM 123 failed - 
vma_queue_write: write error - Broken pipe
---------------------

Since lately (upgrade to the newest PVE release) it's

   VM 123 qmp command 'query-backup' failed - got timeout

with log
---------------------
123: 2020-12-03 03:29:00 INFO: Starting Backup of VM 123 (qemu)
123: 2020-12-03 03:29:00 INFO: status = running
123: 2020-12-03 03:29:00 INFO: VM Name: odcf-vm123
123: 2020-12-03 03:29:00 INFO: include disk 'virtio0' 
'ceph-rbd:vm-123-disk-0' 20G
123: 2020-12-03 03:29:00 INFO: include disk 'virtio1' 
'ceph-rbd:vm-123-disk-2' 1000G
123: 2020-12-03 03:29:00 INFO: include disk 'virtio2' 
'ceph-rbd:vm-123-disk-3' 2T
123: 2020-12-03 03:29:01 INFO: backup mode: snapshot
123: 2020-12-03 03:29:01 INFO: ionice priority: 7
123: 2020-12-03 03:29:01 INFO: creating vzdump archive 
'/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_03-03_29_00.vma.lzo'
123: 2020-12-03 03:29:01 INFO: started backup task 
'cc7cde4e-20e8-4e26-a89a-f6f1aa9e9612'
123: 2020-12-03 03:29:01 INFO: resuming VM again
123: 2020-12-03 03:29:04 INFO:   0% (284.0 MiB of 3.0 TiB) in  3s, read: 
94.7 MiB/s, write: 51.7 MiB/s
[... ecc. ecc. ...]
123: 2020-12-03 09:05:08 INFO:  36% (1.1 TiB of 3.0 TiB) in  5h 36m  7s, 
read: 57.3 MiB/s, write: 53.6 MiB/s
123: 2020-12-03 09:22:57 ERROR: VM 123 qmp command 'query-backup' failed 
- got timeout
123: 2020-12-03 09:22:57 INFO: aborting backup job
123: 2020-12-03 09:32:57 ERROR: VM 123 qmp command 'backup-cancel' 
failed - unable to connect to VM 123 qmp socket - timeout after 5981 retries
123: 2020-12-03 09:32:57 ERROR: Backup of VM 123 failed - VM 123 qmp 
command 'query-backup' failed - got timeout
---------------------

The VM has some quite big vdisks (20G, 1T and 2T).  All stored in Ceph. 
There is still plenty of space in Ceph.

Can anyone give us some hint on how to investigate and debug this further?

Thanks in advance
Frank



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PVE-User] Backup of one VM always fails
  2020-12-03 21:16 [PVE-User] Backup of one VM always fails Frank Thommen
@ 2020-12-03 22:10 ` Gerald Brandt
  2020-12-04  8:26   ` Frank Thommen
       [not found] ` <mailman.131.1607062291.440.pve-user@lists.proxmox.com>
  1 sibling, 1 reply; 11+ messages in thread
From: Gerald Brandt @ 2020-12-03 22:10 UTC (permalink / raw)
  To: pve-user

Sounds like a disk space issue. You're running out of it during the 
backup process.


Gerald

On 2020-12-03 3:16 p.m., Frank Thommen wrote:
>
> Dear all,
>
> on our PVE cluster, the backup of a specific VM always fails (which 
> makes us worry, as it is our GitLab instance).  The general backup 
> plan is "back up all VMs at 00:30".  In the confirmation email we see, 
> that the backup of this specific VM takes six to seven hours and then 
> fails. The error message in the overview table used to be:
>
>   vma_queue_write: write error - Broken pipe
>
> With detailed log
> ---------------------
> 123: 2020-12-01 02:53:08 INFO: Starting Backup of VM 123 (qemu)
> 123: 2020-12-01 02:53:08 INFO: status = running
> 123: 2020-12-01 02:53:09 INFO: update VM 123: -lock backup
> 123: 2020-12-01 02:53:09 INFO: VM Name: odcf-vm123
> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio0' 
> 'ceph-rbd:vm-123-disk-0' 20G
> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio1' 
> 'ceph-rbd:vm-123-disk-2' 1000G
> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio2' 
> 'ceph-rbd:vm-123-disk-3' 2T
> 123: 2020-12-01 02:53:09 INFO: backup mode: snapshot
> 123: 2020-12-01 02:53:09 INFO: ionice priority: 7
> 123: 2020-12-01 02:53:09 INFO: creating archive 
> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_01-02_53_08.vma.lzo'
> 123: 2020-12-01 02:53:09 INFO: started backup task 
> 'a38ff50a-f474-4b0a-a052-01a835d5c5c7'
> 123: 2020-12-01 02:53:12 INFO: status: 0% (167772160/3294239916032), 
> sparse 0% (31563776), duration 3, read/write 55/45 MB/s
> [... ecc. ecc. ...]
> 123: 2020-12-01 09:42:14 INFO: status: 35% 
> (1170252365824/3294239916032), sparse 0% (26845003776), duration 
> 24545, read/write 59/56 MB/s
> 123: 2020-12-01 09:42:14 ERROR: vma_queue_write: write error - Broken 
> pipe
> 123: 2020-12-01 09:42:14 INFO: aborting backup job
> 123: 2020-12-01 09:42:15 ERROR: Backup of VM 123 failed - 
> vma_queue_write: write error - Broken pipe
> ---------------------
>
> Since lately (upgrade to the newest PVE release) it's
>
>   VM 123 qmp command 'query-backup' failed - got timeout
>
> with log
> ---------------------
> 123: 2020-12-03 03:29:00 INFO: Starting Backup of VM 123 (qemu)
> 123: 2020-12-03 03:29:00 INFO: status = running
> 123: 2020-12-03 03:29:00 INFO: VM Name: odcf-vm123
> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio0' 
> 'ceph-rbd:vm-123-disk-0' 20G
> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio1' 
> 'ceph-rbd:vm-123-disk-2' 1000G
> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio2' 
> 'ceph-rbd:vm-123-disk-3' 2T
> 123: 2020-12-03 03:29:01 INFO: backup mode: snapshot
> 123: 2020-12-03 03:29:01 INFO: ionice priority: 7
> 123: 2020-12-03 03:29:01 INFO: creating vzdump archive 
> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_03-03_29_00.vma.lzo'
> 123: 2020-12-03 03:29:01 INFO: started backup task 
> 'cc7cde4e-20e8-4e26-a89a-f6f1aa9e9612'
> 123: 2020-12-03 03:29:01 INFO: resuming VM again
> 123: 2020-12-03 03:29:04 INFO:   0% (284.0 MiB of 3.0 TiB) in  3s, 
> read: 94.7 MiB/s, write: 51.7 MiB/s
> [... ecc. ecc. ...]
> 123: 2020-12-03 09:05:08 INFO:  36% (1.1 TiB of 3.0 TiB) in  5h 36m  
> 7s, read: 57.3 MiB/s, write: 53.6 MiB/s
> 123: 2020-12-03 09:22:57 ERROR: VM 123 qmp command 'query-backup' 
> failed - got timeout
> 123: 2020-12-03 09:22:57 INFO: aborting backup job
> 123: 2020-12-03 09:32:57 ERROR: VM 123 qmp command 'backup-cancel' 
> failed - unable to connect to VM 123 qmp socket - timeout after 5981 
> retries
> 123: 2020-12-03 09:32:57 ERROR: Backup of VM 123 failed - VM 123 qmp 
> command 'query-backup' failed - got timeout
> ---------------------
>
> The VM has some quite big vdisks (20G, 1T and 2T).  All stored in 
> Ceph. There is still plenty of space in Ceph.
>
> Can anyone give us some hint on how to investigate and debug this 
> further?
>
> Thanks in advance
> Frank
>
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
>
>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PVE-User] Backup of one VM always fails
  2020-12-03 22:10 ` Gerald Brandt
@ 2020-12-04  8:26   ` Frank Thommen
  0 siblings, 0 replies; 11+ messages in thread
From: Frank Thommen @ 2020-12-04  8:26 UTC (permalink / raw)
  To: Proxmox VE user list

I doubt that.  We are backing up to Ceph (cephfs) and there are still 8 
TB free ;-)


On 03/12/2020 23:10, Gerald Brandt wrote:
> Sounds like a disk space issue. You're running out of it during the 
> backup process.
> 
> 
> Gerald
> 
> On 2020-12-03 3:16 p.m., Frank Thommen wrote:
>>
>> Dear all,
>>
>> on our PVE cluster, the backup of a specific VM always fails (which 
>> makes us worry, as it is our GitLab instance).  The general backup 
>> plan is "back up all VMs at 00:30".  In the confirmation email we see, 
>> that the backup of this specific VM takes six to seven hours and then 
>> fails. The error message in the overview table used to be:
>>
>>   vma_queue_write: write error - Broken pipe
>>
>> With detailed log
>> ---------------------
>> 123: 2020-12-01 02:53:08 INFO: Starting Backup of VM 123 (qemu)
>> 123: 2020-12-01 02:53:08 INFO: status = running
>> 123: 2020-12-01 02:53:09 INFO: update VM 123: -lock backup
>> 123: 2020-12-01 02:53:09 INFO: VM Name: odcf-vm123
>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio0' 
>> 'ceph-rbd:vm-123-disk-0' 20G
>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio1' 
>> 'ceph-rbd:vm-123-disk-2' 1000G
>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio2' 
>> 'ceph-rbd:vm-123-disk-3' 2T
>> 123: 2020-12-01 02:53:09 INFO: backup mode: snapshot
>> 123: 2020-12-01 02:53:09 INFO: ionice priority: 7
>> 123: 2020-12-01 02:53:09 INFO: creating archive 
>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_01-02_53_08.vma.lzo'
>> 123: 2020-12-01 02:53:09 INFO: started backup task 
>> 'a38ff50a-f474-4b0a-a052-01a835d5c5c7'
>> 123: 2020-12-01 02:53:12 INFO: status: 0% (167772160/3294239916032), 
>> sparse 0% (31563776), duration 3, read/write 55/45 MB/s
>> [... ecc. ecc. ...]
>> 123: 2020-12-01 09:42:14 INFO: status: 35% 
>> (1170252365824/3294239916032), sparse 0% (26845003776), duration 
>> 24545, read/write 59/56 MB/s
>> 123: 2020-12-01 09:42:14 ERROR: vma_queue_write: write error - Broken 
>> pipe
>> 123: 2020-12-01 09:42:14 INFO: aborting backup job
>> 123: 2020-12-01 09:42:15 ERROR: Backup of VM 123 failed - 
>> vma_queue_write: write error - Broken pipe
>> ---------------------
>>
>> Since lately (upgrade to the newest PVE release) it's
>>
>>   VM 123 qmp command 'query-backup' failed - got timeout
>>
>> with log
>> ---------------------
>> 123: 2020-12-03 03:29:00 INFO: Starting Backup of VM 123 (qemu)
>> 123: 2020-12-03 03:29:00 INFO: status = running
>> 123: 2020-12-03 03:29:00 INFO: VM Name: odcf-vm123
>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio0' 
>> 'ceph-rbd:vm-123-disk-0' 20G
>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio1' 
>> 'ceph-rbd:vm-123-disk-2' 1000G
>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio2' 
>> 'ceph-rbd:vm-123-disk-3' 2T
>> 123: 2020-12-03 03:29:01 INFO: backup mode: snapshot
>> 123: 2020-12-03 03:29:01 INFO: ionice priority: 7
>> 123: 2020-12-03 03:29:01 INFO: creating vzdump archive 
>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_03-03_29_00.vma.lzo'
>> 123: 2020-12-03 03:29:01 INFO: started backup task 
>> 'cc7cde4e-20e8-4e26-a89a-f6f1aa9e9612'
>> 123: 2020-12-03 03:29:01 INFO: resuming VM again
>> 123: 2020-12-03 03:29:04 INFO:   0% (284.0 MiB of 3.0 TiB) in  3s, 
>> read: 94.7 MiB/s, write: 51.7 MiB/s
>> [... ecc. ecc. ...]
>> 123: 2020-12-03 09:05:08 INFO:  36% (1.1 TiB of 3.0 TiB) in  5h 36m 
>> 7s, read: 57.3 MiB/s, write: 53.6 MiB/s
>> 123: 2020-12-03 09:22:57 ERROR: VM 123 qmp command 'query-backup' 
>> failed - got timeout
>> 123: 2020-12-03 09:22:57 INFO: aborting backup job
>> 123: 2020-12-03 09:32:57 ERROR: VM 123 qmp command 'backup-cancel' 
>> failed - unable to connect to VM 123 qmp socket - timeout after 5981 
>> retries
>> 123: 2020-12-03 09:32:57 ERROR: Backup of VM 123 failed - VM 123 qmp 
>> command 'query-backup' failed - got timeout
>> ---------------------
>>
>> The VM has some quite big vdisks (20G, 1T and 2T).  All stored in 
>> Ceph. There is still plenty of space in Ceph.
>>
>> Can anyone give us some hint on how to investigate and debug this 
>> further?
>>
>> Thanks in advance
>> Frank
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>>
>>
> 
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PVE-User] Backup of one VM always fails
       [not found] ` <mailman.131.1607062291.440.pve-user@lists.proxmox.com>
@ 2020-12-04  8:30   ` Frank Thommen
  2020-12-04 10:22     ` Frank Thommen
  0 siblings, 1 reply; 11+ messages in thread
From: Frank Thommen @ 2020-12-04  8:30 UTC (permalink / raw)
  To: pve-user

> On Thursday, December 3, 2020 10:16 PM, Frank Thommen <f.thommen@dkfz-heidelberg.de> wrote:
> 
>>
>>
>> Dear all,
>>
>> on our PVE cluster, the backup of a specific VM always fails (which
>> makes us worry, as it is our GitLab instance). The general backup plan
>> is "back up all VMs at 00:30". In the confirmation email we see, that
>> the backup of this specific VM takes six to seven hours and then fails.
>> The error message in the overview table used to be:
>>
>> vma_queue_write: write error - Broken pipe
>>
>> With detailed log
>>
>> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> 123: 2020-12-01 02:53:08 INFO: Starting Backup of VM 123 (qemu)
>> 123: 2020-12-01 02:53:08 INFO: status = running
>> 123: 2020-12-01 02:53:09 INFO: update VM 123: -lock backup
>> 123: 2020-12-01 02:53:09 INFO: VM Name: odcf-vm123
>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio0'
>> 'ceph-rbd:vm-123-disk-0' 20G
>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio1'
>> 'ceph-rbd:vm-123-disk-2' 1000G
>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio2'
>> 'ceph-rbd:vm-123-disk-3' 2T
>> 123: 2020-12-01 02:53:09 INFO: backup mode: snapshot
>> 123: 2020-12-01 02:53:09 INFO: ionice priority: 7
>> 123: 2020-12-01 02:53:09 INFO: creating archive
>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_01-02_53_08.vma.lzo'
>> 123: 2020-12-01 02:53:09 INFO: started backup task
>> 'a38ff50a-f474-4b0a-a052-01a835d5c5c7'
>> 123: 2020-12-01 02:53:12 INFO: status: 0% (167772160/3294239916032),
>> sparse 0% (31563776), duration 3, read/write 55/45 MB/s
>> [... ecc. ecc. ...]
>> 123: 2020-12-01 09:42:14 INFO: status: 35%
>> (1170252365824/3294239916032), sparse 0% (26845003776), duration 24545,
>> read/write 59/56 MB/s
>> 123: 2020-12-01 09:42:14 ERROR: vma_queue_write: write error - Broken pipe
>> 123: 2020-12-01 09:42:14 INFO: aborting backup job
>> 123: 2020-12-01 09:42:15 ERROR: Backup of VM 123 failed -
>> vma_queue_write: write error - Broken pipe
>>
>> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> Since lately (upgrade to the newest PVE release) it's
>>
>> VM 123 qmp command 'query-backup' failed - got timeout
>>
>> with log
>>
>> --------------------------------------------------------------------------------------------------------------------------
>>
>> 123: 2020-12-03 03:29:00 INFO: Starting Backup of VM 123 (qemu)
>> 123: 2020-12-03 03:29:00 INFO: status = running
>> 123: 2020-12-03 03:29:00 INFO: VM Name: odcf-vm123
>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio0'
>> 'ceph-rbd:vm-123-disk-0' 20G
>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio1'
>> 'ceph-rbd:vm-123-disk-2' 1000G
>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio2'
>> 'ceph-rbd:vm-123-disk-3' 2T
>> 123: 2020-12-03 03:29:01 INFO: backup mode: snapshot
>> 123: 2020-12-03 03:29:01 INFO: ionice priority: 7
>> 123: 2020-12-03 03:29:01 INFO: creating vzdump archive
>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_03-03_29_00.vma.lzo'
>> 123: 2020-12-03 03:29:01 INFO: started backup task
>> 'cc7cde4e-20e8-4e26-a89a-f6f1aa9e9612'
>> 123: 2020-12-03 03:29:01 INFO: resuming VM again
>> 123: 2020-12-03 03:29:04 INFO: 0% (284.0 MiB of 3.0 TiB) in 3s, read:
>> 94.7 MiB/s, write: 51.7 MiB/s
>> [... ecc. ecc. ...]
>> 123: 2020-12-03 09:05:08 INFO: 36% (1.1 TiB of 3.0 TiB) in 5h 36m 7s,
>> read: 57.3 MiB/s, write: 53.6 MiB/s
>> 123: 2020-12-03 09:22:57 ERROR: VM 123 qmp command 'query-backup' failed
>>
>> -   got timeout
>>     123: 2020-12-03 09:22:57 INFO: aborting backup job
>>     123: 2020-12-03 09:32:57 ERROR: VM 123 qmp command 'backup-cancel'
>>     failed - unable to connect to VM 123 qmp socket - timeout after 5981 retries
>>     123: 2020-12-03 09:32:57 ERROR: Backup of VM 123 failed - VM 123 qmp
>>     command 'query-backup' failed - got timeout
>>
>>
>> The VM has some quite big vdisks (20G, 1T and 2T). All stored in Ceph.
>> There is still plenty of space in Ceph.
>>
>> Can anyone give us some hint on how to investigate and debug this further?
> 
> Because it is a write error, maybe we should look at the backup destination.
> Maybe it is a network connection issue? Maybe something wrong with the host? Maybe the disk is full?
> Which storage are you using for backup? Can you show us the corresponding entry in /etc/pve/storage.cfg?


We are backing up to cephfs with still 8 TB or so free.

/etc/pve/storage.cfg is
------------
dir: local
         path /var/lib/vz
         content vztmpl,backup,iso

dir: data
         path /data
         content snippets,images,backup,iso,rootdir,vztmpl

cephfs: cephfs
         path /mnt/pve/cephfs
         content backup,vztmpl,iso
         maxfiles 5

rbd: ceph-rbd
         content images,rootdir
         krbd 0
         pool pve-pool1
------------

Frank

> 
> best regards, Arjen
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PVE-User] Backup of one VM always fails
  2020-12-04  8:30   ` Frank Thommen
@ 2020-12-04 10:22     ` Frank Thommen
  2020-12-04 10:26       ` Fabrizio Cuseo
       [not found]       ` <mailman.2.1607078234.376.pve-user@lists.proxmox.com>
  0 siblings, 2 replies; 11+ messages in thread
From: Frank Thommen @ 2020-12-04 10:22 UTC (permalink / raw)
  To: pve-user


On 04/12/2020 09:30, Frank Thommen wrote:
>> On Thursday, December 3, 2020 10:16 PM, Frank Thommen 
>> <f.thommen@dkfz-heidelberg.de> wrote:
>>
>>>
>>>
>>> Dear all,
>>>
>>> on our PVE cluster, the backup of a specific VM always fails (which
>>> makes us worry, as it is our GitLab instance). The general backup plan
>>> is "back up all VMs at 00:30". In the confirmation email we see, that
>>> the backup of this specific VM takes six to seven hours and then fails.
>>> The error message in the overview table used to be:
>>>
>>> vma_queue_write: write error - Broken pipe
>>>
>>> With detailed log
>>>
>>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
>>>
>>>
>>> 123: 2020-12-01 02:53:08 INFO: Starting Backup of VM 123 (qemu)
>>> 123: 2020-12-01 02:53:08 INFO: status = running
>>> 123: 2020-12-01 02:53:09 INFO: update VM 123: -lock backup
>>> 123: 2020-12-01 02:53:09 INFO: VM Name: odcf-vm123
>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio0'
>>> 'ceph-rbd:vm-123-disk-0' 20G
>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio1'
>>> 'ceph-rbd:vm-123-disk-2' 1000G
>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio2'
>>> 'ceph-rbd:vm-123-disk-3' 2T
>>> 123: 2020-12-01 02:53:09 INFO: backup mode: snapshot
>>> 123: 2020-12-01 02:53:09 INFO: ionice priority: 7
>>> 123: 2020-12-01 02:53:09 INFO: creating archive
>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_01-02_53_08.vma.lzo'
>>> 123: 2020-12-01 02:53:09 INFO: started backup task
>>> 'a38ff50a-f474-4b0a-a052-01a835d5c5c7'
>>> 123: 2020-12-01 02:53:12 INFO: status: 0% (167772160/3294239916032),
>>> sparse 0% (31563776), duration 3, read/write 55/45 MB/s
>>> [... ecc. ecc. ...]
>>> 123: 2020-12-01 09:42:14 INFO: status: 35%
>>> (1170252365824/3294239916032), sparse 0% (26845003776), duration 24545,
>>> read/write 59/56 MB/s
>>> 123: 2020-12-01 09:42:14 ERROR: vma_queue_write: write error - Broken 
>>> pipe
>>> 123: 2020-12-01 09:42:14 INFO: aborting backup job
>>> 123: 2020-12-01 09:42:15 ERROR: Backup of VM 123 failed -
>>> vma_queue_write: write error - Broken pipe
>>>
>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------! 
>>>
> ---------
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 
> 
>>>
>>> Since lately (upgrade to the newest PVE release) it's
>>>
>>> VM 123 qmp command 'query-backup' failed - got timeout
>>>
>>> with log
>>>
>>> -------------------------------------------------------------------------------------------------------------------------- 
>>>
>>>
>>> 123: 2020-12-03 03:29:00 INFO: Starting Backup of VM 123 (qemu)
>>> 123: 2020-12-03 03:29:00 INFO: status = running
>>> 123: 2020-12-03 03:29:00 INFO: VM Name: odcf-vm123
>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio0'
>>> 'ceph-rbd:vm-123-disk-0' 20G
>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio1'
>>> 'ceph-rbd:vm-123-disk-2' 1000G
>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio2'
>>> 'ceph-rbd:vm-123-disk-3' 2T
>>> 123: 2020-12-03 03:29:01 INFO: backup mode: snapshot
>>> 123: 2020-12-03 03:29:01 INFO: ionice priority: 7
>>> 123: 2020-12-03 03:29:01 INFO: creating vzdump archive
>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_03-03_29_00.vma.lzo'
>>> 123: 2020-12-03 03:29:01 INFO: started backup task
>>> 'cc7cde4e-20e8-4e26-a89a-f6f1aa9e9612'
>>> 123: 2020-12-03 03:29:01 INFO: resuming VM again
>>> 123: 2020-12-03 03:29:04 INFO: 0% (284.0 MiB of 3.0 TiB) in 3s, read:
>>> 94.7 MiB/s, write: 51.7 MiB/s
>>> [... ecc. ecc. ...]
>>> 123: 2020-12-03 09:05:08 INFO: 36% (1.1 TiB of 3.0 TiB) in 5h 36m 7s,
>>> read: 57.3 MiB/s, write: 53.6 MiB/s
>>> 123: 2020-12-03 09:22:57 ERROR: VM 123 qmp command 'query-backup' failed
>>>
>>> -   got timeout
>>>     123: 2020-12-03 09:22:57 INFO: aborting backup job
>>>     123: 2020-12-03 09:32:57 ERROR: VM 123 qmp command 'backup-cancel'
>>>     failed - unable to connect to VM 123 qmp socket - timeout after 
>>> 5981 retries
>>>     123: 2020-12-03 09:32:57 ERROR: Backup of VM 123 failed - VM 123 qmp
>>>     command 'query-backup' failed - got timeout
>>>
>>>
>>> The VM has some quite big vdisks (20G, 1T and 2T). All stored in Ceph.
>>> There is still plenty of space in Ceph.
>>>
>>> Can anyone give us some hint on how to investigate and debug this 
>>> further?
>>
>> Because it is a write error, maybe we should look at the backup 
>> destination.
>> Maybe it is a network connection issue? Maybe something wrong with the 
>> host? Maybe the disk is full?
>> Which storage are you using for backup? Can you show us the 
>> corresponding entry in /etc/pve/storage.cfg?
> 
> 
> We are backing up to cephfs with still 8 TB or so free.
> 
> /etc/pve/storage.cfg is
> ------------
> dir: local
>          path /var/lib/vz
>          content vztmpl,backup,iso
> 
> dir: data
>          path /data
>          content snippets,images,backup,iso,rootdir,vztmpl
> 
> cephfs: cephfs
>          path /mnt/pve/cephfs
>          content backup,vztmpl,iso
>          maxfiles 5
> 
> rbd: ceph-rbd
>          content images,rootdir
>          krbd 0
>          pool pve-pool1
> ------------
> 

The problem has reached a new level of urgency, as since two days each 
time after a failed backup the VMm becomes unaccessible and has to be 
stopped and started manually from the PVE UI.

Frank

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PVE-User] Backup of one VM always fails
  2020-12-04 10:22     ` Frank Thommen
@ 2020-12-04 10:26       ` Fabrizio Cuseo
       [not found]       ` <mailman.2.1607078234.376.pve-user@lists.proxmox.com>
  1 sibling, 0 replies; 11+ messages in thread
From: Fabrizio Cuseo @ 2020-12-04 10:26 UTC (permalink / raw)
  To: pve-user

Same problem for me, and no space problems. It appears after an update about 1 month ago.
Always the same VMs



how to investigate and debug this
>>>> further?
>>>
>>> Because it is a write error, maybe we should look at the backup
>>> destination.
>>> Maybe it is a network connection issue? Maybe something wrong with the
>>> host? Maybe the disk is full?
>>> Which storage are you using for backup? Can you show us the
>>> corresponding entry in /etc/pve/storage.cfg?
>> 
>> 
>> We are backing up to cephfs with still 8 TB or so free.
>> 
>> /etc/pve/storage.cfg is
>> ------------
>> dir: local
>>          path /var/lib/vz
>>          content vztmpl,backup,iso
>> 
>> dir: data
>>          path /data
>>          content snippets,images,backup,iso,rootdir,vztmpl
>> 
>> cephfs: cephfs
>>          path /mnt/pve/cephfs
>>          content backup,vztmpl,iso
>>          maxfiles 5
>> 
>> rbd: ceph-rbd
>>          content images,rootdir
>>          krbd 0
>>          pool pve-pool1
>> ------------
>> 
> 
> The problem has reached a new level of urgency, as since two days each
> time after a failed backup the VMm becomes unaccessible and has to be
> stopped and started manually from the PVE UI.
> 
> Frank
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

-- 
---
Fabrizio Cuseo - mailto:f.cuseo@panservice.it
Direzione Generale - Panservice InterNetWorking
Servizi Professionali per Internet ed il Networking
Panservice e' associata AIIP - RIPE Local Registry
Phone: +39 0773 410020 - Fax: +39 0773 470219
http://www.panservice.it  mailto:info@panservice.it
Numero verde nazionale: 800 901492



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PVE-User] Backup of one VM always fails
       [not found]       ` <mailman.2.1607078234.376.pve-user@lists.proxmox.com>
@ 2020-12-04 11:09         ` Frank Thommen
  2020-12-04 14:00           ` Yannis Milios
  0 siblings, 1 reply; 11+ messages in thread
From: Frank Thommen @ 2020-12-04 11:09 UTC (permalink / raw)
  To: pve-user

On 04/12/2020 11:36, Arjen via pve-user wrote:
> On Fri, 2020-12-04 at 11:22 +0100, Frank Thommen wrote:
>>
>> On 04/12/2020 09:30, Frank Thommen wrote:
>>>> On Thursday, December 3, 2020 10:16 PM, Frank Thommen
>>>> <f.thommen@dkfz-heidelberg.de> wrote:
>>>>
>>>>>
>>>>> Dear all,
>>>>>
>>>>> on our PVE cluster, the backup of a specific VM always fails
>>>>> (which
>>>>> makes us worry, as it is our GitLab instance). The general
>>>>> backup plan
>>>>> is "back up all VMs at 00:30". In the confirmation email we
>>>>> see, that
>>>>> the backup of this specific VM takes six to seven hours and
>>>>> then fails.
>>>>> The error message in the overview table used to be:
>>>>>
>>>>> vma_queue_write: write error - Broken pipe
>>>>>
>>>>> With detailed log
>>>>>
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -----------------------------------------------
>>>>>
>>>>>
>>>>> 123: 2020-12-01 02:53:08 INFO: Starting Backup of VM 123 (qemu)
>>>>> 123: 2020-12-01 02:53:08 INFO: status = running
>>>>> 123: 2020-12-01 02:53:09 INFO: update VM 123: -lock backup
>>>>> 123: 2020-12-01 02:53:09 INFO: VM Name: odcf-vm123
>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio0'
>>>>> 'ceph-rbd:vm-123-disk-0' 20G
>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio1'
>>>>> 'ceph-rbd:vm-123-disk-2' 1000G
>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio2'
>>>>> 'ceph-rbd:vm-123-disk-3' 2T
>>>>> 123: 2020-12-01 02:53:09 INFO: backup mode: snapshot
>>>>> 123: 2020-12-01 02:53:09 INFO: ionice priority: 7
>>>>> 123: 2020-12-01 02:53:09 INFO: creating archive
>>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_01-
>>>>> 02_53_08.vma.lzo'
>>>>> 123: 2020-12-01 02:53:09 INFO: started backup task
>>>>> 'a38ff50a-f474-4b0a-a052-01a835d5c5c7'
>>>>> 123: 2020-12-01 02:53:12 INFO: status: 0%
>>>>> (167772160/3294239916032),
>>>>> sparse 0% (31563776), duration 3, read/write 55/45 MB/s
>>>>> [... ecc. ecc. ...]
>>>>> 123: 2020-12-01 09:42:14 INFO: status: 35%
>>>>> (1170252365824/3294239916032), sparse 0% (26845003776),
>>>>> duration 24545,
>>>>> read/write 59/56 MB/s
>>>>> 123: 2020-12-01 09:42:14 ERROR: vma_queue_write: write error -
>>>>> Broken
>>>>> pipe
>>>>> 123: 2020-12-01 09:42:14 INFO: aborting backup job
>>>>> 123: 2020-12-01 09:42:15 ERROR: Backup of VM 123 failed -
>>>>> vma_queue_write: write error - Broken pipe
>>>>>
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>> ----------!
>>>>>
>>> ---------
>>> -----------------------------------------------------------------
>>> -----------------------------------------------------------------
>>> -----------------------------------------------------------------
>>> -----------------------------------------------------------------
>>> ----------------------------------
>>>
>>>>> Since lately (upgrade to the newest PVE release) it's
>>>>>
>>>>> VM 123 qmp command 'query-backup' failed - got timeout
>>>>>
>>>>> with log
>>>>>
>>>>> -------------------------------------------------------------
>>>>> -------------------------------------------------------------
>>>>>
>>>>>
>>>>> 123: 2020-12-03 03:29:00 INFO: Starting Backup of VM 123 (qemu)
>>>>> 123: 2020-12-03 03:29:00 INFO: status = running
>>>>> 123: 2020-12-03 03:29:00 INFO: VM Name: odcf-vm123
>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio0'
>>>>> 'ceph-rbd:vm-123-disk-0' 20G
>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio1'
>>>>> 'ceph-rbd:vm-123-disk-2' 1000G
>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio2'
>>>>> 'ceph-rbd:vm-123-disk-3' 2T
>>>>> 123: 2020-12-03 03:29:01 INFO: backup mode: snapshot
>>>>> 123: 2020-12-03 03:29:01 INFO: ionice priority: 7
>>>>> 123: 2020-12-03 03:29:01 INFO: creating vzdump archive
>>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_03-
>>>>> 03_29_00.vma.lzo'
>>>>> 123: 2020-12-03 03:29:01 INFO: started backup task
>>>>> 'cc7cde4e-20e8-4e26-a89a-f6f1aa9e9612'
>>>>> 123: 2020-12-03 03:29:01 INFO: resuming VM again
>>>>> 123: 2020-12-03 03:29:04 INFO: 0% (284.0 MiB of 3.0 TiB) in 3s,
>>>>> read:
>>>>> 94.7 MiB/s, write: 51.7 MiB/s
>>>>> [... ecc. ecc. ...]
>>>>> 123: 2020-12-03 09:05:08 INFO: 36% (1.1 TiB of 3.0 TiB) in 5h
>>>>> 36m 7s,
>>>>> read: 57.3 MiB/s, write: 53.6 MiB/s
>>>>> 123: 2020-12-03 09:22:57 ERROR: VM 123 qmp command 'query-
>>>>> backup' failed
>>>>>
>>>>> -   got timeout
>>>>>     123: 2020-12-03 09:22:57 INFO: aborting backup job
>>>>>     123: 2020-12-03 09:32:57 ERROR: VM 123 qmp command 'backup-
>>>>> cancel'
>>>>>     failed - unable to connect to VM 123 qmp socket - timeout
>>>>> after
>>>>> 5981 retries
>>>>>     123: 2020-12-03 09:32:57 ERROR: Backup of VM 123 failed -
>>>>> VM 123 qmp
>>>>>     command 'query-backup' failed - got timeout
>>>>>
>>>>>
>>>>> The VM has some quite big vdisks (20G, 1T and 2T). All stored
>>>>> in Ceph.
>>>>> There is still plenty of space in Ceph.
>>>>>
>>>>> Can anyone give us some hint on how to investigate and debug
>>>>> this
>>>>> further?
>>>>
>>>> Because it is a write error, maybe we should look at the backup
>>>> destination.
>>>> Maybe it is a network connection issue? Maybe something wrong
>>>> with the
>>>> host? Maybe the disk is full?
>>>> Which storage are you using for backup? Can you show us the
>>>> corresponding entry in /etc/pve/storage.cfg?
>>>
>>> We are backing up to cephfs with still 8 TB or so free.
>>>
>>> /etc/pve/storage.cfg is
>>> ------------
>>> dir: local
>>>          path /var/lib/vz
>>>          content vztmpl,backup,iso
>>>
>>> dir: data
>>>          path /data
>>>          content snippets,images,backup,iso,rootdir,vztmpl
>>>
>>> cephfs: cephfs
>>>          path /mnt/pve/cephfs
>>>          content backup,vztmpl,iso
>>>          maxfiles 5
>>>
>>> rbd: ceph-rbd
>>>          content images,rootdir
>>>          krbd 0
>>>          pool pve-pool1
>>> ------------
>>>
>>
>> The problem has reached a new level of urgency, as since two days
>> each
>> time after a failed backup the VMm becomes unaccessible and has to be
>> stopped and started manually from the PVE UI.
> 
> I don't see anything wrong the configuration that you shared.
> Was anything changed in the last few days since the last successful
> backup? Any updates from Proxmox? Changes to the network?
> I know very little about Ceph and clusters, sorry.
> What makes this VM different, except for the size of the disks?

On December 1st the Hypervisor has been updated to PVE 6.3-2 (I think 
from 6.1-3).  After that the error message slightly changed and - in 
hindsight - since then the VM stops being accessible after the failed 
backup.

However: The VM never ever backed up successfully, not even before the 
PVE upgrade.  It's just that no one really took notice of it.

The VM is not really special.  It's our only Debian VM (but I hope 
that's not an issue :-) and the VM has been migrated 1:1 from oVirt by 
migrating and importing the disk images.  But we have a few other such 
VMs and they run and back up just fine.

No network changes. Basically nothing changed that I could think of.

But to be clear: Our current main problem is the failing backup, not the 
crash.


Cheers, Frank






^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PVE-User] Backup of one VM always fails
  2020-12-04 11:09         ` Frank Thommen
@ 2020-12-04 14:00           ` Yannis Milios
  2020-12-04 14:20             ` Frank Thommen
  0 siblings, 1 reply; 11+ messages in thread
From: Yannis Milios @ 2020-12-04 14:00 UTC (permalink / raw)
  To: Proxmox VE user list

Can you try removing this specific VM from the normal backup schedule and
then create a new test schedule for it, if possible to a different backup
target (nfs, local etc) ?



On Fri, 4 Dec 2020 at 11:10, Frank Thommen <f.thommen@dkfz-heidelberg.de>
wrote:

> On 04/12/2020 11:36, Arjen via pve-user wrote:
> > On Fri, 2020-12-04 at 11:22 +0100, Frank Thommen wrote:
> >>
> >> On 04/12/2020 09:30, Frank Thommen wrote:
> >>>> On Thursday, December 3, 2020 10:16 PM, Frank Thommen
> >>>> <f.thommen@dkfz-heidelberg.de> wrote:
> >>>>
> >>>>>
> >>>>> Dear all,
> >>>>>
> >>>>> on our PVE cluster, the backup of a specific VM always fails
> >>>>> (which
> >>>>> makes us worry, as it is our GitLab instance). The general
> >>>>> backup plan
> >>>>> is "back up all VMs at 00:30". In the confirmation email we
> >>>>> see, that
> >>>>> the backup of this specific VM takes six to seven hours and
> >>>>> then fails.
> >>>>> The error message in the overview table used to be:
> >>>>>
> >>>>> vma_queue_write: write error - Broken pipe
> >>>>>
> >>>>> With detailed log
> >>>>>
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -----------------------------------------------
> >>>>>
> >>>>>
> >>>>> 123: 2020-12-01 02:53:08 INFO: Starting Backup of VM 123 (qemu)
> >>>>> 123: 2020-12-01 02:53:08 INFO: status = running
> >>>>> 123: 2020-12-01 02:53:09 INFO: update VM 123: -lock backup
> >>>>> 123: 2020-12-01 02:53:09 INFO: VM Name: odcf-vm123
> >>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio0'
> >>>>> 'ceph-rbd:vm-123-disk-0' 20G
> >>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio1'
> >>>>> 'ceph-rbd:vm-123-disk-2' 1000G
> >>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio2'
> >>>>> 'ceph-rbd:vm-123-disk-3' 2T
> >>>>> 123: 2020-12-01 02:53:09 INFO: backup mode: snapshot
> >>>>> 123: 2020-12-01 02:53:09 INFO: ionice priority: 7
> >>>>> 123: 2020-12-01 02:53:09 INFO: creating archive
> >>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_01-
> >>>>> 02_53_08.vma.lzo'
> >>>>> 123: 2020-12-01 02:53:09 INFO: started backup task
> >>>>> 'a38ff50a-f474-4b0a-a052-01a835d5c5c7'
> >>>>> 123: 2020-12-01 02:53:12 INFO: status: 0%
> >>>>> (167772160/3294239916032),
> >>>>> sparse 0% (31563776), duration 3, read/write 55/45 MB/s
> >>>>> [... ecc. ecc. ...]
> >>>>> 123: 2020-12-01 09:42:14 INFO: status: 35%
> >>>>> (1170252365824/3294239916032), sparse 0% (26845003776),
> >>>>> duration 24545,
> >>>>> read/write 59/56 MB/s
> >>>>> 123: 2020-12-01 09:42:14 ERROR: vma_queue_write: write error -
> >>>>> Broken
> >>>>> pipe
> >>>>> 123: 2020-12-01 09:42:14 INFO: aborting backup job
> >>>>> 123: 2020-12-01 09:42:15 ERROR: Backup of VM 123 failed -
> >>>>> vma_queue_write: write error - Broken pipe
> >>>>>
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>> ----------!
> >>>>>
> >>> ---------
> >>> -----------------------------------------------------------------
> >>> -----------------------------------------------------------------
> >>> -----------------------------------------------------------------
> >>> -----------------------------------------------------------------
> >>> ----------------------------------
> >>>
> >>>>> Since lately (upgrade to the newest PVE release) it's
> >>>>>
> >>>>> VM 123 qmp command 'query-backup' failed - got timeout
> >>>>>
> >>>>> with log
> >>>>>
> >>>>> -------------------------------------------------------------
> >>>>> -------------------------------------------------------------
> >>>>>
> >>>>>
> >>>>> 123: 2020-12-03 03:29:00 INFO: Starting Backup of VM 123 (qemu)
> >>>>> 123: 2020-12-03 03:29:00 INFO: status = running
> >>>>> 123: 2020-12-03 03:29:00 INFO: VM Name: odcf-vm123
> >>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio0'
> >>>>> 'ceph-rbd:vm-123-disk-0' 20G
> >>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio1'
> >>>>> 'ceph-rbd:vm-123-disk-2' 1000G
> >>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio2'
> >>>>> 'ceph-rbd:vm-123-disk-3' 2T
> >>>>> 123: 2020-12-03 03:29:01 INFO: backup mode: snapshot
> >>>>> 123: 2020-12-03 03:29:01 INFO: ionice priority: 7
> >>>>> 123: 2020-12-03 03:29:01 INFO: creating vzdump archive
> >>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_03-
> >>>>> 03_29_00.vma.lzo'
> >>>>> 123: 2020-12-03 03:29:01 INFO: started backup task
> >>>>> 'cc7cde4e-20e8-4e26-a89a-f6f1aa9e9612'
> >>>>> 123: 2020-12-03 03:29:01 INFO: resuming VM again
> >>>>> 123: 2020-12-03 03:29:04 INFO: 0% (284.0 MiB of 3.0 TiB) in 3s,
> >>>>> read:
> >>>>> 94.7 MiB/s, write: 51.7 MiB/s
> >>>>> [... ecc. ecc. ...]
> >>>>> 123: 2020-12-03 09:05:08 INFO: 36% (1.1 TiB of 3.0 TiB) in 5h
> >>>>> 36m 7s,
> >>>>> read: 57.3 MiB/s, write: 53.6 MiB/s
> >>>>> 123: 2020-12-03 09:22:57 ERROR: VM 123 qmp command 'query-
> >>>>> backup' failed
> >>>>>
> >>>>> -   got timeout
> >>>>>     123: 2020-12-03 09:22:57 INFO: aborting backup job
> >>>>>     123: 2020-12-03 09:32:57 ERROR: VM 123 qmp command 'backup-
> >>>>> cancel'
> >>>>>     failed - unable to connect to VM 123 qmp socket - timeout
> >>>>> after
> >>>>> 5981 retries
> >>>>>     123: 2020-12-03 09:32:57 ERROR: Backup of VM 123 failed -
> >>>>> VM 123 qmp
> >>>>>     command 'query-backup' failed - got timeout
> >>>>>
> >>>>>
> >>>>> The VM has some quite big vdisks (20G, 1T and 2T). All stored
> >>>>> in Ceph.
> >>>>> There is still plenty of space in Ceph.
> >>>>>
> >>>>> Can anyone give us some hint on how to investigate and debug
> >>>>> this
> >>>>> further?
> >>>>
> >>>> Because it is a write error, maybe we should look at the backup
> >>>> destination.
> >>>> Maybe it is a network connection issue? Maybe something wrong
> >>>> with the
> >>>> host? Maybe the disk is full?
> >>>> Which storage are you using for backup? Can you show us the
> >>>> corresponding entry in /etc/pve/storage.cfg?
> >>>
> >>> We are backing up to cephfs with still 8 TB or so free.
> >>>
> >>> /etc/pve/storage.cfg is
> >>> ------------
> >>> dir: local
> >>>          path /var/lib/vz
> >>>          content vztmpl,backup,iso
> >>>
> >>> dir: data
> >>>          path /data
> >>>          content snippets,images,backup,iso,rootdir,vztmpl
> >>>
> >>> cephfs: cephfs
> >>>          path /mnt/pve/cephfs
> >>>          content backup,vztmpl,iso
> >>>          maxfiles 5
> >>>
> >>> rbd: ceph-rbd
> >>>          content images,rootdir
> >>>          krbd 0
> >>>          pool pve-pool1
> >>> ------------
> >>>
> >>
> >> The problem has reached a new level of urgency, as since two days
> >> each
> >> time after a failed backup the VMm becomes unaccessible and has to be
> >> stopped and started manually from the PVE UI.
> >
> > I don't see anything wrong the configuration that you shared.
> > Was anything changed in the last few days since the last successful
> > backup? Any updates from Proxmox? Changes to the network?
> > I know very little about Ceph and clusters, sorry.
> > What makes this VM different, except for the size of the disks?
>
> On December 1st the Hypervisor has been updated to PVE 6.3-2 (I think
> from 6.1-3).  After that the error message slightly changed and - in
> hindsight - since then the VM stops being accessible after the failed
> backup.
>
> However: The VM never ever backed up successfully, not even before the
> PVE upgrade.  It's just that no one really took notice of it.
>
> The VM is not really special.  It's our only Debian VM (but I hope
> that's not an issue :-) and the VM has been migrated 1:1 from oVirt by
> migrating and importing the disk images.  But we have a few other such
> VMs and they run and back up just fine.
>
> No network changes. Basically nothing changed that I could think of.
>
> But to be clear: Our current main problem is the failing backup, not the
> crash.
>
>
> Cheers, Frank
>
>
>
>
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
> --
Sent from Gmail Mobile


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PVE-User] Backup of one VM always fails
  2020-12-04 14:00           ` Yannis Milios
@ 2020-12-04 14:20             ` Frank Thommen
  2020-12-04 14:39               ` [PVE-User] PBS WAS : " Ronny Aasen
  2020-12-16 18:30               ` [PVE-User] " Frank Thommen
  0 siblings, 2 replies; 11+ messages in thread
From: Frank Thommen @ 2020-12-04 14:20 UTC (permalink / raw)
  To: Proxmox VE user list

Lot's of klicking to configure the other 30 VMs, but yes, that's 
probably the most appropriate thing to do now :-)  I have to arrange for 
a free NFS share first, though, as there is no free local disk space for 
3+ TB of backup...


On 04/12/2020 15:00, Yannis Milios wrote:
> Can you try removing this specific VM from the normal backup schedule and
> then create a new test schedule for it, if possible to a different backup
> target (nfs, local etc) ?
> 
> 
> 
> On Fri, 4 Dec 2020 at 11:10, Frank Thommen <f.thommen@dkfz-heidelberg.de>
> wrote:
> 
>> On 04/12/2020 11:36, Arjen via pve-user wrote:
>>> On Fri, 2020-12-04 at 11:22 +0100, Frank Thommen wrote:
>>>>
>>>> On 04/12/2020 09:30, Frank Thommen wrote:
>>>>>> On Thursday, December 3, 2020 10:16 PM, Frank Thommen
>>>>>> <f.thommen@dkfz-heidelberg.de> wrote:
>>>>>>
>>>>>>>
>>>>>>> Dear all,
>>>>>>>
>>>>>>> on our PVE cluster, the backup of a specific VM always fails
>>>>>>> (which
>>>>>>> makes us worry, as it is our GitLab instance). The general
>>>>>>> backup plan
>>>>>>> is "back up all VMs at 00:30". In the confirmation email we
>>>>>>> see, that
>>>>>>> the backup of this specific VM takes six to seven hours and
>>>>>>> then fails.
>>>>>>> The error message in the overview table used to be:
>>>>>>>
>>>>>>> vma_queue_write: write error - Broken pipe
>>>>>>>
>>>>>>> With detailed log
>>>>>>>
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -----------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>> 123: 2020-12-01 02:53:08 INFO: Starting Backup of VM 123 (qemu)
>>>>>>> 123: 2020-12-01 02:53:08 INFO: status = running
>>>>>>> 123: 2020-12-01 02:53:09 INFO: update VM 123: -lock backup
>>>>>>> 123: 2020-12-01 02:53:09 INFO: VM Name: odcf-vm123
>>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio0'
>>>>>>> 'ceph-rbd:vm-123-disk-0' 20G
>>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio1'
>>>>>>> 'ceph-rbd:vm-123-disk-2' 1000G
>>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio2'
>>>>>>> 'ceph-rbd:vm-123-disk-3' 2T
>>>>>>> 123: 2020-12-01 02:53:09 INFO: backup mode: snapshot
>>>>>>> 123: 2020-12-01 02:53:09 INFO: ionice priority: 7
>>>>>>> 123: 2020-12-01 02:53:09 INFO: creating archive
>>>>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_01-
>>>>>>> 02_53_08.vma.lzo'
>>>>>>> 123: 2020-12-01 02:53:09 INFO: started backup task
>>>>>>> 'a38ff50a-f474-4b0a-a052-01a835d5c5c7'
>>>>>>> 123: 2020-12-01 02:53:12 INFO: status: 0%
>>>>>>> (167772160/3294239916032),
>>>>>>> sparse 0% (31563776), duration 3, read/write 55/45 MB/s
>>>>>>> [... ecc. ecc. ...]
>>>>>>> 123: 2020-12-01 09:42:14 INFO: status: 35%
>>>>>>> (1170252365824/3294239916032), sparse 0% (26845003776),
>>>>>>> duration 24545,
>>>>>>> read/write 59/56 MB/s
>>>>>>> 123: 2020-12-01 09:42:14 ERROR: vma_queue_write: write error -
>>>>>>> Broken
>>>>>>> pipe
>>>>>>> 123: 2020-12-01 09:42:14 INFO: aborting backup job
>>>>>>> 123: 2020-12-01 09:42:15 ERROR: Backup of VM 123 failed -
>>>>>>> vma_queue_write: write error - Broken pipe
>>>>>>>
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>> ----------!
>>>>>>>
>>>>> ---------
>>>>> -----------------------------------------------------------------
>>>>> -----------------------------------------------------------------
>>>>> -----------------------------------------------------------------
>>>>> -----------------------------------------------------------------
>>>>> ----------------------------------
>>>>>
>>>>>>> Since lately (upgrade to the newest PVE release) it's
>>>>>>>
>>>>>>> VM 123 qmp command 'query-backup' failed - got timeout
>>>>>>>
>>>>>>> with log
>>>>>>>
>>>>>>> -------------------------------------------------------------
>>>>>>> -------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>> 123: 2020-12-03 03:29:00 INFO: Starting Backup of VM 123 (qemu)
>>>>>>> 123: 2020-12-03 03:29:00 INFO: status = running
>>>>>>> 123: 2020-12-03 03:29:00 INFO: VM Name: odcf-vm123
>>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio0'
>>>>>>> 'ceph-rbd:vm-123-disk-0' 20G
>>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio1'
>>>>>>> 'ceph-rbd:vm-123-disk-2' 1000G
>>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio2'
>>>>>>> 'ceph-rbd:vm-123-disk-3' 2T
>>>>>>> 123: 2020-12-03 03:29:01 INFO: backup mode: snapshot
>>>>>>> 123: 2020-12-03 03:29:01 INFO: ionice priority: 7
>>>>>>> 123: 2020-12-03 03:29:01 INFO: creating vzdump archive
>>>>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_03-
>>>>>>> 03_29_00.vma.lzo'
>>>>>>> 123: 2020-12-03 03:29:01 INFO: started backup task
>>>>>>> 'cc7cde4e-20e8-4e26-a89a-f6f1aa9e9612'
>>>>>>> 123: 2020-12-03 03:29:01 INFO: resuming VM again
>>>>>>> 123: 2020-12-03 03:29:04 INFO: 0% (284.0 MiB of 3.0 TiB) in 3s,
>>>>>>> read:
>>>>>>> 94.7 MiB/s, write: 51.7 MiB/s
>>>>>>> [... ecc. ecc. ...]
>>>>>>> 123: 2020-12-03 09:05:08 INFO: 36% (1.1 TiB of 3.0 TiB) in 5h
>>>>>>> 36m 7s,
>>>>>>> read: 57.3 MiB/s, write: 53.6 MiB/s
>>>>>>> 123: 2020-12-03 09:22:57 ERROR: VM 123 qmp command 'query-
>>>>>>> backup' failed
>>>>>>>
>>>>>>> -   got timeout
>>>>>>>      123: 2020-12-03 09:22:57 INFO: aborting backup job
>>>>>>>      123: 2020-12-03 09:32:57 ERROR: VM 123 qmp command 'backup-
>>>>>>> cancel'
>>>>>>>      failed - unable to connect to VM 123 qmp socket - timeout
>>>>>>> after
>>>>>>> 5981 retries
>>>>>>>      123: 2020-12-03 09:32:57 ERROR: Backup of VM 123 failed -
>>>>>>> VM 123 qmp
>>>>>>>      command 'query-backup' failed - got timeout
>>>>>>>
>>>>>>>
>>>>>>> The VM has some quite big vdisks (20G, 1T and 2T). All stored
>>>>>>> in Ceph.
>>>>>>> There is still plenty of space in Ceph.
>>>>>>>
>>>>>>> Can anyone give us some hint on how to investigate and debug
>>>>>>> this
>>>>>>> further?
>>>>>>
>>>>>> Because it is a write error, maybe we should look at the backup
>>>>>> destination.
>>>>>> Maybe it is a network connection issue? Maybe something wrong
>>>>>> with the
>>>>>> host? Maybe the disk is full?
>>>>>> Which storage are you using for backup? Can you show us the
>>>>>> corresponding entry in /etc/pve/storage.cfg?
>>>>>
>>>>> We are backing up to cephfs with still 8 TB or so free.
>>>>>
>>>>> /etc/pve/storage.cfg is
>>>>> ------------
>>>>> dir: local
>>>>>           path /var/lib/vz
>>>>>           content vztmpl,backup,iso
>>>>>
>>>>> dir: data
>>>>>           path /data
>>>>>           content snippets,images,backup,iso,rootdir,vztmpl
>>>>>
>>>>> cephfs: cephfs
>>>>>           path /mnt/pve/cephfs
>>>>>           content backup,vztmpl,iso
>>>>>           maxfiles 5
>>>>>
>>>>> rbd: ceph-rbd
>>>>>           content images,rootdir
>>>>>           krbd 0
>>>>>           pool pve-pool1
>>>>> ------------
>>>>>
>>>>
>>>> The problem has reached a new level of urgency, as since two days
>>>> each
>>>> time after a failed backup the VMm becomes unaccessible and has to be
>>>> stopped and started manually from the PVE UI.
>>>
>>> I don't see anything wrong the configuration that you shared.
>>> Was anything changed in the last few days since the last successful
>>> backup? Any updates from Proxmox? Changes to the network?
>>> I know very little about Ceph and clusters, sorry.
>>> What makes this VM different, except for the size of the disks?
>>
>> On December 1st the Hypervisor has been updated to PVE 6.3-2 (I think
>> from 6.1-3).  After that the error message slightly changed and - in
>> hindsight - since then the VM stops being accessible after the failed
>> backup.
>>
>> However: The VM never ever backed up successfully, not even before the
>> PVE upgrade.  It's just that no one really took notice of it.
>>
>> The VM is not really special.  It's our only Debian VM (but I hope
>> that's not an issue :-) and the VM has been migrated 1:1 from oVirt by
>> migrating and importing the disk images.  But we have a few other such
>> VMs and they run and back up just fine.
>>
>> No network changes. Basically nothing changed that I could think of.
>>
>> But to be clear: Our current main problem is the failing backup, not the
>> crash.
>>
>>
>> Cheers, Frank
>>
>>
>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>> --
> Sent from Gmail Mobile
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 




^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PVE-User] PBS WAS :  Backup of one VM always fails
  2020-12-04 14:20             ` Frank Thommen
@ 2020-12-04 14:39               ` Ronny Aasen
  2020-12-16 18:30               ` [PVE-User] " Frank Thommen
  1 sibling, 0 replies; 11+ messages in thread
From: Ronny Aasen @ 2020-12-04 14:39 UTC (permalink / raw)
  To: pve-user

On 04.12.2020 15:20, Frank Thommen wrote:
> Lot's of klicking to configure the other 30 VMs, but yes, that's 
> probably the most appropriate thing to do now :-)  I have to arrange for 
> a free NFS share first, though, as there is no free local disk space for 
> 3+ TB of backup...
> 
> 

compleatly unrelated... but since you have several large machines, and 
multiple hours backup time.  You should consider testing out proxmox 
backup server.  It have worked flawlessly on our tests. small and speedy 
backups with dirty block tracking.  while also keeping many historical 
copies via deduplication. easy restores.  but they do take the time tho :)

we copy from one proxmox using rbd backed vm's to another proxmox 
cluster using backup repo on cephfs

kind regards
Ronny

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PVE-User] Backup of one VM always fails
  2020-12-04 14:20             ` Frank Thommen
  2020-12-04 14:39               ` [PVE-User] PBS WAS : " Ronny Aasen
@ 2020-12-16 18:30               ` Frank Thommen
  1 sibling, 0 replies; 11+ messages in thread
From: Frank Thommen @ 2020-12-16 18:30 UTC (permalink / raw)
  To: Proxmox VE user list

I finally got a NFS share and ran a backup of said VM on it. It took 19 
hours (only a 2Gbit bond is available to the outside) but otherwise ran 
w/o issues or error messages.

However I'm not sure what the consequence of this is.  Backing up to 
external NFS shares is not an option for us in the long run.  PBS is 
also not an option at this time.  Maybe in the very long run.

I'd really like to use ceph, and I would like to solve the timeout 
issue. Any ideas on how to troubleshoot that?

Frank


On 04/12/2020 15:20, Frank Thommen wrote:
> Lot's of klicking to configure the other 30 VMs, but yes, that's 
> probably the most appropriate thing to do now :-)  I have to arrange for 
> a free NFS share first, though, as there is no free local disk space for 
> 3+ TB of backup...
> 
> 
> On 04/12/2020 15:00, Yannis Milios wrote:
>> Can you try removing this specific VM from the normal backup schedule and
>> then create a new test schedule for it, if possible to a different backup
>> target (nfs, local etc) ?
>>
>>
>>
>> On Fri, 4 Dec 2020 at 11:10, Frank Thommen <f.thommen@dkfz-heidelberg.de>
>> wrote:
>>
>>> On 04/12/2020 11:36, Arjen via pve-user wrote:
>>>> On Fri, 2020-12-04 at 11:22 +0100, Frank Thommen wrote:
>>>>>
>>>>> On 04/12/2020 09:30, Frank Thommen wrote:
>>>>>>> On Thursday, December 3, 2020 10:16 PM, Frank Thommen
>>>>>>> <f.thommen@dkfz-heidelberg.de> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Dear all,
>>>>>>>>
>>>>>>>> on our PVE cluster, the backup of a specific VM always fails
>>>>>>>> (which
>>>>>>>> makes us worry, as it is our GitLab instance). The general
>>>>>>>> backup plan
>>>>>>>> is "back up all VMs at 00:30". In the confirmation email we
>>>>>>>> see, that
>>>>>>>> the backup of this specific VM takes six to seven hours and
>>>>>>>> then fails.
>>>>>>>> The error message in the overview table used to be:
>>>>>>>>
>>>>>>>> vma_queue_write: write error - Broken pipe
>>>>>>>>
>>>>>>>> With detailed log
>>>>>>>>
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -----------------------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>> 123: 2020-12-01 02:53:08 INFO: Starting Backup of VM 123 (qemu)
>>>>>>>> 123: 2020-12-01 02:53:08 INFO: status = running
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: update VM 123: -lock backup
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: VM Name: odcf-vm123
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio0'
>>>>>>>> 'ceph-rbd:vm-123-disk-0' 20G
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio1'
>>>>>>>> 'ceph-rbd:vm-123-disk-2' 1000G
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio2'
>>>>>>>> 'ceph-rbd:vm-123-disk-3' 2T
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: backup mode: snapshot
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: ionice priority: 7
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: creating archive
>>>>>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_01-
>>>>>>>> 02_53_08.vma.lzo'
>>>>>>>> 123: 2020-12-01 02:53:09 INFO: started backup task
>>>>>>>> 'a38ff50a-f474-4b0a-a052-01a835d5c5c7'
>>>>>>>> 123: 2020-12-01 02:53:12 INFO: status: 0%
>>>>>>>> (167772160/3294239916032),
>>>>>>>> sparse 0% (31563776), duration 3, read/write 55/45 MB/s
>>>>>>>> [... ecc. ecc. ...]
>>>>>>>> 123: 2020-12-01 09:42:14 INFO: status: 35%
>>>>>>>> (1170252365824/3294239916032), sparse 0% (26845003776),
>>>>>>>> duration 24545,
>>>>>>>> read/write 59/56 MB/s
>>>>>>>> 123: 2020-12-01 09:42:14 ERROR: vma_queue_write: write error -
>>>>>>>> Broken
>>>>>>>> pipe
>>>>>>>> 123: 2020-12-01 09:42:14 INFO: aborting backup job
>>>>>>>> 123: 2020-12-01 09:42:15 ERROR: Backup of VM 123 failed -
>>>>>>>> vma_queue_write: write error - Broken pipe
>>>>>>>>
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>> ----------!
>>>>>>>>
>>>>>> ---------
>>>>>> -----------------------------------------------------------------
>>>>>> -----------------------------------------------------------------
>>>>>> -----------------------------------------------------------------
>>>>>> -----------------------------------------------------------------
>>>>>> ----------------------------------
>>>>>>
>>>>>>>> Since lately (upgrade to the newest PVE release) it's
>>>>>>>>
>>>>>>>> VM 123 qmp command 'query-backup' failed - got timeout
>>>>>>>>
>>>>>>>> with log
>>>>>>>>
>>>>>>>> -------------------------------------------------------------
>>>>>>>> -------------------------------------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>> 123: 2020-12-03 03:29:00 INFO: Starting Backup of VM 123 (qemu)
>>>>>>>> 123: 2020-12-03 03:29:00 INFO: status = running
>>>>>>>> 123: 2020-12-03 03:29:00 INFO: VM Name: odcf-vm123
>>>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio0'
>>>>>>>> 'ceph-rbd:vm-123-disk-0' 20G
>>>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio1'
>>>>>>>> 'ceph-rbd:vm-123-disk-2' 1000G
>>>>>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio2'
>>>>>>>> 'ceph-rbd:vm-123-disk-3' 2T
>>>>>>>> 123: 2020-12-03 03:29:01 INFO: backup mode: snapshot
>>>>>>>> 123: 2020-12-03 03:29:01 INFO: ionice priority: 7
>>>>>>>> 123: 2020-12-03 03:29:01 INFO: creating vzdump archive
>>>>>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_03-
>>>>>>>> 03_29_00.vma.lzo'
>>>>>>>> 123: 2020-12-03 03:29:01 INFO: started backup task
>>>>>>>> 'cc7cde4e-20e8-4e26-a89a-f6f1aa9e9612'
>>>>>>>> 123: 2020-12-03 03:29:01 INFO: resuming VM again
>>>>>>>> 123: 2020-12-03 03:29:04 INFO: 0% (284.0 MiB of 3.0 TiB) in 3s,
>>>>>>>> read:
>>>>>>>> 94.7 MiB/s, write: 51.7 MiB/s
>>>>>>>> [... ecc. ecc. ...]
>>>>>>>> 123: 2020-12-03 09:05:08 INFO: 36% (1.1 TiB of 3.0 TiB) in 5h
>>>>>>>> 36m 7s,
>>>>>>>> read: 57.3 MiB/s, write: 53.6 MiB/s
>>>>>>>> 123: 2020-12-03 09:22:57 ERROR: VM 123 qmp command 'query-
>>>>>>>> backup' failed
>>>>>>>>
>>>>>>>> -   got timeout
>>>>>>>>      123: 2020-12-03 09:22:57 INFO: aborting backup job
>>>>>>>>      123: 2020-12-03 09:32:57 ERROR: VM 123 qmp command 'backup-
>>>>>>>> cancel'
>>>>>>>>      failed - unable to connect to VM 123 qmp socket - timeout
>>>>>>>> after
>>>>>>>> 5981 retries
>>>>>>>>      123: 2020-12-03 09:32:57 ERROR: Backup of VM 123 failed -
>>>>>>>> VM 123 qmp
>>>>>>>>      command 'query-backup' failed - got timeout
>>>>>>>>
>>>>>>>>
>>>>>>>> The VM has some quite big vdisks (20G, 1T and 2T). All stored
>>>>>>>> in Ceph.
>>>>>>>> There is still plenty of space in Ceph.
>>>>>>>>
>>>>>>>> Can anyone give us some hint on how to investigate and debug
>>>>>>>> this
>>>>>>>> further?
>>>>>>>
>>>>>>> Because it is a write error, maybe we should look at the backup
>>>>>>> destination.
>>>>>>> Maybe it is a network connection issue? Maybe something wrong
>>>>>>> with the
>>>>>>> host? Maybe the disk is full?
>>>>>>> Which storage are you using for backup? Can you show us the
>>>>>>> corresponding entry in /etc/pve/storage.cfg?
>>>>>>
>>>>>> We are backing up to cephfs with still 8 TB or so free.
>>>>>>
>>>>>> /etc/pve/storage.cfg is
>>>>>> ------------
>>>>>> dir: local
>>>>>>           path /var/lib/vz
>>>>>>           content vztmpl,backup,iso
>>>>>>
>>>>>> dir: data
>>>>>>           path /data
>>>>>>           content snippets,images,backup,iso,rootdir,vztmpl
>>>>>>
>>>>>> cephfs: cephfs
>>>>>>           path /mnt/pve/cephfs
>>>>>>           content backup,vztmpl,iso
>>>>>>           maxfiles 5
>>>>>>
>>>>>> rbd: ceph-rbd
>>>>>>           content images,rootdir
>>>>>>           krbd 0
>>>>>>           pool pve-pool1
>>>>>> ------------
>>>>>>
>>>>>
>>>>> The problem has reached a new level of urgency, as since two days
>>>>> each
>>>>> time after a failed backup the VMm becomes unaccessible and has to be
>>>>> stopped and started manually from the PVE UI.
>>>>
>>>> I don't see anything wrong the configuration that you shared.
>>>> Was anything changed in the last few days since the last successful
>>>> backup? Any updates from Proxmox? Changes to the network?
>>>> I know very little about Ceph and clusters, sorry.
>>>> What makes this VM different, except for the size of the disks?
>>>
>>> On December 1st the Hypervisor has been updated to PVE 6.3-2 (I think
>>> from 6.1-3).  After that the error message slightly changed and - in
>>> hindsight - since then the VM stops being accessible after the failed
>>> backup.
>>>
>>> However: The VM never ever backed up successfully, not even before the
>>> PVE upgrade.  It's just that no one really took notice of it.
>>>
>>> The VM is not really special.  It's our only Debian VM (but I hope
>>> that's not an issue :-) and the VM has been migrated 1:1 from oVirt by
>>> migrating and importing the disk images.  But we have a few other such
>>> VMs and they run and back up just fine.
>>>
>>> No network changes. Basically nothing changed that I could think of.
>>>
>>> But to be clear: Our current main problem is the failing backup, not the
>>> crash.
>>>
>>>
>>> Cheers, Frank
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user@lists.proxmox.com
>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>
>>> -- 
>> Sent from Gmail Mobile
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
> 
> 
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-12-16 18:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-03 21:16 [PVE-User] Backup of one VM always fails Frank Thommen
2020-12-03 22:10 ` Gerald Brandt
2020-12-04  8:26   ` Frank Thommen
     [not found] ` <mailman.131.1607062291.440.pve-user@lists.proxmox.com>
2020-12-04  8:30   ` Frank Thommen
2020-12-04 10:22     ` Frank Thommen
2020-12-04 10:26       ` Fabrizio Cuseo
     [not found]       ` <mailman.2.1607078234.376.pve-user@lists.proxmox.com>
2020-12-04 11:09         ` Frank Thommen
2020-12-04 14:00           ` Yannis Milios
2020-12-04 14:20             ` Frank Thommen
2020-12-04 14:39               ` [PVE-User] PBS WAS : " Ronny Aasen
2020-12-16 18:30               ` [PVE-User] " Frank Thommen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal