From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 4C5A068D1A for ; Fri, 4 Dec 2020 15:00:22 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 42800AAE0 for ; Fri, 4 Dec 2020 15:00:22 +0100 (CET) Received: from mail-lf1-x131.google.com (mail-lf1-x131.google.com [IPv6:2a00:1450:4864:20::131]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 610C8AACB for ; Fri, 4 Dec 2020 15:00:20 +0100 (CET) Received: by mail-lf1-x131.google.com with SMTP id q13so7750114lfr.10 for ; Fri, 04 Dec 2020 06:00:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=N1E7GS/eZ5NEqDvN/4P0qdv6ekSnaz3H/BG/r7Hu3eo=; b=e369jHLUx3PBr3uuya1G777JScUeMPwXeJyyL/zEim8MhJAWAki7pPnO3VGb5uD/0D BJpMrCUBVYkIEcBMk24pWIL98ikaZzg0BkxIW2ArPVC2XOtUbssHG/h0+JBh+fog5iIL AAvTC6lBiy152VolJpBBiDyykOV8AeYnVYaK7bmtXuLfWjeyN6YjzsEypkMisJfxq+n9 YIMyzwrIsWXlLjInQKjE1qbENqGpIX1dpABpy5Wml4eqY7e6U+EDUPZh9J65G6v8UzZO TG3/B7+X70cjNK40M1j0RnM/oVzUPfsrCjDaV7pjhP3z2SLgjJMD6GraZ/jkkf1Az5qX wpDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=N1E7GS/eZ5NEqDvN/4P0qdv6ekSnaz3H/BG/r7Hu3eo=; b=hztYAM168Cm4ja9URRXF4CrX6M9/Rl0EEhLY2y49klfbRQzKcL/UssdEcJOGdIq1Or 49M6wRH3nOLK53ZNSxWtBQ3Gkqnc4u1UzWKh4efixNce3skL/Qo2PgTZQyffctGr4JMS q8yU/cQBonJ9hJhvbyLGTAGK4y5ZXg66YxSB0wK1LLX16sQNO8NNmp4FxeZYWAiQ9h1q cIP3BULgdSoToyNbqi/6rfbRgDmsI5Ywxt/ZBQZVOWXmWgSzje4gyUoGqzBAHGNixhYL ZeIP6+OCnfPyQHKmQNo+8UFBqlJOM0/c2WPHEKVn8z98s+evXwSU50YilM3ZdtPhhJZg NCSw== X-Gm-Message-State: AOAM5309V9+n+ucJimLkewWWOc52yvsPhCjzDMPmKw5tux6Q/44pP34j V5bW+GXEvlgY4FHKw8BekzrZCvuoPBmKF0vsx+J1w+iES9bs5g== X-Google-Smtp-Source: ABdhPJwfAxZOeQ8O0IckcNolNiek2GGw6/UJXlM+q+SjlUvAdCSaRNWgrXh/QqCx729WukBxpUEQ8WZprA+eN513mlE= X-Received: by 2002:a05:6512:33bc:: with SMTP id i28mr3295601lfg.52.1607090413142; Fri, 04 Dec 2020 06:00:13 -0800 (PST) MIME-Version: 1.0 References: <6f8b35b3-bd74-93f1-5298-eb9980c70d77@dkfz-heidelberg.de> <9d09aa69-95aa-0d96-e119-57b724f29080@dkfz-heidelberg.de> In-Reply-To: <9d09aa69-95aa-0d96-e119-57b724f29080@dkfz-heidelberg.de> From: Yannis Milios Date: Fri, 4 Dec 2020 14:00:01 +0000 Message-ID: To: Proxmox VE user list X-SPAM-LEVEL: Spam detection results: 0 AWL 0.023 Adjusted score from AWL reputation of From: address DKIM_SIGNED 0.1 Message has a DKIM or DK signature, not necessarily valid DKIM_VALID -0.1 Message has at least one valid DKIM or DK signature DKIM_VALID_AU -0.1 Message has a valid DKIM or DK signature from author's domain DKIM_VALID_EF -0.1 Message has a valid DKIM or DK signature from envelope-from domain FREEMAIL_FROM 0.001 Sender email is commonly abused enduser mail provider HTML_MESSAGE 0.001 HTML included in message RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com] Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: Re: [PVE-User] Backup of one VM always fails X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 04 Dec 2020 14:00:22 -0000 Can you try removing this specific VM from the normal backup schedule and then create a new test schedule for it, if possible to a different backup target (nfs, local etc) ? On Fri, 4 Dec 2020 at 11:10, Frank Thommen wrote: > On 04/12/2020 11:36, Arjen via pve-user wrote: > > On Fri, 2020-12-04 at 11:22 +0100, Frank Thommen wrote: > >> > >> On 04/12/2020 09:30, Frank Thommen wrote: > >>>> On Thursday, December 3, 2020 10:16 PM, Frank Thommen > >>>> wrote: > >>>> > >>>>> > >>>>> Dear all, > >>>>> > >>>>> on our PVE cluster, the backup of a specific VM always fails > >>>>> (which > >>>>> makes us worry, as it is our GitLab instance). The general > >>>>> backup plan > >>>>> is "back up all VMs at 00:30". In the confirmation email we > >>>>> see, that > >>>>> the backup of this specific VM takes six to seven hours and > >>>>> then fails. > >>>>> The error message in the overview table used to be: > >>>>> > >>>>> vma_queue_write: write error - Broken pipe > >>>>> > >>>>> With detailed log > >>>>> > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ----------------------------------------------- > >>>>> > >>>>> > >>>>> 123: 2020-12-01 02:53:08 INFO: Starting Backup of VM 123 (qemu) > >>>>> 123: 2020-12-01 02:53:08 INFO: status = running > >>>>> 123: 2020-12-01 02:53:09 INFO: update VM 123: -lock backup > >>>>> 123: 2020-12-01 02:53:09 INFO: VM Name: odcf-vm123 > >>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio0' > >>>>> 'ceph-rbd:vm-123-disk-0' 20G > >>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio1' > >>>>> 'ceph-rbd:vm-123-disk-2' 1000G > >>>>> 123: 2020-12-01 02:53:09 INFO: include disk 'virtio2' > >>>>> 'ceph-rbd:vm-123-disk-3' 2T > >>>>> 123: 2020-12-01 02:53:09 INFO: backup mode: snapshot > >>>>> 123: 2020-12-01 02:53:09 INFO: ionice priority: 7 > >>>>> 123: 2020-12-01 02:53:09 INFO: creating archive > >>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_01- > >>>>> 02_53_08.vma.lzo' > >>>>> 123: 2020-12-01 02:53:09 INFO: started backup task > >>>>> 'a38ff50a-f474-4b0a-a052-01a835d5c5c7' > >>>>> 123: 2020-12-01 02:53:12 INFO: status: 0% > >>>>> (167772160/3294239916032), > >>>>> sparse 0% (31563776), duration 3, read/write 55/45 MB/s > >>>>> [... ecc. ecc. ...] > >>>>> 123: 2020-12-01 09:42:14 INFO: status: 35% > >>>>> (1170252365824/3294239916032), sparse 0% (26845003776), > >>>>> duration 24545, > >>>>> read/write 59/56 MB/s > >>>>> 123: 2020-12-01 09:42:14 ERROR: vma_queue_write: write error - > >>>>> Broken > >>>>> pipe > >>>>> 123: 2020-12-01 09:42:14 INFO: aborting backup job > >>>>> 123: 2020-12-01 09:42:15 ERROR: Backup of VM 123 failed - > >>>>> vma_queue_write: write error - Broken pipe > >>>>> > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> ----------! > >>>>> > >>> --------- > >>> ----------------------------------------------------------------- > >>> ----------------------------------------------------------------- > >>> ----------------------------------------------------------------- > >>> ----------------------------------------------------------------- > >>> ---------------------------------- > >>> > >>>>> Since lately (upgrade to the newest PVE release) it's > >>>>> > >>>>> VM 123 qmp command 'query-backup' failed - got timeout > >>>>> > >>>>> with log > >>>>> > >>>>> ------------------------------------------------------------- > >>>>> ------------------------------------------------------------- > >>>>> > >>>>> > >>>>> 123: 2020-12-03 03:29:00 INFO: Starting Backup of VM 123 (qemu) > >>>>> 123: 2020-12-03 03:29:00 INFO: status = running > >>>>> 123: 2020-12-03 03:29:00 INFO: VM Name: odcf-vm123 > >>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio0' > >>>>> 'ceph-rbd:vm-123-disk-0' 20G > >>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio1' > >>>>> 'ceph-rbd:vm-123-disk-2' 1000G > >>>>> 123: 2020-12-03 03:29:00 INFO: include disk 'virtio2' > >>>>> 'ceph-rbd:vm-123-disk-3' 2T > >>>>> 123: 2020-12-03 03:29:01 INFO: backup mode: snapshot > >>>>> 123: 2020-12-03 03:29:01 INFO: ionice priority: 7 > >>>>> 123: 2020-12-03 03:29:01 INFO: creating vzdump archive > >>>>> '/mnt/pve/cephfs/dump/vzdump-qemu-123-2020_12_03- > >>>>> 03_29_00.vma.lzo' > >>>>> 123: 2020-12-03 03:29:01 INFO: started backup task > >>>>> 'cc7cde4e-20e8-4e26-a89a-f6f1aa9e9612' > >>>>> 123: 2020-12-03 03:29:01 INFO: resuming VM again > >>>>> 123: 2020-12-03 03:29:04 INFO: 0% (284.0 MiB of 3.0 TiB) in 3s, > >>>>> read: > >>>>> 94.7 MiB/s, write: 51.7 MiB/s > >>>>> [... ecc. ecc. ...] > >>>>> 123: 2020-12-03 09:05:08 INFO: 36% (1.1 TiB of 3.0 TiB) in 5h > >>>>> 36m 7s, > >>>>> read: 57.3 MiB/s, write: 53.6 MiB/s > >>>>> 123: 2020-12-03 09:22:57 ERROR: VM 123 qmp command 'query- > >>>>> backup' failed > >>>>> > >>>>> - got timeout > >>>>> 123: 2020-12-03 09:22:57 INFO: aborting backup job > >>>>> 123: 2020-12-03 09:32:57 ERROR: VM 123 qmp command 'backup- > >>>>> cancel' > >>>>> failed - unable to connect to VM 123 qmp socket - timeout > >>>>> after > >>>>> 5981 retries > >>>>> 123: 2020-12-03 09:32:57 ERROR: Backup of VM 123 failed - > >>>>> VM 123 qmp > >>>>> command 'query-backup' failed - got timeout > >>>>> > >>>>> > >>>>> The VM has some quite big vdisks (20G, 1T and 2T). All stored > >>>>> in Ceph. > >>>>> There is still plenty of space in Ceph. > >>>>> > >>>>> Can anyone give us some hint on how to investigate and debug > >>>>> this > >>>>> further? > >>>> > >>>> Because it is a write error, maybe we should look at the backup > >>>> destination. > >>>> Maybe it is a network connection issue? Maybe something wrong > >>>> with the > >>>> host? Maybe the disk is full? > >>>> Which storage are you using for backup? Can you show us the > >>>> corresponding entry in /etc/pve/storage.cfg? > >>> > >>> We are backing up to cephfs with still 8 TB or so free. > >>> > >>> /etc/pve/storage.cfg is > >>> ------------ > >>> dir: local > >>> path /var/lib/vz > >>> content vztmpl,backup,iso > >>> > >>> dir: data > >>> path /data > >>> content snippets,images,backup,iso,rootdir,vztmpl > >>> > >>> cephfs: cephfs > >>> path /mnt/pve/cephfs > >>> content backup,vztmpl,iso > >>> maxfiles 5 > >>> > >>> rbd: ceph-rbd > >>> content images,rootdir > >>> krbd 0 > >>> pool pve-pool1 > >>> ------------ > >>> > >> > >> The problem has reached a new level of urgency, as since two days > >> each > >> time after a failed backup the VMm becomes unaccessible and has to be > >> stopped and started manually from the PVE UI. > > > > I don't see anything wrong the configuration that you shared. > > Was anything changed in the last few days since the last successful > > backup? Any updates from Proxmox? Changes to the network? > > I know very little about Ceph and clusters, sorry. > > What makes this VM different, except for the size of the disks? > > On December 1st the Hypervisor has been updated to PVE 6.3-2 (I think > from 6.1-3). After that the error message slightly changed and - in > hindsight - since then the VM stops being accessible after the failed > backup. > > However: The VM never ever backed up successfully, not even before the > PVE upgrade. It's just that no one really took notice of it. > > The VM is not really special. It's our only Debian VM (but I hope > that's not an issue :-) and the VM has been migrated 1:1 from oVirt by > migrating and importing the disk images. But we have a few other such > VMs and they run and back up just fine. > > No network changes. Basically nothing changed that I could think of. > > But to be clear: Our current main problem is the failing backup, not the > crash. > > > Cheers, Frank > > > > > _______________________________________________ > pve-user mailing list > pve-user@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user > > -- Sent from Gmail Mobile