From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 3E70977ABF for ; Wed, 28 Apr 2021 17:34:30 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 35FE81158B for ; Wed, 28 Apr 2021 17:34:30 +0200 (CEST) Received: from smtp2.ngi.it (smtp2.ngi.it [IPv6:2001:4c91::113]) by firstgate.proxmox.com (Proxmox) with ESMTP id 85C681157E for ; Wed, 28 Apr 2021 17:34:28 +0200 (CEST) Received: from 81-174-31-54.v4.ngi.it (81-174-31-54.static.eolo.it [81.174.31.54]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) (No client certificate requested) by smtp2.ngi.it (Postfix) with ESMTPSA id 0CF0760743 for ; Wed, 28 Apr 2021 17:34:15 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by 81-174-31-54.v4.ngi.it (Postfix) with ESMTP id 1F24612A3FE for ; Wed, 28 Apr 2021 17:34:15 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at web.marcobertorello.tk Received: from 81-174-31-54.v4.ngi.it ([127.0.0.1]) by localhost (web.marcobertorello.it [127.0.0.1]) (amavisd-new, port 10026) with LMTP id hQJTdUQkSFCc for ; Wed, 28 Apr 2021 17:34:14 +0200 (CEST) Received: from [192.168.47.250] (net-188-219-6-242.cust.vodafonedsl.it [188.219.6.242]) (Authenticated sender: me@marcobertorello.it) by 81-174-31-54.v4.ngi.it (Postfix) with ESMTPSA id 928EF12A4D3 for ; Wed, 28 Apr 2021 17:34:14 +0200 (CEST) To: pve-user From: "Bertorello, Marco" Message-ID: <11f33c2d-472d-d2c8-d3e4-c5e4a99900e4@marcobertorello.it> Date: Wed, 28 Apr 2021 17:34:14 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.0 MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15; format=flowed Content-Transfer-Encoding: quoted-printable Content-Language: it X-SPAM-LEVEL: Spam detection results: 0 AWL 0.110 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_PASS -0.001 SPF: HELO matches SPF record SPF_PASS -0.001 SPF: sender matches SPF record Subject: [PVE-User] Replication blocked issue X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 Apr 2021 15:34:30 -0000 Dear PVE users, I've a 3-nodes clusters, with ZFS storage. Every node use it's own storage and the VMs/LXCs are replicated across=20 other nodes every 10 minutes. Some times happens that a replica job is running without an end. For example at the moment I have a replication started yesterday: 2021-04-27 07:20:01 101-1: start replication job 2021-04-27 07:20:01 101-1: guest =3D> CT 101, running =3D> 1 2021-04-27 07:20:01 101-1: volumes =3D> DS1:subvol-101-disk-1 2021-04-27 07:20:02 101-1: freeze guest filesystem 2021-04-27 07:20:05 101-1: create snapshot=20 '__replicate_101-1_1619500801__' on DS1:subvol-101-disk-1 2021-04-27 07:20:06 101-1: thaw guest filesystem 2021-04-27 07:20:06 101-1: using secure transmission, rate limit: none 2021-04-27 07:20:06 101-1: incremental sync 'DS1:subvol-101-disk-1'=20 (__replicate_101-1_1619500201__ =3D> __replicate_101-1_1619500801__) 2021-04-27 07:20:08 101-1: send from @__replicate_101-1_1619500201__ to=20 zp1/subvol-101-disk-1@__replicate_101-0_1619500211__ estimated size is 21= 3K 2021-04-27 07:20:08 101-1: send from @__replicate_101-0_1619500211__ to=20 zp1/subvol-101-disk-1@__replicate_101-1_1619500801__ estimated size is 26= =2E1M 2021-04-27 07:20:08 101-1: total estimated size is 26.4M 2021-04-27 07:20:09 101-1: TIME=A0=A0=A0=A0=A0=A0=A0 SENT=A0=A0 SNAPSHOT = zp1/subvol-101-disk-1@__replicate_101-1_1619500801__ 2021-04-27 07:20:09 101-1: 07:20:09=A0=A0 3.18M=20 zp1/subvol-101-disk-1@__replicate_101-1_1619500801__ [...] 2021-04-28 17:27:25 101-1: 17:27:25=A0=A0 3.18M=20 zp1/subvol-101-disk-1@__replicate_101-1_1619500801__ 2021-04-28 17:27:26 101-1: 17:27:26=A0=A0 3.18M=20 zp1/subvol-101-disk-1@__replicate_101-1_1619500801__ 2021-04-28 17:27:27 101-1: 17:27:27=A0=A0 3.18M=20 zp1/subvol-101-disk-1@__replicate_101-1_1619500801__ as you can see, no progress in this time slot, still 3.18M transferred. There are 2 big problems with this: 1) the blocked replica prevents the other replication scheduled on the=20 source node to run until this replication ends or fail 2) I've no other solution but reboot the destination node to exit this=20 situation. I tried to kill the process on the destination node, but the process is=20 in D state and cannot be killed. There is a way to get out this scenario without reboot nodes? Thanks a lot and best regards, --=20 Marco Bertorello https://www.marcobertorello.it