public inbox for pve-user@lists.proxmox.com
 help / color / mirror / Atom feed
From: "Bertorello, Marco" <me@marcobertorello.it>
To: pve-user <pve-user@lists.proxmox.com>
Subject: [PVE-User] Replication blocked issue
Date: Wed, 28 Apr 2021 17:34:14 +0200	[thread overview]
Message-ID: <11f33c2d-472d-d2c8-d3e4-c5e4a99900e4@marcobertorello.it> (raw)

Dear PVE users,

I've a 3-nodes clusters, with ZFS storage.
Every node use it's own storage and the VMs/LXCs are replicated across 
other nodes every 10 minutes.

Some times happens that a replica job is running without an end.

For example at the moment I have a replication started yesterday:

2021-04-27 07:20:01 101-1: start replication job
2021-04-27 07:20:01 101-1: guest => CT 101, running => 1
2021-04-27 07:20:01 101-1: volumes => DS1:subvol-101-disk-1
2021-04-27 07:20:02 101-1: freeze guest filesystem
2021-04-27 07:20:05 101-1: create snapshot 
'__replicate_101-1_1619500801__' on DS1:subvol-101-disk-1
2021-04-27 07:20:06 101-1: thaw guest filesystem
2021-04-27 07:20:06 101-1: using secure transmission, rate limit: none
2021-04-27 07:20:06 101-1: incremental sync 'DS1:subvol-101-disk-1' 
(__replicate_101-1_1619500201__ => __replicate_101-1_1619500801__)
2021-04-27 07:20:08 101-1: send from @__replicate_101-1_1619500201__ to 
zp1/subvol-101-disk-1@__replicate_101-0_1619500211__ estimated size is 213K
2021-04-27 07:20:08 101-1: send from @__replicate_101-0_1619500211__ to 
zp1/subvol-101-disk-1@__replicate_101-1_1619500801__ estimated size is 26.1M
2021-04-27 07:20:08 101-1: total estimated size is 26.4M
2021-04-27 07:20:09 101-1: TIME        SENT   SNAPSHOT 
zp1/subvol-101-disk-1@__replicate_101-1_1619500801__
2021-04-27 07:20:09 101-1: 07:20:09   3.18M 
zp1/subvol-101-disk-1@__replicate_101-1_1619500801__
[...]
2021-04-28 17:27:25 101-1: 17:27:25   3.18M 
zp1/subvol-101-disk-1@__replicate_101-1_1619500801__
2021-04-28 17:27:26 101-1: 17:27:26   3.18M 
zp1/subvol-101-disk-1@__replicate_101-1_1619500801__
2021-04-28 17:27:27 101-1: 17:27:27   3.18M 
zp1/subvol-101-disk-1@__replicate_101-1_1619500801__

as you can see, no progress in this time slot, still 3.18M transferred.

There are 2 big problems with this:

1) the blocked replica prevents the other replication scheduled on the 
source node to run until this replication ends or fail

2) I've no other solution but reboot the destination node to exit this 
situation.

I tried to kill the process on the destination node, but the process is 
in D state and cannot be killed.
There is a way to get out this scenario without reboot nodes?

Thanks a lot and best regards,

-- 
Marco Bertorello
https://www.marcobertorello.it






                 reply	other threads:[~2021-04-28 15:34 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=11f33c2d-472d-d2c8-d3e4-c5e4a99900e4@marcobertorello.it \
    --to=me@marcobertorello.it \
    --cc=pve-user@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal