From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 0174F6170C for ; Thu, 17 Dec 2020 15:18:16 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id EAFE527554 for ; Thu, 17 Dec 2020 15:17:45 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [212.186.127.180]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 4E31B27545 for ; Thu, 17 Dec 2020 15:17:44 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 0E43B4501A for ; Thu, 17 Dec 2020 15:17:44 +0100 (CET) From: Fabian Ebner To: pve-devel@lists.proxmox.com Date: Thu, 17 Dec 2020 15:17:39 +0100 Message-Id: <20201217141739.22535-3-f.ebner@proxmox.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20201217141739.22535-1-f.ebner@proxmox.com> References: <20201217141739.22535-1-f.ebner@proxmox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL -0.008 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_MED -2.3 Sender listed at https://www.dnswl.org/, medium trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: [pve-devel] [PATCH v2 zsync 3/3] fix #2821: only abort if there really is a waiting/syncing job X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 17 Dec 2020 14:18:16 -0000 by remembering the process via PID+start time+boot ID and checking for that information in the new instance. If the old instance can't be found, the new one will continue and register itself in the state. After updating the pve-zsync package, if there is a waiting instance running the old version, one more might be created, because there is no instance_id yet. But the new instance will set the instance_id, which any later instance will see. More importantly, if the state is wrongly 'waiting' or 'syncing', i.e. because an instance was terminated before finishing, we don't abort anymore and recover from the wrong state, thus fixing the bug. Signed-off-by: Fabian Ebner --- Changes from v1: * use file reads instead of spawning 'ps' * use PID+boot specific start time+boot ID instead of PID+absolute start time * collapse get_process_start_time() and get_instance_id() into one function * query the ID of the current instance on startup pve-zsync | 40 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 39 insertions(+), 1 deletion(-) diff --git a/pve-zsync b/pve-zsync index 76e12ce..5c95955 100755 --- a/pve-zsync +++ b/pve-zsync @@ -55,6 +55,8 @@ my $TARGETRE = qr!^(?:($HOSTRE):)?(\d+|(?:[\w\-_]+)(/.+)?)$!; my $DISK_KEY_RE = qr/^(?:(?:(?:virtio|ide|scsi|sata|efidisk|mp)\d+)|rootfs): /; +my $INSTANCE_ID = get_instance_id($$); + my $command = $ARGV[0]; if (defined($command) && $command ne 'help' && $command ne 'printpod') { @@ -274,6 +276,7 @@ sub add_state_to_job { $job->{state} = $state->{state}; $job->{lsync} = $state->{lsync}; $job->{vm_type} = $state->{vm_type}; + $job->{instance_id} = $state->{instance_id}; for (my $i = 0; $state->{"snap$i"}; $i++) { $job->{"snap$i"} = $state->{"snap$i"}; @@ -359,6 +362,7 @@ sub update_state { if ($job->{state} ne "del") { $state->{state} = $job->{state}; $state->{lsync} = $job->{lsync}; + $state->{instance_id} = $job->{instance_id}; $state->{vm_type} = $job->{vm_type}; for (my $i = 0; $job->{"snap$i"} ; $i++) { @@ -571,6 +575,33 @@ sub destroy_job { }); } +sub get_instance_id { + my ($pid) = @_; + + my $stat = read_file("/proc/$pid/stat", 1) + or die "unable to read process stats\n"; + my $boot_id = read_file("/proc/sys/kernel/random/boot_id", 1) + or die "unable to read boot ID\n"; + + my $stats = [ split(/\s+/, $stat) ]; + my $starttime = $stats->[21]; + chomp($boot_id); + + return "${pid}:${starttime}:${boot_id}"; +} + +sub instance_exists { + my ($instance_id) = @_; + + if (defined($instance_id) && $instance_id =~ m/^([1-9][0-9]*):/) { + my $pid = $1; + my $actual_id = eval { get_instance_id($pid); }; + return defined($actual_id) && $actual_id eq $instance_id; + } + + return 0; +} + sub sync { my ($param) = @_; @@ -580,11 +611,16 @@ sub sync { eval { $job = get_job($param) }; if ($job) { - if (defined($job->{state}) && ($job->{state} eq "syncing" || $job->{state} eq "waiting")) { + my $state = $job->{state} // 'ok'; + $state = 'ok' if !instance_exists($job->{instance_id}); + + if ($state eq "syncing" || $state eq "waiting") { die "Job --source $param->{source} --name $param->{name} is already scheduled to sync\n"; } $job->{state} = "waiting"; + $job->{instance_id} = $INSTANCE_ID; + update_state($job); } }); @@ -658,6 +694,7 @@ sub sync { eval { $job = get_job($param); }; if ($job) { $job->{state} = "error"; + delete $job->{instance_id}; update_state($job); } }); @@ -674,6 +711,7 @@ sub sync { $job->{state} = "ok"; } $job->{lsync} = $date; + delete $job->{instance_id}; update_state($job); } }); -- 2.20.1