From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id C74A36A091 for ; Tue, 11 Aug 2020 11:20:57 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id B99D31B711 for ; Tue, 11 Aug 2020 11:20:57 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [212.186.127.180]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id B91951B703 for ; Tue, 11 Aug 2020 11:20:56 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 7334E44585 for ; Tue, 11 Aug 2020 11:20:56 +0200 (CEST) To: pve-devel@lists.proxmox.com References: <20200810123557.22618-1-f.ebner@proxmox.com> <20200810123557.22618-6-f.ebner@proxmox.com> From: Fabian Ebner Message-ID: Date: Tue, 11 Aug 2020 11:20:55 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0 MIME-Version: 1.0 In-Reply-To: <20200810123557.22618-6-f.ebner@proxmox.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 AWL -0.007 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -0.001 Looks like a legit reply (A) RCVD_IN_DNSWL_MED -2.3 Sender listed at https://www.dnswl.org/, medium trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [replication.pm, replicationstate.pm] Subject: Re: [pve-devel] [PATCH/RFC guest-common 6/6] job_status: return jobs with target local node X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Aug 2020 09:20:57 -0000 There is another minor issue (with and without my patches): If there is a job 123-0 for a guest on pve0 with source=target=pve0 and a job 123-4 with source=pve0 and target=pve1, and we migrate to pve1, then switching source and target for job 123-4 is not possible, because there already is a job with target pve0. Thus cfg->write() will fail, and by extension job_status. (Also the switch_replication_job_target call during migration fails for the same reason). Possible solutions: 1. Instead of making such jobs (i.e. jobs with target=source) visible, in the hope that a user would remove/fix them, we could automatically remove them ourselves (this could be done as part of the switch_replication_job_target function as well). Under normal conditions, there shouldn't be any such jobs anyways. 2. Alternatively (or additionally), we could also add checks in the create/update API paths to ensure that the target is not the node the guest is on. Option 2 would add a reason for using guest_migration locks in the create/update paths. But I'm not sure we'd want that. The ability to update job configurations while a replication is running is a feature IMHO, and I think stealing guests might still lead to a bad configuration. Therefore, I'd prefer option 1, which just adds a bit to the automatic fixing we already do. @Fabian G.: Opinions? Am 10.08.20 um 14:35 schrieb Fabian Ebner: > even if not scheduled for removal, while adapting > replicate to die gracefully except for the removal case. > > Like this such invalid jobs are not hidden to the user anymore > (at least via the API, the GUI still hides them) > > Signed-off-by: Fabian Ebner > --- > > I think it's a bit weird that such jobs only show up once > they are scheduled for removal. I'll send a patch for the > GUI too if we do want the new behavior. > > PVE/Replication.pm | 3 +++ > PVE/ReplicationState.pm | 5 +---- > 2 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/PVE/Replication.pm b/PVE/Replication.pm > index ae0f145..b5835bd 100644 > --- a/PVE/Replication.pm > +++ b/PVE/Replication.pm > @@ -207,6 +207,9 @@ sub replicate { > > die "not implemented - internal error" if $jobcfg->{type} ne 'local'; > > + die "job target is local node\n" if $jobcfg->{target} eq $local_node > + && !$jobcfg->{remove_job}; > + > my $dc_conf = PVE::Cluster::cfs_read_file('datacenter.cfg'); > > my $migration_network; > diff --git a/PVE/ReplicationState.pm b/PVE/ReplicationState.pm > index e486bc7..0b751bb 100644 > --- a/PVE/ReplicationState.pm > +++ b/PVE/ReplicationState.pm > @@ -261,10 +261,6 @@ sub job_status { > $cfg->switch_replication_job_target_nolock($vmid, $local_node, $jobcfg->{source}) > if $local_node ne $jobcfg->{source}; > > - my $target = $jobcfg->{target}; > - # never sync to local node > - next if !$jobcfg->{remove_job} && $target eq $local_node; > - > next if !$get_disabled && $jobcfg->{disable}; > > my $state = extract_job_state($stateobj, $jobcfg); > @@ -280,6 +276,7 @@ sub job_status { > } else { > if (my $fail_count = $state->{fail_count}) { > my $members = PVE::Cluster::get_members(); > + my $target = $jobcfg->{target}; > if (!$fail_count || ($members->{$target} && $members->{$target}->{online})) { > $next_sync = $state->{last_try} + 60*($fail_count < 3 ? 5*$fail_count : 30); > } >