From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <f.gruenbichler@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id CF21D60910
 for <pve-devel@lists.proxmox.com>; Thu, 13 Aug 2020 12:06:41 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id BD32118FAD
 for <pve-devel@lists.proxmox.com>; Thu, 13 Aug 2020 12:06:11 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [212.186.127.180])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id E8FBC18FA2
 for <pve-devel@lists.proxmox.com>; Thu, 13 Aug 2020 12:06:10 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id AE7D2445E0
 for <pve-devel@lists.proxmox.com>; Thu, 13 Aug 2020 12:06:10 +0200 (CEST)
Date: Thu, 13 Aug 2020 12:06:03 +0200
From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= <f.gruenbichler@proxmox.com>
To: Fabian Ebner <f.ebner@proxmox.com>, pve-devel@lists.proxmox.com
References: <20200810123557.22618-1-f.ebner@proxmox.com>
 <20200810123557.22618-6-f.ebner@proxmox.com>
 <cd685c4e-64d2-0fdd-038a-5e5ea02dec82@proxmox.com>
In-Reply-To: <cd685c4e-64d2-0fdd-038a-5e5ea02dec82@proxmox.com>
MIME-Version: 1.0
User-Agent: astroid/0.15.0 (https://github.com/astroidmail/astroid)
Message-Id: <1597312701.72ffjl19l1.astroid@nora.none>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.034 Adjusted score from AWL reputation of From: address
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 RCVD_IN_DNSWL_MED        -2.3 Sender listed at https://www.dnswl.org/,
 medium trust
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pve-devel] [PATCH/RFC guest-common 6/6] job_status: return
 jobs with target local node
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Thu, 13 Aug 2020 10:06:41 -0000

On August 11, 2020 11:20 am, Fabian Ebner wrote:
> There is another minor issue (with and without my patches):
> If there is a job 123-0 for a guest on pve0 with source=3Dtarget=3Dpve0 a=
nd=20
> a job 123-4 with source=3Dpve0 and target=3Dpve1, and we migrate to pve1,=
=20
> then switching source and target for job 123-4 is not possible, because=20
> there already is a job with target pve0. Thus cfg->write() will fail,=20
> and by extension job_status. (Also the switch_replication_job_target=20
> call during migration fails for the same reason).

this patch also breaks replication_test2.pl in pve-manager..

>=20
> Possible solutions:
>=20
> 1. Instead of making such jobs (i.e. jobs with target=3Dsource) visible,=20
> in the hope that a user would remove/fix them, we could automatically=20
> remove them ourselves (this could be done as part of the=20
> switch_replication_job_target function as well). Under normal=20
> conditions, there shouldn't be any such jobs anyways.

I guess making them visible as long as they get filtered out/warned=20
about early on when actually doing a replication is fine. would need to=20
look the the call sites to make sure that everybody handles this=20
correctly (and probably also adapt the test case, see above).

I would not remove them altogether/automatically, could be a result of=20
an admin misediting the config, we don't want to throw away a=20
potentially existing replication state if we don't have to..

>=20
> 2. Alternatively (or additionally), we could also add checks in the=20
> create/update API paths to ensure that the target is not the node the=20
> guest is on.
>=20
> Option 2 would add a reason for using guest_migration locks in the=20
> create/update paths. But I'm not sure we'd want that. The ability to=20
> update job configurations while a replication is running is a feature=20
> IMHO, and I think stealing guests might still lead to a bad=20
> configuration. Therefore, I'd prefer option 1, which just adds a bit to=20
> the automatic fixing we already do.

Yes, I'd check that source =3D=3D current node and target !=3D current node=
=20
(see my patch series ;)).

we could leave out the guest lock - worst case, it's possible to=20
add/modify a replication config such that it represents the=20
pre-migration source->target pair, which would get cleaned up on the=20
next run by job_status anyway unless I am mistaking something?

>=20
> @Fabian G.: Opinions?
>=20
> Am 10.08.20 um 14:35 schrieb Fabian Ebner:
>> even if not scheduled for removal, while adapting
>> replicate to die gracefully except for the removal case.
>>=20
>> Like this such invalid jobs are not hidden to the user anymore
>> (at least via the API, the GUI still hides them)
>>=20
>> Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
>> ---
>>=20
>> I think it's a bit weird that such jobs only show up once
>> they are scheduled for removal. I'll send a patch for the
>> GUI too if we do want the new behavior.
>>=20
>>   PVE/Replication.pm      | 3 +++
>>   PVE/ReplicationState.pm | 5 +----
>>   2 files changed, 4 insertions(+), 4 deletions(-)
>>=20
>> diff --git a/PVE/Replication.pm b/PVE/Replication.pm
>> index ae0f145..b5835bd 100644
>> --- a/PVE/Replication.pm
>> +++ b/PVE/Replication.pm
>> @@ -207,6 +207,9 @@ sub replicate {
>>  =20
>>       die "not implemented - internal error" if $jobcfg->{type} ne 'loca=
l';
>>  =20
>> +    die "job target is local node\n" if $jobcfg->{target} eq $local_nod=
e
>> +				     && !$jobcfg->{remove_job};
>> +
>>       my $dc_conf =3D PVE::Cluster::cfs_read_file('datacenter.cfg');
>>  =20
>>       my $migration_network;
>> diff --git a/PVE/ReplicationState.pm b/PVE/ReplicationState.pm
>> index e486bc7..0b751bb 100644
>> --- a/PVE/ReplicationState.pm
>> +++ b/PVE/ReplicationState.pm
>> @@ -261,10 +261,6 @@ sub job_status {
>>   	    $cfg->switch_replication_job_target_nolock($vmid, $local_node, $j=
obcfg->{source})
>>   		if $local_node ne $jobcfg->{source};
>>  =20
>> -	    my $target =3D $jobcfg->{target};
>> -	    # never sync to local node
>> -	    next if !$jobcfg->{remove_job} && $target eq $local_node;
>> -
>>   	    next if !$get_disabled && $jobcfg->{disable};
>>  =20
>>   	    my $state =3D extract_job_state($stateobj, $jobcfg);
>> @@ -280,6 +276,7 @@ sub job_status {
>>   	    } else  {
>>   		if (my $fail_count =3D $state->{fail_count}) {
>>   		    my $members =3D PVE::Cluster::get_members();
>> +		    my $target =3D $jobcfg->{target};
>>   		    if (!$fail_count || ($members->{$target} && $members->{$target}-=
>{online})) {
>>   			$next_sync =3D $state->{last_try} + 60*($fail_count < 3 ? 5*$fail_c=
ount : 30);
>>   		    }
>>=20
>=20
=