From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9])
	by lore.proxmox.com (Postfix) with ESMTPS id 92D8A1FF142
	for <inbox@lore.proxmox.com>; Mon, 02 Mar 2026 15:52:11 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 5CB3837656;
	Mon,  2 Mar 2026 15:53:12 +0100 (CET)
Message-ID: <f8452cf1-6178-4c2c-82f2-5c03987f611a@proxmox.com>
Date: Mon, 2 Mar 2026 15:52:37 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH qemu-server v2 06/14] migration: intra-cluster: check
 config can be parsed on target node
To: Daniel Kral <d.kral@proxmox.com>, pve-devel@lists.proxmox.com
References: <20260225151931.176335-1-f.ebner@proxmox.com>
 <20260225151931.176335-7-f.ebner@proxmox.com>
 <DGPOB3UZ8LZC.12LVZ8JX89AQQ@proxmox.com>
Content-Language: en-US
From: Fiona Ebner <f.ebner@proxmox.com>
In-Reply-To: <DGPOB3UZ8LZC.12LVZ8JX89AQQ@proxmox.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1772463134528
X-SPAM-LEVEL: Spam detection results:  0
	AWL                    -1.073 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	RCVD_IN_VALIDITY_CERTIFIED_BLOCKED  0.012 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	RCVD_IN_VALIDITY_RPBL_BLOCKED  1.188 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	RCVD_IN_VALIDITY_SAFE_BLOCKED   0.93 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
Message-ID-Hash: BFEU2I4UA67TGHKBYYAWMDLX3N3BQ7K3
X-Message-ID-Hash: BFEU2I4UA67TGHKBYYAWMDLX3N3BQ7K3
X-MailFrom: f.ebner@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pve-devel-owner@lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Subscribe: <mailto:pve-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pve-devel-leave@lists.proxmox.com>

Am 27.02.26 um 11:30 AM schrieb Daniel Kral:
> On Wed Feb 25, 2026 at 4:18 PM CET, Fiona Ebner wrote:
>> diff --git a/src/PVE/API2/Qemu.pm b/src/PVE/API2/Qemu.pm
>> index 1f0864f5..47466513 100644
>> --- a/src/PVE/API2/Qemu.pm
>> +++ b/src/PVE/API2/Qemu.pm
>> @@ -5399,7 +5399,9 @@ __PACKAGE__->register_method({
>>              force => {
>>                  type => 'boolean',
>>                  description =>
>> -                    "Allow to migrate VMs which use local devices. Only root may use this option.",
>> +                    "Allow to migrate VMs which use local devices and for intra-cluster migration,"
>> +                    . " configuration options not understood by the target. Only root may use this"
>> +                    . " option.",
> 
> HA-managed VMs are always migrated with force set as it was assumed to
> be only used for local devices at the time [0]. This might need some
> adaption so that LRM-initiated migrations won't cause problems for those
> VMs that this patch series wants to fix.
> 
> [0] https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/PVE/HA/Resources/PVEVM.pm;h=7586da84b7f19686b680d4e1434a17ffe1633d6d;hb=1a8d8bcef1934a43d37344caf965c082e55d451c#l116

Hmm, yes, it might be good to add another param like 'skip-config-check'
rather than re-use 'force'. Then we need a way to allow passing that
along to HA migrations, but I see you have posted [0] recently :)

> As we might want to know which guests can be moved to which nodes in the
> future quickly, e.g. for the load balancer to know which target nodes to
> consider, I briefly considered whether it could also make sense to have
> some config versioning, which is negotiated between the source and
> target node (e.g. qemu-server on the source node is lower than the
> target node, so the VM can be migrated), but that might be too strict,
> especially for guests that don't even use the new config properties of
> the more recent qemu-server version.
> 
> But maybe these load-balancing decisions can also be more coarse-grained
> then this more fine-grained check for config compatibility and
> implemented at a later time when it actually is needed.
> 
> What do you think?

If we need the information for all nodes, there should be a cheap way to
get it, ideally not even an API call per node. One idea is to broadcast
the config schema, and check if parsing with the schema from the other
node works, but it is quite a handful, even if setting 'description' and
'verbose_description' to empty strings:

[I] febner@dev9 ~/repos/pve/qemu-server (master)> ls -lh schema*.json
-rw-rw-r-- 1 febner febner 611K Mar  2 15:20 schema.json
-rw-rw-r-- 1 febner febner 365K Mar  2 15:25 schema-no-desc.json

And it has the limitation that changes in the parsing logic itself would
not be detected. For example, introducing special sections support was a
change in the parsing logic.

Coming back to the general (not-only-HA) situation, not having it in the
preconditions means that users cannot yet select the force (or
skip-config-check) checkbox from the UI, which is also rather bad.
There, we could consider doing it with one API call per node I suppose
(do it for the target upon selection change).

Or maybe we want to flip it around? Proactively do parsing of configs
from other nodes and broadcast the information which configs could and
couldn't be parsed to other nodes? Only needs to be updated when configs
change and might be relatively cheap.

Does anybody have opinions about that last idea?

>>                  optional => 1,
>>              },
>>              migration_type => {
>> diff --git a/src/PVE/QemuMigrate.pm b/src/PVE/QemuMigrate.pm
>> index f7ec3227..901fe96d 100644
>> --- a/src/PVE/QemuMigrate.pm
>> +++ b/src/PVE/QemuMigrate.pm
>> @@ -355,6 +355,33 @@ sub prepare {
>>          my $cmd = [@{ $self->{rem_ssh} }, '/bin/true'];
>>          eval { $self->cmd_quiet($cmd); };
>>          die "Can't connect to destination address using public key\n" if $@;
>> +
>> +        if (!$self->{opts}->{force}) {
>> +            # Fork a short-lived tunnel for checking the config. Later, the proper tunnel with SSH
>> +            # forwaring info is forked.
>> +            my $tunnel = $self->fork_tunnel();
>> +            # Compared to remote migration, which also does volume activation, this only strictly
>> +            # parses the config, so no large timeout is needed. Unfortunately, mtunnel did not
>> +            # indicate that a command is unknown, but not reply at all, so the timeout must be very
>> +            # low right now.
>> +            # FIXME PVE 10 - bump timeout, the trade-off between delaying backwards migration and
>> +            # giving config check more time should now be in favor of config checking
>> +            eval {
>> +                my $nodename = PVE::INotify::nodename();
>> +                PVE::Tunnel::write_tunnel($tunnel, 3, "config $vmid $nodename");
>> +            };
>> +            if (my $err = $@) {
>> +                chomp($err);
>> +                # if there is no reply, assume target did not know the command yet
>> +                if ($err =~ m/^no reply to command/) {
>> +                    $self->log('info', "skipping strict configuration check (target too old?)");
>> +                } else {
>> +                    die "$err - use --force to migrate regardless\n";
> 
> Though unlikely (I couldn't hit `systemctl stop sshd` on time on the
> target node with a few tries ^^), write_tunnel(...) might fail with $err
> that don't really explain why the migration failed. It might be better
> to filter here or explicitly prepend that the strict config check failed
> here and then add the full error message?

Good catch! Yes, I'll match that the error was actually the one for a
failed config check in v3.

[0]:
https://lore.proxmox.com/pve-devel/20260225143514.368884-1-d.kral@proxmox.com/