From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id 8A8621FF17C
	for <inbox@lore.proxmox.com>; Wed, 28 May 2025 16:49:43 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id F23D737FA0;
	Wed, 28 May 2025 16:49:56 +0200 (CEST)
Date: Wed, 28 May 2025 10:49:43 -0400
To: =?UTF-8?Q?Fabian_Gr=C3=BCnbichler?= <f.gruenbichler@proxmox.com>,
 Proxmox VE development discussion <pve-devel@lists.proxmox.com>
References: <mailman.34.1748269920.395.pve-devel@lists.proxmox.com>
 <1594409888.20674.1748329375230@webmail.proxmox.com>
 <4773cf91-8a0f-453d-b9a1-11dcad1a193f@open-e.com>
 <29844581.21458.1748416013495@webmail.proxmox.com>
In-Reply-To: <29844581.21458.1748416013495@webmail.proxmox.com>
MIME-Version: 1.0
Message-ID: <mailman.85.1748443796.395.pve-devel@lists.proxmox.com>
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
From: Andrei Perapiolkin via pve-devel <pve-devel@lists.proxmox.com>
Precedence: list
Cc: Andrei Perapiolkin <andrei.perepiolkin@open-e.com>
X-Mailman-Version: 2.1.29
X-BeenThere: pve-devel@lists.proxmox.com
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
Reply-To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
Subject: Re: [pve-devel] Volume live migration concurrency
Content-Type: multipart/mixed; boundary="===============0084200948287842236=="
Errors-To: pve-devel-bounces@lists.proxmox.com
Sender: "pve-devel" <pve-devel-bounces@lists.proxmox.com>

--===============0084200948287842236==
Content-Type: message/rfc822
Content-Disposition: inline

Return-Path: <andrei.perepiolkin@open-e.com>
X-Original-To: pve-devel@lists.proxmox.com
Delivered-To: pve-devel@lists.proxmox.com
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits))
	(No client certificate requested)
	by lists.proxmox.com (Postfix) with ESMTPS id 0BEC5C9E00
	for <pve-devel@lists.proxmox.com>; Wed, 28 May 2025 16:49:56 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id E943037EEA
	for <pve-devel@lists.proxmox.com>; Wed, 28 May 2025 16:49:55 +0200 (CEST)
Received: from mout.kundenserver.de (mout.kundenserver.de [212.227.126.130])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange ECDHE (prime256v1) server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by firstgate.proxmox.com (Proxmox) with ESMTPS
	for <pve-devel@lists.proxmox.com>; Wed, 28 May 2025 16:49:54 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=open-e.com;
	s=s1-ionos; t=1748443785; x=1749048585;
	i=andrei.perepiolkin@open-e.com;
	bh=1lTBcewdBRIuN3nDkhRwofAukC21ckGSJ24MY9QF7xA=;
	h=X-UI-Sender-Class:Message-ID:Date:MIME-Version:Subject:To:
	 References:From:In-Reply-To:Content-Type:
	 Content-Transfer-Encoding:cc:content-transfer-encoding:
	 content-type:date:from:message-id:mime-version:reply-to:subject:
	 to;
	b=vFpAKsP8yF1ieFB63GmaPkM4GBbp2tquVImELn1acvXg+n9B8f/Rr6sknQAIjU9w
	 5aZbnbIDQW7Af1HvVXWNHahR5bcISsYQixDcc3qzNO0gm/90dIljqovBvVo2135GN
	 rgJo2rqULnXynMvAwi6SKUF5u0sYnsXZ9LP4XuQ50phJFyoBv0y+H71REZ+xJehum
	 pufLMSvOGfr1v/gpXlCfGT/6I2ZlNyMUFtLT1Vp5/JCfSQl+CfLyDz/jAGl+vU7k6
	 odTeukSWivEzKIQBplGQ6k7vf5+WuVTipjff9iXk6hpd1id+nXYYAaB6UqfUoEOZa
	 OQNlJaWzhK9564EKdw==
X-UI-Sender-Class: 55c96926-9e95-11ee-ae09-1f7a4046a0f6
Received: from [10.137.0.75] ([149.102.246.43]) by mrelayeu.kundenserver.de
 (mreue011 [212.227.15.167]) with ESMTPSA (Nemesis) id
 1MQusJ-1uVeec0nhf-00RBvp; Wed, 28 May 2025 16:49:45 +0200
Message-ID: <9d2b0cf4-7037-491c-b4a4-81538e63376d@open-e.com>
Date: Wed, 28 May 2025 10:49:43 -0400
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [pve-devel] Volume live migration concurrency
To: =?UTF-8?Q?Fabian_Gr=C3=BCnbichler?= <f.gruenbichler@proxmox.com>,
 Proxmox VE development discussion <pve-devel@lists.proxmox.com>
References: <mailman.34.1748269920.395.pve-devel@lists.proxmox.com>
 <1594409888.20674.1748329375230@webmail.proxmox.com>
 <4773cf91-8a0f-453d-b9a1-11dcad1a193f@open-e.com>
 <29844581.21458.1748416013495@webmail.proxmox.com>
Content-Language: en-US
From: Andrei Perapiolkin <andrei.perepiolkin@open-e.com>
In-Reply-To: <29844581.21458.1748416013495@webmail.proxmox.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-Provags-ID: V03:K1:+zCVmIB5OGJDnQ1X/1aElkEBhLfW0TNfhtW9FklAVnicZriS1of
 TTOjfkoOUleDUVqMT7OFrHuuOnf4SAJzGcXvfPbhFimbtNUMsasc10fgcBiDv7F76quRbhQ
 noc3YE3S40fHWCBjlhKpk1Y+soh6fWHJxJxWD3XkOwZaX/B+fWgIDFRse8FCXFRuEgRojC7
 RBtA1hPGWcggSml1n4Gpw==
X-Spam-Flag: NO
UI-OutboundReport: notjunk:1;M01:P0:mwVirt4ujdw=;kiMEGvjiNGUJHXCNZtEGfq8v91y
 Za+G0fHMJAspxNRUC61rKVWVve4hEPollOq7ZE+DvWnjAyMr7lSZ0eSJ9pARmtrv9xU3Kd7QV
 xZSS8E8+G7MWfX1BKdu8Y49qSg5184732qvW6rLUVt2owh4n3xza+G1wvV+Y1FwGUZA5VtUqA
 G2rqL4OdYgF5eJKp1yJf1zlzCbIXNmipAF4kpTGKWo1Py/PWKYG4IyBBJcObTPrMrxZULVGTf
 mXLWoYOvmsoKXrfKgAwKritE6NBSSz1w9iiwXdJsXnZKo1CAR4mV+zuAxWE88i2gMGR57rm+e
 qrpwxSQq6jhCwsikyUx01pbISnJpCZ7BLIBBaJsRjrhEKB355R56AXmhQ5eKSxvDaMczSpoBg
 LX6g4Qb9OYFX1Evo8ZzRQWBMJ/TGm3sbn67kdTbj4Rh9kdHeT8MXnnAcfsVHi12sshntOqtEd
 XFej/p+zSwpT77POffkqT6+kQn8qTzLhg4Of7avRqI/m8rAoqhc7+HXiqCF5pHgYksTfxNnxv
 gDo1skpn8lATB5rgMNQRE4oSqn72kYwH+OalgUFZZ8vq7p0UucCDgFGzcMVG3pMDq0Z17pHAG
 PEzsn2HEOaIZkWE0uBfeEVFIK9b187T3pAPU+6vxwdxFl79N7Xyu11QSLUnw8o8sK2gVSYeKI
 B+54FYJjpw+TaXaEiPqHop03b+/dD5DxwBCSbtx+vQpMOAunOWMft6Psny1W6BNCIgPLDxEue
 xCdLgyNGPO+HNiwUxShd8Y6cK4/QqnQ7GCJiUMqVjmwnFOtX2gBPJkDDWv/o9fTg3WJBmre50
 UmW2K4pzSdJ8dSE68M1Nt/D8irU+eW4m5ORHo+uw8e+WimR40oisKAxxLm9DclVu6SmHeejon
 iPmKKV1be9dR/WNeiXUAuYOAa69lM1/MEqS9WZa6W8xdDIGtPulY9wtvXRu8qrssNNW1eAmuY
 YUTGgIiNnsDqNBJgEPeJOZI//wtDSF99ArGGkcn1fHFdeoUfX7Pb8Cb80qnnc+ZOALJITN0xS
 RrGHUjfGbtY+n61Qo+baae+6lH/sbghPQOhM409kRZJFQ445AW2QONc18cgNmnH+RHjMO//QJ
 vmml2GYXwlobZCtitk75rc/vXBuHtQu5CwE370w4hYoDxYlUCy1nXJczcSDIEZYcB+Ltf1e1g
 nwrwatghVNX6L/PQvnyn0WrR0KoIZorbAK/x0VHa9pMhc+b/djNOoPjYuUB0U8MUpJ+B9mPAt
 vaZqDrrWFRzCDayWQUQ3wrabJOmyK30AuTFHdjQIIMhhmtqsrqsaOCfMb2FvCwO/HGSTTHiI7
 fw2eZ7/UBv1CNRwDtlv08D0goWLvLwWcomQGqNwCj2UNFrvyi3mwB1gE7Trzd4yT2QWOT7rZx
 s79XYbai3e5bYGMVxFcN6+XxG0I/PUaWp0gnBvPpJ1Pba44FTa8EsOxyt9jVZOHUQ7web3AfY
 xHIn5FiVJFog8tsOrsn6CAKJxxH0=
X-SPAM-LEVEL: Spam detection results:  0
	AWL                    -0.202 Adjusted score from AWL reputation of From: address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DKIM_SIGNED               0.1 Message has a DKIM or DK signature, not necessarily valid
	DKIM_VALID               -0.1 Message has at least one valid DKIM or DK signature
	DKIM_VALID_AU            -0.1 Message has a valid DKIM or DK signature from author's domain
	DKIM_VALID_EF            -0.1 Message has a valid DKIM or DK signature from envelope-from domain
	DMARC_PASS               -0.1 DMARC pass policy
	POISEN_SPAM_PILL          0.1 Meta: its spam
	POISEN_SPAM_PILL_1        0.1 random spam to be learned in bayes
	POISEN_SPAM_PILL_3        0.1 random spam to be learned in bayes
	RCVD_IN_DNSWL_NONE     -0.0001 Sender listed at https://www.dnswl.org/, no trust
	RCVD_IN_MSPIKE_H5       0.001 Excellent reputation (+5)
	RCVD_IN_MSPIKE_WL       0.001 Mailspike good senders
	RCVD_IN_VALIDITY_CERTIFIED_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked.  See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
	RCVD_IN_VALIDITY_RPBL_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked.  See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
	RCVD_IN_VALIDITY_SAFE_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked.  See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information.
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
	URIBL_BLOCKED           0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked.  See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [open-e.com]

Hi Fabian,

Thank you for your time dedicated to this issue.

>> My current understanding is that all assets related to snapshots should=
=20
>> to be removed when volume is deactivation, is it correct?
>> Or all volumes and snapshots expected to be present across the entire=
=20
>> cluster until they are explicitly deleted?

> I am not quite sure what you mean by "present" - do you mean "exist in a=
n
activated state"?

Exists in an active state - activated.


>> How should the cleanup tasks be triggered across the remaining nodes?

> it should not be needed=20

Consider following scenarios of live migration of a VM from 'node1' to=20
'node2':

1. Error occurs on 'node2' resulting in partial activation
2.=C2=A0Error occurs on 'node1' resulting in partial deactivation
3. Error occurs on both 'node1' and 'node2' resulting in dangling=20
artifacts remain on both 'node1' and 'node2'

That might lead to partial activation(some artefacts might be created)=20
and partial deactivation(some artifacts might remain uncleared).
Now, suppose the user unlocks the VM (if it was previously locked due to=
=20
the failure) and proceeds with another migration attempt, this time to=20
'node3', hoping for success.
What would happen to the artifacts on 'node1' and 'node2' in such a case?


Regarding 'path' function

In my case it is difficult to deterministically predict actual path of=20
the device.
Determining this path essentially requires activating the volume.
This approach is questionable, as it implies calling activate_volume=20
without Proxmox being aware that the activation has occurred.
What would happen if a failure occurs within Proxmox before it reaches=20
the stage of officially activating the volume?

Additionaly I believe that providing 'physical path' of the resource=20
that is not yet present(i.e. activated and usable) is a questionable=20
practice.
This creates a risk, as there is always a temptation to use the path=20
directly, under the assumption that the resource is ready.

This approach assumes that all developers are fully aware that a given=20
$path might merely be a placeholder, and that additional activation is=20
required before use.
The issue becomes even more complex in larger code base that integrate=20
third-party software=E2=80=94such as QEMU.

I might be mistaken, but during my experiments with the 'path' function,=
=20
I encountered an error where the virtualization system failed to open a=20
volume that had not been fully activated.
Perhaps this has been addressed in newer versions, but previously, there=
=20
appeared to be a race condition between volume activation and QEMU=20
attempting to operate on the expected block device path.


Andrei

On 5/28/25 03:06, Fabian Gr=C3=BCnbichler wrote:
>> Andrei Perapiolkin <andrei.perepiolkin@open-e.com> hat am 27.05.2025 18=
:08 CEST geschrieben:
>>
>>  =20
>>> 3. In the context of live migration: Will Proxmox skip calling
>>> /deactivate_volume/ for snapshots that have already been activated?
>>> Should the storage plugin explicitly deactivate all snapshots of a
>>> volume during migration?
>>> a live migration is not concerned with snapshots of shared volumes, an=
d local
>>> volumes are removed on the source node after the migration has finishe=
d..
>>>
>>> but maybe you could expand this part?
>> My original idea was that since both 'activate_volume' and
>> 'deactivate_volume' methods have a 'snapname' argument they would both
>> be used to activate and deactivate snapshots respectivly.
>> And for each snapshot activation, there would be a corresponding
>> deactivation.
> deactivating volumes (and snapshots) is a lot trickier than activating
> them, because you might have multiple readers in parallel that we don't
> know about.
>
> so if you have the following pattern
>
> activate
> do something
> deactivate
>
> and two instances of that are interleaved:
>
> A: activate
> B: activate
> A: do something
> A: deactivate
> B: do something -> FAILURE, volume not active
>
> you have a problem.
>
> that's why we deactivate in special circumstances:
> - as part of error handling for freshly activated volumes
> - as part of migration when finally stopping the source VM or before
>    freeing local source volumes
> - ..
>
> where we can be reasonably sure that no other user exists, or it is
> required for safety purposes.
>
> otherwise, we'd need to do refcounting on volume activations and have
> some way to hook that for external users, to avoid premature deactivatio=
n.
>
>> However, from observing the behavior during migration, I found that
>> 'deactivate_volume' is not called for snapshots that were previously
>> activated with 'activate_volume'.
> where they activated for the migration? or for cloning from a snapshot?
> or ..?
>
> maybe there is call path that should deactivate that snapshot after usin=
g
> it..
>
>> Therefore, I assumed that 'deactivate_volume' is responsible for
>> deactivating all snapshots related to the volume that was previously
>> activated.
>> The purpose if this question was to confirm this.
>>
>>   From your response I conclude the following:
>> 1. Migration does not manages(i.e. it does not activate or deactivate
>> them=C2=A0 volume snapshots.
> that really depends. a storage migration might activate a snapshot if
> that is required for transferring the volume. this mostly applies to
> offline migration or unused volumes though, and only for some storages.
>
>> 2. All volumes are expected to be present across all nodes in cluster,
>> for 'path' function to work.
> if at all possible, path should just do a "logical" conversion of volume=
 ID
> to a stable/deterministic path, or the information required for Qemu to
> access the volume if no path exists. ideally, this means it works withou=
t
> activating the volume, but it might require querying the storage.
>
>> 3. For migration to work volume should be simultaneously present on bot=
h
>> nodes.
> for a live migration and shared storage, yes. for an offline migration w=
ith
> shared storage, the VM is never started on the target node, so no volume
> activation is required until that happens later. for local storages, vol=
umes
> only exist on one node anyway (they are copied during the migration).
>
>> However, I couldn't find explicit instructions or guides on when and by
>> whom volume snapshot deactivation should be triggered.
> yes, this is a bit under-specified unfortunately. we are currently worki=
ng
> on improving the documentation (and the storage plugin API).
>
>> Is it possible for a volume snapshot to remain active active after
>> volume itself was deactivated?
> I'd have to check all the code paths to give an answer to that.
> snapshots are rarely activated in general - IIRC mostly for
> - cloning from a snapshot
> - replication (limited to ZFS at the moment)
> - storage migration
>
> so just did that:
> - cloning from a snapshot only deactivates if the clone is to a differen=
t
>    node, for both VM and CT -> see below
> - CT backup in snapshot mode deletes the snapshot which implies deactiva=
tion
> - storage_migrate (move_disk or offline migration) if a snapshot is pass=
ed,
>    IIRC this only affects ZFS, which doesn't do activation anyway
>
>> During testing proxmox 8.2 Ive encountered situations when cloning a
>> volume from a snapshot did not resulted in snapshot deactivation.
>> This leads to the creation of 'dangling' snapshots if=C2=A0 the volume =
is
>> later migrated.
> ah, that probably answers my question above.
>
> I think this might be one of those cases where deactivation is hard - yo=
u
> can have multiple clones from the same source VM running in parallel, an=
d
> only the last one would be allowed to deactivate the snapshot/volume..
>
>> My current understanding is that all assets related to snapshots should
>> to be removed when volume is deactivation, is it correct?
>> Or all volumes and snapshots expected to be present across the entire
>> cluster until they are explicitly deleted?
> I am not quite sure what you mean by "present" - do you mean "exist in a=
n
> activated state"?
>
>> Second option requires additional recommendation on artifact management=
.
>> May be it should be sent it as an separate email, but draft it here.
>>
>> If all volumes and snapshots are consistently present across entire
>> cluster and their creation/operation results in creation of additional
>> artifacts(such as iSCSI targets, multipath sessions, etc..), then this
>> artifacts should be removed on deletion of associated volume or snapsho=
t.
>> Currently, it is unclear how all nodes in the cluster are notified of
>> such deletion as only one node in the cluster receives 'free_image' or
>> 'volume_snapshot_delete'=C2=A0 request.
>> What is a proper way to instruct plugin on other nodes in the cluster
>> that given volume/snapshot is requested for deletion and all artifacts
>> related to it have to be removed?
> I now get where you are coming from I think! a volume should only be act=
ive
> on a single node, except during a live migration, where the source node
> will always get a deactivation call at the end.
>
> deactivating a volume should also tear down related, volume-specific
> resources, if applicable.
>
>> How should the cleanup tasks be triggered across the remaining nodes?
> it should not be needed, but I think you've found an edge case where we
> need to improve.
>
> I think our RBD plugin is also affected by this, all the other plugins
> either:
> - don't support snapshots (or cloning from them)
> - are local only
> - don't need any special activation/deactivation
>
> I think the safe approach is likely to deactivate all snapshots when
> deactivating the volume itself, for now.
>


--===============0084200948287842236==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

--===============0084200948287842236==--