From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id 3754D1FF170
	for <inbox@lore.proxmox.com>; Thu, 29 May 2025 09:33:48 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id EAEA21133D;
	Thu, 29 May 2025 09:34:00 +0200 (CEST)
References: <mailman.538.1747833190.394.pve-devel@lists.proxmox.com>
 <1283184248.17536.1747895442851@webmail.proxmox.com>
 <857cbd6c-6866-417d-a71f-f5b5297bf09c@storpool.com>
 <1349127939.17705.1747902137180@webmail.proxmox.com>
 <CAHXTzuk7tYRJV_j=88RWc3R3C7AkiEdFUXi88m5qwnDeYDEC+A@mail.gmail.com>
 <11746909.21389.1748414016786@webmail.proxmox.com>
In-Reply-To: <11746909.21389.1748414016786@webmail.proxmox.com>
Date: Thu, 29 May 2025 10:33:14 +0300
To: =?UTF-8?Q?Fabian_Gr=C3=BCnbichler?= <f.gruenbichler@proxmox.com>
MIME-Version: 1.0
Message-ID: <mailman.91.1748504040.395.pve-devel@lists.proxmox.com>
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
From: Denis Kanchev via pve-devel <pve-devel@lists.proxmox.com>
Precedence: list
Cc: Denis Kanchev <denis.kanchev@storpool.com>,
 Wolfgang Bumiller <w.bumiller@proxmox.com>,
 Proxmox VE development discussion <pve-devel@lists.proxmox.com>
X-Mailman-Version: 2.1.29
X-BeenThere: pve-devel@lists.proxmox.com
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
Reply-To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
Subject: Re: [pve-devel] PVE child process behavior question
Content-Type: multipart/mixed; boundary="===============7780112424394344645=="
Errors-To: pve-devel-bounces@lists.proxmox.com
Sender: "pve-devel" <pve-devel-bounces@lists.proxmox.com>

--===============7780112424394344645==
Content-Type: message/rfc822
Content-Disposition: inline

Return-Path: <denis.kanchev@storpool.com>
X-Original-To: pve-devel@lists.proxmox.com
Delivered-To: pve-devel@lists.proxmox.com
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by lists.proxmox.com (Postfix) with ESMTPS id 581BDC9F7D
	for <pve-devel@lists.proxmox.com>; Thu, 29 May 2025 09:33:59 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 2AA53112C4
	for <pve-devel@lists.proxmox.com>; Thu, 29 May 2025 09:33:59 +0200 (CEST)
Received: from mail-yw1-x1132.google.com (mail-yw1-x1132.google.com [IPv6:2607:f8b0:4864:20::1132])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by firstgate.proxmox.com (Proxmox) with ESMTPS
	for <pve-devel@lists.proxmox.com>; Thu, 29 May 2025 09:33:57 +0200 (CEST)
Received: by mail-yw1-x1132.google.com with SMTP id 00721157ae682-70e64b430daso7218747b3.3
        for <pve-devel@lists.proxmox.com>; Thu, 29 May 2025 00:33:57 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=storpool.com; s=google; t=1748504031; x=1749108831; darn=lists.proxmox.com;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=Dk5QjHj7gjZAbBeQKXu+rGtelSRiWCkRVefLqiqT89c=;
        b=gAffKDevYcfbR7d8bio2j6gEHEhRWX5SDgccFUYOtL7Fq1EDkGrnpqndMxPLvIauzd
         8CCkVW/Ikburg267K4Oplntj5VHoEW/iyhHevr8X7hU+F29AkNSZCPUfXXJ+HORKSJPU
         /W0286Bh+pVyDb/+SMnj/OQHQC/bkNHpMqAT0eTBfpRM4XIbEi30p7I3a3C0rlmYD1nX
         amt0jQ7MUsiP7Ln65hw8yuKbrJ8oQgH36CEhBXBuIfiG5BYxy3gzBf0agB1U4bnCPL0D
         3CUHOPXrJix3in2e5+8rbJdxuE4PVzeU0+8gyIUKnwpTkhI93ErTm5tvqiA4KgdLz0bq
         0UdA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1748504031; x=1749108831;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=Dk5QjHj7gjZAbBeQKXu+rGtelSRiWCkRVefLqiqT89c=;
        b=H+bQ6rZD7m7n44bK9w2l8d15U74XrBxQha0ApzfQcP1fMhGvHIxQ0IOm5nJlH9Bxck
         U7RVPKKHqrkBnHQ5zeBLlXD/cDsCPQ7klUBQdvD8lpmvm6ujJRpO2kFzglBFADqtVAnS
         6DFlnkGDjou6DRpqJNTitmtfky4U7TpdkqqQAejtAgZiylAsPvjAoCE+qGr9dmSii/wi
         EtZ2wNDc5gkH+8aAUNI8MjJbwuguIkOr41LMQW7t2S0TM6u3MgqpF9YrCZiFu6p+CJGJ
         yL/Wpm6j/mpX2SB59LZeJAfloZaHVMz/SAl3J0WqFScGPC+AikSd9DO8R3kFfp+rGEof
         FNhg==
X-Gm-Message-State: AOJu0YwoDs+/keG3utQBAskr8KJbFGRLK6kt4/Zg29zuNUz9bZvP1Ejh
	bTskLgQ4cV2lF8uFC1fkl3FcdtQZDANcqmO1q0S4nP74niDZ6D69PAOyubhuSltgg0nUORFfmE8
	W30pVIPoch98pk2npfsZdrmscqDBzoSBx0+stKL2qPQ==
X-Gm-Gg: ASbGncuojDE7/0GBzI3op13HXpB4Zv40E9UKf+rygi932BdbjSB32e+nu7ogQ0o2ySm
	KwiVs55owYMx0jLxJLO6GdNXGECJ5nFP6lh/r9n1f4cjLehTxnQNacJjrlTxJWyCRqpJjaeBMxY
	j0G0uQj8kUjzLHABF2N6R1L5sw1n9ag8GXDkdgvZLonieo
X-Google-Smtp-Source: AGHT+IHEmKuGnIoYxanPR0M4DOMS0OB3ywUBTAOMl5Z2T4EhjUmzn51GndrQRKQxFy+ZVGQbjsiywmIr30gqSMxAWb0=
X-Received: by 2002:a05:690c:6912:b0:70e:73ae:768d with SMTP id
 00721157ae682-70e73ae7e3emr179991337b3.14.1748504030707; Thu, 29 May 2025
 00:33:50 -0700 (PDT)
MIME-Version: 1.0
References: <mailman.538.1747833190.394.pve-devel@lists.proxmox.com>
 <1283184248.17536.1747895442851@webmail.proxmox.com> <857cbd6c-6866-417d-a71f-f5b5297bf09c@storpool.com>
 <1349127939.17705.1747902137180@webmail.proxmox.com> <CAHXTzuk7tYRJV_j=88RWc3R3C7AkiEdFUXi88m5qwnDeYDEC+A@mail.gmail.com>
 <11746909.21389.1748414016786@webmail.proxmox.com>
In-Reply-To: <11746909.21389.1748414016786@webmail.proxmox.com>
From: Denis Kanchev <denis.kanchev@storpool.com>
Date: Thu, 29 May 2025 10:33:14 +0300
X-Gm-Features: AX0GCFumL9sE_JVRhaXtnylGvVYBQ7ZqspaM-M6ek_p1WRJUSKNXRWDLxl6zJ10
Message-ID: <CAHXTzumXeyJQQCj+45Hmy5qdU+BTFBYbHVgPy0u3VS-qS=_bDQ@mail.gmail.com>
Subject: Re: [pve-devel] PVE child process behavior question
To: =?UTF-8?Q?Fabian_Gr=C3=BCnbichler?= <f.gruenbichler@proxmox.com>
Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>, Wolfgang Bumiller <w.bumiller@proxmox.com>
X-SPAM-LEVEL: Spam detection results:  0
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DKIM_SIGNED               0.1 Message has a DKIM or DK signature, not necessarily valid
	DKIM_VALID               -0.1 Message has at least one valid DKIM or DK signature
	DKIM_VALID_AU            -0.1 Message has a valid DKIM or DK signature from author's domain
	DKIM_VALID_EF            -0.1 Message has a valid DKIM or DK signature from envelope-from domain
	DMARC_PASS               -0.1 DMARC pass policy
	HTML_MESSAGE            0.001 HTML included in message
	RCVD_IN_DNSWL_NONE     -0.0001 Sender listed at https://www.dnswl.org/, no trust
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
	URIBL_BLOCKED           0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked.  See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [storpool.com]
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.29

The issue here is that the storage plugin activate_volume() is called after
migration cancel which in case of network shared storage can make things
bad.
This is a sort of race condition, because migration_cancel wont stop the
storage migration on the remote server. As you can see below a call to
activate_volume() is performed after migration_cancel.
In this case we issue volume detach from the old node ( to keep the data
consistent ) and we end up with a VM ( not migrated ) without this volume
attached.
We keep a track if activate_volume() is used for migration by the flag
'lock' =3D> 'migrate', which is cleared on migration_cancel - in case of
migration we won't detach the volume from the old VM.
In short: when the parent of this storage migration task gets killed, the
source node stops the migration, but the storage migration on the
destination node continues.

Source node:
2025-04-11 03:26:50 starting migration of VM 2421 to node 'telpr01pve03'
(10.10.17.3)
2025-04-11 03:26:50 starting VM 2421 on remote node 'telpr01pve03'
2025-04-11 03:26:52 ERROR: online migrate failure - remote command failed
with exit code 255
2025-04-11 03:26:52 aborting phase 2 - cleanup resources
2025-04-11 03:26:52 migrate_cancel # <<< NOTE the time
2025-04-11 03:26:53 ERROR: migration finished with problems (duration
00:00:03)
TASK ERROR: migration problems

Destination node:
2025-04-11T03:26:51.559671+07:00 telpr01pve03 qm[3670216]: <root@pam>
starting task
UPID:telpr01pve03:003800D4:00928867:67F8298B:qmstart:2421:root@pam:
2025-04-11T03:26:51.559897+07:00 telpr01pve03 qm[3670228]: start VM 2421:
UPID:telpr01pve03:003800D4:00928867:67F8298B:qmstart:2421:root@pam:
2025-04-11T03:26:51.837905+07:00 telpr01pve03 qm[3670228]: StorPool plugin:
Volume ~bj7n.b.abe is related to VM 2421, checking status             ###
Call to PVE::Storage::Plugin::activate_volume()
2025-04-11T03:26:53.072206+07:00 telpr01pve03 qm[3670228]: StorPool plugin:
NOT a live migration of VM 2421, will force detach volume ~bj7n.b.abe
###  'lock'
flag missing
2025-04-11T03:26:53.108206+07:00 telpr01pve03 qm[3670228]: StorPool plugin:
Volume ~bj7n.b.sdj is related to VM 2421, checking status             ###
Second call to activate_volume() after migrate_cancel
2025-04-11T03:26:53.903357+07:00 telpr01pve03 qm[3670228]: StorPool plugin:
NOT a live migration of VM 2421, will force detach volume ~bj7n.b.sdj
###  'lock'
flag missing



On Wed, May 28, 2025 at 9:33=E2=80=AFAM Fabian Gr=C3=BCnbichler <
f.gruenbichler@proxmox.com> wrote:

>
> > Denis Kanchev <denis.kanchev@storpool.com> hat am 28.05.2025 08:13 CEST
> geschrieben:
> >
> >
> > Here is the task log
> > 2025-04-11 03:45:42 starting migration of VM 2282 to node 'telpr01pve05=
'
> (10.10.17.5)
> > 2025-04-11 03:45:42 starting VM 2282 on remote node 'telpr01pve05'
> > 2025-04-11 03:45:45 [telpr01pve05] Warning: sch_htb: quantum of class
> 10001 is big. Consider r2q change.
> > 2025-04-11 03:45:46 [telpr01pve05] Dump was interrupted and may be
> inconsistent.
> > 2025-04-11 03:45:46 [telpr01pve05] kvm: failed to find file
> '/usr/share/qemu-server/bootsplash.jpg'
> > 2025-04-11 03:45:46 start remote tunnel
> > 2025-04-11 03:45:46 ssh tunnel ver 1
> > 2025-04-11 03:45:46 starting online/live migration on
> unix:/run/qemu-server/2282.migrate
> > 2025-04-11 03:45:46 set migration capabilities
> > 2025-04-11 03:45:46 migration downtime limit: 100 ms
> > 2025-04-11 03:45:46 migration cachesize: 4.0 GiB
> > 2025-04-11 03:45:46 set migration parameters
> > 2025-04-11 03:45:46 start migrate command to
> unix:/run/qemu-server/2282.migrate
> > 2025-04-11 03:45:47 migration active, transferred 152.2 MiB of 24.0 GiB
> VM-state, 162.1 MiB/s
> > ...
> > 2025-04-11 03:46:49 migration active, transferred 15.2 GiB of 24.0 GiB
> VM-state, 2.0 GiB/s
> > 2025-04-11 03:46:50 migration status error: failed
> > 2025-04-11 03:46:50 ERROR: online migrate failure - aborting
> > 2025-04-11 03:46:50 aborting phase 2 - cleanup resources
> > 2025-04-11 03:46:50 migrate_cancel
> > 2025-04-11 03:46:52 ERROR: migration finished with problems (duration
> 00:01:11)
> > TASK ERROR: migration problems
>
> okay, so no local disks involved.. not sure which process got killed then=
?
> ;)
> the state transfer happens entirely within the Qemu process, perl is just
> polling
> it to print the status, and that perl task worker is not OOM killed since
> it
> continues to print all the error handling messages..
>
> > > that has weird implications with regards to threads, so I don't think
> that
> > > is a good idea..
> > What you mean by that? Are any threads involved?
>
> not intentionally, no. the issue is that the whole "pr_set_deathsig"
> machinery
> works on the thread level, not the process level for historical reasons.
> so it
> actually would kill the child if the thread that called pr_set_deathsig
> exits..
>
> I think we do want to improve how run_command handles the parent
> disappearing.
> but it's not that straight-forward to implement in a race-free fashion (i=
n
> Perl).
>
>

--===============7780112424394344645==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

--===============7780112424394344645==--