From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: <pve-devel-bounces@lists.proxmox.com> Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 3754D1FF170 for <inbox@lore.proxmox.com>; Thu, 29 May 2025 09:33:48 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id EAEA21133D; Thu, 29 May 2025 09:34:00 +0200 (CEST) References: <mailman.538.1747833190.394.pve-devel@lists.proxmox.com> <1283184248.17536.1747895442851@webmail.proxmox.com> <857cbd6c-6866-417d-a71f-f5b5297bf09c@storpool.com> <1349127939.17705.1747902137180@webmail.proxmox.com> <CAHXTzuk7tYRJV_j=88RWc3R3C7AkiEdFUXi88m5qwnDeYDEC+A@mail.gmail.com> <11746909.21389.1748414016786@webmail.proxmox.com> In-Reply-To: <11746909.21389.1748414016786@webmail.proxmox.com> Date: Thu, 29 May 2025 10:33:14 +0300 To: =?UTF-8?Q?Fabian_Gr=C3=BCnbichler?= <f.gruenbichler@proxmox.com> MIME-Version: 1.0 Message-ID: <mailman.91.1748504040.395.pve-devel@lists.proxmox.com> List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com> List-Post: <mailto:pve-devel@lists.proxmox.com> From: Denis Kanchev via pve-devel <pve-devel@lists.proxmox.com> Precedence: list Cc: Denis Kanchev <denis.kanchev@storpool.com>, Wolfgang Bumiller <w.bumiller@proxmox.com>, Proxmox VE development discussion <pve-devel@lists.proxmox.com> X-Mailman-Version: 2.1.29 X-BeenThere: pve-devel@lists.proxmox.com List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe> List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe> List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/> Reply-To: Proxmox VE development discussion <pve-devel@lists.proxmox.com> List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help> Subject: Re: [pve-devel] PVE child process behavior question Content-Type: multipart/mixed; boundary="===============7780112424394344645==" Errors-To: pve-devel-bounces@lists.proxmox.com Sender: "pve-devel" <pve-devel-bounces@lists.proxmox.com> --===============7780112424394344645== Content-Type: message/rfc822 Content-Disposition: inline Return-Path: <denis.kanchev@storpool.com> X-Original-To: pve-devel@lists.proxmox.com Delivered-To: pve-devel@lists.proxmox.com Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 581BDC9F7D for <pve-devel@lists.proxmox.com>; Thu, 29 May 2025 09:33:59 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 2AA53112C4 for <pve-devel@lists.proxmox.com>; Thu, 29 May 2025 09:33:59 +0200 (CEST) Received: from mail-yw1-x1132.google.com (mail-yw1-x1132.google.com [IPv6:2607:f8b0:4864:20::1132]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for <pve-devel@lists.proxmox.com>; Thu, 29 May 2025 09:33:57 +0200 (CEST) Received: by mail-yw1-x1132.google.com with SMTP id 00721157ae682-70e64b430daso7218747b3.3 for <pve-devel@lists.proxmox.com>; Thu, 29 May 2025 00:33:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=storpool.com; s=google; t=1748504031; x=1749108831; darn=lists.proxmox.com; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=Dk5QjHj7gjZAbBeQKXu+rGtelSRiWCkRVefLqiqT89c=; b=gAffKDevYcfbR7d8bio2j6gEHEhRWX5SDgccFUYOtL7Fq1EDkGrnpqndMxPLvIauzd 8CCkVW/Ikburg267K4Oplntj5VHoEW/iyhHevr8X7hU+F29AkNSZCPUfXXJ+HORKSJPU /W0286Bh+pVyDb/+SMnj/OQHQC/bkNHpMqAT0eTBfpRM4XIbEi30p7I3a3C0rlmYD1nX amt0jQ7MUsiP7Ln65hw8yuKbrJ8oQgH36CEhBXBuIfiG5BYxy3gzBf0agB1U4bnCPL0D 3CUHOPXrJix3in2e5+8rbJdxuE4PVzeU0+8gyIUKnwpTkhI93ErTm5tvqiA4KgdLz0bq 0UdA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748504031; x=1749108831; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Dk5QjHj7gjZAbBeQKXu+rGtelSRiWCkRVefLqiqT89c=; b=H+bQ6rZD7m7n44bK9w2l8d15U74XrBxQha0ApzfQcP1fMhGvHIxQ0IOm5nJlH9Bxck U7RVPKKHqrkBnHQ5zeBLlXD/cDsCPQ7klUBQdvD8lpmvm6ujJRpO2kFzglBFADqtVAnS 6DFlnkGDjou6DRpqJNTitmtfky4U7TpdkqqQAejtAgZiylAsPvjAoCE+qGr9dmSii/wi EtZ2wNDc5gkH+8aAUNI8MjJbwuguIkOr41LMQW7t2S0TM6u3MgqpF9YrCZiFu6p+CJGJ yL/Wpm6j/mpX2SB59LZeJAfloZaHVMz/SAl3J0WqFScGPC+AikSd9DO8R3kFfp+rGEof FNhg== X-Gm-Message-State: AOJu0YwoDs+/keG3utQBAskr8KJbFGRLK6kt4/Zg29zuNUz9bZvP1Ejh bTskLgQ4cV2lF8uFC1fkl3FcdtQZDANcqmO1q0S4nP74niDZ6D69PAOyubhuSltgg0nUORFfmE8 W30pVIPoch98pk2npfsZdrmscqDBzoSBx0+stKL2qPQ== X-Gm-Gg: ASbGncuojDE7/0GBzI3op13HXpB4Zv40E9UKf+rygi932BdbjSB32e+nu7ogQ0o2ySm KwiVs55owYMx0jLxJLO6GdNXGECJ5nFP6lh/r9n1f4cjLehTxnQNacJjrlTxJWyCRqpJjaeBMxY j0G0uQj8kUjzLHABF2N6R1L5sw1n9ag8GXDkdgvZLonieo X-Google-Smtp-Source: AGHT+IHEmKuGnIoYxanPR0M4DOMS0OB3ywUBTAOMl5Z2T4EhjUmzn51GndrQRKQxFy+ZVGQbjsiywmIr30gqSMxAWb0= X-Received: by 2002:a05:690c:6912:b0:70e:73ae:768d with SMTP id 00721157ae682-70e73ae7e3emr179991337b3.14.1748504030707; Thu, 29 May 2025 00:33:50 -0700 (PDT) MIME-Version: 1.0 References: <mailman.538.1747833190.394.pve-devel@lists.proxmox.com> <1283184248.17536.1747895442851@webmail.proxmox.com> <857cbd6c-6866-417d-a71f-f5b5297bf09c@storpool.com> <1349127939.17705.1747902137180@webmail.proxmox.com> <CAHXTzuk7tYRJV_j=88RWc3R3C7AkiEdFUXi88m5qwnDeYDEC+A@mail.gmail.com> <11746909.21389.1748414016786@webmail.proxmox.com> In-Reply-To: <11746909.21389.1748414016786@webmail.proxmox.com> From: Denis Kanchev <denis.kanchev@storpool.com> Date: Thu, 29 May 2025 10:33:14 +0300 X-Gm-Features: AX0GCFumL9sE_JVRhaXtnylGvVYBQ7ZqspaM-M6ek_p1WRJUSKNXRWDLxl6zJ10 Message-ID: <CAHXTzumXeyJQQCj+45Hmy5qdU+BTFBYbHVgPy0u3VS-qS=_bDQ@mail.gmail.com> Subject: Re: [pve-devel] PVE child process behavior question To: =?UTF-8?Q?Fabian_Gr=C3=BCnbichler?= <f.gruenbichler@proxmox.com> Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>, Wolfgang Bumiller <w.bumiller@proxmox.com> X-SPAM-LEVEL: Spam detection results: 0 BAYES_00 -1.9 Bayes spam probability is 0 to 1% DKIM_SIGNED 0.1 Message has a DKIM or DK signature, not necessarily valid DKIM_VALID -0.1 Message has at least one valid DKIM or DK signature DKIM_VALID_AU -0.1 Message has a valid DKIM or DK signature from author's domain DKIM_VALID_EF -0.1 Message has a valid DKIM or DK signature from envelope-from domain DMARC_PASS -0.1 DMARC pass policy HTML_MESSAGE 0.001 HTML included in message RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [storpool.com] Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 The issue here is that the storage plugin activate_volume() is called after migration cancel which in case of network shared storage can make things bad. This is a sort of race condition, because migration_cancel wont stop the storage migration on the remote server. As you can see below a call to activate_volume() is performed after migration_cancel. In this case we issue volume detach from the old node ( to keep the data consistent ) and we end up with a VM ( not migrated ) without this volume attached. We keep a track if activate_volume() is used for migration by the flag 'lock' =3D> 'migrate', which is cleared on migration_cancel - in case of migration we won't detach the volume from the old VM. In short: when the parent of this storage migration task gets killed, the source node stops the migration, but the storage migration on the destination node continues. Source node: 2025-04-11 03:26:50 starting migration of VM 2421 to node 'telpr01pve03' (10.10.17.3) 2025-04-11 03:26:50 starting VM 2421 on remote node 'telpr01pve03' 2025-04-11 03:26:52 ERROR: online migrate failure - remote command failed with exit code 255 2025-04-11 03:26:52 aborting phase 2 - cleanup resources 2025-04-11 03:26:52 migrate_cancel # <<< NOTE the time 2025-04-11 03:26:53 ERROR: migration finished with problems (duration 00:00:03) TASK ERROR: migration problems Destination node: 2025-04-11T03:26:51.559671+07:00 telpr01pve03 qm[3670216]: <root@pam> starting task UPID:telpr01pve03:003800D4:00928867:67F8298B:qmstart:2421:root@pam: 2025-04-11T03:26:51.559897+07:00 telpr01pve03 qm[3670228]: start VM 2421: UPID:telpr01pve03:003800D4:00928867:67F8298B:qmstart:2421:root@pam: 2025-04-11T03:26:51.837905+07:00 telpr01pve03 qm[3670228]: StorPool plugin: Volume ~bj7n.b.abe is related to VM 2421, checking status ### Call to PVE::Storage::Plugin::activate_volume() 2025-04-11T03:26:53.072206+07:00 telpr01pve03 qm[3670228]: StorPool plugin: NOT a live migration of VM 2421, will force detach volume ~bj7n.b.abe ### 'lock' flag missing 2025-04-11T03:26:53.108206+07:00 telpr01pve03 qm[3670228]: StorPool plugin: Volume ~bj7n.b.sdj is related to VM 2421, checking status ### Second call to activate_volume() after migrate_cancel 2025-04-11T03:26:53.903357+07:00 telpr01pve03 qm[3670228]: StorPool plugin: NOT a live migration of VM 2421, will force detach volume ~bj7n.b.sdj ### 'lock' flag missing On Wed, May 28, 2025 at 9:33=E2=80=AFAM Fabian Gr=C3=BCnbichler < f.gruenbichler@proxmox.com> wrote: > > > Denis Kanchev <denis.kanchev@storpool.com> hat am 28.05.2025 08:13 CEST > geschrieben: > > > > > > Here is the task log > > 2025-04-11 03:45:42 starting migration of VM 2282 to node 'telpr01pve05= ' > (10.10.17.5) > > 2025-04-11 03:45:42 starting VM 2282 on remote node 'telpr01pve05' > > 2025-04-11 03:45:45 [telpr01pve05] Warning: sch_htb: quantum of class > 10001 is big. Consider r2q change. > > 2025-04-11 03:45:46 [telpr01pve05] Dump was interrupted and may be > inconsistent. > > 2025-04-11 03:45:46 [telpr01pve05] kvm: failed to find file > '/usr/share/qemu-server/bootsplash.jpg' > > 2025-04-11 03:45:46 start remote tunnel > > 2025-04-11 03:45:46 ssh tunnel ver 1 > > 2025-04-11 03:45:46 starting online/live migration on > unix:/run/qemu-server/2282.migrate > > 2025-04-11 03:45:46 set migration capabilities > > 2025-04-11 03:45:46 migration downtime limit: 100 ms > > 2025-04-11 03:45:46 migration cachesize: 4.0 GiB > > 2025-04-11 03:45:46 set migration parameters > > 2025-04-11 03:45:46 start migrate command to > unix:/run/qemu-server/2282.migrate > > 2025-04-11 03:45:47 migration active, transferred 152.2 MiB of 24.0 GiB > VM-state, 162.1 MiB/s > > ... > > 2025-04-11 03:46:49 migration active, transferred 15.2 GiB of 24.0 GiB > VM-state, 2.0 GiB/s > > 2025-04-11 03:46:50 migration status error: failed > > 2025-04-11 03:46:50 ERROR: online migrate failure - aborting > > 2025-04-11 03:46:50 aborting phase 2 - cleanup resources > > 2025-04-11 03:46:50 migrate_cancel > > 2025-04-11 03:46:52 ERROR: migration finished with problems (duration > 00:01:11) > > TASK ERROR: migration problems > > okay, so no local disks involved.. not sure which process got killed then= ? > ;) > the state transfer happens entirely within the Qemu process, perl is just > polling > it to print the status, and that perl task worker is not OOM killed since > it > continues to print all the error handling messages.. > > > > that has weird implications with regards to threads, so I don't think > that > > > is a good idea.. > > What you mean by that? Are any threads involved? > > not intentionally, no. the issue is that the whole "pr_set_deathsig" > machinery > works on the thread level, not the process level for historical reasons. > so it > actually would kill the child if the thread that called pr_set_deathsig > exits.. > > I think we do want to improve how run_command handles the parent > disappearing. > but it's not that straight-forward to implement in a race-free fashion (i= n > Perl). > > --===============7780112424394344645== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel --===============7780112424394344645==--