public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Denis Kanchev via pve-devel <pve-devel@lists.proxmox.com>
To: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
Cc: Denis Kanchev <denis.kanchev@storpool.com>,
	Wolfgang Bumiller <w.bumiller@proxmox.com>,
	Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] PVE child process behavior question
Date: Thu, 29 May 2025 10:33:14 +0300	[thread overview]
Message-ID: <mailman.91.1748504040.395.pve-devel@lists.proxmox.com> (raw)
In-Reply-To: <11746909.21389.1748414016786@webmail.proxmox.com>

[-- Attachment #1: Type: message/rfc822, Size: 10531 bytes --]

From: Denis Kanchev <denis.kanchev@storpool.com>
To: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>, Wolfgang Bumiller <w.bumiller@proxmox.com>
Subject: Re: [pve-devel] PVE child process behavior question
Date: Thu, 29 May 2025 10:33:14 +0300
Message-ID: <CAHXTzumXeyJQQCj+45Hmy5qdU+BTFBYbHVgPy0u3VS-qS=_bDQ@mail.gmail.com>

The issue here is that the storage plugin activate_volume() is called after
migration cancel which in case of network shared storage can make things
bad.
This is a sort of race condition, because migration_cancel wont stop the
storage migration on the remote server. As you can see below a call to
activate_volume() is performed after migration_cancel.
In this case we issue volume detach from the old node ( to keep the data
consistent ) and we end up with a VM ( not migrated ) without this volume
attached.
We keep a track if activate_volume() is used for migration by the flag
'lock' => 'migrate', which is cleared on migration_cancel - in case of
migration we won't detach the volume from the old VM.
In short: when the parent of this storage migration task gets killed, the
source node stops the migration, but the storage migration on the
destination node continues.

Source node:
2025-04-11 03:26:50 starting migration of VM 2421 to node 'telpr01pve03'
(10.10.17.3)
2025-04-11 03:26:50 starting VM 2421 on remote node 'telpr01pve03'
2025-04-11 03:26:52 ERROR: online migrate failure - remote command failed
with exit code 255
2025-04-11 03:26:52 aborting phase 2 - cleanup resources
2025-04-11 03:26:52 migrate_cancel # <<< NOTE the time
2025-04-11 03:26:53 ERROR: migration finished with problems (duration
00:00:03)
TASK ERROR: migration problems

Destination node:
2025-04-11T03:26:51.559671+07:00 telpr01pve03 qm[3670216]: <root@pam>
starting task
UPID:telpr01pve03:003800D4:00928867:67F8298B:qmstart:2421:root@pam:
2025-04-11T03:26:51.559897+07:00 telpr01pve03 qm[3670228]: start VM 2421:
UPID:telpr01pve03:003800D4:00928867:67F8298B:qmstart:2421:root@pam:
2025-04-11T03:26:51.837905+07:00 telpr01pve03 qm[3670228]: StorPool plugin:
Volume ~bj7n.b.abe is related to VM 2421, checking status             ###
Call to PVE::Storage::Plugin::activate_volume()
2025-04-11T03:26:53.072206+07:00 telpr01pve03 qm[3670228]: StorPool plugin:
NOT a live migration of VM 2421, will force detach volume ~bj7n.b.abe
###  'lock'
flag missing
2025-04-11T03:26:53.108206+07:00 telpr01pve03 qm[3670228]: StorPool plugin:
Volume ~bj7n.b.sdj is related to VM 2421, checking status             ###
Second call to activate_volume() after migrate_cancel
2025-04-11T03:26:53.903357+07:00 telpr01pve03 qm[3670228]: StorPool plugin:
NOT a live migration of VM 2421, will force detach volume ~bj7n.b.sdj
###  'lock'
flag missing



On Wed, May 28, 2025 at 9:33 AM Fabian Grünbichler <
f.gruenbichler@proxmox.com> wrote:

>
> > Denis Kanchev <denis.kanchev@storpool.com> hat am 28.05.2025 08:13 CEST
> geschrieben:
> >
> >
> > Here is the task log
> > 2025-04-11 03:45:42 starting migration of VM 2282 to node 'telpr01pve05'
> (10.10.17.5)
> > 2025-04-11 03:45:42 starting VM 2282 on remote node 'telpr01pve05'
> > 2025-04-11 03:45:45 [telpr01pve05] Warning: sch_htb: quantum of class
> 10001 is big. Consider r2q change.
> > 2025-04-11 03:45:46 [telpr01pve05] Dump was interrupted and may be
> inconsistent.
> > 2025-04-11 03:45:46 [telpr01pve05] kvm: failed to find file
> '/usr/share/qemu-server/bootsplash.jpg'
> > 2025-04-11 03:45:46 start remote tunnel
> > 2025-04-11 03:45:46 ssh tunnel ver 1
> > 2025-04-11 03:45:46 starting online/live migration on
> unix:/run/qemu-server/2282.migrate
> > 2025-04-11 03:45:46 set migration capabilities
> > 2025-04-11 03:45:46 migration downtime limit: 100 ms
> > 2025-04-11 03:45:46 migration cachesize: 4.0 GiB
> > 2025-04-11 03:45:46 set migration parameters
> > 2025-04-11 03:45:46 start migrate command to
> unix:/run/qemu-server/2282.migrate
> > 2025-04-11 03:45:47 migration active, transferred 152.2 MiB of 24.0 GiB
> VM-state, 162.1 MiB/s
> > ...
> > 2025-04-11 03:46:49 migration active, transferred 15.2 GiB of 24.0 GiB
> VM-state, 2.0 GiB/s
> > 2025-04-11 03:46:50 migration status error: failed
> > 2025-04-11 03:46:50 ERROR: online migrate failure - aborting
> > 2025-04-11 03:46:50 aborting phase 2 - cleanup resources
> > 2025-04-11 03:46:50 migrate_cancel
> > 2025-04-11 03:46:52 ERROR: migration finished with problems (duration
> 00:01:11)
> > TASK ERROR: migration problems
>
> okay, so no local disks involved.. not sure which process got killed then?
> ;)
> the state transfer happens entirely within the Qemu process, perl is just
> polling
> it to print the status, and that perl task worker is not OOM killed since
> it
> continues to print all the error handling messages..
>
> > > that has weird implications with regards to threads, so I don't think
> that
> > > is a good idea..
> > What you mean by that? Are any threads involved?
>
> not intentionally, no. the issue is that the whole "pr_set_deathsig"
> machinery
> works on the thread level, not the process level for historical reasons.
> so it
> actually would kill the child if the thread that called pr_set_deathsig
> exits..
>
> I think we do want to improve how run_command handles the parent
> disappearing.
> but it's not that straight-forward to implement in a race-free fashion (in
> Perl).
>
>

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

  reply	other threads:[~2025-05-29  7:33 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-21 13:13 Denis Kanchev via pve-devel
2025-05-22  6:30 ` Fabian Grünbichler
2025-05-22  6:55   ` Denis Kanchev via pve-devel
     [not found]   ` <857cbd6c-6866-417d-a71f-f5b5297bf09c@storpool.com>
2025-05-22  8:22     ` Fabian Grünbichler
2025-05-28  6:13       ` Denis Kanchev via pve-devel
     [not found]       ` <CAHXTzuk7tYRJV_j=88RWc3R3C7AkiEdFUXi88m5qwnDeYDEC+A@mail.gmail.com>
2025-05-28  6:33         ` Fabian Grünbichler
2025-05-29  7:33           ` Denis Kanchev via pve-devel [this message]
     [not found]           ` <CAHXTzumXeyJQQCj+45Hmy5qdU+BTFBYbHVgPy0u3VS-qS=_bDQ@mail.gmail.com>
2025-06-02  7:37             ` Fabian Grünbichler
2025-06-02  8:35               ` Denis Kanchev via pve-devel
     [not found]               ` <CAHXTzukAMG9050Ynn-KRSqhCz2Y0m6vnAQ7FEkCmEdQT3HapfQ@mail.gmail.com>
2025-06-02  8:49                 ` Fabian Grünbichler
2025-06-02  9:18                   ` Denis Kanchev via pve-devel
     [not found]                   ` <CAHXTzu=AiNx0iTWFEUU2kdzx9-RopwLc7rqGui6f0Q=+Hy52=w@mail.gmail.com>
2025-06-02 11:42                     ` Fabian Grünbichler
2025-06-02 13:23                       ` Denis Kanchev via pve-devel
     [not found]                       ` <CAHXTzu=qrZe2eEZro7qteR=fDjJQX13syfB9fs5VfFbG7Vy6vQ@mail.gmail.com>
2025-06-02 14:31                         ` Fabian Grünbichler
2025-06-04 12:52                           ` Denis Kanchev via pve-devel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mailman.91.1748504040.395.pve-devel@lists.proxmox.com \
    --to=pve-devel@lists.proxmox.com \
    --cc=denis.kanchev@storpool.com \
    --cc=f.gruenbichler@proxmox.com \
    --cc=w.bumiller@proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal