From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Denis Kanchev <denis.kanchev@storpool.com>
Cc: Wolfgang Bumiller <w.bumiller@proxmox.com>,
Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] PVE child process behavior question
Date: Mon, 2 Jun 2025 09:37:17 +0200 (CEST) [thread overview]
Message-ID: <1695649345.530.1748849837156@webmail.proxmox.com> (raw)
In-Reply-To: <CAHXTzumXeyJQQCj+45Hmy5qdU+BTFBYbHVgPy0u3VS-qS=_bDQ@mail.gmail.com>
> Denis Kanchev <denis.kanchev@storpool.com> hat am 29.05.2025 09:33 CEST geschrieben:
>
>
> The issue here is that the storage plugin activate_volume() is called after migration cancel which in case of network shared storage can make things bad.
> This is a sort of race condition, because migration_cancel wont stop the storage migration on the remote server. As you can see below a call to activate_volume() is performed after migration_cancel.
> In this case we issue volume detach from the old node ( to keep the data consistent ) and we end up with a VM ( not migrated ) without this volume attached.
> We keep a track if activate_volume() is used for migration by the flag 'lock' => 'migrate', which is cleared on migration_cancel - in case of migration we won't detach the volume from the old VM.
> In short: when the parent of this storage migration task gets killed, the source node stops the migration, but the storage migration on the destination node continues.
>
> Source node:
> 2025-04-11 03:26:50 starting migration of VM 2421 to node 'telpr01pve03' (10.10.17.3)
> 2025-04-11 03:26:50 starting VM 2421 on remote node 'telpr01pve03'
> 2025-04-11 03:26:52 ERROR: online migrate failure - remote command failed with exit code 255
> 2025-04-11 03:26:52 aborting phase 2 - cleanup resources
> 2025-04-11 03:26:52 migrate_cancel # <<< NOTE the time2025-04-11 03:26:53 ERROR: migration finished with problems (duration 00:00:03)
> TASK ERROR: migration problems
could you provide the full migration task log and the VM config?
I thought your storage plugin is a shared storage, so there is no storage migration at all, yet you keep talking about storage migration?
> Destination node:2025-04-11T03:26:51.559671+07:00 telpr01pve03 qm[3670216]: <root@pam> starting task UPID:telpr01pve03:003800D4:00928867:67F8298B:qmstart:2421:root@pam:
> 2025-04-11T03:26:51.559897+07:00 telpr01pve03 qm[3670228]: start VM 2421: UPID:telpr01pve03:003800D4:00928867:67F8298B:qmstart:2421:root@pam:
so starting the VM on the target node failed? why?
> 2025-04-11T03:26:51.837905+07:00 telpr01pve03 qm[3670228]: StorPool plugin: Volume ~bj7n.b.abe is related to VM 2421, checking status ### Call to PVE::Storage::Plugin::activate_volume()2025-04-11T03:26:53.072206+07:00 telpr01pve03 qm[3670228]: StorPool plugin: NOT a live migration of VM 2421, will force detach volume ~bj7n.b.abe ###'lock' flag missing
> 2025-04-11T03:26:53.108206+07:00 telpr01pve03 qm[3670228]: StorPool plugin: Volume ~bj7n.b.sdj is related to VM 2421, checking status ### Second call to activate_volume() after migrate_cancel2025-04-11T03:26:53.903357+07:00 telpr01pve03 qm[3670228]: StorPool plugin: NOT a live migration of VM 2421, will force detach volume ~bj7n.b.sdj###'lock' flag missing
>
>
>
>
> On Wed, May 28, 2025 at 9:33 AM Fabian Grünbichler <f.gruenbichler@proxmox.com> wrote:
> >
> > > Denis Kanchev <denis.kanchev@storpool.com> hat am 28.05.2025 08:13 CEST geschrieben:
> > >
> > >
> > > Here is the task log
> > > 2025-04-11 03:45:42 starting migration of VM 2282 to node 'telpr01pve05' (10.10.17.5)
> > > 2025-04-11 03:45:42 starting VM 2282 on remote node 'telpr01pve05'
> > > 2025-04-11 03:45:45 [telpr01pve05] Warning: sch_htb: quantum of class 10001 is big. Consider r2q change.
> > > 2025-04-11 03:45:46 [telpr01pve05] Dump was interrupted and may be inconsistent.
> > > 2025-04-11 03:45:46 [telpr01pve05] kvm: failed to find file '/usr/share/qemu-server/bootsplash.jpg'
> > > 2025-04-11 03:45:46 start remote tunnel
> > > 2025-04-11 03:45:46 ssh tunnel ver 1
> > > 2025-04-11 03:45:46 starting online/live migration on unix:/run/qemu-server/2282.migrate
> > > 2025-04-11 03:45:46 set migration capabilities
> > > 2025-04-11 03:45:46 migration downtime limit: 100 ms
> > > 2025-04-11 03:45:46 migration cachesize: 4.0 GiB
> > > 2025-04-11 03:45:46 set migration parameters
> > > 2025-04-11 03:45:46 start migrate command to unix:/run/qemu-server/2282.migrate
> > > 2025-04-11 03:45:47 migration active, transferred 152.2 MiB of 24.0 GiB VM-state, 162.1 MiB/s
> > > ...
> > > 2025-04-11 03:46:49 migration active, transferred 15.2 GiB of 24.0 GiB VM-state, 2.0 GiB/s
> > > 2025-04-11 03:46:50 migration status error: failed
> > > 2025-04-11 03:46:50 ERROR: online migrate failure - aborting
> > > 2025-04-11 03:46:50 aborting phase 2 - cleanup resources
> > > 2025-04-11 03:46:50 migrate_cancel
> > > 2025-04-11 03:46:52 ERROR: migration finished with problems (duration 00:01:11)
> > > TASK ERROR: migration problems
> >
> > okay, so no local disks involved.. not sure which process got killed then? ;)
> > the state transfer happens entirely within the Qemu process, perl is just polling
> > it to print the status, and that perl task worker is not OOM killed since it
> > continues to print all the error handling messages..
> >
> > > > that has weird implications with regards to threads, so I don't think that
> > > > is a good idea..
> > > What you mean by that? Are any threads involved?
> >
> > not intentionally, no. the issue is that the whole "pr_set_deathsig" machinery
> > works on the thread level, not the process level for historical reasons. so it
> > actually would kill the child if the thread that called pr_set_deathsig exits..
> >
> > I think we do want to improve how run_command handles the parent disappearing.
> > but it's not that straight-forward to implement in a race-free fashion (in Perl).
> >
> >
>
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
next prev parent reply other threads:[~2025-06-02 7:37 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-21 13:13 Denis Kanchev via pve-devel
2025-05-22 6:30 ` Fabian Grünbichler
2025-05-22 6:55 ` Denis Kanchev via pve-devel
[not found] ` <857cbd6c-6866-417d-a71f-f5b5297bf09c@storpool.com>
2025-05-22 8:22 ` Fabian Grünbichler
2025-05-28 6:13 ` Denis Kanchev via pve-devel
[not found] ` <CAHXTzuk7tYRJV_j=88RWc3R3C7AkiEdFUXi88m5qwnDeYDEC+A@mail.gmail.com>
2025-05-28 6:33 ` Fabian Grünbichler
2025-05-29 7:33 ` Denis Kanchev via pve-devel
[not found] ` <CAHXTzumXeyJQQCj+45Hmy5qdU+BTFBYbHVgPy0u3VS-qS=_bDQ@mail.gmail.com>
2025-06-02 7:37 ` Fabian Grünbichler [this message]
2025-06-02 8:35 ` Denis Kanchev via pve-devel
[not found] ` <CAHXTzukAMG9050Ynn-KRSqhCz2Y0m6vnAQ7FEkCmEdQT3HapfQ@mail.gmail.com>
2025-06-02 8:49 ` Fabian Grünbichler
2025-06-02 9:18 ` Denis Kanchev via pve-devel
[not found] ` <CAHXTzu=AiNx0iTWFEUU2kdzx9-RopwLc7rqGui6f0Q=+Hy52=w@mail.gmail.com>
2025-06-02 11:42 ` Fabian Grünbichler
2025-06-02 13:23 ` Denis Kanchev via pve-devel
[not found] ` <CAHXTzu=qrZe2eEZro7qteR=fDjJQX13syfB9fs5VfFbG7Vy6vQ@mail.gmail.com>
2025-06-02 14:31 ` Fabian Grünbichler
2025-06-04 12:52 ` Denis Kanchev via pve-devel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1695649345.530.1748849837156@webmail.proxmox.com \
--to=f.gruenbichler@proxmox.com \
--cc=denis.kanchev@storpool.com \
--cc=pve-devel@lists.proxmox.com \
--cc=w.bumiller@proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal