all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: Denis Kanchev via pve-devel <pve-devel@lists.proxmox.com>
To: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
Cc: Denis Kanchev <denis.kanchev@storpool.com>,
	Wolfgang Bumiller <w.bumiller@proxmox.com>,
	Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] PVE child process behavior question
Date: Thu, 29 May 2025 10:33:14 +0300	[thread overview]
Message-ID: <mailman.91.1748504040.395.pve-devel@lists.proxmox.com> (raw)
In-Reply-To: <11746909.21389.1748414016786@webmail.proxmox.com>

[-- Attachment #1: Type: message/rfc822, Size: 10531 bytes --]

From: Denis Kanchev <denis.kanchev@storpool.com>
To: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>, Wolfgang Bumiller <w.bumiller@proxmox.com>
Subject: Re: [pve-devel] PVE child process behavior question
Date: Thu, 29 May 2025 10:33:14 +0300
Message-ID: <CAHXTzumXeyJQQCj+45Hmy5qdU+BTFBYbHVgPy0u3VS-qS=_bDQ@mail.gmail.com>

The issue here is that the storage plugin activate_volume() is called after
migration cancel which in case of network shared storage can make things
bad.
This is a sort of race condition, because migration_cancel wont stop the
storage migration on the remote server. As you can see below a call to
activate_volume() is performed after migration_cancel.
In this case we issue volume detach from the old node ( to keep the data
consistent ) and we end up with a VM ( not migrated ) without this volume
attached.
We keep a track if activate_volume() is used for migration by the flag
'lock' => 'migrate', which is cleared on migration_cancel - in case of
migration we won't detach the volume from the old VM.
In short: when the parent of this storage migration task gets killed, the
source node stops the migration, but the storage migration on the
destination node continues.

Source node:
2025-04-11 03:26:50 starting migration of VM 2421 to node 'telpr01pve03'
(10.10.17.3)
2025-04-11 03:26:50 starting VM 2421 on remote node 'telpr01pve03'
2025-04-11 03:26:52 ERROR: online migrate failure - remote command failed
with exit code 255
2025-04-11 03:26:52 aborting phase 2 - cleanup resources
2025-04-11 03:26:52 migrate_cancel # <<< NOTE the time
2025-04-11 03:26:53 ERROR: migration finished with problems (duration
00:00:03)
TASK ERROR: migration problems

Destination node:
2025-04-11T03:26:51.559671+07:00 telpr01pve03 qm[3670216]: <root@pam>
starting task
UPID:telpr01pve03:003800D4:00928867:67F8298B:qmstart:2421:root@pam:
2025-04-11T03:26:51.559897+07:00 telpr01pve03 qm[3670228]: start VM 2421:
UPID:telpr01pve03:003800D4:00928867:67F8298B:qmstart:2421:root@pam:
2025-04-11T03:26:51.837905+07:00 telpr01pve03 qm[3670228]: StorPool plugin:
Volume ~bj7n.b.abe is related to VM 2421, checking status             ###
Call to PVE::Storage::Plugin::activate_volume()
2025-04-11T03:26:53.072206+07:00 telpr01pve03 qm[3670228]: StorPool plugin:
NOT a live migration of VM 2421, will force detach volume ~bj7n.b.abe
###  'lock'
flag missing
2025-04-11T03:26:53.108206+07:00 telpr01pve03 qm[3670228]: StorPool plugin:
Volume ~bj7n.b.sdj is related to VM 2421, checking status             ###
Second call to activate_volume() after migrate_cancel
2025-04-11T03:26:53.903357+07:00 telpr01pve03 qm[3670228]: StorPool plugin:
NOT a live migration of VM 2421, will force detach volume ~bj7n.b.sdj
###  'lock'
flag missing



On Wed, May 28, 2025 at 9:33 AM Fabian Grünbichler <
f.gruenbichler@proxmox.com> wrote:

>
> > Denis Kanchev <denis.kanchev@storpool.com> hat am 28.05.2025 08:13 CEST
> geschrieben:
> >
> >
> > Here is the task log
> > 2025-04-11 03:45:42 starting migration of VM 2282 to node 'telpr01pve05'
> (10.10.17.5)
> > 2025-04-11 03:45:42 starting VM 2282 on remote node 'telpr01pve05'
> > 2025-04-11 03:45:45 [telpr01pve05] Warning: sch_htb: quantum of class
> 10001 is big. Consider r2q change.
> > 2025-04-11 03:45:46 [telpr01pve05] Dump was interrupted and may be
> inconsistent.
> > 2025-04-11 03:45:46 [telpr01pve05] kvm: failed to find file
> '/usr/share/qemu-server/bootsplash.jpg'
> > 2025-04-11 03:45:46 start remote tunnel
> > 2025-04-11 03:45:46 ssh tunnel ver 1
> > 2025-04-11 03:45:46 starting online/live migration on
> unix:/run/qemu-server/2282.migrate
> > 2025-04-11 03:45:46 set migration capabilities
> > 2025-04-11 03:45:46 migration downtime limit: 100 ms
> > 2025-04-11 03:45:46 migration cachesize: 4.0 GiB
> > 2025-04-11 03:45:46 set migration parameters
> > 2025-04-11 03:45:46 start migrate command to
> unix:/run/qemu-server/2282.migrate
> > 2025-04-11 03:45:47 migration active, transferred 152.2 MiB of 24.0 GiB
> VM-state, 162.1 MiB/s
> > ...
> > 2025-04-11 03:46:49 migration active, transferred 15.2 GiB of 24.0 GiB
> VM-state, 2.0 GiB/s
> > 2025-04-11 03:46:50 migration status error: failed
> > 2025-04-11 03:46:50 ERROR: online migrate failure - aborting
> > 2025-04-11 03:46:50 aborting phase 2 - cleanup resources
> > 2025-04-11 03:46:50 migrate_cancel
> > 2025-04-11 03:46:52 ERROR: migration finished with problems (duration
> 00:01:11)
> > TASK ERROR: migration problems
>
> okay, so no local disks involved.. not sure which process got killed then?
> ;)
> the state transfer happens entirely within the Qemu process, perl is just
> polling
> it to print the status, and that perl task worker is not OOM killed since
> it
> continues to print all the error handling messages..
>
> > > that has weird implications with regards to threads, so I don't think
> that
> > > is a good idea..
> > What you mean by that? Are any threads involved?
>
> not intentionally, no. the issue is that the whole "pr_set_deathsig"
> machinery
> works on the thread level, not the process level for historical reasons.
> so it
> actually would kill the child if the thread that called pr_set_deathsig
> exits..
>
> I think we do want to improve how run_command handles the parent
> disappearing.
> but it's not that straight-forward to implement in a race-free fashion (in
> Perl).
>
>

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

  reply	other threads:[~2025-05-29  7:33 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-21 13:13 Denis Kanchev via pve-devel
2025-05-22  6:30 ` Fabian Grünbichler
2025-05-22  6:55   ` Denis Kanchev via pve-devel
     [not found]   ` <857cbd6c-6866-417d-a71f-f5b5297bf09c@storpool.com>
2025-05-22  8:22     ` Fabian Grünbichler
2025-05-28  6:13       ` Denis Kanchev via pve-devel
     [not found]       ` <CAHXTzuk7tYRJV_j=88RWc3R3C7AkiEdFUXi88m5qwnDeYDEC+A@mail.gmail.com>
2025-05-28  6:33         ` Fabian Grünbichler
2025-05-29  7:33           ` Denis Kanchev via pve-devel [this message]
     [not found]           ` <CAHXTzumXeyJQQCj+45Hmy5qdU+BTFBYbHVgPy0u3VS-qS=_bDQ@mail.gmail.com>
2025-06-02  7:37             ` Fabian Grünbichler
2025-06-02  8:35               ` Denis Kanchev via pve-devel
     [not found]               ` <CAHXTzukAMG9050Ynn-KRSqhCz2Y0m6vnAQ7FEkCmEdQT3HapfQ@mail.gmail.com>
2025-06-02  8:49                 ` Fabian Grünbichler
2025-06-02  9:18                   ` Denis Kanchev via pve-devel
     [not found]                   ` <CAHXTzu=AiNx0iTWFEUU2kdzx9-RopwLc7rqGui6f0Q=+Hy52=w@mail.gmail.com>
2025-06-02 11:42                     ` Fabian Grünbichler
2025-06-02 13:23                       ` Denis Kanchev via pve-devel
     [not found]                       ` <CAHXTzu=qrZe2eEZro7qteR=fDjJQX13syfB9fs5VfFbG7Vy6vQ@mail.gmail.com>
2025-06-02 14:31                         ` Fabian Grünbichler
2025-06-04 12:52                           ` Denis Kanchev via pve-devel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mailman.91.1748504040.395.pve-devel@lists.proxmox.com \
    --to=pve-devel@lists.proxmox.com \
    --cc=denis.kanchev@storpool.com \
    --cc=f.gruenbichler@proxmox.com \
    --cc=w.bumiller@proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal