[pve-devel] PVE child process behavior question

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [pve-devel] PVE child process behavior question
@ 2025-05-21 13:13 Denis Kanchev via pve-devel
  2025-05-22  6:30 ` Fabian Grünbichler
  0 siblings, 1 reply; 15+ messages in thread
From: Denis Kanchev via pve-devel @ 2025-05-21 13:13 UTC (permalink / raw)
  To: pve-devel; +Cc: Denis Kanchev

[-- Attachment #1: Type: message/rfc822, Size: 6622 bytes --]

From: Denis Kanchev <denis.kanchev@storpool.com>
To: pve-devel@lists.proxmox.com
Subject: PVE child process behavior question
Date: Wed, 21 May 2025 16:13:01 +0300
Message-ID: <8a172f9c-3927-4bff-a2c8-01184098e506@storpool.com>

Hello,

We had an issue with a customer migrating a VM between nodes using our 
shared storage solution.

On the target host the OOM killer killed the main migration process, but 
the child process (which actually performs the migration) kept on 
working, which we did not expect, and that caused some issues.

This leads us to the broader question - after a request is submitted, 
the parent can be terminated, and not return a response to the client, 
while the work is being done, and the request can be wrongly retried or 
considered unfinished.

Should the child processes terminate together with the parent to guard 
against this, or is this expected behavior?


Here is an example patch to do this:


diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm

index bfde7e6..744fffc 100644

--- a/src/PVE/RESTEnvironment.pm

+++ b/src/PVE/RESTEnvironment.pm

@@ -13,8 +13,9 @@ use Fcntl qw(:flock);

  use IO::File;

  use IO::Handle;

  use IO::Select;

-use POSIX qw(:sys_wait_h EINTR);

+use POSIX qw(:sys_wait_h EINTR SIGKILL);

  use AnyEvent;

+use Linux::Prctl qw(set_pdeathsig);


  use PVE::Exception qw(raise raise_perm_exc);

  use PVE::INotify;

@@ -549,6 +550,9 @@ sub fork_worker {

POSIX::setsid();

     }


+   # The signal that the calling process will get when its parent dies

+   set_pdeathsig(SIGKILL);

+

POSIX::close ($psync[0]);

POSIX::close ($ctrlfd[0]) if $sync;

POSIX::close ($csync[1]);

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] PVE child process behavior question
  2025-05-21 13:13 [pve-devel] PVE child process behavior question Denis Kanchev via pve-devel
@ 2025-05-22  6:30 ` Fabian Grünbichler
  2025-05-22  6:55   ` Denis Kanchev via pve-devel
       [not found]   ` <857cbd6c-6866-417d-a71f-f5b5297bf09c@storpool.com>
  0 siblings, 2 replies; 15+ messages in thread
From: Fabian Grünbichler @ 2025-05-22  6:30 UTC (permalink / raw)
  To: Proxmox VE development discussion


> Denis Kanchev via pve-devel <pve-devel@lists.proxmox.com> hat am 21.05.2025 15:13 CEST geschrieben:
> Hello,
> 
> We had an issue with a customer migrating a VM between nodes using our 
> shared storage solution.
> 
> On the target host the OOM killer killed the main migration process, but 
> the child process (which actually performs the migration) kept on 
> working, which we did not expect, and that caused some issues.

could you be more specific which process got killed?

when you do a migration, a task worker is forked and its UPID is returned
to the caller for further querying.

as part of the migration, other processes get spawned:
- ssh tunnel to the target node
- storage migration processes (on both nodes)
- VM state management CLI calls (on the target node)

which of those is the "main migration process"? which is the child process?

> This leads us to the broader question - after a request is submitted, 
> the parent can be terminated, and not return a response to the client, 
> while the work is being done, and the request can be wrongly retried or 
> considered unfinished.

the parent should return almost immediately, as all it is doing at that
point is returning the UPID to the client (the process then continues to
do other work though, but that is no longer related to this task).

the only exception is for "sync" task workers, like in a CLI context,
where the "parent" has no other work to do, so it waits for the child/task
to finish and prints its output while doing so, and some "bulk action"
style API calls that fork multiple task workers and poll them themselves.
 
> Should the child processes terminate together with the parent to guard 
> against this, or is this expected behavior?

the parent (API worker process) and child (task worker process) have no
direct relation after the task worker has been spawned. 

> Here is an example patch to do this:
> 
> 
> diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm
> 
> index bfde7e6..744fffc 100644
> 
> --- a/src/PVE/RESTEnvironment.pm
> 
> +++ b/src/PVE/RESTEnvironment.pm
> 
> @@ -13,8 +13,9 @@ use Fcntl qw(:flock);
> 
>   use IO::File;
> 
>   use IO::Handle;
> 
>   use IO::Select;
> 
> -use POSIX qw(:sys_wait_h EINTR);
> 
> +use POSIX qw(:sys_wait_h EINTR SIGKILL);
> 
>   use AnyEvent;
> 
> +use Linux::Prctl qw(set_pdeathsig);
> 
> 
>   use PVE::Exception qw(raise raise_perm_exc);
> 
>   use PVE::INotify;
> 
> @@ -549,6 +550,9 @@ sub fork_worker {
> 
> POSIX::setsid();
> 
>      }
> 
> 
> +   # The signal that the calling process will get when its parent dies
> 
> +   set_pdeathsig(SIGKILL);

that has weird implications with regards to threads, so I don't think that
is a good idea..

> 
> +
> 
> POSIX::close ($psync[0]);
> 
> POSIX::close ($ctrlfd[0]) if $sync;
> 
> POSIX::close ($csync[1]);


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] PVE child process behavior question
  2025-05-22  6:30 ` Fabian Grünbichler
@ 2025-05-22  6:55   ` Denis Kanchev via pve-devel
       [not found]   ` <857cbd6c-6866-417d-a71f-f5b5297bf09c@storpool.com>
  1 sibling, 0 replies; 15+ messages in thread
From: Denis Kanchev via pve-devel @ 2025-05-22  6:55 UTC (permalink / raw)
  To: Fabian Grünbichler, Proxmox VE development discussion; +Cc: Denis Kanchev

[-- Attachment #1: Type: message/rfc822, Size: 9102 bytes --]

From: Denis Kanchev <denis.kanchev@storpool.com>
To: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] PVE child process behavior question
Date: Thu, 22 May 2025 09:55:49 +0300
Message-ID: <857cbd6c-6866-417d-a71f-f5b5297bf09c@storpool.com>

The parent of the storage migration process gets killed.

It seems that this is the desired behavior and as far i understand it 
correctly - the child worker is detached from the parent and it has 
nothing to do with it after spawning.

Thanks for the information, it was very helpful.

On 22.05.25 г. 9:30 ч., Fabian Grünbichler wrote:
>> Denis Kanchev via pve-devel <pve-devel@lists.proxmox.com> hat am 21.05.2025 15:13 CEST geschrieben:
>> Hello,
>>
>> We had an issue with a customer migrating a VM between nodes using our
>> shared storage solution.
>>
>> On the target host the OOM killer killed the main migration process, but
>> the child process (which actually performs the migration) kept on
>> working, which we did not expect, and that caused some issues.
> could you be more specific which process got killed?
>
> when you do a migration, a task worker is forked and its UPID is returned
> to the caller for further querying.
>
> as part of the migration, other processes get spawned:
> - ssh tunnel to the target node
> - storage migration processes (on both nodes)
> - VM state management CLI calls (on the target node)
>
> which of those is the "main migration process"? which is the child process?
>
>> This leads us to the broader question - after a request is submitted,
>> the parent can be terminated, and not return a response to the client,
>> while the work is being done, and the request can be wrongly retried or
>> considered unfinished.
> the parent should return almost immediately, as all it is doing at that
> point is returning the UPID to the client (the process then continues to
> do other work though, but that is no longer related to this task).
>
> the only exception is for "sync" task workers, like in a CLI context,
> where the "parent" has no other work to do, so it waits for the child/task
> to finish and prints its output while doing so, and some "bulk action"
> style API calls that fork multiple task workers and poll them themselves.
>   
>> Should the child processes terminate together with the parent to guard
>> against this, or is this expected behavior?
> the parent (API worker process) and child (task worker process) have no
> direct relation after the task worker has been spawned.
>
>> Here is an example patch to do this:
>>
>>
>> diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm
>>
>> index bfde7e6..744fffc 100644
>>
>> --- a/src/PVE/RESTEnvironment.pm
>>
>> +++ b/src/PVE/RESTEnvironment.pm
>>
>> @@ -13,8 +13,9 @@ use Fcntl qw(:flock);
>>
>>    use IO::File;
>>
>>    use IO::Handle;
>>
>>    use IO::Select;
>>
>> -use POSIX qw(:sys_wait_h EINTR);
>>
>> +use POSIX qw(:sys_wait_h EINTR SIGKILL);
>>
>>    use AnyEvent;
>>
>> +use Linux::Prctl qw(set_pdeathsig);
>>
>>
>>    use PVE::Exception qw(raise raise_perm_exc);
>>
>>    use PVE::INotify;
>>
>> @@ -549,6 +550,9 @@ sub fork_worker {
>>
>> POSIX::setsid();
>>
>>       }
>>
>>
>> +   # The signal that the calling process will get when its parent dies
>>
>> +   set_pdeathsig(SIGKILL);
> that has weird implications with regards to threads, so I don't think that
> is a good idea..
>
>> +
>>
>> POSIX::close ($psync[0]);
>>
>> POSIX::close ($ctrlfd[0]) if $sync;
>>
>> POSIX::close ($csync[1]);


[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <857cbd6c-6866-417d-a71f-f5b5297bf09c@storpool.com>]

* Re: [pve-devel] PVE child process behavior question
       [not found]   ` <857cbd6c-6866-417d-a71f-f5b5297bf09c@storpool.com>
@ 2025-05-22  8:22     ` Fabian Grünbichler
  2025-05-28  6:13       ` Denis Kanchev via pve-devel
       [not found]       ` <CAHXTzuk7tYRJV_j=88RWc3R3C7AkiEdFUXi88m5qwnDeYDEC+A@mail.gmail.com>
  0 siblings, 2 replies; 15+ messages in thread
From: Fabian Grünbichler @ 2025-05-22  8:22 UTC (permalink / raw)
  To: Denis Kanchev, Proxmox VE development discussion; +Cc: Wolfgang Bumiller


> Denis Kanchev <denis.kanchev@storpool.com> hat am 22.05.2025 08:55 CEST geschrieben:
> 
>  
> The parent of the storage migration process gets killed.
> 
> It seems that this is the desired behavior and as far i understand it 
> correctly - the child worker is detached from the parent and it has 
> nothing to do with it after spawning.

was this a remote migration or a regular migration? could you maybe
post the full task log?

for a regular migration, the storage migration just uses our
"run_command" helper. run_command uses open3 to spawn the command, and
select for command output handling.

basically the process tree would look like this

API worker (one of X in pvedaemon)
-> task worker (executing the migration code)
--> storage migration command (xxx | ssh target_node xxx)

and it does seem like run_command doesn't properly forward the parent being
killed/terminated:

$ perl -e 'use strict; use warnings; use PVE::Tools; warn "parent pid: $$\n"; PVE::Tools::run_command([["bash", "-c", "sleep 10; sleep 20; echo after > /tmp/file"]]);'
parent pid: 204620
[1]    204618 terminated  sudo perl -e

(sending SIGTERM from another shell to 204620). the bash command continues
executing, and also writes to /tmp/file after the sleeps are finished..

the same is also true for SIGKILL. SIGINT properly cleans up the child
though.

@Wolfgang: is this desired behaviour?

> 
> Thanks for the information, it was very helpful.
> 
> On 22.05.25 г. 9:30 ч., Fabian Grünbichler wrote:
> >> Denis Kanchev via pve-devel <pve-devel@lists.proxmox.com> hat am 21.05.2025 15:13 CEST geschrieben:
> >> Hello,
> >>
> >> We had an issue with a customer migrating a VM between nodes using our
> >> shared storage solution.
> >>
> >> On the target host the OOM killer killed the main migration process, but
> >> the child process (which actually performs the migration) kept on
> >> working, which we did not expect, and that caused some issues.
> > could you be more specific which process got killed?
> >
> > when you do a migration, a task worker is forked and its UPID is returned
> > to the caller for further querying.
> >
> > as part of the migration, other processes get spawned:
> > - ssh tunnel to the target node
> > - storage migration processes (on both nodes)
> > - VM state management CLI calls (on the target node)
> >
> > which of those is the "main migration process"? which is the child process?
> >
> >> This leads us to the broader question - after a request is submitted,
> >> the parent can be terminated, and not return a response to the client,
> >> while the work is being done, and the request can be wrongly retried or
> >> considered unfinished.
> > the parent should return almost immediately, as all it is doing at that
> > point is returning the UPID to the client (the process then continues to
> > do other work though, but that is no longer related to this task).
> >
> > the only exception is for "sync" task workers, like in a CLI context,
> > where the "parent" has no other work to do, so it waits for the child/task
> > to finish and prints its output while doing so, and some "bulk action"
> > style API calls that fork multiple task workers and poll them themselves.
> >   
> >> Should the child processes terminate together with the parent to guard
> >> against this, or is this expected behavior?
> > the parent (API worker process) and child (task worker process) have no
> > direct relation after the task worker has been spawned.
> >
> >> Here is an example patch to do this:
> >>
> >>
> >> diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm
> >>
> >> index bfde7e6..744fffc 100644
> >>
> >> --- a/src/PVE/RESTEnvironment.pm
> >>
> >> +++ b/src/PVE/RESTEnvironment.pm
> >>
> >> @@ -13,8 +13,9 @@ use Fcntl qw(:flock);
> >>
> >>    use IO::File;
> >>
> >>    use IO::Handle;
> >>
> >>    use IO::Select;
> >>
> >> -use POSIX qw(:sys_wait_h EINTR);
> >>
> >> +use POSIX qw(:sys_wait_h EINTR SIGKILL);
> >>
> >>    use AnyEvent;
> >>
> >> +use Linux::Prctl qw(set_pdeathsig);
> >>
> >>
> >>    use PVE::Exception qw(raise raise_perm_exc);
> >>
> >>    use PVE::INotify;
> >>
> >> @@ -549,6 +550,9 @@ sub fork_worker {
> >>
> >> POSIX::setsid();
> >>
> >>       }
> >>
> >>
> >> +   # The signal that the calling process will get when its parent dies
> >>
> >> +   set_pdeathsig(SIGKILL);
> > that has weird implications with regards to threads, so I don't think that
> > is a good idea..
> >
> >> +
> >>
> >> POSIX::close ($psync[0]);
> >>
> >> POSIX::close ($ctrlfd[0]) if $sync;
> >>
> >> POSIX::close ($csync[1]);


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] PVE child process behavior question
  2025-05-22  8:22     ` Fabian Grünbichler
@ 2025-05-28  6:13       ` Denis Kanchev via pve-devel
       [not found]       ` <CAHXTzuk7tYRJV_j=88RWc3R3C7AkiEdFUXi88m5qwnDeYDEC+A@mail.gmail.com>
  1 sibling, 0 replies; 15+ messages in thread
From: Denis Kanchev via pve-devel @ 2025-05-28  6:13 UTC (permalink / raw)
  To: Fabian Grünbichler
  Cc: Denis Kanchev, Wolfgang Bumiller, Proxmox VE development discussion

[-- Attachment #1: Type: message/rfc822, Size: 6691 bytes --]

From: Denis Kanchev <denis.kanchev@storpool.com>
To: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>, Wolfgang Bumiller <w.bumiller@proxmox.com>
Subject: Re: [pve-devel] PVE child process behavior question
Date: Wed, 28 May 2025 09:13:44 +0300
Message-ID: <CAHXTzuk7tYRJV_j=88RWc3R3C7AkiEdFUXi88m5qwnDeYDEC+A@mail.gmail.com>

Here is the task log
2025-04-11 03:45:42 starting migration of VM 2282 to node 'telpr01pve05'
(10.10.17.5)
2025-04-11 03:45:42 starting VM 2282 on remote node 'telpr01pve05'
2025-04-11 03:45:45 [telpr01pve05] Warning: sch_htb: quantum of class 10001
is big. Consider r2q change.
2025-04-11 03:45:46 [telpr01pve05] Dump was interrupted and may be
inconsistent.
2025-04-11 03:45:46 [telpr01pve05] kvm: failed to find file
'/usr/share/qemu-server/bootsplash.jpg'
2025-04-11 03:45:46 start remote tunnel
2025-04-11 03:45:46 ssh tunnel ver 1
2025-04-11 03:45:46 starting online/live migration on
unix:/run/qemu-server/2282.migrate
2025-04-11 03:45:46 set migration capabilities
2025-04-11 03:45:46 migration downtime limit: 100 ms
2025-04-11 03:45:46 migration cachesize: 4.0 GiB
2025-04-11 03:45:46 set migration parameters
2025-04-11 03:45:46 start migrate command to
unix:/run/qemu-server/2282.migrate
2025-04-11 03:45:47 migration active, transferred 152.2 MiB of 24.0 GiB
VM-state, 162.1 MiB/s
...
2025-04-11 03:46:49 migration active, transferred 15.2 GiB of 24.0 GiB
VM-state, 2.0 GiB/s
2025-04-11 03:46:50 migration status error: failed
2025-04-11 03:46:50 ERROR: online migrate failure - aborting
2025-04-11 03:46:50 aborting phase 2 - cleanup resources
2025-04-11 03:46:50 migrate_cancel
2025-04-11 03:46:52 ERROR: migration finished with problems (duration
00:01:11)
TASK ERROR: migration problems

> that has weird implications with regards to threads, so I don't think that
> is a good idea..
What you mean by that? Are any threads involved?

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <CAHXTzuk7tYRJV_j=88RWc3R3C7AkiEdFUXi88m5qwnDeYDEC+A@mail.gmail.com>]

* Re: [pve-devel] PVE child process behavior question
       [not found]       ` <CAHXTzuk7tYRJV_j=88RWc3R3C7AkiEdFUXi88m5qwnDeYDEC+A@mail.gmail.com>
@ 2025-05-28  6:33         ` Fabian Grünbichler
  2025-05-29  7:33           ` Denis Kanchev via pve-devel
       [not found]           ` <CAHXTzumXeyJQQCj+45Hmy5qdU+BTFBYbHVgPy0u3VS-qS=_bDQ@mail.gmail.com>
  0 siblings, 2 replies; 15+ messages in thread
From: Fabian Grünbichler @ 2025-05-28  6:33 UTC (permalink / raw)
  To: Denis Kanchev; +Cc: Wolfgang Bumiller, Proxmox VE development discussion


> Denis Kanchev <denis.kanchev@storpool.com> hat am 28.05.2025 08:13 CEST geschrieben:
> 
> 
> Here is the task log
> 2025-04-11 03:45:42 starting migration of VM 2282 to node 'telpr01pve05' (10.10.17.5)
> 2025-04-11 03:45:42 starting VM 2282 on remote node 'telpr01pve05' 
> 2025-04-11 03:45:45 [telpr01pve05] Warning: sch_htb: quantum of class 10001 is big. Consider r2q change. 
> 2025-04-11 03:45:46 [telpr01pve05] Dump was interrupted and may be inconsistent. 
> 2025-04-11 03:45:46 [telpr01pve05] kvm: failed to find file '/usr/share/qemu-server/bootsplash.jpg' 
> 2025-04-11 03:45:46 start remote tunnel 
> 2025-04-11 03:45:46 ssh tunnel ver 1 
> 2025-04-11 03:45:46 starting online/live migration on unix:/run/qemu-server/2282.migrate 
> 2025-04-11 03:45:46 set migration capabilities 
> 2025-04-11 03:45:46 migration downtime limit: 100 ms 
> 2025-04-11 03:45:46 migration cachesize: 4.0 GiB 
> 2025-04-11 03:45:46 set migration parameters 
> 2025-04-11 03:45:46 start migrate command to unix:/run/qemu-server/2282.migrate 
> 2025-04-11 03:45:47 migration active, transferred 152.2 MiB of 24.0 GiB VM-state, 162.1 MiB/s 
> ... 
> 2025-04-11 03:46:49 migration active, transferred 15.2 GiB of 24.0 GiB VM-state, 2.0 GiB/s 
> 2025-04-11 03:46:50 migration status error: failed 
> 2025-04-11 03:46:50 ERROR: online migrate failure - aborting 
> 2025-04-11 03:46:50 aborting phase 2 - cleanup resources 
> 2025-04-11 03:46:50 migrate_cancel 
> 2025-04-11 03:46:52 ERROR: migration finished with problems (duration 00:01:11) 
> TASK ERROR: migration problems

okay, so no local disks involved.. not sure which process got killed then? ;)
the state transfer happens entirely within the Qemu process, perl is just polling
it to print the status, and that perl task worker is not OOM killed since it
continues to print all the error handling messages..

> > that has weird implications with regards to threads, so I don't think that
> > is a good idea..
> What you mean by that? Are any threads involved?

not intentionally, no. the issue is that the whole "pr_set_deathsig" machinery
works on the thread level, not the process level for historical reasons. so it
actually would kill the child if the thread that called pr_set_deathsig exits..

I think we do want to improve how run_command handles the parent disappearing.
but it's not that straight-forward to implement in a race-free fashion (in Perl).


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] PVE child process behavior question
  2025-05-28  6:33         ` Fabian Grünbichler
@ 2025-05-29  7:33           ` Denis Kanchev via pve-devel
       [not found]           ` <CAHXTzumXeyJQQCj+45Hmy5qdU+BTFBYbHVgPy0u3VS-qS=_bDQ@mail.gmail.com>
  1 sibling, 0 replies; 15+ messages in thread
From: Denis Kanchev via pve-devel @ 2025-05-29  7:33 UTC (permalink / raw)
  To: Fabian Grünbichler
  Cc: Denis Kanchev, Wolfgang Bumiller, Proxmox VE development discussion

[-- Attachment #1: Type: message/rfc822, Size: 10531 bytes --]

From: Denis Kanchev <denis.kanchev@storpool.com>
To: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>, Wolfgang Bumiller <w.bumiller@proxmox.com>
Subject: Re: [pve-devel] PVE child process behavior question
Date: Thu, 29 May 2025 10:33:14 +0300
Message-ID: <CAHXTzumXeyJQQCj+45Hmy5qdU+BTFBYbHVgPy0u3VS-qS=_bDQ@mail.gmail.com>

The issue here is that the storage plugin activate_volume() is called after
migration cancel which in case of network shared storage can make things
bad.
This is a sort of race condition, because migration_cancel wont stop the
storage migration on the remote server. As you can see below a call to
activate_volume() is performed after migration_cancel.
In this case we issue volume detach from the old node ( to keep the data
consistent ) and we end up with a VM ( not migrated ) without this volume
attached.
We keep a track if activate_volume() is used for migration by the flag
'lock' => 'migrate', which is cleared on migration_cancel - in case of
migration we won't detach the volume from the old VM.
In short: when the parent of this storage migration task gets killed, the
source node stops the migration, but the storage migration on the
destination node continues.

Source node:
2025-04-11 03:26:50 starting migration of VM 2421 to node 'telpr01pve03'
(10.10.17.3)
2025-04-11 03:26:50 starting VM 2421 on remote node 'telpr01pve03'
2025-04-11 03:26:52 ERROR: online migrate failure - remote command failed
with exit code 255
2025-04-11 03:26:52 aborting phase 2 - cleanup resources
2025-04-11 03:26:52 migrate_cancel # <<< NOTE the time
2025-04-11 03:26:53 ERROR: migration finished with problems (duration
00:00:03)
TASK ERROR: migration problems

Destination node:
2025-04-11T03:26:51.559671+07:00 telpr01pve03 qm[3670216]: <root@pam>
starting task
UPID:telpr01pve03:003800D4:00928867:67F8298B:qmstart:2421:root@pam:
2025-04-11T03:26:51.559897+07:00 telpr01pve03 qm[3670228]: start VM 2421:
UPID:telpr01pve03:003800D4:00928867:67F8298B:qmstart:2421:root@pam:
2025-04-11T03:26:51.837905+07:00 telpr01pve03 qm[3670228]: StorPool plugin:
Volume ~bj7n.b.abe is related to VM 2421, checking status             ###
Call to PVE::Storage::Plugin::activate_volume()
2025-04-11T03:26:53.072206+07:00 telpr01pve03 qm[3670228]: StorPool plugin:
NOT a live migration of VM 2421, will force detach volume ~bj7n.b.abe
###  'lock'
flag missing
2025-04-11T03:26:53.108206+07:00 telpr01pve03 qm[3670228]: StorPool plugin:
Volume ~bj7n.b.sdj is related to VM 2421, checking status             ###
Second call to activate_volume() after migrate_cancel
2025-04-11T03:26:53.903357+07:00 telpr01pve03 qm[3670228]: StorPool plugin:
NOT a live migration of VM 2421, will force detach volume ~bj7n.b.sdj
###  'lock'
flag missing



On Wed, May 28, 2025 at 9:33 AM Fabian Grünbichler <
f.gruenbichler@proxmox.com> wrote:

>
> > Denis Kanchev <denis.kanchev@storpool.com> hat am 28.05.2025 08:13 CEST
> geschrieben:
> >
> >
> > Here is the task log
> > 2025-04-11 03:45:42 starting migration of VM 2282 to node 'telpr01pve05'
> (10.10.17.5)
> > 2025-04-11 03:45:42 starting VM 2282 on remote node 'telpr01pve05'
> > 2025-04-11 03:45:45 [telpr01pve05] Warning: sch_htb: quantum of class
> 10001 is big. Consider r2q change.
> > 2025-04-11 03:45:46 [telpr01pve05] Dump was interrupted and may be
> inconsistent.
> > 2025-04-11 03:45:46 [telpr01pve05] kvm: failed to find file
> '/usr/share/qemu-server/bootsplash.jpg'
> > 2025-04-11 03:45:46 start remote tunnel
> > 2025-04-11 03:45:46 ssh tunnel ver 1
> > 2025-04-11 03:45:46 starting online/live migration on
> unix:/run/qemu-server/2282.migrate
> > 2025-04-11 03:45:46 set migration capabilities
> > 2025-04-11 03:45:46 migration downtime limit: 100 ms
> > 2025-04-11 03:45:46 migration cachesize: 4.0 GiB
> > 2025-04-11 03:45:46 set migration parameters
> > 2025-04-11 03:45:46 start migrate command to
> unix:/run/qemu-server/2282.migrate
> > 2025-04-11 03:45:47 migration active, transferred 152.2 MiB of 24.0 GiB
> VM-state, 162.1 MiB/s
> > ...
> > 2025-04-11 03:46:49 migration active, transferred 15.2 GiB of 24.0 GiB
> VM-state, 2.0 GiB/s
> > 2025-04-11 03:46:50 migration status error: failed
> > 2025-04-11 03:46:50 ERROR: online migrate failure - aborting
> > 2025-04-11 03:46:50 aborting phase 2 - cleanup resources
> > 2025-04-11 03:46:50 migrate_cancel
> > 2025-04-11 03:46:52 ERROR: migration finished with problems (duration
> 00:01:11)
> > TASK ERROR: migration problems
>
> okay, so no local disks involved.. not sure which process got killed then?
> ;)
> the state transfer happens entirely within the Qemu process, perl is just
> polling
> it to print the status, and that perl task worker is not OOM killed since
> it
> continues to print all the error handling messages..
>
> > > that has weird implications with regards to threads, so I don't think
> that
> > > is a good idea..
> > What you mean by that? Are any threads involved?
>
> not intentionally, no. the issue is that the whole "pr_set_deathsig"
> machinery
> works on the thread level, not the process level for historical reasons.
> so it
> actually would kill the child if the thread that called pr_set_deathsig
> exits..
>
> I think we do want to improve how run_command handles the parent
> disappearing.
> but it's not that straight-forward to implement in a race-free fashion (in
> Perl).
>
>

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <CAHXTzumXeyJQQCj+45Hmy5qdU+BTFBYbHVgPy0u3VS-qS=_bDQ@mail.gmail.com>]

* Re: [pve-devel] PVE child process behavior question
       [not found]           ` <CAHXTzumXeyJQQCj+45Hmy5qdU+BTFBYbHVgPy0u3VS-qS=_bDQ@mail.gmail.com>
@ 2025-06-02  7:37             ` Fabian Grünbichler
  2025-06-02  8:35               ` Denis Kanchev via pve-devel
       [not found]               ` <CAHXTzukAMG9050Ynn-KRSqhCz2Y0m6vnAQ7FEkCmEdQT3HapfQ@mail.gmail.com>
  0 siblings, 2 replies; 15+ messages in thread
From: Fabian Grünbichler @ 2025-06-02  7:37 UTC (permalink / raw)
  To: Denis Kanchev; +Cc: Wolfgang Bumiller, Proxmox VE development discussion


> Denis Kanchev <denis.kanchev@storpool.com> hat am 29.05.2025 09:33 CEST geschrieben:
> 
> 
> The issue here is that the storage plugin activate_volume() is called after migration cancel which in case of network shared storage can make things bad.
> This is a sort of race condition, because migration_cancel wont stop the storage migration on the remote server. As you can see below a call to activate_volume() is performed after migration_cancel.
> In this case we issue volume detach from the old node ( to keep the data consistent ) and we end up with a VM ( not migrated ) without this volume attached.
> We keep a track if activate_volume() is used for migration by the flag 'lock' => 'migrate', which is cleared on migration_cancel - in case of migration we won't detach the volume from the old VM.
> In short: when the parent of this storage migration task gets killed, the source node stops the migration, but the storage migration on the destination node continues.
> 
> Source node:
> 2025-04-11 03:26:50 starting migration of VM 2421 to node 'telpr01pve03' (10.10.17.3)
> 2025-04-11 03:26:50 starting VM 2421 on remote node 'telpr01pve03'
> 2025-04-11 03:26:52 ERROR: online migrate failure - remote command failed with exit code 255
> 2025-04-11 03:26:52 aborting phase 2 - cleanup resources 
> 2025-04-11 03:26:52 migrate_cancel # <<< NOTE the time2025-04-11 03:26:53 ERROR: migration finished with problems (duration 00:00:03)
> TASK ERROR: migration problems

could you provide the full migration task log and the VM config?

I thought your storage plugin is a shared storage, so there is no storage migration at all, yet you keep talking about storage migration?

> Destination node:2025-04-11T03:26:51.559671+07:00 telpr01pve03 qm[3670216]: <root@pam> starting task UPID:telpr01pve03:003800D4:00928867:67F8298B:qmstart:2421:root@pam:
> 2025-04-11T03:26:51.559897+07:00 telpr01pve03 qm[3670228]: start VM 2421: UPID:telpr01pve03:003800D4:00928867:67F8298B:qmstart:2421:root@pam: 

so starting the VM on the target node failed? why?

> 2025-04-11T03:26:51.837905+07:00 telpr01pve03 qm[3670228]: StorPool plugin: Volume ~bj7n.b.abe is related to VM 2421, checking status ### Call to PVE::Storage::Plugin::activate_volume()2025-04-11T03:26:53.072206+07:00 telpr01pve03 qm[3670228]: StorPool plugin: NOT a live migration of VM 2421, will force detach volume ~bj7n.b.abe ###'lock' flag missing
> 2025-04-11T03:26:53.108206+07:00 telpr01pve03 qm[3670228]: StorPool plugin: Volume ~bj7n.b.sdj is related to VM 2421, checking status ### Second call to activate_volume() after migrate_cancel2025-04-11T03:26:53.903357+07:00 telpr01pve03 qm[3670228]: StorPool plugin: NOT a live migration of VM 2421, will force detach volume ~bj7n.b.sdj###'lock' flag missing
> 
> 
> 
> 
> On Wed, May 28, 2025 at 9:33 AM Fabian Grünbichler <f.gruenbichler@proxmox.com> wrote:
> > 
> >  > Denis Kanchev <denis.kanchev@storpool.com> hat am 28.05.2025 08:13 CEST geschrieben:
> >  > 
> >  > 
> >  > Here is the task log
> >  > 2025-04-11 03:45:42 starting migration of VM 2282 to node 'telpr01pve05' (10.10.17.5)
> >  > 2025-04-11 03:45:42 starting VM 2282 on remote node 'telpr01pve05' 
> >  > 2025-04-11 03:45:45 [telpr01pve05] Warning: sch_htb: quantum of class 10001 is big. Consider r2q change. 
> >  > 2025-04-11 03:45:46 [telpr01pve05] Dump was interrupted and may be inconsistent. 
> >  > 2025-04-11 03:45:46 [telpr01pve05] kvm: failed to find file '/usr/share/qemu-server/bootsplash.jpg' 
> >  > 2025-04-11 03:45:46 start remote tunnel 
> >  > 2025-04-11 03:45:46 ssh tunnel ver 1 
> >  > 2025-04-11 03:45:46 starting online/live migration on unix:/run/qemu-server/2282.migrate 
> >  > 2025-04-11 03:45:46 set migration capabilities 
> >  > 2025-04-11 03:45:46 migration downtime limit: 100 ms 
> >  > 2025-04-11 03:45:46 migration cachesize: 4.0 GiB 
> >  > 2025-04-11 03:45:46 set migration parameters 
> >  > 2025-04-11 03:45:46 start migrate command to unix:/run/qemu-server/2282.migrate 
> >  > 2025-04-11 03:45:47 migration active, transferred 152.2 MiB of 24.0 GiB VM-state, 162.1 MiB/s 
> >  > ... 
> >  > 2025-04-11 03:46:49 migration active, transferred 15.2 GiB of 24.0 GiB VM-state, 2.0 GiB/s 
> >  > 2025-04-11 03:46:50 migration status error: failed 
> >  > 2025-04-11 03:46:50 ERROR: online migrate failure - aborting 
> >  > 2025-04-11 03:46:50 aborting phase 2 - cleanup resources 
> >  > 2025-04-11 03:46:50 migrate_cancel 
> >  > 2025-04-11 03:46:52 ERROR: migration finished with problems (duration 00:01:11) 
> >  > TASK ERROR: migration problems
> >  
> >  okay, so no local disks involved.. not sure which process got killed then? ;)
> >  the state transfer happens entirely within the Qemu process, perl is just polling
> >  it to print the status, and that perl task worker is not OOM killed since it
> >  continues to print all the error handling messages..
> >  
> >  > > that has weird implications with regards to threads, so I don't think that
> >  > > is a good idea..
> >  > What you mean by that? Are any threads involved?
> >  
> >  not intentionally, no. the issue is that the whole "pr_set_deathsig" machinery
> >  works on the thread level, not the process level for historical reasons. so it
> >  actually would kill the child if the thread that called pr_set_deathsig exits..
> >  
> >  I think we do want to improve how run_command handles the parent disappearing.
> >  but it's not that straight-forward to implement in a race-free fashion (in Perl).
> >  
> > 
>


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] PVE child process behavior question
  2025-06-02  7:37             ` Fabian Grünbichler
@ 2025-06-02  8:35               ` Denis Kanchev via pve-devel
       [not found]               ` <CAHXTzukAMG9050Ynn-KRSqhCz2Y0m6vnAQ7FEkCmEdQT3HapfQ@mail.gmail.com>
  1 sibling, 0 replies; 15+ messages in thread
From: Denis Kanchev via pve-devel @ 2025-06-02  8:35 UTC (permalink / raw)
  To: Fabian Grünbichler
  Cc: Denis Kanchev, Wolfgang Bumiller, Proxmox VE development discussion

[-- Attachment #1: Type: message/rfc822, Size: 9626 bytes --]

From: Denis Kanchev <denis.kanchev@storpool.com>
To: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>, Wolfgang Bumiller <w.bumiller@proxmox.com>
Subject: Re: [pve-devel] PVE child process behavior question
Date: Mon, 2 Jun 2025 11:35:22 +0300
Message-ID: <CAHXTzukAMG9050Ynn-KRSqhCz2Y0m6vnAQ7FEkCmEdQT3HapfQ@mail.gmail.com>

> I thought your storage plugin is a shared storage, so there is no storage
migration at all, yet you keep talking about storage migration?
It's a shared storage indeed, the issue was that the migration process on
the destination host got OOM killed and the migration failed, most probably
that's why there is no log about the storage migration, but that didn't
stop the storage migration on the destination host.
2025-04-11T03:26:52.283913+07:00 telpr01pve03 kernel: [96031.290519] pvesh
invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0

Here is one more migration task attempt where it lived long enough to show
more detailed log:

2025-04-11 03:29:11 starting migration of VM 2421 to node 'telpr01pve06'
(10.10.17.6)
2025-04-11 03:29:11 starting VM 2421 on remote node 'telpr01pve06'
2025-04-11 03:29:15 [telpr01pve06] Warning: sch_htb: quantum of class 10001
is big. Consider r2q change.
2025-04-11 03:29:15 [telpr01pve06] kvm: failed to find file
'/usr/share/qemu-server/bootsplash.jpg'
2025-04-11 03:29:15 start remote tunnel
2025-04-11 03:29:16 ssh tunnel ver 1
2025-04-11 03:29:16 starting online/live migration on
unix:/run/qemu-server/2421.migrate
2025-04-11 03:29:16 set migration capabilities
2025-04-11 03:29:16 migration downtime limit: 100 ms
2025-04-11 03:29:16 migration cachesize: 256.0 MiB
2025-04-11 03:29:16 set migration parameters
2025-04-11 03:29:16 start migrate command to
unix:/run/qemu-server/2421.migrate
2025-04-11 03:29:17 migration active, transferred 281.0 MiB of 2.0 GiB
VM-state, 340.5 MiB/s
2025-04-11 03:29:18 migration active, transferred 561.5 MiB of 2.0 GiB
VM-state, 307.2 MiB/s
2025-04-11 03:29:19 migration active, transferred 849.2 MiB of 2.0 GiB
VM-state, 288.5 MiB/s
2025-04-11 03:29:20 migration active, transferred 1.1 GiB of 2.0 GiB
VM-state, 283.7 MiB/s
2025-04-11 03:29:21 migration active, transferred 1.4 GiB of 2.0 GiB
VM-state, 302.5 MiB/s
2025-04-11 03:29:23 migration active, transferred 1.8 GiB of 2.0 GiB
VM-state, 278.6 MiB/s
2025-04-11 03:29:23 migration status error: failed
2025-04-11 03:29:23 ERROR: online migrate failure - aborting
2025-04-11 03:29:23 aborting phase 2 - cleanup resources
2025-04-11 03:29:23 migrate_cancel
2025-04-11 03:29:25 ERROR: migration finished with problems (duration
00:00:14)
TASK ERROR: migration problems




>  could you provide the full migration task log and the VM config?
2025-04-11 03:26:50 starting migration of VM 2421 to node 'telpr01pve03'
(10.10.17.3) ### QemuMigrate::phase1() +749
2025-04-11 03:26:50 starting VM 2421 on remote node 'telpr01pve03' #
QemuMigrate::phase2_start_local_cluster() +888
2025-04-11 03:26:52 ERROR: online migrate failure - remote command failed
with exit code 255
2025-04-11 03:26:52 aborting phase 2 - cleanup resources
2025-04-11 03:26:52 migrate_cancel
2025-04-11 03:26:53 ERROR: migration finished with problems (duration
00:00:03)
TASK ERROR: migration problems


VM config
#Ubuntu-24.04-14082024
#StorPool adjustment
agent: 1,fstrim_cloned_disks=1
autostart: 1
boot: c
bootdisk: scsi0
cipassword: XXX
citype: nocloud
ciupgrade: 0
ciuser: test
cores: 2
cpu: EPYC-Genoa
cpulimit: 2
ide0: VMDataSp:vm-2421-cloudinit.raw,media=cdrom
ipconfig0: ipxxx
memory: 2048
meta: creation-qemu=8.1.5,ctime=1722917972
name: kredibel-service
nameserver: xxx
net0: virtio=xxx,bridge=vmbr2,firewall=1,rate=250,tag=220
numa: 0
onboot: 1
ostype: l26
scsi0:
VMDataSp:vm-2421-disk-0-sp-bj7n.b.sdj.raw,aio=native,discard=on,iops_rd=20000,iops_rd_max=40000,iops_rd_max_length=60,iops_wr=20000,iops_wr_max=40000,iops_wr_max_length=60,iothread=1,size=40G

scsihw: virtio-scsi-single
searchdomain: neo.internal
serial0: socket
smbios1: uuid=dfxxx
sockets: 1
sshkeys: ssh-rsa%
vmgenid: 17b154a0-


IN this case the call to PVE::Storage::Plugin::activate_volume() was
performed after migration cancelation
2025-04-11T03:26:53.072206+07:00 telpr01pve03 qm[3670228]: StorPool plugin:
NOT a live migration of VM 2421, will force detach volume ~bj7n.b.abe <<<
This log is from the sub activate_volume() in our custom storage plugin

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <CAHXTzukAMG9050Ynn-KRSqhCz2Y0m6vnAQ7FEkCmEdQT3HapfQ@mail.gmail.com>]

* Re: [pve-devel] PVE child process behavior question
       [not found]               ` <CAHXTzukAMG9050Ynn-KRSqhCz2Y0m6vnAQ7FEkCmEdQT3HapfQ@mail.gmail.com>
@ 2025-06-02  8:49                 ` Fabian Grünbichler
  2025-06-02  9:18                   ` Denis Kanchev via pve-devel
       [not found]                   ` <CAHXTzu=AiNx0iTWFEUU2kdzx9-RopwLc7rqGui6f0Q=+Hy52=w@mail.gmail.com>
  0 siblings, 2 replies; 15+ messages in thread
From: Fabian Grünbichler @ 2025-06-02  8:49 UTC (permalink / raw)
  To: Denis Kanchev; +Cc: Wolfgang Bumiller, Proxmox VE development discussion


> Denis Kanchev <denis.kanchev@storpool.com> hat am 02.06.2025 10:35 CEST geschrieben:
> 
> 
> > I thought your storage plugin is a shared storage, so there is no storage migration at all, yet you keep talking about storage migration?It's a shared storage indeed, the issue was that the migration process on the destination host got OOM killed and the migration failed, most probably that's why there is no log about the storage migration, but that didn't stop the storage migration on the destination host.

could you please explain what you mean by storage migration? :)

when I say "storage migration" I mean either
- the target VM exporting newly allocated volumes via NBD, and the source
  VM mirroring its disks via blockjob onto those exported volumes
- PVE::Storage::storage_migrate, which exports a volume, pipes it over SSH
  or a websocket tunnel and imports it on the other side

the first is what happens in a live migration for volumes currently used
by the VM. the second is what happens for other volumes, or in case of an
offline migration.

both will only happen for local volumes, as with a shared storage,
*there is nothing to migrate*.

are you talking about something your storage does (hand-over of control?)?

there also is no "migration process on the destination host", there just is
the target VM running there - did that VM get OOM-killed? or the `qm start`
invocation itself? or ... ? the migration task is only running on the source
node..

please really try to be specific here, it's easy to misunderstand things
or guess wrongly otherwise..

AFAIU, the sequence was:

migration started
target VM started
live-migration started
something happens on the destination node (??) that aborts the migration
source node does migrate_cancel (which is somehow hooked to your storage and removes a flag/lock/.. on the volume?)
something on the destination node calls activate_volume (which checks this flag/lock and is confused because it is missing?)

> 2025-04-11T03:26:52.283913+07:00 telpr01pve03 kernel: [96031.290519] pvesh invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> 
> Here is one more migration task attempt where it lived long enough to show more detailed log:
> 
> 2025-04-11 03:29:11 starting migration of VM 2421 to node 'telpr01pve06' (10.10.17.6)
> 2025-04-11 03:29:11 starting VM 2421 on remote node 'telpr01pve06' 
> 2025-04-11 03:29:15 [telpr01pve06] Warning: sch_htb: quantum of class 10001 is big. Consider r2q change. 
> 2025-04-11 03:29:15 [telpr01pve06] kvm: failed to find file '/usr/share/qemu-server/bootsplash.jpg' 
> 2025-04-11 03:29:15 start remote tunnel 
> 2025-04-11 03:29:16 ssh tunnel ver 1 
> 2025-04-11 03:29:16 starting online/live migration on unix:/run/qemu-server/2421.migrate 
> 2025-04-11 03:29:16 set migration capabilities 
> 2025-04-11 03:29:16 migration downtime limit: 100 ms 
> 2025-04-11 03:29:16 migration cachesize: 256.0 MiB 
> 2025-04-11 03:29:16 set migration parameters 
> 2025-04-11 03:29:16 start migrate command to unix:/run/qemu-server/2421.migrate 
> 2025-04-11 03:29:17 migration active, transferred 281.0 MiB of 2.0 GiB VM-state, 340.5 MiB/s 
> 2025-04-11 03:29:18 migration active, transferred 561.5 MiB of 2.0 GiB VM-state, 307.2 MiB/s 
> 2025-04-11 03:29:19 migration active, transferred 849.2 MiB of 2.0 GiB VM-state, 288.5 MiB/s 
> 2025-04-11 03:29:20 migration active, transferred 1.1 GiB of 2.0 GiB VM-state, 283.7 MiB/s 
> 2025-04-11 03:29:21 migration active, transferred 1.4 GiB of 2.0 GiB VM-state, 302.5 MiB/s 
> 2025-04-11 03:29:23 migration active, transferred 1.8 GiB of 2.0 GiB VM-state, 278.6 MiB/s 
> 2025-04-11 03:29:23 migration status error: failed 
> 2025-04-11 03:29:23 ERROR: online migrate failure - aborting 
> 2025-04-11 03:29:23 aborting phase 2 - cleanup resources 
> 2025-04-11 03:29:23 migrate_cancel
> 2025-04-11 03:29:25 ERROR: migration finished with problems (duration 00:00:14) 
> TASK ERROR: migration problems
>  
> 
> > could you provide the full migration task log and the VM config?
> 2025-04-11 03:26:50 starting migration of VM 2421 to node 'telpr01pve03' (10.10.17.3) ### QemuMigrate::phase1() +749
> 2025-04-11 03:26:50 starting VM 2421 on remote node 'telpr01pve03' # QemuMigrate::phase2_start_local_cluster() +888 
> 2025-04-11 03:26:52 ERROR: online migrate failure - remote command failed with exit code 255 
> 2025-04-11 03:26:52 aborting phase 2 - cleanup resources 
> 2025-04-11 03:26:52 migrate_cancel 
> 2025-04-11 03:26:53 ERROR: migration finished with problems (duration 00:00:03) 
> TASK ERROR: migration problems
> 
> 
> VM config#Ubuntu-24.04-14082024
> #StorPool adjustment 
> agent: 1,fstrim_cloned_disks=1 
> autostart: 1 
> boot: c 
> bootdisk: scsi0 
> cipassword: XXX 
> citype: nocloud 
> ciupgrade: 0 
> ciuser: test
> cores: 2 
> cpu: EPYC-Genoa 
> cpulimit: 2 
> ide0: VMDataSp:vm-2421-cloudinit.raw,media=cdrom 
> ipconfig0: ipxxx 
> memory: 2048 
> meta: creation-qemu=8.1.5,ctime=1722917972 
> name: kredibel-service 
> nameserver: xxx 
> net0: virtio=xxx,bridge=vmbr2,firewall=1,rate=250,tag=220 
> numa: 0 
> onboot: 1 
> ostype: l26 
> scsi0: VMDataSp:vm-2421-disk-0-sp-bj7n.b.sdj.raw,aio=native,discard=on,iops_rd=20000,iops_rd_max=40000,iops_rd_max_length=60,iops_wr=20000,iops_wr_max=40000,iops_wr_max_length=60,iothread=1,size=40G 
> scsihw: virtio-scsi-single 
> searchdomain: neo.internal 
> serial0: socket 
> smbios1: uuid=dfxxx 
> sockets: 1 
> sshkeys: ssh-rsa% 
> vmgenid: 17b154a0-
>  
> IN this case the call to PVE::Storage::Plugin::activate_volume() was performed after migration cancelation2025-04-11T03:26:53.072206+07:00 telpr01pve03 qm[3670228]: StorPool plugin: NOT a live migration of VM 2421, will force detach volume ~bj7n.b.abe <<< This log is from the sub activate_volume() in our custom storage plugin
>  
> 
> 
>


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] PVE child process behavior question
  2025-06-02  8:49                 ` Fabian Grünbichler
@ 2025-06-02  9:18                   ` Denis Kanchev via pve-devel
       [not found]                   ` <CAHXTzu=AiNx0iTWFEUU2kdzx9-RopwLc7rqGui6f0Q=+Hy52=w@mail.gmail.com>
  1 sibling, 0 replies; 15+ messages in thread
From: Denis Kanchev via pve-devel @ 2025-06-02  9:18 UTC (permalink / raw)
  To: Fabian Grünbichler
  Cc: Denis Kanchev, Wolfgang Bumiller, Proxmox VE development discussion

[-- Attachment #1: Type: message/rfc822, Size: 6042 bytes --]

From: Denis Kanchev <denis.kanchev@storpool.com>
To: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>, Wolfgang Bumiller <w.bumiller@proxmox.com>
Subject: Re: [pve-devel] PVE child process behavior question
Date: Mon, 2 Jun 2025 12:18:01 +0300
Message-ID: <CAHXTzu=AiNx0iTWFEUU2kdzx9-RopwLc7rqGui6f0Q=+Hy52=w@mail.gmail.com>

My bad :) in terms of Proxmox it must be hand-overing the storage control -
the storage plugin function activate_volume() is called in our case, which
moves the storage to the new VM.
So no data is moved across the nodes and only the volumes get re-attached.
Thanks for the plentiful information

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <CAHXTzu=AiNx0iTWFEUU2kdzx9-RopwLc7rqGui6f0Q=+Hy52=w@mail.gmail.com>]

* Re: [pve-devel] PVE child process behavior question
       [not found]                   ` <CAHXTzu=AiNx0iTWFEUU2kdzx9-RopwLc7rqGui6f0Q=+Hy52=w@mail.gmail.com>
@ 2025-06-02 11:42                     ` Fabian Grünbichler
  2025-06-02 13:23                       ` Denis Kanchev via pve-devel
       [not found]                       ` <CAHXTzu=qrZe2eEZro7qteR=fDjJQX13syfB9fs5VfFbG7Vy6vQ@mail.gmail.com>
  0 siblings, 2 replies; 15+ messages in thread
From: Fabian Grünbichler @ 2025-06-02 11:42 UTC (permalink / raw)
  To: Denis Kanchev; +Cc: Wolfgang Bumiller, Proxmox VE development discussion

> Denis Kanchev <denis.kanchev@storpool.com> hat am 02.06.2025 11:18 CEST geschrieben:
> 
> 
> My bad :) in terms of Proxmox it must be hand-overing the storage control - the storage plugin function activate_volume() is called in our case, which moves the storage to the new VM.
> So no data is moved across the nodes and only the volumes get re-attached.
> Thanks for the plentiful information

okay!

so you basically special case this "volume is active on two nodes" case which should only happen during a live migration, and that somehow runs into an issue if the migration is aborted because there is some suspected race somewhere?

as part of a live migration, the sequence should be:

node A: migration starts
node A: start request for target VM on node B (over SSH)
node B: `qm start ..` is called
node B: qm start will activate volumes
node B: qm start returns
node A: migration starts
node A/B: some fatal error
node A: cancel migration (via QMP/the source VM running on node A)
node A: request to stop target VM on node B (over SSH)
node B: `qm stop ..` called
node B: qm stop will deactivate volumes

I am not sure where another activate_volume call after node A has started the migration could happen? at that point, node A still has control over the VM (ID), so nothing in PVE should operate on it other than the selective calls made as part of the migration, which are basically only querying migration status and error handling at that point..

it would still be good to know what actually got OOM-killed in your case.. was it the `qm start`? was it the `kvm` process itself? something entirely else?

if you can reproduce the issue, you could also add logging in activate_volume to find out the exact call path (e.g., log the call stack somewhere), maybe that helps find the exact scenario that you are seeing..

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] PVE child process behavior question
  2025-06-02 11:42                     ` Fabian Grünbichler
@ 2025-06-02 13:23                       ` Denis Kanchev via pve-devel
       [not found]                       ` <CAHXTzu=qrZe2eEZro7qteR=fDjJQX13syfB9fs5VfFbG7Vy6vQ@mail.gmail.com>
  1 sibling, 0 replies; 15+ messages in thread
From: Denis Kanchev via pve-devel @ 2025-06-02 13:23 UTC (permalink / raw)
  To: Fabian Grünbichler
  Cc: Denis Kanchev, Wolfgang Bumiller, Proxmox VE development discussion

[-- Attachment #1: Type: message/rfc822, Size: 11746 bytes --]

From: Denis Kanchev <denis.kanchev@storpool.com>
To: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>, Wolfgang Bumiller <w.bumiller@proxmox.com>
Subject: Re: [pve-devel] PVE child process behavior question
Date: Mon, 2 Jun 2025 16:23:27 +0300
Message-ID: <CAHXTzu=qrZe2eEZro7qteR=fDjJQX13syfB9fs5VfFbG7Vy6vQ@mail.gmail.com>

We tend to prevent having a volume active on two nodes, as may lead to data
corruption, so we detach the volume from all nodes ( except the target one
) via our shared storage system.
In the sub activate_volume() our logic is to not detach the volume from
other hosts in case of migration - because activate_volume() can be called
in other cases, where detaching is necessary.
But in this case where the QM start process is killed, the migration is
marked as failed and still activate_volume() is called on the destination
host after migration_cancel ( we track the "lock" flag to be migrate ).
That's why i proposed the child processes to be killed when the parent one
dies - it will prevent such cases.
Not sure if passing an extra argument (marking it as migration) to
activate_volume() will solve such issue too.
Here is a trace log of activate_volume() in case of migration.

2025-05-02 13:03:28.2222 [2712103] took 0.0006: activate_volume: storeid
'autotest__ec2_1', scfg {'type' => 'storpool','shared' => 1,'template' =>
'autotest__ec2_1','extra-tags' => 'tier=high','content' => {'iso' =>
1,'images' => 1}}, volname 'vm-101-disk-0-sp-z.b.df.raw', exclusive undef
at /usr/share/perl5/PVE/St
orage/Custom/StorPoolPlugin.pm line 1551.
       PVE::Storage::Custom::StorPoolPlugin::activate_volume("PVE::Storage::Custom::StorPoolPlugin",
"autotest__ec2_1", HASH(0x559cd06d88a0), "vm-101-disk-0-sp-z.b.df.raw",
undef, HASH(0x559cd076b9a8)) called at /usr/share/perl5/PVE/Storage.pm line
1309
       PVE::Storage::activate_volumes(HASH(0x559cc99d04e0),
ARRAY(0x559cd0754558)) called at /usr/share/perl5/PVE/QemuServer.pm line
5823
       PVE::QemuServer::vm_start_nolock(HASH(0x559cc99d04e0), 101,
HASH(0x559cd0730ca0), HASH(0x559ccfd6d680), HASH(0x559cc99cfe38)) called at
/usr/share/perl5/PVE/QemuServer.pm line 5592
       PVE::QemuServer::__ANON__() called at
/usr/share/perl5/PVE/AbstractConfig.pm line 299
       PVE::AbstractConfig::__ANON__() called at
/usr/share/perl5/PVE/Tools.pm line 259
       eval {...} called at /usr/share/perl5/PVE/Tools.pm line 259

       PVE::Tools::lock_file_full("/var/lock/qemu-server/lock-101.conf",
10, 0, CODE(0x559ccf14b968)) called at
/usr/share/perl5/PVE/AbstractConfig.pm line 302
       PVE::AbstractConfig::__ANON__("PVE::QemuConfig", 101, 10, 0,
CODE(0x559ccf59f740)) called at /usr/share/perl5/PVE/AbstractConfig.pm line
322
       PVE::AbstractConfig::lock_config_full("PVE::QemuConfig", 101, 10,
CODE(0x559ccf59f740)) called at /usr/share/perl5/PVE/AbstractConfig.pm line
330
       PVE::AbstractConfig::lock_config("PVE::QemuConfig", 101,
CODE(0x559ccf59f740)) called at /usr/share/perl5/PVE/QemuServer.pm line
5593
       PVE::QemuServer::vm_start(HASH(0x559cc99d04e0), 101,
HASH(0x559ccfd6d680), HASH(0x559cc99cfe38)) called at
/usr/share/perl5/PVE/API2/Qemu.pm line 3259
       PVE::API2::Qemu::__ANON__("UPID:lab-dk-2:00296227:0ADF72E0:683DA11F:qmstart:101:root\@pam:")
called at /usr/share/perl5/PVE/RESTEnvironment.pm line 620
       eval {...} called at /usr/share/perl5/PVE/RESTEnvironment.pm line
611
       PVE::RESTEnvironment::fork_worker(PVE::RPCEnvironment=HASH(0x559cc99d0558),
"qmstart", 101, "root\@pam", CODE(0x559cd06cc160)) called at
/usr/share/perl5/PVE/API2/Qemu.pm line 3263
       PVE::API2::Qemu::__ANON__(HASH(0x559cd0700df8)) called at
/usr/share/perl5/PVE/RESTHandler.pm line 499
       PVE::RESTHandler::handle("PVE::API2::Qemu", HASH(0x559cd05deb98),
HASH(0x559cd0700df8), 1) called at /usr/share/perl5/PVE/RESTHandler.pm line
985
       eval {...} called at /usr/share/perl5/PVE/RESTHandler.pm line 968

       PVE::RESTHandler::cli_handler("PVE::API2::Qemu", "qm start",
"vm_start", ARRAY(0x559cc99cfee0), ARRAY(0x559cd0745e98),
HASH(0x559cd0745ef8), CODE(0x559cd07091f8), undef) called at
/usr/share/perl5/PVE/CLIHandler.pm line 594
       PVE::CLIHandler::__ANON__(ARRAY(0x559cc99d00c0), undef,
CODE(0x559cd07091f8)) called at /usr/share/perl5/PVE/CLIHandler.pm line 673
       PVE::CLIHandler::run_cli_handler("PVE::CLI::qm") called at
/usr/sbin/qm line 8


On Mon, Jun 2, 2025 at 2:42 PM Fabian Grünbichler <
f.gruenbichler@proxmox.com> wrote:

>
> > Denis Kanchev <denis.kanchev@storpool.com> hat am 02.06.2025 11:18 CEST
> geschrieben:
> >
> >
> > My bad :) in terms of Proxmox it must be hand-overing the storage
> control - the storage plugin function activate_volume() is called in our
> case, which moves the storage to the new VM.
> > So no data is moved across the nodes and only the volumes get
> re-attached.
> > Thanks for the plentiful information
>
> okay!
>
> so you basically special case this "volume is active on two nodes" case
> which should only happen during a live migration, and that somehow runs
> into an issue if the migration is aborted because there is some suspected
> race somewhere?
>
> as part of a live migration, the sequence should be:
>
> node A: migration starts
> node A: start request for target VM on node B (over SSH)
> node B: `qm start ..` is called
> node B: qm start will activate volumes
> node B: qm start returns
> node A: migration starts
> node A/B: some fatal error
> node A: cancel migration (via QMP/the source VM running on node A)
> node A: request to stop target VM on node B (over SSH)
> node B: `qm stop ..` called
> node B: qm stop will deactivate volumes
>
> I am not sure where another activate_volume call after node A has started
> the migration could happen? at that point, node A still has control over
> the VM (ID), so nothing in PVE should operate on it other than the
> selective calls made as part of the migration, which are basically only
> querying migration status and error handling at that point..
>
> it would still be good to know what actually got OOM-killed in your case..
> was it the `qm start`? was it the `kvm` process itself? something entirely
> else?
>
> if you can reproduce the issue, you could also add logging in
> activate_volume to find out the exact call path (e.g., log the call stack
> somewhere), maybe that helps find the exact scenario that you are seeing..
>
>

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <CAHXTzu=qrZe2eEZro7qteR=fDjJQX13syfB9fs5VfFbG7Vy6vQ@mail.gmail.com>]

* Re: [pve-devel] PVE child process behavior question
       [not found]                       ` <CAHXTzu=qrZe2eEZro7qteR=fDjJQX13syfB9fs5VfFbG7Vy6vQ@mail.gmail.com>
@ 2025-06-02 14:31                         ` Fabian Grünbichler
  2025-06-04 12:52                           ` Denis Kanchev via pve-devel
  0 siblings, 1 reply; 15+ messages in thread
From: Fabian Grünbichler @ 2025-06-02 14:31 UTC (permalink / raw)
  To: Denis Kanchev; +Cc: Wolfgang Bumiller, Proxmox VE development discussion


> Denis Kanchev <denis.kanchev@storpool.com> hat am 02.06.2025 15:23 CEST geschrieben:
> 
> 
> We tend to prevent having a volume active on two nodes, as may lead to data corruption, so we detach the volume from all nodes ( except the target one ) via our shared storage system.
> In the sub activate_volume() our logic is to not detach the volume from other hosts in case of migration - because activate_volume() can be called in other cases, where detaching is necessary.
> But in this case where the QM start process is killed, the migration is marked as failed and still activate_volume() is called on the destination host after migration_cancel ( we track the "lock" flag to be migrate ).
> That's why i proposed the child processes to be killed when the parent one dies - it will prevent such cases.
> Not sure if passing an extra argument (marking it as migration) to activate_volume() will solve such issue too.
> Here is a trace log of activate_volume() in case of migration.

but that activation happens as part of starting the target VM, which happens before the (actual) migration is started in QEMU.

so in this case, we have

qm start (over SSH, is this being killed?)
-> start_vm task worker (or this?)
--> activate_volume
--> fork, enter systemd scope, run_command to execute the kvm process
---> kvm (or this?)

how are you hooking the migration state to know whether deactivation should be done or not?

> 
> 2025-05-02 13:03:28.2222 [2712103] took 0.0006:activate_volume:storeid 'autotest__ec2_1', scfg {'type' => 'storpool','shared' => 1,'template' => 'autotest__ec2_1','extra-tags' => 'tier=high','content' => {'iso' => 1,'images' => 1}}, volname 'vm-101-disk-0-sp-z.b.df.raw', exclusive undef at /usr/share/perl5/PVE/St
> orage/Custom/StorPoolPlugin.pm line 1551. 
>  PVE::Storage::Custom::StorPoolPlugin::activate_volume("PVE::Storage::Custom::StorPoolPlugin", "autotest__ec2_1", HASH(0x559cd06d88a0), "vm-101-disk-0-sp-z.b.df.raw", undef, HASH(0x559cd076b9a8)) called at /usr/share/perl5/PVE/Storage.pm line 1309 
>  PVE::Storage::activate_volumes(HASH(0x559cc99d04e0), ARRAY(0x559cd0754558)) called at /usr/share/perl5/PVE/QemuServer.pm line 5823 
>  PVE::QemuServer::vm_start_nolock(HASH(0x559cc99d04e0), 101, HASH(0x559cd0730ca0), HASH(0x559ccfd6d680), HASH(0x559cc99cfe38)) called at /usr/share/perl5/PVE/QemuServer.pm line 5592 
>  PVE::QemuServer::__ANON__() called at /usr/share/perl5/PVE/AbstractConfig.pm line 299 
>  PVE::AbstractConfig::__ANON__() called at /usr/share/perl5/PVE/Tools.pm line 259 
>  eval {...} called at /usr/share/perl5/PVE/Tools.pm line 259 
>  PVE::Tools::lock_file_full("/var/lock/qemu-server/lock-101.conf", 10, 0, CODE(0x559ccf14b968)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 302 
>  PVE::AbstractConfig::__ANON__("PVE::QemuConfig", 101, 10, 0, CODE(0x559ccf59f740)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 322 
>  PVE::AbstractConfig::lock_config_full("PVE::QemuConfig", 101, 10, CODE(0x559ccf59f740)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 330 
>  PVE::AbstractConfig::lock_config("PVE::QemuConfig", 101, CODE(0x559ccf59f740)) called at /usr/share/perl5/PVE/QemuServer.pm line 5593 
>  PVE::QemuServer::vm_start(HASH(0x559cc99d04e0), 101, HASH(0x559ccfd6d680), HASH(0x559cc99cfe38)) called at /usr/share/perl5/PVE/API2/Qemu.pm line 3259 
>  PVE::API2::Qemu::__ANON__("UPID:lab-dk-2:00296227:0ADF72E0:683DA11F:qmstart:101:root\@pam:") called at /usr/share/perl5/PVE/RESTEnvironment.pm line 620 
>  eval {...} called at /usr/share/perl5/PVE/RESTEnvironment.pm line 611 
>  PVE::RESTEnvironment::fork_worker(PVE::RPCEnvironment=HASH(0x559cc99d0558), "qmstart", 101, "root\@pam", CODE(0x559cd06cc160)) called at /usr/share/perl5/PVE/API2/Qemu.pm line 3263 
>  PVE::API2::Qemu::__ANON__(HASH(0x559cd0700df8)) called at /usr/share/perl5/PVE/RESTHandler.pm line 499 
>  PVE::RESTHandler::handle("PVE::API2::Qemu", HASH(0x559cd05deb98), HASH(0x559cd0700df8), 1) called at /usr/share/perl5/PVE/RESTHandler.pm line 985 
>  eval {...} called at /usr/share/perl5/PVE/RESTHandler.pm line 968 
>  PVE::RESTHandler::cli_handler("PVE::API2::Qemu", "qm start", "vm_start", ARRAY(0x559cc99cfee0), ARRAY(0x559cd0745e98), HASH(0x559cd0745ef8), CODE(0x559cd07091f8), undef) called at /usr/share/perl5/PVE/CLIHandler.pm line 594 
>  PVE::CLIHandler::__ANON__(ARRAY(0x559cc99d00c0), undef, CODE(0x559cd07091f8)) called at /usr/share/perl5/PVE/CLIHandler.pm line 673 
>  PVE::CLIHandler::run_cli_handler("PVE::CLI::qm") called at /usr/sbin/qm line 8 
> 
> 
> On Mon, Jun 2, 2025 at 2:42 PM Fabian Grünbichler <f.gruenbichler@proxmox.com> wrote:
> > 
> >  > Denis Kanchev <denis.kanchev@storpool.com> hat am 02.06.2025 11:18 CEST geschrieben:
> >  > 
> >  > 
> >  > My bad :) in terms of Proxmox it must be hand-overing the storage control - the storage plugin function activate_volume() is called in our case, which moves the storage to the new VM.
> >  > So no data is moved across the nodes and only the volumes get re-attached.
> >  > Thanks for the plentiful information
> >  
> >  okay!
> >  
> >  so you basically special case this "volume is active on two nodes" case which should only happen during a live migration, and that somehow runs into an issue if the migration is aborted because there is some suspected race somewhere?
> >  
> >  as part of a live migration, the sequence should be:
> >  
> >  node A: migration starts
> >  node A: start request for target VM on node B (over SSH)
> >  node B: `qm start ..` is called
> >  node B: qm start will activate volumes
> >  node B: qm start returns
> >  node A: migration starts
> >  node A/B: some fatal error
> >  node A: cancel migration (via QMP/the source VM running on node A)
> >  node A: request to stop target VM on node B (over SSH)
> >  node B: `qm stop ..` called
> >  node B: qm stop will deactivate volumes
> >  
> >  I am not sure where another activate_volume call after node A has started the migration could happen? at that point, node A still has control over the VM (ID), so nothing in PVE should operate on it other than the selective calls made as part of the migration, which are basically only querying migration status and error handling at that point..
> >  
> >  it would still be good to know what actually got OOM-killed in your case.. was it the `qm start`? was it the `kvm` process itself? something entirely else?
> >  
> >  if you can reproduce the issue, you could also add logging in activate_volume to find out the exact call path (e.g., log the call stack somewhere), maybe that helps find the exact scenario that you are seeing..
> >  
> > 
> 
>


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [pve-devel] PVE child process behavior question
  2025-06-02 14:31                         ` Fabian Grünbichler
@ 2025-06-04 12:52                           ` Denis Kanchev via pve-devel
  0 siblings, 0 replies; 15+ messages in thread
From: Denis Kanchev via pve-devel @ 2025-06-04 12:52 UTC (permalink / raw)
  To: Fabian Grünbichler
  Cc: Denis Kanchev, Wolfgang Bumiller, Proxmox VE development discussion

[-- Attachment #1: Type: message/rfc822, Size: 6420 bytes --]

From: Denis Kanchev <denis.kanchev@storpool.com>
To: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>, Wolfgang Bumiller <w.bumiller@proxmox.com>
Subject: Re: [pve-devel] PVE child process behavior question
Date: Wed, 4 Jun 2025 15:52:14 +0300
Message-ID: <CAHXTzukATnAznNmeTq2YdJqL0KRXk7e=AFKW7MNQaa9f=3+n+w@mail.gmail.com>

> how are you hooking the migration state to know whether deactivation
should be done or not?
By using the VM property "lock" which must be "migrate":
PVE::Cluster::get_guest_config_properties(['lock']);

> qm start (over SSH, is this being killed?)
> -> start_vm task worker (or this?)
> --> activate_volume
> --> fork, enter systemd scope, run_command to execute the kvm process
> ---> kvm (or this?)

The parent of the process that is executing activate_volume() is killed, in
this case it should be qm start.

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-06-04 12:53 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-21 13:13 [pve-devel] PVE child process behavior question Denis Kanchev via pve-devel
2025-05-22  6:30 ` Fabian Grünbichler
2025-05-22  6:55   ` Denis Kanchev via pve-devel
     [not found]   ` <857cbd6c-6866-417d-a71f-f5b5297bf09c@storpool.com>
2025-05-22  8:22     ` Fabian Grünbichler
2025-05-28  6:13       ` Denis Kanchev via pve-devel
     [not found]       ` <CAHXTzuk7tYRJV_j=88RWc3R3C7AkiEdFUXi88m5qwnDeYDEC+A@mail.gmail.com>
2025-05-28  6:33         ` Fabian Grünbichler
2025-05-29  7:33           ` Denis Kanchev via pve-devel
     [not found]           ` <CAHXTzumXeyJQQCj+45Hmy5qdU+BTFBYbHVgPy0u3VS-qS=_bDQ@mail.gmail.com>
2025-06-02  7:37             ` Fabian Grünbichler
2025-06-02  8:35               ` Denis Kanchev via pve-devel
     [not found]               ` <CAHXTzukAMG9050Ynn-KRSqhCz2Y0m6vnAQ7FEkCmEdQT3HapfQ@mail.gmail.com>
2025-06-02  8:49                 ` Fabian Grünbichler
2025-06-02  9:18                   ` Denis Kanchev via pve-devel
     [not found]                   ` <CAHXTzu=AiNx0iTWFEUU2kdzx9-RopwLc7rqGui6f0Q=+Hy52=w@mail.gmail.com>
2025-06-02 11:42                     ` Fabian Grünbichler
2025-06-02 13:23                       ` Denis Kanchev via pve-devel
     [not found]                       ` <CAHXTzu=qrZe2eEZro7qteR=fDjJQX13syfB9fs5VfFbG7Vy6vQ@mail.gmail.com>
2025-06-02 14:31                         ` Fabian Grünbichler
2025-06-04 12:52                           ` Denis Kanchev via pve-devel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal