public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Denis Kanchev <denis.kanchev@storpool.com>
Cc: Wolfgang Bumiller <w.bumiller@proxmox.com>,
	Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] PVE child process behavior question
Date: Mon, 2 Jun 2025 16:31:04 +0200 (CEST)	[thread overview]
Message-ID: <2059883652.1014.1748874664476@webmail.proxmox.com> (raw)
In-Reply-To: <CAHXTzu=qrZe2eEZro7qteR=fDjJQX13syfB9fs5VfFbG7Vy6vQ@mail.gmail.com>


> Denis Kanchev <denis.kanchev@storpool.com> hat am 02.06.2025 15:23 CEST geschrieben:
> 
> 
> We tend to prevent having a volume active on two nodes, as may lead to data corruption, so we detach the volume from all nodes ( except the target one ) via our shared storage system.
> In the sub activate_volume() our logic is to not detach the volume from other hosts in case of migration - because activate_volume() can be called in other cases, where detaching is necessary.
> But in this case where the QM start process is killed, the migration is marked as failed and still activate_volume() is called on the destination host after migration_cancel ( we track the "lock" flag to be migrate ).
> That's why i proposed the child processes to be killed when the parent one dies - it will prevent such cases.
> Not sure if passing an extra argument (marking it as migration) to activate_volume() will solve such issue too.
> Here is a trace log of activate_volume() in case of migration.

but that activation happens as part of starting the target VM, which happens before the (actual) migration is started in QEMU.

so in this case, we have

qm start (over SSH, is this being killed?)
-> start_vm task worker (or this?)
--> activate_volume
--> fork, enter systemd scope, run_command to execute the kvm process
---> kvm (or this?)

how are you hooking the migration state to know whether deactivation should be done or not?

> 
> 2025-05-02 13:03:28.2222 [2712103] took 0.0006:activate_volume:storeid 'autotest__ec2_1', scfg {'type' => 'storpool','shared' => 1,'template' => 'autotest__ec2_1','extra-tags' => 'tier=high','content' => {'iso' => 1,'images' => 1}}, volname 'vm-101-disk-0-sp-z.b.df.raw', exclusive undef at /usr/share/perl5/PVE/St
> orage/Custom/StorPoolPlugin.pm line 1551. 
>  PVE::Storage::Custom::StorPoolPlugin::activate_volume("PVE::Storage::Custom::StorPoolPlugin", "autotest__ec2_1", HASH(0x559cd06d88a0), "vm-101-disk-0-sp-z.b.df.raw", undef, HASH(0x559cd076b9a8)) called at /usr/share/perl5/PVE/Storage.pm line 1309 
>  PVE::Storage::activate_volumes(HASH(0x559cc99d04e0), ARRAY(0x559cd0754558)) called at /usr/share/perl5/PVE/QemuServer.pm line 5823 
>  PVE::QemuServer::vm_start_nolock(HASH(0x559cc99d04e0), 101, HASH(0x559cd0730ca0), HASH(0x559ccfd6d680), HASH(0x559cc99cfe38)) called at /usr/share/perl5/PVE/QemuServer.pm line 5592 
>  PVE::QemuServer::__ANON__() called at /usr/share/perl5/PVE/AbstractConfig.pm line 299 
>  PVE::AbstractConfig::__ANON__() called at /usr/share/perl5/PVE/Tools.pm line 259 
>  eval {...} called at /usr/share/perl5/PVE/Tools.pm line 259 
>  PVE::Tools::lock_file_full("/var/lock/qemu-server/lock-101.conf", 10, 0, CODE(0x559ccf14b968)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 302 
>  PVE::AbstractConfig::__ANON__("PVE::QemuConfig", 101, 10, 0, CODE(0x559ccf59f740)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 322 
>  PVE::AbstractConfig::lock_config_full("PVE::QemuConfig", 101, 10, CODE(0x559ccf59f740)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 330 
>  PVE::AbstractConfig::lock_config("PVE::QemuConfig", 101, CODE(0x559ccf59f740)) called at /usr/share/perl5/PVE/QemuServer.pm line 5593 
>  PVE::QemuServer::vm_start(HASH(0x559cc99d04e0), 101, HASH(0x559ccfd6d680), HASH(0x559cc99cfe38)) called at /usr/share/perl5/PVE/API2/Qemu.pm line 3259 
>  PVE::API2::Qemu::__ANON__("UPID:lab-dk-2:00296227:0ADF72E0:683DA11F:qmstart:101:root\@pam:") called at /usr/share/perl5/PVE/RESTEnvironment.pm line 620 
>  eval {...} called at /usr/share/perl5/PVE/RESTEnvironment.pm line 611 
>  PVE::RESTEnvironment::fork_worker(PVE::RPCEnvironment=HASH(0x559cc99d0558), "qmstart", 101, "root\@pam", CODE(0x559cd06cc160)) called at /usr/share/perl5/PVE/API2/Qemu.pm line 3263 
>  PVE::API2::Qemu::__ANON__(HASH(0x559cd0700df8)) called at /usr/share/perl5/PVE/RESTHandler.pm line 499 
>  PVE::RESTHandler::handle("PVE::API2::Qemu", HASH(0x559cd05deb98), HASH(0x559cd0700df8), 1) called at /usr/share/perl5/PVE/RESTHandler.pm line 985 
>  eval {...} called at /usr/share/perl5/PVE/RESTHandler.pm line 968 
>  PVE::RESTHandler::cli_handler("PVE::API2::Qemu", "qm start", "vm_start", ARRAY(0x559cc99cfee0), ARRAY(0x559cd0745e98), HASH(0x559cd0745ef8), CODE(0x559cd07091f8), undef) called at /usr/share/perl5/PVE/CLIHandler.pm line 594 
>  PVE::CLIHandler::__ANON__(ARRAY(0x559cc99d00c0), undef, CODE(0x559cd07091f8)) called at /usr/share/perl5/PVE/CLIHandler.pm line 673 
>  PVE::CLIHandler::run_cli_handler("PVE::CLI::qm") called at /usr/sbin/qm line 8 
> 
> 
> On Mon, Jun 2, 2025 at 2:42 PM Fabian Grünbichler <f.gruenbichler@proxmox.com> wrote:
> > 
> >  > Denis Kanchev <denis.kanchev@storpool.com> hat am 02.06.2025 11:18 CEST geschrieben:
> >  > 
> >  > 
> >  > My bad :) in terms of Proxmox it must be hand-overing the storage control - the storage plugin function activate_volume() is called in our case, which moves the storage to the new VM.
> >  > So no data is moved across the nodes and only the volumes get re-attached.
> >  > Thanks for the plentiful information
> >  
> >  okay!
> >  
> >  so you basically special case this "volume is active on two nodes" case which should only happen during a live migration, and that somehow runs into an issue if the migration is aborted because there is some suspected race somewhere?
> >  
> >  as part of a live migration, the sequence should be:
> >  
> >  node A: migration starts
> >  node A: start request for target VM on node B (over SSH)
> >  node B: `qm start ..` is called
> >  node B: qm start will activate volumes
> >  node B: qm start returns
> >  node A: migration starts
> >  node A/B: some fatal error
> >  node A: cancel migration (via QMP/the source VM running on node A)
> >  node A: request to stop target VM on node B (over SSH)
> >  node B: `qm stop ..` called
> >  node B: qm stop will deactivate volumes
> >  
> >  I am not sure where another activate_volume call after node A has started the migration could happen? at that point, node A still has control over the VM (ID), so nothing in PVE should operate on it other than the selective calls made as part of the migration, which are basically only querying migration status and error handling at that point..
> >  
> >  it would still be good to know what actually got OOM-killed in your case.. was it the `qm start`? was it the `kvm` process itself? something entirely else?
> >  
> >  if you can reproduce the issue, you could also add logging in activate_volume to find out the exact call path (e.g., log the call stack somewhere), maybe that helps find the exact scenario that you are seeing..
> >  
> > 
> 
>


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

  parent reply	other threads:[~2025-06-02 14:31 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-21 13:13 Denis Kanchev via pve-devel
2025-05-22  6:30 ` Fabian Grünbichler
2025-05-22  6:55   ` Denis Kanchev via pve-devel
     [not found]   ` <857cbd6c-6866-417d-a71f-f5b5297bf09c@storpool.com>
2025-05-22  8:22     ` Fabian Grünbichler
2025-05-28  6:13       ` Denis Kanchev via pve-devel
     [not found]       ` <CAHXTzuk7tYRJV_j=88RWc3R3C7AkiEdFUXi88m5qwnDeYDEC+A@mail.gmail.com>
2025-05-28  6:33         ` Fabian Grünbichler
2025-05-29  7:33           ` Denis Kanchev via pve-devel
     [not found]           ` <CAHXTzumXeyJQQCj+45Hmy5qdU+BTFBYbHVgPy0u3VS-qS=_bDQ@mail.gmail.com>
2025-06-02  7:37             ` Fabian Grünbichler
2025-06-02  8:35               ` Denis Kanchev via pve-devel
     [not found]               ` <CAHXTzukAMG9050Ynn-KRSqhCz2Y0m6vnAQ7FEkCmEdQT3HapfQ@mail.gmail.com>
2025-06-02  8:49                 ` Fabian Grünbichler
2025-06-02  9:18                   ` Denis Kanchev via pve-devel
     [not found]                   ` <CAHXTzu=AiNx0iTWFEUU2kdzx9-RopwLc7rqGui6f0Q=+Hy52=w@mail.gmail.com>
2025-06-02 11:42                     ` Fabian Grünbichler
2025-06-02 13:23                       ` Denis Kanchev via pve-devel
     [not found]                       ` <CAHXTzu=qrZe2eEZro7qteR=fDjJQX13syfB9fs5VfFbG7Vy6vQ@mail.gmail.com>
2025-06-02 14:31                         ` Fabian Grünbichler [this message]
2025-06-04 12:52                           ` Denis Kanchev via pve-devel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2059883652.1014.1748874664476@webmail.proxmox.com \
    --to=f.gruenbichler@proxmox.com \
    --cc=denis.kanchev@storpool.com \
    --cc=pve-devel@lists.proxmox.com \
    --cc=w.bumiller@proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal