Re: [pve-devel] [PATCH manager v2] api: implement node-independent bulk actions

From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Dominik Csapak <d.csapak@proxmox.com>, pve-devel@lists.proxmox.com
Subject: Re: [pve-devel] [PATCH manager v2] api: implement node-independent bulk actions
Date: Fri, 14 Nov 2025 10:32:50 +0100	[thread overview]
Message-ID: <176311277090.161537.17167193408053318170@yuna.proxmox.com> (raw)
In-Reply-To: <c9ff0657-de5f-431a-b493-ea138bd36ad0@proxmox.com>

Quoting Dominik Csapak (2025-11-14 10:16:20)
> thanks for the review!
> 
> just a few questions inline
> 
> On 11/14/25 9:32 AM, Fabian Grünbichler wrote:
> > Quoting Dominik Csapak (2025-08-14 13:26:59)
> [snip]
> >> +sub create_client {
> >> +    my ($authuser, $request_timeout) = @_;
> >> +    my ($user, undef) = PVE::AccessControl::split_tokenid($authuser, 1);
> >> +
> >> +    # TODO: How to handle Tokens?
> > 
> > not like below for sure ;) we'd need to make it queriable using the
> > RPCEnvironment (and store it there) I guess? maybe opt-in so the storing only
> > happens for certain API handlers (e.g., these ones here for a start)?
> > 
> > this basically escalates from the token to a ticket of the user, which is a
> > nogo even if you duplicate the current set of privilege checks here, as that is
> > just waiting to get out of sync
> 
> Just to clarify: what's the worst that could happen with this?
> 
> the API call checks if the token has the correct permissions for all 
> resources, and the user can't have less permissions at that point.
> So 'upgrading' the token to a full user ticket for the context
> of these changes does not inherently break anything?
> (Except I'm missing something?)

the issue is that "the correct permissions" is only valid now. it's potentially
not valid if one of the called endpoints is changed in the future. and if that
happens, this is now suddenly a (probably small!) privilege escalation to allow
the token to call the endpoint as the user.

> There is a TOCTOU issue of course, but IMHO it would not that big
> of a deal since this can only happen if the permissions change
> during the running task, and one could argue that the token
> had permission to do these changes when starting the api call,
> so it should be able to finish it.

that as well, but yeah, this is how we normally handle such things within an
endpoint handler.

> I modeled this after we do the metrics export api call
> in PVE/API2/Cluster/MetricServer.pm where we do the same
> (or similar), so we probably should change it there as well then?

yes

> Anyway, I'll see that we extend the rpcenv to be able to do that,
> but how should we mark the api call? in a special way like we do
> with 'protected' ? (so e.g. '"save-token-info" => 1,' in the api
> call (with a better name ofc))

see Thomas reply. I think that approach is sensible, given that we don't have
too many of these calls that do "manual" proxying.

> > 
> >> +    my $ticket = PVE::AccessControl::assemble_ticket($user || $authuser);
> >> +    my $csrf_token = PVE::AccessControl::assemble_csrf_prevention_token($user || $authuser);
> >> +
> >> +    my $node = PVE::INotify::nodename();
> >> +    my $fingerprint = PVE::Cluster::get_node_fingerprint($node);
> >> +
> >> +    my $conn_args = {
> >> +        protocol => 'https',
> >> +        host => 'localhost', # always call the api locally, let pveproxy handle the proxying
> >> +        port => 8006,
> >> +        ticket => $ticket,
> >> +        timeout => $request_timeout // 25, # default slightly shorter than the proxy->daemon timeout
> >> +        cached_fingerprints => {
> >> +            $fingerprint => 1,
> >> +        },
> >> +    };
> >> +
> >> +    my $api_client = PVE::APIClient::LWP->new($conn_args->%*);
> >> +    if (defined($csrf_token)) {
> >> +        $api_client->update_csrftoken($csrf_token);
> >> +    }
> >> +
> >> +    return $api_client;
> > 
> > this client doesn't automatically refresh the ticket before it expires, so
> > bigger sets of bulk actions that take more than 2h will always fail..
> > 
> 
> true, so i'd abstract away a 'make_request' function that checks
> if we should renew the ticket and always use that?

yeah, in this case it wouldn't even help to implement a relogin in the client,
since we don't do a login in the first place ;)

checking the ticket lifetime with some slack is probably good enough.

> >> +}
> >> +
> >> +# starts and awaits a task for each guest given via $startlist.
> >> +#
> >> +# takes a vm list in the form of
> >> +# {
> >> +#     0 => {
> >> +#         100 => { .. guest info ..},
> >> +#         101 => { .. guest info ..},
> >> +#     },
> >> +#     1 => {
> >> +#         102 => { .. guest info ..},
> >> +#         103 => { .. guest info ..},
> >> +#     },
> >> +# }
> >> +#
> >> +# max_workers: how many parallel tasks should be started.
> >> +# start_task: a sub that returns eiter a upid or 1 (undef means failure)
> >> +# check_task: if start_task returned a upid, will wait for that to finish and
> >> +#    call check_task with the resulting task status
> >> +sub handle_task_foreach_guest {
> >> +    my ($startlist, $max_workers, $start_task, $check_task) = @_;
> >> +
> >> +    my $rpcenv = PVE::RPCEnvironment::get();
> >> +    my $authuser = $rpcenv->get_user();
> >> +    my $api_client = create_client($authuser);
> >> +
> >> +    my $failed = [];
> >> +    for my $order (sort { $a <=> $b } keys $startlist->%*) {
> >> +        my $vmlist = $startlist->{$order};
> >> +        my $workers = {};
> >> +
> >> +        for my $vmid (sort { $a <=> $b } keys $vmlist->%*) {
> >> +
> >> +            # wait until at least one slot is free
> >> +            while (scalar(keys($workers->%*)) >= $max_workers) {
> >> +                for my $upid (keys($workers->%*)) {
> >> +                    my $worker = $workers->{$upid};
> >> +                    my $node = $worker->{guest}->{node};
> >> +
> >> +                    my $task = eval { $api_client->get("/nodes/$node/tasks/$upid/status") };
> > 
> > this could easily fail for reasons other than the task having exited? should we
> > maybe retry a few times to avoid accidents, before giving up?
> > 
> 
> yep, i'd include such functionality (opt-in just for these calls) in the 
> 'make_request' abstraction?

yeah, that would work.

> 
> >> +                    if (my $err = $@) {
> >> +                        push $failed->@*, $worker->{vmid};
> >> +
> >> +                        $check_task->($api_client, $worker->{vmid}, $worker->{guest}, 1, undef);
> >> +
> >> +                        delete $workers->{$upid};
> >> +                    } elsif ($task->{status} ne 'running') {
> >> +                        my $is_error = PVE::Tools::upid_status_is_error($task->{exitstatus});
> >> +                        push $failed->@*, $worker->{vmid} if $is_error;
> >> +
> >> +                        $check_task->(
> >> +                            $api_client, $worker->{vmid}, $worker->{guest}, $is_error, $task,
> >> +                        );
> >> +
> >> +                        delete $workers->{$upid};
> >> +                    }
> >> +                }
> >> +                sleep(1); # How much?
> >> +            }
> >> +
> >> +            my $guest = $vmlist->{$vmid};
> >> +            my $upid = eval { $start_task->($api_client, $vmid, $guest) };
> >> +            warn $@ if $@;
> > 
> > A: here we use warn (see further similar nits below)
> 
> true, i'll use log_warn for those
> 
> > 
> 
> [snip]
> >> +    code => sub {
> >> +        my ($param) = @_;
> >> +
> >> +        my $rpcenv = PVE::RPCEnvironment::get();
> >> +        my $authuser = $rpcenv->get_user();
> >> +
> >> +        my ($vmlist, $vmlist_string) = extract_vmlist($param);
> >> +
> >> +        check_guest_permissions($rpcenv, $authuser, $vmlist, ['VM.PowerMgmt']);
> >> +
> >> +        if ($param->{'to-disk'}) {
> >> +            check_guest_permissions($rpcenv, $authuser, $vmlist, ['VM.Config.Disk']);
> >> +        }
> >> +
> >> +        if (my $statestorage = $param->{statestorage}) {
> >> +            $rpcenv->check($authuser, "/storage/$statestorage", ['Datastore.AllocateSpace']);
> >> +        } else {
> >> +            # storage access must be done in start task
> >> +        }
> > 
> > this if should be nested in the other if?
> 
> we could, but since 'statestorage' requires 'to-disk' in the api anyway,
> so changing this would just increase indentation?
> 
> i could put it int he if, but the api already does that for us

here it makes no difference other than being harder to parse

> 
> > 
> >> +
> >> +        my $code = sub {
> >> +            my $startlist =
> >> +                PVE::API2::Nodes::Nodeinfo::get_start_stop_list(undef, undef, $vmlist_string);
> >> +
> >> +            print_start_action($vmlist, "Suspending");
> >> +
> >> +            # reverse order for suspend
> >> +            for my $order (keys $startlist->%*) {
> >> +                my $list = delete $startlist->{$order};
> >> +                $order = $order * -1;
> >> +                $startlist->{$order} = $list;
> >> +            }
> >> +
> >> +            my $start_task = sub {
> >> +                my ($api_client, $vmid, $guest) = @_;
> >> +                my $node = $guest->{node};
> >> +
> >> +                if ($guest->{type} ne 'qemu') {
> >> +                    log_warn("skipping $vmid, only VMs can be suspended");
> >> +                    return 1;
> >> +                }
> >> +
> >> +                if (!$param->{statestorage}) {
> > 
> > this should again be nested inside a check for to-disk being set
> 
> again, same argument as above
> > 
> >> +                    my $conf = PVE::QemuConfig->load_config($vmid, $node);
> >> +                    my $storecfg = PVE::Storage::config();
> >> +                    my $statestorage = PVE::QemuServer::find_vmstate_storage($conf, $storecfg);
> > 
> > this does not exist, it's in QemuConfig
> 
> ah yes, this was moved betweeen my v1 and v2 IIRC and i forgot to change it
> 
> > 
> >> +                    $rpcenv->check(
> >> +                        $authuser,
> >> +                        "/storage/$statestorage",
> >> +                        ['Datastore.AllocateSpace'],
> >> +                    );

but here we actually do a permission check if neither to-disk nor statestorage is set, which is wrong..

> >> +                }
> >> +
> >> +                my $status =
> >> +                    eval { $api_client->get("/nodes/$node/qemu/$vmid/status/current") };
> >> +                if (defined($status) && $status->{status} ne 'running') {
> >> +                    print STDERR "Skipping VM $vmid, not running.\n";
> >> +                    return 1;
> >> +                }
> >> +
> >> +                my $params = {};
> >> +                $params->{'todisk'} = $param->{'to-disk'} // 0;
> >> +                $params->{statestorage} = $param->{statestorage}
> >> +                    if defined($param->{statestorage});
> > 
> > statestorage only makes sense if you set to-disk, so it should be ordered like
> > that here as well..
> 
> again same argument as above
> 
> if you insist, i'll change it, but IMHO it just increases the 
> indentation level for not much gain

here it's only a readability issue I'd say, as long as the actual called API
endpoint continues to ignore statestorage if todisk is not set/enabled

> >> +                $params->{'todisk'} = $param->{'to-disk'} // 0;
> >> +                $params->{statestorage} = $param->{statestorage}
> >> +                    if defined($param->{statestorage});

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel