public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>,
	Thomas Lamprecht <t.lamprecht@proxmox.com>,
	Dominik Csapak <d.csapak@proxmox.com>
Subject: Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node
Date: Tue, 1 Apr 2025 13:13:25 +0200 (CEST)	[thread overview]
Message-ID: <135708611.3668.1743506005916@webmail.proxmox.com> (raw)
In-Reply-To: <6b44b21c-5399-47a5-8e06-3c1d3e1eaab9@proxmox.com>


> Thomas Lamprecht <t.lamprecht@proxmox.com> hat am 01.04.2025 12:46 CEST geschrieben:
> 
>  
> Am 01.04.25 um 12:19 schrieb Dominik Csapak:
> > while i also agree to all said here, I have one counter point to offer:
> > 
> > In the case that such an operation is necessary (e.g. HA is not wanted/needed/possible
> > for what ever reason), the user will fall back to do it manually (iow. 'mv source target')
> > which is at least as dangerous as exposing over the API, since
> > 
> > * now the admins sharing the system must share root@pam credentials (ssh/console access)
> >    (alternatively setup sudo, which has it's own problems)
> 
> Setups with many admins need to handle already how they can log in as root, be
> it through a jump user (`doas` is a thing if sudo is deemed to complex), some
> identity provider (LDAP, OIDC, ... with PAM configuration), as root operations
> are required for other things too.

and the same feature on the API also requires root@pam anyway ;)

> [..]

> > * link to the doc section for it from the UI with a big caveat
> >    https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_recovery
> 
> As Fabian wrote, such disclaimers might be nice for shifting the blame
> but are not enough in practice for such an operation.
> 
> And Fabians point wasn't that doing it on the CLI is less dangerous, its
> about the same either way, but that exposing this as well-integrated feature
> makes it seem much less dangerous to the user, especially those that are
> less experienced and should be stumped and ask some support channel for
> help.
> 
> That said, the actual first step to move this forward would IMO be to create
> an extensive documentation/how-to for how such things can be resolved and what
> one needs to watch out for, sort of check-list style might be a good format.
> As that alone should help users a lot already, and that would also make it
> much clearer what a more integrated (semi-automated) way could look like.
> Which could be a check tool that helps with assessing the recovery depending
> on config, storage (types), network, mappings, ... which would ensure that
> common issues/blockers are not missed and will even help experienced admins.
> If that cannot be first documented and then optionally transformed into a
> hands-off evaluation checker tool, or if that's deemed to not help users, I
> really do not see how an API integrated solution can do so without just
> hand-waving all actual and real issues for why this does not already exists
> away.

(improving) such docs would be nice - we do have a little bit here:

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_recovering_moving_guests_from_failed_nodes 

the only way to technically improve what is possible IMHO would be to implement
some kind of reliable STONITH mechanism in addition to fencing, and base an
integrated "guest stealing" mechanism on that (with some additional component
that ensures that if the "shot" comes back up right away it won't do anything
with the "stolen" guest before the theft is over).

e.g., if you have a (set of) remote-manageable power strip(s) configured that
allows:
- removing all power from node
- query power state of a node

you could use that to reduce HA failover times (you can shoot the other node
if you want to make it fenced, irrespective of watchdog timeouts/..), and to
implement a guest stealing mechanism:
- put a file/entry in /etc/pve marking a guest as "currently being stolen"
- shoot the other node and verify it is down
- steal config
- remove marker file/entry

no matter at which point after the shooting the other node comes back up, it
must first sync up /etc/pve, which means it can check for markers on VM
locking. if a marker is found, it's not allowed to lock, else it can proceed
(checking doesn't require locking cluster wide, just setting the mark would).
if no marker is found, the config is not there anymore either or it hasn't
been stolen and can be locked and used normally.

if no stonith mechanism is configured, stealing is not available.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


  reply	other threads:[~2025-04-01 11:14 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20250324111529.338025-1-alexandre.derumier@groupe-cyllene.com>
2025-03-24 11:15 ` [pve-devel] [PATCH pve-manager 1/1] migrate: allow " Alexandre Derumier via pve-devel
2025-03-24 11:15 ` [pve-devel] [PATCH qemu-server 1/1] qemu: add offline " Alexandre Derumier via pve-devel
2025-04-01  9:52   ` Fabian Grünbichler
2025-04-01  9:57     ` Thomas Lamprecht
2025-04-01 10:19       ` Dominik Csapak
2025-04-01 10:46         ` Thomas Lamprecht
2025-04-01 11:13           ` Fabian Grünbichler [this message]
2025-04-01 12:38             ` Thomas Lamprecht
2025-04-01 11:37           ` Dominik Csapak
2025-04-01 12:54             ` Thomas Lamprecht
2025-04-01 13:20               ` Dominik Csapak
2025-04-01 15:08                 ` Thomas Lamprecht
2025-04-01 16:13         ` DERUMIER, Alexandre via pve-devel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=135708611.3668.1743506005916@webmail.proxmox.com \
    --to=f.gruenbichler@proxmox.com \
    --cc=d.csapak@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    --cc=t.lamprecht@proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal