public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Daniel Kral <d.kral@proxmox.com>
To: Fiona Ebner <f.ebner@proxmox.com>,
	Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] [RFC cluster/ha-manager 00/16] HA colocation rules
Date: Fri, 25 Apr 2025 15:25:42 +0200	[thread overview]
Message-ID: <500c452d-581d-4fb7-81d2-fe0f46d29fd6@proxmox.com> (raw)
In-Reply-To: <d977b382-95ab-4f51-b1ce-8268630a5e24@proxmox.com>

On 4/25/25 14:25, Fiona Ebner wrote:
> Am 25.04.25 um 10:36 schrieb Daniel Kral:
>> On 4/24/25 12:12, Fiona Ebner wrote:
>> As suggested by @Lukas off-list, I'll also try to make the check
>> selective, e.g. the user has made an infeasible change to the config
>> manually by writing to the file and then wants to create another rule.
>> Here it should ignore the infeasible rules (as they'll be dropped
>> anyway) and only check if the added rule / changed rule is infeasible.
> 
> How will you select the rule to drop? Applying the rules one-by-one to
> find a first violation?

AFAICS we could use the same helpers to check whether the rules are 
feasible, and only check whether the added / updated ruleid is one that 
is causing these troubles. I guess this would be a reasonable option 
without duplicating code, but still check against the whole config. 
There's surely some optimization potential here, but then we would have 
a larger problem at reloading the rule configuration for the manager 
anyway. For the latter I could check for what size of a larger 
configuration this could become an actual bottleneck.

For either adding a rule or updating a rule, we would just make the 
change to the configuration in-memory and run the helper. Depending on 
the result, we'd store the config or error out to the API user.

> 
>> But as you said, it must not change the user's configuration in the end
>> as that would be very confusing to the user.
> 
> Okay, so dropping dynamically. I guess we could also disable such rules
> explicitly/mark them as being in violation with other rules somehow:
> Tri-state enabled/disabled/conflict status? Explicit field?
> 
> Something like that would make such rules easily visible and have the
> configuration better reflect the actual status.
> 
> As discussed off-list now: we can try to re-enable conflicting rules
> next time the rules are loaded.

Hm, there's three options now:

- Allowing conflicts over the create / update API and auto-resolving the 
conflicts as soon as we're able to (e.g. on the load / save where the 
rule becomes feasible again).

- Not allowing conflicts over the create / update API, but set the state 
to 'conflict' if manual changes (or other circumstances) made the rules 
be in conflict with one another.

- Having something like the SDN config, where there's a working 
configuration and a "draft" configuration that needs to be applied. So 
conflicts are allowed in drafts, but not in working configurations.

The SDN option seems too much for me here, but I just noticed some 
similarity.

I guess one of the first two makes more sense. If there's no arguments 
against this, I'd choose the second option as we can always allow 
intentional conflicts later if there's user demand or we see other 
reasons in that.

> 
>>>> The only thing that I'm unsure about this, is how we would migrate the
>>>> `nofailback` option, since this operates on the group-level. If we keep
>>>> the `<node>(:<priority>)` syntax and restrict that each service can only
>>>> be part of one location rule, it'd be easy to have the same flag. If we
>>>> go with multiple location rules per service and each having a score or
>>>> weight (for the priority), then we wouldn't be able to have this flag
>>>> anymore. I think we could keep the semantic if we move this flag to the
>>>> service config, but I'm thankful for any comments on this.
>>> My gut feeling is that going for a more direct mapping, i.e. each
>>> location rule represents one HA group, is better. The nofailback flag
>>> can still apply to a given location rule I think? For a given service,
>>> if a higher-priority node is online for any location rule the service is
>>> part of, with nofailback=0, it will get migrated to that higher-priority
>>> node. It does make sense to have a given service be part of only one
>>> location rule then though, since node priorities can conflict between
>>> rules.
>>
>> Yeah, I think this is the reasonable option too.
>>
>> I briefly discussed this with @Fabian off-list and we also agreed that
>> it would be good to make location rules as 1:1 to location rules as
>> possible and keep the nofailback per location rule, as the behavior of
>> the HA group's nofailback could still be preserved - at least if there's
>> only a single location rule per service at least.
>>
>> ---
>>
>> On the other hand, I'll have to take a closer look if we can do
>> something about the blockers when creating multiple location rules where
>> e.g. one has nofailback enabled and the other has not. As you already
>> said, they could easily conflict between rules...
>>
>> My previous idea was to make location rules as flexible as possible, so
>> that it would theoretically not matter if one writes:
>>
>> location: rule1
>>      services: vm:101
>>      nodes: node1:2,node2:1
>>      strict: 1
>> or:
>>
>> location: rule1
>>      services: vm:101
>>      nodes: node1
>>      strict: 1
>>
>> location: rule2
>>      services: vm:101
>>      nodes: node2
>>      strict: 1
>>
>> The order which one's more important could be encoded in the order which
>> it is defined (if one configures this in the config it's easy, and I'd
>> add an API endpoint to realize this over the API/WebGUI too), or maybe
>> even simpler to maintain: just another property.
> 
> We cannot use just the order, because a user might want to give two
> nodes the same priority. I'd also like to avoid an implicit
> order-priority mapping.

Right, good point!

> 
>> But then, the
>> nofailback would have to be either moved to some other place...
> 
>> Or it is still allowed in location rules, but either the more detailed
>> rule wins (e.g. one rule has node1 without a priority and the other does
>> have node1 with a priority)
> 
> Maybe we should prohibit multiple rules with the same service-node pair?
> Otherwise, my intuition says that all rules should be considered and the
> rule with the highest node priority should win.

Yes, I think that would make the most sense as disallowing users to put 
the same two or more services in multiple negative colocation rules.

> 
>> or the first location rule with a specific
>> node wins and the other is ignored. But this is already confusing when
>> writing it out here...
>>
>> I'd prefer users to write the former (and make this the dynamic
>> 'canonical' form when selecting nodes), but as with colocation rules it
>> could make sense to separate them for specific reasons / use cases.
> 
> Fair point.
> 
>> And another reason why it could still make sense to go that way is to
>> allow "negative" location rules at a later point, which makes sense in
>> larger environments, where it's easier to write opt-out rules than opt-
>> in rules, so I'd like to keep that path open for the future.
> 
> We also discussed this off list: Daniel convinced me that it would be
> cleaner if the nofailback property would be associated to a given
> service rather than a given location rule. And if we later support pools
> as resources, the property should be associated to (certain or all)
> services in that pool and defined in the resource config for the pool.
> 
> To avoid the double-negation with nofailback=0, it could also be renamed
> to a positive property, below called "auto-elevate", just a working name.
> 
> A small concern of mine was that this makes it impossible to have a
> service that only "auto-elevates" to a specific node with a priority,
> but not others. This is already not possible right now, and honestly,
> that would be quite strange behavior and not supporting that is unlikely
> to hurt real use cases.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

  reply	other threads:[~2025-04-25 13:26 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-25 15:12 Daniel Kral
2025-03-25 15:12 ` [pve-devel] [PATCH cluster 1/1] cfs: add 'ha/rules.cfg' to observed files Daniel Kral
2025-03-25 15:12 ` [pve-devel] [PATCH ha-manager 01/15] ignore output of fence config tests in tree Daniel Kral
2025-03-25 17:49   ` [pve-devel] applied: " Thomas Lamprecht
2025-03-25 15:12 ` [pve-devel] [PATCH ha-manager 02/15] tools: add hash set helper subroutines Daniel Kral
2025-03-25 17:53   ` Thomas Lamprecht
2025-04-03 12:16     ` Fabian Grünbichler
2025-04-11 11:24       ` Daniel Kral
2025-03-25 15:12 ` [pve-devel] [PATCH ha-manager 03/15] usage: add get_service_node and pin_service_node methods Daniel Kral
2025-04-24 12:29   ` Fiona Ebner
2025-04-25  7:39     ` Daniel Kral
2025-03-25 15:12 ` [pve-devel] [PATCH ha-manager 04/15] add rules section config base plugin Daniel Kral
2025-04-24 13:03   ` Fiona Ebner
2025-04-25  8:29     ` Daniel Kral
2025-04-25  9:12       ` Fiona Ebner
2025-04-25 13:30         ` Daniel Kral
2025-03-25 15:12 ` [pve-devel] [PATCH ha-manager 05/15] rules: add colocation rule plugin Daniel Kral
2025-04-03 12:16   ` Fabian Grünbichler
2025-04-11 11:04     ` Daniel Kral
2025-04-25 14:06       ` Fiona Ebner
2025-04-29  8:37         ` Daniel Kral
2025-04-29  9:15           ` Fiona Ebner
2025-05-07  8:41             ` Daniel Kral
2025-04-25 14:05   ` Fiona Ebner
2025-04-29  8:44     ` Daniel Kral
2025-03-25 15:12 ` [pve-devel] [PATCH ha-manager 06/15] config, env, hw: add rules read and parse methods Daniel Kral
2025-04-25 14:11   ` Fiona Ebner
2025-03-25 15:12 ` [pve-devel] [PATCH ha-manager 07/15] manager: read and update rules config Daniel Kral
2025-04-25 14:30   ` Fiona Ebner
2025-04-29  8:04     ` Daniel Kral
2025-03-25 15:12 ` [pve-devel] [PATCH ha-manager 08/15] manager: factor out prioritized nodes in select_service_node Daniel Kral
2025-04-28 13:03   ` Fiona Ebner
2025-03-25 15:12 ` [pve-devel] [PATCH ha-manager 09/15] manager: apply colocation rules when selecting service nodes Daniel Kral
2025-04-03 12:17   ` Fabian Grünbichler
2025-04-11 15:56     ` Daniel Kral
2025-04-28 12:46       ` Fiona Ebner
2025-04-29  9:07         ` Daniel Kral
2025-04-29  9:22           ` Fiona Ebner
2025-04-28 12:26   ` Fiona Ebner
2025-04-28 14:33     ` Fiona Ebner
2025-04-29  9:39       ` Daniel Kral
2025-04-29  9:50     ` Daniel Kral
2025-04-30 11:09   ` Daniel Kral
2025-05-02  9:33     ` Fiona Ebner
2025-05-07  8:31       ` Daniel Kral
2025-03-25 15:12 ` [pve-devel] [PATCH ha-manager 10/15] sim: resources: add option to limit start and migrate tries to node Daniel Kral
2025-04-28 13:20   ` Fiona Ebner
2025-03-25 15:12 ` [pve-devel] [PATCH ha-manager 11/15] test: ha tester: add test cases for strict negative colocation rules Daniel Kral
2025-04-28 13:44   ` Fiona Ebner
2025-03-25 15:12 ` [pve-devel] [PATCH ha-manager 12/15] test: ha tester: add test cases for strict positive " Daniel Kral
2025-04-28 13:51   ` Fiona Ebner
2025-05-09 11:22     ` Daniel Kral
2025-03-25 15:12 ` [pve-devel] [PATCH ha-manager 13/15] test: ha tester: add test cases for loose " Daniel Kral
2025-04-28 14:44   ` Fiona Ebner
2025-05-09 11:20     ` Daniel Kral
2025-03-25 15:12 ` [pve-devel] [PATCH ha-manager 14/15] test: ha tester: add test cases in more complex scenarios Daniel Kral
2025-04-29  8:54   ` Fiona Ebner
2025-04-29  9:01   ` Fiona Ebner
2025-03-25 15:12 ` [pve-devel] [PATCH ha-manager 15/15] test: add test cases for rules config Daniel Kral
2025-03-25 16:47 ` [pve-devel] [RFC cluster/ha-manager 00/16] HA colocation rules Daniel Kral
2025-04-24 10:12   ` Fiona Ebner
2025-04-01  1:50 ` DERUMIER, Alexandre
2025-04-01  9:39   ` Daniel Kral
2025-04-01 11:05     ` DERUMIER, Alexandre via pve-devel
2025-04-03 12:26     ` Fabian Grünbichler
2025-04-24 10:12     ` Fiona Ebner
2025-04-24 10:12 ` Fiona Ebner
2025-04-25  8:36   ` Daniel Kral
2025-04-25 12:25     ` Fiona Ebner
2025-04-25 13:25       ` Daniel Kral [this message]
2025-04-25 13:58         ` Fiona Ebner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=500c452d-581d-4fb7-81d2-fe0f46d29fd6@proxmox.com \
    --to=d.kral@proxmox.com \
    --cc=f.ebner@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal