From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id 6A7821FF13F
	for <inbox@lore.proxmox.com>; Thu, 26 Mar 2026 15:16:30 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id DA9FF1569F;
	Thu, 26 Mar 2026 15:16:52 +0100 (CET)
Mime-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8
Date: Thu, 26 Mar 2026 15:16:17 +0100
Message-Id: <DHCS0AP47PZO.3K98MYBZB6RYI@proxmox.com>
To: "Dominik Rusovac" <d.rusovac@proxmox.com>, <pve-devel@lists.proxmox.com>
Subject: Re: [PATCH proxmox v2 04/40] resource-scheduling: introduce generic
 scheduler implementation
From: "Daniel Kral" <d.kral@proxmox.com>
X-Mailer: aerc 0.21.0-38-g7088c3642f2c-dirty
References: <20260324183029.1274972-1-d.kral@proxmox.com>
 <20260324183029.1274972-5-d.kral@proxmox.com>
 <DHCMYPYGMY0I.179IM7YU0XVMC@proxmox.com>
In-Reply-To: <DHCMYPYGMY0I.179IM7YU0XVMC@proxmox.com>
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1774534528651
X-SPAM-LEVEL: Spam detection results:  0
	AWL                     0.058 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
Message-ID-Hash: IZMWO77EV6YP46HUXZ64TQTCFCNBVVXH
X-Message-ID-Hash: IZMWO77EV6YP46HUXZ64TQTCFCNBVVXH
X-MailFrom: d.kral@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pve-devel-owner@lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Subscribe: <mailto:pve-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pve-devel-leave@lists.proxmox.com>

On Thu Mar 26, 2026 at 11:19 AM CET, Dominik Rusovac wrote:
> [snip]
>
> good to have:=20
>
>     #[derive(Clone, Debug)]

ACK, thanks!

>
>> +pub struct Scheduler {
>> +    nodes: Vec<NodeUsage>,
>> +}
>> +
>
> nit: The implementation of `Scheduler` is totally fine as-is.  This is
> just my two cents, as this was mentioned off-list.=20
>
> I believe that for the implementation of the scheduler working with
> enum variants and a trait and then exploiting static dispatch is more
> convenient and easier to maintain, e.g.:
>
>     pub enum Schedulerr<Nodes> {
>         Topsis(Nodes),
>         BruteForce(Nodes),
>     }
>    =20
>     pub trait Decide {
>         fn node_imbalance(&self) -> f64;
>    =20
>         fn node_imbalance_with_migration_candidate(&self, candidate: &Mig=
rationCandidate) -> f64;
>    =20
>         fn score_best_balancing_migration_candidates(
>             &self,
>             candidates: &[MigrationCandidate],
>             limit: usize,
>         ) -> Result<Vec<ScoredMigration>, Error>;
>     }
>    =20
>     impl Decide for Schedulerr<Vec<NodeUsage>> {

[...]

>     }
>    =20
>     pub fn score_best_balancing_migration_candidates(
>         scheduler: &impl Decide,
>         candidates: &[MigrationCandidate],
>         limit: usize,
>     ) -> Result<Vec<ScoredMigration>, Error> {
>         scheduler.score_best_balancing_migration_candidates(candidates, l=
imit)
>     }
>
> In a nutshell, this declares what a scheduler ought to be able to do to b=
e used
> for scoring (that is, implementing the `Decide` trait); and implements
> all the functionality for all the variants in one place. =20
>
> Nice side-effects of this design:=20
> * one scoring function implements all the variants in one place, which is=
 nice, I think
> * adding/removing a scheduler variant would become more systematic
> * modifying scheduler variants in terms of how they score or, for example=
, how they measure
>   imbalance, would also be more straightforward
>
> Again, just my two cents.

As discussed off-list, I like how readable the different method
alternatives are with the pattern matching and that makes maintenance
and reading diffs easier.

Though I'm not sure whether it is a good idea to define the `Scheduler`
by the algorithms that one or more of their methods use. Choosing one
algorithm might only be something we want in the short-time, e.g.,
whether TOPSIS or bruteforce is the right fit for users, and might drop
bruteforce or TOPSIS for these methods in the future or might add
another method, which uses another new algorithm.

Then it might be better that the methods themselves, such as
score_nodes_to_start_resource() and
score_best_balancing_migration_candidates() defines which algorithm
these use. In that case it is only coincidental that two methods use the
same algorithm internally.

Maybe we could still go for some parameter that allows passing the
preferred algorithm for score_best_balancing_migration_candidates(),
which now can either be Bruteforce or Topsis for the latter?

Still, thanks a lot for the suggestion! I like the idea a lot, but I'm
only a little unsure whether it is the right fit in this situation wrt.
to future changes.

>
>> +impl Scheduler {
>> +    /// Instantiate scheduler instance from node usages.
>> +    pub fn from_nodes<I>(nodes: I) -> Self
>> +    where
>> +        I: IntoIterator<Item: Into<NodeUsage>>,
>> +    {
>> +        Self {
>> +            nodes: nodes.into_iter().map(|node| node.into()).collect(),
>> +        }
>> +    }
>> +
>> +    /// Scores nodes to start a resource with the usage statistics `res=
ource_stats` on.
>> +    ///
>> +    /// The scoring is done as if the resource is already started on ea=
ch node. This assumes that
>> +    /// the already started resource consumes the maximum amount of eac=
h stat according to its
>> +    /// `resource_stats`.
>> +    ///
>> +    /// Returns a vector of (nodename, score) pairs. Scores are between=
 0.0 and 1.0 and a higher
>> +    /// score is better.
>> +    pub fn score_nodes_to_start_resource<T: Into<ResourceStats>>(
>> +        &self,
>> +        resource_stats: T,
>> +    ) -> Result<Vec<(String, f64)>, Error> {
>> +        let len =3D self.nodes.len();
>> +        let resource_stats =3D resource_stats.into();
>> +
>> +        let matrix =3D self
>> +            .nodes
>> +            .iter()
>> +            .enumerate()
>> +            .map(|(target_index, _)| {
>> +                // Base values on percentages to allow comparing nodes =
with different stats.
>> +                let mut highest_cpu =3D 0.0;
>> +                let mut squares_cpu =3D 0.0;
>> +                let mut highest_mem =3D 0.0;
>> +                let mut squares_mem =3D 0.0;
>> +
>> +                for (index, node) in self.nodes.iter().enumerate() {
>> +                    let mut new_stats =3D node.stats;
>> +
>> +                    if index =3D=3D target_index {
>> +                        new_stats.add_started_resource(&resource_stats)
>> +                    };
>> +
>> +                    let new_cpu =3D new_stats.cpu_load();
>> +                    highest_cpu =3D f64::max(highest_cpu, new_cpu);
>> +                    squares_cpu +=3D new_cpu.powi(2);
>> +
>> +                    let new_mem =3D new_stats.mem_load();
>> +                    highest_mem =3D f64::max(highest_mem, new_mem);
>> +                    squares_mem +=3D new_mem.powi(2);
>> +                }
>> +
>> +                // Add 1.0 to avoid boosting tiny differences: e.g. 0.0=
04 is twice as much as 0.002, but
>> +                // 1.004 is only slightly more than 1.002.
>> +                PveTopsisAlternative {
>> +                    average_cpu: 1.0 + (squares_cpu / len as f64).sqrt(=
),
>> +                    highest_cpu: 1.0 + highest_cpu,
>> +                    average_memory: 1.0 + (squares_mem / len as f64).sq=
rt(),
>> +                    highest_memory: 1.0 + highest_mem,
>> +                }
>> +                .into()
>> +            })
>> +            .collect::<Vec<_>>();
>> +
>> +        let scores =3D
>> +            topsis::score_alternatives(&topsis::Matrix::new(matrix)?, &=
PVE_HA_TOPSIS_CRITERIA)?;
>> +
>> +        Ok(scores
>> +            .into_iter()
>> +            .enumerate()
>> +            .map(|(n, score)| (self.nodes[n].name.to_string(), score))
>> +            .collect())
>> +    }
>>  }
>
> [snip]
>
> in the future, the proptest crate could come in handy for such helper
> functions.=20
>
> in general, I would propose to add proptests in the future.
>
> for some examples and inspiration, see [0].
>
> [0] https://lore.proxmox.com/all/20260306082046.34311-1-d.rusovac@proxmox=
.com/T/

+1

>
>> +fn new_homogeneous_cluster_scheduler() -> Scheduler {
>> +    let (maxcpu, maxmem) =3D (16, 64 * (1 << 30));
>> +
>> +    let node1 =3D NodeUsage {
>> +        name: String::from("node1"),
>> +        stats: NodeStats {
>> +            cpu: 1.7,
>> +            maxcpu,
>> +            mem: 12334 << 20,
>> +            maxmem,
>> +        },
>> +    };
>> +
>> +    let node2 =3D NodeUsage {
>> +        name: String::from("node2"),
>> +        stats: NodeStats {
>> +            cpu: 15.184,
>> +            maxcpu,
>> +            mem: 529 << 20,
>> +            maxmem,
>> +        },
>> +    };
>> +
>> +    let node3 =3D NodeUsage {
>> +        name: String::from("node3"),
>> +        stats: NodeStats {
>> +            cpu: 5.2,
>> +            maxcpu,
>> +            mem: 9381 << 20,
>> +            maxmem,
>> +        },
>> +    };
>> +
>> +    Scheduler::from_nodes(vec![node1, node2, node3])
>> +}
>> +
>> +fn new_heterogeneous_cluster_scheduler() -> Scheduler {
>> +    let node1 =3D NodeUsage {
>> +        name: String::from("node1"),
>> +        stats: NodeStats {
>> +            cpu: 1.7,
>> +            maxcpu: 16,
>> +            mem: 12334 << 20,
>> +            maxmem: 128 << 30,
>> +        },
>> +    };
>> +
>> +    let node2 =3D NodeUsage {
>> +        name: String::from("node2"),
>> +        stats: NodeStats {
>> +            cpu: 15.184,
>> +            maxcpu: 32,
>> +            mem: 529 << 20,
>> +            maxmem: 96 << 30,
>> +        },
>> +    };
>> +
>> +    let node3 =3D NodeUsage {
>> +        name: String::from("node3"),
>> +        stats: NodeStats {
>> +            cpu: 5.2,
>> +            maxcpu: 24,
>> +            mem: 9381 << 20,
>> +            maxmem: 64 << 30,
>> +        },
>> +    };
>> +
>> +    Scheduler::from_nodes(vec![node1, node2, node3])
>> +}
>
> [snip]
>
>> +#[test]
>> +fn test_score_homogeneous_nodes_to_start_resource() -> Result<(), Error=
> {
>> +    let scheduler =3D new_homogeneous_cluster_scheduler();
>> +
>> +    let heavy_memory_resource_stats =3D ResourceStats {
>> +        cpu: 0.0,
>> +        maxcpu: 1.0,
>> +        mem: 0,
>> +        maxmem: 12 << 30,
>> +    };
>> +
>> +    assert_eq!(
>> +        rank_nodes_to_start_resource(&scheduler, heavy_memory_resource_=
stats)?,
>> +        vec!["node2", "node3", "node1"]
>> +    );
>> +
>> +    let heavy_cpu_resource_stats =3D ResourceStats {
>> +        cpu: 0.0,
>> +        maxcpu: 12.0,
>> +        mem: 0,
>> +        maxmem: 0,
>> +    };
>> +
>> +    assert_eq!(
>> +        rank_nodes_to_start_resource(&scheduler, heavy_cpu_resource_sta=
ts)?,
>> +        vec!["node1", "node3", "node2"]
>> +    );
>> +
>> +    let unlimited_cpu_resource_stats =3D ResourceStats {
>> +        cpu: 0.0,
>> +        maxcpu: 0.0,
>> +        mem: 0,
>> +        maxmem: 0,
>> +    };
>> +
>> +    assert_eq!(
>> +        rank_nodes_to_start_resource(&scheduler, unlimited_cpu_resource=
_stats)?,
>> +        vec!["node1", "node3", "node2"]
>> +    );
>> +
>
> nit: confusing variable name

ACK this and the following, that was a copy-paste error unfortunately.

>
>> +    let unlimited_cpu_resource_stats =3D ResourceStats {
>> +        cpu: 0.0,
>> +        maxcpu: 12.0,
>> +        mem: 0,
>> +        maxmem: 12 << 30,
>> +    };
>> +
>> +    assert_eq!(
>
> nit: confusing variable name

ACK

>
>> +        rank_nodes_to_start_resource(&scheduler, unlimited_cpu_resource=
_stats)?,
>> +        vec!["node2", "node3", "node1"]
>> +    );
>> +
>> +    Ok(())
>> +}
>> +
>> +#[test]
>> +fn test_score_heterogeneous_nodes_to_start_resource() -> Result<(), Err=
or> {
>> +    let scheduler =3D new_heterogeneous_cluster_scheduler();
>> +
>> +    let heavy_memory_resource_stats =3D ResourceStats {
>> +        cpu: 0.0,
>> +        maxcpu: 1.0,
>> +        mem: 0,
>> +        maxmem: 12 << 30,
>> +    };
>> +
>> +    assert_eq!(
>> +        rank_nodes_to_start_resource(&scheduler, heavy_memory_resource_=
stats)?,
>> +        vec!["node2", "node1", "node3"]
>> +    );
>> +
>> +    let heavy_cpu_resource_stats =3D ResourceStats {
>> +        cpu: 0.0,
>> +        maxcpu: 12.0,
>> +        mem: 0,
>> +        maxmem: 0,
>> +    };
>> +
>> +    assert_eq!(
>> +        rank_nodes_to_start_resource(&scheduler, heavy_cpu_resource_sta=
ts)?,
>> +        vec!["node3", "node2", "node1"]
>> +    );
>> +
>> +    let unlimited_cpu_resource_stats =3D ResourceStats {
>> +        cpu: 0.0,
>> +        maxcpu: 0.0,
>> +        mem: 0,
>> +        maxmem: 0,
>> +    };
>> +
>> +    assert_eq!(
>> +        rank_nodes_to_start_resource(&scheduler, unlimited_cpu_resource=
_stats)?,
>> +        vec!["node1", "node3", "node2"]
>> +    );
>> +
>
> nit: confusing variable name
>

ACK

>> +    let unlimited_cpu_resource_stats =3D ResourceStats {
>> +        cpu: 0.0,
>> +        maxcpu: 12.0,
>> +        mem: 0,
>> +        maxmem: 12 << 30,
>> +    };
>> +
>
> nit: confusing variable name
>

ACK

>> +    assert_eq!(
>> +        rank_nodes_to_start_resource(&scheduler, unlimited_cpu_resource=
_stats)?,
>> +        vec!["node2", "node1", "node3"]
>> +    );
>> +
>> +    Ok(())
>> +}