public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pbs-devel] RFC: Scheduler for PBS
@ 2024-08-09  9:31 Max Carrara
  2024-08-09 11:22 ` Christian Ebner
  2024-08-09 12:52 ` Dominik Csapak
  0 siblings, 2 replies; 5+ messages in thread
From: Max Carrara @ 2024-08-09  9:31 UTC (permalink / raw)
  To: pbs-devel

RFC: Scheduler for PBS
======================

Introduction
------------

Gabriel and I have been prototyping a new scheduler for PBS, mostly in
order to address #3086 [1]. We will first summarize this bugzilla issue
and elaborate on why we think that implementing a scheduler inside PBS
is the best way to solve it.

Furthermore, this RFC shall provide a high-level overview on our plans
and what we have managed to implement thus far. Additionally, we will
outline a couple of other problems the scheduler, specifically with our
current architecture, could solve in the future.

We are doing this mostly because we want to gather some early feedback
on our design and our plans, and whether we should continue going in
this direction or not.

We also want to gather some thoughts on some other problems in
particular that we think should be addressed adequately in order for the
scheduler to work in an efficient and robust manner.

Summary of #3086: Limiting the Number of Parallel Backups
---------------------------------------------------------

The RFE #3086 [1] can be summarized as follows:

Currently, there is no way to limit the number of backup jobs that may
run in parallel.

This is not necessarily a problem for smaller setups or clusters, where
the administrator can schedule all backup jobs of all PVE hosts to run
at a different time.

However, trying to manually coordinate backup jobs of multiple hosts
becomes increasingly cumbersome as one's infrastructure (number of
hosts) scales up. An administrator would need to be able to accurately
estimate how long each job would take, for each VM, on each host.

Additionally, running many backups in parallel risks oversaturating the
network, causing a drop in bandwidth for the running backups - this in
turn can affect running VMs [2].

Why a Scheduler is Necessary
----------------------------

The above issue seemingly cannot be solved through bandwidth limits or
similar, which is why we believe that implementing a scheduler inside
PBS is the correct route to go.

This belief was reinforced after the prototypes we developed were able
to successfully queue and schedule many backups that were launched in
parallel (from the CLI) at once. Thus no other limits or "workarounds"
were necessary, all backup jobs were eventually completed.

This makes it much easier for administrators to ensure that their
backups actually happen (and succeed) while removing the need to
manually adjust the timing of when backups are made, so that none
overlap.

Architectural Overview
----------------------

The scheduler internally contains the type of job queue that is being
used, which in our case is a simple FIFO queue. We also used HTTP
long-polling [3] to schedule backup jobs, responding to the client only
when the backup job is started.

While long-polling appears to work fine for our current intents and
purposes, we still want to test if any alternatives (e.g.
"short-polling", as in normal polling) are more robust.

The main way to communicate with the scheduler is via its event loop.
This is a plain tokio task with an inner `loop` that matches on an enum
representing the different events / messages the scheduler may handle.
Such an event would be e.g. `NewBackupRequest` or `ConfigUpdate`.

The event loop receives events via an mpsc channel and may respond to
them individually via oneshot channels which are set up when certain
events are created. The benefit of tokio's channels is that they can
also work in blocking contexts, so it is possible to completely isolate
the scheduler in a separate thread if needed, for example.

Because users should also be able to dynamically configure the
scheduler, configuration changes are handled via the `ConfigUpdate`
event. That way even the type of the queue can be changed on the fly,
which one prototype is able to do.

Furthermore, our prototypes currently run inside `proxmox-backup-proxy`
and are reasonably decoupled from the rest of PBS, due to the scheduler
being event-based.

Backward Compatibility Considerations
-------------------------------------

We are still in the process of adequately handling backward compat. At
the moment, HTTP long-polling lets us support older clients as well; no
issues have appeared thus far.

However, we are also considering at least one separate API endpoint for
polling purposes and overall better client support, as we feel that this
might be safer and allows us to handle errors more gracefully.

Future Plans & Possibilities
----------------------------

1. Because the scheduler is keeping track of which jobs are currently
   running, it is relatively straightforward to check whether a job for
   the same group on the same datastore is running already. This makes
   it possible to queue the conflicting job, instead of having it fail
   immediately when trying to acquire the lock.

2. The scheduler should be in full control over when and which
   `WorkerTask`sare spawned, as that makes it much easier to handle
   errors. At the same time, the overall architecture of PBS would
   become much cleaner, by clearly separating concerns instead of having
   e.g. large API methods that do many things all at once [4].

3. The architecture of the scheduler is flexible enough to support
   different kinds of jobs in the future, so that e.g. prune, GC, sync
   jobs etc. may also be queued. This is definitely something we are
   considering of implementing as well.

4. Should more types of jobs be implemented in the scheduler, separate
   limits could also be set for each job. For example, the global job
   limit could be set to 10, while allowing a maximum of 10 backup jobs
   and 2 sync jobs to run concurrently. That way users can prefer
   backup jobs over other jobs, or vice versa.

5. In addition to a global limit, limits could also be set for
   individual users and API tokens. This would allow for even
   finer-grained job control, but is more costly to implement.

Final Thoughts
--------------

Please let us know what you think - we believe that implementing a
scheduler can potentially solve a vast amount of issues of users that
aim to scale up their infrastructure.

Thank you for reading! :)

References
----------

[1]: https://bugzilla.proxmox.com/show_bug.cgi?id=3086
[2]: https://bugzilla.proxmox.com/show_bug.cgi?id=3086#c2
[3]: https://www.rfc-editor.org/rfc/rfc6202#section-2.1
[4]: https://git.proxmox.com/?p=proxmox-backup.git;a=blob;f=src/api2/backup/mod.rs;h=ea0d0292ec587382b154d436ce358e78fc723d0a;hb=refs/heads/master#l71


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pbs-devel] RFC: Scheduler for PBS
  2024-08-09  9:31 [pbs-devel] RFC: Scheduler for PBS Max Carrara
@ 2024-08-09 11:22 ` Christian Ebner
  2024-08-09 12:33   ` Max Carrara
  2024-08-09 12:52 ` Dominik Csapak
  1 sibling, 1 reply; 5+ messages in thread
From: Christian Ebner @ 2024-08-09 11:22 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Max Carrara

> On 09.08.2024 11:31 CEST Max Carrara <m.carrara@proxmox.com> wrote:
> Architectural Overview
> ----------------------
> 
> The scheduler internally contains the type of job queue that is being
> used, which in our case is a simple FIFO queue. We also used HTTP
> long-polling [3] to schedule backup jobs, responding to the client only
> when the backup job is started.
> 
> While long-polling appears to work fine for our current intents and
> purposes, we still want to test if any alternatives (e.g.
> "short-polling", as in normal polling) are more robust.
> 
> The main way to communicate with the scheduler is via its event loop.
> This is a plain tokio task with an inner `loop` that matches on an enum
> representing the different events / messages the scheduler may handle.
> Such an event would be e.g. `NewBackupRequest` or `ConfigUpdate`.
> 
> The event loop receives events via an mpsc channel and may respond to
> them individually via oneshot channels which are set up when certain
> events are created. The benefit of tokio's channels is that they can
> also work in blocking contexts, so it is possible to completely isolate
> the scheduler in a separate thread if needed, for example.
> 
> Because users should also be able to dynamically configure the
> scheduler, configuration changes are handled via the `ConfigUpdate`
> event. That way even the type of the queue can be changed on the fly,
> which one prototype is able to do.
> 
> Furthermore, our prototypes currently run inside `proxmox-backup-proxy`
> and are reasonably decoupled from the rest of PBS, due to the scheduler
> being event-based.

Thanks for the write-up, this does sound interesting!

Do you plan to also include the notification system, e.g. by sending out notification events based on events/messages handled by the scheduler? Or will that solely be handled by the worker tasks?

What about periodic tasks that should be run at a given time, e.g. for server side alerts/monitoring tasks [0]? From you description I suppose these would simply be a different job type, and therefore be queued/executed based on their priority?

Can you already share some code (maybe of one of the prototypes), so one can have a closer look and do some initial testing or is it still to experimental for that?

Cheers,
Chris

[0] https://bugzilla.proxmox.com/show_bug.cgi?id=5108


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pbs-devel] RFC: Scheduler for PBS
  2024-08-09 11:22 ` Christian Ebner
@ 2024-08-09 12:33   ` Max Carrara
  0 siblings, 0 replies; 5+ messages in thread
From: Max Carrara @ 2024-08-09 12:33 UTC (permalink / raw)
  To: Christian Ebner, Proxmox Backup Server development discussion

On Fri Aug 9, 2024 at 1:22 PM CEST, Christian Ebner wrote:
> > On 09.08.2024 11:31 CEST Max Carrara <m.carrara@proxmox.com> wrote:
> > Architectural Overview
> > ----------------------
> > 
> > The scheduler internally contains the type of job queue that is being
> > used, which in our case is a simple FIFO queue. We also used HTTP
> > long-polling [3] to schedule backup jobs, responding to the client only
> > when the backup job is started.
> > 
> > While long-polling appears to work fine for our current intents and
> > purposes, we still want to test if any alternatives (e.g.
> > "short-polling", as in normal polling) are more robust.
> > 
> > The main way to communicate with the scheduler is via its event loop.
> > This is a plain tokio task with an inner `loop` that matches on an enum
> > representing the different events / messages the scheduler may handle.
> > Such an event would be e.g. `NewBackupRequest` or `ConfigUpdate`.
> > 
> > The event loop receives events via an mpsc channel and may respond to
> > them individually via oneshot channels which are set up when certain
> > events are created. The benefit of tokio's channels is that they can
> > also work in blocking contexts, so it is possible to completely isolate
> > the scheduler in a separate thread if needed, for example.
> > 
> > Because users should also be able to dynamically configure the
> > scheduler, configuration changes are handled via the `ConfigUpdate`
> > event. That way even the type of the queue can be changed on the fly,
> > which one prototype is able to do.
> > 
> > Furthermore, our prototypes currently run inside `proxmox-backup-proxy`
> > and are reasonably decoupled from the rest of PBS, due to the scheduler
> > being event-based.
>
> Thanks for the write-up, this does sound interesting!

Thanks for reading! Glad you like it!

>
> Do you plan to also include the notification system, e.g. by sending out notification events based on events/messages handled by the scheduler? Or will that solely be handled by the worker tasks?

We haven't considered this yet, but that does actually sound pretty
interesting - we could probably use the tokio broadcast channel for
that. That way other components could react to whatever is going on
inside the scheduler.

Damn, I like this idea. I'll definitely keep it in mind.

Regarding worker tasks - I assume you mean `WorkerTask` here - I would
personally like to rewrite those, as they currently don't return a
handle, which is needed to check whether a task panicked or was
cancelled somehow. (If you hit CTRL+C on the CLI, the finish-event will
never reach the scheduler, thus the job is never removed from the
running set of jobs.)

We'll probably need to change the return type of `WorkerTask::spawn` and
`WorkerTask::new_thread`, but I'd personally like to have the scheduler
do all the spawning of tasks itself and introduce a type of worker task
that's more integrated with the scheduler,, so we don't need to
needlessly pass the `JoinHandle`s around (and also don't use `String`s
for every thing in the universe).

I hope that we could perhaps gradually transition from `WorkerTask` to
`WhateverTheNewTaskIsCalled`, as that would make things much less
painful, but it would need a lot of churn nevertheless, I think.

So yes, I think I'd prefer the scheduler to emit events itself. A worker
task should IMO just focus on what it's supposed to do, after all.

>
> What about periodic tasks that should be run at a given time, e.g. for server side alerts/monitoring tasks [0]? From you description I suppose these would simply be a different job type, and therefore be queued/executed based on their priority?

Currently we check if a periodic task (like GC, Sync, etc.) needs to be
run every minute (IIRC), which again is something that the scheduler (or
rather a new component of it) could handle.

The current loop we have for that is actually pretty fine - it could
instead just send jobs to the scheduler instead of launching any
periodic jobs itself.

For such alerts and monitoring things this could be done in a similar
way, I believe - if we add the "broadcasting idea" from above into the
mix, we could have some kind of "monitoring service" that listens to the
stuff the scheduler does. If e.g. the scheduler hasn't emitted a
`BackupJobCompleted` event (or whatever) for a while, the monitoring
service could send out an alert to the admin. What do you think about
that?


We don't have any handling for job priorities or something of the sort
yet, as we mostly focused on getting a basic FIFO queue working (while
trying to remain backward-compatible with the current behaviour).

However, this should be fairly trivial to implement as well - each job
could e.g. get a default priority of `0`; higher values mean that a job
has a higher priority (or we use an enum to represent those prios).

The rest could be done by the queue - we could just integrate that with
the FIFO queue (or even introduce a new queue type, just because we can
now ;P /j)

Or we could add a separate queue for periodic jobs - the scheduler could
simply prefer those over "requested" jobs. Lots of possibilities,
really.

Also, because of the event loop, it's really easy to just add more
events and `match` on them. In fact, I like this pattern so much that I
think we should adopt it in other places too.

>
> Can you already share some code (maybe of one of the prototypes), so one can have a closer look and do some initial testing or is it still to experimental for that?

Yes -- Gabriel and I both have our prototypes in our staff repos! :)

When you test things, do keep in mind that it still doesn't play too
nicely with PVE (regarding e.g. when to fs-freeze / fs-thaw and a bunch
of other things) - that in particular is one of the reasons why we think
that we'll need at least one new endpoint for the scheduling stuff.
(Probably with some kind of (long-)polling mechanism as well.)

Thanks again for reading our RFC - you've given me *lots* of new ideas.
I'm really curious what others have to say to this as well. :)

>
> Cheers,
> Chris
>
> [0] https://bugzilla.proxmox.com/show_bug.cgi?id=5108



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pbs-devel] RFC: Scheduler for PBS
  2024-08-09  9:31 [pbs-devel] RFC: Scheduler for PBS Max Carrara
  2024-08-09 11:22 ` Christian Ebner
@ 2024-08-09 12:52 ` Dominik Csapak
  2024-08-09 14:20   ` Max Carrara
  1 sibling, 1 reply; 5+ messages in thread
From: Dominik Csapak @ 2024-08-09 12:52 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Max Carrara

Hi,

great to see that you tackle this!

I read through the overview, which sounds fine, but I think that it
should more reflect the actual issues, namely limitations in memory,
threads, disk io and network.

The actual reason people want to schedule things is to not overload the system
(because of timeouts, hangs, etc.) so any scheduling system should consider
not only the amount of jobs, but how much resources the the job will/can
utilize.

E.g. when I tried to introduce multi-threaded tape backup (configurable threads
per tape job), Thomas rightfully said that it's probably not a good idea, since
making multiple parallel tape backup job increases the load by much more than before.

I generally like the approach, but I personally would like to see some
work with resource constraints, for example one could imagine a configurable
amount of available threads and (configurable?) used thread by job type

so i can set my available to e.g. 10 and if my tape backup jobs then get
4, i can start 2 in parallel but not more

Such a system does not have to be included from the beginning IMO, but the
architecture should be prepared for such things

Does that make sense?


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pbs-devel] RFC: Scheduler for PBS
  2024-08-09 12:52 ` Dominik Csapak
@ 2024-08-09 14:20   ` Max Carrara
  0 siblings, 0 replies; 5+ messages in thread
From: Max Carrara @ 2024-08-09 14:20 UTC (permalink / raw)
  To: Dominik Csapak, Proxmox Backup Server development discussion

On Fri Aug 9, 2024 at 2:52 PM CEST, Dominik Csapak wrote:
> Hi,
>
> great to see that you tackle this!
>
> I read through the overview, which sounds fine, but I think that it
> should more reflect the actual issues, namely limitations in memory,
> threads, disk io and network.
>
> The actual reason people want to schedule things is to not overload the system
> (because of timeouts, hangs, etc.) so any scheduling system should consider
> not only the amount of jobs, but how much resources the the job will/can
> utilize.
>
> E.g. when I tried to introduce multi-threaded tape backup (configurable threads
> per tape job), Thomas rightfully said that it's probably not a good idea, since
> making multiple parallel tape backup job increases the load by much more than before.
>
> I generally like the approach, but I personally would like to see some
> work with resource constraints, for example one could imagine a configurable
> amount of available threads and (configurable?) used thread by job type
>
> so i can set my available to e.g. 10 and if my tape backup jobs then get
> 4, i can start 2 in parallel but not more
>
> Such a system does not have to be included from the beginning IMO, but the
> architecture should be prepared for such things
>
> Does that make sense?

That does make sense, yes! Thanks for bringing this to our attention.

We've just discussed this off-list a bit and mostly agree on stuff like
e.g. the thread limit per worker - though to be sure, do you mean the
number of threads that are passed to e.g. a `ParallelHandler` and
similar?

The scheduler doesn't really have a way to *really* enforce any limits,
though with the event-based architecture, it should be fairly trivial to
just add new fields to the scheduler's config.

We want to have a kind of "top-down control", so once the scheduler can
actually spawn and manage tasks itself (not like how it's done right
now, see my response to Chris), the scheduler could give the task a
separate thread pool for the stuff it wants to run in parallel. There
could even be different "types" of thread pools depending on the
purpose.

This is much easier said than done though, but I'm honestly rather
confident that we can get this to work. I would prefer to have the
resource-checking and -management decoupled and warded off, so that the
scheduler itself isn't really concerned with that. Rather, it should ask
the (e.g.) `ResourceManager` if there are enough threads available for a
`JobType::TapeBackup` or something of the sort.

Another thing we've been discussing just now was to just give the
spawned task a struct representing the limits it should abide to - that
would be a soft limit, but it would make things probably a lot easier.
(After all, passing a thread pool to the task also doesn't mean the task
*has* to use that thread pool...)

One thing I just discovered is tokio's `Semaphore` [1], which we could use
to keep track of the resources we've been handing out.

So, IMO this is a good idea and something we definitely should consider
in the future, though I have a couple questions:

1. How would you track & enforce memory limits? I think this is a much
   harder problem, to be honest.

2. In the same vein, how could one find out how much memory a given task
   will use? There's nothing that prevents tasks from just allocating
   more memory at will, obviously.

   Do you rather mean that if there's e.g. >90% memory being used (can
   be made configurable), that we're not spawning any additional tasks?

3. How would you limit disk IO? We definitely want to add a limit for
   the number of jobs that can run on a datastore at a time, so I guess
   that would also be indirectly included there..?

   (It could probably also be done with tokio's `Semaphore` [1], but
   we'd need some kind of abstraction on top of that, because we can
   still just read / write / open / close at will etc. We would need a
   uniform way of accessing disk resources and *not* use any other way
   to perform disk IO otherwise, which will be *hard*)

4. I guess network limits (e.g. bandwidth limits for sync jobs etc.)
   could just be enforced on the TCP socket, so this shouldn't be too
   hard. That way you could enforce individual rate limits for
   individual tasks. Though, probably also easier said than done. Can
   you elaborate some more on this, too?

Thanks a lot for your input, you've given us lots of ideas as well! :)

[1]: https://docs.rs/tokio/latest/tokio/sync/struct.Semaphore.html



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-08-09 14:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-09  9:31 [pbs-devel] RFC: Scheduler for PBS Max Carrara
2024-08-09 11:22 ` Christian Ebner
2024-08-09 12:33   ` Max Carrara
2024-08-09 12:52 ` Dominik Csapak
2024-08-09 14:20   ` Max Carrara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal