From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pbs-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id 3FB721FF15E
	for <inbox@lore.proxmox.com>; Fri,  9 Aug 2024 14:34:14 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id B5B113DCCB;
	Fri,  9 Aug 2024 14:34:24 +0200 (CEST)
Mime-Version: 1.0
Date: Fri, 09 Aug 2024 14:33:49 +0200
Message-Id: <D3BDU91J4FA4.EZOQ28CT3O8U@proxmox.com>
To: "Christian Ebner" <c.ebner@proxmox.com>, "Proxmox Backup Server
 development discussion" <pbs-devel@lists.proxmox.com>
From: "Max Carrara" <m.carrara@proxmox.com>
X-Mailer: aerc 0.17.0-72-g6a84f1331f1c
References: <D3B9YIIHKRP2.2EX6MKWE3C0NP@proxmox.com>
 <1045470937.13120.1723202551905@webmail.proxmox.com>
In-Reply-To: <1045470937.13120.1723202551905@webmail.proxmox.com>
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.033 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DMARC_MISSING             0.1 Missing DMARC policy
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [proxmox.com]
Subject: Re: [pbs-devel] RFC: Scheduler for PBS
X-BeenThere: pbs-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox Backup Server development discussion
 <pbs-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pbs-devel>, 
 <mailto:pbs-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pbs-devel/>
List-Post: <mailto:pbs-devel@lists.proxmox.com>
List-Help: <mailto:pbs-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel>, 
 <mailto:pbs-devel-request@lists.proxmox.com?subject=subscribe>
Reply-To: Proxmox Backup Server development discussion
 <pbs-devel@lists.proxmox.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: pbs-devel-bounces@lists.proxmox.com
Sender: "pbs-devel" <pbs-devel-bounces@lists.proxmox.com>

On Fri Aug 9, 2024 at 1:22 PM CEST, Christian Ebner wrote:
> > On 09.08.2024 11:31 CEST Max Carrara <m.carrara@proxmox.com> wrote:
> > Architectural Overview
> > ----------------------
> > 
> > The scheduler internally contains the type of job queue that is being
> > used, which in our case is a simple FIFO queue. We also used HTTP
> > long-polling [3] to schedule backup jobs, responding to the client only
> > when the backup job is started.
> > 
> > While long-polling appears to work fine for our current intents and
> > purposes, we still want to test if any alternatives (e.g.
> > "short-polling", as in normal polling) are more robust.
> > 
> > The main way to communicate with the scheduler is via its event loop.
> > This is a plain tokio task with an inner `loop` that matches on an enum
> > representing the different events / messages the scheduler may handle.
> > Such an event would be e.g. `NewBackupRequest` or `ConfigUpdate`.
> > 
> > The event loop receives events via an mpsc channel and may respond to
> > them individually via oneshot channels which are set up when certain
> > events are created. The benefit of tokio's channels is that they can
> > also work in blocking contexts, so it is possible to completely isolate
> > the scheduler in a separate thread if needed, for example.
> > 
> > Because users should also be able to dynamically configure the
> > scheduler, configuration changes are handled via the `ConfigUpdate`
> > event. That way even the type of the queue can be changed on the fly,
> > which one prototype is able to do.
> > 
> > Furthermore, our prototypes currently run inside `proxmox-backup-proxy`
> > and are reasonably decoupled from the rest of PBS, due to the scheduler
> > being event-based.
>
> Thanks for the write-up, this does sound interesting!

Thanks for reading! Glad you like it!

>
> Do you plan to also include the notification system, e.g. by sending out notification events based on events/messages handled by the scheduler? Or will that solely be handled by the worker tasks?

We haven't considered this yet, but that does actually sound pretty
interesting - we could probably use the tokio broadcast channel for
that. That way other components could react to whatever is going on
inside the scheduler.

Damn, I like this idea. I'll definitely keep it in mind.

Regarding worker tasks - I assume you mean `WorkerTask` here - I would
personally like to rewrite those, as they currently don't return a
handle, which is needed to check whether a task panicked or was
cancelled somehow. (If you hit CTRL+C on the CLI, the finish-event will
never reach the scheduler, thus the job is never removed from the
running set of jobs.)

We'll probably need to change the return type of `WorkerTask::spawn` and
`WorkerTask::new_thread`, but I'd personally like to have the scheduler
do all the spawning of tasks itself and introduce a type of worker task
that's more integrated with the scheduler,, so we don't need to
needlessly pass the `JoinHandle`s around (and also don't use `String`s
for every thing in the universe).

I hope that we could perhaps gradually transition from `WorkerTask` to
`WhateverTheNewTaskIsCalled`, as that would make things much less
painful, but it would need a lot of churn nevertheless, I think.

So yes, I think I'd prefer the scheduler to emit events itself. A worker
task should IMO just focus on what it's supposed to do, after all.

>
> What about periodic tasks that should be run at a given time, e.g. for server side alerts/monitoring tasks [0]? From you description I suppose these would simply be a different job type, and therefore be queued/executed based on their priority?

Currently we check if a periodic task (like GC, Sync, etc.) needs to be
run every minute (IIRC), which again is something that the scheduler (or
rather a new component of it) could handle.

The current loop we have for that is actually pretty fine - it could
instead just send jobs to the scheduler instead of launching any
periodic jobs itself.

For such alerts and monitoring things this could be done in a similar
way, I believe - if we add the "broadcasting idea" from above into the
mix, we could have some kind of "monitoring service" that listens to the
stuff the scheduler does. If e.g. the scheduler hasn't emitted a
`BackupJobCompleted` event (or whatever) for a while, the monitoring
service could send out an alert to the admin. What do you think about
that?


We don't have any handling for job priorities or something of the sort
yet, as we mostly focused on getting a basic FIFO queue working (while
trying to remain backward-compatible with the current behaviour).

However, this should be fairly trivial to implement as well - each job
could e.g. get a default priority of `0`; higher values mean that a job
has a higher priority (or we use an enum to represent those prios).

The rest could be done by the queue - we could just integrate that with
the FIFO queue (or even introduce a new queue type, just because we can
now ;P /j)

Or we could add a separate queue for periodic jobs - the scheduler could
simply prefer those over "requested" jobs. Lots of possibilities,
really.

Also, because of the event loop, it's really easy to just add more
events and `match` on them. In fact, I like this pattern so much that I
think we should adopt it in other places too.

>
> Can you already share some code (maybe of one of the prototypes), so one can have a closer look and do some initial testing or is it still to experimental for that?

Yes -- Gabriel and I both have our prototypes in our staff repos! :)

When you test things, do keep in mind that it still doesn't play too
nicely with PVE (regarding e.g. when to fs-freeze / fs-thaw and a bunch
of other things) - that in particular is one of the reasons why we think
that we'll need at least one new endpoint for the scheduling stuff.
(Probably with some kind of (long-)polling mechanism as well.)

Thanks again for reading our RFC - you've given me *lots* of new ideas.
I'm really curious what others have to say to this as well. :)

>
> Cheers,
> Chris
>
> [0] https://bugzilla.proxmox.com/show_bug.cgi?id=5108



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel