From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 3F2ED1FF15E for ; Fri, 9 Aug 2024 11:31:13 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id E3C103B235; Fri, 9 Aug 2024 11:31:23 +0200 (CEST) Date: Fri, 09 Aug 2024 11:31:19 +0200 Message-Id: From: "Max Carrara" To: Mime-Version: 1.0 X-Mailer: aerc 0.17.0-72-g6a84f1331f1c X-SPAM-LEVEL: Spam detection results: 0 AWL -0.368 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_ASCII_DIVIDERS 0.8 Email that uses ascii formatting dividers and possible spam tricks KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com, rfc-editor.org] Subject: [pbs-devel] RFC: Scheduler for PBS X-BeenThere: pbs-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox Backup Server development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Proxmox Backup Server development discussion Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: pbs-devel-bounces@lists.proxmox.com Sender: "pbs-devel" RFC: Scheduler for PBS ====================== Introduction ------------ Gabriel and I have been prototyping a new scheduler for PBS, mostly in order to address #3086 [1]. We will first summarize this bugzilla issue and elaborate on why we think that implementing a scheduler inside PBS is the best way to solve it. Furthermore, this RFC shall provide a high-level overview on our plans and what we have managed to implement thus far. Additionally, we will outline a couple of other problems the scheduler, specifically with our current architecture, could solve in the future. We are doing this mostly because we want to gather some early feedback on our design and our plans, and whether we should continue going in this direction or not. We also want to gather some thoughts on some other problems in particular that we think should be addressed adequately in order for the scheduler to work in an efficient and robust manner. Summary of #3086: Limiting the Number of Parallel Backups --------------------------------------------------------- The RFE #3086 [1] can be summarized as follows: Currently, there is no way to limit the number of backup jobs that may run in parallel. This is not necessarily a problem for smaller setups or clusters, where the administrator can schedule all backup jobs of all PVE hosts to run at a different time. However, trying to manually coordinate backup jobs of multiple hosts becomes increasingly cumbersome as one's infrastructure (number of hosts) scales up. An administrator would need to be able to accurately estimate how long each job would take, for each VM, on each host. Additionally, running many backups in parallel risks oversaturating the network, causing a drop in bandwidth for the running backups - this in turn can affect running VMs [2]. Why a Scheduler is Necessary ---------------------------- The above issue seemingly cannot be solved through bandwidth limits or similar, which is why we believe that implementing a scheduler inside PBS is the correct route to go. This belief was reinforced after the prototypes we developed were able to successfully queue and schedule many backups that were launched in parallel (from the CLI) at once. Thus no other limits or "workarounds" were necessary, all backup jobs were eventually completed. This makes it much easier for administrators to ensure that their backups actually happen (and succeed) while removing the need to manually adjust the timing of when backups are made, so that none overlap. Architectural Overview ---------------------- The scheduler internally contains the type of job queue that is being used, which in our case is a simple FIFO queue. We also used HTTP long-polling [3] to schedule backup jobs, responding to the client only when the backup job is started. While long-polling appears to work fine for our current intents and purposes, we still want to test if any alternatives (e.g. "short-polling", as in normal polling) are more robust. The main way to communicate with the scheduler is via its event loop. This is a plain tokio task with an inner `loop` that matches on an enum representing the different events / messages the scheduler may handle. Such an event would be e.g. `NewBackupRequest` or `ConfigUpdate`. The event loop receives events via an mpsc channel and may respond to them individually via oneshot channels which are set up when certain events are created. The benefit of tokio's channels is that they can also work in blocking contexts, so it is possible to completely isolate the scheduler in a separate thread if needed, for example. Because users should also be able to dynamically configure the scheduler, configuration changes are handled via the `ConfigUpdate` event. That way even the type of the queue can be changed on the fly, which one prototype is able to do. Furthermore, our prototypes currently run inside `proxmox-backup-proxy` and are reasonably decoupled from the rest of PBS, due to the scheduler being event-based. Backward Compatibility Considerations ------------------------------------- We are still in the process of adequately handling backward compat. At the moment, HTTP long-polling lets us support older clients as well; no issues have appeared thus far. However, we are also considering at least one separate API endpoint for polling purposes and overall better client support, as we feel that this might be safer and allows us to handle errors more gracefully. Future Plans & Possibilities ---------------------------- 1. Because the scheduler is keeping track of which jobs are currently running, it is relatively straightforward to check whether a job for the same group on the same datastore is running already. This makes it possible to queue the conflicting job, instead of having it fail immediately when trying to acquire the lock. 2. The scheduler should be in full control over when and which `WorkerTask`sare spawned, as that makes it much easier to handle errors. At the same time, the overall architecture of PBS would become much cleaner, by clearly separating concerns instead of having e.g. large API methods that do many things all at once [4]. 3. The architecture of the scheduler is flexible enough to support different kinds of jobs in the future, so that e.g. prune, GC, sync jobs etc. may also be queued. This is definitely something we are considering of implementing as well. 4. Should more types of jobs be implemented in the scheduler, separate limits could also be set for each job. For example, the global job limit could be set to 10, while allowing a maximum of 10 backup jobs and 2 sync jobs to run concurrently. That way users can prefer backup jobs over other jobs, or vice versa. 5. In addition to a global limit, limits could also be set for individual users and API tokens. This would allow for even finer-grained job control, but is more costly to implement. Final Thoughts -------------- Please let us know what you think - we believe that implementing a scheduler can potentially solve a vast amount of issues of users that aim to scale up their infrastructure. Thank you for reading! :) References ---------- [1]: https://bugzilla.proxmox.com/show_bug.cgi?id=3086 [2]: https://bugzilla.proxmox.com/show_bug.cgi?id=3086#c2 [3]: https://www.rfc-editor.org/rfc/rfc6202#section-2.1 [4]: https://git.proxmox.com/?p=proxmox-backup.git;a=blob;f=src/api2/backup/mod.rs;h=ea0d0292ec587382b154d436ce358e78fc723d0a;hb=refs/heads/master#l71 _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel