From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id F03951FF141 for ; Mon, 13 Apr 2026 15:19:18 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 1EC69256B5; Mon, 13 Apr 2026 15:20:07 +0200 (CEST) From: =?UTF-8?q?Michael=20K=C3=B6ppl?= To: pbs-devel@lists.proxmox.com Subject: [PATCH proxmox-backup v5 0/3] fix #7400: improve handling of corrupted job statefiles Date: Mon, 13 Apr 2026 15:19:57 +0200 Message-ID: <20260413132000.49889-1-m.koeppl@proxmox.com> X-Mailer: git-send-email 2.47.3 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1776086327630 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.103 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [backup.rs,datastore.rs,prune.rs,verify.rs,sync.rs,jobstate.rs,proxmox-backup-proxy.rs] Message-ID-Hash: 2DHMDLJHY6JOIPSAVYIWULKSV4BKUWUS X-Message-ID-Hash: 2DHMDLJHY6JOIPSAVYIWULKSV4BKUWUS X-MailFrom: m.koeppl@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox Backup Server development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: This patch series fixes a problem [0] where an empty or corrupted job state file (due to I/O error, abrupt shutdown, ...) would cause API endpoints for listing jobs to return an error, breaking the web UI for users because they could not view any of their configured jobs of that type. It would also cause proxmox-backup-proxy to indefinitely skip the jobs until a user manually triggered it to rewrite the statefile. 1/3 is a preparatory patch that centralizes job statefile loading in compute_schedule_status instead of having every handler function open the statefile, handle potential errors and then passing the JobState to compute_schedule_status. 2/3 introduces a new JobState `Unknown`, representing cases in which the job state could not be determined. In addition, the patch also updates the scheduling functions such that errors during reading the statefiles will result in the Unknown state. 3/3 then utilizes this Unknown state and adapts the scheduling functions such that the Unknown state will then lead to the statefile being overwritten with a new Created state and the job running again at its next scheduled run. changes since v4: - updated docstring for Unknown JobState, no functional changes (thanks, @Shannon) changes since v3: - adapted commit message of 1/3 to mention the change in behavior regarding the handling of UPID parsing errors with garbage collection state files - in 2/3, adapt JobState::load to return early with JobState::Unknown - defined a constant for the scheduling offset used when calculating the last run time. The constant is introduced in 2/3 and also used in 3/3 Thanks for the feedback on v3, @Christian! changes since v2: - introduced the Unknown state in 2/3, adapted 3/3 accordingly (thanks, @Fabian and @Christian) - make sure the "could not open statefile" error is also printed in garbage_collection_status if status_in_memory.upid is None (thanks, @Christian) - inline jobtype and err variables in error logging changes since v1: - added preparatory patch 1/3, centralizing the statefile loading before adapting the handling of the error case in that centralized place (compute_schedule_status). Thanks, Christian for the suggestion! - adapted the error message if job statefile loading fails to make clear that the default status will be returned as a fallback proxmox-backup: Michael Köppl (3): api: move statefile loading into compute_schedule_status fix #7400: api: gracefully handle corrupted job statefiles fix #7400: proxy: self-heal corrupted job statefiles src/api2/admin/datastore.rs | 15 +++----- src/api2/admin/prune.rs | 9 ++--- src/api2/admin/sync.rs | 9 ++--- src/api2/admin/verify.rs | 9 ++--- src/api2/tape/backup.rs | 9 ++--- src/bin/proxmox-backup-proxy.rs | 6 ++- src/server/jobstate.rs | 67 +++++++++++++++++++++++++++++---- 7 files changed, 82 insertions(+), 42 deletions(-) Summary over all repositories: 7 files changed, 82 insertions(+), 42 deletions(-) -- Generated by murpp 0.11.0