From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 782111FF13E for ; Fri, 03 Apr 2026 15:26:12 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 3157B464B; Fri, 3 Apr 2026 15:26:41 +0200 (CEST) From: =?UTF-8?q?Michael=20K=C3=B6ppl?= To: pbs-devel@lists.proxmox.com Subject: [PATCH proxmox-backup v4 0/3] fix #7400: improve handling of corrupted job statefiles Date: Fri, 3 Apr 2026 15:26:25 +0200 Message-ID: <20260403132628.210128-1-m.koeppl@proxmox.com> X-Mailer: git-send-email 2.47.3 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1775222735717 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.100 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: TOXDTCW5EOMQ37U7RCQCRELMEBDTDX7T X-Message-ID-Hash: TOXDTCW5EOMQ37U7RCQCRELMEBDTDX7T X-MailFrom: m.koeppl@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox Backup Server development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: This patch series fixes a problem [0] where an empty or corrupted job state file (due to I/O error, abrupt shutdown, ...) would cause API endpoints for listing jobs to return an error, breaking the web UI for users because they could not view any of their configured jobs of that type. It would also cause proxmox-backup-proxy to indefinitely skip the jobs until a user manually triggered it to rewrite the statefile. 1/3 is a preparatory patch that centralizes job statefile loading in compute_schedule_status instead of having every handler function open the statefile, handle potential errors and then passing the JobState to compute_schedule_status. 2/3 introduces a new JobState `Unknown`, representing cases in which the job state could not be determined. In addition, the patch also updates the scheduling functions such that errors during reading the statefiles will result in the Unknown state. 3/3 then utilizes this Unknown state and adapts the scheduling functions such that the Unknown state will then lead to the statefile being overwritten with a new Created state and the job running again at its next scheduled run. changes since v3: - adapted commit message of 1/3 to mention the change in behavior regarding the handling of UPID parsing errors with garbage collection state files - in 2/3, adapt JobState::load to return early with JobState::Unknown - defined a constant for the scheduling offset used when calculating the last run time. The constant is introduced in 2/3 and also used in 3/3 Thanks for the feedback on v3, @Christian! changes since v2: - introduced the Unknown state in 2/3, adapted 3/3 accordingly (thanks, @Fabian and @Christian) - make sure the "could not open statefile" error is also printed in garbage_collection_status if status_in_memory.upid is None (thanks, @Christian) - inline jobtype and err variables in error logging changes since v1: - added preparatory patch 1/3, centralizing the statefile loading before adapting the handling of the error case in that centralized place (compute_schedule_status). Thanks, Christian for the suggestion! - adapted the error message if job statefile loading fails to make clear that the default status will be returned as a fallback [0] https://bugzilla.proxmox.com/show_bug.cgi?id=7400 proxmox-backup: Michael Köppl (3): api: move statefile loading into compute_schedule_status fix #7400: api: gracefully handle corrupted job statefiles fix #7400: proxy: self-heal corrupted job statefiles src/api2/admin/datastore.rs | 15 +++----- src/api2/admin/prune.rs | 9 ++--- src/api2/admin/sync.rs | 9 ++--- src/api2/admin/verify.rs | 9 ++--- src/api2/tape/backup.rs | 9 ++--- src/bin/proxmox-backup-proxy.rs | 6 ++- src/server/jobstate.rs | 65 +++++++++++++++++++++++++++++---- 7 files changed, 80 insertions(+), 42 deletions(-) Summary over all repositories: 7 files changed, 80 insertions(+), 42 deletions(-) -- Generated by murpp 0.11.0