From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 6ACC31FF141 for ; Mon, 13 Apr 2026 15:20:56 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 8AB332594E; Mon, 13 Apr 2026 15:21:44 +0200 (CEST) Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Mon, 13 Apr 2026 15:21:41 +0200 Message-Id: Subject: superseded: [PATCH proxmox-backup v4 0/3] fix #7400: improve handling of corrupted job statefiles From: =?utf-8?q?Michael_K=C3=B6ppl?= To: =?utf-8?q?Michael_K=C3=B6ppl?= , X-Mailer: aerc 0.21.0 References: <20260403132628.210128-1-m.koeppl@proxmox.com> In-Reply-To: <20260403132628.210128-1-m.koeppl@proxmox.com> X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1776086427215 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.102 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: ZVTJ5X5EEPROZTP4WFRNBZSYOBMZZKQ3 X-Message-ID-Hash: ZVTJ5X5EEPROZTP4WFRNBZSYOBMZZKQ3 X-MailFrom: m.koeppl@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox Backup Server development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Superseded by: https://lore.proxmox.com/pbs-devel/20260413132000.49889-1-m.koeppl@proxmox.= com On Fri Apr 3, 2026 at 3:26 PM CEST, Michael K=C3=B6ppl wrote: > This patch series fixes a problem [0] where an empty or corrupted job > state file (due to I/O error, abrupt shutdown, ...) would cause API > endpoints for listing jobs to return an error, breaking the web UI for > users because they could not view any of their configured jobs of that > type. It would also cause proxmox-backup-proxy to indefinitely skip the > jobs until a user manually triggered it to rewrite the statefile. > > 1/3 is a preparatory patch that centralizes job statefile loading > in compute_schedule_status instead of having every handler function > open the statefile, handle potential errors and then passing the > JobState to compute_schedule_status. > > 2/3 introduces a new JobState `Unknown`, representing cases in which the > job state could not be determined. In addition, the patch also updates > the scheduling functions such that errors during reading the statefiles > will result in the Unknown state. > > 3/3 then utilizes this Unknown state and adapts the scheduling functions > such that the Unknown state will then lead to the statefile being > overwritten with a new Created state and the job running again at its > next scheduled run. > > changes since v3: > - adapted commit message of 1/3 to mention the change in behavior > regarding the handling of UPID parsing errors with garbage collection > state files > - in 2/3, adapt JobState::load to return early with JobState::Unknown > - defined a constant for the scheduling offset used when calculating > the last run time. The constant is introduced in 2/3 and also used in > 3/3 > Thanks for the feedback on v3, @Christian! > > changes since v2: > - introduced the Unknown state in 2/3, adapted 3/3 accordingly (thanks, > @Fabian and @Christian) > - make sure the "could not open statefile" error is also printed in > garbage_collection_status if status_in_memory.upid is None (thanks, > @Christian) > - inline jobtype and err variables in error logging > > changes since v1: > - added preparatory patch 1/3, centralizing the statefile loading before > adapting the handling of the error case in that centralized place > (compute_schedule_status). Thanks, Christian for the suggestion! > - adapted the error message if job statefile loading fails to make clear > that the default status will be returned as a fallback > > [0] https://bugzilla.proxmox.com/show_bug.cgi?id=3D7400 > > proxmox-backup: > > Michael K=C3=B6ppl (3): > api: move statefile loading into compute_schedule_status > fix #7400: api: gracefully handle corrupted job statefiles > fix #7400: proxy: self-heal corrupted job statefiles > > src/api2/admin/datastore.rs | 15 +++----- > src/api2/admin/prune.rs | 9 ++--- > src/api2/admin/sync.rs | 9 ++--- > src/api2/admin/verify.rs | 9 ++--- > src/api2/tape/backup.rs | 9 ++--- > src/bin/proxmox-backup-proxy.rs | 6 ++- > src/server/jobstate.rs | 65 +++++++++++++++++++++++++++++---- > 7 files changed, 80 insertions(+), 42 deletions(-) > > > Summary over all repositories: > 7 files changed, 80 insertions(+), 42 deletions(-)