From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 20B641FF13C for ; Thu, 16 Apr 2026 13:08:15 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id F415E2874; Thu, 16 Apr 2026 13:08:14 +0200 (CEST) Message-ID: Date: Thu, 16 Apr 2026 13:07:40 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH proxmox-backup v5 0/3] fix #7400: improve handling of corrupted job statefiles To: =?UTF-8?Q?Michael_K=C3=B6ppl?= , pbs-devel@lists.proxmox.com References: <20260413132000.49889-1-m.koeppl@proxmox.com> Content-Language: en-US, de-DE From: Christian Ebner In-Reply-To: <20260413132000.49889-1-m.koeppl@proxmox.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1776337582411 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.070 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: RSURJNWVSYTB4VZ5FEGH7JUCVNSWLLE6 X-Message-ID-Hash: RSURJNWVSYTB4VZ5FEGH7JUCVNSWLLE6 X-MailFrom: c.ebner@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox Backup Server development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On 4/13/26 3:18 PM, Michael Köppl wrote: > This patch series fixes a problem [0] where an empty or corrupted job > state file (due to I/O error, abrupt shutdown, ...) would cause API > endpoints for listing jobs to return an error, breaking the web UI for > users because they could not view any of their configured jobs of that > type. It would also cause proxmox-backup-proxy to indefinitely skip the > jobs until a user manually triggered it to rewrite the statefile. > > 1/3 is a preparatory patch that centralizes job statefile loading > in compute_schedule_status instead of having every handler function > open the statefile, handle potential errors and then passing the > JobState to compute_schedule_status. > > 2/3 introduces a new JobState `Unknown`, representing cases in which the > job state could not be determined. In addition, the patch also updates > the scheduling functions such that errors during reading the statefiles > will result in the Unknown state. > > 3/3 then utilizes this Unknown state and adapts the scheduling functions > such that the Unknown state will then lead to the statefile being > overwritten with a new Created state and the job running again at its > next scheduled run. > > changes since v4: > - updated docstring for Unknown JobState, no functional changes (thanks, > @Shannon) > > changes since v3: > - adapted commit message of 1/3 to mention the change in behavior > regarding the handling of UPID parsing errors with garbage collection > state files > - in 2/3, adapt JobState::load to return early with JobState::Unknown > - defined a constant for the scheduling offset used when calculating > the last run time. The constant is introduced in 2/3 and also used in > 3/3 > Thanks for the feedback on v3, @Christian! > > changes since v2: > - introduced the Unknown state in 2/3, adapted 3/3 accordingly (thanks, > @Fabian and @Christian) > - make sure the "could not open statefile" error is also printed in > garbage_collection_status if status_in_memory.upid is None (thanks, > @Christian) > - inline jobtype and err variables in error logging > > changes since v1: > - added preparatory patch 1/3, centralizing the statefile loading before > adapting the handling of the error case in that centralized place > (compute_schedule_status). Thanks, Christian for the suggestion! > - adapted the error message if job statefile loading fails to make clear > that the default status will be returned as a fallback Reviewed-by: Christian Ebner