[PATCH proxmox-backup v1 0/2] fix #7400: improve handling of corrupted job statefiles

public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [PATCH proxmox-backup v1 0/2] fix #7400: improve handling of corrupted job statefiles
@ 2026-03-17 16:07 Michael Köppl
  2026-03-17 16:07 ` [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle " Michael Köppl
  2026-03-17 16:07 ` [PATCH proxmox-backup v1 2/2] fix #7400: proxy: self-heal " Michael Köppl
  0 siblings, 2 replies; 8+ messages in thread
From: Michael Köppl @ 2026-03-17 16:07 UTC (permalink / raw)
  To: pbs-devel

This patch series fixes a problem where an empty or corrupted job state
file (due to I/O error, abrupt shutdown, ...) would cause API endpoints
for listing jobs to return an error, breaking the web UI for users
because they could not view any of their configured jobs of that type.
It would also cause proxmox-backup-proxy to indefinitely skip the jobs
until a user manually triggered it to rewrite the statefile.

1/2 fixes the API layer by catching the load error, logging a warning,
and returning a default JobScheduleStatus. This ensure the API returns
the job list, showing the affected job without its last run status.

2/2 updates the scheduler loop inside the proxy to match the behavior of
a a newly created job, triggering the job at its next internal (which
then rewrites the statefile).

proxmox-backup:

Michael Köppl (2):
  fix #7400: api: gracefully handle corrupted job statefiles
  fix #7400: proxy: self-heal corrupted job statefiles

 src/api2/admin/prune.rs         | 17 ++++++++++-------
 src/api2/admin/sync.rs          | 16 ++++++++++------
 src/api2/admin/verify.rs        | 17 ++++++++++-------
 src/api2/tape/backup.rs         | 17 ++++++++++-------
 src/bin/proxmox-backup-proxy.rs |  2 +-
 5 files changed, 41 insertions(+), 28 deletions(-)

Summary over all repositories:
  5 files changed, 41 insertions(+), 28 deletions(-)

-- 
Generated by murpp 0.9.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle corrupted job statefiles
  2026-03-17 16:07 [PATCH proxmox-backup v1 0/2] fix #7400: improve handling of corrupted job statefiles Michael Köppl
@ 2026-03-17 16:07 ` Michael Köppl
  2026-03-18 16:15   ` Christian Ebner
  2026-03-18 17:22   ` Michael Köppl
  2026-03-17 16:07 ` [PATCH proxmox-backup v1 2/2] fix #7400: proxy: self-heal " Michael Köppl
  1 sibling, 2 replies; 8+ messages in thread
From: Michael Köppl @ 2026-03-17 16:07 UTC (permalink / raw)
  To: pbs-devel

Previously, if a job statefile was empty or corrupted (e.g. due to an
I/O error or an abrupt shutdown), the JobStatus::load method would
return an error that would be propagated up, causing the endpoint to
return an error to the user, meaning users would not see any of their
jobs if a single job had a corrupted statefile.

Instead, handle the error explicitly, logging a warning and returning a
default JobScheduleStatus, such that jobs lists can still be fetched,
displaying the affected job as configured but simply missing its last
run status.

Signed-off-by: Michael Köppl <m.koeppl@proxmox.com>
---
 src/api2/admin/prune.rs  | 17 ++++++++++-------
 src/api2/admin/sync.rs   | 16 ++++++++++------
 src/api2/admin/verify.rs | 17 ++++++++++-------
 src/api2/tape/backup.rs  | 17 ++++++++++-------
 4 files changed, 40 insertions(+), 27 deletions(-)

diff --git a/src/api2/admin/prune.rs b/src/api2/admin/prune.rs
index a5ebf2975..f2a6445c2 100644
--- a/src/api2/admin/prune.rs
+++ b/src/api2/admin/prune.rs
@@ -1,6 +1,6 @@
 //! Datastore Prune Job Management
 
-use anyhow::{format_err, Error};
+use anyhow::Error;
 use serde_json::Value;
 
 use proxmox_router::{
@@ -10,8 +10,8 @@ use proxmox_schema::api;
 use proxmox_sortable_macro::sortable;
 
 use pbs_api_types::{
-    Authid, PruneJobConfig, PruneJobStatus, DATASTORE_SCHEMA, JOB_ID_SCHEMA, PRIV_DATASTORE_AUDIT,
-    PRIV_DATASTORE_MODIFY,
+    Authid, JobScheduleStatus, PruneJobConfig, PruneJobStatus, DATASTORE_SCHEMA, JOB_ID_SCHEMA,
+    PRIV_DATASTORE_AUDIT, PRIV_DATASTORE_MODIFY,
 };
 use pbs_config::prune;
 use pbs_config::CachedUserInfo;
@@ -73,10 +73,13 @@ pub fn list_prune_jobs(
     let mut list = Vec::new();
 
     for job in job_config_iter {
-        let last_state = JobState::load("prunejob", &job.id)
-            .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?;
-
-        let mut status = compute_schedule_status(&last_state, Some(&job.schedule))?;
+        let mut status = match JobState::load("prunejob", &job.id) {
+            Ok(last_state) => compute_schedule_status(&last_state, Some(&job.schedule))?,
+            Err(err) => {
+                log::error!("could not open statefile for {}: {}", &job.id, err);
+                JobScheduleStatus::default()
+            }
+        };
         if job.disable {
             status.next_run = None;
         }
diff --git a/src/api2/admin/sync.rs b/src/api2/admin/sync.rs
index 6722ebea0..b0ff53ef7 100644
--- a/src/api2/admin/sync.rs
+++ b/src/api2/admin/sync.rs
@@ -1,6 +1,6 @@
 //! Datastore Synchronization Job Management
 
-use anyhow::{bail, format_err, Error};
+use anyhow::{bail, Error};
 use serde::{Deserialize, Serialize};
 use serde_json::Value;
 
@@ -12,7 +12,8 @@ use proxmox_schema::api;
 use proxmox_sortable_macro::sortable;
 
 use pbs_api_types::{
-    Authid, SyncDirection, SyncJobConfig, SyncJobStatus, DATASTORE_SCHEMA, JOB_ID_SCHEMA,
+    Authid, JobScheduleStatus, SyncDirection, SyncJobConfig, SyncJobStatus, DATASTORE_SCHEMA,
+    JOB_ID_SCHEMA,
 };
 use pbs_config::sync;
 use pbs_config::CachedUserInfo;
@@ -112,10 +113,13 @@ pub fn list_config_sync_jobs(
             continue;
         }
 
-        let last_state = JobState::load("syncjob", &job.id)
-            .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?;
-
-        let status = compute_schedule_status(&last_state, job.schedule.as_deref())?;
+        let status = match JobState::load("syncjob", &job.id) {
+            Ok(last_state) => compute_schedule_status(&last_state, job.schedule.as_deref())?,
+            Err(err) => {
+                log::error!("could not open statefile for {}: {}", &job.id, err);
+                JobScheduleStatus::default()
+            }
+        };
 
         list.push(SyncJobStatus {
             config: job,
diff --git a/src/api2/admin/verify.rs b/src/api2/admin/verify.rs
index 66695236c..d2c47ce0c 100644
--- a/src/api2/admin/verify.rs
+++ b/src/api2/admin/verify.rs
@@ -1,6 +1,6 @@
 //! Datastore Verify Job Management
 
-use anyhow::{format_err, Error};
+use anyhow::Error;
 use serde_json::Value;
 
 use proxmox_router::{
@@ -11,8 +11,8 @@ use proxmox_schema::api;
 use proxmox_sortable_macro::sortable;
 
 use pbs_api_types::{
-    Authid, VerificationJobConfig, VerificationJobStatus, DATASTORE_SCHEMA, JOB_ID_SCHEMA,
-    PRIV_DATASTORE_AUDIT, PRIV_DATASTORE_VERIFY,
+    Authid, JobScheduleStatus, VerificationJobConfig, VerificationJobStatus, DATASTORE_SCHEMA,
+    JOB_ID_SCHEMA, PRIV_DATASTORE_AUDIT, PRIV_DATASTORE_VERIFY,
 };
 use pbs_config::verify;
 use pbs_config::CachedUserInfo;
@@ -73,10 +73,13 @@ pub fn list_verification_jobs(
     let mut list = Vec::new();
 
     for job in job_config_iter {
-        let last_state = JobState::load("verificationjob", &job.id)
-            .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?;
-
-        let status = compute_schedule_status(&last_state, job.schedule.as_deref())?;
+        let status = match JobState::load("verificationjob", &job.id) {
+            Ok(last_state) => compute_schedule_status(&last_state, job.schedule.as_deref())?,
+            Err(err) => {
+                log::error!("could not open statefile for {}: {}", &job.id, err);
+                JobScheduleStatus::default()
+            }
+        };
 
         list.push(VerificationJobStatus {
             config: job,
diff --git a/src/api2/tape/backup.rs b/src/api2/tape/backup.rs
index 47e8d0209..dde65735d 100644
--- a/src/api2/tape/backup.rs
+++ b/src/api2/tape/backup.rs
@@ -1,6 +1,6 @@
 use std::sync::{Arc, Mutex};
 
-use anyhow::{bail, format_err, Error};
+use anyhow::{bail, Error};
 use serde_json::Value;
 use tracing::{info, warn};
 
@@ -11,8 +11,8 @@ use proxmox_schema::api;
 use proxmox_worker_task::WorkerTaskContext;
 
 use pbs_api_types::{
-    print_ns_and_snapshot, print_store_and_ns, Authid, MediaPoolConfig, Operation,
-    TapeBackupJobConfig, TapeBackupJobSetup, TapeBackupJobStatus, JOB_ID_SCHEMA,
+    print_ns_and_snapshot, print_store_and_ns, Authid, JobScheduleStatus, MediaPoolConfig,
+    Operation, TapeBackupJobConfig, TapeBackupJobSetup, TapeBackupJobStatus, JOB_ID_SCHEMA,
     PRIV_DATASTORE_READ, PRIV_TAPE_AUDIT, PRIV_TAPE_WRITE, UPID_SCHEMA,
 };
 
@@ -97,10 +97,13 @@ pub fn list_tape_backup_jobs(
             continue;
         }
 
-        let last_state = JobState::load("tape-backup-job", &job.id)
-            .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?;
-
-        let status = compute_schedule_status(&last_state, job.schedule.as_deref())?;
+        let status = match JobState::load("tape-backup-job", &job.id) {
+            Ok(last_state) => compute_schedule_status(&last_state, job.schedule.as_deref())?,
+            Err(err) => {
+                log::error!("could not open statefile for {}: {}", &job.id, err);
+                JobScheduleStatus::default()
+            }
+        };
 
         let next_run = status.next_run.unwrap_or(current_time);
 
-- 
2.47.3





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle corrupted job statefiles
  2026-03-17 16:07 ` [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle " Michael Köppl
@ 2026-03-18 16:15   ` Christian Ebner
  2026-03-18 17:04     ` Michael Köppl
  2026-03-18 17:22   ` Michael Köppl
  1 sibling, 1 reply; 8+ messages in thread
From: Christian Ebner @ 2026-03-18 16:15 UTC (permalink / raw)
  To: Michael Köppl, pbs-devel

On 3/17/26 5:07 PM, Michael Köppl wrote:
> Previously, if a job statefile was empty or corrupted (e.g. due to an
> I/O error or an abrupt shutdown), the JobStatus::load method would
> return an error that would be propagated up, causing the endpoint to
> return an error to the user, meaning users would not see any of their
> jobs if a single job had a corrupted statefile.
> 
> Instead, handle the error explicitly, logging a warning and returning a
> default JobScheduleStatus, such that jobs lists can still be fetched,
> displaying the affected job as configured but simply missing its last
> run status.
> 
> Signed-off-by: Michael Köppl <m.koeppl@proxmox.com>
> ---

Thanks for the patches, one comment for this patch.

Instead of adapting all the call sides, loading of the state file might 
be moved to compute_schedule_status() in a preparatory patch, passing 
the job type and id as parameters. The improved error handling can then 
be done inside that helper.






^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle corrupted job statefiles
  2026-03-18 16:15   ` Christian Ebner
@ 2026-03-18 17:04     ` Michael Köppl
  0 siblings, 0 replies; 8+ messages in thread
From: Michael Köppl @ 2026-03-18 17:04 UTC (permalink / raw)
  To: Christian Ebner, Michael Köppl, pbs-devel

On Wed Mar 18, 2026 at 5:15 PM CET, Christian Ebner wrote:
> On 3/17/26 5:07 PM, Michael Köppl wrote:
>> Previously, if a job statefile was empty or corrupted (e.g. due to an
>> I/O error or an abrupt shutdown), the JobStatus::load method would
>> return an error that would be propagated up, causing the endpoint to
>> return an error to the user, meaning users would not see any of their
>> jobs if a single job had a corrupted statefile.
>> 
>> Instead, handle the error explicitly, logging a warning and returning a
>> default JobScheduleStatus, such that jobs lists can still be fetched,
>> displaying the affected job as configured but simply missing its last
>> run status.
>> 
>> Signed-off-by: Michael Köppl <m.koeppl@proxmox.com>
>> ---
>
> Thanks for the patches, one comment for this patch.
>
> Instead of adapting all the call sides, loading of the state file might 
> be moved to compute_schedule_status() in a preparatory patch, passing 
> the job type and id as parameters. The improved error handling can then 
> be done inside that helper.

Thanks for having a look! Yeah, true. Was so busy trying to avoid moving
too much stuff around that I didn't see the cleaner solution. Will send
an updated v2, thanks for the feedback.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle corrupted job statefiles
  2026-03-17 16:07 ` [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle " Michael Köppl
  2026-03-18 16:15   ` Christian Ebner
@ 2026-03-18 17:22   ` Michael Köppl
  2026-03-19  8:05     ` Christian Ebner
  1 sibling, 1 reply; 8+ messages in thread
From: Michael Köppl @ 2026-03-18 17:22 UTC (permalink / raw)
  To: Michael Köppl, pbs-devel

On Tue Mar 17, 2026 at 5:07 PM CET, Michael Köppl wrote:

[snip]

>  };
>  use pbs_config::prune;
>  use pbs_config::CachedUserInfo;
> @@ -73,10 +73,13 @@ pub fn list_prune_jobs(
>      let mut list = Vec::new();
>  
>      for job in job_config_iter {
> -        let last_state = JobState::load("prunejob", &job.id)
> -            .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?;
> -
> -        let mut status = compute_schedule_status(&last_state, Some(&job.schedule))?;
> +        let mut status = match JobState::load("prunejob", &job.id) {
> +            Ok(last_state) => compute_schedule_status(&last_state, Some(&job.schedule))?,
> +            Err(err) => {
> +                log::error!("could not open statefile for {}: {}", &job.id, err);

Since I'm currently preparing v2, would it make sense to instead make
this a warning? Not quite sure about it, but displaying an error to the
user and then just continuing (and having self-healing behavior) seems a
bit odd to me.

> +                JobScheduleStatus::default()
> +            }
> +        };
>          if job.disable {
>              status.next_run = None;
>          }
> diff --git a/src/api2/admin/sync.rs b/src/api2/admin/sync.rs
> index 6722ebea0..b0ff53ef7 100644
> --- a/src/api2/admin/sync.rs
> +++ b/src/api2/admin/sync.rs
> @@ -1,6 +1,6 @@
>  //! Datastore Synchronization Job Management

[snip]




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle corrupted job statefiles
  2026-03-18 17:22   ` Michael Köppl
@ 2026-03-19  8:05     ` Christian Ebner
  2026-03-19 10:19       ` Michael Köppl
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Ebner @ 2026-03-19  8:05 UTC (permalink / raw)
  To: Michael Köppl, pbs-devel

On 3/18/26 6:22 PM, Michael Köppl wrote:
> On Tue Mar 17, 2026 at 5:07 PM CET, Michael Köppl wrote:
> 
> [snip]
> 
>>   };
>>   use pbs_config::prune;
>>   use pbs_config::CachedUserInfo;
>> @@ -73,10 +73,13 @@ pub fn list_prune_jobs(
>>       let mut list = Vec::new();
>>   
>>       for job in job_config_iter {
>> -        let last_state = JobState::load("prunejob", &job.id)
>> -            .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?;
>> -
>> -        let mut status = compute_schedule_status(&last_state, Some(&job.schedule))?;
>> +        let mut status = match JobState::load("prunejob", &job.id) {
>> +            Ok(last_state) => compute_schedule_status(&last_state, Some(&job.schedule))?,
>> +            Err(err) => {
>> +                log::error!("could not open statefile for {}: {}", &job.id, err);
> 
> Since I'm currently preparing v2, would it make sense to instead make
> this a warning? Not quite sure about it, but displaying an error to the
> user and then just continuing (and having self-healing behavior) seems a
> bit odd to me.

IMO it makes sense as an error. There was an error reading the file 
after all. And you might not always be able to self-heal. E.g. what if 
you cannot re-write the state file (although this should be logged as well).

Further, I do not expect these to show up frequently and the error 
message could be adapted to include the default being used as fallback?




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle corrupted job statefiles
  2026-03-19  8:05     ` Christian Ebner
@ 2026-03-19 10:19       ` Michael Köppl
  0 siblings, 0 replies; 8+ messages in thread
From: Michael Köppl @ 2026-03-19 10:19 UTC (permalink / raw)
  To: Christian Ebner, Michael Köppl, pbs-devel

On Thu Mar 19, 2026 at 9:05 AM CET, Christian Ebner wrote:
> On 3/18/26 6:22 PM, Michael Köppl wrote:
>> On Tue Mar 17, 2026 at 5:07 PM CET, Michael Köppl wrote:
>> 
>> [snip]
>> 
>>>   };
>>>   use pbs_config::prune;
>>>   use pbs_config::CachedUserInfo;
>>> @@ -73,10 +73,13 @@ pub fn list_prune_jobs(
>>>       let mut list = Vec::new();
>>>   
>>>       for job in job_config_iter {
>>> -        let last_state = JobState::load("prunejob", &job.id)
>>> -            .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?;
>>> -
>>> -        let mut status = compute_schedule_status(&last_state, Some(&job.schedule))?;
>>> +        let mut status = match JobState::load("prunejob", &job.id) {
>>> +            Ok(last_state) => compute_schedule_status(&last_state, Some(&job.schedule))?,
>>> +            Err(err) => {
>>> +                log::error!("could not open statefile for {}: {}", &job.id, err);
>> 
>> Since I'm currently preparing v2, would it make sense to instead make
>> this a warning? Not quite sure about it, but displaying an error to the
>> user and then just continuing (and having self-healing behavior) seems a
>> bit odd to me.
>
> IMO it makes sense as an error. There was an error reading the file 
> after all. And you might not always be able to self-heal. E.g. what if 
> you cannot re-write the state file (although this should be logged as well).

Ack, I see your point. Thanks for the feedback, I'll leave it as an
error.

>
> Further, I do not expect these to show up frequently and the error 
> message could be adapted to include the default being used as fallback?

Yeah, something like "could not open statefile for {}: {} - falling back
to default job schedule status". I'll adapt it and send the v2.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH proxmox-backup v1 2/2] fix #7400: proxy: self-heal corrupted job statefiles
  2026-03-17 16:07 [PATCH proxmox-backup v1 0/2] fix #7400: improve handling of corrupted job statefiles Michael Köppl
  2026-03-17 16:07 ` [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle " Michael Köppl
@ 2026-03-17 16:07 ` Michael Köppl
  1 sibling, 0 replies; 8+ messages in thread
From: Michael Köppl @ 2026-03-17 16:07 UTC (permalink / raw)
  To: pbs-devel

Previously, if a job statefile was corrupted or missing, last_run_time()
would return an error. The job scheduling loops handled this error by
skipping the job indefinitely. To fix this, manual intervention was
required (as in, triggering the job manually).

Instead, match the behavior of a newly created job, ensuring that the
job runs at its next scheduled interval. The job file will then be
rewritten upon successful completion of the job.

Signed-off-by: Michael Köppl <m.koeppl@proxmox.com>
---
 src/bin/proxmox-backup-proxy.rs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/bin/proxmox-backup-proxy.rs b/src/bin/proxmox-backup-proxy.rs
index c1fe3ac15..8359ac0f1 100644
--- a/src/bin/proxmox-backup-proxy.rs
+++ b/src/bin/proxmox-backup-proxy.rs
@@ -936,7 +936,7 @@ fn check_schedule(worker_type: &str, event_str: &str, id: &str) -> bool {
         Ok(time) => time,
         Err(err) => {
             eprintln!("could not get last run time of {worker_type} {id}: {err}");
-            return false;
+            proxmox_time::epoch_i64() - 30
         }
     };
 
-- 
2.47.3





^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-03-19 10:19 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-17 16:07 [PATCH proxmox-backup v1 0/2] fix #7400: improve handling of corrupted job statefiles Michael Köppl
2026-03-17 16:07 ` [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle " Michael Köppl
2026-03-18 16:15   ` Christian Ebner
2026-03-18 17:04     ` Michael Köppl
2026-03-18 17:22   ` Michael Köppl
2026-03-19  8:05     ` Christian Ebner
2026-03-19 10:19       ` Michael Köppl
2026-03-17 16:07 ` [PATCH proxmox-backup v1 2/2] fix #7400: proxy: self-heal " Michael Köppl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal