all lists on lists.proxmox.com
 help / color / mirror / Atom feed
* [PATCH proxmox-backup v3 0/3] fix #7400: improve handling of corrupted job statefiles
@ 2026-03-25 16:06 Michael Köppl
  2026-03-25 16:06 ` [PATCH proxmox-backup v3 1/3] api: move statefile loading into compute_schedule_status Michael Köppl
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Michael Köppl @ 2026-03-25 16:06 UTC (permalink / raw)
  To: pbs-devel

This patch series fixes a problem where an empty or corrupted job state
file (due to I/O error, abrupt shutdown, ...) would cause API endpoints
for listing jobs to return an error, breaking the web UI for users
because they could not view any of their configured jobs of that type.
It would also cause proxmox-backup-proxy to indefinitely skip the jobs
until a user manually triggered it to rewrite the statefile.

1/3 is a preparatory patch that centralizes job statefile loading
in compute_schedule_status instead of having every handler function
open the statefile, handle potential errors and then passing the
JobState to compute_schedule_status.

2/3 introduces a new JobState `Unknown`, representing cases in which the
job state could not be determined. In addition, the patch also updates
the scheduling functions such that errors during reading the statefiles
will result in the Unknown state.

3/3 then utilizes this Unknown state and adapts the scheduling functions
such that the Unknown state will then lead to the statefile being
overwritten with a new Created state and the job running again at its
next scheduled run.

changes since v2:
- introduced the Unknown state in 2/3, adapted 3/3 accordingly (thanks,
  @Fabian and @Christian)
- make sure the "could not open statefile" error is also printed in
  garbage_collection_status if status_in_memory.upid is None (thanks,
  @Christian)
- inline jobtype and err variables in error logging

changes since v1:
- added preparatory patch 1/3, centralizing the statefile loading before
  adapting the handling of the error case in that centralized place
  (compute_schedule_status). Thanks, Christian for the suggestion!
- adapted the error message if job statefile loading fails to make clear
  that the default status will be returned as a fallback

proxmox-backup:

Michael Köppl (3):
  api: move statefile loading into compute_schedule_status
  fix #7400: api: gracefully handle corrupted job statefiles
  fix #7400: proxy: self-heal corrupted job statefiles

 src/api2/admin/datastore.rs     | 15 ++++------
 src/api2/admin/prune.rs         |  9 ++----
 src/api2/admin/sync.rs          |  9 ++----
 src/api2/admin/verify.rs        |  9 ++----
 src/api2/tape/backup.rs         |  9 ++----
 src/bin/proxmox-backup-proxy.rs |  4 ++-
 src/server/jobstate.rs          | 52 +++++++++++++++++++++++++++++----
 7 files changed, 67 insertions(+), 40 deletions(-)


Summary over all repositories:
  7 files changed, 67 insertions(+), 40 deletions(-)

-- 
Generated by murpp 0.11.0




^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH proxmox-backup v3 1/3] api: move statefile loading into compute_schedule_status
  2026-03-25 16:06 [PATCH proxmox-backup v3 0/3] fix #7400: improve handling of corrupted job statefiles Michael Köppl
@ 2026-03-25 16:06 ` Michael Köppl
  2026-03-25 16:06 ` [PATCH proxmox-backup v3 2/3] fix #7400: api: gracefully handle corrupted job statefiles Michael Köppl
  2026-03-25 16:06 ` [PATCH proxmox-backup v3 3/3] fix #7400: proxy: self-heal " Michael Köppl
  2 siblings, 0 replies; 4+ messages in thread
From: Michael Köppl @ 2026-03-25 16:06 UTC (permalink / raw)
  To: pbs-devel

Centralize loading of the job statefiles in compute_schedule_status,
reducing code duplication across the job management API endpoints.

Signed-off-by: Michael Köppl <m.koeppl@proxmox.com>
---
 src/api2/admin/datastore.rs | 15 ++++++---------
 src/api2/admin/prune.rs     |  9 +++------
 src/api2/admin/sync.rs      |  9 +++------
 src/api2/admin/verify.rs    |  9 +++------
 src/api2/tape/backup.rs     |  9 +++------
 src/server/jobstate.rs      |  8 ++++++--
 6 files changed, 24 insertions(+), 35 deletions(-)

diff --git a/src/api2/admin/datastore.rs b/src/api2/admin/datastore.rs
index cca340553..1a04a17ec 100644
--- a/src/api2/admin/datastore.rs
+++ b/src/api2/admin/datastore.rs
@@ -1167,19 +1167,14 @@ pub fn garbage_collection_status(
 
     let datastore = DataStore::lookup_datastore(&store, Operation::Read)?;
     let status_in_memory = datastore.last_gc_status();
-    let state_file = JobState::load("garbage_collection", &store)
-        .map_err(|err| log::error!("could not open GC statefile for {store}: {err}"))
-        .ok();
 
     let mut last = proxmox_time::epoch_i64();
 
+    let jobtype = "garbage_collection";
+
     if let Some(ref upid) = status_in_memory.upid {
-        let mut computed_schedule: JobScheduleStatus = JobScheduleStatus::default();
-        if let Some(state) = state_file {
-            if let Ok(cs) = compute_schedule_status(&state, Some(upid)) {
-                computed_schedule = cs;
-            }
-        }
+        let computed_schedule: JobScheduleStatus =
+            compute_schedule_status(jobtype, &store, Some(upid))?;
 
         if let Some(endtime) = computed_schedule.last_run_endtime {
             last = endtime;
@@ -1191,6 +1186,8 @@ pub fn garbage_collection_status(
         info.next_run = computed_schedule.next_run;
         info.last_run_endtime = computed_schedule.last_run_endtime;
         info.last_run_state = computed_schedule.last_run_state;
+    } else if let Err(err) = JobState::load(jobtype, &store) {
+        log::error!("could not open statefile for {store}: {err}");
     }
 
     info.next_run = info
diff --git a/src/api2/admin/prune.rs b/src/api2/admin/prune.rs
index a5ebf2975..1b1d2f1ba 100644
--- a/src/api2/admin/prune.rs
+++ b/src/api2/admin/prune.rs
@@ -1,6 +1,6 @@
 //! Datastore Prune Job Management
 
-use anyhow::{format_err, Error};
+use anyhow::Error;
 use serde_json::Value;
 
 use proxmox_router::{
@@ -18,7 +18,7 @@ use pbs_config::CachedUserInfo;
 
 use crate::server::{
     do_prune_job,
-    jobstate::{compute_schedule_status, Job, JobState},
+    jobstate::{compute_schedule_status, Job},
 };
 
 #[api(
@@ -73,10 +73,7 @@ pub fn list_prune_jobs(
     let mut list = Vec::new();
 
     for job in job_config_iter {
-        let last_state = JobState::load("prunejob", &job.id)
-            .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?;
-
-        let mut status = compute_schedule_status(&last_state, Some(&job.schedule))?;
+        let mut status = compute_schedule_status("prunejob", &job.id, Some(&job.schedule))?;
         if job.disable {
             status.next_run = None;
         }
diff --git a/src/api2/admin/sync.rs b/src/api2/admin/sync.rs
index 6722ebea0..2384ede75 100644
--- a/src/api2/admin/sync.rs
+++ b/src/api2/admin/sync.rs
@@ -1,6 +1,6 @@
 //! Datastore Synchronization Job Management
 
-use anyhow::{bail, format_err, Error};
+use anyhow::{bail, Error};
 use serde::{Deserialize, Serialize};
 use serde_json::Value;
 
@@ -19,7 +19,7 @@ use pbs_config::CachedUserInfo;
 
 use crate::{
     api2::config::sync::{check_sync_job_modify_access, check_sync_job_read_access},
-    server::jobstate::{compute_schedule_status, Job, JobState},
+    server::jobstate::{compute_schedule_status, Job},
     server::sync::do_sync_job,
 };
 
@@ -112,10 +112,7 @@ pub fn list_config_sync_jobs(
             continue;
         }
 
-        let last_state = JobState::load("syncjob", &job.id)
-            .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?;
-
-        let status = compute_schedule_status(&last_state, job.schedule.as_deref())?;
+        let status = compute_schedule_status("syncjob", &job.id, job.schedule.as_deref())?;
 
         list.push(SyncJobStatus {
             config: job,
diff --git a/src/api2/admin/verify.rs b/src/api2/admin/verify.rs
index 66695236c..af5b7fff4 100644
--- a/src/api2/admin/verify.rs
+++ b/src/api2/admin/verify.rs
@@ -1,6 +1,6 @@
 //! Datastore Verify Job Management
 
-use anyhow::{format_err, Error};
+use anyhow::Error;
 use serde_json::Value;
 
 use proxmox_router::{
@@ -19,7 +19,7 @@ use pbs_config::CachedUserInfo;
 
 use crate::server::{
     do_verification_job,
-    jobstate::{compute_schedule_status, Job, JobState},
+    jobstate::{compute_schedule_status, Job},
 };
 
 #[api(
@@ -73,10 +73,7 @@ pub fn list_verification_jobs(
     let mut list = Vec::new();
 
     for job in job_config_iter {
-        let last_state = JobState::load("verificationjob", &job.id)
-            .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?;
-
-        let status = compute_schedule_status(&last_state, job.schedule.as_deref())?;
+        let status = compute_schedule_status("verificationjob", &job.id, job.schedule.as_deref())?;
 
         list.push(VerificationJobStatus {
             config: job,
diff --git a/src/api2/tape/backup.rs b/src/api2/tape/backup.rs
index 47e8d0209..2c1aa5c0a 100644
--- a/src/api2/tape/backup.rs
+++ b/src/api2/tape/backup.rs
@@ -1,6 +1,6 @@
 use std::sync::{Arc, Mutex};
 
-use anyhow::{bail, format_err, Error};
+use anyhow::{bail, Error};
 use serde_json::Value;
 use tracing::{info, warn};
 
@@ -23,7 +23,7 @@ use pbs_datastore::{DataStore, StoreProgress};
 use crate::tape::{assert_datastore_type, TapeNotificationMode};
 use crate::{
     server::{
-        jobstate::{compute_schedule_status, Job, JobState},
+        jobstate::{compute_schedule_status, Job},
         TapeBackupJobSummary,
     },
     tape::{
@@ -97,10 +97,7 @@ pub fn list_tape_backup_jobs(
             continue;
         }
 
-        let last_state = JobState::load("tape-backup-job", &job.id)
-            .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?;
-
-        let status = compute_schedule_status(&last_state, job.schedule.as_deref())?;
+        let status = compute_schedule_status("tape-backup-job", &job.id, job.schedule.as_deref())?;
 
         let next_run = status.next_run.unwrap_or(current_time);
 
diff --git a/src/server/jobstate.rs b/src/server/jobstate.rs
index dc9f6c90d..ceac8dde8 100644
--- a/src/server/jobstate.rs
+++ b/src/server/jobstate.rs
@@ -301,11 +301,15 @@ impl Job {
 }
 
 pub fn compute_schedule_status(
-    job_state: &JobState,
+    jobtype: &str,
+    jobname: &str,
     schedule: Option<&str>,
 ) -> Result<JobScheduleStatus, Error> {
+    let job_state = JobState::load(jobtype, jobname)
+        .map_err(|err| format_err!("could not open statefile for {jobname}: {err}"))?;
+
     let (upid, endtime, state, last) = match job_state {
-        JobState::Created { time } => (None, None, None, *time),
+        JobState::Created { time } => (None, None, None, time),
         JobState::Started { upid } => {
             let parsed_upid: UPID = upid.parse()?;
             (Some(upid), None, None, parsed_upid.starttime)
-- 
2.47.3





^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH proxmox-backup v3 2/3] fix #7400: api: gracefully handle corrupted job statefiles
  2026-03-25 16:06 [PATCH proxmox-backup v3 0/3] fix #7400: improve handling of corrupted job statefiles Michael Köppl
  2026-03-25 16:06 ` [PATCH proxmox-backup v3 1/3] api: move statefile loading into compute_schedule_status Michael Köppl
@ 2026-03-25 16:06 ` Michael Köppl
  2026-03-25 16:06 ` [PATCH proxmox-backup v3 3/3] fix #7400: proxy: self-heal " Michael Köppl
  2 siblings, 0 replies; 4+ messages in thread
From: Michael Köppl @ 2026-03-25 16:06 UTC (permalink / raw)
  To: pbs-devel

Introduce Unknown JobState to more explicitly represent cases where the
state could not be determined, e.g. if the statefile was corrupted or
missing. Update JobState::load to handle parsing errors (both for
statefiles themselves as well as UPIDs) and return an Unknown state if
such an error occurred. Update compute_schedule_status to also handle
the new Unknown status, returning a default JobScheduleStatus so API
endpoints don't return an error to the user, stopping them from viewing
their jobs.

Signed-off-by: Michael Köppl <m.koeppl@proxmox.com>
---
 src/server/jobstate.rs | 48 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 42 insertions(+), 6 deletions(-)

diff --git a/src/server/jobstate.rs b/src/server/jobstate.rs
index ceac8dde8..4163656e8 100644
--- a/src/server/jobstate.rs
+++ b/src/server/jobstate.rs
@@ -66,6 +66,7 @@ pub enum JobState {
         state: TaskState,
         updated: Option<i64>,
     },
+    Unknown,
 }
 
 /// Represents a Job and holds the correct lock
@@ -155,6 +156,7 @@ pub fn update_job_last_run_time(jobtype: &str, jobname: &str) -> Result<(), Erro
             state,
             updated: Some(time),
         },
+        JobState::Unknown => bail!("cannot update last run time for unknown job state"),
     };
     job.write_state()
 }
@@ -179,6 +181,7 @@ pub fn last_run_time(jobtype: &str, jobname: &str) -> Result<i64, Error> {
                 .map_err(|err| format_err!("could not parse upid from state: {err}"))?;
             Ok(upid.starttime)
         }
+        JobState::Unknown => bail!("statefile could not be parsed or was empty"),
     }
 }
 
@@ -191,11 +194,20 @@ impl JobState {
     /// This does not update the state in the file.
     pub fn load(jobtype: &str, jobname: &str) -> Result<Self, Error> {
         if let Some(state) = file_read_optional_string(get_path(jobtype, jobname))? {
-            match serde_json::from_str(&state)? {
+            let job_state = serde_json::from_str(&state).unwrap_or_else(|err| {
+                log::error!("could not parse statefile for {jobname}: {err}");
+                JobState::Unknown
+            });
+
+            match job_state {
                 JobState::Started { upid } => {
-                    let parsed: UPID = upid
-                        .parse()
-                        .map_err(|err| format_err!("error parsing upid: {err}"))?;
+                    let parsed: UPID = match upid.parse() {
+                        Ok(parsed) => parsed,
+                        Err(err) => {
+                            log::error!("error parsing upid for {jobname}: {err}");
+                            return Ok(JobState::Unknown);
+                        }
+                    };
 
                     if !worker_is_active_local(&parsed) {
                         let state = upid_read_status(&parsed).unwrap_or(TaskState::Unknown {
@@ -211,6 +223,21 @@ impl JobState {
                         Ok(JobState::Started { upid })
                     }
                 }
+                JobState::Finished {
+                    upid,
+                    state,
+                    updated,
+                } => {
+                    if let Err(err) = upid.parse::<UPID>() {
+                        log::error!("error parsing upid for {jobname}: {err}");
+                        return Ok(JobState::Unknown);
+                    }
+                    Ok(JobState::Finished {
+                        upid,
+                        state,
+                        updated,
+                    })
+                }
                 other => Ok(other),
             }
         } else {
@@ -263,6 +290,7 @@ impl Job {
             JobState::Created { .. } => bail!("cannot finish when not started"),
             JobState::Started { upid } => upid,
             JobState::Finished { upid, .. } => upid,
+            JobState::Unknown => bail!("cannot finish job with unknown status"),
         }
         .to_string();
 
@@ -305,8 +333,15 @@ pub fn compute_schedule_status(
     jobname: &str,
     schedule: Option<&str>,
 ) -> Result<JobScheduleStatus, Error> {
-    let job_state = JobState::load(jobtype, jobname)
-        .map_err(|err| format_err!("could not open statefile for {jobname}: {err}"))?;
+    let job_state = match JobState::load(jobtype, jobname) {
+        Ok(job_state) => job_state,
+        Err(err) => {
+            log::error!(
+                "could not open statefile for {jobname}: {err} - falling back to default job schedule status",
+            );
+            return Ok(JobScheduleStatus::default());
+        }
+    };
 
     let (upid, endtime, state, last) = match job_state {
         JobState::Created { time } => (None, None, None, time),
@@ -327,6 +362,7 @@ pub fn compute_schedule_status(
                 last,
             )
         }
+        JobState::Unknown => (None, None, None, proxmox_time::epoch_i64() - 30),
     };
 
     let mut status = JobScheduleStatus {
-- 
2.47.3





^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH proxmox-backup v3 3/3] fix #7400: proxy: self-heal corrupted job statefiles
  2026-03-25 16:06 [PATCH proxmox-backup v3 0/3] fix #7400: improve handling of corrupted job statefiles Michael Köppl
  2026-03-25 16:06 ` [PATCH proxmox-backup v3 1/3] api: move statefile loading into compute_schedule_status Michael Köppl
  2026-03-25 16:06 ` [PATCH proxmox-backup v3 2/3] fix #7400: api: gracefully handle corrupted job statefiles Michael Köppl
@ 2026-03-25 16:06 ` Michael Köppl
  2 siblings, 0 replies; 4+ messages in thread
From: Michael Köppl @ 2026-03-25 16:06 UTC (permalink / raw)
  To: pbs-devel

Update update_job_last_run_time to transition JobState::Unknown into
JobState::Created so the corrupted statefile is overwritten. In
addition, update the scheduling loops to actively overwrite corrupted
statefiles and return the time for the next scheduled run of the
affected job.

Signed-off-by: Michael Köppl <m.koeppl@proxmox.com>
---
 src/bin/proxmox-backup-proxy.rs | 4 +++-
 src/server/jobstate.rs          | 2 +-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/bin/proxmox-backup-proxy.rs b/src/bin/proxmox-backup-proxy.rs
index c1fe3ac15..71d8566f4 100644
--- a/src/bin/proxmox-backup-proxy.rs
+++ b/src/bin/proxmox-backup-proxy.rs
@@ -564,6 +564,7 @@ async fn schedule_datastore_garbage_collection() {
             Ok(time) => time,
             Err(err) => {
                 eprintln!("could not get last run time of {worker_type} {store}: {err}");
+                let _ = jobstate::update_job_last_run_time(worker_type, &store);
                 continue;
             }
         };
@@ -936,7 +937,8 @@ fn check_schedule(worker_type: &str, event_str: &str, id: &str) -> bool {
         Ok(time) => time,
         Err(err) => {
             eprintln!("could not get last run time of {worker_type} {id}: {err}");
-            return false;
+            let _ = jobstate::update_job_last_run_time(worker_type, id);
+            proxmox_time::epoch_i64() - 30
         }
     };
 
diff --git a/src/server/jobstate.rs b/src/server/jobstate.rs
index 4163656e8..214691965 100644
--- a/src/server/jobstate.rs
+++ b/src/server/jobstate.rs
@@ -156,7 +156,7 @@ pub fn update_job_last_run_time(jobtype: &str, jobname: &str) -> Result<(), Erro
             state,
             updated: Some(time),
         },
-        JobState::Unknown => bail!("cannot update last run time for unknown job state"),
+        JobState::Unknown => JobState::Created { time },
     };
     job.write_state()
 }
-- 
2.47.3





^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-03-25 16:06 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-25 16:06 [PATCH proxmox-backup v3 0/3] fix #7400: improve handling of corrupted job statefiles Michael Köppl
2026-03-25 16:06 ` [PATCH proxmox-backup v3 1/3] api: move statefile loading into compute_schedule_status Michael Köppl
2026-03-25 16:06 ` [PATCH proxmox-backup v3 2/3] fix #7400: api: gracefully handle corrupted job statefiles Michael Köppl
2026-03-25 16:06 ` [PATCH proxmox-backup v3 3/3] fix #7400: proxy: self-heal " Michael Köppl

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal