* [PATCH proxmox-backup v3 0/3] fix #7400: improve handling of corrupted job statefiles
@ 2026-03-25 16:06 Michael Köppl
2026-03-25 16:06 ` [PATCH proxmox-backup v3 1/3] api: move statefile loading into compute_schedule_status Michael Köppl
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: Michael Köppl @ 2026-03-25 16:06 UTC (permalink / raw)
To: pbs-devel
This patch series fixes a problem where an empty or corrupted job state
file (due to I/O error, abrupt shutdown, ...) would cause API endpoints
for listing jobs to return an error, breaking the web UI for users
because they could not view any of their configured jobs of that type.
It would also cause proxmox-backup-proxy to indefinitely skip the jobs
until a user manually triggered it to rewrite the statefile.
1/3 is a preparatory patch that centralizes job statefile loading
in compute_schedule_status instead of having every handler function
open the statefile, handle potential errors and then passing the
JobState to compute_schedule_status.
2/3 introduces a new JobState `Unknown`, representing cases in which the
job state could not be determined. In addition, the patch also updates
the scheduling functions such that errors during reading the statefiles
will result in the Unknown state.
3/3 then utilizes this Unknown state and adapts the scheduling functions
such that the Unknown state will then lead to the statefile being
overwritten with a new Created state and the job running again at its
next scheduled run.
changes since v2:
- introduced the Unknown state in 2/3, adapted 3/3 accordingly (thanks,
@Fabian and @Christian)
- make sure the "could not open statefile" error is also printed in
garbage_collection_status if status_in_memory.upid is None (thanks,
@Christian)
- inline jobtype and err variables in error logging
changes since v1:
- added preparatory patch 1/3, centralizing the statefile loading before
adapting the handling of the error case in that centralized place
(compute_schedule_status). Thanks, Christian for the suggestion!
- adapted the error message if job statefile loading fails to make clear
that the default status will be returned as a fallback
proxmox-backup:
Michael Köppl (3):
api: move statefile loading into compute_schedule_status
fix #7400: api: gracefully handle corrupted job statefiles
fix #7400: proxy: self-heal corrupted job statefiles
src/api2/admin/datastore.rs | 15 ++++------
src/api2/admin/prune.rs | 9 ++----
src/api2/admin/sync.rs | 9 ++----
src/api2/admin/verify.rs | 9 ++----
src/api2/tape/backup.rs | 9 ++----
src/bin/proxmox-backup-proxy.rs | 4 ++-
src/server/jobstate.rs | 52 +++++++++++++++++++++++++++++----
7 files changed, 67 insertions(+), 40 deletions(-)
Summary over all repositories:
7 files changed, 67 insertions(+), 40 deletions(-)
--
Generated by murpp 0.11.0
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH proxmox-backup v3 1/3] api: move statefile loading into compute_schedule_status
2026-03-25 16:06 [PATCH proxmox-backup v3 0/3] fix #7400: improve handling of corrupted job statefiles Michael Köppl
@ 2026-03-25 16:06 ` Michael Köppl
2026-04-02 16:22 ` Christian Ebner
2026-03-25 16:06 ` [PATCH proxmox-backup v3 2/3] fix #7400: api: gracefully handle corrupted job statefiles Michael Köppl
` (2 subsequent siblings)
3 siblings, 1 reply; 11+ messages in thread
From: Michael Köppl @ 2026-03-25 16:06 UTC (permalink / raw)
To: pbs-devel
Centralize loading of the job statefiles in compute_schedule_status,
reducing code duplication across the job management API endpoints.
Signed-off-by: Michael Köppl <m.koeppl@proxmox.com>
---
src/api2/admin/datastore.rs | 15 ++++++---------
src/api2/admin/prune.rs | 9 +++------
src/api2/admin/sync.rs | 9 +++------
src/api2/admin/verify.rs | 9 +++------
src/api2/tape/backup.rs | 9 +++------
src/server/jobstate.rs | 8 ++++++--
6 files changed, 24 insertions(+), 35 deletions(-)
diff --git a/src/api2/admin/datastore.rs b/src/api2/admin/datastore.rs
index cca340553..1a04a17ec 100644
--- a/src/api2/admin/datastore.rs
+++ b/src/api2/admin/datastore.rs
@@ -1167,19 +1167,14 @@ pub fn garbage_collection_status(
let datastore = DataStore::lookup_datastore(&store, Operation::Read)?;
let status_in_memory = datastore.last_gc_status();
- let state_file = JobState::load("garbage_collection", &store)
- .map_err(|err| log::error!("could not open GC statefile for {store}: {err}"))
- .ok();
let mut last = proxmox_time::epoch_i64();
+ let jobtype = "garbage_collection";
+
if let Some(ref upid) = status_in_memory.upid {
- let mut computed_schedule: JobScheduleStatus = JobScheduleStatus::default();
- if let Some(state) = state_file {
- if let Ok(cs) = compute_schedule_status(&state, Some(upid)) {
- computed_schedule = cs;
- }
- }
+ let computed_schedule: JobScheduleStatus =
+ compute_schedule_status(jobtype, &store, Some(upid))?;
if let Some(endtime) = computed_schedule.last_run_endtime {
last = endtime;
@@ -1191,6 +1186,8 @@ pub fn garbage_collection_status(
info.next_run = computed_schedule.next_run;
info.last_run_endtime = computed_schedule.last_run_endtime;
info.last_run_state = computed_schedule.last_run_state;
+ } else if let Err(err) = JobState::load(jobtype, &store) {
+ log::error!("could not open statefile for {store}: {err}");
}
info.next_run = info
diff --git a/src/api2/admin/prune.rs b/src/api2/admin/prune.rs
index a5ebf2975..1b1d2f1ba 100644
--- a/src/api2/admin/prune.rs
+++ b/src/api2/admin/prune.rs
@@ -1,6 +1,6 @@
//! Datastore Prune Job Management
-use anyhow::{format_err, Error};
+use anyhow::Error;
use serde_json::Value;
use proxmox_router::{
@@ -18,7 +18,7 @@ use pbs_config::CachedUserInfo;
use crate::server::{
do_prune_job,
- jobstate::{compute_schedule_status, Job, JobState},
+ jobstate::{compute_schedule_status, Job},
};
#[api(
@@ -73,10 +73,7 @@ pub fn list_prune_jobs(
let mut list = Vec::new();
for job in job_config_iter {
- let last_state = JobState::load("prunejob", &job.id)
- .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?;
-
- let mut status = compute_schedule_status(&last_state, Some(&job.schedule))?;
+ let mut status = compute_schedule_status("prunejob", &job.id, Some(&job.schedule))?;
if job.disable {
status.next_run = None;
}
diff --git a/src/api2/admin/sync.rs b/src/api2/admin/sync.rs
index 6722ebea0..2384ede75 100644
--- a/src/api2/admin/sync.rs
+++ b/src/api2/admin/sync.rs
@@ -1,6 +1,6 @@
//! Datastore Synchronization Job Management
-use anyhow::{bail, format_err, Error};
+use anyhow::{bail, Error};
use serde::{Deserialize, Serialize};
use serde_json::Value;
@@ -19,7 +19,7 @@ use pbs_config::CachedUserInfo;
use crate::{
api2::config::sync::{check_sync_job_modify_access, check_sync_job_read_access},
- server::jobstate::{compute_schedule_status, Job, JobState},
+ server::jobstate::{compute_schedule_status, Job},
server::sync::do_sync_job,
};
@@ -112,10 +112,7 @@ pub fn list_config_sync_jobs(
continue;
}
- let last_state = JobState::load("syncjob", &job.id)
- .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?;
-
- let status = compute_schedule_status(&last_state, job.schedule.as_deref())?;
+ let status = compute_schedule_status("syncjob", &job.id, job.schedule.as_deref())?;
list.push(SyncJobStatus {
config: job,
diff --git a/src/api2/admin/verify.rs b/src/api2/admin/verify.rs
index 66695236c..af5b7fff4 100644
--- a/src/api2/admin/verify.rs
+++ b/src/api2/admin/verify.rs
@@ -1,6 +1,6 @@
//! Datastore Verify Job Management
-use anyhow::{format_err, Error};
+use anyhow::Error;
use serde_json::Value;
use proxmox_router::{
@@ -19,7 +19,7 @@ use pbs_config::CachedUserInfo;
use crate::server::{
do_verification_job,
- jobstate::{compute_schedule_status, Job, JobState},
+ jobstate::{compute_schedule_status, Job},
};
#[api(
@@ -73,10 +73,7 @@ pub fn list_verification_jobs(
let mut list = Vec::new();
for job in job_config_iter {
- let last_state = JobState::load("verificationjob", &job.id)
- .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?;
-
- let status = compute_schedule_status(&last_state, job.schedule.as_deref())?;
+ let status = compute_schedule_status("verificationjob", &job.id, job.schedule.as_deref())?;
list.push(VerificationJobStatus {
config: job,
diff --git a/src/api2/tape/backup.rs b/src/api2/tape/backup.rs
index 47e8d0209..2c1aa5c0a 100644
--- a/src/api2/tape/backup.rs
+++ b/src/api2/tape/backup.rs
@@ -1,6 +1,6 @@
use std::sync::{Arc, Mutex};
-use anyhow::{bail, format_err, Error};
+use anyhow::{bail, Error};
use serde_json::Value;
use tracing::{info, warn};
@@ -23,7 +23,7 @@ use pbs_datastore::{DataStore, StoreProgress};
use crate::tape::{assert_datastore_type, TapeNotificationMode};
use crate::{
server::{
- jobstate::{compute_schedule_status, Job, JobState},
+ jobstate::{compute_schedule_status, Job},
TapeBackupJobSummary,
},
tape::{
@@ -97,10 +97,7 @@ pub fn list_tape_backup_jobs(
continue;
}
- let last_state = JobState::load("tape-backup-job", &job.id)
- .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?;
-
- let status = compute_schedule_status(&last_state, job.schedule.as_deref())?;
+ let status = compute_schedule_status("tape-backup-job", &job.id, job.schedule.as_deref())?;
let next_run = status.next_run.unwrap_or(current_time);
diff --git a/src/server/jobstate.rs b/src/server/jobstate.rs
index dc9f6c90d..ceac8dde8 100644
--- a/src/server/jobstate.rs
+++ b/src/server/jobstate.rs
@@ -301,11 +301,15 @@ impl Job {
}
pub fn compute_schedule_status(
- job_state: &JobState,
+ jobtype: &str,
+ jobname: &str,
schedule: Option<&str>,
) -> Result<JobScheduleStatus, Error> {
+ let job_state = JobState::load(jobtype, jobname)
+ .map_err(|err| format_err!("could not open statefile for {jobname}: {err}"))?;
+
let (upid, endtime, state, last) = match job_state {
- JobState::Created { time } => (None, None, None, *time),
+ JobState::Created { time } => (None, None, None, time),
JobState::Started { upid } => {
let parsed_upid: UPID = upid.parse()?;
(Some(upid), None, None, parsed_upid.starttime)
--
2.47.3
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH proxmox-backup v3 2/3] fix #7400: api: gracefully handle corrupted job statefiles
2026-03-25 16:06 [PATCH proxmox-backup v3 0/3] fix #7400: improve handling of corrupted job statefiles Michael Köppl
2026-03-25 16:06 ` [PATCH proxmox-backup v3 1/3] api: move statefile loading into compute_schedule_status Michael Köppl
@ 2026-03-25 16:06 ` Michael Köppl
2026-04-02 16:35 ` Christian Ebner
2026-03-25 16:06 ` [PATCH proxmox-backup v3 3/3] fix #7400: proxy: self-heal " Michael Köppl
2026-04-03 13:28 ` superseded: [PATCH proxmox-backup v3 0/3] fix #7400: improve handling of " Michael Köppl
3 siblings, 1 reply; 11+ messages in thread
From: Michael Köppl @ 2026-03-25 16:06 UTC (permalink / raw)
To: pbs-devel
Introduce Unknown JobState to more explicitly represent cases where the
state could not be determined, e.g. if the statefile was corrupted or
missing. Update JobState::load to handle parsing errors (both for
statefiles themselves as well as UPIDs) and return an Unknown state if
such an error occurred. Update compute_schedule_status to also handle
the new Unknown status, returning a default JobScheduleStatus so API
endpoints don't return an error to the user, stopping them from viewing
their jobs.
Signed-off-by: Michael Köppl <m.koeppl@proxmox.com>
---
src/server/jobstate.rs | 48 ++++++++++++++++++++++++++++++++++++------
1 file changed, 42 insertions(+), 6 deletions(-)
diff --git a/src/server/jobstate.rs b/src/server/jobstate.rs
index ceac8dde8..4163656e8 100644
--- a/src/server/jobstate.rs
+++ b/src/server/jobstate.rs
@@ -66,6 +66,7 @@ pub enum JobState {
state: TaskState,
updated: Option<i64>,
},
+ Unknown,
}
/// Represents a Job and holds the correct lock
@@ -155,6 +156,7 @@ pub fn update_job_last_run_time(jobtype: &str, jobname: &str) -> Result<(), Erro
state,
updated: Some(time),
},
+ JobState::Unknown => bail!("cannot update last run time for unknown job state"),
};
job.write_state()
}
@@ -179,6 +181,7 @@ pub fn last_run_time(jobtype: &str, jobname: &str) -> Result<i64, Error> {
.map_err(|err| format_err!("could not parse upid from state: {err}"))?;
Ok(upid.starttime)
}
+ JobState::Unknown => bail!("statefile could not be parsed or was empty"),
}
}
@@ -191,11 +194,20 @@ impl JobState {
/// This does not update the state in the file.
pub fn load(jobtype: &str, jobname: &str) -> Result<Self, Error> {
if let Some(state) = file_read_optional_string(get_path(jobtype, jobname))? {
- match serde_json::from_str(&state)? {
+ let job_state = serde_json::from_str(&state).unwrap_or_else(|err| {
+ log::error!("could not parse statefile for {jobname}: {err}");
+ JobState::Unknown
+ });
+
+ match job_state {
JobState::Started { upid } => {
- let parsed: UPID = upid
- .parse()
- .map_err(|err| format_err!("error parsing upid: {err}"))?;
+ let parsed: UPID = match upid.parse() {
+ Ok(parsed) => parsed,
+ Err(err) => {
+ log::error!("error parsing upid for {jobname}: {err}");
+ return Ok(JobState::Unknown);
+ }
+ };
if !worker_is_active_local(&parsed) {
let state = upid_read_status(&parsed).unwrap_or(TaskState::Unknown {
@@ -211,6 +223,21 @@ impl JobState {
Ok(JobState::Started { upid })
}
}
+ JobState::Finished {
+ upid,
+ state,
+ updated,
+ } => {
+ if let Err(err) = upid.parse::<UPID>() {
+ log::error!("error parsing upid for {jobname}: {err}");
+ return Ok(JobState::Unknown);
+ }
+ Ok(JobState::Finished {
+ upid,
+ state,
+ updated,
+ })
+ }
other => Ok(other),
}
} else {
@@ -263,6 +290,7 @@ impl Job {
JobState::Created { .. } => bail!("cannot finish when not started"),
JobState::Started { upid } => upid,
JobState::Finished { upid, .. } => upid,
+ JobState::Unknown => bail!("cannot finish job with unknown status"),
}
.to_string();
@@ -305,8 +333,15 @@ pub fn compute_schedule_status(
jobname: &str,
schedule: Option<&str>,
) -> Result<JobScheduleStatus, Error> {
- let job_state = JobState::load(jobtype, jobname)
- .map_err(|err| format_err!("could not open statefile for {jobname}: {err}"))?;
+ let job_state = match JobState::load(jobtype, jobname) {
+ Ok(job_state) => job_state,
+ Err(err) => {
+ log::error!(
+ "could not open statefile for {jobname}: {err} - falling back to default job schedule status",
+ );
+ return Ok(JobScheduleStatus::default());
+ }
+ };
let (upid, endtime, state, last) = match job_state {
JobState::Created { time } => (None, None, None, time),
@@ -327,6 +362,7 @@ pub fn compute_schedule_status(
last,
)
}
+ JobState::Unknown => (None, None, None, proxmox_time::epoch_i64() - 30),
};
let mut status = JobScheduleStatus {
--
2.47.3
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH proxmox-backup v3 3/3] fix #7400: proxy: self-heal corrupted job statefiles
2026-03-25 16:06 [PATCH proxmox-backup v3 0/3] fix #7400: improve handling of corrupted job statefiles Michael Köppl
2026-03-25 16:06 ` [PATCH proxmox-backup v3 1/3] api: move statefile loading into compute_schedule_status Michael Köppl
2026-03-25 16:06 ` [PATCH proxmox-backup v3 2/3] fix #7400: api: gracefully handle corrupted job statefiles Michael Köppl
@ 2026-03-25 16:06 ` Michael Köppl
2026-04-02 16:50 ` Christian Ebner
2026-04-03 13:28 ` superseded: [PATCH proxmox-backup v3 0/3] fix #7400: improve handling of " Michael Köppl
3 siblings, 1 reply; 11+ messages in thread
From: Michael Köppl @ 2026-03-25 16:06 UTC (permalink / raw)
To: pbs-devel
Update update_job_last_run_time to transition JobState::Unknown into
JobState::Created so the corrupted statefile is overwritten. In
addition, update the scheduling loops to actively overwrite corrupted
statefiles and return the time for the next scheduled run of the
affected job.
Signed-off-by: Michael Köppl <m.koeppl@proxmox.com>
---
src/bin/proxmox-backup-proxy.rs | 4 +++-
src/server/jobstate.rs | 2 +-
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/src/bin/proxmox-backup-proxy.rs b/src/bin/proxmox-backup-proxy.rs
index c1fe3ac15..71d8566f4 100644
--- a/src/bin/proxmox-backup-proxy.rs
+++ b/src/bin/proxmox-backup-proxy.rs
@@ -564,6 +564,7 @@ async fn schedule_datastore_garbage_collection() {
Ok(time) => time,
Err(err) => {
eprintln!("could not get last run time of {worker_type} {store}: {err}");
+ let _ = jobstate::update_job_last_run_time(worker_type, &store);
continue;
}
};
@@ -936,7 +937,8 @@ fn check_schedule(worker_type: &str, event_str: &str, id: &str) -> bool {
Ok(time) => time,
Err(err) => {
eprintln!("could not get last run time of {worker_type} {id}: {err}");
- return false;
+ let _ = jobstate::update_job_last_run_time(worker_type, id);
+ proxmox_time::epoch_i64() - 30
}
};
diff --git a/src/server/jobstate.rs b/src/server/jobstate.rs
index 4163656e8..214691965 100644
--- a/src/server/jobstate.rs
+++ b/src/server/jobstate.rs
@@ -156,7 +156,7 @@ pub fn update_job_last_run_time(jobtype: &str, jobname: &str) -> Result<(), Erro
state,
updated: Some(time),
},
- JobState::Unknown => bail!("cannot update last run time for unknown job state"),
+ JobState::Unknown => JobState::Created { time },
};
job.write_state()
}
--
2.47.3
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH proxmox-backup v3 1/3] api: move statefile loading into compute_schedule_status
2026-03-25 16:06 ` [PATCH proxmox-backup v3 1/3] api: move statefile loading into compute_schedule_status Michael Köppl
@ 2026-04-02 16:22 ` Christian Ebner
2026-04-03 11:23 ` Michael Köppl
0 siblings, 1 reply; 11+ messages in thread
From: Christian Ebner @ 2026-04-02 16:22 UTC (permalink / raw)
To: Michael Köppl, pbs-devel
One comment inline.
On 3/25/26 5:05 PM, Michael Köppl wrote:
> Centralize loading of the job statefiles in compute_schedule_status,
> reducing code duplication across the job management API endpoints.
>
> Signed-off-by: Michael Köppl <m.koeppl@proxmox.com>
> ---
> src/api2/admin/datastore.rs | 15 ++++++---------
> src/api2/admin/prune.rs | 9 +++------
> src/api2/admin/sync.rs | 9 +++------
> src/api2/admin/verify.rs | 9 +++------
> src/api2/tape/backup.rs | 9 +++------
> src/server/jobstate.rs | 8 ++++++--
> 6 files changed, 24 insertions(+), 35 deletions(-)
>
> diff --git a/src/api2/admin/datastore.rs b/src/api2/admin/datastore.rs
> index cca340553..1a04a17ec 100644
> --- a/src/api2/admin/datastore.rs
> +++ b/src/api2/admin/datastore.rs
> @@ -1167,19 +1167,14 @@ pub fn garbage_collection_status(
>
> let datastore = DataStore::lookup_datastore(&store, Operation::Read)?;
> let status_in_memory = datastore.last_gc_status();
> - let state_file = JobState::load("garbage_collection", &store)
> - .map_err(|err| log::error!("could not open GC statefile for {store}: {err}"))
> - .ok();
>
> let mut last = proxmox_time::epoch_i64();
>
> + let jobtype = "garbage_collection";
> +
> if let Some(ref upid) = status_in_memory.upid {
> - let mut computed_schedule: JobScheduleStatus = JobScheduleStatus::default();
> - if let Some(state) = state_file {
> - if let Ok(cs) = compute_schedule_status(&state, Some(upid)) {
> - computed_schedule = cs;
> - }
> - }
> + let computed_schedule: JobScheduleStatus =
> + compute_schedule_status(jobtype, &store, Some(upid))?;
comment: While the change in behavior here when parsing the UPID fails
is fine as discussed in the previous patch series [0], this must be
mentioned in the commit message!
A short comment such as:
```
This also changes the error handling for UPID parsing errors with
garbage collection state files, aligning it to the rest of the API
handler behavior.
```
Other than that, consider
Reviewed-by: Christian Ebner <c.ebner@proxmox.com>
[0]
https://lore.proxmox.com/pbs-devel/DH6UARBSGJYA.172KMABDLD4GQ@proxmox.com/
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH proxmox-backup v3 2/3] fix #7400: api: gracefully handle corrupted job statefiles
2026-03-25 16:06 ` [PATCH proxmox-backup v3 2/3] fix #7400: api: gracefully handle corrupted job statefiles Michael Köppl
@ 2026-04-02 16:35 ` Christian Ebner
0 siblings, 0 replies; 11+ messages in thread
From: Christian Ebner @ 2026-04-02 16:35 UTC (permalink / raw)
To: Michael Köppl, pbs-devel
One nit inline.
Other than that:
Reviewed-by: Christian Ebner <c.ebner@proxmox.com>
On 3/25/26 5:06 PM, Michael Köppl wrote:
> Introduce Unknown JobState to more explicitly represent cases where the
> state could not be determined, e.g. if the statefile was corrupted or
> missing. Update JobState::load to handle parsing errors (both for
> statefiles themselves as well as UPIDs) and return an Unknown state if
> such an error occurred. Update compute_schedule_status to also handle
> the new Unknown status, returning a default JobScheduleStatus so API
> endpoints don't return an error to the user, stopping them from viewing
> their jobs.
>
> Signed-off-by: Michael Köppl <m.koeppl@proxmox.com>
> ---
> src/server/jobstate.rs | 48 ++++++++++++++++++++++++++++++++++++------
> 1 file changed, 42 insertions(+), 6 deletions(-)
>
> diff --git a/src/server/jobstate.rs b/src/server/jobstate.rs
> index ceac8dde8..4163656e8 100644
> --- a/src/server/jobstate.rs
> +++ b/src/server/jobstate.rs
> @@ -66,6 +66,7 @@ pub enum JobState {
> state: TaskState,
> updated: Option<i64>,
> },
> + Unknown,
> }
>
> /// Represents a Job and holds the correct lock
> @@ -155,6 +156,7 @@ pub fn update_job_last_run_time(jobtype: &str, jobname: &str) -> Result<(), Erro
> state,
> updated: Some(time),
> },
> + JobState::Unknown => bail!("cannot update last run time for unknown job state"),
> };
> job.write_state()
> }
> @@ -179,6 +181,7 @@ pub fn last_run_time(jobtype: &str, jobname: &str) -> Result<i64, Error> {
> .map_err(|err| format_err!("could not parse upid from state: {err}"))?;
> Ok(upid.starttime)
> }
> + JobState::Unknown => bail!("statefile could not be parsed or was empty"),
> }
> }
>
> @@ -191,11 +194,20 @@ impl JobState {
> /// This does not update the state in the file.
> pub fn load(jobtype: &str, jobname: &str) -> Result<Self, Error> {
> if let Some(state) = file_read_optional_string(get_path(jobtype, jobname))? {
> - match serde_json::from_str(&state)? {
> + let job_state = serde_json::from_str(&state).unwrap_or_else(|err| {
> + log::error!("could not parse statefile for {jobname}: {err}");
> + JobState::Unknown
nit: This should early return IMHO, no need to fall trough to the match
statement below.
> + });
> +
> + match job_state {
> JobState::Started { upid } => {
> - let parsed: UPID = upid
> - .parse()
> - .map_err(|err| format_err!("error parsing upid: {err}"))?;
> + let parsed: UPID = match upid.parse() {
> + Ok(parsed) => parsed,
> + Err(err) => {
> + log::error!("error parsing upid for {jobname}: {err}");
> + return Ok(JobState::Unknown);
> + }
> + };
>
> if !worker_is_active_local(&parsed) {
> let state = upid_read_status(&parsed).unwrap_or(TaskState::Unknown {
> @@ -211,6 +223,21 @@ impl JobState {
> Ok(JobState::Started { upid })
> }
> }
> + JobState::Finished {
> + upid,
> + state,
> + updated,
> + } => {
> + if let Err(err) = upid.parse::<UPID>() {
> + log::error!("error parsing upid for {jobname}: {err}");
> + return Ok(JobState::Unknown);
> + }
> + Ok(JobState::Finished {
> + upid,
> + state,
> + updated,
> + })
> + }
> other => Ok(other),
> }
> } else {
> @@ -263,6 +290,7 @@ impl Job {
> JobState::Created { .. } => bail!("cannot finish when not started"),
> JobState::Started { upid } => upid,
> JobState::Finished { upid, .. } => upid,
> + JobState::Unknown => bail!("cannot finish job with unknown status"),
> }
> .to_string();
>
> @@ -305,8 +333,15 @@ pub fn compute_schedule_status(
> jobname: &str,
> schedule: Option<&str>,
> ) -> Result<JobScheduleStatus, Error> {
> - let job_state = JobState::load(jobtype, jobname)
> - .map_err(|err| format_err!("could not open statefile for {jobname}: {err}"))?;
> + let job_state = match JobState::load(jobtype, jobname) {
> + Ok(job_state) => job_state,
> + Err(err) => {
> + log::error!(
> + "could not open statefile for {jobname}: {err} - falling back to default job schedule status",
> + );
> + return Ok(JobScheduleStatus::default());
> + }
> + };
>
> let (upid, endtime, state, last) = match job_state {
> JobState::Created { time } => (None, None, None, time),
> @@ -327,6 +362,7 @@ pub fn compute_schedule_status(
> last,
> )
> }
> + JobState::Unknown => (None, None, None, proxmox_time::epoch_i64() - 30),
> };
>
> let mut status = JobScheduleStatus {
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH proxmox-backup v3 3/3] fix #7400: proxy: self-heal corrupted job statefiles
2026-03-25 16:06 ` [PATCH proxmox-backup v3 3/3] fix #7400: proxy: self-heal " Michael Köppl
@ 2026-04-02 16:50 ` Christian Ebner
2026-04-03 8:44 ` Michael Köppl
0 siblings, 1 reply; 11+ messages in thread
From: Christian Ebner @ 2026-04-02 16:50 UTC (permalink / raw)
To: Michael Köppl, pbs-devel
One nit inline.
Other than that:
Reviewed-by: Christian Ebner <c.ebner@proxmox.com>
On 3/25/26 5:06 PM, Michael Köppl wrote:
> Update update_job_last_run_time to transition JobState::Unknown into
> JobState::Created so the corrupted statefile is overwritten. In
> addition, update the scheduling loops to actively overwrite corrupted
> statefiles and return the time for the next scheduled run of the
> affected job.
>
> Signed-off-by: Michael Köppl <m.koeppl@proxmox.com>
> ---
> src/bin/proxmox-backup-proxy.rs | 4 +++-
> src/server/jobstate.rs | 2 +-
> 2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/src/bin/proxmox-backup-proxy.rs b/src/bin/proxmox-backup-proxy.rs
> index c1fe3ac15..71d8566f4 100644
> --- a/src/bin/proxmox-backup-proxy.rs
> +++ b/src/bin/proxmox-backup-proxy.rs
> @@ -564,6 +564,7 @@ async fn schedule_datastore_garbage_collection() {
> Ok(time) => time,
> Err(err) => {
> eprintln!("could not get last run time of {worker_type} {store}: {err}");
> + let _ = jobstate::update_job_last_run_time(worker_type, &store);
> continue;
> }
> };
> @@ -936,7 +937,8 @@ fn check_schedule(worker_type: &str, event_str: &str, id: &str) -> bool {
> Ok(time) => time,
> Err(err) => {
> eprintln!("could not get last run time of {worker_type} {id}: {err}");
> - return false;
> + let _ = jobstate::update_job_last_run_time(worker_type, id);
> + proxmox_time::epoch_i64() - 30
nit: Should we define the offset from epoch here as constant? This is
used in several places now and having one common constant with a
meaningful name would help with code quality IMHO.
> }
> };
>
> diff --git a/src/server/jobstate.rs b/src/server/jobstate.rs
> index 4163656e8..214691965 100644
> --- a/src/server/jobstate.rs
> +++ b/src/server/jobstate.rs
> @@ -156,7 +156,7 @@ pub fn update_job_last_run_time(jobtype: &str, jobname: &str) -> Result<(), Erro
> state,
> updated: Some(time),
> },
> - JobState::Unknown => bail!("cannot update last run time for unknown job state"),
> + JobState::Unknown => JobState::Created { time },
> };
> job.write_state()
> }
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH proxmox-backup v3 3/3] fix #7400: proxy: self-heal corrupted job statefiles
2026-04-02 16:50 ` Christian Ebner
@ 2026-04-03 8:44 ` Michael Köppl
2026-04-03 11:25 ` Christian Ebner
0 siblings, 1 reply; 11+ messages in thread
From: Michael Köppl @ 2026-04-03 8:44 UTC (permalink / raw)
To: Christian Ebner, Michael Köppl, pbs-devel
On Thu Apr 2, 2026 at 6:50 PM CEST, Christian Ebner wrote:
> One nit inline.
>
> Other than that:
>
> Reviewed-by: Christian Ebner <c.ebner@proxmox.com>
>
> On 3/25/26 5:06 PM, Michael Köppl wrote:
>> Update update_job_last_run_time to transition JobState::Unknown into
>> JobState::Created so the corrupted statefile is overwritten. In
>> addition, update the scheduling loops to actively overwrite corrupted
>> statefiles and return the time for the next scheduled run of the
>> affected job.
>>
>> Signed-off-by: Michael Köppl <m.koeppl@proxmox.com>
>> ---
>> src/bin/proxmox-backup-proxy.rs | 4 +++-
>> src/server/jobstate.rs | 2 +-
>> 2 files changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/src/bin/proxmox-backup-proxy.rs b/src/bin/proxmox-backup-proxy.rs
>> index c1fe3ac15..71d8566f4 100644
>> --- a/src/bin/proxmox-backup-proxy.rs
>> +++ b/src/bin/proxmox-backup-proxy.rs
>> @@ -564,6 +564,7 @@ async fn schedule_datastore_garbage_collection() {
>> Ok(time) => time,
>> Err(err) => {
>> eprintln!("could not get last run time of {worker_type} {store}: {err}");
>> + let _ = jobstate::update_job_last_run_time(worker_type, &store);
>> continue;
>> }
>> };
>> @@ -936,7 +937,8 @@ fn check_schedule(worker_type: &str, event_str: &str, id: &str) -> bool {
>> Ok(time) => time,
>> Err(err) => {
>> eprintln!("could not get last run time of {worker_type} {id}: {err}");
>> - return false;
>> + let _ = jobstate::update_job_last_run_time(worker_type, id);
>> + proxmox_time::epoch_i64() - 30
>
> nit: Should we define the offset from epoch here as constant? This is
> used in several places now and having one common constant with a
> meaningful name would help with code quality IMHO.
Thanks for a having another look at this version! I agree, that's a good
idea. What do you think of SCHEDULE_FALLBACK_OFFSET?
>
>> }
>> };
>>
>> diff --git a/src/server/jobstate.rs b/src/server/jobstate.rs
>> index 4163656e8..214691965 100644
>> --- a/src/server/jobstate.rs
>> +++ b/src/server/jobstate.rs
>> @@ -156,7 +156,7 @@ pub fn update_job_last_run_time(jobtype: &str, jobname: &str) -> Result<(), Erro
>> state,
>> updated: Some(time),
>> },
>> - JobState::Unknown => bail!("cannot update last run time for unknown job state"),
>> + JobState::Unknown => JobState::Created { time },
>> };
>> job.write_state()
>> }
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH proxmox-backup v3 1/3] api: move statefile loading into compute_schedule_status
2026-04-02 16:22 ` Christian Ebner
@ 2026-04-03 11:23 ` Michael Köppl
0 siblings, 0 replies; 11+ messages in thread
From: Michael Köppl @ 2026-04-03 11:23 UTC (permalink / raw)
To: Christian Ebner, Michael Köppl, pbs-devel
On Thu Apr 2, 2026 at 6:22 PM CEST, Christian Ebner wrote:
[snip]
>> let mut last = proxmox_time::epoch_i64();
>>
>> + let jobtype = "garbage_collection";
>> +
>> if let Some(ref upid) = status_in_memory.upid {
>> - let mut computed_schedule: JobScheduleStatus = JobScheduleStatus::default();
>> - if let Some(state) = state_file {
>> - if let Ok(cs) = compute_schedule_status(&state, Some(upid)) {
>> - computed_schedule = cs;
>> - }
>> - }
>> + let computed_schedule: JobScheduleStatus =
>> + compute_schedule_status(jobtype, &store, Some(upid))?;
>
> comment: While the change in behavior here when parsing the UPID fails
> is fine as discussed in the previous patch series [0], this must be
> mentioned in the commit message!
>
> A short comment such as:
> ```
> This also changes the error handling for UPID parsing errors with
> garbage collection state files, aligning it to the rest of the API
> handler behavior.
> ```
Missed that, will adapt the commit message in a v4! Thanks
>
> Other than that, consider
>
> Reviewed-by: Christian Ebner <c.ebner@proxmox.com>
>
> [0]
> https://lore.proxmox.com/pbs-devel/DH6UARBSGJYA.172KMABDLD4GQ@proxmox.com/
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH proxmox-backup v3 3/3] fix #7400: proxy: self-heal corrupted job statefiles
2026-04-03 8:44 ` Michael Köppl
@ 2026-04-03 11:25 ` Christian Ebner
0 siblings, 0 replies; 11+ messages in thread
From: Christian Ebner @ 2026-04-03 11:25 UTC (permalink / raw)
To: Michael Köppl, pbs-devel
On 4/3/26 10:43 AM, Michael Köppl wrote:
> On Thu Apr 2, 2026 at 6:50 PM CEST, Christian Ebner wrote:
>> One nit inline.
>>
>> Other than that:
>>
>> Reviewed-by: Christian Ebner <c.ebner@proxmox.com>
>>
>> On 3/25/26 5:06 PM, Michael Köppl wrote:
>>> Update update_job_last_run_time to transition JobState::Unknown into
>>> JobState::Created so the corrupted statefile is overwritten. In
>>> addition, update the scheduling loops to actively overwrite corrupted
>>> statefiles and return the time for the next scheduled run of the
>>> affected job.
>>>
>>> Signed-off-by: Michael Köppl <m.koeppl@proxmox.com>
>>> ---
>>> src/bin/proxmox-backup-proxy.rs | 4 +++-
>>> src/server/jobstate.rs | 2 +-
>>> 2 files changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/src/bin/proxmox-backup-proxy.rs b/src/bin/proxmox-backup-proxy.rs
>>> index c1fe3ac15..71d8566f4 100644
>>> --- a/src/bin/proxmox-backup-proxy.rs
>>> +++ b/src/bin/proxmox-backup-proxy.rs
>>> @@ -564,6 +564,7 @@ async fn schedule_datastore_garbage_collection() {
>>> Ok(time) => time,
>>> Err(err) => {
>>> eprintln!("could not get last run time of {worker_type} {store}: {err}");
>>> + let _ = jobstate::update_job_last_run_time(worker_type, &store);
>>> continue;
>>> }
>>> };
>>> @@ -936,7 +937,8 @@ fn check_schedule(worker_type: &str, event_str: &str, id: &str) -> bool {
>>> Ok(time) => time,
>>> Err(err) => {
>>> eprintln!("could not get last run time of {worker_type} {id}: {err}");
>>> - return false;
>>> + let _ = jobstate::update_job_last_run_time(worker_type, id);
>>> + proxmox_time::epoch_i64() - 30
>>
>> nit: Should we define the offset from epoch here as constant? This is
>> used in several places now and having one common constant with a
>> meaningful name would help with code quality IMHO.
>
> Thanks for a having another look at this version! I agree, that's a good
> idea. What do you think of SCHEDULE_FALLBACK_OFFSET?
Fine by me, yes!
>
>>
>>> }
>>> };
>>>
>>> diff --git a/src/server/jobstate.rs b/src/server/jobstate.rs
>>> index 4163656e8..214691965 100644
>>> --- a/src/server/jobstate.rs
>>> +++ b/src/server/jobstate.rs
>>> @@ -156,7 +156,7 @@ pub fn update_job_last_run_time(jobtype: &str, jobname: &str) -> Result<(), Erro
>>> state,
>>> updated: Some(time),
>>> },
>>> - JobState::Unknown => bail!("cannot update last run time for unknown job state"),
>>> + JobState::Unknown => JobState::Created { time },
>>> };
>>> job.write_state()
>>> }
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* superseded: [PATCH proxmox-backup v3 0/3] fix #7400: improve handling of corrupted job statefiles
2026-03-25 16:06 [PATCH proxmox-backup v3 0/3] fix #7400: improve handling of corrupted job statefiles Michael Köppl
` (2 preceding siblings ...)
2026-03-25 16:06 ` [PATCH proxmox-backup v3 3/3] fix #7400: proxy: self-heal " Michael Köppl
@ 2026-04-03 13:28 ` Michael Köppl
3 siblings, 0 replies; 11+ messages in thread
From: Michael Köppl @ 2026-04-03 13:28 UTC (permalink / raw)
To: Michael Köppl, pbs-devel
Superseded by https://lore.proxmox.com/pbs-devel/20260403132628.210128-1-m.koeppl@proxmox.com
On Wed Mar 25, 2026 at 5:06 PM CET, Michael Köppl wrote:
> This patch series fixes a problem where an empty or corrupted job state
> file (due to I/O error, abrupt shutdown, ...) would cause API endpoints
> for listing jobs to return an error, breaking the web UI for users
> because they could not view any of their configured jobs of that type.
> It would also cause proxmox-backup-proxy to indefinitely skip the jobs
> until a user manually triggered it to rewrite the statefile.
>
> 1/3 is a preparatory patch that centralizes job statefile loading
> in compute_schedule_status instead of having every handler function
> open the statefile, handle potential errors and then passing the
> JobState to compute_schedule_status.
>
> 2/3 introduces a new JobState `Unknown`, representing cases in which the
> job state could not be determined. In addition, the patch also updates
> the scheduling functions such that errors during reading the statefiles
> will result in the Unknown state.
>
> 3/3 then utilizes this Unknown state and adapts the scheduling functions
> such that the Unknown state will then lead to the statefile being
> overwritten with a new Created state and the job running again at its
> next scheduled run.
>
> changes since v2:
> - introduced the Unknown state in 2/3, adapted 3/3 accordingly (thanks,
> @Fabian and @Christian)
> - make sure the "could not open statefile" error is also printed in
> garbage_collection_status if status_in_memory.upid is None (thanks,
> @Christian)
> - inline jobtype and err variables in error logging
>
> changes since v1:
> - added preparatory patch 1/3, centralizing the statefile loading before
> adapting the handling of the error case in that centralized place
> (compute_schedule_status). Thanks, Christian for the suggestion!
> - adapted the error message if job statefile loading fails to make clear
> that the default status will be returned as a fallback
>
> proxmox-backup:
>
> Michael Köppl (3):
> api: move statefile loading into compute_schedule_status
> fix #7400: api: gracefully handle corrupted job statefiles
> fix #7400: proxy: self-heal corrupted job statefiles
>
> src/api2/admin/datastore.rs | 15 ++++------
> src/api2/admin/prune.rs | 9 ++----
> src/api2/admin/sync.rs | 9 ++----
> src/api2/admin/verify.rs | 9 ++----
> src/api2/tape/backup.rs | 9 ++----
> src/bin/proxmox-backup-proxy.rs | 4 ++-
> src/server/jobstate.rs | 52 +++++++++++++++++++++++++++++----
> 7 files changed, 67 insertions(+), 40 deletions(-)
>
>
> Summary over all repositories:
> 7 files changed, 67 insertions(+), 40 deletions(-)
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2026-04-03 13:27 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-25 16:06 [PATCH proxmox-backup v3 0/3] fix #7400: improve handling of corrupted job statefiles Michael Köppl
2026-03-25 16:06 ` [PATCH proxmox-backup v3 1/3] api: move statefile loading into compute_schedule_status Michael Köppl
2026-04-02 16:22 ` Christian Ebner
2026-04-03 11:23 ` Michael Köppl
2026-03-25 16:06 ` [PATCH proxmox-backup v3 2/3] fix #7400: api: gracefully handle corrupted job statefiles Michael Köppl
2026-04-02 16:35 ` Christian Ebner
2026-03-25 16:06 ` [PATCH proxmox-backup v3 3/3] fix #7400: proxy: self-heal " Michael Köppl
2026-04-02 16:50 ` Christian Ebner
2026-04-03 8:44 ` Michael Köppl
2026-04-03 11:25 ` Christian Ebner
2026-04-03 13:28 ` superseded: [PATCH proxmox-backup v3 0/3] fix #7400: improve handling of " Michael Köppl
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.