* [PATCH proxmox-backup v1 0/2] fix #7400: improve handling of corrupted job statefiles @ 2026-03-17 16:07 Michael Köppl 2026-03-17 16:07 ` [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle " Michael Köppl 2026-03-17 16:07 ` [PATCH proxmox-backup v1 2/2] fix #7400: proxy: self-heal " Michael Köppl 0 siblings, 2 replies; 8+ messages in thread From: Michael Köppl @ 2026-03-17 16:07 UTC (permalink / raw) To: pbs-devel This patch series fixes a problem where an empty or corrupted job state file (due to I/O error, abrupt shutdown, ...) would cause API endpoints for listing jobs to return an error, breaking the web UI for users because they could not view any of their configured jobs of that type. It would also cause proxmox-backup-proxy to indefinitely skip the jobs until a user manually triggered it to rewrite the statefile. 1/2 fixes the API layer by catching the load error, logging a warning, and returning a default JobScheduleStatus. This ensure the API returns the job list, showing the affected job without its last run status. 2/2 updates the scheduler loop inside the proxy to match the behavior of a a newly created job, triggering the job at its next internal (which then rewrites the statefile). proxmox-backup: Michael Köppl (2): fix #7400: api: gracefully handle corrupted job statefiles fix #7400: proxy: self-heal corrupted job statefiles src/api2/admin/prune.rs | 17 ++++++++++------- src/api2/admin/sync.rs | 16 ++++++++++------ src/api2/admin/verify.rs | 17 ++++++++++------- src/api2/tape/backup.rs | 17 ++++++++++------- src/bin/proxmox-backup-proxy.rs | 2 +- 5 files changed, 41 insertions(+), 28 deletions(-) Summary over all repositories: 5 files changed, 41 insertions(+), 28 deletions(-) -- Generated by murpp 0.9.0 ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle corrupted job statefiles 2026-03-17 16:07 [PATCH proxmox-backup v1 0/2] fix #7400: improve handling of corrupted job statefiles Michael Köppl @ 2026-03-17 16:07 ` Michael Köppl 2026-03-18 16:15 ` Christian Ebner 2026-03-18 17:22 ` Michael Köppl 2026-03-17 16:07 ` [PATCH proxmox-backup v1 2/2] fix #7400: proxy: self-heal " Michael Köppl 1 sibling, 2 replies; 8+ messages in thread From: Michael Köppl @ 2026-03-17 16:07 UTC (permalink / raw) To: pbs-devel Previously, if a job statefile was empty or corrupted (e.g. due to an I/O error or an abrupt shutdown), the JobStatus::load method would return an error that would be propagated up, causing the endpoint to return an error to the user, meaning users would not see any of their jobs if a single job had a corrupted statefile. Instead, handle the error explicitly, logging a warning and returning a default JobScheduleStatus, such that jobs lists can still be fetched, displaying the affected job as configured but simply missing its last run status. Signed-off-by: Michael Köppl <m.koeppl@proxmox.com> --- src/api2/admin/prune.rs | 17 ++++++++++------- src/api2/admin/sync.rs | 16 ++++++++++------ src/api2/admin/verify.rs | 17 ++++++++++------- src/api2/tape/backup.rs | 17 ++++++++++------- 4 files changed, 40 insertions(+), 27 deletions(-) diff --git a/src/api2/admin/prune.rs b/src/api2/admin/prune.rs index a5ebf2975..f2a6445c2 100644 --- a/src/api2/admin/prune.rs +++ b/src/api2/admin/prune.rs @@ -1,6 +1,6 @@ //! Datastore Prune Job Management -use anyhow::{format_err, Error}; +use anyhow::Error; use serde_json::Value; use proxmox_router::{ @@ -10,8 +10,8 @@ use proxmox_schema::api; use proxmox_sortable_macro::sortable; use pbs_api_types::{ - Authid, PruneJobConfig, PruneJobStatus, DATASTORE_SCHEMA, JOB_ID_SCHEMA, PRIV_DATASTORE_AUDIT, - PRIV_DATASTORE_MODIFY, + Authid, JobScheduleStatus, PruneJobConfig, PruneJobStatus, DATASTORE_SCHEMA, JOB_ID_SCHEMA, + PRIV_DATASTORE_AUDIT, PRIV_DATASTORE_MODIFY, }; use pbs_config::prune; use pbs_config::CachedUserInfo; @@ -73,10 +73,13 @@ pub fn list_prune_jobs( let mut list = Vec::new(); for job in job_config_iter { - let last_state = JobState::load("prunejob", &job.id) - .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?; - - let mut status = compute_schedule_status(&last_state, Some(&job.schedule))?; + let mut status = match JobState::load("prunejob", &job.id) { + Ok(last_state) => compute_schedule_status(&last_state, Some(&job.schedule))?, + Err(err) => { + log::error!("could not open statefile for {}: {}", &job.id, err); + JobScheduleStatus::default() + } + }; if job.disable { status.next_run = None; } diff --git a/src/api2/admin/sync.rs b/src/api2/admin/sync.rs index 6722ebea0..b0ff53ef7 100644 --- a/src/api2/admin/sync.rs +++ b/src/api2/admin/sync.rs @@ -1,6 +1,6 @@ //! Datastore Synchronization Job Management -use anyhow::{bail, format_err, Error}; +use anyhow::{bail, Error}; use serde::{Deserialize, Serialize}; use serde_json::Value; @@ -12,7 +12,8 @@ use proxmox_schema::api; use proxmox_sortable_macro::sortable; use pbs_api_types::{ - Authid, SyncDirection, SyncJobConfig, SyncJobStatus, DATASTORE_SCHEMA, JOB_ID_SCHEMA, + Authid, JobScheduleStatus, SyncDirection, SyncJobConfig, SyncJobStatus, DATASTORE_SCHEMA, + JOB_ID_SCHEMA, }; use pbs_config::sync; use pbs_config::CachedUserInfo; @@ -112,10 +113,13 @@ pub fn list_config_sync_jobs( continue; } - let last_state = JobState::load("syncjob", &job.id) - .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?; - - let status = compute_schedule_status(&last_state, job.schedule.as_deref())?; + let status = match JobState::load("syncjob", &job.id) { + Ok(last_state) => compute_schedule_status(&last_state, job.schedule.as_deref())?, + Err(err) => { + log::error!("could not open statefile for {}: {}", &job.id, err); + JobScheduleStatus::default() + } + }; list.push(SyncJobStatus { config: job, diff --git a/src/api2/admin/verify.rs b/src/api2/admin/verify.rs index 66695236c..d2c47ce0c 100644 --- a/src/api2/admin/verify.rs +++ b/src/api2/admin/verify.rs @@ -1,6 +1,6 @@ //! Datastore Verify Job Management -use anyhow::{format_err, Error}; +use anyhow::Error; use serde_json::Value; use proxmox_router::{ @@ -11,8 +11,8 @@ use proxmox_schema::api; use proxmox_sortable_macro::sortable; use pbs_api_types::{ - Authid, VerificationJobConfig, VerificationJobStatus, DATASTORE_SCHEMA, JOB_ID_SCHEMA, - PRIV_DATASTORE_AUDIT, PRIV_DATASTORE_VERIFY, + Authid, JobScheduleStatus, VerificationJobConfig, VerificationJobStatus, DATASTORE_SCHEMA, + JOB_ID_SCHEMA, PRIV_DATASTORE_AUDIT, PRIV_DATASTORE_VERIFY, }; use pbs_config::verify; use pbs_config::CachedUserInfo; @@ -73,10 +73,13 @@ pub fn list_verification_jobs( let mut list = Vec::new(); for job in job_config_iter { - let last_state = JobState::load("verificationjob", &job.id) - .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?; - - let status = compute_schedule_status(&last_state, job.schedule.as_deref())?; + let status = match JobState::load("verificationjob", &job.id) { + Ok(last_state) => compute_schedule_status(&last_state, job.schedule.as_deref())?, + Err(err) => { + log::error!("could not open statefile for {}: {}", &job.id, err); + JobScheduleStatus::default() + } + }; list.push(VerificationJobStatus { config: job, diff --git a/src/api2/tape/backup.rs b/src/api2/tape/backup.rs index 47e8d0209..dde65735d 100644 --- a/src/api2/tape/backup.rs +++ b/src/api2/tape/backup.rs @@ -1,6 +1,6 @@ use std::sync::{Arc, Mutex}; -use anyhow::{bail, format_err, Error}; +use anyhow::{bail, Error}; use serde_json::Value; use tracing::{info, warn}; @@ -11,8 +11,8 @@ use proxmox_schema::api; use proxmox_worker_task::WorkerTaskContext; use pbs_api_types::{ - print_ns_and_snapshot, print_store_and_ns, Authid, MediaPoolConfig, Operation, - TapeBackupJobConfig, TapeBackupJobSetup, TapeBackupJobStatus, JOB_ID_SCHEMA, + print_ns_and_snapshot, print_store_and_ns, Authid, JobScheduleStatus, MediaPoolConfig, + Operation, TapeBackupJobConfig, TapeBackupJobSetup, TapeBackupJobStatus, JOB_ID_SCHEMA, PRIV_DATASTORE_READ, PRIV_TAPE_AUDIT, PRIV_TAPE_WRITE, UPID_SCHEMA, }; @@ -97,10 +97,13 @@ pub fn list_tape_backup_jobs( continue; } - let last_state = JobState::load("tape-backup-job", &job.id) - .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?; - - let status = compute_schedule_status(&last_state, job.schedule.as_deref())?; + let status = match JobState::load("tape-backup-job", &job.id) { + Ok(last_state) => compute_schedule_status(&last_state, job.schedule.as_deref())?, + Err(err) => { + log::error!("could not open statefile for {}: {}", &job.id, err); + JobScheduleStatus::default() + } + }; let next_run = status.next_run.unwrap_or(current_time); -- 2.47.3 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle corrupted job statefiles 2026-03-17 16:07 ` [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle " Michael Köppl @ 2026-03-18 16:15 ` Christian Ebner 2026-03-18 17:04 ` Michael Köppl 2026-03-18 17:22 ` Michael Köppl 1 sibling, 1 reply; 8+ messages in thread From: Christian Ebner @ 2026-03-18 16:15 UTC (permalink / raw) To: Michael Köppl, pbs-devel On 3/17/26 5:07 PM, Michael Köppl wrote: > Previously, if a job statefile was empty or corrupted (e.g. due to an > I/O error or an abrupt shutdown), the JobStatus::load method would > return an error that would be propagated up, causing the endpoint to > return an error to the user, meaning users would not see any of their > jobs if a single job had a corrupted statefile. > > Instead, handle the error explicitly, logging a warning and returning a > default JobScheduleStatus, such that jobs lists can still be fetched, > displaying the affected job as configured but simply missing its last > run status. > > Signed-off-by: Michael Köppl <m.koeppl@proxmox.com> > --- Thanks for the patches, one comment for this patch. Instead of adapting all the call sides, loading of the state file might be moved to compute_schedule_status() in a preparatory patch, passing the job type and id as parameters. The improved error handling can then be done inside that helper. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle corrupted job statefiles 2026-03-18 16:15 ` Christian Ebner @ 2026-03-18 17:04 ` Michael Köppl 0 siblings, 0 replies; 8+ messages in thread From: Michael Köppl @ 2026-03-18 17:04 UTC (permalink / raw) To: Christian Ebner, Michael Köppl, pbs-devel On Wed Mar 18, 2026 at 5:15 PM CET, Christian Ebner wrote: > On 3/17/26 5:07 PM, Michael Köppl wrote: >> Previously, if a job statefile was empty or corrupted (e.g. due to an >> I/O error or an abrupt shutdown), the JobStatus::load method would >> return an error that would be propagated up, causing the endpoint to >> return an error to the user, meaning users would not see any of their >> jobs if a single job had a corrupted statefile. >> >> Instead, handle the error explicitly, logging a warning and returning a >> default JobScheduleStatus, such that jobs lists can still be fetched, >> displaying the affected job as configured but simply missing its last >> run status. >> >> Signed-off-by: Michael Köppl <m.koeppl@proxmox.com> >> --- > > Thanks for the patches, one comment for this patch. > > Instead of adapting all the call sides, loading of the state file might > be moved to compute_schedule_status() in a preparatory patch, passing > the job type and id as parameters. The improved error handling can then > be done inside that helper. Thanks for having a look! Yeah, true. Was so busy trying to avoid moving too much stuff around that I didn't see the cleaner solution. Will send an updated v2, thanks for the feedback. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle corrupted job statefiles 2026-03-17 16:07 ` [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle " Michael Köppl 2026-03-18 16:15 ` Christian Ebner @ 2026-03-18 17:22 ` Michael Köppl 2026-03-19 8:05 ` Christian Ebner 1 sibling, 1 reply; 8+ messages in thread From: Michael Köppl @ 2026-03-18 17:22 UTC (permalink / raw) To: Michael Köppl, pbs-devel On Tue Mar 17, 2026 at 5:07 PM CET, Michael Köppl wrote: [snip] > }; > use pbs_config::prune; > use pbs_config::CachedUserInfo; > @@ -73,10 +73,13 @@ pub fn list_prune_jobs( > let mut list = Vec::new(); > > for job in job_config_iter { > - let last_state = JobState::load("prunejob", &job.id) > - .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?; > - > - let mut status = compute_schedule_status(&last_state, Some(&job.schedule))?; > + let mut status = match JobState::load("prunejob", &job.id) { > + Ok(last_state) => compute_schedule_status(&last_state, Some(&job.schedule))?, > + Err(err) => { > + log::error!("could not open statefile for {}: {}", &job.id, err); Since I'm currently preparing v2, would it make sense to instead make this a warning? Not quite sure about it, but displaying an error to the user and then just continuing (and having self-healing behavior) seems a bit odd to me. > + JobScheduleStatus::default() > + } > + }; > if job.disable { > status.next_run = None; > } > diff --git a/src/api2/admin/sync.rs b/src/api2/admin/sync.rs > index 6722ebea0..b0ff53ef7 100644 > --- a/src/api2/admin/sync.rs > +++ b/src/api2/admin/sync.rs > @@ -1,6 +1,6 @@ > //! Datastore Synchronization Job Management [snip] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle corrupted job statefiles 2026-03-18 17:22 ` Michael Köppl @ 2026-03-19 8:05 ` Christian Ebner 2026-03-19 10:19 ` Michael Köppl 0 siblings, 1 reply; 8+ messages in thread From: Christian Ebner @ 2026-03-19 8:05 UTC (permalink / raw) To: Michael Köppl, pbs-devel On 3/18/26 6:22 PM, Michael Köppl wrote: > On Tue Mar 17, 2026 at 5:07 PM CET, Michael Köppl wrote: > > [snip] > >> }; >> use pbs_config::prune; >> use pbs_config::CachedUserInfo; >> @@ -73,10 +73,13 @@ pub fn list_prune_jobs( >> let mut list = Vec::new(); >> >> for job in job_config_iter { >> - let last_state = JobState::load("prunejob", &job.id) >> - .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?; >> - >> - let mut status = compute_schedule_status(&last_state, Some(&job.schedule))?; >> + let mut status = match JobState::load("prunejob", &job.id) { >> + Ok(last_state) => compute_schedule_status(&last_state, Some(&job.schedule))?, >> + Err(err) => { >> + log::error!("could not open statefile for {}: {}", &job.id, err); > > Since I'm currently preparing v2, would it make sense to instead make > this a warning? Not quite sure about it, but displaying an error to the > user and then just continuing (and having self-healing behavior) seems a > bit odd to me. IMO it makes sense as an error. There was an error reading the file after all. And you might not always be able to self-heal. E.g. what if you cannot re-write the state file (although this should be logged as well). Further, I do not expect these to show up frequently and the error message could be adapted to include the default being used as fallback? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle corrupted job statefiles 2026-03-19 8:05 ` Christian Ebner @ 2026-03-19 10:19 ` Michael Köppl 0 siblings, 0 replies; 8+ messages in thread From: Michael Köppl @ 2026-03-19 10:19 UTC (permalink / raw) To: Christian Ebner, Michael Köppl, pbs-devel On Thu Mar 19, 2026 at 9:05 AM CET, Christian Ebner wrote: > On 3/18/26 6:22 PM, Michael Köppl wrote: >> On Tue Mar 17, 2026 at 5:07 PM CET, Michael Köppl wrote: >> >> [snip] >> >>> }; >>> use pbs_config::prune; >>> use pbs_config::CachedUserInfo; >>> @@ -73,10 +73,13 @@ pub fn list_prune_jobs( >>> let mut list = Vec::new(); >>> >>> for job in job_config_iter { >>> - let last_state = JobState::load("prunejob", &job.id) >>> - .map_err(|err| format_err!("could not open statefile for {}: {}", &job.id, err))?; >>> - >>> - let mut status = compute_schedule_status(&last_state, Some(&job.schedule))?; >>> + let mut status = match JobState::load("prunejob", &job.id) { >>> + Ok(last_state) => compute_schedule_status(&last_state, Some(&job.schedule))?, >>> + Err(err) => { >>> + log::error!("could not open statefile for {}: {}", &job.id, err); >> >> Since I'm currently preparing v2, would it make sense to instead make >> this a warning? Not quite sure about it, but displaying an error to the >> user and then just continuing (and having self-healing behavior) seems a >> bit odd to me. > > IMO it makes sense as an error. There was an error reading the file > after all. And you might not always be able to self-heal. E.g. what if > you cannot re-write the state file (although this should be logged as well). Ack, I see your point. Thanks for the feedback, I'll leave it as an error. > > Further, I do not expect these to show up frequently and the error > message could be adapted to include the default being used as fallback? Yeah, something like "could not open statefile for {}: {} - falling back to default job schedule status". I'll adapt it and send the v2. ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH proxmox-backup v1 2/2] fix #7400: proxy: self-heal corrupted job statefiles 2026-03-17 16:07 [PATCH proxmox-backup v1 0/2] fix #7400: improve handling of corrupted job statefiles Michael Köppl 2026-03-17 16:07 ` [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle " Michael Köppl @ 2026-03-17 16:07 ` Michael Köppl 1 sibling, 0 replies; 8+ messages in thread From: Michael Köppl @ 2026-03-17 16:07 UTC (permalink / raw) To: pbs-devel Previously, if a job statefile was corrupted or missing, last_run_time() would return an error. The job scheduling loops handled this error by skipping the job indefinitely. To fix this, manual intervention was required (as in, triggering the job manually). Instead, match the behavior of a newly created job, ensuring that the job runs at its next scheduled interval. The job file will then be rewritten upon successful completion of the job. Signed-off-by: Michael Köppl <m.koeppl@proxmox.com> --- src/bin/proxmox-backup-proxy.rs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/bin/proxmox-backup-proxy.rs b/src/bin/proxmox-backup-proxy.rs index c1fe3ac15..8359ac0f1 100644 --- a/src/bin/proxmox-backup-proxy.rs +++ b/src/bin/proxmox-backup-proxy.rs @@ -936,7 +936,7 @@ fn check_schedule(worker_type: &str, event_str: &str, id: &str) -> bool { Ok(time) => time, Err(err) => { eprintln!("could not get last run time of {worker_type} {id}: {err}"); - return false; + proxmox_time::epoch_i64() - 30 } }; -- 2.47.3 ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-03-19 10:19 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2026-03-17 16:07 [PATCH proxmox-backup v1 0/2] fix #7400: improve handling of corrupted job statefiles Michael Köppl 2026-03-17 16:07 ` [PATCH proxmox-backup v1 1/2] fix #7400: api: gracefully handle " Michael Köppl 2026-03-18 16:15 ` Christian Ebner 2026-03-18 17:04 ` Michael Köppl 2026-03-18 17:22 ` Michael Köppl 2026-03-19 8:05 ` Christian Ebner 2026-03-19 10:19 ` Michael Köppl 2026-03-17 16:07 ` [PATCH proxmox-backup v1 2/2] fix #7400: proxy: self-heal " Michael Köppl
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox