public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pbs-devel] [PATCH v2 proxmox-backup] garbage collection: fix rare race in chunk marking phase
@ 2025-04-15 10:54 Christian Ebner
  2025-04-15 11:42 ` Fabian Grünbichler
  2025-04-15 14:51 ` [pbs-devel] superseded: " Christian Ebner
  0 siblings, 2 replies; 8+ messages in thread
From: Christian Ebner @ 2025-04-15 10:54 UTC (permalink / raw)
  To: pbs-devel

During phase 1 of garbage collection referenced chunks are marked as
in use by iterating over all index files and updating the atime on
the chunks referenced by these.

In an edge case for long running garbage collection jobs, where a
newly added snapshot (created after the start of GC) reused known
chunks from a previous snapshot, but the previous snapshot index
referencing them disappeared before the marking phase could reach
that index (e.g. pruned because only 1 snapshot to be kept by
retention setting), known chunks from that previous index file might
not be marked (given that by none of the other index files it was
marked).

Since commit 74361da8 ("garbage collection: generate index file list
via datastore iterators") this is even less likely as now the
iteration reads also index files added during phase 1, and
therefore either the new or the previous index file will account for
these chunks (the previous backup snapshot can only be prunded after
the new one finished, since locked). There remains however a small
race window between the reading of the snapshots in the backup group
and the reading of the actual index files for marking.

Fix this race by:
1. Checking if the last snapshot of a group disappeared and if so
2. generate the list again, looking for new index files previously
   not accounted for
3. To avoid possible endless looping, lock the group if the snapshot
   list changed even after the 10th time (which will lead to
   concurrent operations to this group failing).

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 1:
- rebased onto current master

 pbs-datastore/src/datastore.rs | 110 ++++++++++++++++++++++-----------
 1 file changed, 74 insertions(+), 36 deletions(-)

diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index aa38e2ac1..788dbf212 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -1143,47 +1143,85 @@ impl DataStore {
             let namespace = namespace.context("iterating namespaces failed")?;
             for group in arc_self.iter_backup_groups(namespace)? {
                 let group = group.context("iterating backup groups failed")?;
-                let mut snapshots = group.list_backups().context("listing snapshots failed")?;
-                // Sort by snapshot timestamp to iterate over consecutive snapshots for each image.
-                BackupInfo::sort_list(&mut snapshots, true);
-                for snapshot in snapshots {
-                    for file in snapshot.files {
-                        worker.check_abort()?;
-                        worker.fail_on_shutdown()?;
-
-                        let mut path = snapshot.backup_dir.full_path();
-                        path.push(file);
-
-                        let index = match self.open_index_reader(&path)? {
-                            Some(index) => index,
-                            None => {
-                                unprocessed_index_list.remove(&path);
+
+                // Avoid race between listing/marking of snapshots by GC and pruning the last
+                // snapshot in the group, following a new snapshot creation. Otherwise known chunks
+                // might only be referenced by the new snapshot, so it must be read as well.
+                let mut retry_counter = 0;
+                let mut processed_group_indices = HashSet::new();
+                let mut needs_retry;
+                loop {
+                    needs_retry = false;
+                    let _lock = if retry_counter == 10 {
+                        // Last resort, exclusively lock the backup group to make sure no new
+                        // backups can be written. This will only trigger if none of the previous
+                        // retries converged successfully, which is very unlikely.
+                        Some(group.lock())
+                    } else {
+                        None
+                    };
+
+                    let mut snapshots = group.list_backups().context("listing snapshots failed")?;
+                    let snapshot_count = snapshots.len();
+                    // Sort by snapshot timestamp to iterate over consecutive snapshots for each snapshot.
+                    BackupInfo::sort_list(&mut snapshots, true);
+                    for (count, snapshot) in snapshots.into_iter().enumerate() {
+                        for file in snapshot.files {
+                            worker.check_abort()?;
+                            worker.fail_on_shutdown()?;
+
+                            let mut path = snapshot.backup_dir.full_path();
+                            path.push(file);
+
+                            // Avoid reprocessing of already seen index files on retry
+                            if retry_counter > 0 && processed_group_indices.contains(&path) {
                                 continue;
                             }
-                        };
-                        self.index_mark_used_chunks(
-                            index,
-                            &path,
-                            &mut chunk_lru_cache,
-                            status,
-                            worker,
-                        )?;
-
-                        if !unprocessed_index_list.remove(&path) {
-                            info!("Encountered new index file '{path:?}', increment total index file count");
-                            index_count += 1;
-                        }
 
-                        let percentage = (processed_index_files + 1) * 100 / index_count;
-                        if percentage > last_percentage {
-                            info!(
-                                "marked {percentage}% ({} of {index_count} index files)",
-                                processed_index_files + 1,
-                            );
-                            last_percentage = percentage;
+                            let index = match self.open_index_reader(&path)? {
+                                Some(index) => index,
+                                None => {
+                                    if count + 1 == snapshot_count
+                                        && !snapshot.backup_dir.full_path().exists()
+                                    {
+                                        needs_retry = true;
+                                    }
+                                    unprocessed_index_list.remove(&path);
+                                    continue;
+                                }
+                            };
+                            self.index_mark_used_chunks(
+                                index,
+                                &path,
+                                &mut chunk_lru_cache,
+                                status,
+                                worker,
+                            )?;
+                            processed_group_indices.insert(path.clone());
+
+                            if !unprocessed_index_list.remove(&path) {
+                                info!("Encountered new index file '{path:?}', increment total index file count");
+                                index_count += 1;
+                            }
+
+                            let percentage = (processed_index_files + 1) * 100 / index_count;
+                            if percentage > last_percentage {
+                                info!(
+                                    "marked {percentage}% ({} of {index_count} index files)",
+                                    processed_index_files + 1,
+                                );
+                                last_percentage = percentage;
+                            }
+                            processed_index_files += 1;
                         }
-                        processed_index_files += 1;
                     }
+
+                    if needs_retry {
+                        retry_counter += 1;
+                        continue;
+                    }
+
+                    break;
                 }
             }
         }
-- 
2.39.5



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pbs-devel] [PATCH v2 proxmox-backup] garbage collection: fix rare race in chunk marking phase
  2025-04-15 10:54 [pbs-devel] [PATCH v2 proxmox-backup] garbage collection: fix rare race in chunk marking phase Christian Ebner
@ 2025-04-15 11:42 ` Fabian Grünbichler
  2025-04-15 12:50   ` Christian Ebner
  2025-04-15 14:51 ` [pbs-devel] superseded: " Christian Ebner
  1 sibling, 1 reply; 8+ messages in thread
From: Fabian Grünbichler @ 2025-04-15 11:42 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On April 15, 2025 12:54 pm, Christian Ebner wrote:
> During phase 1 of garbage collection referenced chunks are marked as
> in use by iterating over all index files and updating the atime on
> the chunks referenced by these.
> 
> In an edge case for long running garbage collection jobs, where a
> newly added snapshot (created after the start of GC) reused known
> chunks from a previous snapshot, but the previous snapshot index
> referencing them disappeared before the marking phase could reach
> that index (e.g. pruned because only 1 snapshot to be kept by
> retention setting), known chunks from that previous index file might
> not be marked (given that by none of the other index files it was
> marked).
> 
> Since commit 74361da8 ("garbage collection: generate index file list
> via datastore iterators") this is even less likely as now the
> iteration reads also index files added during phase 1, and
> therefore either the new or the previous index file will account for
> these chunks (the previous backup snapshot can only be prunded after
> the new one finished, since locked). There remains however a small
> race window between the reading of the snapshots in the backup group
> and the reading of the actual index files for marking.
> 
> Fix this race by:
> 1. Checking if the last snapshot of a group disappeared and if so
> 2. generate the list again, looking for new index files previously
>    not accounted for
> 3. To avoid possible endless looping, lock the group if the snapshot
>    list changed even after the 10th time (which will lead to
>    concurrent operations to this group failing).
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 1:
> - rebased onto current master
> 
>  pbs-datastore/src/datastore.rs | 110 ++++++++++++++++++++++-----------
>  1 file changed, 74 insertions(+), 36 deletions(-)
> 
> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> index aa38e2ac1..788dbf212 100644
> --- a/pbs-datastore/src/datastore.rs
> +++ b/pbs-datastore/src/datastore.rs
> @@ -1143,47 +1143,85 @@ impl DataStore {
>              let namespace = namespace.context("iterating namespaces failed")?;
>              for group in arc_self.iter_backup_groups(namespace)? {
>                  let group = group.context("iterating backup groups failed")?;

pre-existing/unrelated - what if the group got pruned at the right
moment here? should we maybe recheck that it exists and ignore the error
in that case?

> -                let mut snapshots = group.list_backups().context("listing snapshots failed")?;
> -                // Sort by snapshot timestamp to iterate over consecutive snapshots for each image.
> -                BackupInfo::sort_list(&mut snapshots, true);
> -                for snapshot in snapshots {
> -                    for file in snapshot.files {
> -                        worker.check_abort()?;
> -                        worker.fail_on_shutdown()?;
> -
> -                        let mut path = snapshot.backup_dir.full_path();
> -                        path.push(file);
> -
> -                        let index = match self.open_index_reader(&path)? {
> -                            Some(index) => index,
> -                            None => {
> -                                unprocessed_index_list.remove(&path);
> +
> +                // Avoid race between listing/marking of snapshots by GC and pruning the last
> +                // snapshot in the group, following a new snapshot creation. Otherwise known chunks
> +                // might only be referenced by the new snapshot, so it must be read as well.
> +                let mut retry_counter = 0;
> +                let mut processed_group_indices = HashSet::new();
> +                let mut needs_retry;

I don't think we need this variable (see below)

> +                loop {
> +                    needs_retry = false;
> +                    let _lock = if retry_counter == 10 {
> +                        // Last resort, exclusively lock the backup group to make sure no new
> +                        // backups can be written. This will only trigger if none of the previous
> +                        // retries converged successfully, which is very unlikely.
> +                        Some(group.lock())

this should check the result? this would also fail if a backup is
currently going on (very likely if we end up here?) and abort the GC
then, but we don't have a way to lock a group with a timeout at the
moment.. but maybe we can wait and see if users actually run into that,
we can always extend the locking interface then..

we should probably also abort if we ever hit a retry_counter > 10, since
that would mean we missed some bug and this could loop forever..

> +                    } else {
> +                        None
> +                    };
> +
> +                    let mut snapshots = group.list_backups().context("listing snapshots failed")?;

same question as above applies here as well - what if the whole group vanished?

> +                    let snapshot_count = snapshots.len();
> +                    // Sort by snapshot timestamp to iterate over consecutive snapshots for each snapshot.

this comment makes no sense ;)

// iterate over sorted snapshots to benefit from chunk caching

or something like this? if we want to keep the comment at all..

> +                    BackupInfo::sort_list(&mut snapshots, true);
> +                    for (count, snapshot) in snapshots.into_iter().enumerate() {
> +                        for file in snapshot.files {

snapshot.files also includes blobs, should we filter them here to avoid
problems if we adapt below to handle partial/ongoing pruning?

> +                            worker.check_abort()?;
> +                            worker.fail_on_shutdown()?;
> +
> +                            let mut path = snapshot.backup_dir.full_path();
> +                            path.push(file);
> +
> +                            // Avoid reprocessing of already seen index files on retry
> +                            if retry_counter > 0 && processed_group_indices.contains(&path) {
>                                  continue;
>                              }
> -                        };
> -                        self.index_mark_used_chunks(
> -                            index,
> -                            &path,
> -                            &mut chunk_lru_cache,
> -                            status,
> -                            worker,
> -                        )?;
> -
> -                        if !unprocessed_index_list.remove(&path) {
> -                            info!("Encountered new index file '{path:?}', increment total index file count");
> -                            index_count += 1;
> -                        }
>  
> -                        let percentage = (processed_index_files + 1) * 100 / index_count;
> -                        if percentage > last_percentage {
> -                            info!(
> -                                "marked {percentage}% ({} of {index_count} index files)",
> -                                processed_index_files + 1,
> -                            );
> -                            last_percentage = percentage;
> +                            let index = match self.open_index_reader(&path)? {
> +                                Some(index) => index,
> +                                None => {
> +                                    if count + 1 == snapshot_count
> +                                        && !snapshot.backup_dir.full_path().exists()

what if the snapshot dir still exists, but the index doesn't? wouldn't
that also be problematic? I think we should probably start the loop over
as soon as something unexpected is going on with the last snapshot of
the group..

> +                                    {
> +                                        needs_retry = true;

since this is the only time we set it to true, and at this point we
could just start another iteration of the outer loop via labeled
continue?

> +                                    }
> +                                    unprocessed_index_list.remove(&path);

this could move above the if then

> +                                    continue;
> +                                }
> +                            };
> +                            self.index_mark_used_chunks(
> +                                index,
> +                                &path,
> +                                &mut chunk_lru_cache,
> +                                status,
> +                                worker,
> +                            )?;
> +                            processed_group_indices.insert(path.clone());
> +
> +                            if !unprocessed_index_list.remove(&path) {
> +                                info!("Encountered new index file '{path:?}', increment total index file count");
> +                                index_count += 1;
> +                            }
> +
> +                            let percentage = (processed_index_files + 1) * 100 / index_count;
> +                            if percentage > last_percentage {
> +                                info!(
> +                                    "marked {percentage}% ({} of {index_count} index files)",
> +                                    processed_index_files + 1,
> +                                );
> +                                last_percentage = percentage;
> +                            }
> +                            processed_index_files += 1;
>                          }
> -                        processed_index_files += 1;
>                      }
> +
> +                    if needs_retry {
> +                        retry_counter += 1;
> +                        continue;
> +                    }

this whole thing here can go away then..

> +
> +                    break;
>                  }
>              }
>          }
> -- 
> 2.39.5
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pbs-devel] [PATCH v2 proxmox-backup] garbage collection: fix rare race in chunk marking phase
  2025-04-15 11:42 ` Fabian Grünbichler
@ 2025-04-15 12:50   ` Christian Ebner
  2025-04-15 13:14     ` Fabian Grünbichler
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Ebner @ 2025-04-15 12:50 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 4/15/25 13:42, Fabian Grünbichler wrote:
> On April 15, 2025 12:54 pm, Christian Ebner wrote:
>> During phase 1 of garbage collection referenced chunks are marked as
>> in use by iterating over all index files and updating the atime on
>> the chunks referenced by these.
>>
>> In an edge case for long running garbage collection jobs, where a
>> newly added snapshot (created after the start of GC) reused known
>> chunks from a previous snapshot, but the previous snapshot index
>> referencing them disappeared before the marking phase could reach
>> that index (e.g. pruned because only 1 snapshot to be kept by
>> retention setting), known chunks from that previous index file might
>> not be marked (given that by none of the other index files it was
>> marked).
>>
>> Since commit 74361da8 ("garbage collection: generate index file list
>> via datastore iterators") this is even less likely as now the
>> iteration reads also index files added during phase 1, and
>> therefore either the new or the previous index file will account for
>> these chunks (the previous backup snapshot can only be prunded after
>> the new one finished, since locked). There remains however a small
>> race window between the reading of the snapshots in the backup group
>> and the reading of the actual index files for marking.
>>
>> Fix this race by:
>> 1. Checking if the last snapshot of a group disappeared and if so
>> 2. generate the list again, looking for new index files previously
>>     not accounted for
>> 3. To avoid possible endless looping, lock the group if the snapshot
>>     list changed even after the 10th time (which will lead to
>>     concurrent operations to this group failing).
>>
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> changes since version 1:
>> - rebased onto current master
>>
>>   pbs-datastore/src/datastore.rs | 110 ++++++++++++++++++++++-----------
>>   1 file changed, 74 insertions(+), 36 deletions(-)
>>
>> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
>> index aa38e2ac1..788dbf212 100644
>> --- a/pbs-datastore/src/datastore.rs
>> +++ b/pbs-datastore/src/datastore.rs
>> @@ -1143,47 +1143,85 @@ impl DataStore {
>>               let namespace = namespace.context("iterating namespaces failed")?;
>>               for group in arc_self.iter_backup_groups(namespace)? {
>>                   let group = group.context("iterating backup groups failed")?;
> 
> pre-existing/unrelated - what if the group got pruned at the right
> moment here? should we maybe recheck that it exists and ignore the error
> in that case?

Hmm, that's right... Same probably applies to the namespace iteration 
above as well...
But I'm not sure we should check if the group exists and ignore? What's 
your reasoning here? As far as I see the whole state of the iterator 
boils down to the readdir_r() calls in nix::dir::Iter::next(), but not 
sure what state one is looking at at that moment.

> 
>> -                let mut snapshots = group.list_backups().context("listing snapshots failed")?;
>> -                // Sort by snapshot timestamp to iterate over consecutive snapshots for each image.
>> -                BackupInfo::sort_list(&mut snapshots, true);
>> -                for snapshot in snapshots {
>> -                    for file in snapshot.files {
>> -                        worker.check_abort()?;
>> -                        worker.fail_on_shutdown()?;
>> -
>> -                        let mut path = snapshot.backup_dir.full_path();
>> -                        path.push(file);
>> -
>> -                        let index = match self.open_index_reader(&path)? {
>> -                            Some(index) => index,
>> -                            None => {
>> -                                unprocessed_index_list.remove(&path);
>> +
>> +                // Avoid race between listing/marking of snapshots by GC and pruning the last
>> +                // snapshot in the group, following a new snapshot creation. Otherwise known chunks
>> +                // might only be referenced by the new snapshot, so it must be read as well.
>> +                let mut retry_counter = 0;
>> +                let mut processed_group_indices = HashSet::new();
>> +                let mut needs_retry;
> 
> I don't think we need this variable (see below)

Acked

> 
>> +                loop {
>> +                    needs_retry = false;
>> +                    let _lock = if retry_counter == 10 {
>> +                        // Last resort, exclusively lock the backup group to make sure no new
>> +                        // backups can be written. This will only trigger if none of the previous
>> +                        // retries converged successfully, which is very unlikely.
>> +                        Some(group.lock())
> 
> this should check the result? this would also fail if a backup is
> currently going on (very likely if we end up here?) and abort the GC
> then, but we don't have a way to lock a group with a timeout at the
> moment.. but maybe we can wait and see if users actually run into that,
> we can always extend the locking interface then..

True, but since this is very unlikely to happen, I would opt to fail and 
add an error context here so this can easily be traced back to this code 
path.

> 
> we should probably also abort if we ever hit a retry_counter > 10, since
> that would mean we missed some bug and this could loop forever..

Ah yes, good point! Will add a bail for that case.

> 
>> +                    } else {
>> +                        None
>> +                    };
>> +
>> +                    let mut snapshots = group.list_backups().context("listing snapshots failed")?;
> 
> same question as above applies here as well - what if the whole group vanished?

Acked, must check

> 
>> +                    let snapshot_count = snapshots.len();
>> +                    // Sort by snapshot timestamp to iterate over consecutive snapshots for each snapshot.
> 
> this comment makes no sense ;)
> 
> // iterate over sorted snapshots to benefit from chunk caching
> 
> or something like this? if we want to keep the comment at all..

yeah, incorrectly adapted the pre-existing one here, but this does not 
give much insight, so lets drop it.

> 
>> +                    BackupInfo::sort_list(&mut snapshots, true);
>> +                    for (count, snapshot) in snapshots.into_iter().enumerate() {
>> +                        for file in snapshot.files {
> 
> snapshot.files also includes blobs, should we filter them here to avoid
> problems if we adapt below to handle partial/ongoing pruning?

Yes, might be better to exclude these at this point already.

> 
>> +                            worker.check_abort()?;
>> +                            worker.fail_on_shutdown()?;
>> +
>> +                            let mut path = snapshot.backup_dir.full_path();
>> +                            path.push(file);
>> +
>> +                            // Avoid reprocessing of already seen index files on retry
>> +                            if retry_counter > 0 && processed_group_indices.contains(&path) {
>>                                   continue;
>>                               }
>> -                        };
>> -                        self.index_mark_used_chunks(
>> -                            index,
>> -                            &path,
>> -                            &mut chunk_lru_cache,
>> -                            status,
>> -                            worker,
>> -                        )?;
>> -
>> -                        if !unprocessed_index_list.remove(&path) {
>> -                            info!("Encountered new index file '{path:?}', increment total index file count");
>> -                            index_count += 1;
>> -                        }
>>   
>> -                        let percentage = (processed_index_files + 1) * 100 / index_count;
>> -                        if percentage > last_percentage {
>> -                            info!(
>> -                                "marked {percentage}% ({} of {index_count} index files)",
>> -                                processed_index_files + 1,
>> -                            );
>> -                            last_percentage = percentage;
>> +                            let index = match self.open_index_reader(&path)? {
>> +                                Some(index) => index,
>> +                                None => {
>> +                                    if count + 1 == snapshot_count
>> +                                        && !snapshot.backup_dir.full_path().exists()
> 
> what if the snapshot dir still exists, but the index doesn't? wouldn't
> that also be problematic? I think we should probably start the loop over
> as soon as something unexpected is going on with the last snapshot of
> the group..

Agreed, pruning should remove the directory as a whole but let's be 
conservative here to cover inconsistent states as well.

> 
>> +                                    {
>> +                                        needs_retry = true;
> 
> since this is the only time we set it to true, and at this point we
> could just start another iteration of the outer loop via labeled
> continue?

Ah, yes that's indeed better, thx!

> 
>> +                                    }
>> +                                    unprocessed_index_list.remove(&path);
> 
> this could move above the if then

Acked

> 
>> +                                    continue;
>> +                                }
>> +                            };
>> +                            self.index_mark_used_chunks(
>> +                                index,
>> +                                &path,
>> +                                &mut chunk_lru_cache,
>> +                                status,
>> +                                worker,
>> +                            )?;
>> +                            processed_group_indices.insert(path.clone());
>> +
>> +                            if !unprocessed_index_list.remove(&path) {
>> +                                info!("Encountered new index file '{path:?}', increment total index file count");
>> +                                index_count += 1;
>> +                            }
>> +
>> +                            let percentage = (processed_index_files + 1) * 100 / index_count;
>> +                            if percentage > last_percentage {
>> +                                info!(
>> +                                    "marked {percentage}% ({} of {index_count} index files)",
>> +                                    processed_index_files + 1,
>> +                                );
>> +                                last_percentage = percentage;
>> +                            }
>> +                            processed_index_files += 1;
>>                           }
>> -                        processed_index_files += 1;
>>                       }
>> +
>> +                    if needs_retry {
>> +                        retry_counter += 1;
>> +                        continue;
>> +                    }
> 
> this whole thing here can go away then..

Acked

> 
>> +
>> +                    break;
>>                   }
>>               }
>>           }
>> -- 
>> 2.39.5
>>
>>
>>
>> _______________________________________________
>> pbs-devel mailing list
>> pbs-devel@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
>>
>>
>>
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pbs-devel] [PATCH v2 proxmox-backup] garbage collection: fix rare race in chunk marking phase
  2025-04-15 12:50   ` Christian Ebner
@ 2025-04-15 13:14     ` Fabian Grünbichler
  2025-04-15 15:40       ` Thomas Lamprecht
  0 siblings, 1 reply; 8+ messages in thread
From: Fabian Grünbichler @ 2025-04-15 13:14 UTC (permalink / raw)
  To: Christian Ebner, Proxmox Backup Server development discussion

On April 15, 2025 2:50 pm, Christian Ebner wrote:
> On 4/15/25 13:42, Fabian Grünbichler wrote:
>> On April 15, 2025 12:54 pm, Christian Ebner wrote:
>>> During phase 1 of garbage collection referenced chunks are marked as
>>> in use by iterating over all index files and updating the atime on
>>> the chunks referenced by these.
>>>
>>> In an edge case for long running garbage collection jobs, where a
>>> newly added snapshot (created after the start of GC) reused known
>>> chunks from a previous snapshot, but the previous snapshot index
>>> referencing them disappeared before the marking phase could reach
>>> that index (e.g. pruned because only 1 snapshot to be kept by
>>> retention setting), known chunks from that previous index file might
>>> not be marked (given that by none of the other index files it was
>>> marked).
>>>
>>> Since commit 74361da8 ("garbage collection: generate index file list
>>> via datastore iterators") this is even less likely as now the
>>> iteration reads also index files added during phase 1, and
>>> therefore either the new or the previous index file will account for
>>> these chunks (the previous backup snapshot can only be prunded after
>>> the new one finished, since locked). There remains however a small
>>> race window between the reading of the snapshots in the backup group
>>> and the reading of the actual index files for marking.
>>>
>>> Fix this race by:
>>> 1. Checking if the last snapshot of a group disappeared and if so
>>> 2. generate the list again, looking for new index files previously
>>>     not accounted for
>>> 3. To avoid possible endless looping, lock the group if the snapshot
>>>     list changed even after the 10th time (which will lead to
>>>     concurrent operations to this group failing).
>>>
>>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>>> ---
>>> changes since version 1:
>>> - rebased onto current master
>>>
>>>   pbs-datastore/src/datastore.rs | 110 ++++++++++++++++++++++-----------
>>>   1 file changed, 74 insertions(+), 36 deletions(-)
>>>
>>> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
>>> index aa38e2ac1..788dbf212 100644
>>> --- a/pbs-datastore/src/datastore.rs
>>> +++ b/pbs-datastore/src/datastore.rs
>>> @@ -1143,47 +1143,85 @@ impl DataStore {
>>>               let namespace = namespace.context("iterating namespaces failed")?;
>>>               for group in arc_self.iter_backup_groups(namespace)? {
>>>                   let group = group.context("iterating backup groups failed")?;
>> 
>> pre-existing/unrelated - what if the group got pruned at the right
>> moment here? should we maybe recheck that it exists and ignore the error
>> in that case?
> 
> Hmm, that's right... Same probably applies to the namespace iteration 
> above as well...
> But I'm not sure we should check if the group exists and ignore? What's 
> your reasoning here? As far as I see the whole state of the iterator 
> boils down to the readdir_r() calls in nix::dir::Iter::next(), but not 
> sure what state one is looking at at that moment.

my reasoning was that we now changed the behaviour so that groups are
completely removed if the last snapshot is removed, so if that happens
while iterating it would abort the whole GC.. but that probably applies
more to the list_backups call below, and not to the group iterator
here..

for namespaces I'd be a bit less worried, as deleting them while a GC is
running and hitting a conflict should be a fairly rare occurrence..

> 
>> 
>>> -                let mut snapshots = group.list_backups().context("listing snapshots failed")?;
>>> -                // Sort by snapshot timestamp to iterate over consecutive snapshots for each image.
>>> -                BackupInfo::sort_list(&mut snapshots, true);
>>> -                for snapshot in snapshots {
>>> -                    for file in snapshot.files {
>>> -                        worker.check_abort()?;
>>> -                        worker.fail_on_shutdown()?;
>>> -
>>> -                        let mut path = snapshot.backup_dir.full_path();
>>> -                        path.push(file);
>>> -
>>> -                        let index = match self.open_index_reader(&path)? {
>>> -                            Some(index) => index,
>>> -                            None => {
>>> -                                unprocessed_index_list.remove(&path);
>>> +
>>> +                // Avoid race between listing/marking of snapshots by GC and pruning the last
>>> +                // snapshot in the group, following a new snapshot creation. Otherwise known chunks
>>> +                // might only be referenced by the new snapshot, so it must be read as well.
>>> +                let mut retry_counter = 0;
>>> +                let mut processed_group_indices = HashSet::new();
>>> +                let mut needs_retry;
>> 
>> I don't think we need this variable (see below)
> 
> Acked
> 
>> 
>>> +                loop {
>>> +                    needs_retry = false;
>>> +                    let _lock = if retry_counter == 10 {
>>> +                        // Last resort, exclusively lock the backup group to make sure no new
>>> +                        // backups can be written. This will only trigger if none of the previous
>>> +                        // retries converged successfully, which is very unlikely.
>>> +                        Some(group.lock())
>> 
>> this should check the result? this would also fail if a backup is
>> currently going on (very likely if we end up here?) and abort the GC
>> then, but we don't have a way to lock a group with a timeout at the
>> moment.. but maybe we can wait and see if users actually run into that,
>> we can always extend the locking interface then..
> 
> True, but since this is very unlikely to happen, I would opt to fail and 
> add an error context here so this can easily be traced back to this code 
> path.

yes, for now I'd say aborting GC with a clear error here is best. we
cannot safely continue..

> 
>> 
>> we should probably also abort if we ever hit a retry_counter > 10, since
>> that would mean we missed some bug and this could loop forever..
> 
> Ah yes, good point! Will add a bail for that case.
> 
>> 
>>> +                    } else {
>>> +                        None
>>> +                    };
>>> +
>>> +                    let mut snapshots = group.list_backups().context("listing snapshots failed")?;
>> 
>> same question as above applies here as well - what if the whole group vanished?
> 
> Acked, must check

see above - this is probably the relevant part for disappearing groups.

> 
>> 
>>> +                    let snapshot_count = snapshots.len();
>>> +                    // Sort by snapshot timestamp to iterate over consecutive snapshots for each snapshot.
>> 
>> this comment makes no sense ;)
>> 
>> // iterate over sorted snapshots to benefit from chunk caching
>> 
>> or something like this? if we want to keep the comment at all..
> 
> yeah, incorrectly adapted the pre-existing one here, but this does not 
> give much insight, so lets drop it.
> 
>> 
>>> +                    BackupInfo::sort_list(&mut snapshots, true);
>>> +                    for (count, snapshot) in snapshots.into_iter().enumerate() {
>>> +                        for file in snapshot.files {
>> 
>> snapshot.files also includes blobs, should we filter them here to avoid
>> problems if we adapt below to handle partial/ongoing pruning?
> 
> Yes, might be better to exclude these at this point already.
> 
>> 
>>> +                            worker.check_abort()?;
>>> +                            worker.fail_on_shutdown()?;
>>> +
>>> +                            let mut path = snapshot.backup_dir.full_path();
>>> +                            path.push(file);
>>> +
>>> +                            // Avoid reprocessing of already seen index files on retry
>>> +                            if retry_counter > 0 && processed_group_indices.contains(&path) {
>>>                                   continue;
>>>                               }
>>> -                        };
>>> -                        self.index_mark_used_chunks(
>>> -                            index,
>>> -                            &path,
>>> -                            &mut chunk_lru_cache,
>>> -                            status,
>>> -                            worker,
>>> -                        )?;
>>> -
>>> -                        if !unprocessed_index_list.remove(&path) {
>>> -                            info!("Encountered new index file '{path:?}', increment total index file count");
>>> -                            index_count += 1;
>>> -                        }
>>>   
>>> -                        let percentage = (processed_index_files + 1) * 100 / index_count;
>>> -                        if percentage > last_percentage {
>>> -                            info!(
>>> -                                "marked {percentage}% ({} of {index_count} index files)",
>>> -                                processed_index_files + 1,
>>> -                            );
>>> -                            last_percentage = percentage;
>>> +                            let index = match self.open_index_reader(&path)? {
>>> +                                Some(index) => index,
>>> +                                None => {
>>> +                                    if count + 1 == snapshot_count
>>> +                                        && !snapshot.backup_dir.full_path().exists()
>> 
>> what if the snapshot dir still exists, but the index doesn't? wouldn't
>> that also be problematic? I think we should probably start the loop over
>> as soon as something unexpected is going on with the last snapshot of
>> the group..
> 
> Agreed, pruning should remove the directory as a whole but let's be 
> conservative here to cover inconsistent states as well.
> 
>> 
>>> +                                    {
>>> +                                        needs_retry = true;
>> 
>> since this is the only time we set it to true, and at this point we
>> could just start another iteration of the outer loop via labeled
>> continue?
> 
> Ah, yes that's indeed better, thx!
> 
>> 
>>> +                                    }
>>> +                                    unprocessed_index_list.remove(&path);
>> 
>> this could move above the if then
> 
> Acked
> 
>> 
>>> +                                    continue;
>>> +                                }
>>> +                            };
>>> +                            self.index_mark_used_chunks(
>>> +                                index,
>>> +                                &path,
>>> +                                &mut chunk_lru_cache,
>>> +                                status,
>>> +                                worker,
>>> +                            )?;
>>> +                            processed_group_indices.insert(path.clone());
>>> +
>>> +                            if !unprocessed_index_list.remove(&path) {
>>> +                                info!("Encountered new index file '{path:?}', increment total index file count");
>>> +                                index_count += 1;
>>> +                            }
>>> +
>>> +                            let percentage = (processed_index_files + 1) * 100 / index_count;
>>> +                            if percentage > last_percentage {
>>> +                                info!(
>>> +                                    "marked {percentage}% ({} of {index_count} index files)",
>>> +                                    processed_index_files + 1,
>>> +                                );
>>> +                                last_percentage = percentage;
>>> +                            }
>>> +                            processed_index_files += 1;
>>>                           }
>>> -                        processed_index_files += 1;
>>>                       }
>>> +
>>> +                    if needs_retry {
>>> +                        retry_counter += 1;
>>> +                        continue;
>>> +                    }
>> 
>> this whole thing here can go away then..
> 
> Acked


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [pbs-devel] superseded: [PATCH v2 proxmox-backup] garbage collection: fix rare race in chunk marking phase
  2025-04-15 10:54 [pbs-devel] [PATCH v2 proxmox-backup] garbage collection: fix rare race in chunk marking phase Christian Ebner
  2025-04-15 11:42 ` Fabian Grünbichler
@ 2025-04-15 14:51 ` Christian Ebner
  1 sibling, 0 replies; 8+ messages in thread
From: Christian Ebner @ 2025-04-15 14:51 UTC (permalink / raw)
  To: pbs-devel

superseded-by version 3:
https://lore.proxmox.com/pbs-devel/20250415144930.426824-1-c.ebner@proxmox.com/T/


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pbs-devel] [PATCH v2 proxmox-backup] garbage collection: fix rare race in chunk marking phase
  2025-04-15 13:14     ` Fabian Grünbichler
@ 2025-04-15 15:40       ` Thomas Lamprecht
  2025-04-16  6:31         ` Christian Ebner
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Lamprecht @ 2025-04-15 15:40 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion,
	Fabian Grünbichler, Christian Ebner

On 15/04/2025 15:14, Fabian Grünbichler wrote:
>>> this should check the result? this would also fail if a backup is
>>> currently going on (very likely if we end up here?) and abort the GC
>>> then, but we don't have a way to lock a group with a timeout at the
>>> moment.. but maybe we can wait and see if users actually run into that,
>>> we can always extend the locking interface then..
>> True, but since this is very unlikely to happen, I would opt to fail and 
>> add an error context here so this can easily be traced back to this code 
>> path.
> yes, for now I'd say aborting GC with a clear error here is best. we
> cannot safely continue..

Did not check v3, but note that users often do not run GC with a high
frequency due to the load it generates and time it needs, but still
depend on it to finish so space is being freed.

So if there is any way we can go or add to avoid aborting completely,
it would be IMO quite worth to evaluate doing that more closely.

FWIW, an completely different alternative might be to not change
GC but pruning when a GC job runs, e.g. (spitballing/hand waving)
move the index to a trash folder and notify GC about that.


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pbs-devel] [PATCH v2 proxmox-backup] garbage collection: fix rare race in chunk marking phase
  2025-04-15 15:40       ` Thomas Lamprecht
@ 2025-04-16  6:31         ` Christian Ebner
  2025-04-16  7:11           ` Fabian Grünbichler
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Ebner @ 2025-04-16  6:31 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox Backup Server development discussion,
	Fabian Grünbichler

Hi Thomas,

On 4/15/25 17:40, Thomas Lamprecht wrote:
> On 15/04/2025 15:14, Fabian Grünbichler wrote:
>>>> this should check the result? this would also fail if a backup is
>>>> currently going on (very likely if we end up here?) and abort the GC
>>>> then, but we don't have a way to lock a group with a timeout at the
>>>> moment.. but maybe we can wait and see if users actually run into that,
>>>> we can always extend the locking interface then..
>>> True, but since this is very unlikely to happen, I would opt to fail and
>>> add an error context here so this can easily be traced back to this code
>>> path.
>> yes, for now I'd say aborting GC with a clear error here is best. we
>> cannot safely continue..
> 
> Did not check v3, but note that users often do not run GC with a high
> frequency due to the load it generates and time it needs, but still
> depend on it to finish so space is being freed.
> 
> So if there is any way we can go or add to avoid aborting completely,
> it would be IMO quite worth to evaluate doing that more closely.
> 
> FWIW, an completely different alternative might be to not change
> GC but pruning when a GC job runs, e.g. (spitballing/hand waving)
> move the index to a trash folder and notify GC about that.

yes, having some sort of shadow copy of the index files came to mind as 
well. I did however disregard that for the GC itself, because it would 
be expensive and probably run into similar races with pruning.

Your suggested approach would however eliminate that, and further also 
be a nice feature! GC could then clean up all the trashed index files 
with some retention logic in a new phase 3, after cleaning up the chunks.

E.g. it already happened to some users that they pruned a snapshot they 
still needed by accident. So might it make sense to add a trash can as 
feature?

Nevertheless, I do think that changing the order of snapshot iteration 
for the GC run should be reversed, as that even further reduces the 
window of opportunity where things can go wrong (as stated in my 
self-reply to version 3 of the patch).


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pbs-devel] [PATCH v2 proxmox-backup] garbage collection: fix rare race in chunk marking phase
  2025-04-16  6:31         ` Christian Ebner
@ 2025-04-16  7:11           ` Fabian Grünbichler
  0 siblings, 0 replies; 8+ messages in thread
From: Fabian Grünbichler @ 2025-04-16  7:11 UTC (permalink / raw)
  To: Christian Ebner, Proxmox Backup Server development discussion,
	Thomas Lamprecht

On April 16, 2025 8:31 am, Christian Ebner wrote:
> Hi Thomas,
> 
> On 4/15/25 17:40, Thomas Lamprecht wrote:
>> On 15/04/2025 15:14, Fabian Grünbichler wrote:
>>>>> this should check the result? this would also fail if a backup is
>>>>> currently going on (very likely if we end up here?) and abort the GC
>>>>> then, but we don't have a way to lock a group with a timeout at the
>>>>> moment.. but maybe we can wait and see if users actually run into that,
>>>>> we can always extend the locking interface then..
>>>> True, but since this is very unlikely to happen, I would opt to fail and
>>>> add an error context here so this can easily be traced back to this code
>>>> path.
>>> yes, for now I'd say aborting GC with a clear error here is best. we
>>> cannot safely continue..
>> 
>> Did not check v3, but note that users often do not run GC with a high
>> frequency due to the load it generates and time it needs, but still
>> depend on it to finish so space is being freed.
>> 
>> So if there is any way we can go or add to avoid aborting completely,
>> it would be IMO quite worth to evaluate doing that more closely.

this should only trigger for setups that do things like very frequent
incremental backups with --keep-last 1 and pruning immediately after
backing up. I think even without Christian's most recently proposed
improvement the chances of hitting this 10 times practically means GC is
impossible in this situation.

the race window is the following after all:
- list snapshots in group
- sort them
- iterate over them but skipping previously marked ones
- mark each not yet seen one
- new snapshot based on previously last one was made since listing, and
  previously last one pruned, before that pruned snapshot was marked

so we'd have to repeatedly hit this sequence

- list group
- backup finished + prune finished in this group
- iteration reaches last snapshot in list

where the whole sequence is repeated in a tight loop, and the delta
between iterations should be a single snapshot so the iteration should
reach it almost instantly.

the situation with any variant of this patch is very different from what
we had before, which was:

- list all indices in the datastore
- iterate and mark
- if any last snapshot in any group was used as base for a new backup,
  and pruned before the iteration reached that group+snapshot, chunks
  could be lost

which for setups where GC took days, made the issue very much possible
to hit.

>> FWIW, an completely different alternative might be to not change
>> GC but pruning when a GC job runs, e.g. (spitballing/hand waving)
>> move the index to a trash folder and notify GC about that.
> 
> yes, having some sort of shadow copy of the index files came to mind as 
> well. I did however disregard that for the GC itself, because it would 
> be expensive and probably run into similar races with pruning.
> 
> Your suggested approach would however eliminate that, and further also 
> be a nice feature! GC could then clean up all the trashed index files 
> with some retention logic in a new phase 3, after cleaning up the chunks.
> 
> E.g. it already happened to some users that they pruned a snapshot they 
> still needed by accident. So might it make sense to add a trash can as 
> feature?

it has one downside - it's not longer possible to prune to get out of
(almost) full datastore situations, unless we also have a "skip trash
can" feature? but yes, it might be nice for non-GC-safety reasons as
well.

for GC, I think the order should be:
- clear out trash can (this doesn't race with marking, so no issues)
- mark (including anything that got added to the trash since clearing it
  out, to prevent the prune+GC race)
- sweep (like now)

else the trash can feature would effectively double the time until
garbage is actually removed, or double the run time of GC because we
have to run it twice back-to-back ;)

if we make the trash can feature unconditional, then once it is
implemented we can drop the retry logic when marking a group, as it's no
longer needed.

> Nevertheless, I do think that changing the order of snapshot iteration 
> for the GC run should be reversed, as that even further reduces the 
> window of opportunity where things can go wrong (as stated in my 
> self-reply to version 3 of the patch).

I think with this change the chances of hitting the retry counter limit
in practice should already be zero..

because if listing is slow, then doing a backup should be slow as well
(and thus the race becomes very unlikely).

but if any user reports aborted GCs because of this (or we are worried
about it), we can simply bump the counter from 10 to 100 or 1000, it
shouldn't affect regular setups in any fashion after all?


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-04-16  7:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-15 10:54 [pbs-devel] [PATCH v2 proxmox-backup] garbage collection: fix rare race in chunk marking phase Christian Ebner
2025-04-15 11:42 ` Fabian Grünbichler
2025-04-15 12:50   ` Christian Ebner
2025-04-15 13:14     ` Fabian Grünbichler
2025-04-15 15:40       ` Thomas Lamprecht
2025-04-16  6:31         ` Christian Ebner
2025-04-16  7:11           ` Fabian Grünbichler
2025-04-15 14:51 ` [pbs-devel] superseded: " Christian Ebner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal