From: Christian Ebner <c.ebner@proxmox.com>
To: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>,
"Proxmox Backup Server development discussion"
<pbs-devel@lists.proxmox.com>
Subject: Re: [pbs-devel] [PATCH proxmox-backup 4/9] fix #5982: garbage collection: check atime updates are honored
Date: Mon, 3 Mar 2025 17:09:10 +0100 [thread overview]
Message-ID: <5e23c5cf-8cc6-48b5-9d92-a9828d95053e@proxmox.com> (raw)
In-Reply-To: <1741016392.vckcqyi01l.astroid@yuna.none>
On 3/3/25 16:50, Fabian Grünbichler wrote:
> On March 3, 2025 4:23 pm, Christian Ebner wrote:
>> On 3/3/25 14:51, Fabian Grünbichler wrote:
>>> On February 19, 2025 5:48 pm, Christian Ebner wrote:
>>>> Check if the filesystem backing the chunk store actually updates the
>>>> atime to avoid potential data loss in phase 2 of garbage collection
>>>> in case the atime update is not honored.
>>>>
>>>> Perform the check before phase 1 of garbage collection, as well as
>>>> on datastore creation. The latter to early detect and disallow
>>>> datastore creation on filesystem configurations which otherwise most
>>>> likely would lead to data losses.
>>>>
>>>> Enable the atime update check by default, but allow to opt-out by
>>>> setting a datastore tuning parameter flag for backwards compatibility.
>>>> This is honored by both, garbage collection and datastore creation.
>>>>
>>>> The check uses a 4 MiB fixed sized, unencypted and compressed chunk
>>>> as test marker, inserted if not present. This all zero-chunk is very
>>>> likely anyways for unencrypted backup contents with large all-zero
>>>> regions using fixed size chunking (e.g. VMs).
>>>>
>>>> The check sets the chunk's atime twice, once to set it to 1 second
>>>> into the past, and once to set it to now. By this one can detect if
>>>> the atime is set at all and if it also set if the filesystem is e.g.
>>>> mounted via relatime, which is intended to deferr the atime update.
>>>
>>> that's not how relatime works? see below..
>>>
>>>>
>>>> Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=5982
>>>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>>>> ---
>>>> changes since version 1:
>>>> - Use all-zero fixed size chunk for atime update test
>>>> - Test on datastore creation and before garbage collection phase 1
>>>> - fail datastore creation and garbage collection jobs if check fails,
>>>> but keep opt-out
>>>>
>>>> pbs-datastore/src/chunk_store.rs | 80 ++++++++++++++++++++++++++++++--
>>>> pbs-datastore/src/datastore.rs | 10 ++++
>>>> src/api2/config/datastore.rs | 1 +
>>>> 3 files changed, 86 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/pbs-datastore/src/chunk_store.rs b/pbs-datastore/src/chunk_store.rs
>>>> index 7bdcb0297..be57e3c87 100644
>>>> --- a/pbs-datastore/src/chunk_store.rs
>>>> +++ b/pbs-datastore/src/chunk_store.rs
>>>> @@ -1,6 +1,7 @@
>>>> use std::os::unix::io::AsRawFd;
>>>> use std::path::{Path, PathBuf};
>>>> use std::sync::{Arc, Mutex};
>>>> +use std::time::{Duration, SystemTime, UNIX_EPOCH};
>>>>
>>>> use anyhow::{bail, format_err, Error};
>>>> use tracing::info;
>>>> @@ -13,6 +14,7 @@ use proxmox_sys::process_locker::{
>>>> };
>>>> use proxmox_worker_task::WorkerTaskContext;
>>>>
>>>> +use crate::data_blob::DataChunkBuilder;
>>>> use crate::file_formats::{
>>>> COMPRESSED_BLOB_MAGIC_1_0, ENCRYPTED_BLOB_MAGIC_1_0, UNCOMPRESSED_BLOB_MAGIC_1_0,
>>>> };
>>>> @@ -96,6 +98,7 @@ impl ChunkStore {
>>>> uid: nix::unistd::Uid,
>>>> gid: nix::unistd::Gid,
>>>> sync_level: DatastoreFSyncLevel,
>>>> + atime_update_check: bool,
>>>> ) -> Result<Self, Error>
>>>> where
>>>> P: Into<PathBuf>,
>>>> @@ -150,7 +153,16 @@ impl ChunkStore {
>>>> }
>>>> }
>>>>
>>>> - Self::open(name, base, sync_level)
>>>> + let chunk_store = Self::open(name, base, sync_level)?;
>>>> + if atime_update_check {
>>>> + chunk_store
>>>> + .check_fs_atime_update()
>>>> + .map(|()| info!("atime update check successful."))?;
>>>> + } else {
>>>> + info!("atime update check skipped.");
>>>> + }
>>>> +
>>>> + Ok(chunk_store)
>>>> }
>>>>
>>>> fn lockfile_path<P: Into<PathBuf>>(base: P) -> PathBuf {
>>>> @@ -445,6 +457,51 @@ impl ChunkStore {
>>>> Ok(())
>>>> }
>>>>
>>>> + /// Check if atime updates are honored by the filesystem backing the chunk store.
>>>> + ///
>>>> + /// Sets the atime twice, once into the past by one second, checking that this was honored and
>>>> + /// then to now again, as otherwise it is not guaranteed to also cover time ranges by `relatime`.
>>>> + /// Uses a 4 MiB fixed size, compressed but unencrypted chunk to test. The chunk is inserted in
>>>> + /// the chunk store if not yet present.
>>>> + pub fn check_fs_atime_update(&self) -> Result<(), Error> {
>>>> + let (zero_chunk, digest) = DataChunkBuilder::build_zero_chunk(None, 4096 * 1024, true)?;
>>>> + self.insert_chunk(&zero_chunk, &digest)?;
>>>> +
>>>> + let past = SystemTime::now().duration_since(UNIX_EPOCH)? - Duration::from_secs(1);
>>>> + let times: [libc::timespec; 2] = [
>>>> + libc::timespec {
>>>> + tv_sec: past.as_secs().try_into()?,
>>>> + tv_nsec: past.subsec_nanos().into(),
>>>> + },
>>>> + libc::timespec {
>>>> + tv_sec: 0,
>>>> + tv_nsec: UTIME_OMIT,
>>>> + },
>>>> + ];
>>>
>>> this sets *atime* one second in the past, while not touching *mtime*.
>>> this is not something we ever need? if this fails, something is weird,
>>> but it wouldn't mean that atime handling was broken for our use case..
>>>
>>>> +
>>>> + use nix::NixPath;
>>>> + let (path, _digest) = self.chunk_path(&digest);
>>>> + path.with_nix_path(|cstr| unsafe {
>>>> + let tmp = libc::utimensat(-1, cstr.as_ptr(), ×[0], libc::AT_SYMLINK_NOFOLLOW);
>>>> + nix::errno::Errno::result(tmp)
>>>> + })??;
>>>> +
>>>> + let metadata_past = std::fs::metadata(&path).map_err(Error::from)?;
>>>> + let atime_past = metadata_past.accessed()?;
>>>> + if past != atime_past.duration_since(UNIX_EPOCH)? {
>>>> + bail!("setting atime to the past failed, is atime support enabled on datastore backing filesystem?");
>>>> + }
>>>> +
>>>> + self.cond_touch_chunk(&digest, true)?;
>>>> +
>>>> + let metadata_now = std::fs::metadata(&path).map_err(Error::from)?;
>>>> + let atime_now = metadata_now.accessed()?;
>>>> + if atime_past >= atime_now {
>>>> + bail!("atime update check failed, is atime support enabled on datastore backing filesystem?");
>>>> + }
>>>
>>> if something implements explicit timestamp updates to work like relatime
>>> does for read-triggered updates, this would still potentially fail -
>>> relatime covers checking *mtime* (which we haven't updated above!) first
>>> to see if *atime* should be updated when doing reads. it's basically a
>>> performance optimization (skip most atime updates for read-heavy work
>>> loads) without breaking a very specific use case (that of mutt abusing
>>> atime to know when it last read a local mailbox[0], which would break if
>>> atime were never updated on reads).
>>
>> Thanks a lot for the pointer! Need to dig a bit deeper into how relatime
>> works, as my understanding of it is clearly flawed.
>>
>>> I think what we actually want to do at GC time is:
>>>
>>> (potentially) insert the chunk
>>> stat the chunk
>>> touch the chunk (regularly)
>>> stat the chunk again
>>> compare the timestamps
>>
>> That was my initial approach, which failed because of the timestamp
>> granularity and since I didn't want to wait in-between the stat and
>> touch the approach with setting the atime into the past first came to
>> mind, therefore my implementation. But I see that this does not work
>> based on your feedback.
>
> did you actually hit the granularity issue in practice? with which
> storage/file system?
Hmm, had a datastore on ZFS with atime on, which triggered the check for
me on garbage collection, with an additional delay of 1 second however
it worked... Might have been an implementation error tough...
>
> because even with relatime or noatime on ext4 a sequence of
>
> stat foobar; touch -a foobar; stat foobar; touch -a foobar; stat foobar
>
> will give me different atimes for each stat. the same is true for ZFS no
> matter which atime settings I give the dataset.
>
>>> we now have three cases:
>>>
>>> A) the old atime is before the new atime => the storage doesn't discard
>>> *all* atime updates and we can proceed
>>> B) the old atime and the new atime are identical and atime is *before*
>>> mtime and ctime => atime should have been updated even under
>>> "relatime-like" semantics, very likely this storage is discarding all
>>> atime updates, we should abort with an error
>>> C) the old atime and the new atime are identical, but atime is *after*
>>> ctime/mtime => we need to check whether we just ran into relatime-like
>>> behaviour
>>>
>>> C means we either abort as well (if we don't want to support such
>>> setups), or we try to detect if setting both mtime and atime works, and
>>> force the GC into doing that
>>>
>>> we might want to add a "wait a second" between the first stat and the
>>> touching to account for filesystems with more granular timestamps
>>> (inserting touches, and other tasks might touch at "the same time" as
>>> well by accident), since the requested timestamps might be modified to
>>> become "the greatest value supported by the filesystem that is not
>>> greater than the specified time" (utimensat(2)), and one second doesn't
>>> really hurt is in the scope of a GC..
>>>
>>> note that the 24h grace period is totally independent of this - it comes
>>> from lazytime and relates to how often changes of timestamps are
>>> *persisted on disk*. that shouldn't affect us for local storages (since
>>> we talk to the kernel and always get the updated view anyway, no matter
>>> if it is in-memory or on-disk. if the PBS system crashes GC is gone as
>>> well and needs to start over, no risk of inconsistency).
>>>
>>> we might have a problem if the storage uses lazytime internally and
>>> lives somewhere else
>>> - GC starts phase1
>>> - updates a few atimes
>>> - storage crashes and loses some of the updates while I/O to it blocks
>>> - storage comes back up
>>> - GC continues not knowing that some of the atime updates have been lost
>>> in the ether
>>> - GC enters phase2
>>> - chunk timestamp makes it eligible for deletion, even though it was
>>> touched in phase1
>>> - data loss
>>>
>>> I don't think we can really prevent this other than by ditching the
>>> whole atime mechanism and instead keep a list of written chunks in each
>>> backup writer, and generate such a list in phase1, merging them all and
>>> using that for phase2 (which is really expensive as it scales with the
>>> number of touched chunks and would require syncing up writers and GC).
>>> thankfully I'd expect most such failure scenarios to actually abort the
>>> GC anyway for I/O error reasons..
>>
>> Maybe one could use `statmount` to determine if the `SB_LAZYTIME` flag
>> in `smbuf.sb_flags` is set? But I assume that you meant the issue being
>> this not exposed by the filesystem...
>
> the local mount having lazytime doesn't matter (if it does, that means
> the local kernel handles the consistency and if the kernel goes
> away/crashes, so does the GC). the issue would be some storage appliance
> using it (or something with similar semantics) internally that would
> cause atime updates to "work" but then disappear later without
> interrupting the GC. I don't see how we can protect against that without
> switching to a different, more expensive mechanism.
Hmm, okay. But at least one should be able to distinguish local
filesystems from NAS by checking /proc/mounts output, so not to add this
penalty to users running datastores on local filesystems.
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
next prev parent reply other threads:[~2025-03-03 16:09 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-19 16:48 [pbs-devel] [PATCH proxmox proxmox-backup 0/9] fix #5892: check atime update is honored Christian Ebner
2025-02-19 16:48 ` [pbs-devel] [PATCH proxmox 1/2] pbs api types: add check garbage collection atime updates flag Christian Ebner
2025-03-03 12:58 ` Fabian Grünbichler
2025-03-03 13:20 ` Christian Ebner
2025-02-19 16:48 ` [pbs-devel] [PATCH proxmox 2/2] pbs api types: add option to set GC chunk cleanup wait period Christian Ebner
2025-03-03 12:58 ` Fabian Grünbichler
2025-03-03 13:30 ` Christian Ebner
2025-03-03 15:30 ` Fabian Grünbichler
2025-03-04 16:01 ` Thomas Lamprecht
2025-03-04 16:37 ` Christian Ebner
2025-03-04 16:49 ` Thomas Lamprecht
2025-03-04 17:03 ` Christian Ebner
2025-03-05 7:57 ` Fabian Grünbichler
2025-03-05 8:12 ` Fabian Grünbichler
2025-02-19 16:48 ` [pbs-devel] [PATCH proxmox-backup 3/9] datastore: move utimensat() related constants to module scope Christian Ebner
2025-03-03 13:03 ` Fabian Grünbichler
2025-03-03 13:32 ` Christian Ebner
2025-02-19 16:48 ` [pbs-devel] [PATCH proxmox-backup 4/9] fix #5982: garbage collection: check atime updates are honored Christian Ebner
2025-03-03 13:51 ` Fabian Grünbichler
2025-03-03 15:23 ` Christian Ebner
2025-03-03 15:50 ` Fabian Grünbichler
2025-03-03 16:09 ` Christian Ebner [this message]
2025-02-19 16:48 ` [pbs-devel] [PATCH proxmox-backup 5/9] ui: expose atime update check flag in datastore tuning options Christian Ebner
2025-02-19 16:48 ` [pbs-devel] [PATCH proxmox-backup 6/9] docs: mention GC atime update check for " Christian Ebner
2025-02-19 16:48 ` [pbs-devel] [PATCH proxmox-backup 7/9] datastore: conditionally use custom GC wait period if set Christian Ebner
2025-02-19 16:48 ` [pbs-devel] [PATCH proxmox-backup 8/9] ui: expose GC wait period in datastore tuning option Christian Ebner
2025-02-19 16:48 ` [pbs-devel] [PATCH proxmox-backup 9/9] docs: mention gc-wait-period option as datastore tuning parameter Christian Ebner
2025-03-04 18:38 ` [pbs-devel] [PATCH proxmox proxmox-backup 0/9] fix #5892: check atime update is honored Christian Ebner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5e23c5cf-8cc6-48b5-9d92-a9828d95053e@proxmox.com \
--to=c.ebner@proxmox.com \
--cc=f.gruenbichler@proxmox.com \
--cc=pbs-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal