[pbs-devel] [PATCH proxmox-backup v2 0/4] improve compression throughput

public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [pbs-devel] [PATCH proxmox-backup v2 0/4] improve compression throughput
@ 2024-07-31  9:36 Dominik Csapak
  2024-07-31  9:36 ` [pbs-devel] [PATCH proxmox-backup v2 1/4] remove data blob writer Dominik Csapak
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Dominik Csapak @ 2024-07-31  9:36 UTC (permalink / raw)
  To: pbs-devel

this series replaces my previous 2 patches [0] for that

in my tests (again current master) it improved the throughput if
the source/target storage is fast enough (tmpfs -> tmpfs):

Type                master (MiB/s)   with my patches (MiB/s)
.img file           ~614             ~767
pxar one big file   ~657             ~807
pxar small files    ~576             ~627

It would be great, if someone else can cross check my results here.
Note: the the pxar code being faster than the img code seems to stem
from better multithreading pipelining in that code or in tokio (pxar
codepath scales more directly with more cores than the .img codepath)

changes from v1:
* reorder patches so that the data blob writer removal is the first one
* add tests for DataBlob that we can decode what we encoded
  (to see that my patches don't mess up the chunk generation)
* add new patch to cleanup the `encode` function a bit

Dominik Csapak (4):
  remove data blob writer
  datastore: test DataBlob encode/decode roundtrip
  datastore: data blob: increase compression throughput
  datastore: DataBlob encode: simplify code

 pbs-datastore/src/data_blob.rs        | 158 ++++++++++++-------
 pbs-datastore/src/data_blob_writer.rs | 212 --------------------------
 pbs-datastore/src/lib.rs              |   2 -
 tests/blob_writer.rs                  | 105 -------------
 4 files changed, 107 insertions(+), 370 deletions(-)
 delete mode 100644 pbs-datastore/src/data_blob_writer.rs
 delete mode 100644 tests/blob_writer.rs

-- 
2.39.2



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [pbs-devel] [PATCH proxmox-backup v2 1/4] remove data blob writer
  2024-07-31  9:36 [pbs-devel] [PATCH proxmox-backup v2 0/4] improve compression throughput Dominik Csapak
@ 2024-07-31  9:36 ` Dominik Csapak
  2024-07-31  9:36 ` [pbs-devel] [PATCH proxmox-backup v2 2/4] datastore: test DataBlob encode/decode roundtrip Dominik Csapak
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Dominik Csapak @ 2024-07-31  9:36 UTC (permalink / raw)
  To: pbs-devel

this is leftover code that is not currently used outside of tests.
Should we need it again, we can just revert this commit.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
 pbs-datastore/src/data_blob_writer.rs | 212 --------------------------
 pbs-datastore/src/lib.rs              |   2 -
 tests/blob_writer.rs                  | 105 -------------
 3 files changed, 319 deletions(-)
 delete mode 100644 pbs-datastore/src/data_blob_writer.rs
 delete mode 100644 tests/blob_writer.rs

diff --git a/pbs-datastore/src/data_blob_writer.rs b/pbs-datastore/src/data_blob_writer.rs
deleted file mode 100644
index 30d9645f..00000000
--- a/pbs-datastore/src/data_blob_writer.rs
+++ /dev/null
@@ -1,212 +0,0 @@
-use std::io::{Seek, SeekFrom, Write};
-use std::sync::Arc;
-
-use anyhow::Error;
-
-use proxmox_io::WriteExt;
-
-use pbs_tools::crypt_config::CryptConfig;
-
-use crate::checksum_writer::ChecksumWriter;
-use crate::crypt_writer::CryptWriter;
-use crate::file_formats::{self, DataBlobHeader, EncryptedDataBlobHeader};
-
-enum BlobWriterState<'writer, W: Write> {
-    Uncompressed {
-        csum_writer: ChecksumWriter<W>,
-    },
-    Compressed {
-        compr: zstd::stream::write::Encoder<'writer, ChecksumWriter<W>>,
-    },
-    Encrypted {
-        crypt_writer: CryptWriter<ChecksumWriter<W>>,
-    },
-    EncryptedCompressed {
-        compr: zstd::stream::write::Encoder<'writer, CryptWriter<ChecksumWriter<W>>>,
-    },
-}
-
-/// Data blob writer
-pub struct DataBlobWriter<'writer, W: Write> {
-    state: BlobWriterState<'writer, W>,
-}
-
-impl<W: Write + Seek> DataBlobWriter<'_, W> {
-    pub fn new_uncompressed(mut writer: W) -> Result<Self, Error> {
-        writer.seek(SeekFrom::Start(0))?;
-        let head = DataBlobHeader {
-            magic: file_formats::UNCOMPRESSED_BLOB_MAGIC_1_0,
-            crc: [0; 4],
-        };
-        unsafe {
-            writer.write_le_value(head)?;
-        }
-        let csum_writer = ChecksumWriter::new(writer, None);
-        Ok(Self {
-            state: BlobWriterState::Uncompressed { csum_writer },
-        })
-    }
-
-    pub fn new_compressed(mut writer: W) -> Result<Self, Error> {
-        writer.seek(SeekFrom::Start(0))?;
-        let head = DataBlobHeader {
-            magic: file_formats::COMPRESSED_BLOB_MAGIC_1_0,
-            crc: [0; 4],
-        };
-        unsafe {
-            writer.write_le_value(head)?;
-        }
-        let csum_writer = ChecksumWriter::new(writer, None);
-        let compr = zstd::stream::write::Encoder::new(csum_writer, 1)?;
-        Ok(Self {
-            state: BlobWriterState::Compressed { compr },
-        })
-    }
-
-    pub fn new_encrypted(mut writer: W, config: Arc<CryptConfig>) -> Result<Self, Error> {
-        writer.seek(SeekFrom::Start(0))?;
-        let head = EncryptedDataBlobHeader {
-            head: DataBlobHeader {
-                magic: file_formats::ENCRYPTED_BLOB_MAGIC_1_0,
-                crc: [0; 4],
-            },
-            iv: [0u8; 16],
-            tag: [0u8; 16],
-        };
-        unsafe {
-            writer.write_le_value(head)?;
-        }
-
-        let csum_writer = ChecksumWriter::new(writer, None);
-        let crypt_writer = CryptWriter::new(csum_writer, config)?;
-        Ok(Self {
-            state: BlobWriterState::Encrypted { crypt_writer },
-        })
-    }
-
-    pub fn new_encrypted_compressed(
-        mut writer: W,
-        config: Arc<CryptConfig>,
-    ) -> Result<Self, Error> {
-        writer.seek(SeekFrom::Start(0))?;
-        let head = EncryptedDataBlobHeader {
-            head: DataBlobHeader {
-                magic: file_formats::ENCR_COMPR_BLOB_MAGIC_1_0,
-                crc: [0; 4],
-            },
-            iv: [0u8; 16],
-            tag: [0u8; 16],
-        };
-        unsafe {
-            writer.write_le_value(head)?;
-        }
-
-        let csum_writer = ChecksumWriter::new(writer, None);
-        let crypt_writer = CryptWriter::new(csum_writer, config)?;
-        let compr = zstd::stream::write::Encoder::new(crypt_writer, 1)?;
-        Ok(Self {
-            state: BlobWriterState::EncryptedCompressed { compr },
-        })
-    }
-
-    pub fn finish(self) -> Result<W, Error> {
-        match self.state {
-            BlobWriterState::Uncompressed { csum_writer } => {
-                // write CRC
-                let (mut writer, crc, _) = csum_writer.finish()?;
-                let head = DataBlobHeader {
-                    magic: file_formats::UNCOMPRESSED_BLOB_MAGIC_1_0,
-                    crc: crc.to_le_bytes(),
-                };
-
-                writer.seek(SeekFrom::Start(0))?;
-                unsafe {
-                    writer.write_le_value(head)?;
-                }
-
-                Ok(writer)
-            }
-            BlobWriterState::Compressed { compr } => {
-                let csum_writer = compr.finish()?;
-                let (mut writer, crc, _) = csum_writer.finish()?;
-
-                let head = DataBlobHeader {
-                    magic: file_formats::COMPRESSED_BLOB_MAGIC_1_0,
-                    crc: crc.to_le_bytes(),
-                };
-
-                writer.seek(SeekFrom::Start(0))?;
-                unsafe {
-                    writer.write_le_value(head)?;
-                }
-
-                Ok(writer)
-            }
-            BlobWriterState::Encrypted { crypt_writer } => {
-                let (csum_writer, iv, tag) = crypt_writer.finish()?;
-                let (mut writer, crc, _) = csum_writer.finish()?;
-
-                let head = EncryptedDataBlobHeader {
-                    head: DataBlobHeader {
-                        magic: file_formats::ENCRYPTED_BLOB_MAGIC_1_0,
-                        crc: crc.to_le_bytes(),
-                    },
-                    iv,
-                    tag,
-                };
-                writer.seek(SeekFrom::Start(0))?;
-                unsafe {
-                    writer.write_le_value(head)?;
-                }
-                Ok(writer)
-            }
-            BlobWriterState::EncryptedCompressed { compr } => {
-                let crypt_writer = compr.finish()?;
-                let (csum_writer, iv, tag) = crypt_writer.finish()?;
-                let (mut writer, crc, _) = csum_writer.finish()?;
-
-                let head = EncryptedDataBlobHeader {
-                    head: DataBlobHeader {
-                        magic: file_formats::ENCR_COMPR_BLOB_MAGIC_1_0,
-                        crc: crc.to_le_bytes(),
-                    },
-                    iv,
-                    tag,
-                };
-                writer.seek(SeekFrom::Start(0))?;
-                unsafe {
-                    writer.write_le_value(head)?;
-                }
-                Ok(writer)
-            }
-        }
-    }
-}
-
-impl<W: Write + Seek> Write for DataBlobWriter<'_, W> {
-    fn write(&mut self, buf: &[u8]) -> Result<usize, std::io::Error> {
-        match self.state {
-            BlobWriterState::Uncompressed {
-                ref mut csum_writer,
-            } => csum_writer.write(buf),
-            BlobWriterState::Compressed { ref mut compr } => compr.write(buf),
-            BlobWriterState::Encrypted {
-                ref mut crypt_writer,
-            } => crypt_writer.write(buf),
-            BlobWriterState::EncryptedCompressed { ref mut compr } => compr.write(buf),
-        }
-    }
-
-    fn flush(&mut self) -> Result<(), std::io::Error> {
-        match self.state {
-            BlobWriterState::Uncompressed {
-                ref mut csum_writer,
-            } => csum_writer.flush(),
-            BlobWriterState::Compressed { ref mut compr } => compr.flush(),
-            BlobWriterState::Encrypted {
-                ref mut crypt_writer,
-            } => crypt_writer.flush(),
-            BlobWriterState::EncryptedCompressed { ref mut compr } => compr.flush(),
-        }
-    }
-}
diff --git a/pbs-datastore/src/lib.rs b/pbs-datastore/src/lib.rs
index 3e4aa34c..202b0955 100644
--- a/pbs-datastore/src/lib.rs
+++ b/pbs-datastore/src/lib.rs
@@ -179,7 +179,6 @@ pub mod crypt_reader;
 pub mod crypt_writer;
 pub mod data_blob;
 pub mod data_blob_reader;
-pub mod data_blob_writer;
 pub mod file_formats;
 pub mod index;
 pub mod manifest;
@@ -201,7 +200,6 @@ pub use crypt_reader::CryptReader;
 pub use crypt_writer::CryptWriter;
 pub use data_blob::DataBlob;
 pub use data_blob_reader::DataBlobReader;
-pub use data_blob_writer::DataBlobWriter;
 pub use manifest::BackupManifest;
 pub use store_progress::StoreProgress;
 
diff --git a/tests/blob_writer.rs b/tests/blob_writer.rs
deleted file mode 100644
index 23a3283d..00000000
--- a/tests/blob_writer.rs
+++ /dev/null
@@ -1,105 +0,0 @@
-use std::io::Cursor;
-use std::io::{Read, Seek, SeekFrom, Write};
-use std::sync::Arc;
-
-use anyhow::{bail, Error};
-use lazy_static::lazy_static;
-
-use pbs_datastore::{DataBlob, DataBlobReader, DataBlobWriter};
-use pbs_tools::crypt_config::CryptConfig;
-
-lazy_static! {
-    static ref TEST_DATA: Vec<u8> = {
-        let mut data = Vec::new();
-
-        for i in 0..100_000 {
-            data.push((i % 255) as u8);
-        }
-
-        data
-    };
-    static ref CRYPT_CONFIG: Arc<CryptConfig> = {
-        let key = [1u8; 32];
-        Arc::new(CryptConfig::new(key).unwrap())
-    };
-    static ref TEST_DIGEST_PLAIN: [u8; 32] = [
-        83, 154, 96, 195, 167, 204, 38, 142, 204, 224, 130, 201, 24, 71, 2, 188, 130, 155, 177, 6,
-        162, 100, 61, 238, 38, 219, 63, 240, 191, 132, 87, 238
-    ];
-    static ref TEST_DIGEST_ENC: [u8; 32] = [
-        50, 162, 191, 93, 255, 132, 9, 14, 127, 23, 92, 39, 246, 102, 245, 204, 130, 104, 4, 106,
-        182, 239, 218, 14, 80, 17, 150, 188, 239, 253, 198, 117
-    ];
-}
-
-fn verify_test_blob(mut cursor: Cursor<Vec<u8>>, digest: &[u8; 32]) -> Result<(), Error> {
-    // run read tests with different buffer sizes
-    for size in [1, 3, 64 * 1024].iter() {
-        println!("Starting DataBlobReader test (size = {})", size);
-
-        cursor.seek(SeekFrom::Start(0))?;
-        let mut reader = DataBlobReader::new(&mut cursor, Some(CRYPT_CONFIG.clone()))?;
-        let mut buffer = Vec::<u8>::new();
-        // read the whole file
-        //reader.read_to_end(&mut buffer)?;
-        let mut buf = vec![0u8; *size];
-        loop {
-            let count = reader.read(&mut buf)?;
-            if count == 0 {
-                break;
-            }
-            buffer.extend(&buf[..count]);
-        }
-
-        reader.finish()?;
-        if buffer != *TEST_DATA {
-            bail!("blob data is wrong (read buffer size {})", size);
-        }
-    }
-
-    let raw_data = cursor.into_inner();
-
-    let blob = DataBlob::load_from_reader(&mut &raw_data[..])?;
-
-    let data = blob.decode(Some(&CRYPT_CONFIG), Some(digest))?;
-    if data != *TEST_DATA {
-        bail!("blob data is wrong (decode)");
-    }
-    Ok(())
-}
-
-#[test]
-fn test_uncompressed_blob_writer() -> Result<(), Error> {
-    let tmp = Cursor::new(Vec::<u8>::new());
-    let mut blob_writer = DataBlobWriter::new_uncompressed(tmp)?;
-    blob_writer.write_all(&TEST_DATA)?;
-
-    verify_test_blob(blob_writer.finish()?, &TEST_DIGEST_PLAIN)
-}
-
-#[test]
-fn test_compressed_blob_writer() -> Result<(), Error> {
-    let tmp = Cursor::new(Vec::<u8>::new());
-    let mut blob_writer = DataBlobWriter::new_compressed(tmp)?;
-    blob_writer.write_all(&TEST_DATA)?;
-
-    verify_test_blob(blob_writer.finish()?, &TEST_DIGEST_PLAIN)
-}
-
-#[test]
-fn test_encrypted_blob_writer() -> Result<(), Error> {
-    let tmp = Cursor::new(Vec::<u8>::new());
-    let mut blob_writer = DataBlobWriter::new_encrypted(tmp, CRYPT_CONFIG.clone())?;
-    blob_writer.write_all(&TEST_DATA)?;
-
-    verify_test_blob(blob_writer.finish()?, &TEST_DIGEST_ENC)
-}
-
-#[test]
-fn test_encrypted_compressed_blob_writer() -> Result<(), Error> {
-    let tmp = Cursor::new(Vec::<u8>::new());
-    let mut blob_writer = DataBlobWriter::new_encrypted_compressed(tmp, CRYPT_CONFIG.clone())?;
-    blob_writer.write_all(&TEST_DATA)?;
-
-    verify_test_blob(blob_writer.finish()?, &TEST_DIGEST_ENC)
-}
-- 
2.39.2



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [pbs-devel] [PATCH proxmox-backup v2 2/4] datastore: test DataBlob encode/decode roundtrip
  2024-07-31  9:36 [pbs-devel] [PATCH proxmox-backup v2 0/4] improve compression throughput Dominik Csapak
  2024-07-31  9:36 ` [pbs-devel] [PATCH proxmox-backup v2 1/4] remove data blob writer Dominik Csapak
@ 2024-07-31  9:36 ` Dominik Csapak
  2024-07-31  9:47   ` Lukas Wagner
  2024-07-31  9:36 ` [pbs-devel] [PATCH proxmox-backup v2 3/4] datastore: data blob: increase compression throughput Dominik Csapak
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Dominik Csapak @ 2024-07-31  9:36 UTC (permalink / raw)
  To: pbs-devel

so that we can be sure we can decode an encoded blob again

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
new in v2
 pbs-datastore/src/data_blob.rs | 66 ++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/pbs-datastore/src/data_blob.rs b/pbs-datastore/src/data_blob.rs
index a7a55fb7..8715afef 100644
--- a/pbs-datastore/src/data_blob.rs
+++ b/pbs-datastore/src/data_blob.rs
@@ -562,3 +562,69 @@ impl<'a, 'b> DataChunkBuilder<'a, 'b> {
         chunk_builder.build()
     }
 }
+
+#[cfg(test)]
+mod test {
+    use pbs_tools::crypt_config::CryptConfig;
+
+    use super::DataChunkBuilder;
+
+    const TEST_DATA_LEN: usize = 50;
+
+    #[test]
+    fn test_data_blob_builder() {
+        let mut data = Vec::with_capacity(TEST_DATA_LEN);
+        for i in 0..TEST_DATA_LEN / 10 {
+            for _ in 0..10 {
+                data.push(i as u8);
+            }
+        }
+
+        // unencrypted, uncompressed
+        let (chunk, digest) = DataChunkBuilder::new(&data)
+            .compress(false)
+            .build()
+            .expect("could not create unencrypted, uncompressed chunk");
+
+        let data_decoded = chunk
+            .decode(None, Some(&digest))
+            .expect("cannot decode unencrypted, uncompressed chunk");
+        assert_eq!(data, data_decoded);
+
+        // unencrypted, compressed
+        let (chunk, digest) = DataChunkBuilder::new(&data)
+            .compress(true)
+            .build()
+            .expect("could not create unencrypted, compressed chunk");
+
+        let data_decoded = chunk
+            .decode(None, Some(&digest))
+            .expect("cannot decode unencrypted, compressed chunk");
+        assert_eq!(data, data_decoded);
+
+        // encrypted, uncompressed
+        let crypt_config = CryptConfig::new([9; 32]).expect("could not create crypt config");
+        let (chunk, digest) = DataChunkBuilder::new(&data)
+            .compress(false)
+            .crypt_config(&crypt_config)
+            .build()
+            .expect("could not create encrypted, uncompressed chunk");
+
+        let data_decoded = chunk
+            .decode(Some(&crypt_config), Some(&digest))
+            .expect("cannot decode encrypted, uncompressed chunk");
+        assert_eq!(data, data_decoded);
+
+        // encrypted, compressed
+        let (chunk, digest) = DataChunkBuilder::new(&data)
+            .compress(true)
+            .crypt_config(&crypt_config)
+            .build()
+            .expect("could not create encrypted, compressed chunk");
+
+        let data_decoded = chunk
+            .decode(Some(&crypt_config), Some(&digest))
+            .expect("cannot decode encrypted, compressed chunk");
+        assert_eq!(data, data_decoded);
+    }
+}
-- 
2.39.2



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [pbs-devel] [PATCH proxmox-backup v2 3/4] datastore: data blob: increase compression throughput
  2024-07-31  9:36 [pbs-devel] [PATCH proxmox-backup v2 0/4] improve compression throughput Dominik Csapak
  2024-07-31  9:36 ` [pbs-devel] [PATCH proxmox-backup v2 1/4] remove data blob writer Dominik Csapak
  2024-07-31  9:36 ` [pbs-devel] [PATCH proxmox-backup v2 2/4] datastore: test DataBlob encode/decode roundtrip Dominik Csapak
@ 2024-07-31  9:36 ` Dominik Csapak
  2024-07-31 14:39   ` Thomas Lamprecht
  2024-07-31  9:36 ` [pbs-devel] [PATCH proxmox-backup v2 4/4] datastore: DataBlob encode: simplify code Dominik Csapak
  2024-08-05  9:33 ` [pbs-devel] [PATCH proxmox-backup v2 0/4] improve compression throughput Dominik Csapak
  4 siblings, 1 reply; 14+ messages in thread
From: Dominik Csapak @ 2024-07-31  9:36 UTC (permalink / raw)
  To: pbs-devel

by not using `zstd::stream::copy_encode`, because that has an allocation
pattern that reduces throughput if the target/source storage and the
network are faster than the chunk creation.

instead use `zstd::bulk::compress_to_buffer` which shouldn't do any big
allocations, since we provide the target buffer.

To handle the case that the target buffer is too small, we now ignore
all zstd error and continue with the uncompressed data, logging the error
except if the target buffer is too small.

For now, we have to parse the error string for that, as `zstd` maps all
errors as `io::ErrorKind::Other`. Until that gets changed, there is no
other way to differentiate between different kind of errors.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
changes from v1:
* fixed commit message
* reduced log severity to `warn`
* use vec![0; size]
* omit unnecessary buffer allocation in the unencrypted,uncompressed case
  by reusing the initial buffer that was tried for compression
 pbs-datastore/src/data_blob.rs | 37 +++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/pbs-datastore/src/data_blob.rs b/pbs-datastore/src/data_blob.rs
index 8715afef..2a528204 100644
--- a/pbs-datastore/src/data_blob.rs
+++ b/pbs-datastore/src/data_blob.rs
@@ -136,39 +136,44 @@ impl DataBlob {
 
             DataBlob { raw_data }
         } else {
-            let max_data_len = data.len() + std::mem::size_of::<DataBlobHeader>();
+            let header_len = std::mem::size_of::<DataBlobHeader>();
+            let max_data_len = data.len() + header_len;
+            let mut raw_data = vec![0; max_data_len];
             if compress {
-                let mut comp_data = Vec::with_capacity(max_data_len);
-
                 let head = DataBlobHeader {
                     magic: COMPRESSED_BLOB_MAGIC_1_0,
                     crc: [0; 4],
                 };
                 unsafe {
-                    comp_data.write_le_value(head)?;
+                    (&mut raw_data[0..header_len]).write_le_value(head)?;
                 }
 
-                zstd::stream::copy_encode(data, &mut comp_data, 1)?;
-
-                if comp_data.len() < max_data_len {
-                    let mut blob = DataBlob {
-                        raw_data: comp_data,
-                    };
-                    blob.set_crc(blob.compute_crc());
-                    return Ok(blob);
+                match zstd::bulk::compress_to_buffer(data, &mut raw_data[header_len..], 1) {
+                    Ok(size) if size <= data.len() => {
+                        raw_data.truncate(header_len + size);
+                        let mut blob = DataBlob { raw_data };
+                        blob.set_crc(blob.compute_crc());
+                        return Ok(blob);
+                    }
+                    // if size is bigger than the data, or any error is returned, continue with non
+                    // compressed archive but log all errors beside buffer too small
+                    Ok(_) => {}
+                    Err(err) => {
+                        if !err.to_string().contains("Destination buffer is too small") {
+                            log::warn!("zstd compression error: {err}");
+                        }
+                    }
                 }
             }
 
-            let mut raw_data = Vec::with_capacity(max_data_len);
-
             let head = DataBlobHeader {
                 magic: UNCOMPRESSED_BLOB_MAGIC_1_0,
                 crc: [0; 4],
             };
             unsafe {
-                raw_data.write_le_value(head)?;
+                (&mut raw_data[0..header_len]).write_le_value(head)?;
             }
-            raw_data.extend_from_slice(data);
+            (&mut raw_data[header_len..]).write_all(data)?;
 
             DataBlob { raw_data }
         };
-- 
2.39.2



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [pbs-devel] [PATCH proxmox-backup v2 4/4] datastore: DataBlob encode: simplify code
  2024-07-31  9:36 [pbs-devel] [PATCH proxmox-backup v2 0/4] improve compression throughput Dominik Csapak
                   ` (2 preceding siblings ...)
  2024-07-31  9:36 ` [pbs-devel] [PATCH proxmox-backup v2 3/4] datastore: data blob: increase compression throughput Dominik Csapak
@ 2024-07-31  9:36 ` Dominik Csapak
  2024-08-05  9:33 ` [pbs-devel] [PATCH proxmox-backup v2 0/4] improve compression throughput Dominik Csapak
  4 siblings, 0 replies; 14+ messages in thread
From: Dominik Csapak @ 2024-07-31  9:36 UTC (permalink / raw)
  To: pbs-devel

by combining the compression call from both encrypted and unencrypted
paths and deciding on the header magic at one site.

No functional changes intended, besides reusing the same buffer for
compression.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
new in v2
probably better to review the old and new code side by side, instead
of trying to decode the diff...

 pbs-datastore/src/data_blob.rs | 97 ++++++++++++++--------------------
 1 file changed, 41 insertions(+), 56 deletions(-)

diff --git a/pbs-datastore/src/data_blob.rs b/pbs-datastore/src/data_blob.rs
index 2a528204..4411bc08 100644
--- a/pbs-datastore/src/data_blob.rs
+++ b/pbs-datastore/src/data_blob.rs
@@ -93,28 +93,46 @@ impl DataBlob {
             bail!("data blob too large ({} bytes).", data.len());
         }
 
-        let mut blob = if let Some(config) = config {
-            let compr_data;
-            let (_compress, data, magic) = if compress {
-                compr_data = zstd::bulk::compress(data, 1)?;
-                // Note: We only use compression if result is shorter
-                if compr_data.len() < data.len() {
-                    (true, &compr_data[..], ENCR_COMPR_BLOB_MAGIC_1_0)
-                } else {
-                    (false, data, ENCRYPTED_BLOB_MAGIC_1_0)
+        let header_len = if config.is_some() {
+            std::mem::size_of::<EncryptedDataBlobHeader>()
+        } else {
+            std::mem::size_of::<DataBlobHeader>()
+        };
+
+        let mut compressed = false;
+        let mut data_compressed = vec![0u8; header_len + data.len()];
+        if compress {
+            match zstd::bulk::compress_to_buffer(data, &mut data_compressed[header_len..], 1) {
+                Ok(size) if size <= data.len() => {
+                    data_compressed.truncate(header_len + size);
+                    compressed = true;
                 }
-            } else {
-                (false, data, ENCRYPTED_BLOB_MAGIC_1_0)
-            };
+                // if size is bigger than the data, or any error is returned, continue with non
+                // compressed archive but log all errors beside buffer too small
+                Ok(_) => {}
+                Err(err) => {
+                    if !err.to_string().contains("Destination buffer is too small") {
+                        log::warn!("zstd compression error: {err}");
+                    }
+                }
+            }
+        }
 
-            let header_len = std::mem::size_of::<EncryptedDataBlobHeader>();
+        let (magic, encryption_source) = match (compressed, config.is_some()) {
+            (true, true) => (ENCR_COMPR_BLOB_MAGIC_1_0, &data_compressed[header_len..]),
+            (true, false) => (COMPRESSED_BLOB_MAGIC_1_0, &data_compressed[header_len..]),
+            (false, true) => (ENCRYPTED_BLOB_MAGIC_1_0, data),
+            (false, false) => {
+                (&mut data_compressed[header_len..]).write_all(data)?;
+                (UNCOMPRESSED_BLOB_MAGIC_1_0, data)
+            }
+        };
+
+        let raw_data = if let Some(config) = config {
             let mut raw_data = Vec::with_capacity(data.len() + header_len);
 
             let dummy_head = EncryptedDataBlobHeader {
-                head: DataBlobHeader {
-                    magic: [0u8; 8],
-                    crc: [0; 4],
-                },
+                head: DataBlobHeader { magic, crc: [0; 4] },
                 iv: [0u8; 16],
                 tag: [0u8; 16],
             };
@@ -122,7 +140,7 @@ impl DataBlob {
                 raw_data.write_le_value(dummy_head)?;
             }
 
-            let (iv, tag) = Self::encrypt_to(config, data, &mut raw_data)?;
+            let (iv, tag) = Self::encrypt_to(config, encryption_source, &mut raw_data)?;
 
             let head = EncryptedDataBlobHeader {
                 head: DataBlobHeader { magic, crc: [0; 4] },
@@ -134,50 +152,17 @@ impl DataBlob {
                 (&mut raw_data[0..header_len]).write_le_value(head)?;
             }
 
-            DataBlob { raw_data }
+            raw_data
         } else {
-            let header_len = std::mem::size_of::<DataBlobHeader>();
-            let max_data_len = data.len() + header_len;
-            let mut raw_data = vec![0; max_data_len];
-            if compress {
-                let head = DataBlobHeader {
-                    magic: COMPRESSED_BLOB_MAGIC_1_0,
-                    crc: [0; 4],
-                };
-                unsafe {
-                    (&mut raw_data[0..header_len]).write_le_value(head)?;
-                }
-
-                match zstd::bulk::compress_to_buffer(data, &mut raw_data[header_len..], 1) {
-                    Ok(size) if size <= data.len() => {
-                        raw_data.truncate(header_len + size);
-                        let mut blob = DataBlob { raw_data };
-                        blob.set_crc(blob.compute_crc());
-                        return Ok(blob);
-                    }
-                    // if size is bigger than the data, or any error is returned, continue with non
-                    // compressed archive but log all errors beside buffer too small
-                    Ok(_) => {}
-                    Err(err) => {
-                        if !err.to_string().contains("Destination buffer is too small") {
-                            log::warn!("zstd compression error: {err}");
-                        }
-                    }
-                }
-            }
-
-            let head = DataBlobHeader {
-                magic: UNCOMPRESSED_BLOB_MAGIC_1_0,
-                crc: [0; 4],
-            };
+            let head = DataBlobHeader { magic, crc: [0; 4] };
             unsafe {
-                (&mut raw_data[0..header_len]).write_le_value(head)?;
+                (&mut data_compressed[0..header_len]).write_le_value(head)?;
             }
-            (&mut raw_data[header_len..]).write_all(data)?;
 
-            DataBlob { raw_data }
+            data_compressed
         };
 
+        let mut blob = DataBlob { raw_data };
         blob.set_crc(blob.compute_crc());
 
         Ok(blob)
-- 
2.39.2



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup v2 2/4] datastore: test DataBlob encode/decode roundtrip
  2024-07-31  9:36 ` [pbs-devel] [PATCH proxmox-backup v2 2/4] datastore: test DataBlob encode/decode roundtrip Dominik Csapak
@ 2024-07-31  9:47   ` Lukas Wagner
  2024-07-31  9:50     ` Dominik Csapak
  0 siblings, 1 reply; 14+ messages in thread
From: Lukas Wagner @ 2024-07-31  9:47 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Dominik Csapak

On  2024-07-31 11:36, Dominik Csapak wrote:
> so that we can be sure we can decode an encoded blob again
> 
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> ---
> new in v2
>  pbs-datastore/src/data_blob.rs | 66 ++++++++++++++++++++++++++++++++++
>  1 file changed, 66 insertions(+)
> 
> diff --git a/pbs-datastore/src/data_blob.rs b/pbs-datastore/src/data_blob.rs
> index a7a55fb7..8715afef 100644
> --- a/pbs-datastore/src/data_blob.rs
> +++ b/pbs-datastore/src/data_blob.rs
> @@ -562,3 +562,69 @@ impl<'a, 'b> DataChunkBuilder<'a, 'b> {
>          chunk_builder.build()
>      }
>  }
> +
> +#[cfg(test)]
> +mod test {
> +    use pbs_tools::crypt_config::CryptConfig;
> +
> +    use super::DataChunkBuilder;
> +
> +    const TEST_DATA_LEN: usize = 50;
> +
> +    #[test]
> +    fn test_data_blob_builder() {
> +        let mut data = Vec::with_capacity(TEST_DATA_LEN);
> +        for i in 0..TEST_DATA_LEN / 10 {
> +            for _ in 0..10 {
> +                data.push(i as u8);
> +            }
> +        }
> +
> +        // unencrypted, uncompressed
> +        let (chunk, digest) = DataChunkBuilder::new(&data)
> +            .compress(false)
> +            .build()
> +            .expect("could not create unencrypted, uncompressed chunk");
> +
> +        let data_decoded = chunk
> +            .decode(None, Some(&digest))
> +            .expect("cannot decode unencrypted, uncompressed chunk");
> +        assert_eq!(data, data_decoded);
> +
> +        // unencrypted, compressed
> +        let (chunk, digest) = DataChunkBuilder::new(&data)
> +            .compress(true)
> +            .build()
> +            .expect("could not create unencrypted, compressed chunk");
> +
> +        let data_decoded = chunk
> +            .decode(None, Some(&digest))
> +            .expect("cannot decode unencrypted, compressed chunk");
> +        assert_eq!(data, data_decoded);
> +
> +        // encrypted, uncompressed
> +        let crypt_config = CryptConfig::new([9; 32]).expect("could not create crypt config");
> +        let (chunk, digest) = DataChunkBuilder::new(&data)
> +            .compress(false)
> +            .crypt_config(&crypt_config)
> +            .build()
> +            .expect("could not create encrypted, uncompressed chunk");
> +
> +        let data_decoded = chunk
> +            .decode(Some(&crypt_config), Some(&digest))
> +            .expect("cannot decode encrypted, uncompressed chunk");
> +        assert_eq!(data, data_decoded);
> +
> +        // encrypted, compressed
> +        let (chunk, digest) = DataChunkBuilder::new(&data)
> +            .compress(true)
> +            .crypt_config(&crypt_config)
> +            .build()
> +            .expect("could not create encrypted, compressed chunk");
> +
> +        let data_decoded = chunk
> +            .decode(Some(&crypt_config), Some(&digest))
> +            .expect("cannot decode encrypted, compressed chunk");
> +        assert_eq!(data, data_decoded);
> +    }
> +}

IMO it would be nicer to move the (sub) testcases 
(e.g. 'encrypted, compressed', 'encrypted, uncompressed') to individual test cases,
transforming the info from the comments into the test name, for instance:

  #[test]
  fn test_encrypted_uncompressed() { ... }

(the fact that you test the data_blob module is clear from the full test name, which includes
the module prefix - so I think you can drop it from the test name)

IMO that is quite nice when at some point we end up with a failing assertion, because
you see exactly from the name of the failing test what scenario did not work.

What's your take on this?

-- 
- Lukas


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup v2 2/4] datastore: test DataBlob encode/decode roundtrip
  2024-07-31  9:47   ` Lukas Wagner
@ 2024-07-31  9:50     ` Dominik Csapak
  0 siblings, 0 replies; 14+ messages in thread
From: Dominik Csapak @ 2024-07-31  9:50 UTC (permalink / raw)
  To: Lukas Wagner, Proxmox Backup Server development discussion

On 7/31/24 11:47, Lukas Wagner wrote:
> On  2024-07-31 11:36, Dominik Csapak wrote:
>> so that we can be sure we can decode an encoded blob again
>>
>> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
>> ---
>> new in v2
>>   pbs-datastore/src/data_blob.rs | 66 ++++++++++++++++++++++++++++++++++
>>   1 file changed, 66 insertions(+)
>>
>> diff --git a/pbs-datastore/src/data_blob.rs b/pbs-datastore/src/data_blob.rs
>> index a7a55fb7..8715afef 100644
>> --- a/pbs-datastore/src/data_blob.rs
>> +++ b/pbs-datastore/src/data_blob.rs
>> @@ -562,3 +562,69 @@ impl<'a, 'b> DataChunkBuilder<'a, 'b> {
>>           chunk_builder.build()
>>       }
>>   }
>> +
>> +#[cfg(test)]
>> +mod test {
>> +    use pbs_tools::crypt_config::CryptConfig;
>> +
>> +    use super::DataChunkBuilder;
>> +
>> +    const TEST_DATA_LEN: usize = 50;
>> +
>> +    #[test]
>> +    fn test_data_blob_builder() {
>> +        let mut data = Vec::with_capacity(TEST_DATA_LEN);
>> +        for i in 0..TEST_DATA_LEN / 10 {
>> +            for _ in 0..10 {
>> +                data.push(i as u8);
>> +            }
>> +        }
>> +
>> +        // unencrypted, uncompressed
>> +        let (chunk, digest) = DataChunkBuilder::new(&data)
>> +            .compress(false)
>> +            .build()
>> +            .expect("could not create unencrypted, uncompressed chunk");
>> +
>> +        let data_decoded = chunk
>> +            .decode(None, Some(&digest))
>> +            .expect("cannot decode unencrypted, uncompressed chunk");
>> +        assert_eq!(data, data_decoded);
>> +
>> +        // unencrypted, compressed
>> +        let (chunk, digest) = DataChunkBuilder::new(&data)
>> +            .compress(true)
>> +            .build()
>> +            .expect("could not create unencrypted, compressed chunk");
>> +
>> +        let data_decoded = chunk
>> +            .decode(None, Some(&digest))
>> +            .expect("cannot decode unencrypted, compressed chunk");
>> +        assert_eq!(data, data_decoded);
>> +
>> +        // encrypted, uncompressed
>> +        let crypt_config = CryptConfig::new([9; 32]).expect("could not create crypt config");
>> +        let (chunk, digest) = DataChunkBuilder::new(&data)
>> +            .compress(false)
>> +            .crypt_config(&crypt_config)
>> +            .build()
>> +            .expect("could not create encrypted, uncompressed chunk");
>> +
>> +        let data_decoded = chunk
>> +            .decode(Some(&crypt_config), Some(&digest))
>> +            .expect("cannot decode encrypted, uncompressed chunk");
>> +        assert_eq!(data, data_decoded);
>> +
>> +        // encrypted, compressed
>> +        let (chunk, digest) = DataChunkBuilder::new(&data)
>> +            .compress(true)
>> +            .crypt_config(&crypt_config)
>> +            .build()
>> +            .expect("could not create encrypted, compressed chunk");
>> +
>> +        let data_decoded = chunk
>> +            .decode(Some(&crypt_config), Some(&digest))
>> +            .expect("cannot decode encrypted, compressed chunk");
>> +        assert_eq!(data, data_decoded);
>> +    }
>> +}
> 
> IMO it would be nicer to move the (sub) testcases
> (e.g. 'encrypted, compressed', 'encrypted, uncompressed') to individual test cases,
> transforming the info from the comments into the test name, for instance:
> 
>    #[test]
>    fn test_encrypted_uncompressed() { ... }
> 
> (the fact that you test the data_blob module is clear from the full test name, which includes
> the module prefix - so I think you can drop it from the test name)
> 
> IMO that is quite nice when at some point we end up with a failing assertion, because
> you see exactly from the name of the failing test what scenario did not work.
> 
> What's your take on this?
> 

sure makes sense

i'll send a v3, but i'll wait for a review on the other patches first


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup v2 3/4] datastore: data blob: increase compression throughput
  2024-07-31  9:36 ` [pbs-devel] [PATCH proxmox-backup v2 3/4] datastore: data blob: increase compression throughput Dominik Csapak
@ 2024-07-31 14:39   ` Thomas Lamprecht
  2024-08-01  6:55     ` Dominik Csapak
  2024-08-02 10:47     ` Dominik Csapak
  0 siblings, 2 replies; 14+ messages in thread
From: Thomas Lamprecht @ 2024-07-31 14:39 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Dominik Csapak

Am 31/07/2024 um 11:36 schrieb Dominik Csapak:
> by not using `zstd::stream::copy_encode`, because that has an allocation
> pattern that reduces throughput if the target/source storage and the
> network are faster than the chunk creation.

any before/after benchmark numbers would be really great to have in the
commit message of any such patch.

> 
> instead use `zstd::bulk::compress_to_buffer` which shouldn't do any big
> allocations, since we provide the target buffer.
> 
> To handle the case that the target buffer is too small, we now ignore
> all zstd error and continue with the uncompressed data, logging the error
> except if the target buffer is too small.

This is hard to read to me and might to better with some reasoning
add for why this is OK, even if it's clear to you, maybe something like:

In case of a compression error just return the uncompressed data,
there's nothing we can do and saving uncompressed data is better than
having none. Additionally, log any such error besides the one for the
target buffer being too small.


> For now, we have to parse the error string for that, as `zstd` maps all
> errors as `io::ErrorKind::Other`. Until that gets changed, there is no
> other way to differentiate between different kind of errors.

FWIW, you could also use the lower-level zstd_safe's compress2 [0] here,
compress_to_buffer is just a thin wrapper around that [1] anyway. Then you
could match the return value to see if it equals `70`, i.e., the value of
the ZSTD_error_dstSize_tooSmall [2] from the ZSTD_ErrorCode enum.

I mean, naturally it would be much better if upstream provided a saner
interface or at least a binding for the enum, but IME such error codes
are quite stable if defined in this enum way, at least more stable than
strings, so might be a slightly better workaround.

[0]: https://docs.rs/zstd-safe/latest/zstd_safe/struct.CCtx.html#method.compress2
[1]: https://docs.rs/zstd/latest/src/zstd/bulk/compressor.rs.html#117-125
[2]: https://github.com/facebook/zstd/blob/fdfb2aff39dc498372d8c9e5f2330b692fea9794/lib/zstd_errors.h#L88

besides that and a small nit below: looks OK to me

> 
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> ---
> changes from v1:
> * fixed commit message
> * reduced log severity to `warn`
> * use vec![0; size]
> * omit unnecessary buffer allocation in the unencrypted,uncompressed case
>   by reusing the initial buffer that was tried for compression
>  pbs-datastore/src/data_blob.rs | 37 +++++++++++++++++++---------------
>  1 file changed, 21 insertions(+), 16 deletions(-)
> 
> diff --git a/pbs-datastore/src/data_blob.rs b/pbs-datastore/src/data_blob.rs
> index 8715afef..2a528204 100644
> --- a/pbs-datastore/src/data_blob.rs
> +++ b/pbs-datastore/src/data_blob.rs
> @@ -136,39 +136,44 @@ impl DataBlob {
>  
>              DataBlob { raw_data }
>          } else {
> -            let max_data_len = data.len() + std::mem::size_of::<DataBlobHeader>();
> +            let header_len = std::mem::size_of::<DataBlobHeader>();
> +            let max_data_len = data.len() + header_len;
> +            let mut raw_data = vec![0; max_data_len];
>              if compress {
> -                let mut comp_data = Vec::with_capacity(max_data_len);
> -
>                  let head = DataBlobHeader {
>                      magic: COMPRESSED_BLOB_MAGIC_1_0,
>                      crc: [0; 4],
>                  };
>                  unsafe {
> -                    comp_data.write_le_value(head)?;
> +                    (&mut raw_data[0..header_len]).write_le_value(head)?;
>                  }
>  
> -                zstd::stream::copy_encode(data, &mut comp_data, 1)?;
> -
> -                if comp_data.len() < max_data_len {
> -                    let mut blob = DataBlob {
> -                        raw_data: comp_data,
> -                    };
> -                    blob.set_crc(blob.compute_crc());
> -                    return Ok(blob);
> +                match zstd::bulk::compress_to_buffer(data, &mut raw_data[header_len..], 1) {
> +                    Ok(size) if size <= data.len() => {
> +                        raw_data.truncate(header_len + size);
> +                        let mut blob = DataBlob { raw_data };
> +                        blob.set_crc(blob.compute_crc());
> +                        return Ok(blob);
> +                    }
> +                    // if size is bigger than the data, or any error is returned, continue with non
> +                    // compressed archive but log all errors beside buffer too small

this is mostly a 1:1 translation of the code to a comment, IMO not _that_
useful, at least if not really complex, and something one has to remember
to update too on modifying the code; but not too hard feelings here.

> +                    Ok(_) => {}
> +                    Err(err) => {
> +                        if !err.to_string().contains("Destination buffer is too small") {
> +                            log::warn!("zstd compression error: {err}");
> +                        }
> +                    }
>                  }
>              }
>  
> -            let mut raw_data = Vec::with_capacity(max_data_len);
> -
>              let head = DataBlobHeader {
>                  magic: UNCOMPRESSED_BLOB_MAGIC_1_0,
>                  crc: [0; 4],
>              };
>              unsafe {
> -                raw_data.write_le_value(head)?;
> +                (&mut raw_data[0..header_len]).write_le_value(head)?;
>              }
> -            raw_data.extend_from_slice(data);
> +            (&mut raw_data[header_len..]).write_all(data)?;
>  
>              DataBlob { raw_data }
>          };



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup v2 3/4] datastore: data blob: increase compression throughput
  2024-07-31 14:39   ` Thomas Lamprecht
@ 2024-08-01  6:55     ` Dominik Csapak
  2024-08-02 10:47     ` Dominik Csapak
  1 sibling, 0 replies; 14+ messages in thread
From: Dominik Csapak @ 2024-08-01  6:55 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox Backup Server development discussion

On 7/31/24 16:39, Thomas Lamprecht wrote:
> Am 31/07/2024 um 11:36 schrieb Dominik Csapak:
>> by not using `zstd::stream::copy_encode`, because that has an allocation
>> pattern that reduces throughput if the target/source storage and the
>> network are faster than the chunk creation.
> 
> any before/after benchmark numbers would be really great to have in the
> commit message of any such patch.
> 

ok, i put it in the cover letter but you're right, it's better to put them here

>>
>> instead use `zstd::bulk::compress_to_buffer` which shouldn't do any big
>> allocations, since we provide the target buffer.
>>
>> To handle the case that the target buffer is too small, we now ignore
>> all zstd error and continue with the uncompressed data, logging the error
>> except if the target buffer is too small.
> 
> This is hard to read to me and might to better with some reasoning
> add for why this is OK, even if it's clear to you, maybe something like:
> 
> In case of a compression error just return the uncompressed data,
> there's nothing we can do and saving uncompressed data is better than
> having none. Additionally, log any such error besides the one for the
> target buffer being too small.

Ok

> 
> 
>> For now, we have to parse the error string for that, as `zstd` maps all
>> errors as `io::ErrorKind::Other`. Until that gets changed, there is no
>> other way to differentiate between different kind of errors.
> 
> FWIW, you could also use the lower-level zstd_safe's compress2 [0] here,
> compress_to_buffer is just a thin wrapper around that [1] anyway. Then you
> could match the return value to see if it equals `70`, i.e., the value of
> the ZSTD_error_dstSize_tooSmall [2] from the ZSTD_ErrorCode enum.
> 
> I mean, naturally it would be much better if upstream provided a saner
> interface or at least a binding for the enum, but IME such error codes
> are quite stable if defined in this enum way, at least more stable than
> strings, so might be a slightly better workaround.
> 
> [0]: https://docs.rs/zstd-safe/latest/zstd_safe/struct.CCtx.html#method.compress2
> [1]: https://docs.rs/zstd/latest/src/zstd/bulk/compressor.rs.html#117-125
> [2]: https://github.com/facebook/zstd/blob/fdfb2aff39dc498372d8c9e5f2330b692fea9794/lib/zstd_errors.h#L88
> 
> besides that and a small nit below: looks OK to me

i did actually try something like that before sending v1 of this,
but i could not get it to work reliably because the returned integer
did not match with what zfs had in the code, rather it were
(at least on my machine) consistent but "garbage" numbers

i guessed at the time that this has to do with how the zstd crates are compiled
into the rust binary, but maybe I was mistaken. I'll look again
and see if i was just holding it wrong...

> 
>>
>> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
>> ---
>> changes from v1:
>> * fixed commit message
>> * reduced log severity to `warn`
>> * use vec![0; size]
>> * omit unnecessary buffer allocation in the unencrypted,uncompressed case
>>    by reusing the initial buffer that was tried for compression
>>   pbs-datastore/src/data_blob.rs | 37 +++++++++++++++++++---------------
>>   1 file changed, 21 insertions(+), 16 deletions(-)
>>
>> diff --git a/pbs-datastore/src/data_blob.rs b/pbs-datastore/src/data_blob.rs
>> index 8715afef..2a528204 100644
>> --- a/pbs-datastore/src/data_blob.rs
>> +++ b/pbs-datastore/src/data_blob.rs
>> @@ -136,39 +136,44 @@ impl DataBlob {
>>   
>>               DataBlob { raw_data }
>>           } else {
>> -            let max_data_len = data.len() + std::mem::size_of::<DataBlobHeader>();
>> +            let header_len = std::mem::size_of::<DataBlobHeader>();
>> +            let max_data_len = data.len() + header_len;
>> +            let mut raw_data = vec![0; max_data_len];
>>               if compress {
>> -                let mut comp_data = Vec::with_capacity(max_data_len);
>> -
>>                   let head = DataBlobHeader {
>>                       magic: COMPRESSED_BLOB_MAGIC_1_0,
>>                       crc: [0; 4],
>>                   };
>>                   unsafe {
>> -                    comp_data.write_le_value(head)?;
>> +                    (&mut raw_data[0..header_len]).write_le_value(head)?;
>>                   }
>>   
>> -                zstd::stream::copy_encode(data, &mut comp_data, 1)?;
>> -
>> -                if comp_data.len() < max_data_len {
>> -                    let mut blob = DataBlob {
>> -                        raw_data: comp_data,
>> -                    };
>> -                    blob.set_crc(blob.compute_crc());
>> -                    return Ok(blob);
>> +                match zstd::bulk::compress_to_buffer(data, &mut raw_data[header_len..], 1) {
>> +                    Ok(size) if size <= data.len() => {
>> +                        raw_data.truncate(header_len + size);
>> +                        let mut blob = DataBlob { raw_data };
>> +                        blob.set_crc(blob.compute_crc());
>> +                        return Ok(blob);
>> +                    }
>> +                    // if size is bigger than the data, or any error is returned, continue with non
>> +                    // compressed archive but log all errors beside buffer too small
> 
> this is mostly a 1:1 translation of the code to a comment, IMO not _that_
> useful, at least if not really complex, and something one has to remember
> to update too on modifying the code; but not too hard feelings here.

you're right that it doesn't tell much more than the code. I wanted to capture
the purpose of both the remaining Ok and Err paths in a comment, but did poorly
so, i'll try to find a better comment in v3

> 
>> +                    Ok(_) => {}
>> +                    Err(err) => {
>> +                        if !err.to_string().contains("Destination buffer is too small") {
>> +                            log::warn!("zstd compression error: {err}");
>> +                        }
>> +                    }
>>                   }
>>               }
>>   
>> -            let mut raw_data = Vec::with_capacity(max_data_len);
>> -
>>               let head = DataBlobHeader {
>>                   magic: UNCOMPRESSED_BLOB_MAGIC_1_0,
>>                   crc: [0; 4],
>>               };
>>               unsafe {
>> -                raw_data.write_le_value(head)?;
>> +                (&mut raw_data[0..header_len]).write_le_value(head)?;
>>               }
>> -            raw_data.extend_from_slice(data);
>> +            (&mut raw_data[header_len..]).write_all(data)?;
>>   
>>               DataBlob { raw_data }
>>           };
> 



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup v2 3/4] datastore: data blob: increase compression throughput
  2024-07-31 14:39   ` Thomas Lamprecht
  2024-08-01  6:55     ` Dominik Csapak
@ 2024-08-02 10:47     ` Dominik Csapak
  2024-08-02 11:59       ` Thomas Lamprecht
  1 sibling, 1 reply; 14+ messages in thread
From: Dominik Csapak @ 2024-08-02 10:47 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox Backup Server development discussion

On 7/31/24 16:39, Thomas Lamprecht wrote:
> Am 31/07/2024 um 11:36 schrieb Dominik Csapak:
>> For now, we have to parse the error string for that, as `zstd` maps all
>> errors as `io::ErrorKind::Other`. Until that gets changed, there is no
>> other way to differentiate between different kind of errors.
> 
> FWIW, you could also use the lower-level zstd_safe's compress2 [0] here,
> compress_to_buffer is just a thin wrapper around that [1] anyway. Then you
> could match the return value to see if it equals `70`, i.e., the value of
> the ZSTD_error_dstSize_tooSmall [2] from the ZSTD_ErrorCode enum.
> 
> I mean, naturally it would be much better if upstream provided a saner
> interface or at least a binding for the enum, but IME such error codes
> are quite stable if defined in this enum way, at least more stable than
> strings, so might be a slightly better workaround.
> 
> [0]: https://docs.rs/zstd-safe/latest/zstd_safe/struct.CCtx.html#method.compress2
> [1]: https://docs.rs/zstd/latest/src/zstd/bulk/compressor.rs.html#117-125
> [2]: https://github.com/facebook/zstd/blob/fdfb2aff39dc498372d8c9e5f2330b692fea9794/lib/zstd_errors.h#L88

sadly AFAICS this is currently not possible

i tried this and instead of error 70 i got:

'18446744073709551546'

reading your [2] link, it also states:

 > *  note 1 : this API shall be used with static linking only.
 > *           dynamic linking is not yet officially supported.

so i don't thinks this works, unless we'd link statically?

so how do we want go forward with this?


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup v2 3/4] datastore: data blob: increase compression throughput
  2024-08-02 10:47     ` Dominik Csapak
@ 2024-08-02 11:59       ` Thomas Lamprecht
  2024-08-02 12:38         ` Dominik Csapak
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Lamprecht @ 2024-08-02 11:59 UTC (permalink / raw)
  To: Dominik Csapak, Proxmox Backup Server development discussion

On 02/08/2024 12:47, Dominik Csapak wrote:
> sadly AFAICS this is currently not possible
> 
> i tried this and instead of error 70 i got:
> 
> '18446744073709551546'
> 
> reading your [2] link, it also states:
> 
>> *  note 1 : this API shall be used with static linking only.
>> *           dynamic linking is not yet officially supported.

That just sounds wrong... but oh well.
 
> so i don't thinks this works, unless we'd link statically?
>

I mean, the library can obviously translated the error to some meaningful
string. I mean that could be due to relying on internal compiled stuff that
we cannot use, but while I think that is a solid guess, I'd still evaluate
how it actually works to be sure. 
 
> so how do we want go forward with this?


If, after checking, this really seems to be unfeasible then lets go for
the string comparison with a TODO/FIXME comment point out why it's done
that way and that one might want to reevaluate if still required (in the
future)


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup v2 3/4] datastore: data blob: increase compression throughput
  2024-08-02 11:59       ` Thomas Lamprecht
@ 2024-08-02 12:38         ` Dominik Csapak
  2024-08-07 15:01           ` Thomas Lamprecht
  0 siblings, 1 reply; 14+ messages in thread
From: Dominik Csapak @ 2024-08-02 12:38 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox Backup Server development discussion

On 8/2/24 13:59, Thomas Lamprecht wrote:
> On 02/08/2024 12:47, Dominik Csapak wrote:
>> sadly AFAICS this is currently not possible
>>
>> i tried this and instead of error 70 i got:
>>
>> '18446744073709551546'
>>
>> reading your [2] link, it also states:
>>
>>> *  note 1 : this API shall be used with static linking only.
>>> *           dynamic linking is not yet officially supported.
> 
> That just sounds wrong... but oh well.
>   
>> so i don't thinks this works, unless we'd link statically?
>>
> 
> I mean, the library can obviously translated the error to some meaningful
> string. I mean that could be due to relying on internal compiled stuff that
> we cannot use, but while I think that is a solid guess, I'd still evaluate
> how it actually works to be sure.
>   
>> so how do we want go forward with this?
> 
> 
> If, after checking, this really seems to be unfeasible then lets go for
> the string comparison with a TODO/FIXME comment point out why it's done
> that way and that one might want to reevaluate if still required (in the
> future)


mhmm zstd just calls this:

---
fn map_error_code(code: usize) -> io::Error {
     let msg = zstd_safe::get_error_name(code);
     io::Error::new(io::ErrorKind::Other, msg.to_string())
}
---

which calls this:

---
pub fn get_error_name(code: usize) -> &'static str {
     unsafe {
         // Safety: assumes ZSTD returns a well-formed utf8 string.
         let name = zstd_sys::ZSTD_getErrorName(code);
         c_char_to_str(name)
     }
}
---

which is part of the zstd api and at the end it maps the error code like this:

---
ERR_STATIC ERR_enum ERR_getErrorCode(size_t code) { if (!ERR_isError(code)) return (ERR_enum)0; 
return (ERR_enum) (0-code); }
---

with that result, it maps the code to a string...

which matches what i get, since

2^64 - 70 = 18446744073709551546 [0]


but, i'm really not sure if we could rely in that since the function is in a 'error_private.c' which 
seems to me like it's an implementation detail only?


0: https://www.wolframalpha.com/input?i=2%5E64+-+70


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup v2 0/4] improve compression throughput
  2024-07-31  9:36 [pbs-devel] [PATCH proxmox-backup v2 0/4] improve compression throughput Dominik Csapak
                   ` (3 preceding siblings ...)
  2024-07-31  9:36 ` [pbs-devel] [PATCH proxmox-backup v2 4/4] datastore: DataBlob encode: simplify code Dominik Csapak
@ 2024-08-05  9:33 ` Dominik Csapak
  4 siblings, 0 replies; 14+ messages in thread
From: Dominik Csapak @ 2024-08-05  9:33 UTC (permalink / raw)
  To: pbs-devel

replaced by a v3:

https://lists.proxmox.com/pipermail/pbs-devel/2024-August/010402.html


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup v2 3/4] datastore: data blob: increase compression throughput
  2024-08-02 12:38         ` Dominik Csapak
@ 2024-08-07 15:01           ` Thomas Lamprecht
  0 siblings, 0 replies; 14+ messages in thread
From: Thomas Lamprecht @ 2024-08-07 15:01 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Dominik Csapak

Seems I forgot to reply-all, so while this is outdated (i.e., already implemented
by Dominik in v3) it still is nice to have the list complete, for the record so
to say.

On 02/08/2024 14:38, Dominik Csapak wrote:
> ---
> fn map_error_code(code: usize) -> io::Error {
>     let msg = zstd_safe::get_error_name(code);
>     io::Error::new(io::ErrorKind::Other, msg.to_string())
> }
> ---
>
> which calls this:
>
> ---
> pub fn get_error_name(code: usize) -> &'static str {
>     unsafe {
>         // Safety: assumes ZSTD returns a well-formed utf8 string.
>         let name = zstd_sys::ZSTD_getErrorName(code);
>         c_char_to_str(name)
>     }
> }
> ---
>
> which is part of the zstd api and at the end it maps the error code like this:
>
> ---
> ERR_STATIC ERR_enum ERR_getErrorCode(size_t code) { if (!ERR_isError(code)) return (ERR_enum)0; return (ERR_enum) (0-code); }
> ---
>
> with that result, it maps the code to a string...
>
> which matches what i get, since
>
> 2^64 - 70 = 18446744073709551546 [0]
>

Thanks for looking into this and providing the explanation.

> but, i'm really not sure if we could rely in that since the function is in a 'error_private.c' which seems to me like it's an implementation detail only?
>

Yeah, it's not ideal... But it could be made safe enough by adding a test that runs
on build and triggers this error explicitly by passing a way to small target buffer,
that way we can notice when this internal error changes, which is IMO not _that_
likely, at least not during the same major Debian release, as there we normally
only get critical bug and security fixes, and while I don't want to curse it,
but I'd really be surprised if this particular code would change semantics, as
it's hard to envision that the widely used `- code` pattern to return errors
in C ABIs should be part of such a critical flaw.

And yeah, while that is not the interface I'd wish for, it doesn't really feels
significantly worse to me than doing matching on error string, as those aren't
guaranteed to be 100% stable either I'd think.

_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-08-07 15:02 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-31  9:36 [pbs-devel] [PATCH proxmox-backup v2 0/4] improve compression throughput Dominik Csapak
2024-07-31  9:36 ` [pbs-devel] [PATCH proxmox-backup v2 1/4] remove data blob writer Dominik Csapak
2024-07-31  9:36 ` [pbs-devel] [PATCH proxmox-backup v2 2/4] datastore: test DataBlob encode/decode roundtrip Dominik Csapak
2024-07-31  9:47   ` Lukas Wagner
2024-07-31  9:50     ` Dominik Csapak
2024-07-31  9:36 ` [pbs-devel] [PATCH proxmox-backup v2 3/4] datastore: data blob: increase compression throughput Dominik Csapak
2024-07-31 14:39   ` Thomas Lamprecht
2024-08-01  6:55     ` Dominik Csapak
2024-08-02 10:47     ` Dominik Csapak
2024-08-02 11:59       ` Thomas Lamprecht
2024-08-02 12:38         ` Dominik Csapak
2024-08-07 15:01           ` Thomas Lamprecht
2024-07-31  9:36 ` [pbs-devel] [PATCH proxmox-backup v2 4/4] datastore: DataBlob encode: simplify code Dominik Csapak
2024-08-05  9:33 ` [pbs-devel] [PATCH proxmox-backup v2 0/4] improve compression throughput Dominik Csapak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal