public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Dominik Csapak <d.csapak@proxmox.com>
To: Thomas Lamprecht <t.lamprecht@proxmox.com>,
	Proxmox Backup Server development discussion
	<pbs-devel@lists.proxmox.com>
Subject: Re: [pbs-devel] [RFC PATCH proxmox-backup] pbs-tools: zip: add EFS flag to zip files
Date: Mon, 13 Sep 2021 09:53:51 +0200	[thread overview]
Message-ID: <7e3bcd21-b93f-eb58-ee4f-673796ea271b@proxmox.com> (raw)
In-Reply-To: <916d18cc-b4ac-ff07-e0fe-6c4dab61105d@proxmox.com>

On 9/13/21 09:25, Thomas Lamprecht wrote:
> On 13.09.21 09:14, Dominik Csapak wrote:
>> On 9/11/21 17:08, Thomas Lamprecht wrote:
>>> On 10.09.21 11:09, Dominik Csapak wrote:
>>> Also interesting, just below above quote:
>>>
>>>> D.3 Applications MAY choose to supplement this file name storage through the use
>>>> of the 0x0008 Extra Field.  Storage for this optional field is currently
>>>> undefined, however it will be used to allow storing extended information
>>>> on source or target encoding that MAY further assist applications with file
>>>> name, or file content encoding tasks.  Please contact PKWARE with any
>>>> requirements on how this field SHOULD be used.
>>
>> AFAIU, that part of the 'spec' is not open and would require a license of PKWARE
> 
> yeah that's what I figured, but observe and replicate could in theory still be OK I guess?
> 
>>
>>>
>>>
>>> So I'd like to know what standard tools like info-zip (i.e., Debian's "zip" package) or
>>> other cross-platform tools like 7zip do.
>>>
>>> It seems that at least Debian's version of info zip had some thoughts about this and can
>>> (or always does, did not checked that closely) safe utf8 filenames in an extra field, one
>>> that some other tools maybe check for?
>>>
>>> https://sources.debian.org/src/zip/3.0-12/zip.c/#L967
>>>
>>> I say Debian's version, as upstream still talks about Unicode support on their home page,
>>> which itself may be just outdated too, but it could also be that Debian patched that in.
>>>
>>> Any how, it seems to me that there'd be some more compatible options that do not plainly
>>> state that they're 100% utf-8 while actually not being so sure of that, so I'd explore that
>>> angle quite some more; data restoration is probably the most important aspect of a backup
>>> system - so every way we expose doing so should work as as good as possible - even if going
>>> outside our Linux bubble.
>>
>> I tested it with debians zip (Info-Zip), and despite was the spec says,
>> i could not get it to write an 'extra-field' with the filenames.
>>
> 
> hmm, meh...
> 
>> Instead, when it finds a filename that has a filename with any byte that
>> has the high-bit set (>127) it sets the EFS bit in the filename iff
>> the filename is valid utf-8, and does nothing if not.
>>
>> IMHO, that sounds like a reasonable thing to do, so i'd suggest
>> that we test each filename for valid UTF-8, and set the bit
>> if that's true. This marks ASCII only filenames also as UTF-8,
>> but thats technically true and simplifies the rust code a bit
>> (and should be faster for UTF-8 Filenames,
>> since we do not have to check for a high bit first and then try to
>> convert...)
>>
>> Does that sound ok to you?
> 
> Yes, sounds reasonable to me for nwo. It'd be an improvement and we basically only use
> this for on-the-fly generation, so if we find a possible "better" (hard to measure that
> with such a mess of format) approach in the future it would help all users immediately
> too then.
> 
>>
>> For non-valid UTF-8 filenames that have a high bit i'd produce
>> CP437 filenames on windows, and on linux it'd just be the
>> byte value.
> 
> just to be sure: what does "i'd produce" mean here? As we certainly do not differ ZIP
> generation semantics by user OS (thank goodness), this means that a unzip by most
> reasonable tools would produce the outcome you described, or?

oops, this should say "it'd" so "it would".

To be completely clear, such ZIPs extracts as
* Windows(built-in + 7zip): UTF-8 filenames get correctly encoded,
   all other filenames interpreted as CP437
* Debian(zip): filenames as bytes in any case
* Debian(7z): weirdly it uses valid utf-8 as utf-8 bytes,
   but other filenames, it interprets as Windows 1252 and encodes as UTF-8..

i'll send a v2 with this semantic though, as i don't think we can do
much better for now

> 
>>
>>>
>>>>    pbs-tools/src/zip.rs | 6 ++++--
>>>>    1 file changed, 4 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/pbs-tools/src/zip.rs b/pbs-tools/src/zip.rs
>>>> index 605480a8..88eea07b 100644
>>>> --- a/pbs-tools/src/zip.rs
>>>> +++ b/pbs-tools/src/zip.rs
>>>> @@ -34,6 +34,8 @@ const VERSION_MADE_BY: u16 = 0x032d;
>>>>    const ZIP64_EOCD_RECORD: u32 = 0x06064B50;
>>>>    const ZIP64_EOCD_LOCATOR: u32 = 0x07064B50;
>>>>    +const GENERAL_PURUPOSE_FLAGS: u16 = (1 << 3) | (1 << 11); // EFS + Data Descriptor
>>>> +
>>>
>>> - typo in constant name: purupose vs. purpose
>>
>> yeah thanks...
>>
>>> - comment order do not match the bits used, bit 11 is EFS and bit 3 is telling
>>>     the parser that the crc32 is not in the header but in the data descriptor after
>>>     the compressed data; your bitwise-OR+comment order suggests different.
>>
>> sorry
>>
>>> - isn't this related to BZ entry #3618, but that is neither mentioned here nor in the
>>>     bug report...
>>
>> that bug-report wasn't there when i wrote the patch.
> 
> I naturally only compared mail vs BZ time and missed the date difference of one day,
> argh, sorry.
> 
>>
>>>
>>> _If_ we'd go down this way then the following const name and formatting would make this
>>> easier to read IMO:
>>>
>>> const LFH_GENERAL_PURPOSE_FLAGS: u16 = (1 << 3) // we place crc32 in data descriptor
>>>       | (1 << 11); // EFS, mark filenames & comments as UTF-8 (not guaranteed but more often OK than CP437)
>>>
>>
>> makes, sense, though if we only set it conditionally i'd split the
>> EFS_MARK into its own constant.
>>
> 
> sounds good to me.
> 





  reply	other threads:[~2021-09-13  7:54 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-10  9:09 Dominik Csapak
2021-09-11 15:08 ` Thomas Lamprecht
2021-09-13  7:14   ` Dominik Csapak
2021-09-13  7:25     ` Thomas Lamprecht
2021-09-13  7:53       ` Dominik Csapak [this message]
2021-09-13  7:57         ` Thomas Lamprecht

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7e3bcd21-b93f-eb58-ee4f-673796ea271b@proxmox.com \
    --to=d.csapak@proxmox.com \
    --cc=pbs-devel@lists.proxmox.com \
    --cc=t.lamprecht@proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal