From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <d.csapak@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id B33FD69268
 for <pbs-devel@lists.proxmox.com>; Mon, 13 Sep 2021 09:14:48 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id A570825D02
 for <pbs-devel@lists.proxmox.com>; Mon, 13 Sep 2021 09:14:48 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [94.136.29.106])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id 0DED025CF5
 for <pbs-devel@lists.proxmox.com>; Mon, 13 Sep 2021 09:14:45 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id C28674477E;
 Mon, 13 Sep 2021 09:14:38 +0200 (CEST)
Message-ID: <758f28ef-4c6c-ccaf-54a7-17e4c0f8b10c@proxmox.com>
Date: Mon, 13 Sep 2021 09:14:36 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:92.0) Gecko/20100101
 Thunderbird/92.0
Content-Language: en-US
To: Thomas Lamprecht <t.lamprecht@proxmox.com>,
 Proxmox Backup Server development discussion <pbs-devel@lists.proxmox.com>
References: <20210910090948.2145523-1-d.csapak@proxmox.com>
 <60f030f0-c15a-0b2f-6bf6-1a243f042f0c@proxmox.com>
From: Dominik Csapak <d.csapak@proxmox.com>
In-Reply-To: <60f030f0-c15a-0b2f-6bf6-1a243f042f0c@proxmox.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-SPAM-LEVEL: Spam detection results:  0
 AWL 2.207 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 NICE_REPLY_A           -3.584 Looks like a legit reply (A)
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pbs-devel] [RFC PATCH proxmox-backup] pbs-tools: zip: add EFS
 flag to zip files
X-BeenThere: pbs-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox Backup Server development discussion
 <pbs-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pbs-devel>, 
 <mailto:pbs-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pbs-devel/>
List-Post: <mailto:pbs-devel@lists.proxmox.com>
List-Help: <mailto:pbs-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel>, 
 <mailto:pbs-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Mon, 13 Sep 2021 07:14:48 -0000

On 9/11/21 17:08, Thomas Lamprecht wrote:
> On 10.09.21 11:09, Dominik Csapak wrote:
>> this flag marks the file names as 'UTF-8' encoded.
>>
>> By default, encoding of file names in zips are defined as code page 437,
>> but we save the filenames as bytes (like in linux fs).
>>
>> For linux systems this neither would be a problem since most tools
>> simply use the filenames as bytes, but for the zip utility under
>> windows it's important since NTFS uses UTF-16 for file names.
>>
>> Since we generate zips only on pxars (file based backup on linux) or
>> via file-restore-daemons (linux; ntfs mounted as UTF-8), it's a fair
>> assumption that we can mark most filenames as UTF-8.
>>
>> For zips generated from linux backups to be extracted on windows it is
>> impossible to do the correct thing anyway, since windows can not have
>> arbitrary bytes in file names, and for each encoding chosen, there is
>> some file that cannot be shown correctly.
>> so either all filenames are decoded as CP437 ('ö' -> '├╢')
>> or non UTF-8 encoded file-names have garbage characters in them (�)
>>
>> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
>> ---
>> sending as RFC since there is no way to have it correct in all cases,
>> and we have to decide if we want CP437 or UTF-8 by default
>>
> 
> Yeah, it's not only that we may not be incorrect, the closest definition of a ZIP spec
> says "not set == should be cp437 but meh" and "set == MUST be valid UTF-8" about this
> bit:
> 
>> D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform
>> to the original ZIP character encoding.  If general purpose bit 11 is set, the
>> filename and comment MUST support The Unicode Standard, Version 4.1.0 or
>> greater using the character encoding form defined by the UTF-8 storage
>> specification.  The Unicode Standard is published by the The Unicode
>> Consortium (www.unicode.org).  UTF-8 encoded data stored within ZIP files
>> is expected to not include a byte order mark (BOM).
>>
> - Appendix D, https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

Yes, you're right, i did read that part too loosely i think...

> 
> Also interesting, just below above quote:
> 
>> D.3 Applications MAY choose to supplement this file name storage through the use
>> of the 0x0008 Extra Field.  Storage for this optional field is currently
>> undefined, however it will be used to allow storing extended information
>> on source or target encoding that MAY further assist applications with file
>> name, or file content encoding tasks.  Please contact PKWARE with any
>> requirements on how this field SHOULD be used.

AFAIU, that part of the 'spec' is not open and would require a license 
of PKWARE

> 
> 
> So I'd like to know what standard tools like info-zip (i.e., Debian's "zip" package) or
> other cross-platform tools like 7zip do.
> 
> It seems that at least Debian's version of info zip had some thoughts about this and can
> (or always does, did not checked that closely) safe utf8 filenames in an extra field, one
> that some other tools maybe check for?
> 
> https://sources.debian.org/src/zip/3.0-12/zip.c/#L967
> 
> I say Debian's version, as upstream still talks about Unicode support on their home page,
> which itself may be just outdated too, but it could also be that Debian patched that in.
> 
> Any how, it seems to me that there'd be some more compatible options that do not plainly
> state that they're 100% utf-8 while actually not being so sure of that, so I'd explore that
> angle quite some more; data restoration is probably the most important aspect of a backup
> system - so every way we expose doing so should work as as good as possible - even if going
> outside our Linux bubble.

I tested it with debians zip (Info-Zip), and despite was the spec says,
i could not get it to write an 'extra-field' with the filenames.

Instead, when it finds a filename that has a filename with any byte that
has the high-bit set (>127) it sets the EFS bit in the filename iff
the filename is valid utf-8, and does nothing if not.

IMHO, that sounds like a reasonable thing to do, so i'd suggest
that we test each filename for valid UTF-8, and set the bit
if that's true. This marks ASCII only filenames also as UTF-8,
but thats technically true and simplifies the rust code a bit
(and should be faster for UTF-8 Filenames,
since we do not have to check for a high bit first and then try to
convert...)

Does that sound ok to you?

For non-valid UTF-8 filenames that have a high bit i'd produce
CP437 filenames on windows, and on linux it'd just be the
byte value.

> 
>>   pbs-tools/src/zip.rs | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/pbs-tools/src/zip.rs b/pbs-tools/src/zip.rs
>> index 605480a8..88eea07b 100644
>> --- a/pbs-tools/src/zip.rs
>> +++ b/pbs-tools/src/zip.rs
>> @@ -34,6 +34,8 @@ const VERSION_MADE_BY: u16 = 0x032d;
>>   const ZIP64_EOCD_RECORD: u32 = 0x06064B50;
>>   const ZIP64_EOCD_LOCATOR: u32 = 0x07064B50;
>>   
>> +const GENERAL_PURUPOSE_FLAGS: u16 = (1 << 3) | (1 << 11); // EFS + Data Descriptor
>> +
> 
> - typo in constant name: purupose vs. purpose

yeah thanks...

> - comment order do not match the bits used, bit 11 is EFS and bit 3 is telling
>    the parser that the crc32 is not in the header but in the data descriptor after
>    the compressed data; your bitwise-OR+comment order suggests different.

sorry

> - isn't this related to BZ entry #3618, but that is neither mentioned here nor in the
>    bug report...

that bug-report wasn't there when i wrote the patch.

> 
> _If_ we'd go down this way then the following const name and formatting would make this
> easier to read IMO:
> 
> const LFH_GENERAL_PURPOSE_FLAGS: u16 = (1 << 3) // we place crc32 in data descriptor
>      | (1 << 11); // EFS, mark filenames & comments as UTF-8 (not guaranteed but more often OK than CP437)
> 

makes, sense, though if we only set it conditionally i'd split the
EFS_MARK into its own constant.