From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 32C3268F57 for ; Sat, 11 Sep 2021 17:10:01 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 1ABBB1D16C for ; Sat, 11 Sep 2021 17:09:31 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 075A61D160 for ; Sat, 11 Sep 2021 17:09:29 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id C423844700 for ; Sat, 11 Sep 2021 17:09:23 +0200 (CEST) Message-ID: <60f030f0-c15a-0b2f-6bf6-1a243f042f0c@proxmox.com> Date: Sat, 11 Sep 2021 17:08:47 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:93.0) Gecko/20100101 Thunderbird/93.0 Content-Language: en-US To: Proxmox Backup Server development discussion , Dominik Csapak References: <20210910090948.2145523-1-d.csapak@proxmox.com> From: Thomas Lamprecht In-Reply-To: <20210910090948.2145523-1-d.csapak@proxmox.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-SPAM-LEVEL: Spam detection results: 0 AWL 1.458 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -2.349 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [zip.rs, unicode.org, cachefly.net] Subject: Re: [pbs-devel] [RFC PATCH proxmox-backup] pbs-tools: zip: add EFS flag to zip files X-BeenThere: pbs-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox Backup Server development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 11 Sep 2021 15:10:01 -0000 On 10.09.21 11:09, Dominik Csapak wrote: > this flag marks the file names as 'UTF-8' encoded. >=20 > By default, encoding of file names in zips are defined as code page 437= , > but we save the filenames as bytes (like in linux fs). >=20 > For linux systems this neither would be a problem since most tools > simply use the filenames as bytes, but for the zip utility under > windows it's important since NTFS uses UTF-16 for file names. >=20 > Since we generate zips only on pxars (file based backup on linux) or > via file-restore-daemons (linux; ntfs mounted as UTF-8), it's a fair > assumption that we can mark most filenames as UTF-8. >=20 > For zips generated from linux backups to be extracted on windows it is > impossible to do the correct thing anyway, since windows can not have > arbitrary bytes in file names, and for each encoding chosen, there is > some file that cannot be shown correctly. > so either all filenames are decoded as CP437 ('=C3=B6' -> '=E2=94=9C=E2= =95=A2') > or non UTF-8 encoded file-names have garbage characters in them (=EF=BF= =BD) >=20 > Signed-off-by: Dominik Csapak > --- > sending as RFC since there is no way to have it correct in all cases, > and we have to decide if we want CP437 or UTF-8 by default >=20 Yeah, it's not only that we may not be incorrect, the closest definition = of a ZIP spec says "not set =3D=3D should be cp437 but meh" and "set =3D=3D MUST be val= id UTF-8" about this bit: > D.2 If general purpose bit 11 is unset, the file name and comment SHOUL= D conform=20 > to the original ZIP character encoding. If general purpose bit 11 is s= et, the=20 > filename and comment MUST support The Unicode Standard, Version 4.1.0 o= r=20 > greater using the character encoding form defined by the UTF-8 storage = > specification. The Unicode Standard is published by the The Unicode > Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP fil= es=20 > is expected to not include a byte order mark (BOM).=20 >=20 - Appendix D, https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT= Also interesting, just below above quote: > D.3 Applications MAY choose to supplement this file name storage throug= h the use=20 > of the 0x0008 Extra Field. Storage for this optional field is currentl= y=20 > undefined, however it will be used to allow storing extended informatio= n=20 > on source or target encoding that MAY further assist applications with = file=20 > name, or file content encoding tasks. Please contact PKWARE with any > requirements on how this field SHOULD be used. So I'd like to know what standard tools like info-zip (i.e., Debian's "zi= p" package) or other cross-platform tools like 7zip do. It seems that at least Debian's version of info zip had some thoughts abo= ut this and can (or always does, did not checked that closely) safe utf8 filenames in an = extra field, one that some other tools maybe check for? https://sources.debian.org/src/zip/3.0-12/zip.c/#L967 I say Debian's version, as upstream still talks about Unicode support on = their home page, which itself may be just outdated too, but it could also be that Debian p= atched that in. Any how, it seems to me that there'd be some more compatible options that= do not plainly state that they're 100% utf-8 while actually not being so sure of that, s= o I'd explore that angle quite some more; data restoration is probably the most important as= pect of a backup system - so every way we expose doing so should work as as good as possib= le - even if going outside our Linux bubble. > pbs-tools/src/zip.rs | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) >=20 > diff --git a/pbs-tools/src/zip.rs b/pbs-tools/src/zip.rs > index 605480a8..88eea07b 100644 > --- a/pbs-tools/src/zip.rs > +++ b/pbs-tools/src/zip.rs > @@ -34,6 +34,8 @@ const VERSION_MADE_BY: u16 =3D 0x032d; > const ZIP64_EOCD_RECORD: u32 =3D 0x06064B50; > const ZIP64_EOCD_LOCATOR: u32 =3D 0x07064B50; > =20 > +const GENERAL_PURUPOSE_FLAGS: u16 =3D (1 << 3) | (1 << 11); // EFS + D= ata Descriptor > + - typo in constant name: purupose vs. purpose - comment order do not match the bits used, bit 11 is EFS and bit 3 is te= lling the parser that the crc32 is not in the header but in the data descript= or after the compressed data; your bitwise-OR+comment order suggests different. - isn't this related to BZ entry #3618, but that is neither mentioned her= e nor in the bug report... _If_ we'd go down this way then the following const name and formatting w= ould make this easier to read IMO: const LFH_GENERAL_PURPOSE_FLAGS: u16 =3D (1 << 3) // we place crc32 in da= ta descriptor | (1 << 11); // EFS, mark filenames & comments as UTF-8 (not guarante= ed but more often OK than CP437)