From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id A44B2692D2 for ; Mon, 13 Sep 2021 09:26:32 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 988F625DB8 for ; Mon, 13 Sep 2021 09:26:32 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 8D7E325DAD for ; Mon, 13 Sep 2021 09:26:31 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 5E96B44790 for ; Mon, 13 Sep 2021 09:26:31 +0200 (CEST) Message-ID: <916d18cc-b4ac-ff07-e0fe-6c4dab61105d@proxmox.com> Date: Mon, 13 Sep 2021 09:25:52 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:93.0) Gecko/20100101 Thunderbird/93.0 Content-Language: en-US To: Dominik Csapak , Proxmox Backup Server development discussion References: <20210910090948.2145523-1-d.csapak@proxmox.com> <60f030f0-c15a-0b2f-6bf6-1a243f042f0c@proxmox.com> <758f28ef-4c6c-ccaf-54a7-17e4c0f8b10c@proxmox.com> From: Thomas Lamprecht In-Reply-To: <758f28ef-4c6c-ccaf-54a7-17e4c0f8b10c@proxmox.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-SPAM-LEVEL: Spam detection results: 0 AWL 2.074 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -3.584 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pbs-devel] [RFC PATCH proxmox-backup] pbs-tools: zip: add EFS flag to zip files X-BeenThere: pbs-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox Backup Server development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Sep 2021 07:26:32 -0000 On 13.09.21 09:14, Dominik Csapak wrote: > On 9/11/21 17:08, Thomas Lamprecht wrote: >> On 10.09.21 11:09, Dominik Csapak wrote: >> Also interesting, just below above quote: >> >>> D.3 Applications MAY choose to supplement this file name storage thro= ugh the use >>> of the 0x0008 Extra Field.=C2=A0 Storage for this optional field is c= urrently >>> undefined, however it will be used to allow storing extended informat= ion >>> on source or target encoding that MAY further assist applications wit= h file >>> name, or file content encoding tasks.=C2=A0 Please contact PKWARE wit= h any >>> requirements on how this field SHOULD be used. >=20 > AFAIU, that part of the 'spec' is not open and would require a license = of PKWARE yeah that's what I figured, but observe and replicate could in theory sti= ll be OK I guess? >=20 >> >> >> So I'd like to know what standard tools like info-zip (i.e., Debian's = "zip" package) or >> other cross-platform tools like 7zip do. >> >> It seems that at least Debian's version of info zip had some thoughts = about this and can >> (or always does, did not checked that closely) safe utf8 filenames in = an extra field, one >> that some other tools maybe check for? >> >> https://sources.debian.org/src/zip/3.0-12/zip.c/#L967 >> >> I say Debian's version, as upstream still talks about Unicode support = on their home page, >> which itself may be just outdated too, but it could also be that Debia= n patched that in. >> >> Any how, it seems to me that there'd be some more compatible options t= hat do not plainly >> state that they're 100% utf-8 while actually not being so sure of that= , so I'd explore that >> angle quite some more; data restoration is probably the most important= aspect of a backup >> system - so every way we expose doing so should work as as good as pos= sible - even if going >> outside our Linux bubble. >=20 > I tested it with debians zip (Info-Zip), and despite was the spec says,= > i could not get it to write an 'extra-field' with the filenames. >=20 hmm, meh... > Instead, when it finds a filename that has a filename with any byte tha= t > has the high-bit set (>127) it sets the EFS bit in the filename iff > the filename is valid utf-8, and does nothing if not. >=20 > IMHO, that sounds like a reasonable thing to do, so i'd suggest > that we test each filename for valid UTF-8, and set the bit > if that's true. This marks ASCII only filenames also as UTF-8, > but thats technically true and simplifies the rust code a bit > (and should be faster for UTF-8 Filenames, > since we do not have to check for a high bit first and then try to > convert...) >=20 > Does that sound ok to you? Yes, sounds reasonable to me for nwo. It'd be an improvement and we basic= ally only use this for on-the-fly generation, so if we find a possible "better" (hard t= o measure that with such a mess of format) approach in the future it would help all user= s immediately too then. >=20 > For non-valid UTF-8 filenames that have a high bit i'd produce > CP437 filenames on windows, and on linux it'd just be the > byte value. just to be sure: what does "i'd produce" mean here? As we certainly do no= t differ ZIP generation semantics by user OS (thank goodness), this means that a unzip= by most reasonable tools would produce the outcome you described, or?=20 >=20 >> >>> =C2=A0 pbs-tools/src/zip.rs | 6 ++++-- >>> =C2=A0 1 file changed, 4 insertions(+), 2 deletions(-) >>> >>> diff --git a/pbs-tools/src/zip.rs b/pbs-tools/src/zip.rs >>> index 605480a8..88eea07b 100644 >>> --- a/pbs-tools/src/zip.rs >>> +++ b/pbs-tools/src/zip.rs >>> @@ -34,6 +34,8 @@ const VERSION_MADE_BY: u16 =3D 0x032d; >>> =C2=A0 const ZIP64_EOCD_RECORD: u32 =3D 0x06064B50; >>> =C2=A0 const ZIP64_EOCD_LOCATOR: u32 =3D 0x07064B50; >>> =C2=A0 +const GENERAL_PURUPOSE_FLAGS: u16 =3D (1 << 3) | (1 << 11); /= / EFS + Data Descriptor >>> + >> >> - typo in constant name: purupose vs. purpose >=20 > yeah thanks... >=20 >> - comment order do not match the bits used, bit 11 is EFS and bit 3 is= telling >> =C2=A0=C2=A0 the parser that the crc32 is not in the header but in the= data descriptor after >> =C2=A0=C2=A0 the compressed data; your bitwise-OR+comment order sugges= ts different. >=20 > sorry >=20 >> - isn't this related to BZ entry #3618, but that is neither mentioned = here nor in the >> =C2=A0=C2=A0 bug report... >=20 > that bug-report wasn't there when i wrote the patch. I naturally only compared mail vs BZ time and missed the date difference = of one day, argh, sorry. >=20 >> >> _If_ we'd go down this way then the following const name and formattin= g would make this >> easier to read IMO: >> >> const LFH_GENERAL_PURPOSE_FLAGS: u16 =3D (1 << 3) // we place crc32 in= data descriptor >> =C2=A0=C2=A0=C2=A0=C2=A0 | (1 << 11); // EFS, mark filenames & comment= s as UTF-8 (not guaranteed but more often OK than CP437) >> >=20 > makes, sense, though if we only set it conditionally i'd split the > EFS_MARK into its own constant. >=20 sounds good to me.