From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id D53E1E81C for ; Wed, 19 Jul 2023 11:54:57 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id B050A4FA0 for ; Wed, 19 Jul 2023 11:54:27 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Wed, 19 Jul 2023 11:54:27 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id D30D640EE5 for ; Wed, 19 Jul 2023 11:54:26 +0200 (CEST) Date: Wed, 19 Jul 2023 11:54:25 +0200 From: Wolfgang Bumiller To: Lukas Wagner Cc: Dominik Csapak , Proxmox VE development discussion , Maximiliano Sandoval Message-ID: References: <20230717150051.710464-1-l.wagner@proxmox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-SPAM-LEVEL: Spam detection results: 0 AWL 0.118 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - WEIRD_PORT 0.001 Uses non-standard port number for HTTP Subject: Re: [pve-devel] [PATCH v3 many 00/66] fix #4156: introduce new notification system X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Jul 2023 09:54:57 -0000 On Wed, Jul 19, 2023 at 10:40:09AM +0200, Lukas Wagner wrote: > Hi again, > > On 7/18/23 14:34, Dominik Csapak wrote: > > * i found one bug, but not quite sure yet where it comes from exactly, > >   putting in emojis into a field (e.g. a comment or author) it's accepted, > >   but editing a different entry fails with: > > > > --->8--- > > could not serialize configuration: writing 'notifications.cfg' failed: detected unexpected control character in section 'testgroup' key 'comment' (500) > > ---8<--- > > > > not sure where the utf-8 info gets lost. (or we could limit all fields to ascii?) > > such a notification target still works AFAICT (but if set as e.g. the author it's > > probably the wrong value) > > > > (i used 😀 as a test) > > So I investigated a bit and found a minimal reproducer. Turns out it's an encoding issue > in the FFI interface (perl->rust). > > Let's assume that we have the following exported function in the pve-rs bindings: > > #[export] > fn test_emoji(name: &str) { > dbg!(&name); > } > > > > use PVE::RS::Notify; > my $str = "😊"; Without `use utf8;`, this produces a "byte string": $ perl -MDevel::Peek -e 'my $str = "😊"; Dump($str);' SV = PV(0x5576f4e0cea0) at 0x5576f4e39370 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x5576f4e424d0 "\xF0\x9F\x98\x8A"\0 CUR = 4 LEN = 10 COW_REFCNT = 1 Note that \xF0\x9F\x98\x8A. > PVE::RS::Notify::test_emoji($str); > > > root@pve:~# perl test.pl > [src/notify.rs:562] &name = "ð\u{9f}\u{98}\u{8a}" Note the `\u` portions here. This string contains the *UTF-8* characters 0xF0, 0x9F, 0x98, 0x8A. And how is it supposed to know any better. > > To me it looks a bit like a UTF-16/UTF-8 mixup: > > ð = 0x00F0 in UTF16 > 😊 = 0xF0 0x9F 0x98 0x8A in UTF-8 > > The issue can be fixed by doing a `$str = encode('utf-8', $str);` before calling > `test_emoji`. Perl and most of our perl code never cared (hence we already ran into a bunch of utf-8 issues and for a long time did the whole "transport encoding vs actual encoding" in HTTP vs JS vs json vs perl strings completely *wrong* (and probably still do)), and a lot of *files* aren't even *defined* to have a specific encoding (eg. interpreting bytes >0x80 from `/etc/network/interfaces` as utf-8 may simply be the *wrong* thing to do). Sure, the perlmod layer could be an issue. But I wouldn't jump to conclusions there. Also, note what *actually* happens if you `encode('utf-8', $str)`: $ perl -MEncode -MDevel::Peek -e 'my $a = encode("utf-8", "👍"); Dump($a);' SV = PV(0x55f62dfe6170) at 0x55f62e012430 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x55f62e1cbe60 "\xC3\xB0\xC2\x9F\xC2\x91\xC2\x8D"\0 CUR = 8 LEN = 10 COW_REFCNT = 0 Now you have the UTF-8 encoding of each character in there explicitly. What you really want would be for perl to acknowledge that you already have utf-8: $ perl -MDevel::Peek -e 'use utf8; my $a = "👍"; Dump($a);' SV = PV(0x56265764cea0) at 0x5626576793e8 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x5626576a2b60 "\xF0\x9F\x91\x8D"\0 [UTF8 "\x{1f44d}"] CUR = 4 LEN = 10 COW_REFCNT = 1 But we don't use `use utf8;` in our code base because it has too many side effects. To mark an utf-8 encoded not-as-utf-8-marked string as utf-8 in perl, you can *decode* it: $ perl -MDevel::Peek -e 'use utf8; no utf8; my $a = "👍"; utf8::decode($a); Dump($a);' SV = PV(0x55a5c9c45ea0) at 0x55a5c9c723a8 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x55a5c9c64280 "\xF0\x9F\x91\x8D"\0 [UTF8 "\x{1f44d}"] CUR = 4 LEN = 10 All that said, I have not yet looked at the perl side (or perlmod side) and cannot say what's going on. But If you hand utf-8 *bytes* which aren't marked as utf-8 to perlmod, it'll do what perl does and just encode each byte as utf-8. *Guessing* that it's utf-8 would surely work - *this time* - but might simply be *wrong* other times.