public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [RFC PATCH storage] Plugins: en/decode notes as UTF-8
@ 2022-03-08 14:41 Dominik Csapak
  2022-03-08 18:10 ` Thomas Lamprecht
  0 siblings, 1 reply; 3+ messages in thread
From: Dominik Csapak @ 2022-03-08 14:41 UTC (permalink / raw)
  To: pve-devel

When writing into the file, explicitly utf8 encode it, and then try to
utf8 decode it on read.

If the notes are not valid utf8, we assume it was an iso-8859 comment
and return is at is was.

Technically this is a breaking change, since there are iso-8859 comments
that would sucessfully decode as utf8, for example:
the byte sequence "C2 A9" would be "£" in iso, but would decode to "£".

From what i can tell though, this is rather unlikely to happen for
"real world" notes, because the first byte would be in the range of
C0-F7 (which are mostly language dependent characters like "Â")
and the following bytes would have to be in the range of
80-BF, which are only special characters like "£" (or undefined)

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
we may want to have this 'try_decode_utf8' in PVE::Tools i guess?
i just put it here for the RFC, so its more easy to review

 PVE/Storage.pm           | 17 +++++++++++++++++
 PVE/Storage/DirPlugin.pm |  9 +++++++--
 PVE/Storage/Plugin.pm    |  2 +-
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/PVE/Storage.pm b/PVE/Storage.pm
index b1d31bb..4335ee9 100755
--- a/PVE/Storage.pm
+++ b/PVE/Storage.pm
@@ -14,6 +14,7 @@ use File::Path;
 use Cwd 'abs_path';
 use Socket;
 use Time::Local qw(timelocal);
+use Encode qw(decode);
 
 use PVE::Tools qw(run_command file_read_firstline dir_glob_foreach $IPV6RE);
 use PVE::Cluster qw(cfs_read_file cfs_write_file cfs_lock_file);
@@ -2077,4 +2078,20 @@ sub normalize_content_filename {
     return $filename;
 }
 
+sub try_decode_utf8 {
+    my ($data) = @_;
+
+    my $decoded = eval {
+	decode('UTF-8', $data, 1);
+    };
+
+    if (!defined($decoded)) {
+	# we could not decode, it's probably iso-8859,
+	# so return original value
+	return $data;
+    }
+
+    return $decoded;
+}
+
 1;
diff --git a/PVE/Storage/DirPlugin.pm b/PVE/Storage/DirPlugin.pm
index c60818b..bc559e6 100644
--- a/PVE/Storage/DirPlugin.pm
+++ b/PVE/Storage/DirPlugin.pm
@@ -7,6 +7,7 @@ use Cwd;
 use File::Path;
 use IO::File;
 use POSIX;
+use Encode qw(encode);
 
 use PVE::Storage::Plugin;
 use PVE::JSONSchema qw(get_standard_option);
@@ -103,7 +104,10 @@ sub get_volume_notes {
     my $path = $class->filesystem_path($scfg, $volname);
     $path .= $class->SUPER::NOTES_EXT;
 
-    return PVE::Tools::file_get_contents($path) if -f $path;
+    if (-f $path) {
+	my $data = PVE::Tools::file_get_contents($path);
+	return PVE::Storage::try_decode_utf8($data);
+    }
 
     return '';
 }
@@ -120,7 +124,8 @@ sub update_volume_notes {
     $path .= $class->SUPER::NOTES_EXT;
 
     if (defined($notes) && $notes ne '') {
-	PVE::Tools::file_set_contents($path, $notes);
+	my $encoded = encode('UTF-8', $notes);
+	PVE::Tools::file_set_contents($path, $encoded);
     } else {
 	unlink $path or $! == ENOENT or die "could not delete notes - $!\n";
     }
diff --git a/PVE/Storage/Plugin.pm b/PVE/Storage/Plugin.pm
index a6b0bdd..edec516 100644
--- a/PVE/Storage/Plugin.pm
+++ b/PVE/Storage/Plugin.pm
@@ -1172,7 +1172,7 @@ my $get_subdir_files = sub {
 	    my $notes_fn = $original.NOTES_EXT;
 	    if (-f $notes_fn) {
 		my $notes = PVE::Tools::file_read_firstline($notes_fn);
-		$info->{notes} = $notes if defined($notes);
+		$info->{notes} = PVE::Storage::try_decode_utf8($notes) if defined($notes);
 	    }
 
 	    $info->{protected} = 1 if -e PVE::Storage::protection_file_path($original);
-- 
2.30.2





^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [pve-devel] [RFC PATCH storage] Plugins: en/decode notes as UTF-8
  2022-03-08 14:41 [pve-devel] [RFC PATCH storage] Plugins: en/decode notes as UTF-8 Dominik Csapak
@ 2022-03-08 18:10 ` Thomas Lamprecht
  2022-03-09  7:30   ` Dominik Csapak
  0 siblings, 1 reply; 3+ messages in thread
From: Thomas Lamprecht @ 2022-03-08 18:10 UTC (permalink / raw)
  To: Proxmox VE development discussion, Dominik Csapak

On 08.03.22 15:41, Dominik Csapak wrote:
> When writing into the file, explicitly utf8 encode it, and then try to
> utf8 decode it on read.
> 
> If the notes are not valid utf8, we assume it was an iso-8859 comment
> and return is at is was.
> 
> Technically this is a breaking change, since there are iso-8859 comments
> that would sucessfully decode as utf8, for example:

s/sucessfully/successfully/

> the byte sequence "C2 A9" would be "£" in iso, but would decode to "£".
> 
> From what i can tell though, this is rather unlikely to happen for
> "real world" notes, because the first byte would be in the range of
> C0-F7 (which are mostly language dependent characters like "Â")
> and the following bytes would have to be in the range of
> 80-BF, which are only special characters like "£" (or undefined)

IMO a bit strange to trying to reason about free-form content that end user can
edit is hardly going to be right, but oh well you made it sound like really being
more of an edge case and I'd like to avoid versioning comment notes, so fine for me.

> 
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> ---
> we may want to have this 'try_decode_utf8' in PVE::Tools i guess?
> i just put it here for the RFC, so its more easy to review

meh, it's hardly any complicated logic, just calling into Encode and falling
back, but yeah the version below makes it seem a bit bloated, you made a one
liner expand into 14 ^^

> 
>  PVE/Storage.pm           | 17 +++++++++++++++++
>  PVE/Storage/DirPlugin.pm |  9 +++++++--
>  PVE/Storage/Plugin.pm    |  2 +-
>  3 files changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/PVE/Storage.pm b/PVE/Storage.pm
> index b1d31bb..4335ee9 100755
> --- a/PVE/Storage.pm
> +++ b/PVE/Storage.pm
> @@ -14,6 +14,7 @@ use File::Path;
>  use Cwd 'abs_path';
>  use Socket;
>  use Time::Local qw(timelocal);
> +use Encode qw(decode);
>  
>  use PVE::Tools qw(run_command file_read_firstline dir_glob_foreach $IPV6RE);
>  use PVE::Cluster qw(cfs_read_file cfs_write_file cfs_lock_file);
> @@ -2077,4 +2078,20 @@ sub normalize_content_filename {
>      return $filename;
>  }
>  
> +sub try_decode_utf8 {
> +    my ($data) = @_;
> +
> +    my $decoded = eval {
> +	decode('UTF-8', $data, 1);
> +    };

assignment evals should to be in a single line if text width allows it

> +
> +    if (!defined($decoded)) {
> +	# we could not decode, it's probably iso-8859,
> +	# so return original value

please stop breaking up comments always that early

> +	return $data;
> +    }
> +
> +    return $decoded;
> +}
> +

In general, why not just inline it? The following would be just as good as the whole
14 line method here...

my $foo = eval { decode('UTF-8', $data, 1) } // $data;


And if we want it centrally, then we want a set/get_notes helper somewhere around
that does the note-exists check + encode stuff, but as all is very centrally for now
and churn is not /that/ likely I'd slightly favoring just in-lining it..

>  1;
> diff --git a/PVE/Storage/DirPlugin.pm b/PVE/Storage/DirPlugin.pm
> index c60818b..bc559e6 100644
> --- a/PVE/Storage/DirPlugin.pm
> +++ b/PVE/Storage/DirPlugin.pm
> @@ -7,6 +7,7 @@ use Cwd;
>  use File::Path;
>  use IO::File;
>  use POSIX;
> +use Encode qw(encode);
>  
>  use PVE::Storage::Plugin;
>  use PVE::JSONSchema qw(get_standard_option);
> @@ -103,7 +104,10 @@ sub get_volume_notes {
>      my $path = $class->filesystem_path($scfg, $volname);
>      $path .= $class->SUPER::NOTES_EXT;
>  
> -    return PVE::Tools::file_get_contents($path) if -f $path;
> +    if (-f $path) {
> +	my $data = PVE::Tools::file_get_contents($path);
> +	return PVE::Storage::try_decode_utf8($data);

return eval { decode('UTF-8', $data, 1) } // $data;

> +    }
>  
>      return '';
>  }
> @@ -120,7 +124,8 @@ sub update_volume_notes {
>      $path .= $class->SUPER::NOTES_EXT;
>  
>      if (defined($notes) && $notes ne '') {
> -	PVE::Tools::file_set_contents($path, $notes);
> +	my $encoded = encode('UTF-8', $notes);
> +	PVE::Tools::file_set_contents($path, $encoded);
>      } else {
>  	unlink $path or $! == ENOENT or die "could not delete notes - $!\n";
>      }
> diff --git a/PVE/Storage/Plugin.pm b/PVE/Storage/Plugin.pm
> index a6b0bdd..edec516 100644
> --- a/PVE/Storage/Plugin.pm
> +++ b/PVE/Storage/Plugin.pm
> @@ -1172,7 +1172,7 @@ my $get_subdir_files = sub {
>  	    my $notes_fn = $original.NOTES_EXT;
>  	    if (-f $notes_fn) {
>  		my $notes = PVE::Tools::file_read_firstline($notes_fn);
> -		$info->{notes} = $notes if defined($notes);
> +		$info->{notes} = PVE::Storage::try_decode_utf8($notes) if defined($notes);

$info->{notes} = eval { decode('UTF-8', $notes, 1) } // $notes if defined($notes)

>  	    }
>  
>  	    $info->{protected} = 1 if -e PVE::Storage::protection_file_path($original);





^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [pve-devel] [RFC PATCH storage] Plugins: en/decode notes as UTF-8
  2022-03-08 18:10 ` Thomas Lamprecht
@ 2022-03-09  7:30   ` Dominik Csapak
  0 siblings, 0 replies; 3+ messages in thread
From: Dominik Csapak @ 2022-03-09  7:30 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE development discussion

On 3/8/22 19:10, Thomas Lamprecht wrote:
> On 08.03.22 15:41, Dominik Csapak wrote:
>> When writing into the file, explicitly utf8 encode it, and then try to
>> utf8 decode it on read.
>>
>> If the notes are not valid utf8, we assume it was an iso-8859 comment
>> and return is at is was.
>>
>> Technically this is a breaking change, since there are iso-8859 comments
>> that would sucessfully decode as utf8, for example:
> 
> s/sucessfully/successfully/
> 
>> the byte sequence "C2 A9" would be "£" in iso, but would decode to "£".
>>
>>  From what i can tell though, this is rather unlikely to happen for
>> "real world" notes, because the first byte would be in the range of
>> C0-F7 (which are mostly language dependent characters like "Â")
>> and the following bytes would have to be in the range of
>> 80-BF, which are only special characters like "£" (or undefined)
> 
> IMO a bit strange to trying to reason about free-form content that end user can
> edit is hardly going to be right, but oh well you made it sound like really being
> more of an edge case and I'd like to avoid versioning comment notes, so fine for me.
> 

yeah, i originally did not want this solution (because some valid input
would be wrongly decoded), but we encountered the same problems about 5
times now (most in pmg), and this time i took the time to look deeper into
which combinations are actually valid that would be decoded. and it turns
out that there are not really sensible ones. i guess there will be
at least some people affected by this, but we can simply tell them
to enter the comment again and it will be fixed.

>>
>> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
>> ---
>> we may want to have this 'try_decode_utf8' in PVE::Tools i guess?
>> i just put it here for the RFC, so its more easy to review
> 
> meh, it's hardly any complicated logic, just calling into Encode and falling
> back, but yeah the version below makes it seem a bit bloated, you made a one
> liner expand into 14 ^^
> 
>>
>>   PVE/Storage.pm           | 17 +++++++++++++++++
>>   PVE/Storage/DirPlugin.pm |  9 +++++++--
>>   PVE/Storage/Plugin.pm    |  2 +-
>>   3 files changed, 25 insertions(+), 3 deletions(-)
>>
>> diff --git a/PVE/Storage.pm b/PVE/Storage.pm
>> index b1d31bb..4335ee9 100755
>> --- a/PVE/Storage.pm
>> +++ b/PVE/Storage.pm
>> @@ -14,6 +14,7 @@ use File::Path;
>>   use Cwd 'abs_path';
>>   use Socket;
>>   use Time::Local qw(timelocal);
>> +use Encode qw(decode);
>>   
>>   use PVE::Tools qw(run_command file_read_firstline dir_glob_foreach $IPV6RE);
>>   use PVE::Cluster qw(cfs_read_file cfs_write_file cfs_lock_file);
>> @@ -2077,4 +2078,20 @@ sub normalize_content_filename {
>>       return $filename;
>>   }
>>   
>> +sub try_decode_utf8 {
>> +    my ($data) = @_;
>> +
>> +    my $decoded = eval {
>> +	decode('UTF-8', $data, 1);
>> +    };
> 
> assignment evals should to be in a single line if text width allows it
> 
>> +
>> +    if (!defined($decoded)) {
>> +	# we could not decode, it's probably iso-8859,
>> +	# so return original value
> 
> please stop breaking up comments always that early
> 
>> +	return $data;
>> +    }
>> +
>> +    return $decoded;
>> +}
>> +
> 
> In general, why not just inline it? The following would be just as good as the whole
> 14 line method here...
> 
> my $foo = eval { decode('UTF-8', $data, 1) } // $data;
> 
> 
> And if we want it centrally, then we want a set/get_notes helper somewhere around
> that does the note-exists check + encode stuff, but as all is very centrally for now
> and churn is not /that/ likely I'd slightly favoring just in-lining it..

sorry, i was so caught up with wanting to make this *very* explicit, that
i overlooked the obvious one-liner. my intention was to have a single
place were we do this with a short explanation (since we also want
to do this in some pmg cases), but inlining it with a good commit message
is more than enough...

thanks for pointing it out!
sending a v1 shortly

> 
>>   1;
>> diff --git a/PVE/Storage/DirPlugin.pm b/PVE/Storage/DirPlugin.pm
>> index c60818b..bc559e6 100644
>> --- a/PVE/Storage/DirPlugin.pm
>> +++ b/PVE/Storage/DirPlugin.pm
>> @@ -7,6 +7,7 @@ use Cwd;
>>   use File::Path;
>>   use IO::File;
>>   use POSIX;
>> +use Encode qw(encode);
>>   
>>   use PVE::Storage::Plugin;
>>   use PVE::JSONSchema qw(get_standard_option);
>> @@ -103,7 +104,10 @@ sub get_volume_notes {
>>       my $path = $class->filesystem_path($scfg, $volname);
>>       $path .= $class->SUPER::NOTES_EXT;
>>   
>> -    return PVE::Tools::file_get_contents($path) if -f $path;
>> +    if (-f $path) {
>> +	my $data = PVE::Tools::file_get_contents($path);
>> +	return PVE::Storage::try_decode_utf8($data);
> 
> return eval { decode('UTF-8', $data, 1) } // $data;
> 
>> +    }
>>   
>>       return '';
>>   }
>> @@ -120,7 +124,8 @@ sub update_volume_notes {
>>       $path .= $class->SUPER::NOTES_EXT;
>>   
>>       if (defined($notes) && $notes ne '') {
>> -	PVE::Tools::file_set_contents($path, $notes);
>> +	my $encoded = encode('UTF-8', $notes);
>> +	PVE::Tools::file_set_contents($path, $encoded);
>>       } else {
>>   	unlink $path or $! == ENOENT or die "could not delete notes - $!\n";
>>       }
>> diff --git a/PVE/Storage/Plugin.pm b/PVE/Storage/Plugin.pm
>> index a6b0bdd..edec516 100644
>> --- a/PVE/Storage/Plugin.pm
>> +++ b/PVE/Storage/Plugin.pm
>> @@ -1172,7 +1172,7 @@ my $get_subdir_files = sub {
>>   	    my $notes_fn = $original.NOTES_EXT;
>>   	    if (-f $notes_fn) {
>>   		my $notes = PVE::Tools::file_read_firstline($notes_fn);
>> -		$info->{notes} = $notes if defined($notes);
>> +		$info->{notes} = PVE::Storage::try_decode_utf8($notes) if defined($notes);
> 
> $info->{notes} = eval { decode('UTF-8', $notes, 1) } // $notes if defined($notes)
> 
>>   	    }
>>   
>>   	    $info->{protected} = 1 if -e PVE::Storage::protection_file_path($original);
> 





^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-03-09  7:30 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-08 14:41 [pve-devel] [RFC PATCH storage] Plugins: en/decode notes as UTF-8 Dominik Csapak
2022-03-08 18:10 ` Thomas Lamprecht
2022-03-09  7:30   ` Dominik Csapak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal