From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <t.lamprecht@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 7525EE528
 for <pve-devel@lists.proxmox.com>; Fri, 22 Apr 2022 18:54:51 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 6319729D36
 for <pve-devel@lists.proxmox.com>; Fri, 22 Apr 2022 18:54:21 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [94.136.29.106])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id 91A1329D28
 for <pve-devel@lists.proxmox.com>; Fri, 22 Apr 2022 18:54:19 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 6D03042763
 for <pve-devel@lists.proxmox.com>; Fri, 22 Apr 2022 18:54:19 +0200 (CEST)
Message-ID: <15d074c4-fa81-9d22-1e39-8bb04a250c42@proxmox.com>
Date: Fri, 22 Apr 2022 18:54:17 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:100.0) Gecko/20100101
 Thunderbird/100.0
Content-Language: en-US
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>,
 Matthias Heiserer <m.heiserer@proxmox.com>
References: <20220422144358.3217098-1-m.heiserer@proxmox.com>
From: Thomas Lamprecht <t.lamprecht@proxmox.com>
In-Reply-To: <20220422144358.3217098-1-m.heiserer@proxmox.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-SPAM-LEVEL: Spam detection results:  0
 AWL 1.627 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 NICE_REPLY_A           -3.185 Looks like a legit reply (A)
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [anyevent.pm]
Subject: Re: [pve-devel] [PATCH v1 http-server 1/2] AnyEvent: fix #3990 -
 rewrite file upload
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Fri, 22 Apr 2022 16:54:51 -0000

On 22.04.22 16:43, Matthias Heiserer wrote:
> Uploading of files of arbitrary size is now possible.
> 
> Some (regex) lines are long, but imo more readable than splitting them
> up into multiple lines, as that's how they look in the HTTP request.
> 

This commit message lacks:
* why was it limited before (to big, to small). I'd like to avoid the need to
  search through bug reports for the basic info.
* what is done how to improve on that
* does that have any drawbacks, if an short argument would be great, but in
  any case stating explicitly what was chosen consciously and not just missed
  by accident should be done
* what did you actually test? Is there a good upload test/fuzz suite to run this
  through? can we create 

While the regex comment is borderline on meta-info, i.e., relevant for review only.

Did only some lightweight review as I don't see why I should invest the time to research
basic info myself that you had to do for the implementation anyway, at least you should have.

So, please consider adding a bit more info to your commit message, it don't needs to
be hundreds of words, but tbh I never saw a too long commit message and I have to spend
quite some time in `git log` ;-)

> Signed-off-by: Matthias Heiserer <m.heiserer@proxmox.com>
> ---
>  src/PVE/APIServer/AnyEvent.pm | 152 +++++++++++++++++-----------------
>  1 file changed, 78 insertions(+), 74 deletions(-)
> 
> diff --git a/src/PVE/APIServer/AnyEvent.pm b/src/PVE/APIServer/AnyEvent.pm
> index 7dd7d2d..ade3c05 100644
> --- a/src/PVE/APIServer/AnyEvent.pm
> +++ b/src/PVE/APIServer/AnyEvent.pm
> @@ -1157,64 +1157,74 @@ sub handle_request {
>  
>  sub file_upload_multipart {
>      my ($self, $reqstate, $auth, $method, $path, $rstate) = @_;
> -

unnecessary noise

>      eval {
>  	my $boundary = $rstate->{boundary};
>  	my $hdl = $reqstate->{hdl};
> -
unnecessary noise

>  	my $startlen = length($hdl->{rbuf});
>  
> -	if ($rstate->{phase} == 0) { # skip everything until start
> -	    if ($hdl->{rbuf} =~ s/^.*?--\Q$boundary\E  \015?\012
> -                       ((?:[^\015]+\015\012)* ) \015?\012//xs) {
> -		my $header = $1;
> -		my ($ct, $disp, $name, $filename);
> -		foreach my $line (split(/\015?\012/, $header)) {
> -		    # assume we have single line headers
> -		    if ($line =~ m/^Content-Type\s*:\s*(.*)/i) {
> -			$ct = parse_content_type($1);
> -		    } elsif ($line =~ m/^Content-Disposition\s*:\s*(.*)/i) {
> -			($disp, $name, $filename) = parse_content_disposition($1);
> -		    }
> -		}
> +	my $newline = qr/\015?\012/;
> +	my $delimiter = qr/--\Q$boundary\E${newline}/;
> +	my $closeDelimiter = qr/--\Q$boundary\E--${newline}/;
>  
> -		if (!($disp && $disp eq 'form-data' && $name)) {
> -		    syslog('err', "wrong content disposition in multipart - abort upload");
> -		    $rstate->{phase} = -1;
> -		} else {
> +	my $check_disposition = sub {
> +	    my ($disp) = @_;
> +	    die "wrong Content-Disposition in multipart, expected `form-data` - abort upload"
> +		if $disp ne 'form-data';
> +	};
>  
> -		    $rstate->{fieldname} = $name;
> +	# Phase 0 - preserve boundary, but remove everything before
> +	if ($rstate->{phase} == 0 && $hdl->{rbuf} =~ s/^.*?($delimiter)/$1/xs) {
> +	    $rstate->{read} += $startlen - length($hdl->{rbuf});
> +	    $rstate->{phase} = 1;
> +	}
>  
> -		    if ($filename) {
> -			if ($name eq 'filename') {
> -			    # found file upload data
> -			    $rstate->{phase} = 1;
> -			    $rstate->{filename} = $filename;
> -			} else {
> -			    syslog('err', "wrong field name for file upload - abort upload");
> -			    $rstate->{phase} = -1;
> -			}
> -		    } else {
> -			# found form data for field $name
> -			$rstate->{phase} = 2;
> -		    }
> -		}
> -	    } else {
> -		my $len = length($hdl->{rbuf});
> -		substr($hdl->{rbuf}, 0, $len - $rstate->{maxheader}, '')
> -		    if $len > $rstate->{maxheader}; # skip garbage
> -	    }
> -	} elsif ($rstate->{phase} == 1) { # inside file - dump until end marker
> -	    if ($hdl->{rbuf} =~ s/^(.*?)\015?\012(--\Q$boundary\E(--)? \015?\012(.*))$/$2/xs) {
> +	# Phase 1 - parse payload without file data
> +	if ($rstate->{phase} == 1 && $hdl->{rbuf} =~
> +	    s/^${delimiter}Content-Disposition: (.*?); name="content"(.*?)($delimiter)/$3/s
> +	) {
> +	    $check_disposition->($1);
> +	    $rstate->{params}->{content} = trim($2);
> +	    syslog('info', "timeout: " . $hdl->{timeout});
> +	}
> +
> +	if ($rstate->{phase} == 1 && $hdl->{rbuf} =~
> +	    s/^${delimiter}Content-Disposition: (.*?); name="checksum-algorithm"(.*?)($delimiter)/$3/s
> +	) {
> +	    $check_disposition->($1);
> +	    $rstate->{params}->{"checksum-algorithm"} = trim($2);
> +	}
> +
> +	if ($rstate->{phase} == 1 && $hdl->{rbuf} =~
> +	    s/^${delimiter}Content-Disposition: (.*?); name="checksum"(.*?)($delimiter)/$3/s
> +	) {
> +	    $check_disposition->($1);
> +	    $rstate->{params}->{checksum} = trim($2);
> +	}
> +
> +	if ($rstate->{phase} == 1 && $hdl->{rbuf} =~
> +	    s/^${delimiter}
> +	    Content-Disposition:\ (.*?);\ name="(.*?)";\ filename="([^"]+)"${newline}
> +	    Content-Type:\ [^\s]*\s+ #remove all whitespace until begin of data
> +	    //sxx
> +	) {
> +	    $check_disposition->($1);
> +	    die "wrong field `name` for file upload, expected `filename` - abort upload"
> +		if $2 ne "filename";
> +	    $rstate->{phase} = 2;
> +	    $rstate->{params}->{filename} = trim($3);
> +	}

why explicitly check for $phase == 1 5 times and not to

if ($rstate->{phase} == 1) {
    if ($hdl->{rbuf} =~ s/^${delimiter}Content-Disposition: (.*?); name="content"(.*?)($delimiter)/$3/s) {
         ...
    }


would result in less noise and clearer code flow...

> +
> +	# Phase 2 - dump content into file
> +	if ($rstate->{phase} == 2) {
> +	    if ($hdl->{rbuf} =~ s/^(.*?)${newline}?+${closeDelimiter}.*$//s) {
>  		my ($rest, $eof) = ($1, $3);
>  		my $len = length($rest);
>  		die "write to temporary file failed - $!"
>  		    if syswrite($rstate->{outfh}, $rest) != $len;
>  		$rstate->{ctx}->add($rest);
> -		$rstate->{params}->{filename} = $rstate->{filename};
>  		$rstate->{md5sum} = $rstate->{ctx}->hexdigest;
>  		$rstate->{bytes} += $len;
> -		$rstate->{phase} =  $eof ? 100 : 0;
> +		$rstate->{phase} =  100;
>  	    } else {
>  		my $len = length($hdl->{rbuf});
>  		my $wlen = $len - $rstate->{boundlen};
> @@ -1226,42 +1236,29 @@ sub file_upload_multipart {
>  		    $rstate->{ctx}->add($data);
>  		}
>  	    }
> -	} elsif ($rstate->{phase} == 2) { # inside normal field
> +	}
>  
> -	    if ($hdl->{rbuf} =~ s/^(.*?)\015?\012(--\Q$boundary\E(--)? \015?\012(.*))$/$2/xs) {
> -		my ($rest, $eof) = ($1, $3);
> -		my $len = length($rest);
> -		$rstate->{post_size} += $len;
> -		if ($rstate->{post_size} < $limit_max_post) {
> -		    $rstate->{params}->{$rstate->{fieldname}} = $rest;
> -		    $rstate->{phase} = $eof ? 100 : 0;
> -		} else {
> -		    syslog('err', "form data to large - abort upload");
> -		    $rstate->{phase} = -1; # skip
> -		}
> -	    }
> -	} else { # skip
> -	    my $len = length($hdl->{rbuf});
> -	    substr($hdl->{rbuf}, 0, $len, ''); # empty rbuf
> +	# Phase 100 - transfer finished
> +	if ($rstate->{phase} == 100) {
> +	    my $elapsed = tv_interval($rstate->{starttime});
> +
> +	    my $rate = int($rstate->{bytes} / ($elapsed * 1024 * 1024));
> +	    syslog('info',
> +		"multipart upload complete (size: %d time: %ds rate: %.2fMiB/s md5sum: %s)",
> +		$rstate->{bytes}, $elapsed, $rate, $rstate->{md5sum}
> +	    );
> +	    $self->handle_api2_request($reqstate, $auth, $method, $path, $rstate);
>  	}
>  
> -	$rstate->{read} += ($startlen - length($hdl->{rbuf}));
> +	$rstate->{read} += $startlen - length($hdl->{rbuf});

unnecessary noise

>  
> -	if (!$rstate->{done} && ($rstate->{read} + length($hdl->{rbuf})) >= $rstate->{size}) {
> -	    $rstate->{done} = 1; # make sure we dont get called twice
> -	    if ($rstate->{phase} < 0 || !$rstate->{md5sum}) {
> -		die "upload failed\n";
> -	    } else {
> -		my $elapsed = tv_interval($rstate->{starttime});
> -
> -		my $rate = int($rstate->{bytes} / ($elapsed * 1024 * 1024));
> -		syslog('info',
> -		    "multipart upload complete (size: %d time: %ds rate: %.2fMiB/s md5sum: %s)",
> -		     $rstate->{bytes}, $elapsed, $rate, $rstate->{md5sum}
> -		);
> -		$self->handle_api2_request($reqstate, $auth, $method, $path, $rstate);
> -	    }
> +	if (
> +	    $rstate->{read} + length($hdl->{rbuf}) >= $rstate->{size}
> +	    && $rstate->{phase} != 100
> +	) {
> +	    die "upload failed";
>  	}
> +
>      };
>      if (my $err = $@) {
>  	syslog('err', $err);
> @@ -1269,6 +1266,13 @@ sub file_upload_multipart {
>      }
>  }
>  
> +sub trim {

should be a local sub (my sub {}) not public ABI of a http server module, maybe even just a local inline
closure.

nit: a bit overly generic method name, i.e., "trim <what>", better: trim_whitespace
but actually this is "trim leading whitespace and trailing everything after the first non-whitespace,
and that multiline"

> +    my ($string) = @_;
> +
> +    $string =~ s/\s*([^\s]+).*/$1/s;

this is to expensive, why replace the string reference if it's always assigned as rvalue anyway?

also: use \S not [^\s]

> +    return $string;
> +}

could be (assuming that multiline is overkill):

my sub trim_whitespace {
    $_[0] =~ s/\s(\S+)/;
    return $1;
}

that wouldn't scan the string after we don't need to anyway and avoid replacing it inline
just to override it via return + rvalue assignment anyway.

> +
>  sub parse_content_type {
>      my ($ctype) = @_;
>