From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <t.lamprecht@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 16B2F69F29
 for <pve-devel@pve.proxmox.com>; Fri, 25 Mar 2022 18:59:07 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 0D0B93C1E
 for <pve-devel@pve.proxmox.com>; Fri, 25 Mar 2022 18:59:07 +0100 (CET)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [94.136.29.106])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id E259A3C14
 for <pve-devel@pve.proxmox.com>; Fri, 25 Mar 2022 18:59:05 +0100 (CET)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id B84DC46FB1
 for <pve-devel@pve.proxmox.com>; Fri, 25 Mar 2022 18:59:05 +0100 (CET)
Message-ID: <416be01d-0e5f-02eb-0af8-7c6e1896d095@proxmox.com>
Date: Fri, 25 Mar 2022 18:59:00 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:99.0) Gecko/20100101
 Thunderbird/99.0
Content-Language: en-US
To: Daniel Tschlatscher <d.tschlatscher@proxmox.com>,
 PVE development discussion <pve-devel@pve.proxmox.com>
References: <20220324154443.2442430-1-d.tschlatscher@proxmox.com>
 <5a02e9a6-fe05-129a-a4bf-9a29a8efa5d4@proxmox.com>
 <8af5712b-d33d-ffb7-ccd5-5dda8973822a@proxmox.com>
 <fd056cb7-0e4b-a25a-cd45-c1089185b7b7@proxmox.com>
 <4095c73e-9aa2-41c4-21f6-8aa4fb05c591@proxmox.com>
From: Thomas Lamprecht <t.lamprecht@proxmox.com>
In-Reply-To: <4095c73e-9aa2-41c4-21f6-8aa4fb05c591@proxmox.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.056 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 NICE_REPLY_A           -0.001 Looks like a legit reply (A)
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 T_SCC_BODY_TEXT_LINE    -0.01 -
Subject: Re: [pve-devel] [PATCH pve-qemu] fix: 3865: backup restore human
 readable output
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Fri, 25 Mar 2022 17:59:07 -0000

On 25.03.22 13:39, Daniel Tschlatscher wrote:
> On 3/25/22 12:05, Thomas Lamprecht wrote:
>>>>> ++
>>>>> ++=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 if (delta !=3D 0)
>>>>> ++=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 bps /=3D delta;
>>>> currently, if delta would be null you'd print a size unit as through=
put unit?
>>>>
>>> This is due to the underlying code. The function that is wrapping thi=
s part of the code is called multiple times every second (something like =
~20 times / sec) and in the original code a print statement would only oc=
cur after the percentage of bytes transferred increased by at least 1%.>
>>> The time elapsed between an increase in percent could be less than a =
second. In this case the output simply prints how much bytes were transfe=
rred since this second began. This means the value for throughput would a=
t the time of the print be always lower than the real world B/s and only =
the last print in that second would represent the actual B/s.
>> This just doesn't makes sense to me, the if clauses enforces that its =
called at
>> max every 2s (or 100% edge case) already, and using wrong units is usi=
ng wrong
>> units, you cannot really argue basic math/physics away..
>>
> The 100% edge case is the problem for the most part.

You could also ignore that in loop but print a 100% done after, rethinkin=
g this
makes the current approach to handle that look rather odd anyhow.

>=20
> It comes down to the inaccuracy of keeping time in seconds rather than =
something more granular. While it is true that in 0 time no actual data c=
ould be transferred, 0 in this case means somewhere from 0 =3D> 0.999 sec=
onds. Delta could be zero while the transferred bytes for example count 2=
0MB.

yes, as mentioned using something more granular improves a lot of things.=
=2E

>=20
> Just printing 0 would be confusing because *some* data was in fact tran=
sferred -> reflected in the bps variable. It could also give the user a f=
alse impression that their disk is slow or something similar if they see =
a lot of "0 B/s" entries (which was the actual reasoning here).

it isn't really reflect in bps as it's just wrong to print a size as thro=
ughput,
and with finer granularity or just not spamming the log at the 100% stage=
 would
be a nicer, more practical solution.

The delta 0 -> throughput 0 is then still the only valid way to handle it=
, the
minimal time resolution the program handles needs to be used as metric, n=
ot the
physical (almost) infinitesimal one, and once that's ms or =C2=B5s its on=
ly going to
happen when throughput would be rounded down to 0 anyway.

> The current implementation means, if delta is 0, bps =3D 20MB and in re=
ality 0.33 seconds have elapsed that the program prints "20MB/s". In actu=
ality the expected rate of transfer would be "60 MB/s" though.

bps (byte per second) cannot be MB, you really need to get your units str=
aigt,
as else its just confusing to me, but I assume you meant a delta of 20 Mi=
B in
bytes restored since last time, and then 20 / 0.33 =3D 60 so not seeing t=
he issue.

>=20
> My thinking not using milliseconds here was: Even though I could calcul=
ate how much the transfer rate per second is, during the time of the prin=
t, it is still an approximation. Imagine following scenario:

So, you're arguing that because a much more accurate solution is still no=
t 100%
accurate (which is impossible to do) lets go for the much less accurate o=
ne?
Sorry, but not really following that logic and really not getting why you=
 still
argue for it?! It should be clear that using millisecond (or lower) resol=
ution
gives much more accurate results, and can *never* be worse than more gran=
ular ones.

> In the first 0.33 secs 20MB/s are=C2=A0 transferred (=3D> print goes ou=
t with 60 MB/s)

Do you meant to write "In the first 0.33s 20 MiB are transferred", as els=
e we talk
second order derivative, i.e., throughput acceleration?


> In the next 0.33 secs only 10MB/s are transferred
> In the last 0.33 secs only 0.5MB/s are transferred
> In this case "60MB/s" was printed but 30.5MB have been transferred.

I don't get the point/example...

>=20
> The current implementation might be severely understating the actual th=
roughput, but will never show a value for B/s that is higher than the act=
ual throughput per second. The unaccounted for bytes here would be then r=
eflected in the following output.

exactly, but making the interval, that the error "leaks" into, 1000, or e=
ven 1
million times smaller reduces the difference from reality vs. actual outp=
ut quite
a lot.

> The millisecond approach would have both the same and opposite problem,=
 e.g. if delta is only a few milliseconds and the throughput in that time=
 a lot higher than average, then the printed B/s value could be highly ex=
aggerated.

no, they do not have the opposite problem, that cannot really be as its t=
he
same thing but just more accurate, it's still not perfect, and for very v=
ery
high bandwidths (100s of GiB/s) the error may be even still get relevant =
with
a per ms sampling, but then it'd be still 1000 times better on average.

>=20
> I let the current implementation be because, in practice, at the end of=
 the backup restore there are usually a lot of sparse bytes (I hope I am =
using this term correctly). This means that the restore would go from 56%=
 to 100% in less than a second.=C2=A0 Simply writing 0 here would probabl=
y confuse the user more than anything because the value suddenly jumped b=
y e.g.=C2=A0 6GiB with 0B/s.

This cannot be generalized in practice, on a long running production VM h=
oles due
to sparseness can be there or not, and may not be accumulated all at the =
end.
Different OS and the filesystems they use also have very different impact=
s.

>=20
> Still, I do concede that using milliseconds just makes more sense becau=
se it's the more "proper" way to implement something like this and especi=
ally because of the unit. In the current implementation it would be somet=
hing like B/last-output.
> Also, it seems to me that the rate of throughput plummeting or skyrocke=
ting in the span of less than a second is rather unlikely, the millisecon=
d method would statistically just be more correct.

Hows that unlikely? Modern HW and their caches are really not determinist=
ic at all,
e.g., QLC SSDs with a few tens of GiB of SLC and some GiB of DDR4 cache c=
an
show very stark jumps in throughput, from 6 Gbps (maxing out SATA link sp=
eed) to 100
Mbps (maxing out QLC). In addition to that, the IO schedule handling mult=
iple
processes submitting IO concurrently doesn't makes it anymore stable, the=
n add effects
of filesystems or possible ZFS RAID, ... on top. Also, simply that anothe=
r process is
done with IO can make a lot of capacity available in mere micro seconds.
There's a reason some counters/timing bits of the IO paths adds to the ke=
rnel's RNG
entropy pool.

Further, note that the base assumption of why I went into this wasn't eve=
n erratic IO,
but rather CPU scheduling leading into very different actual duration int=
ervals.
And that could even make a stable IO throughput flow display as completel=
y erratic,
even if it actually wasn't. I.e., using such a low accuracy just adds way=
 to much
distortion in any direction.

> I will change it in v3.
>=20
>>> Nonetheless, I later adapted this code to behave like the backup crea=
te function does, which updates every 3 seconds, eliminating this problem=
 entirely. This is therefore dead code anyway.
>>>
>> later where?
>>
>> Actual improvement, track duration in milliseconds or less and still c=
atch the edge
>> case for when delta is zero and just set throughput to zero then (as n=
o data can
>> transfer in zero time).
>>
> Later in the sense of time. I rewrote the if statement above to only ex=
ecute if delta is greater than 2 seconds (before it updated with every pe=
rcent increase).

Ah ok, in a future patch - I encountered "later" quite often like in "Lat=
er in this patch
or series", so yeah can be clear enough & probably just confusion on my s=
ide.

> _>> __"eliminating this problem entirely"_ is, I retract, not true thou=
gh because I forgot about the check whether progress is at 100%, in which=
 case delta could still be 0.

In that case you either skip reporting this round and/or move the 100% re=
porting out of
the loop entirely, which could be the nicer approach in general anyway, a=
nd with the
reporting factored out in its own function also not much code overhead.