* [pve-devel] [PATCH qemu-server] fix #6207: vm status: cache last disk read/write values
@ 2025-09-22 10:15 Fiona Ebner
2025-09-22 17:26 ` Thomas Lamprecht
0 siblings, 1 reply; 6+ messages in thread
From: Fiona Ebner @ 2025-09-22 10:15 UTC (permalink / raw)
To: pve-devel
If disk read/write cannot be queried because of QMP timeout, they
should not be reported as 0, but the last value should be re-used.
Otherwise, the difference between that reported 0 and the next value,
when the stats are queried successfully, will show up as a huge spike
in the RRD graphs.
Invalidate the cache when there is no PID or the PID changed.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
src/PVE/QemuServer.pm | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)
diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm
index f7f85436..5940ec38 100644
--- a/src/PVE/QemuServer.pm
+++ b/src/PVE/QemuServer.pm
@@ -2670,6 +2670,12 @@ our $vmstatus_return_properties = {
my $last_proc_pid_stat;
+# See bug #6207. If disk write/read cannot be queried because of QMP timeout, they should not be
+# reported as 0, but the last value should be re-used. Otherwise, the difference between that
+# reported 0 and the next value, when the stats are queried successfully, will show up as a huge
+# spike.
+my $last_disk_stats = {};
+
# get VM status information
# This must be fast and should not block ($full == false)
# We only query KVM using QMP if $full == true (this can be slow)
@@ -2741,6 +2747,14 @@ sub vmstatus {
$d->{tags} = $conf->{tags} if defined($conf->{tags});
$res->{$vmid} = $d;
+
+ if (
+ $last_disk_stats->{$vmid}
+ && (!$d->{pid} || $d->{pid} != $last_disk_stats->{$vmid}->{pid})
+ ) {
+ delete($last_disk_stats->{$vmid});
+ }
+
}
my $netdev = PVE::ProcFSTools::read_proc_net_dev();
@@ -2815,6 +2829,8 @@ sub vmstatus {
return $res if !$full;
+ my $disk_stats_present = {};
+
my $qmpclient = PVE::QMPClient->new();
my $ballooncb = sub {
@@ -2862,8 +2878,10 @@ sub vmstatus {
$res->{$vmid}->{blockstat}->{$drive_id} = $blockstat->{stats};
}
- $res->{$vmid}->{diskread} = $totalrdbytes;
- $res->{$vmid}->{diskwrite} = $totalwrbytes;
+ $res->{$vmid}->{diskread} = $last_disk_stats->{$vmid}->{diskread} = $totalrdbytes;
+ $res->{$vmid}->{diskwrite} = $last_disk_stats->{$vmid}->{diskwrite} = $totalwrbytes;
+ $last_disk_stats->{$vmid}->{pid} = $res->{$vmid}->{pid};
+ $disk_stats_present->{$vmid} = 1;
};
my $machinecb = sub {
@@ -2925,7 +2943,13 @@ sub vmstatus {
foreach my $vmid (keys %$list) {
next if $opt_vmid && ($vmid ne $opt_vmid);
+
$res->{$vmid}->{qmpstatus} = $res->{$vmid}->{status} if !$res->{$vmid}->{qmpstatus};
+
+ if (!$disk_stats_present->{$vmid} && $last_disk_stats->{$vmid}) {
+ $res->{$vmid}->{diskread} = $last_disk_stats->{$vmid}->{diskread};
+ $res->{$vmid}->{diskwrite} = $last_disk_stats->{$vmid}->{diskwrite};
+ }
}
return $res;
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [pve-devel] [PATCH qemu-server] fix #6207: vm status: cache last disk read/write values
2025-09-22 10:15 [pve-devel] [PATCH qemu-server] fix #6207: vm status: cache last disk read/write values Fiona Ebner
@ 2025-09-22 17:26 ` Thomas Lamprecht
2025-09-25 8:27 ` Fiona Ebner
0 siblings, 1 reply; 6+ messages in thread
From: Thomas Lamprecht @ 2025-09-22 17:26 UTC (permalink / raw)
To: Proxmox VE development discussion, Fiona Ebner
Am 22.09.25 um 12:18 schrieb Fiona Ebner:
> If disk read/write cannot be queried because of QMP timeout, they
> should not be reported as 0, but the last value should be re-used.
> Otherwise, the difference between that reported 0 and the next value,
> when the stats are queried successfully, will show up as a huge spike
> in the RRD graphs.
Fine with the idea in general, but this is effectively relevant for
the pvestatd only though?
As of now we would also cache in the API daemon, without every using
this. Might not be _that_ much, so not really a problem of the amount,
but feels a bit wrong to me w.r.t. "code place".
Has pvestatd the necessary info, directly on indirectly through the
existence of some other vmstatus properties, to derive when it can
safely reuse the previous value?
Or maybe we could make this caching opt-in through some module flag
that only pvestatd sets? But not really thought that through, so
please take this with a grain of salt.
btw. what about QMP being "stuck" for a prolonged time, should we
stop using the previous value after a few times (or duration)?
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [pve-devel] [PATCH qemu-server] fix #6207: vm status: cache last disk read/write values
2025-09-22 17:26 ` Thomas Lamprecht
@ 2025-09-25 8:27 ` Fiona Ebner
2025-09-25 8:52 ` Thomas Lamprecht
0 siblings, 1 reply; 6+ messages in thread
From: Fiona Ebner @ 2025-09-25 8:27 UTC (permalink / raw)
To: Thomas Lamprecht, Proxmox VE development discussion
Am 22.09.25 um 7:26 PM schrieb Thomas Lamprecht:
> Am 22.09.25 um 12:18 schrieb Fiona Ebner:
>> If disk read/write cannot be queried because of QMP timeout, they
>> should not be reported as 0, but the last value should be re-used.
>> Otherwise, the difference between that reported 0 and the next value,
>> when the stats are queried successfully, will show up as a huge spike
>> in the RRD graphs.
>
> Fine with the idea in general, but this is effectively relevant for
> the pvestatd only though?
>
> As of now we would also cache in the API daemon, without every using
> this. Might not be _that_ much, so not really a problem of the amount,
> but feels a bit wrong to me w.r.t. "code place".
>
> Has pvestatd the necessary info, directly on indirectly through the
> existence of some other vmstatus properties, to derive when it can
> safely reuse the previous value?
It's safe (and sensible/required) if and only if there is no new value.
We could have the cache be only inside pvestatd, initialize the cache
with a value of 0 and properly report diskread/write values as undef if
we cannot get an actual value, and have that mean "re-use previous
value". (Aside: we cannot use 0 instead of undef to mean "re-use
previous value", because there are edge cases where a later 0 actually
means 0 again, for example, all disk unplugged).
> Or maybe we could make this caching opt-in through some module flag
> that only pvestatd sets? But not really thought that through, so
> please take this with a grain of salt.
>
> btw. what about QMP being "stuck" for a prolonged time, should we
> stop using the previous value after a few times (or duration)?
What other value could we use? Since the graph looks at the differences
of reported values, the only reasonable value we can use if we cannot
get a new one is the previous one. No matter how long it takes to get a
new one, or there will be that completely wrong spike again. Or is there
a N/A kind of value that we could use, where RRD/graph would be smart
enough to know "I cannot calculate a difference now, will have to wait
for multiple good values"? Then I'd go for that instead of the current
approach.
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [pve-devel] [PATCH qemu-server] fix #6207: vm status: cache last disk read/write values
2025-09-25 8:27 ` Fiona Ebner
@ 2025-09-25 8:52 ` Thomas Lamprecht
2025-09-25 9:28 ` Fiona Ebner
0 siblings, 1 reply; 6+ messages in thread
From: Thomas Lamprecht @ 2025-09-25 8:52 UTC (permalink / raw)
To: Fiona Ebner, Proxmox VE development discussion
Am 25.09.25 um 10:27 schrieb Fiona Ebner:
> Am 22.09.25 um 7:26 PM schrieb Thomas Lamprecht:
>> Am 22.09.25 um 12:18 schrieb Fiona Ebner:
>>> If disk read/write cannot be queried because of QMP timeout, they
>>> should not be reported as 0, but the last value should be re-used.
>>> Otherwise, the difference between that reported 0 and the next value,
>>> when the stats are queried successfully, will show up as a huge spike
>>> in the RRD graphs.
>>
>> Fine with the idea in general, but this is effectively relevant for
>> the pvestatd only though?
>>
>> As of now we would also cache in the API daemon, without every using
>> this. Might not be _that_ much, so not really a problem of the amount,
>> but feels a bit wrong to me w.r.t. "code place".
>>
>> Has pvestatd the necessary info, directly on indirectly through the
>> existence of some other vmstatus properties, to derive when it can
>> safely reuse the previous value?
>
> It's safe (and sensible/required) if and only if there is no new value.
> We could have the cache be only inside pvestatd, initialize the cache
> with a value of 0 and properly report diskread/write values as undef if
> we cannot get an actual value, and have that mean "re-use previous
> value". (Aside: we cannot use 0 instead of undef to mean "re-use
> previous value", because there are edge cases where a later 0 actually
> means 0 again, for example, all disk unplugged).
Yeah, it would have to be a invalid value like -1, but even that is
naturally not ideal, an explicit undefined or null value would
naturally be better to signal what's happening.
>> Or maybe we could make this caching opt-in through some module flag
>> that only pvestatd sets? But not really thought that through, so
>> please take this with a grain of salt.
>>
>> btw. what about QMP being "stuck" for a prolonged time, should we
>> stop using the previous value after a few times (or duration)?
>
> What other value could we use? Since the graph looks at the differences
> of reported values, the only reasonable value we can use if we cannot
> get a new one is the previous one. No matter how long it takes to get a
> new one, or there will be that completely wrong spike again. Or is there
> a N/A kind of value that we could use, where RRD/graph would be smart
> enough to know "I cannot calculate a difference now, will have to wait
> for multiple good values"? Then I'd go for that instead of the current
> approach.
That should never be the problem of the metric collecting entity, but of
the one interpreting or displaying the data, as else this is creating a
false impression of reality.
So the more I think of this, the more I'm sure that we won't do anybody
a favor in the mid/long term here with "faking it" in the backend.
I'd need to look into RRD, but even if there wasn't a way there to
submit null-ish values, I'd rather see that as further argument for
switching out RRD with the rust based proxmox-rrd crate, where we have
control over these things, compared to recording measurements that did
not happen.
That does not mean that doing this correctly in proxmox-rrd will be
trivial to do once we migrated–which is non-trivial on it's own–though.
There are also some ideas to switching to a rather different way to
encode metrics, using a more flexible format and stuff like delta
encoding, i.e. closer to modern time series DBs like influxdb do it,
Lukas signaled some interest in this work here.
But that is vaporware as of now, so no need to wait on that to happen
now, just wanted to mention it to not have those ideas isolated to much.
But taking a step back, why is QMP even timing out here? Is this not
just reading some in-memory counters that QEMU has ready to go?
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [pve-devel] [PATCH qemu-server] fix #6207: vm status: cache last disk read/write values
2025-09-25 8:52 ` Thomas Lamprecht
@ 2025-09-25 9:28 ` Fiona Ebner
2025-09-25 12:31 ` [pve-devel] superseded: " Fiona Ebner
0 siblings, 1 reply; 6+ messages in thread
From: Fiona Ebner @ 2025-09-25 9:28 UTC (permalink / raw)
To: Thomas Lamprecht, Proxmox VE development discussion
Am 25.09.25 um 10:52 AM schrieb Thomas Lamprecht:
> Am 25.09.25 um 10:27 schrieb Fiona Ebner:
>> Am 22.09.25 um 7:26 PM schrieb Thomas Lamprecht:
>>> Am 22.09.25 um 12:18 schrieb Fiona Ebner:
>>> Or maybe we could make this caching opt-in through some module flag
>>> that only pvestatd sets? But not really thought that through, so
>>> please take this with a grain of salt.
>>>
>>> btw. what about QMP being "stuck" for a prolonged time, should we
>>> stop using the previous value after a few times (or duration)?
>>
>> What other value could we use? Since the graph looks at the differences
>> of reported values, the only reasonable value we can use if we cannot
>> get a new one is the previous one. No matter how long it takes to get a
>> new one, or there will be that completely wrong spike again. Or is there
>> a N/A kind of value that we could use, where RRD/graph would be smart
>> enough to know "I cannot calculate a difference now, will have to wait
>> for multiple good values"? Then I'd go for that instead of the current
>> approach.
>
> That should never be the problem of the metric collecting entity, but of
> the one interpreting or displaying the data, as else this is creating a
> false impression of reality.
>
> So the more I think of this, the more I'm sure that we won't do anybody
> a favor in the mid/long term here with "faking it" in the backend.
Very good point! I'll look into what happens when reporting an undef
value, because right now the interpreting entity cannot distinguish
between "0 because of no data" and "0 yes I really mean this is the
actual value".
> I'd need to look into RRD, but even if there wasn't a way there to
> submit null-ish values, I'd rather see that as further argument for
> switching out RRD with the rust based proxmox-rrd crate, where we have
> control over these things, compared to recording measurements that did
> not happen.
>
> That does not mean that doing this correctly in proxmox-rrd will be
> trivial to do once we migrated–which is non-trivial on it's own–though.
> There are also some ideas to switching to a rather different way to
> encode metrics, using a more flexible format and stuff like delta
> encoding, i.e. closer to modern time series DBs like influxdb do it,
> Lukas signaled some interest in this work here.
> But that is vaporware as of now, so no need to wait on that to happen
> now, just wanted to mention it to not have those ideas isolated to much.
>
>
> But taking a step back, why is QMP even timing out here? Is this not
> just reading some in-memory counters that QEMU has ready to go?
There can be another QMP operation going on blocking the request (e.g.
backup), or the QEMU main thread might be busy or the system in general
might be under too much load to handle all of the QMP commands to all
the VMs in time. The report of this issue in the enterprise support has
VMs that are not being backed-up showing the spike during backup of
other VMs.
But it seems like there is potential for improvement how we do things.
We collect :
> my $statuscb = sub {
> my ($vmid, $resp) = @_;
>
> $qmpclient->queue_cmd($vmid, $blockstatscb, 'query-blockstats');
> $qmpclient->queue_cmd($vmid, $machinecb, 'query-machines');
> $qmpclient->queue_cmd($vmid, $versioncb, 'query-version');
> # this fails if ballon driver is not loaded, so this must be
> # the last command (following command are aborted if this fails).
> $qmpclient->queue_cmd($vmid, $ballooncb, 'query-balloon');
>
> my $status = 'unknown';
> if (!defined($status = $resp->{'return'}->{status})) {
> warn "unable to get VM status\n";
> return;
> }
>
> $res->{$vmid}->{qmpstatus} = $resp->{'return'}->{status};
> };
>
> foreach my $vmid (keys %$list) {
> next if $opt_vmid && ($vmid ne $opt_vmid);
> next if !$res->{$vmid}->{pid}; # not running
> $qmpclient->queue_cmd($vmid, $statuscb, 'query-status');
> }
Okay, good! We collect all commands so we can issue them in parallel.
> $qmpclient->queue_execute(undef, 2);
Here we only have the default timeout of 3 seconds (i.e. the undef
argument), maybe we should bump that to something like 5 seconds? Right
now, without having the pvestatd parallelize the update_qemu_status()
with other update_xyz() operations, that might already be quite costly
:/ But considering it's for all VMs it might be fair?
> foreach my $vmid (keys %$list) {
> next if $opt_vmid && ($vmid ne $opt_vmid);
> next if !$res->{$vmid}->{pid}; #not running
>
> # we can't use the $qmpclient since it might have already aborted on
> # 'query-balloon', but this might also fail for older versions...
> my $qemu_support = eval { mon_cmd($vmid, "query-proxmox-support") };
> $res->{$vmid}->{'proxmox-support'} = $qemu_support // {};
> }
This OTOH, seems just bad, querying the info one-by-one, each with its
own timeout. I'll look into whether this can be reworked to be part of
the queue (before 'query-balloon'). And/or we should be able to even
disable this for the status daemon, I think it doesn't use that info at all.
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* [pve-devel] superseded: [PATCH qemu-server] fix #6207: vm status: cache last disk read/write values
2025-09-25 9:28 ` Fiona Ebner
@ 2025-09-25 12:31 ` Fiona Ebner
0 siblings, 0 replies; 6+ messages in thread
From: Fiona Ebner @ 2025-09-25 12:31 UTC (permalink / raw)
To: Thomas Lamprecht, Proxmox VE development discussion
Am 25.09.25 um 11:28 AM schrieb Fiona Ebner:
> Am 25.09.25 um 10:52 AM schrieb Thomas Lamprecht:
>> Am 25.09.25 um 10:27 schrieb Fiona Ebner:
>>> Am 22.09.25 um 7:26 PM schrieb Thomas Lamprecht:
>>>> Am 22.09.25 um 12:18 schrieb Fiona Ebner:
>>>> Or maybe we could make this caching opt-in through some module flag
>>>> that only pvestatd sets? But not really thought that through, so
>>>> please take this with a grain of salt.
>>>>
>>>> btw. what about QMP being "stuck" for a prolonged time, should we
>>>> stop using the previous value after a few times (or duration)?
>>>
>>> What other value could we use? Since the graph looks at the differences
>>> of reported values, the only reasonable value we can use if we cannot
>>> get a new one is the previous one. No matter how long it takes to get a
>>> new one, or there will be that completely wrong spike again. Or is there
>>> a N/A kind of value that we could use, where RRD/graph would be smart
>>> enough to know "I cannot calculate a difference now, will have to wait
>>> for multiple good values"? Then I'd go for that instead of the current
>>> approach.
>>
>> That should never be the problem of the metric collecting entity, but of
>> the one interpreting or displaying the data, as else this is creating a
>> false impression of reality.
>>
>> So the more I think of this, the more I'm sure that we won't do anybody
>> a favor in the mid/long term here with "faking it" in the backend.
>
> Very good point! I'll look into what happens when reporting an undef
> value, because right now the interpreting entity cannot distinguish
> between "0 because of no data" and "0 yes I really mean this is the
> actual value".
Returning undef already works as intended :)
Superseded by:
https://lore.proxmox.com/pve-devel/20250925122829.70121-1-f.ebner@proxmox.com/T/
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-09-25 12:31 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-22 10:15 [pve-devel] [PATCH qemu-server] fix #6207: vm status: cache last disk read/write values Fiona Ebner
2025-09-22 17:26 ` Thomas Lamprecht
2025-09-25 8:27 ` Fiona Ebner
2025-09-25 8:52 ` Thomas Lamprecht
2025-09-25 9:28 ` Fiona Ebner
2025-09-25 12:31 ` [pve-devel] superseded: " Fiona Ebner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox