public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH common/qemu-server] improve sysfs write behaviour
@ 2024-11-05  9:24 Dominik Csapak
  2024-11-05  9:24 ` [pve-devel] [PATCH common 1/1] sysfstools: file_write: log the actual error if there was one Dominik Csapak
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Dominik Csapak @ 2024-11-05  9:24 UTC (permalink / raw)
  To: pve-devel

As i feared previously in [0], making it a hard error when encountering
errors during sysfs writes uncovered some situations where our code was
too strict to keep some setups working.

One such case is resetting devices, which is seemingly not necessary
at all times, so this series

* donwgrades that error to warning
* adds some more logging to `file_write` to be able to better debug

Alternatively, we could rewrite file_write such that we can control the
error behaviour with a parameter and replace all "old" call sites so
that we ignore errors. But since the only other call sites currently are
for binding vfio-pci to the device. (which AFAIK is necessary and not
optional) and setting mdev models (which is also not optional), we should
mostly be fine with this approach here.

In [1], some users reported it's breaking, at least one with binding of
vfio-pci, which I'm currently investigating there to see if it's really
necessary.

0: https://lore.proxmox.com/pve-devel/20240723082925.934603-1-d.csapak@proxmox.com/
1: https://forum.proxmox.com/threads/156848/

pve-common:

Dominik Csapak (1):
  sysfstools: file_write: log the actual error if there was one

 src/PVE/SysFSTools.pm | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

qemu-server:

Dominik Csapak (1):
  pci: don't hard require resetting devices for passthrough

 PVE/QemuServer/PCI.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.39.5



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pve-devel] [PATCH common 1/1] sysfstools: file_write: log the actual error if there was one
  2024-11-05  9:24 [pve-devel] [PATCH common/qemu-server] improve sysfs write behaviour Dominik Csapak
@ 2024-11-05  9:24 ` Dominik Csapak
  2024-11-05  9:24 ` [pve-devel] [PATCH qemu-server 1/1] pci: don't hard require resetting devices for passthrough Dominik Csapak
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Dominik Csapak @ 2024-11-05  9:24 UTC (permalink / raw)
  To: pve-devel

the actual error and path is useful to know when tryin to debug or
figure out what did not work, so warn here if there was an error.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
 src/PVE/SysFSTools.pm | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/PVE/SysFSTools.pm b/src/PVE/SysFSTools.pm
index 0bde6d7..d0d1bb9 100644
--- a/src/PVE/SysFSTools.pm
+++ b/src/PVE/SysFSTools.pm
@@ -217,7 +217,12 @@ sub file_write {
     my $fh = IO::File->new($filename, "w");
     return undef if !$fh;
 
-    my $res = defined(syswrite($fh, $buf)) ? 1 : 0;
+    my $res = 0;
+    if (defined(syswrite($fh, $buf))) {
+	$res = 1;
+    } elsif (my $err = $!) {
+	warn "error writing '$buf' to '$filename': $err\n";
+    }
 
     $fh->close();
 
-- 
2.39.5



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pve-devel] [PATCH qemu-server 1/1] pci: don't hard require resetting devices for passthrough
  2024-11-05  9:24 [pve-devel] [PATCH common/qemu-server] improve sysfs write behaviour Dominik Csapak
  2024-11-05  9:24 ` [pve-devel] [PATCH common 1/1] sysfstools: file_write: log the actual error if there was one Dominik Csapak
@ 2024-11-05  9:24 ` Dominik Csapak
  2024-11-05 10:16   ` Stoiko Ivanov
  2024-11-05 10:12 ` [pve-devel] [PATCH common/qemu-server] improve sysfs write behaviour Stoiko Ivanov
  2024-11-06 10:51 ` Christoph Heiss
  3 siblings, 1 reply; 7+ messages in thread
From: Dominik Csapak @ 2024-11-05  9:24 UTC (permalink / raw)
  To: pve-devel

Since pve-common commit:

 eff5957 (sysfstools: file_write: properly catch errors)

this check here fails now when the reset does not work. It turns out
that resetting the device is not always necessary, and we previously
ignored most errors when trying to do so.

To restore that functionality, downgrade this `die` to a warning.

If the device really needs a reset to work, it will either fail later
during startup, or not work correctly in the guest, but that behavior
existed before and is AFAIK not really detectable from our side.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
 PVE/QemuServer/PCI.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/PVE/QemuServer/PCI.pm b/PVE/QemuServer/PCI.pm
index 75eac134..dceb8938 100644
--- a/PVE/QemuServer/PCI.pm
+++ b/PVE/QemuServer/PCI.pm
@@ -728,7 +728,7 @@ sub prepare_pci_device {
     } else {
 	die "can't unbind/bind PCI group to VFIO '$pciid'\n"
 	    if !PVE::SysFSTools::pci_dev_group_bind_to_vfio($pciid);
-	die "can't reset PCI device '$pciid'\n"
+	warn "can't reset PCI device '$pciid'\n"
 	    if $info->{has_fl_reset} && !PVE::SysFSTools::pci_dev_reset($info);
     }
 
-- 
2.39.5



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [PATCH common/qemu-server] improve sysfs write behaviour
  2024-11-05  9:24 [pve-devel] [PATCH common/qemu-server] improve sysfs write behaviour Dominik Csapak
  2024-11-05  9:24 ` [pve-devel] [PATCH common 1/1] sysfstools: file_write: log the actual error if there was one Dominik Csapak
  2024-11-05  9:24 ` [pve-devel] [PATCH qemu-server 1/1] pci: don't hard require resetting devices for passthrough Dominik Csapak
@ 2024-11-05 10:12 ` Stoiko Ivanov
  2024-11-06 10:51 ` Christoph Heiss
  3 siblings, 0 replies; 7+ messages in thread
From: Stoiko Ivanov @ 2024-11-05 10:12 UTC (permalink / raw)
  To: Dominik Csapak; +Cc: Proxmox VE development discussion

Thanks big-time for the quick fix!
I encountered this at a machine at home with an older GPU (NVIDIA GT1030) 
passed through to a VM, which seemingly does not handle resets too well.

with both patches applied the guest starts again w/o error - the tasklog
contains:
```
error writing '1' to '/sys/bus/pci/devices/0000:01:00.0/reset': Inappropriate ioctl for device
can't reset PCI device '0000:01:00.0'
```
(similarly this is the output when starting on the commandline with 
`qm start <vmid>`)

with or without the nit/idea for the qemu-server 1/1 patch consider this:

Reviewed-by: Stoiko Ivanov <s.ivanov@proxmox.com>
Tested-by: Stoiko Ivanov <s.ivanov@proxmox.com>

On Tue,  5 Nov 2024 10:24:19 +0100
Dominik Csapak <d.csapak@proxmox.com> wrote:

> As i feared previously in [0], making it a hard error when encountering
> errors during sysfs writes uncovered some situations where our code was
> too strict to keep some setups working.
> 
> One such case is resetting devices, which is seemingly not necessary
> at all times, so this series
> 
> * donwgrades that error to warning
> * adds some more logging to `file_write` to be able to better debug
> 
> Alternatively, we could rewrite file_write such that we can control the
> error behaviour with a parameter and replace all "old" call sites so
> that we ignore errors. But since the only other call sites currently are
> for binding vfio-pci to the device. (which AFAIK is necessary and not
> optional) and setting mdev models (which is also not optional), we should
> mostly be fine with this approach here.
> 
> In [1], some users reported it's breaking, at least one with binding of
> vfio-pci, which I'm currently investigating there to see if it's really
> necessary.
> 
> 0: https://lore.proxmox.com/pve-devel/20240723082925.934603-1-d.csapak@proxmox.com/
> 1: https://forum.proxmox.com/threads/156848/
> 
> pve-common:
> 
> Dominik Csapak (1):
>   sysfstools: file_write: log the actual error if there was one
> 
>  src/PVE/SysFSTools.pm | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> qemu-server:
> 
> Dominik Csapak (1):
>   pci: don't hard require resetting devices for passthrough
> 
>  PVE/QemuServer/PCI.pm | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [PATCH qemu-server 1/1] pci: don't hard require resetting devices for passthrough
  2024-11-05  9:24 ` [pve-devel] [PATCH qemu-server 1/1] pci: don't hard require resetting devices for passthrough Dominik Csapak
@ 2024-11-05 10:16   ` Stoiko Ivanov
  2024-11-05 10:40     ` Dominik Csapak
  0 siblings, 1 reply; 7+ messages in thread
From: Stoiko Ivanov @ 2024-11-05 10:16 UTC (permalink / raw)
  To: Dominik Csapak; +Cc: Proxmox VE development discussion

On Tue,  5 Nov 2024 10:24:21 +0100
Dominik Csapak <d.csapak@proxmox.com> wrote:

> Since pve-common commit:
> 
>  eff5957 (sysfstools: file_write: properly catch errors)
> 
> this check here fails now when the reset does not work. It turns out
> that resetting the device is not always necessary, and we previously
> ignored most errors when trying to do so.
> 
> To restore that functionality, downgrade this `die` to a warning.
> 
> If the device really needs a reset to work, it will either fail later
> during startup, or not work correctly in the guest, but that behavior
> existed before and is AFAIK not really detectable from our side.
> 
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> ---
>  PVE/QemuServer/PCI.pm | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/PVE/QemuServer/PCI.pm b/PVE/QemuServer/PCI.pm
> index 75eac134..dceb8938 100644
> --- a/PVE/QemuServer/PCI.pm
> +++ b/PVE/QemuServer/PCI.pm
> @@ -728,7 +728,7 @@ sub prepare_pci_device {
>      } else {
>  	die "can't unbind/bind PCI group to VFIO '$pciid'\n"
>  	    if !PVE::SysFSTools::pci_dev_group_bind_to_vfio($pciid);
> -	die "can't reset PCI device '$pciid'\n"
> +	warn "can't reset PCI device '$pciid'\n"
maybe the issue would get more visiblity if we used
PVE::RESTEnvironment::log_warn here?
not really sure if it makes sense to mark the start tasks specifically for
something that worked before without any logline indicating that something
was off - but as I somehow expected to see the task with a warning - I
thought I'd drop this here



>  	    if $info->{has_fl_reset} && !PVE::SysFSTools::pci_dev_reset($info);
>      }
>  



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [PATCH qemu-server 1/1] pci: don't hard require resetting devices for passthrough
  2024-11-05 10:16   ` Stoiko Ivanov
@ 2024-11-05 10:40     ` Dominik Csapak
  0 siblings, 0 replies; 7+ messages in thread
From: Dominik Csapak @ 2024-11-05 10:40 UTC (permalink / raw)
  To: Stoiko Ivanov; +Cc: Proxmox VE development discussion

On 11/5/24 11:16, Stoiko Ivanov wrote:
> On Tue,  5 Nov 2024 10:24:21 +0100
> Dominik Csapak <d.csapak@proxmox.com> wrote:
> 
>> Since pve-common commit:
>>
>>   eff5957 (sysfstools: file_write: properly catch errors)
>>
>> this check here fails now when the reset does not work. It turns out
>> that resetting the device is not always necessary, and we previously
>> ignored most errors when trying to do so.
>>
>> To restore that functionality, downgrade this `die` to a warning.
>>
>> If the device really needs a reset to work, it will either fail later
>> during startup, or not work correctly in the guest, but that behavior
>> existed before and is AFAIK not really detectable from our side.
>>
>> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
>> ---
>>   PVE/QemuServer/PCI.pm | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/PVE/QemuServer/PCI.pm b/PVE/QemuServer/PCI.pm
>> index 75eac134..dceb8938 100644
>> --- a/PVE/QemuServer/PCI.pm
>> +++ b/PVE/QemuServer/PCI.pm
>> @@ -728,7 +728,7 @@ sub prepare_pci_device {
>>       } else {
>>   	die "can't unbind/bind PCI group to VFIO '$pciid'\n"
>>   	    if !PVE::SysFSTools::pci_dev_group_bind_to_vfio($pciid);
>> -	die "can't reset PCI device '$pciid'\n"
>> +	warn "can't reset PCI device '$pciid'\n"
> maybe the issue would get more visiblity if we used
> PVE::RESTEnvironment::log_warn here?
> not really sure if it makes sense to mark the start tasks specifically for
> something that worked before without any logline indicating that something
> was off - but as I somehow expected to see the task with a warning - I
> thought I'd drop this here
> 

meh not a fan of tihs, because it implies there is something to be done by the user here.
but either it works, then the warning is just a documentation strange hardware behavior,
or it does not work, which the user cannot fix himself anyway (if it's because
the reset does not work)

just having the warning in the task log is IMHO good enough

> 
> 
>>   	    if $info->{has_fl_reset} && !PVE::SysFSTools::pci_dev_reset($info);
>>       }
>>   
> 



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [PATCH common/qemu-server] improve sysfs write behaviour
  2024-11-05  9:24 [pve-devel] [PATCH common/qemu-server] improve sysfs write behaviour Dominik Csapak
                   ` (2 preceding siblings ...)
  2024-11-05 10:12 ` [pve-devel] [PATCH common/qemu-server] improve sysfs write behaviour Stoiko Ivanov
@ 2024-11-06 10:51 ` Christoph Heiss
  3 siblings, 0 replies; 7+ messages in thread
From: Christoph Heiss @ 2024-11-06 10:51 UTC (permalink / raw)
  To: Proxmox VE development discussion

Reviewed-by: Christoph Heiss <c.heiss@proxmox.com>
Tested-by: Christoph Heiss <c.heiss@proxmox.com>

Tested at least to the degree that this series does not regress
anything. (as I don't have any hardware to passthrough that exhibits
that problem in the first place)

On Tue, Nov 05, 2024 at 10:24:19AM +0100, Dominik Csapak wrote:
> As i feared previously in [0], making it a hard error when encountering
> errors during sysfs writes uncovered some situations where our code was
> too strict to keep some setups working.
>
> One such case is resetting devices, which is seemingly not necessary
> at all times, so this series
>
> * donwgrades that error to warning
> * adds some more logging to `file_write` to be able to better debug
>
> Alternatively, we could rewrite file_write such that we can control the
> error behaviour with a parameter and replace all "old" call sites so
> that we ignore errors. But since the only other call sites currently are
> for binding vfio-pci to the device. (which AFAIK is necessary and not
> optional) and setting mdev models (which is also not optional), we should
> mostly be fine with this approach here.
>
> In [1], some users reported it's breaking, at least one with binding of
> vfio-pci, which I'm currently investigating there to see if it's really
> necessary.
>
> 0: https://lore.proxmox.com/pve-devel/20240723082925.934603-1-d.csapak@proxmox.com/
> 1: https://forum.proxmox.com/threads/156848/
>
> pve-common:
>
> Dominik Csapak (1):
>   sysfstools: file_write: log the actual error if there was one
>
>  src/PVE/SysFSTools.pm | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> qemu-server:
>
> Dominik Csapak (1):
>   pci: don't hard require resetting devices for passthrough
>
>  PVE/QemuServer/PCI.pm | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> --
> 2.39.5
>
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>
>


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-11-06 10:51 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-05  9:24 [pve-devel] [PATCH common/qemu-server] improve sysfs write behaviour Dominik Csapak
2024-11-05  9:24 ` [pve-devel] [PATCH common 1/1] sysfstools: file_write: log the actual error if there was one Dominik Csapak
2024-11-05  9:24 ` [pve-devel] [PATCH qemu-server 1/1] pci: don't hard require resetting devices for passthrough Dominik Csapak
2024-11-05 10:16   ` Stoiko Ivanov
2024-11-05 10:40     ` Dominik Csapak
2024-11-05 10:12 ` [pve-devel] [PATCH common/qemu-server] improve sysfs write behaviour Stoiko Ivanov
2024-11-06 10:51 ` Christoph Heiss

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal