public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH storage 1/2] disks: get: separate error path for retrieving SMART data
@ 2025-04-11 15:08 Daniel Kral
  2025-04-11 15:08 ` [pve-devel] [PATCH storage 2/2] fix #6224: disks: get: set timeout for retrieval of SMART stat data Daniel Kral
  2025-04-11 15:52 ` [pve-devel] [PATCH storage 1/2] disks: get: separate error path for retrieving SMART data Max Carrara
  0 siblings, 2 replies; 6+ messages in thread
From: Daniel Kral @ 2025-04-11 15:08 UTC (permalink / raw)
  To: pve-devel

Make the subroutine get_smart_data() die with the error message from
running the `smartctl` command before. This is in preparation for the
next patch, which makes that command fail in certain scenarios.

Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
 src/PVE/Diskmanage.pm | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/PVE/Diskmanage.pm b/src/PVE/Diskmanage.pm
index 4272668..059d645 100644
--- a/src/PVE/Diskmanage.pm
+++ b/src/PVE/Diskmanage.pm
@@ -150,8 +150,10 @@ sub get_smart_data {
     };
     my $err = $@;
 
+    die "Error getting S.M.A.R.T. data: $err\n" if $err;
+
     # bit 0 and 1 mark a fatal error, other bits are for disk status -> ignore (see man 8 smartctl)
-    if ((defined($returncode) && ($returncode & 0b00000011)) || $err) {
+    if ((defined($returncode) && ($returncode & 0b00000011))) {
 	die "Error getting S.M.A.R.T. data: Exit code: $returncode\n";
     }
 
-- 
2.39.5



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [pve-devel] [PATCH storage 2/2] fix #6224: disks: get: set timeout for retrieval of SMART stat data
  2025-04-11 15:08 [pve-devel] [PATCH storage 1/2] disks: get: separate error path for retrieving SMART data Daniel Kral
@ 2025-04-11 15:08 ` Daniel Kral
  2025-04-11 16:04   ` Max Carrara
  2025-04-11 15:52 ` [pve-devel] [PATCH storage 1/2] disks: get: separate error path for retrieving SMART data Max Carrara
  1 sibling, 1 reply; 6+ messages in thread
From: Daniel Kral @ 2025-04-11 15:08 UTC (permalink / raw)
  To: pve-devel

In rare scenarios, `smartctl` takes up to 60 seconds to timeout for SCSI
commands to be completed, as reported in our user forum [0] and bugzilla
[1]. It seems that USB drives handled by the USB Attached SCSI (UAS)
kernel module are more likely to be affected by this [2], but is more of
a case-by-case situation.

Therefore, set a more reasonable timeout of 10 seconds, so that callers
don't have to wait too long or seem unresponsive (e.g. Node Disks view
in the WebGUI).

[0] https://forum.proxmox.com/threads/164799/
[1] https://bugzilla.proxmox.com/show_bug.cgi?id=6224
[2] https://www.smartmontools.org/wiki/SAT-with-UAS-Linux

Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
As mentioned in the Bugzilla and indicated above, I haven't found any
clear indicator for this happening besides that the most affected
devices seem to be USB devices, which use the mentioned UAS kernel
module.

I'm fine lowering the timeout further, but 10 seconds seemed reasonable
if only one disk is affected for now, so that loading takes some time
and not seemingly forever.

I was also thinking about just caching which disks have had that
behavior and just not running the command for them, but I thought this
would add more complexity than needed here.

 src/PVE/Diskmanage.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/PVE/Diskmanage.pm b/src/PVE/Diskmanage.pm
index 059d645..6aa1338 100644
--- a/src/PVE/Diskmanage.pm
+++ b/src/PVE/Diskmanage.pm
@@ -98,7 +98,7 @@ sub get_smart_data {
     push @$cmd, $disk;
 
     my $returncode = eval {
-	run_command($cmd, noerr => 1, outfunc => sub {
+	run_command($cmd, noerr => 1, timeout => 10, outfunc => sub {
 	    my ($line) = @_;
 
 # ATA SMART attributes, e.g.:
-- 
2.39.5



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [pve-devel] [PATCH storage 1/2] disks: get: separate error path for retrieving SMART data
  2025-04-11 15:08 [pve-devel] [PATCH storage 1/2] disks: get: separate error path for retrieving SMART data Daniel Kral
  2025-04-11 15:08 ` [pve-devel] [PATCH storage 2/2] fix #6224: disks: get: set timeout for retrieval of SMART stat data Daniel Kral
@ 2025-04-11 15:52 ` Max Carrara
  2025-04-15  7:15   ` Daniel Kral
  1 sibling, 1 reply; 6+ messages in thread
From: Max Carrara @ 2025-04-11 15:52 UTC (permalink / raw)
  To: Proxmox VE development discussion

On Fri Apr 11, 2025 at 5:08 PM CEST, Daniel Kral wrote:
> Make the subroutine get_smart_data() die with the error message from
> running the `smartctl` command before. This is in preparation for the
> next patch, which makes that command fail in certain scenarios.
>
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
> ---
>  src/PVE/Diskmanage.pm | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/src/PVE/Diskmanage.pm b/src/PVE/Diskmanage.pm
> index 4272668..059d645 100644
> --- a/src/PVE/Diskmanage.pm
> +++ b/src/PVE/Diskmanage.pm
> @@ -150,8 +150,10 @@ sub get_smart_data {
>      };
>      my $err = $@;
>  
> +    die "Error getting S.M.A.R.T. data: $err\n" if $err;

The above is *fine* IMO, but for future reference, I'd recommend
something like this if $@ is only used in one place:

    die "Error getting S.M.A.R.T. data: $@\n" if $@;

Also, another convenient pattern I've seen (and also been using lately)
is the following---useful if you need to do a bit more than just die:

    if (my $err = $@) {
        # [...]
    }

> +
>      # bit 0 and 1 mark a fatal error, other bits are for disk status -> ignore (see man 8 smartctl)
> -    if ((defined($returncode) && ($returncode & 0b00000011)) || $err) {
> +    if ((defined($returncode) && ($returncode & 0b00000011))) {

The inner parentheses here are redundant, now that `|| $err` is gone ;)
But can IMO just be adapted in a tiny follow-up / when applying.

>  	die "Error getting S.M.A.R.T. data: Exit code: $returncode\n";
>      }
>  



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [pve-devel] [PATCH storage 2/2] fix #6224: disks: get: set timeout for retrieval of SMART stat data
  2025-04-11 15:08 ` [pve-devel] [PATCH storage 2/2] fix #6224: disks: get: set timeout for retrieval of SMART stat data Daniel Kral
@ 2025-04-11 16:04   ` Max Carrara
  2025-04-15  6:42     ` Daniel Kral
  0 siblings, 1 reply; 6+ messages in thread
From: Max Carrara @ 2025-04-11 16:04 UTC (permalink / raw)
  To: Proxmox VE development discussion

On Fri Apr 11, 2025 at 5:08 PM CEST, Daniel Kral wrote:
> In rare scenarios, `smartctl` takes up to 60 seconds to timeout for SCSI
> commands to be completed, as reported in our user forum [0] and bugzilla
> [1]. It seems that USB drives handled by the USB Attached SCSI (UAS)
> kernel module are more likely to be affected by this [2], but is more of
> a case-by-case situation.
>
> Therefore, set a more reasonable timeout of 10 seconds, so that callers
> don't have to wait too long or seem unresponsive (e.g. Node Disks view
> in the WebGUI).
>
> [0] https://forum.proxmox.com/threads/164799/
> [1] https://bugzilla.proxmox.com/show_bug.cgi?id=6224
> [2] https://www.smartmontools.org/wiki/SAT-with-UAS-Linux
>
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
> ---
> As mentioned in the Bugzilla and indicated above, I haven't found any
> clear indicator for this happening besides that the most affected
> devices seem to be USB devices, which use the mentioned UAS kernel
> module.

Have you perhaps found any way to test this? I could then try to
replicate this behaviour. Otherwise no hard feelings; I think setting a
shorter timeout for (usually) smaller commands is something we should do
in general.

(That being said, looking through the code of PVE::Tools::run_command---
I'm surprised we don't set a default timeout there at all. I think
introducing one there could perhaps break something unexpected, though,
so I'd rather not touch it.)

>
> I'm fine lowering the timeout further, but 10 seconds seemed reasonable
> if only one disk is affected for now, so that loading takes some time
> and not seemingly forever.

Given that I've never had a single device take longer than a split
second, I think this is quite reasonable too.

>
> I was also thinking about just caching which disks have had that
> behavior and just not running the command for them, but I thought this
> would add more complexity than needed here.

I agree that this would be a little too much; you'd also have to
invalidate cache entries after a certain time / a certain condition etc.
You'd also have to handle the case where the disk starts to magically
respond to `smartctl` again. Better to just keep the timeout here as-is.


Either way, nice work! For both patches, consider:

Reviewed-by: Max Carrara <m.carrara@proxmox.com>

(Though, I'd still like to test this somehow, if you found a way to do so)

>
>  src/PVE/Diskmanage.pm | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/src/PVE/Diskmanage.pm b/src/PVE/Diskmanage.pm
> index 059d645..6aa1338 100644
> --- a/src/PVE/Diskmanage.pm
> +++ b/src/PVE/Diskmanage.pm
> @@ -98,7 +98,7 @@ sub get_smart_data {
>      push @$cmd, $disk;
>  
>      my $returncode = eval {
> -	run_command($cmd, noerr => 1, outfunc => sub {
> +	run_command($cmd, noerr => 1, timeout => 10, outfunc => sub {
>  	    my ($line) = @_;
>  
>  # ATA SMART attributes, e.g.:



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [pve-devel] [PATCH storage 2/2] fix #6224: disks: get: set timeout for retrieval of SMART stat data
  2025-04-11 16:04   ` Max Carrara
@ 2025-04-15  6:42     ` Daniel Kral
  0 siblings, 0 replies; 6+ messages in thread
From: Daniel Kral @ 2025-04-15  6:42 UTC (permalink / raw)
  To: Proxmox VE development discussion, Max Carrara

Thanks for the review, Max! :)

On 4/11/25 18:04, Max Carrara wrote:
> On Fri Apr 11, 2025 at 5:08 PM CEST, Daniel Kral wrote:
>> As mentioned in the Bugzilla and indicated above, I haven't found any
>> clear indicator for this happening besides that the most affected
>> devices seem to be USB devices, which use the mentioned UAS kernel
>> module.
> 
> Have you perhaps found any way to test this? I could then try to
> replicate this behaviour. Otherwise no hard feelings; I think setting a
> shorter timeout for (usually) smaller commands is something we should do
> in general.

Unfortunately not, I've tried all the (4) USB devices I had on me, but 
sadly none of them had those quirks ;). I tested only that the error 
path works correctly with simply substituting the smartctl command with 
`sleep 11` and `sh -c 'exit 3'` for the timeout + non-zero return.

It'd be sure great if someone with an affected disk could test this 
directly, I'll forward it to the Bugzilla entry and forum post so it 
might get more coverage.

> (That being said, looking through the code of PVE::Tools::run_command---
> I'm surprised we don't set a default timeout there at all. I think
> introducing one there could perhaps break something unexpected, though,
> so I'd rather not touch it.)

Yes, I'd guess that there would be some places where the $noerr is set, 
but $timeout will error anyway now AFAICS as here, so there'd be quite a 
few places which do not have error handlers setup. I hope that smartctl 
is more of an odd case here as the timeout is quite high because of reasons.

>> I'm fine lowering the timeout further, but 10 seconds seemed reasonable
>> if only one disk is affected for now, so that loading takes some time
>> and not seemingly forever.
> 
> Given that I've never had a single device take longer than a split
> second, I think this is quite reasonable too.
> 
>>
>> I was also thinking about just caching which disks have had that
>> behavior and just not running the command for them, but I thought this
>> would add more complexity than needed here.
> 
> I agree that this would be a little too much; you'd also have to
> invalidate cache entries after a certain time / a certain condition etc.
> You'd also have to handle the case where the disk starts to magically
> respond to `smartctl` again. Better to just keep the timeout here as-is.

Agreed, that would be way too much for this. And as it seems from the 
forum, it was probably a faulty controller / firmware (?) anyway [0].

[0] https://forum.proxmox.com/threads/164799/#post-763192

> Either way, nice work! For both patches, consider:
> 
> Reviewed-by: Max Carrara <m.carrara@proxmox.com>
> 
> (Though, I'd still like to test this somehow, if you found a way to do so)


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [pve-devel] [PATCH storage 1/2] disks: get: separate error path for retrieving SMART data
  2025-04-11 15:52 ` [pve-devel] [PATCH storage 1/2] disks: get: separate error path for retrieving SMART data Max Carrara
@ 2025-04-15  7:15   ` Daniel Kral
  0 siblings, 0 replies; 6+ messages in thread
From: Daniel Kral @ 2025-04-15  7:15 UTC (permalink / raw)
  To: Proxmox VE development discussion, Max Carrara

On 4/11/25 17:52, Max Carrara wrote:
> On Fri Apr 11, 2025 at 5:08 PM CEST, Daniel Kral wrote:
>> Make the subroutine get_smart_data() die with the error message from
>> running the `smartctl` command before. This is in preparation for the
>> next patch, which makes that command fail in certain scenarios.
>>
>> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
>> ---
>>   src/PVE/Diskmanage.pm | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/src/PVE/Diskmanage.pm b/src/PVE/Diskmanage.pm
>> index 4272668..059d645 100644
>> --- a/src/PVE/Diskmanage.pm
>> +++ b/src/PVE/Diskmanage.pm
>> @@ -150,8 +150,10 @@ sub get_smart_data {
>>       };
>>       my $err = $@;
>>   
>> +    die "Error getting S.M.A.R.T. data: $err\n" if $err;
> 
> The above is *fine* IMO, but for future reference, I'd recommend
> something like this if $@ is only used in one place:
> 
>      die "Error getting S.M.A.R.T. data: $@\n" if $@;
> 
> Also, another convenient pattern I've seen (and also been using lately)
> is the following---useful if you need to do a bit more than just die:
> 
>      if (my $err = $@) {
>          # [...]
>      }
> 
>> +
>>       # bit 0 and 1 mark a fatal error, other bits are for disk status -> ignore (see man 8 smartctl)
>> -    if ((defined($returncode) && ($returncode & 0b00000011)) || $err) {
>> +    if ((defined($returncode) && ($returncode & 0b00000011))) {
> 
> The inner parentheses here are redundant, now that `|| $err` is gone ;)
> But can IMO just be adapted in a tiny follow-up / when applying.

Thanks for pointing out both, should have taken a closer look here ;).

I've made those changes + a RFC with a possibly missing error handler 
and sent a v2: 
https://lore.proxmox.com/pve-devel/20250415071123.36921-1-d.kral@proxmox.com/T/#t



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-04-15  7:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-11 15:08 [pve-devel] [PATCH storage 1/2] disks: get: separate error path for retrieving SMART data Daniel Kral
2025-04-11 15:08 ` [pve-devel] [PATCH storage 2/2] fix #6224: disks: get: set timeout for retrieval of SMART stat data Daniel Kral
2025-04-11 16:04   ` Max Carrara
2025-04-15  6:42     ` Daniel Kral
2025-04-11 15:52 ` [pve-devel] [PATCH storage 1/2] disks: get: separate error path for retrieving SMART data Max Carrara
2025-04-15  7:15   ` Daniel Kral

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal