From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9])
	by lore.proxmox.com (Postfix) with ESMTPS id BF3471FF172
	for <inbox@lore.proxmox.com>; Tue, 15 Apr 2025 08:42:08 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 07A263F47A;
	Tue, 15 Apr 2025 08:42:06 +0200 (CEST)
Message-ID: <49f9ad1e-e29c-44af-bae7-3fd437a76d1f@proxmox.com>
Date: Tue, 15 Apr 2025 08:42:02 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>,
 Max Carrara <m.carrara@proxmox.com>
References: <20250411150831.255017-1-d.kral@proxmox.com>
 <20250411150831.255017-2-d.kral@proxmox.com>
 <D93XRB2F7AS8.365FZ2C4M427F@proxmox.com>
Content-Language: en-US
From: Daniel Kral <d.kral@proxmox.com>
In-Reply-To: <D93XRB2F7AS8.365FZ2C4M427F@proxmox.com>
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.014 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DMARC_MISSING             0.1 Missing DMARC policy
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [proxmox.com]
Subject: Re: [pve-devel] [PATCH storage 2/2] fix #6224: disks: get: set
 timeout for retrieval of SMART stat data
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
Reply-To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Errors-To: pve-devel-bounces@lists.proxmox.com
Sender: "pve-devel" <pve-devel-bounces@lists.proxmox.com>

Thanks for the review, Max! :)

On 4/11/25 18:04, Max Carrara wrote:
> On Fri Apr 11, 2025 at 5:08 PM CEST, Daniel Kral wrote:
>> As mentioned in the Bugzilla and indicated above, I haven't found any
>> clear indicator for this happening besides that the most affected
>> devices seem to be USB devices, which use the mentioned UAS kernel
>> module.
> 
> Have you perhaps found any way to test this? I could then try to
> replicate this behaviour. Otherwise no hard feelings; I think setting a
> shorter timeout for (usually) smaller commands is something we should do
> in general.

Unfortunately not, I've tried all the (4) USB devices I had on me, but 
sadly none of them had those quirks ;). I tested only that the error 
path works correctly with simply substituting the smartctl command with 
`sleep 11` and `sh -c 'exit 3'` for the timeout + non-zero return.

It'd be sure great if someone with an affected disk could test this 
directly, I'll forward it to the Bugzilla entry and forum post so it 
might get more coverage.

> (That being said, looking through the code of PVE::Tools::run_command---
> I'm surprised we don't set a default timeout there at all. I think
> introducing one there could perhaps break something unexpected, though,
> so I'd rather not touch it.)

Yes, I'd guess that there would be some places where the $noerr is set, 
but $timeout will error anyway now AFAICS as here, so there'd be quite a 
few places which do not have error handlers setup. I hope that smartctl 
is more of an odd case here as the timeout is quite high because of reasons.

>> I'm fine lowering the timeout further, but 10 seconds seemed reasonable
>> if only one disk is affected for now, so that loading takes some time
>> and not seemingly forever.
> 
> Given that I've never had a single device take longer than a split
> second, I think this is quite reasonable too.
> 
>>
>> I was also thinking about just caching which disks have had that
>> behavior and just not running the command for them, but I thought this
>> would add more complexity than needed here.
> 
> I agree that this would be a little too much; you'd also have to
> invalidate cache entries after a certain time / a certain condition etc.
> You'd also have to handle the case where the disk starts to magically
> respond to `smartctl` again. Better to just keep the timeout here as-is.

Agreed, that would be way too much for this. And as it seems from the 
forum, it was probably a faulty controller / firmware (?) anyway [0].

[0] https://forum.proxmox.com/threads/164799/#post-763192

> Either way, nice work! For both patches, consider:
> 
> Reviewed-by: Max Carrara <m.carrara@proxmox.com>
> 
> (Though, I'd still like to test this somehow, if you found a way to do so)


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel