From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <g.goller@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 96B51F0AE
 for <pbs-devel@lists.proxmox.com>; Thu, 28 Sep 2023 14:57:42 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 76B22166EA
 for <pbs-devel@lists.proxmox.com>; Thu, 28 Sep 2023 14:57:12 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [94.136.29.106])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS
 for <pbs-devel@lists.proxmox.com>; Thu, 28 Sep 2023 14:57:11 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 3A92148D55
 for <pbs-devel@lists.proxmox.com>; Thu, 28 Sep 2023 14:57:11 +0200 (CEST)
Message-ID: <27539eba-9e90-15bf-aed1-635e2e57aebc@proxmox.com>
Date: Thu, 28 Sep 2023 14:57:10 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.15.0
Content-Language: en-US
To: Thomas Lamprecht <t.lamprecht@proxmox.com>,
 Proxmox Backup Server development discussion <pbs-devel@lists.proxmox.com>,
 Dominik Csapak <d.csapak@proxmox.com>,
 =?UTF-8?Q?Fabian_Gr=c3=bcnbichler?= <f.gruenbichler@proxmox.com>
References: <20230921082157.33718-1-g.goller@proxmox.com>
 <1695714131.wbzdf9yjca.astroid@yuna.none>
 <f31588c3-5dba-47f8-a140-d015d0a0456c@proxmox.com>
 <7b0ac740-096b-4850-a81d-5fa6bc9c347a@proxmox.com>
From: Gabriel Goller <g.goller@proxmox.com>
In-Reply-To: <7b0ac740-096b-4850-a81d-5fa6bc9c347a@proxmox.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.364 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DMARC_MISSING             0.1 Missing DMARC policy
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 NICE_REPLY_A           -1.473 Looks like a legit reply (A)
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pbs-devel] [PATCH widget-toolkit] fix #4961: disks: increase
 timeout to 1min
X-BeenThere: pbs-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox Backup Server development discussion
 <pbs-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pbs-devel>, 
 <mailto:pbs-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pbs-devel/>
List-Post: <mailto:pbs-devel@lists.proxmox.com>
List-Help: <mailto:pbs-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel>, 
 <mailto:pbs-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Thu, 28 Sep 2023 12:57:42 -0000

On 9/26/23 15:42, Thomas Lamprecht wrote:

> [..]
> I'd rather like to avoid that, 30s is already long and might make some
> users just close the dialogue or whatever.
>
> And general "lightweight-task" mechanism that Fabian proposed could be
> nice for some things but IMO the wrong approach here too.
>
> Here, with the end result being to show a simple disk list, I just
> refuse to believe that this has to take that long; here the API call has
> some serious architectural problems, among other things that can mean a
> mix of:
>
> 1. too much stuff read that is not relevant most of the time, we could
>     avoid querying the data that has the worst expensive/useful ratio
>     and make that only available on a per-disk info dialogue.
>
> 2. getting some info in a convoluted way (e.g, we have already lsblk
>     info, which knows if a disk is a "zfs_member", but still do a zpool
>     list, and all of those do separate stats on the device nodes to get
>     the rdevs, multiple times...)
>
> 3. missing a cache for the more expensive stuff, e.g., one that the rrd
>     stat querying code could probably fill every X'd run or so, as it
>     checks for IO metrics anyway
>
> I'd start with the 2. point above, as I saw quite a bit optimization
> potential with only a quick skimming, some of that should hopefully
> produce some actual improvements.
>
> Then I'd check where most of the time is spent, either by asking the
> reporter nicely for timings of all relevant commands and potentially
> also by just checking ourselves, we have slow HW here too, and while
> not 90 disks, even a tenth should show slowish enough behavior to
> be able to find the major culprits with some tracing or even just
> printf-timestamp-debugging
>
>
> The _last_ thing I'd do here for this specific endpoint is extending
> timeouts or inventing some new fancy light-weight task mechanisms.
> As optimizations, omitting expensive stuff, or moving it out from
> "for-all-disk" to a "per-single-disk" view, and caching for the
> "expensive but just really interesting to always see" stuff goes a
> long way..
>
> btw. PBS still misses the Wipe-Disk feature from PVE..
I looked into optimizing some stuff and noticed that (as you
wrote in the issue) the slowest part is the smartctl. Getting
the smartctl-data takes about 60-70% of the time spent on the
request.
I only have 4 disks with 6 partitions, but the api call still takes
about 232.22ms on average. Disabling the smartctl check we
get to 157.22ms. I did some other optimizations and got down
to 121.27ms, which is still not great but an improvement. I will
do some other stuff and check if it helps...

But in the meantime I thought about some way of disabling
smartctl checks when there are a lot of disks: How about we
add a button on the frontend "get smartctl data", which
manually updates the smartctl data in the table. We could then
check server-side if there are more than x disks, or "lsblk" takes
more than x seconds and don't run the smartctl check, which returns
a unknown status. Then the user can manually click the "get smartctl data"
button which makes the smartctl check in a separate api request.

So when a user doesn't have a lot of disks, we still get them all including
the smartctl data on the same request. But if it takes too long, we skip the
smartctl check, and let the user decide if he needs the data or not.