From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 4ED19B30D for ; Thu, 29 Jun 2023 16:36:44 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 2F5761A24F for ; Thu, 29 Jun 2023 16:36:44 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Thu, 29 Jun 2023 16:36:42 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 643D04229B for ; Thu, 29 Jun 2023 16:36:42 +0200 (CEST) Message-ID: Date: Thu, 29 Jun 2023 16:36:40 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 Content-Language: en-US To: Thomas Lamprecht , Proxmox VE development discussion References: <20230629135935.62588-1-f.ebner@proxmox.com> From: Fiona Ebner In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 AWL -0.001 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -0.089 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - Subject: Re: [pve-devel] [RFC cluster 1/2] pvecm: updatecerts: allow specifying time to wait for quorum via CLI argument X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Jun 2023 14:36:44 -0000 Am 29.06.23 um 16:26 schrieb Thomas Lamprecht: > Am 29/06/2023 um 15:59 schrieb Fiona Ebner: >> Useful for the updatecerts call triggered via the ExecStartPre hook >> for pveproxy.service. >> >> When starting a node that's part of a cluster, there is a time window >> between the start of pve-cluster.service and when quorum is reached >> (from the node's perspective). pveproxy.service is ordered after >> pve-cluster.service, but that does not prevent the ExecStartPre hook >> from being executed before the node is part of the quorate partition. >> The pvecm updatecerts command won't do anything without quorum. >> >> In particular, it might happen that the base directories for observed >> files will not get created during/after the upgrade from Proxmox VE 7 >> to 8 (reported in the community forum [0] and reproduced right away in >> a virtual test cluster). >> >> This parameter will allow to increase the chances for successful >> execution of the hook. >> >> [0]: https://forum.proxmox.com/threads/129644/ >> >> Signed-off-by: Fiona Ebner >> --- >> src/PVE/CLI/pvecm.pm | 23 ++++++++++++++++++++++- >> 1 file changed, 22 insertions(+), 1 deletion(-) >> > > > Hmm, I would just do something like (untested and needs importing Time::HiRes): > > > @@ -576,6 +578,11 @@ __PACKAGE__->register_method ({ > # IO (on /etc/pve) which can hang (uninterruptedly D state). That'd be > # no-good for ExecStartPre as it fails the whole service in this case > PVE::Tools::run_fork_with_timeout(30, sub { > + for (my $i = 0; !PVE::Cluster::check_cfs_quorum(1); $i++) { > + print "waiting for pmxcfs mount to appear and get quorate...\n" if $i % 50 == 0; > + usleep(100 * 1000); > + $i++; > + } > PVE::Cluster::Setup::updatecerts_and_ssh($param->@{qw(force silent)}); > PVE::Cluster::prepare_observed_file_basedirs(); > }); > > > after all any user or tooling calling this want's it to happen, so waiting until > the timeout seems sensible enough as hard coded default to me.. The issue here is that it would delay the pveproxy.service start a full 30 seconds when a node can't get quorum (e.g. after all nodes in a cluster were down). Is that tolerable?