From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 6389071048 for ; Thu, 23 Jun 2022 13:28:24 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 5209E25692 for ; Thu, 23 Jun 2022 13:27:54 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 413FA25687 for ; Thu, 23 Jun 2022 13:27:53 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 0E6FE43C74; Thu, 23 Jun 2022 13:27:53 +0200 (CEST) Message-ID: <3a376498-7d82-f26d-93c4-428fc34c930c@proxmox.com> Date: Thu, 23 Jun 2022 13:27:51 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.0 Content-Language: en-GB To: Proxmox VE development discussion , "DERUMIER, Alexandre" References: From: Thomas Lamprecht In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 AWL 0.004 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -0.001 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com] Subject: Re: [pve-devel] last training week student feedback/request X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jun 2022 11:28:24 -0000 Hi, Am 23/06/2022 um 10:25 schrieb DERUMIER, Alexandre: > 1) > > We have a usecase, with an HA enabled cluster, where student need to > shutdown the cluster cleaning through api or script. (electrical > unplanned shutdown, through UPS with nut). > > He want to cleanly stop all the vm, then all the nodes. > > Simply shutdown nodes one by one don't work, because some nodes can > loose quorum when half of the cluster is already shutdown, so ha is > stuck and nodes can be fenced by watchdog. > > We have looked to cleany stop all the vms first. > pve-guest service can't be used for HA. > So we have done a script with loop to "qm stop" all the vms. > The problem is that, the HA state of the vms is going to stopped, > so when servers are restarting after the maintenance, we need to script > again a qm start of the vms. > > > Student asked if it could be possible to add some kind of "cluster > maintenance" option, to disable HA on the full cluster (pause/stop all > pve-ha-crm/lrm + disabling watchdog), and temporary remove all vms > services from ha. > > > I think it could be usefull too when adding new nodes to the cluster, > when a bad corosync new node could impact the whole cluster. We talked about something like that in our internal chat a bit ago: > > For the HA it would be basically a maintenance mode the master node > propagates without any service daemon stop/starts or the like (just as > dangerous too) that then can be handled live, and the status can display the > "currently entering" vs. "maintenance active" (once all LRMs switched their > state correctly) differences, additionally one could imagine having two > different modi "ignore all a "ignore every service command" and a "unsafe > redirect service commands as there isn't any HA active" If this then is done automatically on cluster node join is a bit of another question, but should be relatively easy to add. > > > Also, related to this, maybe a "node maintenance option" could be great > too, like of vmware. (auto vms eviction with live migration). > when user need to change network config for example, withtout shutdown > the node. > > > 2) > Another student have a need with pci passthrough, cluster with > multiples nodes with multiple pci cards. > He's using HA and have 1 or 2 backups nodes with a lot of cards, > to be able to failover 10 others servers. See Dominik's RFC: https://lists.proxmox.com/pipermail/pve-devel/2021-June/048862.html Should be possible to get that in for 7.3: > > 5) > All my students have windows reboot stuck problem since migration to > proxmox 7. (I have the problem too, randomly,I'm currently trying to > debug this). Yeah, reproducing this is really hard and the main issue on holding up a fix, added to that there seems to be more than one problem, (stuck vs. crash) that we try to investigate in parallel. Our slightly questionable reproducer for the crash one showed that issues started with 5.15 (5.14 and their stable releases seems to be fine, albeit it's hard to tell for sure as there are) and we can only trigger it on a machine with an outdated bios (carbon copy of that host with a newer bios won't trigger it). > > 6) > > PBS: all students are using pbs, and it's working very fine. > > Some users have fast nvme in production, and slower hdd for pbs on a > remote site. > > Student asked if it could be possible to add some kind of write cache > on a local pbs with fast nvme, forwarding to the remote slower pbs. > (Without the need to have a full pbs datastore on local site with nvme) Hmm, to understand correctly, basically: A daemon that runs locally and is in-between the remote PBS. It allows to write the new chunks locally, returning relatively quickly to the qemu/client and sends the chunks to the actual remote backing store in the background. If it's full it'd stall until a few chunks where sent out and can be removed and also stall the worker tasks until it flushed all chunks at the end of the backup. Seems like it would add quite a bit of complexity though and would mostly be helpful when the PBS is really remote with low link speed and higher latency, not just LAN. IMO it's better to use a full blown PBS with a low keep-x retention setting and sync periodically to an archive PBS with higher retention. Needs a bit more storage in the LAN one, but is conceptually much simpler.