From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <t.lamprecht@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 5483761FF2
 for <pve-devel@lists.proxmox.com>; Tue, 15 Sep 2020 15:00:07 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 3AEF81AC5A
 for <pve-devel@lists.proxmox.com>; Tue, 15 Sep 2020 15:00:07 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [212.186.127.180])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id 70B131AC4A
 for <pve-devel@lists.proxmox.com>; Tue, 15 Sep 2020 15:00:05 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 20D4B44C37;
 Tue, 15 Sep 2020 15:00:05 +0200 (CEST)
To: Alexandre DERUMIER <aderumier@odiso.com>,
 Proxmox VE development discussion <pve-devel@lists.proxmox.com>
References: <216436814.339545.1599142316781.JavaMail.zimbra@odiso.com>
 <295606419.745430.1600151269212.JavaMail.zimbra@odiso.com>
 <1227203309.12.1600154034412@webmail.proxmox.com>
 <1746620611.752896.1600159335616.JavaMail.zimbra@odiso.com>
 <1464606394.823230.1600162557186.JavaMail.zimbra@odiso.com>
 <98e79e8d-9001-db77-c032-bdfcdb3698a6@proxmox.com>
 <1282130277.831843.1600164947209.JavaMail.zimbra@odiso.com>
 <1732268946.834480.1600167871823.JavaMail.zimbra@odiso.com>
 <1800811328.836757.1600174194769.JavaMail.zimbra@odiso.com>
From: Thomas Lamprecht <t.lamprecht@proxmox.com>
Message-ID: <43250fdc-55ba-03d9-2507-a2b08c5945ce@proxmox.com>
Date: Tue, 15 Sep 2020 15:00:03 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:81.0) Gecko/20100101
 Thunderbird/81.0
MIME-Version: 1.0
In-Reply-To: <1800811328.836757.1600174194769.JavaMail.zimbra@odiso.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-GB
Content-Transfer-Encoding: 7bit
X-SPAM-LEVEL: Spam detection results:  0
 AWL -0.200 Adjusted score from AWL reputation of From: address
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 NICE_REPLY_A           -0.001 Looks like a legit reply (A)
 RCVD_IN_DNSWL_MED        -2.3 Sender listed at https://www.dnswl.org/,
 medium trust
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean
 shutdown
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Tue, 15 Sep 2020 13:00:07 -0000

On 9/15/20 2:49 PM, Alexandre DERUMIER wrote:
> Hi,
> 
> I have produce it again, 
> 
> now I can't write to /etc/pve/ from any node
> 

OK, so seems to really be an issue in pmxcfs or between corosync and pmxcfs,
not the HA LRM or watchdog mux itself.

Can you try to give pmxcfs real time scheduling, e.g., by doing:

# systemctl edit pve-cluster

And then add snippet:


[Service]
CPUSchedulingPolicy=rr
CPUSchedulingPriority=99


And restart pve-cluster

> I have also added some debug logs to pve-ha-lrm, and it was stuck in:
> (but if /etc/pve is locked, this is normal)
> 
>         if ($fence_request) {
>             $haenv->log('err', "node need to be fenced - releasing agent_lock\n");
>             $self->set_local_status({ state => 'lost_agent_lock'});
>         } elsif (!$self->get_protected_ha_agent_lock()) {
>             $self->set_local_status({ state => 'lost_agent_lock'});
>         } elsif ($self->{mode} eq 'maintenance') {
>             $self->set_local_status({ state => 'maintenance'});
>         }
> 
> 
> corosync quorum is currently ok
> 
> I'm currently digging the logs
Is your most simplest/stable reproducer still a periodic restart of corosync in one node?