From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <f.thommen@dkfz-heidelberg.de>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 070F56624B
 for <pve-user@lists.proxmox.com>; Tue,  5 Jan 2021 21:17:20 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id ED17BD75A
 for <pve-user@lists.proxmox.com>; Tue,  5 Jan 2021 21:17:19 +0100 (CET)
Received: from mx-ext.inet.dkfz-heidelberg.de (mx-ext.inet.dkfz-heidelberg.de
 [192.54.49.101])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id C2568D750
 for <pve-user@lists.proxmox.com>; Tue,  5 Jan 2021 21:17:18 +0100 (CET)
X-Virus-Scanned-DKFZ: amavisd-new at dkfz-heidelberg.de
Received: from w610-mb05.local (dkfz-vpn116.inet.dkfz-heidelberg.de
 [194.94.115.116]) (authenticated bits=0)
 by mx-ext.inet.dkfz-heidelberg.de (8.14.7/8.14.7/smtpin) with ESMTP id
 105KHFID016725
 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO)
 for <pve-user@lists.proxmox.com>; Tue, 5 Jan 2021 21:17:16 +0100
DKIM-Filter: OpenDKIM Filter v2.11.0 mx-ext.inet.dkfz-heidelberg.de
 105KHFID016725
References: <21dec802-c6e8-d395-1444-7b30df5620cd@dkfz-heidelberg.de>
 <255b8af8-8834-0f24-d9a6-819f2d2cf8c8@dkfz-heidelberg.de>
 <b43df258-4cdb-49f4-5adb-a20188a908df@gmail.com>
 <9811d98a-ebf2-8590-ddd0-3b707ede4a4e@dkfz-heidelberg.de>
 <ccbcca68-59fc-944e-d90c-c26ae20b17e5@gmail.com>
 <cca8ce9d-2de2-3fb5-0522-3b7bb0f4132c@dkfz-heidelberg.de>
 <f3ca5f88-5cbd-9807-02d7-d8f24fcbefdb@gmail.com>
From: Frank Thommen <f.thommen@dkfz-heidelberg.de>
To: pve-user@lists.proxmox.com
Organization: DKFZ Heidelberg, Omics IT and Data Management Core Facility
 (ODCF)
Message-ID: <89a1ad57-6f99-d422-08df-d110f10aa3b9@dkfz-heidelberg.de>
Date: Tue, 5 Jan 2021 21:17:11 +0100
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:78.0)
 Gecko/20100101 Thunderbird/78.6.0
MIME-Version: 1.0
In-Reply-To: <f3ca5f88-5cbd-9807-02d7-d8f24fcbefdb@gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-GB
Content-Transfer-Encoding: 8bit
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.6.2
 (mx-ext.inet.dkfz-heidelberg.de [192.54.49.101]);
 Tue, 05 Jan 2021 21:17:16 +0100 (CET)
X-Spam-Status: No, score=-100.2 required=5.0 tests=ALL_TRUSTED,NICE_REPLY_A,
 URIBL_BLOCKED autolearn=disabled version=3.4.0
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
 mx-ext.inet.dkfz-heidelberg.de
X-SPAM-LEVEL: Spam detection results:  0
 AWL -0.139 Adjusted score from AWL reputation of From: address
 KAM_ASCII_DIVIDERS        0.8 Spam that uses ascii formatting tricks
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 NICE_REPLY_A           -0.001 Looks like a legit reply (A)
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [proxmox.com, ceph.com, ceph.target]
Subject: Re: [PVE-User] After update Ceph monitor shows wrong version in UI
 and is down and out of quorum
X-BeenThere: pve-user@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE user list <pve-user.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-user/>
List-Post: <mailto:pve-user@lists.proxmox.com>
List-Help: <mailto:pve-user-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Tue, 05 Jan 2021 20:17:20 -0000

On 05.01.21 21:02, Uwe Sauter wrote:
> There's a paragraph about probing mons on
> 
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/

I will check that (tomorrow :-)

> Can you connect to ports TCP/3300 and TCP/6789 from the other two hosts?
> You can use telnet to check this. (On host1 or host3 run "telnet host 
> port". To quit telnet, press ctrl+], then type "quit" + enter.)

yes, all working ok

> Is the clock synchronized on the affected host?

yes

>  From the package lists I assume that host1 has a GUI installed while 
> host3 additionally acts as MariaDB server? I always keep the packages 
> list in sync on the clusters I run…

no MariaDB server anywhere and all three have the PVE webUI.  I wouldn't 
know of any other GUI.  Additional packages might come a prerequisite 
for individually installed admin tools (e.g. wireshark).  We install 
them ad-hoc and don't usually keep them on all hosts.


> It also lookls like hosts 1 and 3 are on different kernels (though that 
> shouldn't be an issue here…).

1 and 3 should be the same while 2 has been updated more recently

Frank

> 
> 
> Am 05.01.21 um 20:44 schrieb Frank Thommen:
>>
>>
>> On 05.01.21 20:29, Uwe Sauter wrote:
>>> Frank,
>>>
>>> Am 05.01.21 um 20:24 schrieb Frank Thommen:
>>>> Hi Uwe,
>>>>
>>>>> did you look into the log of MON and OSD?
>>>>
>>>> I can't see any specific MON and OSD logs. However the log available 
>>>> in the UI (Ceph -> Log) has lots of messages regarding scrubbing but 
>>>> no messages regarding issues with starting the monitor
>>>>
>>>
>>> On each host the logs should be in /var/log/ceph. These should be 
>>> rotated (see /etc/logrotate.d/ceph-common for details).
>>
>> ok.  I see lots of
>>
>> -----------------------
>> 2021-01-05 20:38:05.900 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:07.208 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:08.688 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:08.744 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:09.092 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:12.268 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:12.468 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:12.964 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:15.752 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:17.440 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:19.388 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:19.468 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:22.712 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:22.828 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> -----------------------
>>
>> in the mon log on the problematic host.
>>
>> When (unsuccessfully) starting the monitor through the UI, the 
>> following entries appear in ceph.audit.log:
>>
>> -----------------------
>> 2021-01-05 20:40:07.635369 mon.odcf-pve03 (mon.1) 288082 : audit [DBG] 
>> from='client.? 192.168.255.2:0/2418486168' entity='client.admin' 
>> cmd=[{"format":"json","prefix":"mgr metadata"}]: dispatch
>> 2021-01-05 20:40:07.636592 mon.odcf-pve03 (mon.1) 288083 : audit [DBG] 
>> from='client.? 192.168.255.2:0/2418486168' entity='client.admin' 
>> cmd=[{"format":"json","prefix":"mgr dump"}]: dispatch
>> 2021-01-05 20:40:08.296793 mon.odcf-pve03 (mon.1) 288084 : audit [DBG] 
>> from='client.? 192.168.255.2:0/778781756' entity='client.admin' 
>> cmd=[{"format":"json","prefix":"mon metadata"}]: dispatch
>> 2021-01-05 20:40:08.297767 mon.odcf-pve03 (mon.1) 288085 : audit [DBG] 
>> from='client.? 192.168.255.2:0/778781756' entity='client.admin' 
>> cmd=[{"prefix":"quorum_status","format":"json"}]: dispatch
>> 2021-01-05 20:40:08.436982 mon.odcf-pve01 (mon.0) 389632 : audit [DBG] 
>> from='client.? 192.168.255.2:0/784579843' entity='client.admin' 
>> cmd=[{"format":"json","prefix":"df"}]: dispatch
>> -----------------------
>>
>> 192.168.255.2 is the IP number of the problematic host in the Ceph 
>> mesh network. odcf-pve01 and odcf-pve03 are the "good" nodes.
>>
>> However I am not sure, what kind of information I should look for in 
>> the logs
>>
>> Frank
>>
>>>
>>> Regards,
>>>
>>>      Uwe
>>>
>>>
>>>
>>>>
>>>>> Can you provide the list of installed packages of the affected host 
>>>>> and the rest of the cluster?
>>>>
>>>> let me compile the lists and post them somewhere.  They are quite long.
>>>>
>>>>>
>>>>> Is the output of "ceph status" the same for all hosts?
>>>>
>>>> yes
>>>>
>>>> Frank
>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>>      Uwe
>>>>>
>>>>> Am 05.01.21 um 20:01 schrieb Frank Thommen:
>>>>>>
>>>>>> On 04.01.21 12:44, Frank Thommen wrote:
>>>>>>>
>>>>>>> Dear all,
>>>>>>>
>>>>>>> one of our three PVE hypervisors in the cluster crashed (it was 
>>>>>>> fenced successfully) and rebooted automatically. I took the 
>>>>>>> chance to do a complete dist-upgrade and rebooted again.
>>>>>>>
>>>>>>> The PVE Ceph dashboard now reports, that
>>>>>>>
>>>>>>>    * the monitor on the host is down (out of quorum), and
>>>>>>>    * "A newer version was installed but old version still 
>>>>>>> running, please restart"
>>>>>>>
>>>>>>> The Ceph UI reports monitor version 14.2.11 while in fact 14.2.16 
>>>>>>> is installed. The hypervisor has been rebooted twice since the 
>>>>>>> upgrade, so it should be basically impossible that the old 
>>>>>>> version is still running.
>>>>>>>
>>>>>>> `systemctl restart ceph.target` and restarting the monitor 
>>>>>>> through the PVE Ceph UI didn't help. The hypervisor is running 
>>>>>>> PVE 6.3-3 (the other two are running 6.3-2 with monitor 14.2.15)
>>>>>>>
>>>>>>> What to do in this situation?
>>>>>>>
>>>>>>> I am happy with either UI or commandline instructions, but I have 
>>>>>>> no Ceph experience besides setting up it up following the PVE 
>>>>>>> instructions.
>>>>>>>
>>>>>>> Any help or hint is appreciated.
>>>>>>> Cheers, Frank
>>>>>>
>>>>>> In an attempt to fix the issue I destroyed the monitor through the 
>>>>>> UI and recreated it.  Unfortunately it can still not be started.  
>>>>>> A popup tells me that the monitor has been started, but the 
>>>>>> overview still shows "stopped" and there is no version number any 
>>>>>> more.
>>>>>>
>>>>>> Then I stopped and started Ceph on the node (`pveceph stop; 
>>>>>> pveceph start`) which resulted in a degraded cluster (1 host down, 
>>>>>> 7 of 21 OSDs down). OSDs cannot be started through the UI either.
>>>>>>
>>>>>> I feel extremely uncomfortable with this situation and would 
>>>>>> appreciate any hint as to how I should proceed with the problem.
>>>>>>
>>>>>> Cheers, Frank
>>>>>>
>>>>>> _______________________________________________
>>>>>> pve-user mailing list
>>>>>> pve-user@lists.proxmox.com
>>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>>
>>>>> _______________________________________________
>>>>> pve-user mailing list
>>>>> pve-user@lists.proxmox.com
>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>
>>>> _______________________________________________
>>>> pve-user mailing list
>>>> pve-user@lists.proxmox.com
>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user@lists.proxmox.com
>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user