From: Frank Thommen <f.thommen@dkfz-heidelberg.de>
To: pve-user@lists.proxmox.com
Subject: Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
Date: Tue, 5 Jan 2021 21:17:11 +0100 [thread overview]
Message-ID: <89a1ad57-6f99-d422-08df-d110f10aa3b9@dkfz-heidelberg.de> (raw)
In-Reply-To: <f3ca5f88-5cbd-9807-02d7-d8f24fcbefdb@gmail.com>
On 05.01.21 21:02, Uwe Sauter wrote:
> There's a paragraph about probing mons on
>
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/
I will check that (tomorrow :-)
> Can you connect to ports TCP/3300 and TCP/6789 from the other two hosts?
> You can use telnet to check this. (On host1 or host3 run "telnet host
> port". To quit telnet, press ctrl+], then type "quit" + enter.)
yes, all working ok
> Is the clock synchronized on the affected host?
yes
> From the package lists I assume that host1 has a GUI installed while
> host3 additionally acts as MariaDB server? I always keep the packages
> list in sync on the clusters I run…
no MariaDB server anywhere and all three have the PVE webUI. I wouldn't
know of any other GUI. Additional packages might come a prerequisite
for individually installed admin tools (e.g. wireshark). We install
them ad-hoc and don't usually keep them on all hosts.
> It also lookls like hosts 1 and 3 are on different kernels (though that
> shouldn't be an issue here…).
1 and 3 should be the same while 2 has been updated more recently
Frank
>
>
> Am 05.01.21 um 20:44 schrieb Frank Thommen:
>>
>>
>> On 05.01.21 20:29, Uwe Sauter wrote:
>>> Frank,
>>>
>>> Am 05.01.21 um 20:24 schrieb Frank Thommen:
>>>> Hi Uwe,
>>>>
>>>>> did you look into the log of MON and OSD?
>>>>
>>>> I can't see any specific MON and OSD logs. However the log available
>>>> in the UI (Ceph -> Log) has lots of messages regarding scrubbing but
>>>> no messages regarding issues with starting the monitor
>>>>
>>>
>>> On each host the logs should be in /var/log/ceph. These should be
>>> rotated (see /etc/logrotate.d/ceph-common for details).
>>
>> ok. I see lots of
>>
>> -----------------------
>> 2021-01-05 20:38:05.900 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:07.208 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:08.688 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:08.744 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:09.092 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:12.268 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:12.468 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:12.964 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:15.752 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:17.440 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:19.388 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:19.468 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:22.712 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:22.828 7f979e753700 1 mon.odcf-pve02@-1(probing) e4
>> handle_auth_request failed to assign global_id
>> -----------------------
>>
>> in the mon log on the problematic host.
>>
>> When (unsuccessfully) starting the monitor through the UI, the
>> following entries appear in ceph.audit.log:
>>
>> -----------------------
>> 2021-01-05 20:40:07.635369 mon.odcf-pve03 (mon.1) 288082 : audit [DBG]
>> from='client.? 192.168.255.2:0/2418486168' entity='client.admin'
>> cmd=[{"format":"json","prefix":"mgr metadata"}]: dispatch
>> 2021-01-05 20:40:07.636592 mon.odcf-pve03 (mon.1) 288083 : audit [DBG]
>> from='client.? 192.168.255.2:0/2418486168' entity='client.admin'
>> cmd=[{"format":"json","prefix":"mgr dump"}]: dispatch
>> 2021-01-05 20:40:08.296793 mon.odcf-pve03 (mon.1) 288084 : audit [DBG]
>> from='client.? 192.168.255.2:0/778781756' entity='client.admin'
>> cmd=[{"format":"json","prefix":"mon metadata"}]: dispatch
>> 2021-01-05 20:40:08.297767 mon.odcf-pve03 (mon.1) 288085 : audit [DBG]
>> from='client.? 192.168.255.2:0/778781756' entity='client.admin'
>> cmd=[{"prefix":"quorum_status","format":"json"}]: dispatch
>> 2021-01-05 20:40:08.436982 mon.odcf-pve01 (mon.0) 389632 : audit [DBG]
>> from='client.? 192.168.255.2:0/784579843' entity='client.admin'
>> cmd=[{"format":"json","prefix":"df"}]: dispatch
>> -----------------------
>>
>> 192.168.255.2 is the IP number of the problematic host in the Ceph
>> mesh network. odcf-pve01 and odcf-pve03 are the "good" nodes.
>>
>> However I am not sure, what kind of information I should look for in
>> the logs
>>
>> Frank
>>
>>>
>>> Regards,
>>>
>>> Uwe
>>>
>>>
>>>
>>>>
>>>>> Can you provide the list of installed packages of the affected host
>>>>> and the rest of the cluster?
>>>>
>>>> let me compile the lists and post them somewhere. They are quite long.
>>>>
>>>>>
>>>>> Is the output of "ceph status" the same for all hosts?
>>>>
>>>> yes
>>>>
>>>> Frank
>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Uwe
>>>>>
>>>>> Am 05.01.21 um 20:01 schrieb Frank Thommen:
>>>>>>
>>>>>> On 04.01.21 12:44, Frank Thommen wrote:
>>>>>>>
>>>>>>> Dear all,
>>>>>>>
>>>>>>> one of our three PVE hypervisors in the cluster crashed (it was
>>>>>>> fenced successfully) and rebooted automatically. I took the
>>>>>>> chance to do a complete dist-upgrade and rebooted again.
>>>>>>>
>>>>>>> The PVE Ceph dashboard now reports, that
>>>>>>>
>>>>>>> * the monitor on the host is down (out of quorum), and
>>>>>>> * "A newer version was installed but old version still
>>>>>>> running, please restart"
>>>>>>>
>>>>>>> The Ceph UI reports monitor version 14.2.11 while in fact 14.2.16
>>>>>>> is installed. The hypervisor has been rebooted twice since the
>>>>>>> upgrade, so it should be basically impossible that the old
>>>>>>> version is still running.
>>>>>>>
>>>>>>> `systemctl restart ceph.target` and restarting the monitor
>>>>>>> through the PVE Ceph UI didn't help. The hypervisor is running
>>>>>>> PVE 6.3-3 (the other two are running 6.3-2 with monitor 14.2.15)
>>>>>>>
>>>>>>> What to do in this situation?
>>>>>>>
>>>>>>> I am happy with either UI or commandline instructions, but I have
>>>>>>> no Ceph experience besides setting up it up following the PVE
>>>>>>> instructions.
>>>>>>>
>>>>>>> Any help or hint is appreciated.
>>>>>>> Cheers, Frank
>>>>>>
>>>>>> In an attempt to fix the issue I destroyed the monitor through the
>>>>>> UI and recreated it. Unfortunately it can still not be started.
>>>>>> A popup tells me that the monitor has been started, but the
>>>>>> overview still shows "stopped" and there is no version number any
>>>>>> more.
>>>>>>
>>>>>> Then I stopped and started Ceph on the node (`pveceph stop;
>>>>>> pveceph start`) which resulted in a degraded cluster (1 host down,
>>>>>> 7 of 21 OSDs down). OSDs cannot be started through the UI either.
>>>>>>
>>>>>> I feel extremely uncomfortable with this situation and would
>>>>>> appreciate any hint as to how I should proceed with the problem.
>>>>>>
>>>>>> Cheers, Frank
>>>>>>
>>>>>> _______________________________________________
>>>>>> pve-user mailing list
>>>>>> pve-user@lists.proxmox.com
>>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>>
>>>>> _______________________________________________
>>>>> pve-user mailing list
>>>>> pve-user@lists.proxmox.com
>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>
>>>> _______________________________________________
>>>> pve-user mailing list
>>>> pve-user@lists.proxmox.com
>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user@lists.proxmox.com
>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
next prev parent reply other threads:[~2021-01-05 20:17 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-04 11:44 Frank Thommen
2021-01-05 19:01 ` Frank Thommen
2021-01-05 19:08 ` Frank Thommen
2021-01-05 19:10 ` Uwe Sauter
2021-01-05 19:24 ` Frank Thommen
2021-01-05 19:29 ` Uwe Sauter
2021-01-05 19:44 ` Frank Thommen
[not found] ` <f3ca5f88-5cbd-9807-02d7-d8f24fcbefdb@gmail.com>
2021-01-05 20:17 ` Frank Thommen [this message]
2021-01-08 10:36 ` Frank Thommen
2021-01-08 10:45 ` Uwe Sauter
2021-01-08 11:05 ` Frank Thommen
2021-01-08 11:27 ` Peter Simon
2021-01-08 11:44 ` Frank Thommen
2021-01-08 11:57 ` Peter Simon
2021-01-08 12:01 ` Frank Thommen
2021-01-16 12:26 ` Frank Thommen
[not found] ` <058f3eca-2e6f-eead-365a-4d451fa160d3@gmail.com>
2021-01-05 20:18 ` Frank Thommen
2021-01-05 19:35 ` Frank Thommen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=89a1ad57-6f99-d422-08df-d110f10aa3b9@dkfz-heidelberg.de \
--to=f.thommen@dkfz-heidelberg.de \
--cc=pve-user@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox