all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: Frank Thommen <f.thommen@dkfz-heidelberg.de>
To: pve-user@lists.proxmox.com
Subject: Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
Date: Tue, 5 Jan 2021 21:17:11 +0100	[thread overview]
Message-ID: <89a1ad57-6f99-d422-08df-d110f10aa3b9@dkfz-heidelberg.de> (raw)
In-Reply-To: <f3ca5f88-5cbd-9807-02d7-d8f24fcbefdb@gmail.com>

On 05.01.21 21:02, Uwe Sauter wrote:
> There's a paragraph about probing mons on
> 
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/

I will check that (tomorrow :-)

> Can you connect to ports TCP/3300 and TCP/6789 from the other two hosts?
> You can use telnet to check this. (On host1 or host3 run "telnet host 
> port". To quit telnet, press ctrl+], then type "quit" + enter.)

yes, all working ok

> Is the clock synchronized on the affected host?

yes

>  From the package lists I assume that host1 has a GUI installed while 
> host3 additionally acts as MariaDB server? I always keep the packages 
> list in sync on the clusters I run…

no MariaDB server anywhere and all three have the PVE webUI.  I wouldn't 
know of any other GUI.  Additional packages might come a prerequisite 
for individually installed admin tools (e.g. wireshark).  We install 
them ad-hoc and don't usually keep them on all hosts.


> It also lookls like hosts 1 and 3 are on different kernels (though that 
> shouldn't be an issue here…).

1 and 3 should be the same while 2 has been updated more recently

Frank

> 
> 
> Am 05.01.21 um 20:44 schrieb Frank Thommen:
>>
>>
>> On 05.01.21 20:29, Uwe Sauter wrote:
>>> Frank,
>>>
>>> Am 05.01.21 um 20:24 schrieb Frank Thommen:
>>>> Hi Uwe,
>>>>
>>>>> did you look into the log of MON and OSD?
>>>>
>>>> I can't see any specific MON and OSD logs. However the log available 
>>>> in the UI (Ceph -> Log) has lots of messages regarding scrubbing but 
>>>> no messages regarding issues with starting the monitor
>>>>
>>>
>>> On each host the logs should be in /var/log/ceph. These should be 
>>> rotated (see /etc/logrotate.d/ceph-common for details).
>>
>> ok.  I see lots of
>>
>> -----------------------
>> 2021-01-05 20:38:05.900 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:07.208 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:08.688 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:08.744 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:09.092 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:12.268 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:12.468 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:12.964 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:15.752 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:17.440 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:19.388 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:19.468 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:22.712 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:22.828 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> -----------------------
>>
>> in the mon log on the problematic host.
>>
>> When (unsuccessfully) starting the monitor through the UI, the 
>> following entries appear in ceph.audit.log:
>>
>> -----------------------
>> 2021-01-05 20:40:07.635369 mon.odcf-pve03 (mon.1) 288082 : audit [DBG] 
>> from='client.? 192.168.255.2:0/2418486168' entity='client.admin' 
>> cmd=[{"format":"json","prefix":"mgr metadata"}]: dispatch
>> 2021-01-05 20:40:07.636592 mon.odcf-pve03 (mon.1) 288083 : audit [DBG] 
>> from='client.? 192.168.255.2:0/2418486168' entity='client.admin' 
>> cmd=[{"format":"json","prefix":"mgr dump"}]: dispatch
>> 2021-01-05 20:40:08.296793 mon.odcf-pve03 (mon.1) 288084 : audit [DBG] 
>> from='client.? 192.168.255.2:0/778781756' entity='client.admin' 
>> cmd=[{"format":"json","prefix":"mon metadata"}]: dispatch
>> 2021-01-05 20:40:08.297767 mon.odcf-pve03 (mon.1) 288085 : audit [DBG] 
>> from='client.? 192.168.255.2:0/778781756' entity='client.admin' 
>> cmd=[{"prefix":"quorum_status","format":"json"}]: dispatch
>> 2021-01-05 20:40:08.436982 mon.odcf-pve01 (mon.0) 389632 : audit [DBG] 
>> from='client.? 192.168.255.2:0/784579843' entity='client.admin' 
>> cmd=[{"format":"json","prefix":"df"}]: dispatch
>> -----------------------
>>
>> 192.168.255.2 is the IP number of the problematic host in the Ceph 
>> mesh network. odcf-pve01 and odcf-pve03 are the "good" nodes.
>>
>> However I am not sure, what kind of information I should look for in 
>> the logs
>>
>> Frank
>>
>>>
>>> Regards,
>>>
>>>      Uwe
>>>
>>>
>>>
>>>>
>>>>> Can you provide the list of installed packages of the affected host 
>>>>> and the rest of the cluster?
>>>>
>>>> let me compile the lists and post them somewhere.  They are quite long.
>>>>
>>>>>
>>>>> Is the output of "ceph status" the same for all hosts?
>>>>
>>>> yes
>>>>
>>>> Frank
>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>>      Uwe
>>>>>
>>>>> Am 05.01.21 um 20:01 schrieb Frank Thommen:
>>>>>>
>>>>>> On 04.01.21 12:44, Frank Thommen wrote:
>>>>>>>
>>>>>>> Dear all,
>>>>>>>
>>>>>>> one of our three PVE hypervisors in the cluster crashed (it was 
>>>>>>> fenced successfully) and rebooted automatically. I took the 
>>>>>>> chance to do a complete dist-upgrade and rebooted again.
>>>>>>>
>>>>>>> The PVE Ceph dashboard now reports, that
>>>>>>>
>>>>>>>    * the monitor on the host is down (out of quorum), and
>>>>>>>    * "A newer version was installed but old version still 
>>>>>>> running, please restart"
>>>>>>>
>>>>>>> The Ceph UI reports monitor version 14.2.11 while in fact 14.2.16 
>>>>>>> is installed. The hypervisor has been rebooted twice since the 
>>>>>>> upgrade, so it should be basically impossible that the old 
>>>>>>> version is still running.
>>>>>>>
>>>>>>> `systemctl restart ceph.target` and restarting the monitor 
>>>>>>> through the PVE Ceph UI didn't help. The hypervisor is running 
>>>>>>> PVE 6.3-3 (the other two are running 6.3-2 with monitor 14.2.15)
>>>>>>>
>>>>>>> What to do in this situation?
>>>>>>>
>>>>>>> I am happy with either UI or commandline instructions, but I have 
>>>>>>> no Ceph experience besides setting up it up following the PVE 
>>>>>>> instructions.
>>>>>>>
>>>>>>> Any help or hint is appreciated.
>>>>>>> Cheers, Frank
>>>>>>
>>>>>> In an attempt to fix the issue I destroyed the monitor through the 
>>>>>> UI and recreated it.  Unfortunately it can still not be started.  
>>>>>> A popup tells me that the monitor has been started, but the 
>>>>>> overview still shows "stopped" and there is no version number any 
>>>>>> more.
>>>>>>
>>>>>> Then I stopped and started Ceph on the node (`pveceph stop; 
>>>>>> pveceph start`) which resulted in a degraded cluster (1 host down, 
>>>>>> 7 of 21 OSDs down). OSDs cannot be started through the UI either.
>>>>>>
>>>>>> I feel extremely uncomfortable with this situation and would 
>>>>>> appreciate any hint as to how I should proceed with the problem.
>>>>>>
>>>>>> Cheers, Frank
>>>>>>
>>>>>> _______________________________________________
>>>>>> pve-user mailing list
>>>>>> pve-user@lists.proxmox.com
>>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>>
>>>>> _______________________________________________
>>>>> pve-user mailing list
>>>>> pve-user@lists.proxmox.com
>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>
>>>> _______________________________________________
>>>> pve-user mailing list
>>>> pve-user@lists.proxmox.com
>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user@lists.proxmox.com
>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user



  parent reply	other threads:[~2021-01-05 20:17 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-04 11:44 Frank Thommen
2021-01-05 19:01 ` Frank Thommen
2021-01-05 19:08   ` Frank Thommen
2021-01-05 19:10   ` Uwe Sauter
2021-01-05 19:24     ` Frank Thommen
2021-01-05 19:29       ` Uwe Sauter
2021-01-05 19:44         ` Frank Thommen
     [not found]           ` <f3ca5f88-5cbd-9807-02d7-d8f24fcbefdb@gmail.com>
2021-01-05 20:17             ` Frank Thommen [this message]
2021-01-08 10:36               ` Frank Thommen
2021-01-08 10:45                 ` Uwe Sauter
2021-01-08 11:05                   ` Frank Thommen
2021-01-08 11:27                     ` Peter Simon
2021-01-08 11:44                       ` Frank Thommen
2021-01-08 11:57                         ` Peter Simon
2021-01-08 12:01                         ` Frank Thommen
2021-01-16 12:26                           ` Frank Thommen
     [not found]           ` <058f3eca-2e6f-eead-365a-4d451fa160d3@gmail.com>
2021-01-05 20:18             ` Frank Thommen
2021-01-05 19:35       ` Frank Thommen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=89a1ad57-6f99-d422-08df-d110f10aa3b9@dkfz-heidelberg.de \
    --to=f.thommen@dkfz-heidelberg.de \
    --cc=pve-user@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal