[PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum

public inbox for pve-user@lists.proxmox.com
 help / color / mirror / Atom feed

* [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
@ 2021-01-04 11:44 Frank Thommen
  2021-01-05 19:01 ` Frank Thommen
  0 siblings, 1 reply; 18+ messages in thread
From: Frank Thommen @ 2021-01-04 11:44 UTC (permalink / raw)
  To: Proxmox VE user list

Dear all,

one of our three PVE hypervisors in the cluster crashed (it was fenced 
successfully) and rebooted automatically.  I took the chance to do a 
complete dist-upgrade and rebooted again.

The PVE Ceph dashboard now reports, that

   * the monitor on the host is down (out of quorum), and
   * "A newer version was installed but old version still running, 
please restart"

The Ceph UI reports monitor version 14.2.11 while in fact 14.2.16 is 
installed. The hypervisor has been rebooted twice since the upgrade, so 
it should be basically impossible that the old version is still running.

`systemctl restart ceph.target` and restarting the monitor through the 
PVE Ceph UI didn't help. The hypervisor is running PVE 6.3-3 (the other 
two are running 6.3-2 with monitor 14.2.15)

What to do in this situation?

I am happy with either UI or commandline instructions, but I have no 
Ceph experience besides setting up it up following the PVE instructions.

Any help or hint is appreciated.
Cheers, Frank

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
  2021-01-04 11:44 [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum Frank Thommen
@ 2021-01-05 19:01 ` Frank Thommen
  2021-01-05 19:08   ` Frank Thommen
  2021-01-05 19:10   ` Uwe Sauter
  0 siblings, 2 replies; 18+ messages in thread
From: Frank Thommen @ 2021-01-05 19:01 UTC (permalink / raw)
  To: pve-user


On 04.01.21 12:44, Frank Thommen wrote:
> 
> Dear all,
> 
> one of our three PVE hypervisors in the cluster crashed (it was fenced 
> successfully) and rebooted automatically.  I took the chance to do a 
> complete dist-upgrade and rebooted again.
> 
> The PVE Ceph dashboard now reports, that
> 
>    * the monitor on the host is down (out of quorum), and
>    * "A newer version was installed but old version still running, 
> please restart"
> 
> The Ceph UI reports monitor version 14.2.11 while in fact 14.2.16 is 
> installed. The hypervisor has been rebooted twice since the upgrade, so 
> it should be basically impossible that the old version is still running.
> 
> `systemctl restart ceph.target` and restarting the monitor through the 
> PVE Ceph UI didn't help. The hypervisor is running PVE 6.3-3 (the other 
> two are running 6.3-2 with monitor 14.2.15)
> 
> What to do in this situation?
> 
> I am happy with either UI or commandline instructions, but I have no 
> Ceph experience besides setting up it up following the PVE instructions.
> 
> Any help or hint is appreciated.
> Cheers, Frank

In an attempt to fix the issue I destroyed the monitor through the UI 
and recreated it.  Unfortunately it can still not be started.  A popup 
tells me that the monitor has been started, but the overview still shows 
"stopped" and there is no version number any more.

Then I stopped and started Ceph on the node (`pveceph stop; pveceph 
start`) which resulted in a degraded cluster (1 host down, 7 of 21 OSDs 
down). OSDs cannot be started through the UI either.

I feel extremely uncomfortable with this situation and would appreciate 
any hint as to how I should proceed with the problem.

Cheers, Frank



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
  2021-01-05 19:01 ` Frank Thommen
@ 2021-01-05 19:08   ` Frank Thommen
  2021-01-05 19:10   ` Uwe Sauter
  1 sibling, 0 replies; 18+ messages in thread
From: Frank Thommen @ 2021-01-05 19:08 UTC (permalink / raw)
  To: pve-user

On 05.01.21 20:01, Frank Thommen wrote:
> 
> On 04.01.21 12:44, Frank Thommen wrote:
>>
>> Dear all,
>>
>> one of our three PVE hypervisors in the cluster crashed (it was fenced 
>> successfully) and rebooted automatically.  I took the chance to do a 
>> complete dist-upgrade and rebooted again.
>>
>> The PVE Ceph dashboard now reports, that
>>
>>    * the monitor on the host is down (out of quorum), and
>>    * "A newer version was installed but old version still running, 
>> please restart"
>>
>> The Ceph UI reports monitor version 14.2.11 while in fact 14.2.16 is 
>> installed. The hypervisor has been rebooted twice since the upgrade, 
>> so it should be basically impossible that the old version is still 
>> running.
>>
>> `systemctl restart ceph.target` and restarting the monitor through the 
>> PVE Ceph UI didn't help. The hypervisor is running PVE 6.3-3 (the 
>> other two are running 6.3-2 with monitor 14.2.15)
>>
>> What to do in this situation?
>>
>> I am happy with either UI or commandline instructions, but I have no 
>> Ceph experience besides setting up it up following the PVE instructions.
>>
>> Any help or hint is appreciated.
>> Cheers, Frank
> 
> In an attempt to fix the issue I destroyed the monitor through the UI 
> and recreated it.  Unfortunately it can still not be started.  A popup 
> tells me that the monitor has been started, but the overview still shows 
> "stopped" and there is no version number any more.
> 
> Then I stopped and started Ceph on the node (`pveceph stop; pveceph 
> start`) which resulted in a degraded cluster (1 host down, 7 of 21 OSDs 
> down). OSDs cannot be started through the UI either.
> 
> I feel extremely uncomfortable with this situation and would appreciate 
> any hint as to how I should proceed with the problem.
> 
> Cheers, Frank

OSDs and MDSs just took a bit to start, so from this side it looks ok 
now.  But the monitor still refuses to start.

Frank



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
  2021-01-05 19:01 ` Frank Thommen
  2021-01-05 19:08   ` Frank Thommen
@ 2021-01-05 19:10   ` Uwe Sauter
  2021-01-05 19:24     ` Frank Thommen
  1 sibling, 1 reply; 18+ messages in thread
From: Uwe Sauter @ 2021-01-05 19:10 UTC (permalink / raw)
  To: pve-user

Hi Frank,

did you look into the log of MON and OSD? Can you provide the list of installed packages of the affected host and the 
rest of the cluster?

Is the output of "ceph status" the same for all hosts?


Regards,

	Uwe

Am 05.01.21 um 20:01 schrieb Frank Thommen:
> 
> On 04.01.21 12:44, Frank Thommen wrote:
>>
>> Dear all,
>>
>> one of our three PVE hypervisors in the cluster crashed (it was fenced successfully) and rebooted automatically.  I 
>> took the chance to do a complete dist-upgrade and rebooted again.
>>
>> The PVE Ceph dashboard now reports, that
>>
>>    * the monitor on the host is down (out of quorum), and
>>    * "A newer version was installed but old version still running, please restart"
>>
>> The Ceph UI reports monitor version 14.2.11 while in fact 14.2.16 is installed. The hypervisor has been rebooted twice 
>> since the upgrade, so it should be basically impossible that the old version is still running.
>>
>> `systemctl restart ceph.target` and restarting the monitor through the PVE Ceph UI didn't help. The hypervisor is 
>> running PVE 6.3-3 (the other two are running 6.3-2 with monitor 14.2.15)
>>
>> What to do in this situation?
>>
>> I am happy with either UI or commandline instructions, but I have no Ceph experience besides setting up it up 
>> following the PVE instructions.
>>
>> Any help or hint is appreciated.
>> Cheers, Frank
> 
> In an attempt to fix the issue I destroyed the monitor through the UI and recreated it.  Unfortunately it can still not 
> be started.  A popup tells me that the monitor has been started, but the overview still shows "stopped" and there is no 
> version number any more.
> 
> Then I stopped and started Ceph on the node (`pveceph stop; pveceph start`) which resulted in a degraded cluster (1 host 
> down, 7 of 21 OSDs down). OSDs cannot be started through the UI either.
> 
> I feel extremely uncomfortable with this situation and would appreciate any hint as to how I should proceed with the 
> problem.
> 
> Cheers, Frank
> 
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
  2021-01-05 19:10   ` Uwe Sauter
@ 2021-01-05 19:24     ` Frank Thommen
  2021-01-05 19:29       ` Uwe Sauter
  2021-01-05 19:35       ` Frank Thommen
  0 siblings, 2 replies; 18+ messages in thread
From: Frank Thommen @ 2021-01-05 19:24 UTC (permalink / raw)
  To: pve-user

Hi Uwe,

> did you look into the log of MON and OSD?

I can't see any specific MON and OSD logs. However the log available in 
the UI (Ceph -> Log) has lots of messages regarding scrubbing but no 
messages regarding issues with starting the monitor


> Can you provide the list of 
> installed packages of the affected host and the rest of the cluster?

let me compile the lists and post them somewhere.  They are quite long.

> 
> Is the output of "ceph status" the same for all hosts?

yes

Frank

> 
> 
> Regards,
> 
>      Uwe
> 
> Am 05.01.21 um 20:01 schrieb Frank Thommen:
>>
>> On 04.01.21 12:44, Frank Thommen wrote:
>>>
>>> Dear all,
>>>
>>> one of our three PVE hypervisors in the cluster crashed (it was 
>>> fenced successfully) and rebooted automatically.  I took the chance 
>>> to do a complete dist-upgrade and rebooted again.
>>>
>>> The PVE Ceph dashboard now reports, that
>>>
>>>    * the monitor on the host is down (out of quorum), and
>>>    * "A newer version was installed but old version still running, 
>>> please restart"
>>>
>>> The Ceph UI reports monitor version 14.2.11 while in fact 14.2.16 is 
>>> installed. The hypervisor has been rebooted twice since the upgrade, 
>>> so it should be basically impossible that the old version is still 
>>> running.
>>>
>>> `systemctl restart ceph.target` and restarting the monitor through 
>>> the PVE Ceph UI didn't help. The hypervisor is running PVE 6.3-3 (the 
>>> other two are running 6.3-2 with monitor 14.2.15)
>>>
>>> What to do in this situation?
>>>
>>> I am happy with either UI or commandline instructions, but I have no 
>>> Ceph experience besides setting up it up following the PVE instructions.
>>>
>>> Any help or hint is appreciated.
>>> Cheers, Frank
>>
>> In an attempt to fix the issue I destroyed the monitor through the UI 
>> and recreated it.  Unfortunately it can still not be started.  A popup 
>> tells me that the monitor has been started, but the overview still 
>> shows "stopped" and there is no version number any more.
>>
>> Then I stopped and started Ceph on the node (`pveceph stop; pveceph 
>> start`) which resulted in a degraded cluster (1 host down, 7 of 21 
>> OSDs down). OSDs cannot be started through the UI either.
>>
>> I feel extremely uncomfortable with this situation and would 
>> appreciate any hint as to how I should proceed with the problem.
>>
>> Cheers, Frank
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
  2021-01-05 19:24     ` Frank Thommen
@ 2021-01-05 19:29       ` Uwe Sauter
  2021-01-05 19:44         ` Frank Thommen
  2021-01-05 19:35       ` Frank Thommen
  1 sibling, 1 reply; 18+ messages in thread
From: Uwe Sauter @ 2021-01-05 19:29 UTC (permalink / raw)
  To: pve-user

Frank,

Am 05.01.21 um 20:24 schrieb Frank Thommen:
> Hi Uwe,
> 
>> did you look into the log of MON and OSD?
> 
> I can't see any specific MON and OSD logs. However the log available in the UI (Ceph -> Log) has lots of messages 
> regarding scrubbing but no messages regarding issues with starting the monitor
> 

On each host the logs should be in /var/log/ceph. These should be rotated (see /etc/logrotate.d/ceph-common for details).

Regards,

	Uwe



> 
>> Can you provide the list of installed packages of the affected host and the rest of the cluster?
> 
> let me compile the lists and post them somewhere.  They are quite long.
> 
>>
>> Is the output of "ceph status" the same for all hosts?
> 
> yes
> 
> Frank
> 
>>
>>
>> Regards,
>>
>>      Uwe
>>
>> Am 05.01.21 um 20:01 schrieb Frank Thommen:
>>>
>>> On 04.01.21 12:44, Frank Thommen wrote:
>>>>
>>>> Dear all,
>>>>
>>>> one of our three PVE hypervisors in the cluster crashed (it was fenced successfully) and rebooted automatically.  I 
>>>> took the chance to do a complete dist-upgrade and rebooted again.
>>>>
>>>> The PVE Ceph dashboard now reports, that
>>>>
>>>>    * the monitor on the host is down (out of quorum), and
>>>>    * "A newer version was installed but old version still running, please restart"
>>>>
>>>> The Ceph UI reports monitor version 14.2.11 while in fact 14.2.16 is installed. The hypervisor has been rebooted 
>>>> twice since the upgrade, so it should be basically impossible that the old version is still running.
>>>>
>>>> `systemctl restart ceph.target` and restarting the monitor through the PVE Ceph UI didn't help. The hypervisor is 
>>>> running PVE 6.3-3 (the other two are running 6.3-2 with monitor 14.2.15)
>>>>
>>>> What to do in this situation?
>>>>
>>>> I am happy with either UI or commandline instructions, but I have no Ceph experience besides setting up it up 
>>>> following the PVE instructions.
>>>>
>>>> Any help or hint is appreciated.
>>>> Cheers, Frank
>>>
>>> In an attempt to fix the issue I destroyed the monitor through the UI and recreated it.  Unfortunately it can still 
>>> not be started.  A popup tells me that the monitor has been started, but the overview still shows "stopped" and there 
>>> is no version number any more.
>>>
>>> Then I stopped and started Ceph on the node (`pveceph stop; pveceph start`) which resulted in a degraded cluster (1 
>>> host down, 7 of 21 OSDs down). OSDs cannot be started through the UI either.
>>>
>>> I feel extremely uncomfortable with this situation and would appreciate any hint as to how I should proceed with the 
>>> problem.
>>>
>>> Cheers, Frank
>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user@lists.proxmox.com
>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
  2021-01-05 19:24     ` Frank Thommen
  2021-01-05 19:29       ` Uwe Sauter
@ 2021-01-05 19:35       ` Frank Thommen
  1 sibling, 0 replies; 18+ messages in thread
From: Frank Thommen @ 2021-01-05 19:35 UTC (permalink / raw)
  To: pve-user


On 05.01.21 20:24, Frank Thommen wrote:
> Hi Uwe,
> 
>> did you look into the log of MON and OSD?
> 
> I can't see any specific MON and OSD logs. However the log available in 
> the UI (Ceph -> Log) has lots of messages regarding scrubbing but no 
> messages regarding issues with starting the monitor
> 
> 
>> Can you provide the list of installed packages of the affected host 
>> and the rest of the cluster?
> 
> let me compile the lists and post them somewhere.  They are quite long.

dpkg -l:

   * https://pastebin.com/HacFNTDf
   * https://pastebin.com/qmya0Y2Y
   * https://pastebin.com/5CmudA6L

The second one is from the host where the monitor refuses to start

Frank



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
  2021-01-05 19:29       ` Uwe Sauter
@ 2021-01-05 19:44         ` Frank Thommen
       [not found]           ` <f3ca5f88-5cbd-9807-02d7-d8f24fcbefdb@gmail.com>
       [not found]           ` <058f3eca-2e6f-eead-365a-4d451fa160d3@gmail.com>
  0 siblings, 2 replies; 18+ messages in thread
From: Frank Thommen @ 2021-01-05 19:44 UTC (permalink / raw)
  To: pve-user



On 05.01.21 20:29, Uwe Sauter wrote:
> Frank,
> 
> Am 05.01.21 um 20:24 schrieb Frank Thommen:
>> Hi Uwe,
>>
>>> did you look into the log of MON and OSD?
>>
>> I can't see any specific MON and OSD logs. However the log available 
>> in the UI (Ceph -> Log) has lots of messages regarding scrubbing but 
>> no messages regarding issues with starting the monitor
>>
> 
> On each host the logs should be in /var/log/ceph. These should be 
> rotated (see /etc/logrotate.d/ceph-common for details).

ok.  I see lots of

-----------------------
2021-01-05 20:38:05.900 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
handle_auth_request failed to assign global_id
2021-01-05 20:38:07.208 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
handle_auth_request failed to assign global_id
2021-01-05 20:38:08.688 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
handle_auth_request failed to assign global_id
2021-01-05 20:38:08.744 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
handle_auth_request failed to assign global_id
2021-01-05 20:38:09.092 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
handle_auth_request failed to assign global_id
2021-01-05 20:38:12.268 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
handle_auth_request failed to assign global_id
2021-01-05 20:38:12.468 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
handle_auth_request failed to assign global_id
2021-01-05 20:38:12.964 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
handle_auth_request failed to assign global_id
2021-01-05 20:38:15.752 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
handle_auth_request failed to assign global_id
2021-01-05 20:38:17.440 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
handle_auth_request failed to assign global_id
2021-01-05 20:38:19.388 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
handle_auth_request failed to assign global_id
2021-01-05 20:38:19.468 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
handle_auth_request failed to assign global_id
2021-01-05 20:38:22.712 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
handle_auth_request failed to assign global_id
2021-01-05 20:38:22.828 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
handle_auth_request failed to assign global_id
-----------------------

in the mon log on the problematic host.

When (unsuccessfully) starting the monitor through the UI, the following 
entries appear in ceph.audit.log:

-----------------------
2021-01-05 20:40:07.635369 mon.odcf-pve03 (mon.1) 288082 : audit [DBG] 
from='client.? 192.168.255.2:0/2418486168' entity='client.admin' 
cmd=[{"format":"json","prefix":"mgr metadata"}]: dispatch
2021-01-05 20:40:07.636592 mon.odcf-pve03 (mon.1) 288083 : audit [DBG] 
from='client.? 192.168.255.2:0/2418486168' entity='client.admin' 
cmd=[{"format":"json","prefix":"mgr dump"}]: dispatch
2021-01-05 20:40:08.296793 mon.odcf-pve03 (mon.1) 288084 : audit [DBG] 
from='client.? 192.168.255.2:0/778781756' entity='client.admin' 
cmd=[{"format":"json","prefix":"mon metadata"}]: dispatch
2021-01-05 20:40:08.297767 mon.odcf-pve03 (mon.1) 288085 : audit [DBG] 
from='client.? 192.168.255.2:0/778781756' entity='client.admin' 
cmd=[{"prefix":"quorum_status","format":"json"}]: dispatch
2021-01-05 20:40:08.436982 mon.odcf-pve01 (mon.0) 389632 : audit [DBG] 
from='client.? 192.168.255.2:0/784579843' entity='client.admin' 
cmd=[{"format":"json","prefix":"df"}]: dispatch
-----------------------

192.168.255.2 is the IP number of the problematic host in the Ceph mesh 
network. odcf-pve01 and odcf-pve03 are the "good" nodes.

However I am not sure, what kind of information I should look for in the 
logs

Frank

> 
> Regards,
> 
>      Uwe
> 
> 
> 
>>
>>> Can you provide the list of installed packages of the affected host 
>>> and the rest of the cluster?
>>
>> let me compile the lists and post them somewhere.  They are quite long.
>>
>>>
>>> Is the output of "ceph status" the same for all hosts?
>>
>> yes
>>
>> Frank
>>
>>>
>>>
>>> Regards,
>>>
>>>      Uwe
>>>
>>> Am 05.01.21 um 20:01 schrieb Frank Thommen:
>>>>
>>>> On 04.01.21 12:44, Frank Thommen wrote:
>>>>>
>>>>> Dear all,
>>>>>
>>>>> one of our three PVE hypervisors in the cluster crashed (it was 
>>>>> fenced successfully) and rebooted automatically.  I took the chance 
>>>>> to do a complete dist-upgrade and rebooted again.
>>>>>
>>>>> The PVE Ceph dashboard now reports, that
>>>>>
>>>>>    * the monitor on the host is down (out of quorum), and
>>>>>    * "A newer version was installed but old version still running, 
>>>>> please restart"
>>>>>
>>>>> The Ceph UI reports monitor version 14.2.11 while in fact 14.2.16 
>>>>> is installed. The hypervisor has been rebooted twice since the 
>>>>> upgrade, so it should be basically impossible that the old version 
>>>>> is still running.
>>>>>
>>>>> `systemctl restart ceph.target` and restarting the monitor through 
>>>>> the PVE Ceph UI didn't help. The hypervisor is running PVE 6.3-3 
>>>>> (the other two are running 6.3-2 with monitor 14.2.15)
>>>>>
>>>>> What to do in this situation?
>>>>>
>>>>> I am happy with either UI or commandline instructions, but I have 
>>>>> no Ceph experience besides setting up it up following the PVE 
>>>>> instructions.
>>>>>
>>>>> Any help or hint is appreciated.
>>>>> Cheers, Frank
>>>>
>>>> In an attempt to fix the issue I destroyed the monitor through the 
>>>> UI and recreated it.  Unfortunately it can still not be started.  A 
>>>> popup tells me that the monitor has been started, but the overview 
>>>> still shows "stopped" and there is no version number any more.
>>>>
>>>> Then I stopped and started Ceph on the node (`pveceph stop; pveceph 
>>>> start`) which resulted in a degraded cluster (1 host down, 7 of 21 
>>>> OSDs down). OSDs cannot be started through the UI either.
>>>>
>>>> I feel extremely uncomfortable with this situation and would 
>>>> appreciate any hint as to how I should proceed with the problem.
>>>>
>>>> Cheers, Frank
>>>>
>>>> _______________________________________________
>>>> pve-user mailing list
>>>> pve-user@lists.proxmox.com
>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user@lists.proxmox.com
>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
       [not found]           ` <f3ca5f88-5cbd-9807-02d7-d8f24fcbefdb@gmail.com>
@ 2021-01-05 20:17             ` Frank Thommen
  2021-01-08 10:36               ` Frank Thommen
  0 siblings, 1 reply; 18+ messages in thread
From: Frank Thommen @ 2021-01-05 20:17 UTC (permalink / raw)
  To: pve-user

On 05.01.21 21:02, Uwe Sauter wrote:
> There's a paragraph about probing mons on
> 
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/

I will check that (tomorrow :-)

> Can you connect to ports TCP/3300 and TCP/6789 from the other two hosts?
> You can use telnet to check this. (On host1 or host3 run "telnet host 
> port". To quit telnet, press ctrl+], then type "quit" + enter.)

yes, all working ok

> Is the clock synchronized on the affected host?

yes

>  From the package lists I assume that host1 has a GUI installed while 
> host3 additionally acts as MariaDB server? I always keep the packages 
> list in sync on the clusters I run…

no MariaDB server anywhere and all three have the PVE webUI.  I wouldn't 
know of any other GUI.  Additional packages might come a prerequisite 
for individually installed admin tools (e.g. wireshark).  We install 
them ad-hoc and don't usually keep them on all hosts.


> It also lookls like hosts 1 and 3 are on different kernels (though that 
> shouldn't be an issue here…).

1 and 3 should be the same while 2 has been updated more recently

Frank

> 
> 
> Am 05.01.21 um 20:44 schrieb Frank Thommen:
>>
>>
>> On 05.01.21 20:29, Uwe Sauter wrote:
>>> Frank,
>>>
>>> Am 05.01.21 um 20:24 schrieb Frank Thommen:
>>>> Hi Uwe,
>>>>
>>>>> did you look into the log of MON and OSD?
>>>>
>>>> I can't see any specific MON and OSD logs. However the log available 
>>>> in the UI (Ceph -> Log) has lots of messages regarding scrubbing but 
>>>> no messages regarding issues with starting the monitor
>>>>
>>>
>>> On each host the logs should be in /var/log/ceph. These should be 
>>> rotated (see /etc/logrotate.d/ceph-common for details).
>>
>> ok.  I see lots of
>>
>> -----------------------
>> 2021-01-05 20:38:05.900 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:07.208 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:08.688 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:08.744 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:09.092 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:12.268 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:12.468 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:12.964 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:15.752 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:17.440 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:19.388 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:19.468 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:22.712 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> 2021-01-05 20:38:22.828 7f979e753700  1 mon.odcf-pve02@-1(probing) e4 
>> handle_auth_request failed to assign global_id
>> -----------------------
>>
>> in the mon log on the problematic host.
>>
>> When (unsuccessfully) starting the monitor through the UI, the 
>> following entries appear in ceph.audit.log:
>>
>> -----------------------
>> 2021-01-05 20:40:07.635369 mon.odcf-pve03 (mon.1) 288082 : audit [DBG] 
>> from='client.? 192.168.255.2:0/2418486168' entity='client.admin' 
>> cmd=[{"format":"json","prefix":"mgr metadata"}]: dispatch
>> 2021-01-05 20:40:07.636592 mon.odcf-pve03 (mon.1) 288083 : audit [DBG] 
>> from='client.? 192.168.255.2:0/2418486168' entity='client.admin' 
>> cmd=[{"format":"json","prefix":"mgr dump"}]: dispatch
>> 2021-01-05 20:40:08.296793 mon.odcf-pve03 (mon.1) 288084 : audit [DBG] 
>> from='client.? 192.168.255.2:0/778781756' entity='client.admin' 
>> cmd=[{"format":"json","prefix":"mon metadata"}]: dispatch
>> 2021-01-05 20:40:08.297767 mon.odcf-pve03 (mon.1) 288085 : audit [DBG] 
>> from='client.? 192.168.255.2:0/778781756' entity='client.admin' 
>> cmd=[{"prefix":"quorum_status","format":"json"}]: dispatch
>> 2021-01-05 20:40:08.436982 mon.odcf-pve01 (mon.0) 389632 : audit [DBG] 
>> from='client.? 192.168.255.2:0/784579843' entity='client.admin' 
>> cmd=[{"format":"json","prefix":"df"}]: dispatch
>> -----------------------
>>
>> 192.168.255.2 is the IP number of the problematic host in the Ceph 
>> mesh network. odcf-pve01 and odcf-pve03 are the "good" nodes.
>>
>> However I am not sure, what kind of information I should look for in 
>> the logs
>>
>> Frank
>>
>>>
>>> Regards,
>>>
>>>      Uwe
>>>
>>>
>>>
>>>>
>>>>> Can you provide the list of installed packages of the affected host 
>>>>> and the rest of the cluster?
>>>>
>>>> let me compile the lists and post them somewhere.  They are quite long.
>>>>
>>>>>
>>>>> Is the output of "ceph status" the same for all hosts?
>>>>
>>>> yes
>>>>
>>>> Frank
>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>>      Uwe
>>>>>
>>>>> Am 05.01.21 um 20:01 schrieb Frank Thommen:
>>>>>>
>>>>>> On 04.01.21 12:44, Frank Thommen wrote:
>>>>>>>
>>>>>>> Dear all,
>>>>>>>
>>>>>>> one of our three PVE hypervisors in the cluster crashed (it was 
>>>>>>> fenced successfully) and rebooted automatically. I took the 
>>>>>>> chance to do a complete dist-upgrade and rebooted again.
>>>>>>>
>>>>>>> The PVE Ceph dashboard now reports, that
>>>>>>>
>>>>>>>    * the monitor on the host is down (out of quorum), and
>>>>>>>    * "A newer version was installed but old version still 
>>>>>>> running, please restart"
>>>>>>>
>>>>>>> The Ceph UI reports monitor version 14.2.11 while in fact 14.2.16 
>>>>>>> is installed. The hypervisor has been rebooted twice since the 
>>>>>>> upgrade, so it should be basically impossible that the old 
>>>>>>> version is still running.
>>>>>>>
>>>>>>> `systemctl restart ceph.target` and restarting the monitor 
>>>>>>> through the PVE Ceph UI didn't help. The hypervisor is running 
>>>>>>> PVE 6.3-3 (the other two are running 6.3-2 with monitor 14.2.15)
>>>>>>>
>>>>>>> What to do in this situation?
>>>>>>>
>>>>>>> I am happy with either UI or commandline instructions, but I have 
>>>>>>> no Ceph experience besides setting up it up following the PVE 
>>>>>>> instructions.
>>>>>>>
>>>>>>> Any help or hint is appreciated.
>>>>>>> Cheers, Frank
>>>>>>
>>>>>> In an attempt to fix the issue I destroyed the monitor through the 
>>>>>> UI and recreated it.  Unfortunately it can still not be started.  
>>>>>> A popup tells me that the monitor has been started, but the 
>>>>>> overview still shows "stopped" and there is no version number any 
>>>>>> more.
>>>>>>
>>>>>> Then I stopped and started Ceph on the node (`pveceph stop; 
>>>>>> pveceph start`) which resulted in a degraded cluster (1 host down, 
>>>>>> 7 of 21 OSDs down). OSDs cannot be started through the UI either.
>>>>>>
>>>>>> I feel extremely uncomfortable with this situation and would 
>>>>>> appreciate any hint as to how I should proceed with the problem.
>>>>>>
>>>>>> Cheers, Frank
>>>>>>
>>>>>> _______________________________________________
>>>>>> pve-user mailing list
>>>>>> pve-user@lists.proxmox.com
>>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>>
>>>>> _______________________________________________
>>>>> pve-user mailing list
>>>>> pve-user@lists.proxmox.com
>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>
>>>> _______________________________________________
>>>> pve-user mailing list
>>>> pve-user@lists.proxmox.com
>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user@lists.proxmox.com
>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
       [not found]           ` <058f3eca-2e6f-eead-365a-4d451fa160d3@gmail.com>
@ 2021-01-05 20:18             ` Frank Thommen
  0 siblings, 0 replies; 18+ messages in thread
From: Frank Thommen @ 2021-01-05 20:18 UTC (permalink / raw)
  To: pve-user

On 05.01.21 21:05, Uwe Sauter wrote:> Also, is there still disk space 
available? It seems that the monitor
> refuses to start if it can't write to the log files.

There are tons of free disk space on all partitions and filesystems :-)

Frank





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
  2021-01-05 20:17             ` Frank Thommen
@ 2021-01-08 10:36               ` Frank Thommen
  2021-01-08 10:45                 ` Uwe Sauter
  0 siblings, 1 reply; 18+ messages in thread
From: Frank Thommen @ 2021-01-08 10:36 UTC (permalink / raw)
  To: Proxmox VE user list


On 05.01.21 21:17, Frank Thommen wrote:
> On 05.01.21 21:02, Uwe Sauter wrote:
>> There's a paragraph about probing mons on
>>
>> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/ 
>>
> 
> I will check that (tomorrow :-)


using the monitor's admin socket on either of the three nodes I can 
query the monitors of 01 and 03 (the good ones) but not of 02 (the 
problematic one):

root@odcf-pve01:~# ceph tell mon.odcf-pve02 mon_status
Error ENOENT: problem getting command descriptions from mon.odcf-pve02
root@odcf-pve01:~#

The monitor daemon is running on all three and the ports are open.

Any other ideas?

Cheers, Frank



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
  2021-01-08 10:36               ` Frank Thommen
@ 2021-01-08 10:45                 ` Uwe Sauter
  2021-01-08 11:05                   ` Frank Thommen
  0 siblings, 1 reply; 18+ messages in thread
From: Uwe Sauter @ 2021-01-08 10:45 UTC (permalink / raw)
  To: pve-user



Am 08.01.21 um 11:36 schrieb Frank Thommen:
> 
> On 05.01.21 21:17, Frank Thommen wrote:
>> On 05.01.21 21:02, Uwe Sauter wrote:
>>> There's a paragraph about probing mons on
>>>
>>> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/
>>
>> I will check that (tomorrow :-)
> 
> 
> using the monitor's admin socket on either of the three nodes I can query the monitors of 01 and 03 (the good ones) but 
> not of 02 (the problematic one):
> 
> root@odcf-pve01:~# ceph tell mon.odcf-pve02 mon_status
> Error ENOENT: problem getting command descriptions from mon.odcf-pve02
> root@odcf-pve01:~#
> 
> The monitor daemon is running on all three and the ports are open.
> 
> Any other ideas?

You could check the permissions on the socket:

ss -xln | grep ceph-mon
SOCK=$(ss -xln | awk '/ceph-mon/ {print $5}')
ls -la ${SOCK}

On my host, this shows

srwxr-xr-x 1 ceph ceph 0 Dec 20 23:47 /var/run/ceph/ceph-mon.px-alpha-cluster.asok



> 
> Cheers, Frank
> 
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
  2021-01-08 10:45                 ` Uwe Sauter
@ 2021-01-08 11:05                   ` Frank Thommen
  2021-01-08 11:27                     ` Peter Simon
  0 siblings, 1 reply; 18+ messages in thread
From: Frank Thommen @ 2021-01-08 11:05 UTC (permalink / raw)
  To: Proxmox VE user list



On 08.01.21 11:45, Uwe Sauter wrote:
> 
> 
> Am 08.01.21 um 11:36 schrieb Frank Thommen:
>>
>> On 05.01.21 21:17, Frank Thommen wrote:
>>> On 05.01.21 21:02, Uwe Sauter wrote:
>>>> There's a paragraph about probing mons on
>>>>
>>>> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/ 
>>>>
>>>
>>> I will check that (tomorrow :-)
>>
>>
>> using the monitor's admin socket on either of the three nodes I can 
>> query the monitors of 01 and 03 (the good ones) but not of 02 (the 
>> problematic one):
>>
>> root@odcf-pve01:~# ceph tell mon.odcf-pve02 mon_status
>> Error ENOENT: problem getting command descriptions from mon.odcf-pve02
>> root@odcf-pve01:~#
>>
>> The monitor daemon is running on all three and the ports are open.
>>
>> Any other ideas?
> 
> You could check the permissions on the socket:
> 
> ss -xln | grep ceph-mon
> SOCK=$(ss -xln | awk '/ceph-mon/ {print $5}')
> ls -la ${SOCK}
> 
> On my host, this shows
> 
> srwxr-xr-x 1 ceph ceph 0 Dec 20 23:47 
> /var/run/ceph/ceph-mon.px-alpha-cluster.asok

same here



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
  2021-01-08 11:05                   ` Frank Thommen
@ 2021-01-08 11:27                     ` Peter Simon
  2021-01-08 11:44                       ` Frank Thommen
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Simon @ 2021-01-08 11:27 UTC (permalink / raw)
  To: pve-user

Hi Frank,

your /etc/ceph/ceph.conf is the same on all hosts ?

is there mon host = ip1, ip2, ip3

and seperate sections with [mon.x]
host = hostname
mon addr = ip:6789

Cheers
Peter

Am 08.01.21 um 12:05 schrieb Frank Thommen:
>
>
> On 08.01.21 11:45, Uwe Sauter wrote:
>>
>>
>> Am 08.01.21 um 11:36 schrieb Frank Thommen:
>>>
>>> On 05.01.21 21:17, Frank Thommen wrote:
>>>> On 05.01.21 21:02, Uwe Sauter wrote:
>>>>> There's a paragraph about probing mons on
>>>>>
>>>>> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/
>>>>>
>>>>
>>>> I will check that (tomorrow :-)
>>>
>>>
>>> using the monitor's admin socket on either of the three nodes I can
>>> query the monitors of 01 and 03 (the good ones) but not of 02 (the
>>> problematic one):
>>>
>>> root@odcf-pve01:~# ceph tell mon.odcf-pve02 mon_status
>>> Error ENOENT: problem getting command descriptions from mon.odcf-pve02
>>> root@odcf-pve01:~#
>>>
>>> The monitor daemon is running on all three and the ports are open.
>>>
>>> Any other ideas?
>>
>> You could check the permissions on the socket:
>>
>> ss -xln | grep ceph-mon
>> SOCK=$(ss -xln | awk '/ceph-mon/ {print $5}')
>> ls -la ${SOCK}
>>
>> On my host, this shows
>>
>> srwxr-xr-x 1 ceph ceph 0 Dec 20 23:47
>> /var/run/ceph/ceph-mon.px-alpha-cluster.asok
>
> same here
>
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
  2021-01-08 11:27                     ` Peter Simon
@ 2021-01-08 11:44                       ` Frank Thommen
  2021-01-08 11:57                         ` Peter Simon
  2021-01-08 12:01                         ` Frank Thommen
  0 siblings, 2 replies; 18+ messages in thread
From: Frank Thommen @ 2021-01-08 11:44 UTC (permalink / raw)
  To: pve-user

yes /etc/ceph/ceph.conf is identical on all three hosts and there is a 
mon_host line with the correct IPs.  Interestingly there is a special 
section for odcf-pve02:

-----------
[mon.odcf-pve02]
	 public_addr = 192.168.255.2
-----------

This is the same IP as in the mon_host line.  However there is no 
equivalent section for the other two nodes.

Frank


On 08.01.21 12:27, Peter Simon wrote:
> Hi Frank,
> 
> your /etc/ceph/ceph.conf is the same on all hosts ?
> 
> is there mon host = ip1, ip2, ip3
> 
> and seperate sections with [mon.x]
> host = hostname
> mon addr = ip:6789
> 
> Cheers
> Peter
> 
> Am 08.01.21 um 12:05 schrieb Frank Thommen:
>>
>>
>> On 08.01.21 11:45, Uwe Sauter wrote:
>>>
>>>
>>> Am 08.01.21 um 11:36 schrieb Frank Thommen:
>>>>
>>>> On 05.01.21 21:17, Frank Thommen wrote:
>>>>> On 05.01.21 21:02, Uwe Sauter wrote:
>>>>>> There's a paragraph about probing mons on
>>>>>>
>>>>>> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/ 
>>>>>>
>>>>>>
>>>>>
>>>>> I will check that (tomorrow :-)
>>>>
>>>>
>>>> using the monitor's admin socket on either of the three nodes I can
>>>> query the monitors of 01 and 03 (the good ones) but not of 02 (the
>>>> problematic one):
>>>>
>>>> root@odcf-pve01:~# ceph tell mon.odcf-pve02 mon_status
>>>> Error ENOENT: problem getting command descriptions from mon.odcf-pve02
>>>> root@odcf-pve01:~#
>>>>
>>>> The monitor daemon is running on all three and the ports are open.
>>>>
>>>> Any other ideas?
>>>
>>> You could check the permissions on the socket:
>>>
>>> ss -xln | grep ceph-mon
>>> SOCK=$(ss -xln | awk '/ceph-mon/ {print $5}')
>>> ls -la ${SOCK}
>>>
>>> On my host, this shows
>>>
>>> srwxr-xr-x 1 ceph ceph 0 Dec 20 23:47
>>> /var/run/ceph/ceph-mon.px-alpha-cluster.asok
>>
>> same here
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
> 
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
  2021-01-08 11:44                       ` Frank Thommen
@ 2021-01-08 11:57                         ` Peter Simon
  2021-01-08 12:01                         ` Frank Thommen
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Simon @ 2021-01-08 11:57 UTC (permalink / raw)
  To: pve-user

Hi,

please try :

[mon.odcf-pve0X]

  host = hostname
  mon addr = 192.168.255.x:6789

seperate entry for each

VG
Peter

Am 08.01.21 um 12:44 schrieb Frank Thommen:
> yes /etc/ceph/ceph.conf is identical on all three hosts and there is a
> mon_host line with the correct IPs.  Interestingly there is a special
> section for odcf-pve02:
>
> -----------
> [mon.odcf-pve02]
>      public_addr = 192.168.255.2
> -----------
>
> This is the same IP as in the mon_host line.  However there is no
> equivalent section for the other two nodes.
>
> Frank
>
>
> On 08.01.21 12:27, Peter Simon wrote:
>> Hi Frank,
>>
>> your /etc/ceph/ceph.conf is the same on all hosts ?
>>
>> is there mon host = ip1, ip2, ip3
>>
>> and seperate sections with [mon.x]
>> host = hostname
>> mon addr = ip:6789
>>
>> Cheers
>> Peter
>>
>> Am 08.01.21 um 12:05 schrieb Frank Thommen:
>>>
>>>
>>> On 08.01.21 11:45, Uwe Sauter wrote:
>>>>
>>>>
>>>> Am 08.01.21 um 11:36 schrieb Frank Thommen:
>>>>>
>>>>> On 05.01.21 21:17, Frank Thommen wrote:
>>>>>> On 05.01.21 21:02, Uwe Sauter wrote:
>>>>>>> There's a paragraph about probing mons on
>>>>>>>
>>>>>>> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I will check that (tomorrow :-)
>>>>>
>>>>>
>>>>> using the monitor's admin socket on either of the three nodes I can
>>>>> query the monitors of 01 and 03 (the good ones) but not of 02 (the
>>>>> problematic one):
>>>>>
>>>>> root@odcf-pve01:~# ceph tell mon.odcf-pve02 mon_status
>>>>> Error ENOENT: problem getting command descriptions from
>>>>> mon.odcf-pve02
>>>>> root@odcf-pve01:~#
>>>>>
>>>>> The monitor daemon is running on all three and the ports are open.
>>>>>
>>>>> Any other ideas?
>>>>
>>>> You could check the permissions on the socket:
>>>>
>>>> ss -xln | grep ceph-mon
>>>> SOCK=$(ss -xln | awk '/ceph-mon/ {print $5}')
>>>> ls -la ${SOCK}
>>>>
>>>> On my host, this shows
>>>>
>>>> srwxr-xr-x 1 ceph ceph 0 Dec 20 23:47
>>>> /var/run/ceph/ceph-mon.px-alpha-cluster.asok
>>>
>>> same here
>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user@lists.proxmox.com
>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
  2021-01-08 11:44                       ` Frank Thommen
  2021-01-08 11:57                         ` Peter Simon
@ 2021-01-08 12:01                         ` Frank Thommen
  2021-01-16 12:26                           ` Frank Thommen
  1 sibling, 1 reply; 18+ messages in thread
From: Frank Thommen @ 2021-01-08 12:01 UTC (permalink / raw)
  To: pve-user

Could this entry be the result of the fencing which happened when the 
host initially crashed?  I assumed, that it would automatically be 
unfenced when it comes up again.  I never run some manual "unfencing" (I 
wouldn't know how).

Frank



On 08.01.21 12:44, Frank Thommen wrote:
> yes /etc/ceph/ceph.conf is identical on all three hosts and there is a 
> mon_host line with the correct IPs.  Interestingly there is a special 
> section for odcf-pve02:
> 
> -----------
> [mon.odcf-pve02]
>       public_addr = 192.168.255.2
> -----------
> 
> This is the same IP as in the mon_host line.  However there is no 
> equivalent section for the other two nodes.
> 
> Frank
> 
> 
> On 08.01.21 12:27, Peter Simon wrote:
>> Hi Frank,
>>
>> your /etc/ceph/ceph.conf is the same on all hosts ?
>>
>> is there mon host = ip1, ip2, ip3
>>
>> and seperate sections with [mon.x]
>> host = hostname
>> mon addr = ip:6789
>>
>> Cheers
>> Peter
>>
>> Am 08.01.21 um 12:05 schrieb Frank Thommen:
>>>
>>>
>>> On 08.01.21 11:45, Uwe Sauter wrote:
>>>>
>>>>
>>>> Am 08.01.21 um 11:36 schrieb Frank Thommen:
>>>>>
>>>>> On 05.01.21 21:17, Frank Thommen wrote:
>>>>>> On 05.01.21 21:02, Uwe Sauter wrote:
>>>>>>> There's a paragraph about probing mons on
>>>>>>>
>>>>>>> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/ 
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I will check that (tomorrow :-)
>>>>>
>>>>>
>>>>> using the monitor's admin socket on either of the three nodes I can
>>>>> query the monitors of 01 and 03 (the good ones) but not of 02 (the
>>>>> problematic one):
>>>>>
>>>>> root@odcf-pve01:~# ceph tell mon.odcf-pve02 mon_status
>>>>> Error ENOENT: problem getting command descriptions from mon.odcf-pve02
>>>>> root@odcf-pve01:~#
>>>>>
>>>>> The monitor daemon is running on all three and the ports are open.
>>>>>
>>>>> Any other ideas?
>>>>
>>>> You could check the permissions on the socket:
>>>>
>>>> ss -xln | grep ceph-mon
>>>> SOCK=$(ss -xln | awk '/ceph-mon/ {print $5}')
>>>> ls -la ${SOCK}
>>>>
>>>> On my host, this shows
>>>>
>>>> srwxr-xr-x 1 ceph ceph 0 Dec 20 23:47
>>>> /var/run/ceph/ceph-mon.px-alpha-cluster.asok
>>>
>>> same here
>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user@lists.proxmox.com
>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
> 
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum
  2021-01-08 12:01                         ` Frank Thommen
@ 2021-01-16 12:26                           ` Frank Thommen
  0 siblings, 0 replies; 18+ messages in thread
From: Frank Thommen @ 2021-01-16 12:26 UTC (permalink / raw)
  To: pve-user

Just to close this thread on the maillist: I finally made this a support 
request @proxmox and we are still working on it.  It's not an easy case 
to solve :-)

Frank



On 08.01.21 13:01, Frank Thommen wrote:
> Could this entry be the result of the fencing which happened when the 
> host initially crashed?  I assumed, that it would automatically be 
> unfenced when it comes up again.  I never run some manual "unfencing" (I 
> wouldn't know how).
> 
> Frank
> 
> 
> 
> On 08.01.21 12:44, Frank Thommen wrote:
>> yes /etc/ceph/ceph.conf is identical on all three hosts and there is a 
>> mon_host line with the correct IPs.  Interestingly there is a special 
>> section for odcf-pve02:
>>
>> -----------
>> [mon.odcf-pve02]
>>       public_addr = 192.168.255.2
>> -----------
>>
>> This is the same IP as in the mon_host line.  However there is no 
>> equivalent section for the other two nodes.
>>
>> Frank
>>
>>
>> On 08.01.21 12:27, Peter Simon wrote:
>>> Hi Frank,
>>>
>>> your /etc/ceph/ceph.conf is the same on all hosts ?
>>>
>>> is there mon host = ip1, ip2, ip3
>>>
>>> and seperate sections with [mon.x]
>>> host = hostname
>>> mon addr = ip:6789
>>>
>>> Cheers
>>> Peter
>>>
>>> Am 08.01.21 um 12:05 schrieb Frank Thommen:
>>>>
>>>>
>>>> On 08.01.21 11:45, Uwe Sauter wrote:
>>>>>
>>>>>
>>>>> Am 08.01.21 um 11:36 schrieb Frank Thommen:
>>>>>>
>>>>>> On 05.01.21 21:17, Frank Thommen wrote:
>>>>>>> On 05.01.21 21:02, Uwe Sauter wrote:
>>>>>>>> There's a paragraph about probing mons on
>>>>>>>>
>>>>>>>> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/ 
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> I will check that (tomorrow :-)
>>>>>>
>>>>>>
>>>>>> using the monitor's admin socket on either of the three nodes I can
>>>>>> query the monitors of 01 and 03 (the good ones) but not of 02 (the
>>>>>> problematic one):
>>>>>>
>>>>>> root@odcf-pve01:~# ceph tell mon.odcf-pve02 mon_status
>>>>>> Error ENOENT: problem getting command descriptions from 
>>>>>> mon.odcf-pve02
>>>>>> root@odcf-pve01:~#
>>>>>>
>>>>>> The monitor daemon is running on all three and the ports are open.
>>>>>>
>>>>>> Any other ideas?
>>>>>
>>>>> You could check the permissions on the socket:
>>>>>
>>>>> ss -xln | grep ceph-mon
>>>>> SOCK=$(ss -xln | awk '/ceph-mon/ {print $5}')
>>>>> ls -la ${SOCK}
>>>>>
>>>>> On my host, this shows
>>>>>
>>>>> srwxr-xr-x 1 ceph ceph 0 Dec 20 23:47
>>>>> /var/run/ceph/ceph-mon.px-alpha-cluster.asok
>>>>
>>>> same here
>>>>
>>>> _______________________________________________
>>>> pve-user mailing list
>>>> pve-user@lists.proxmox.com
>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>>
>>>
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user@lists.proxmox.com
>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
> 
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-01-16 12:26 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-04 11:44 [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum Frank Thommen
2021-01-05 19:01 ` Frank Thommen
2021-01-05 19:08   ` Frank Thommen
2021-01-05 19:10   ` Uwe Sauter
2021-01-05 19:24     ` Frank Thommen
2021-01-05 19:29       ` Uwe Sauter
2021-01-05 19:44         ` Frank Thommen
     [not found]           ` <f3ca5f88-5cbd-9807-02d7-d8f24fcbefdb@gmail.com>
2021-01-05 20:17             ` Frank Thommen
2021-01-08 10:36               ` Frank Thommen
2021-01-08 10:45                 ` Uwe Sauter
2021-01-08 11:05                   ` Frank Thommen
2021-01-08 11:27                     ` Peter Simon
2021-01-08 11:44                       ` Frank Thommen
2021-01-08 11:57                         ` Peter Simon
2021-01-08 12:01                         ` Frank Thommen
2021-01-16 12:26                           ` Frank Thommen
     [not found]           ` <058f3eca-2e6f-eead-365a-4d451fa160d3@gmail.com>
2021-01-05 20:18             ` Frank Thommen
2021-01-05 19:35       ` Frank Thommen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal