public inbox for pve-user@lists.proxmox.com
 help / color / mirror / Atom feed
* [PVE-User] Caution: ceph-mon service does not start after today's updates
@ 2020-11-26 12:18 Uwe Sauter
  2020-11-26 14:10 ` Lindsay Mathieson
  2020-11-26 19:35 ` [PVE-User] [update available] " Thomas Lamprecht
  0 siblings, 2 replies; 15+ messages in thread
From: Uwe Sauter @ 2020-11-26 12:18 UTC (permalink / raw)
  To: pve-user

Hi all,

this is a warning for all that are eager to apply today's updates.
In my case the ceph-mon@<host> service did not start after a reboot which caused a hanging Ceph once the second monitoring service
went offline.

Make sure to check ceph-mon@<host>.service is running on freshly started hosts before rebooting the next host with a monitoring
service.

Regards,

	Uwe



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PVE-User] Caution: ceph-mon service does not start after today's updates
  2020-11-26 12:18 [PVE-User] Caution: ceph-mon service does not start after today's updates Uwe Sauter
@ 2020-11-26 14:10 ` Lindsay Mathieson
  2020-11-26 14:15   ` Uwe Sauter
  2020-11-26 19:35 ` [PVE-User] [update available] " Thomas Lamprecht
  1 sibling, 1 reply; 15+ messages in thread
From: Lindsay Mathieson @ 2020-11-26 14:10 UTC (permalink / raw)
  To: pve-user

On 26/11/2020 10:18 pm, Uwe Sauter wrote:
> this is a warning for all that are eager to apply today's updates.
> In my case the ceph-mon@<host> service did not start after a reboot which caused a hanging Ceph once the second monitoring service
> went offline.

I ran into that, also the node failed to rejoin the cluster quorum. 
syslog had errors relating to the pem-ssl key.

Manually start the pve cluster service and a 2nd reboot solved both issues.

-- 
Lindsay




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PVE-User] Caution: ceph-mon service does not start after today's updates
  2020-11-26 14:10 ` Lindsay Mathieson
@ 2020-11-26 14:15   ` Uwe Sauter
  2020-11-26 14:46     ` Thomas Lamprecht
  0 siblings, 1 reply; 15+ messages in thread
From: Uwe Sauter @ 2020-11-26 14:15 UTC (permalink / raw)
  To: pve-user

Am 26.11.20 um 15:10 schrieb Lindsay Mathieson:
> On 26/11/2020 10:18 pm, Uwe Sauter wrote:
>> this is a warning for all that are eager to apply today's updates.
>> In my case the ceph-mon@<host> service did not start after a reboot which caused a hanging Ceph once the second monitoring service
>> went offline.
> 
> I ran into that, also the node failed to rejoin the cluster quorum. syslog had errors relating to the pem-ssl key.
> 
> Manually start the pve cluster service and a 2nd reboot solved both issues.
> 

Yes, rebooting might help, but not reliably. I had nodes that needed several reboots until pvestatd did not fail.

I also had failed ceph-mgr@<host> services (with Nautilus).

My current suspicion is that my network takes too long to become available.


Regards,

	Uwe



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PVE-User] Caution: ceph-mon service does not start after today's updates
  2020-11-26 14:15   ` Uwe Sauter
@ 2020-11-26 14:46     ` Thomas Lamprecht
  2020-11-26 15:03       ` Lindsay Mathieson
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Lamprecht @ 2020-11-26 14:46 UTC (permalink / raw)
  To: uwe.sauter.de, Proxmox VE user list

On 26.11.20 15:15, Uwe Sauter wrote:
> Am 26.11.20 um 15:10 schrieb Lindsay Mathieson:
>> On 26/11/2020 10:18 pm, Uwe Sauter wrote:
>>> this is a warning for all that are eager to apply today's updates.
>>> In my case the ceph-mon@<host> service did not start after a reboot which caused a hanging Ceph once the second monitoring service
>>> went offline.
>>
>> I ran into that, also the node failed to rejoin the cluster quorum. syslog had errors relating to the pem-ssl key.
>>
>> Manually start the pve cluster service and a 2nd reboot solved both issues.
>>
> 
> Yes, rebooting might help, but not reliably. I had nodes that needed several reboots until pvestatd did not fail.
> 
> I also had failed ceph-mgr@<host> services (with Nautilus).
> 
> My current suspicion is that my network takes too long to become available.
> 

Note, it's always good idea to check if all services are running OK again before
continuing with upgrading the next host, not just on this update :-)

Also, ceph monitors can be nicely restarted over the web interface, there's a
visible status about which services run outdated versions/need a restart.


Anyway, do you have any logs which could give more details for possible issues?

regards,
Thomas





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PVE-User] Caution: ceph-mon service does not start after today's updates
  2020-11-26 14:46     ` Thomas Lamprecht
@ 2020-11-26 15:03       ` Lindsay Mathieson
  2020-11-26 16:14         ` Thomas Lamprecht
  2020-11-26 18:56         ` Thomas Lamprecht
  0 siblings, 2 replies; 15+ messages in thread
From: Lindsay Mathieson @ 2020-11-26 15:03 UTC (permalink / raw)
  To: pve-user

On 27/11/2020 12:46 am, Thomas Lamprecht wrote:
> Note, it's always good idea to check if all services are running OK again before
> continuing with upgrading the next host, not just on this update:-)
>
> Also, ceph monitors can be nicely restarted over the web interface, there's a
> visible status about which services run outdated versions/need a restart.
>
>
> Anyway, do you have any logs which could give more details for possible issues?

I have a node that is just failing to rejoin the cluster and the ceph 
mon & mgr fail to start.


Seeing this repeated in syslog

    Nov 27 00:58:23 vnh pveproxy[2903]: /etc/pve/local/pve-ssl.key:
    failed to load local private key (key_file or key) at
    /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1737.
    Nov 27 00:58:23 vnh pveproxy[2904]: /etc/pve/local/pve-ssl.key:
    failed to load local private key (key_file or key) at
    /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1737.
    Nov 27 00:58:23 vnh pveproxy[2905]: /etc/pve/local/pve-ssl.key:
    failed to load local private key (key_file or key) at
    /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1737.
    Nov 27 00:58:26 vnh ceph-mon[2073]: 2020-11-27 00:58:26.378
    7fb182935700 -1 mon.vnh@0(probing) e9 handle_auth_bad_method hmm,
    they didn't like 2 result (95) Operation not supported
    Nov 27 00:58:26 vnh ceph-mon[2073]: 2020-11-27 00:58:26.390
    7fb17d92b700 -1 mon.vnh@0(probing) e9 handle_auth_bad_method hmm,
    they didn't like 2 result (95) Operation not supported
    Nov 27 00:58:26 vnh ceph-mon[2073]: 2020-11-27 00:58:26.526
    7fb183136700 -1 mon.vnh@0(probing) e9 handle_auth_bad_method hmm,
    they didn't like 2 result (95) Operation not supported
    Nov 27 00:58:27 vnh ceph-mon[2073]: 2020-11-27 00:58:27.702
    7fb182935700 -1 mon.vnh@0(probing) e9 handle_auth_request no
    AuthAuthorizeHandler found for auth method 1


The following gets the node back on the cluster:

systemctl start pve-cluster.service
systemctl restart pvestatd.service


But I can't get the mon, mgr or osd services to start.

-- 
Lindsay



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PVE-User] Caution: ceph-mon service does not start after today's updates
  2020-11-26 15:03       ` Lindsay Mathieson
@ 2020-11-26 16:14         ` Thomas Lamprecht
  2020-11-26 16:35           ` Thomas Lamprecht
  2020-11-26 18:56         ` Thomas Lamprecht
  1 sibling, 1 reply; 15+ messages in thread
From: Thomas Lamprecht @ 2020-11-26 16:14 UTC (permalink / raw)
  To: Proxmox VE user list, Lindsay Mathieson

On 26.11.20 16:03, Lindsay Mathieson wrote:
> On 27/11/2020 12:46 am, Thomas Lamprecht wrote:
>> Note, it's always good idea to check if all services are running OK again before
>> continuing with upgrading the next host, not just on this update:-)
>>
>> Also, ceph monitors can be nicely restarted over the web interface, there's a
>> visible status about which services run outdated versions/need a restart.
>>
>>
>> Anyway, do you have any logs which could give more details for possible issues?
> 
> I have a node that is just failing to rejoin the cluster and the ceph mon & mgr fail to start.
> 
> 
> Seeing this repeated in syslog
> 
>    Nov 27 00:58:23 vnh pveproxy[2903]: /etc/pve/local/pve-ssl.key:
>    failed to load local private key (key_file or key) at
>    /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1737.
>    Nov 27 00:58:23 vnh pveproxy[2904]: /etc/pve/local/pve-ssl.key:
>    failed to load local private key (key_file or key) at
>    /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1737.
>    Nov 27 00:58:23 vnh pveproxy[2905]: /etc/pve/local/pve-ssl.key:
>    failed to load local private key (key_file or key) at
>    /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1737.
>    Nov 27 00:58:26 vnh ceph-mon[2073]: 2020-11-27 00:58:26.378
>    7fb182935700 -1 mon.vnh@0(probing) e9 handle_auth_bad_method hmm,
>    they didn't like 2 result (95) Operation not supported
>    Nov 27 00:58:26 vnh ceph-mon[2073]: 2020-11-27 00:58:26.390
>    7fb17d92b700 -1 mon.vnh@0(probing) e9 handle_auth_bad_method hmm,
>    they didn't like 2 result (95) Operation not supported
>    Nov 27 00:58:26 vnh ceph-mon[2073]: 2020-11-27 00:58:26.526
>    7fb183136700 -1 mon.vnh@0(probing) e9 handle_auth_bad_method hmm,
>    they didn't like 2 result (95) Operation not supported
>    Nov 27 00:58:27 vnh ceph-mon[2073]: 2020-11-27 00:58:27.702
>    7fb182935700 -1 mon.vnh@0(probing) e9 handle_auth_request no
>    AuthAuthorizeHandler found for auth method 1
> 

the errors seems like being the result of pve-cluster not coming up,
which seems the actual problem.

> 
> The following gets the node back on the cluster:
> 
> systemctl start pve-cluster.service

Anything of pve-cluster service in the log?


What does:
# systemd-analyze verify default.target

outputs?

cheers,
Thomas





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PVE-User] Caution: ceph-mon service does not start after today's updates
  2020-11-26 16:14         ` Thomas Lamprecht
@ 2020-11-26 16:35           ` Thomas Lamprecht
  0 siblings, 0 replies; 15+ messages in thread
From: Thomas Lamprecht @ 2020-11-26 16:35 UTC (permalink / raw)
  To: Proxmox VE user list, Lindsay Mathieson

On 26.11.20 17:14, Thomas Lamprecht wrote:
> What does:
> # systemd-analyze verify default.target
> 
> outputs?

May have found an issue with systemd service ordering, seems there's a
cycle which gets broken up arbitrary, working sometimes, but sometimes
not.





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PVE-User] Caution: ceph-mon service does not start after today's updates
  2020-11-26 15:03       ` Lindsay Mathieson
  2020-11-26 16:14         ` Thomas Lamprecht
@ 2020-11-26 18:56         ` Thomas Lamprecht
  1 sibling, 0 replies; 15+ messages in thread
From: Thomas Lamprecht @ 2020-11-26 18:56 UTC (permalink / raw)
  To: Proxmox VE user list

Some news.

There are a few things at play, it boils down to two things:
* a update of various service orderings in ceph with 14.2.12 (released a bit
  ago), they introduced pretty much everywhere a `Before=remote-fs-pre.target`
  order enforcement.

* rrdcached, a service used by pve-cluster.service (pmxcfs), this has no native
  systemd service file, so systemd auto generates one, with an `Before=remote-pre.target`
  order enforcement which then has ordering for the aforementioned
  `Before=remote-fs-pre.target`


Thus you get the cycle (-> means an after odering, all befores where transformed
to after by reversing them (systemd does that too)):


.> pve-cluster -> rrdcached -> remote-pre -> remote-fs-pre -> ceph-mgr@ -.
|                                                                        |
`------------------------------------------------------------------------'

We're building a new ceph version with the Before=remote-fs-pre.target removed,
it is bogus for the ceph mgr, mds, mon, .. services as is.

As you probably guessed, one can also fix this by adapting rrdcached, and as
a work around you can do so:

1. copy over the generated ephemeral service file from /run to /etc, which
   has higher priority.

# cp /run/systemd/generator.late/rrdcached.service /etc/systemd/system/

2. Drop the after ordering for remote-fs.target
# sed -i '/^After=remote-fs.target/d' /etc/systemd/system/rrdcached.service

3. reboot 

A ceph 14.2.15-pve2 package will soon be available, we'll also see if we can
improve the rrdcached situation in the future, it has no fault on its own
naturally, the systemd auto generators heuristic is to blame, but maybe we
can see if upstream or Debian has interest in adding an hand crafted systemd
unit file, avoiding auto-generation. Otionally we could maintain it for PVE,
or do like in Proxmox Backup Server - use our own rust based RRD implementation

regards,
Thomas





^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PVE-User] [update available] Re: Caution: ceph-mon service does not start after today's updates
  2020-11-26 12:18 [PVE-User] Caution: ceph-mon service does not start after today's updates Uwe Sauter
  2020-11-26 14:10 ` Lindsay Mathieson
@ 2020-11-26 19:35 ` Thomas Lamprecht
  2020-11-26 19:54   ` Uwe Sauter
                     ` (3 more replies)
  1 sibling, 4 replies; 15+ messages in thread
From: Thomas Lamprecht @ 2020-11-26 19:35 UTC (permalink / raw)
  To: Proxmox VE user list

Hi all,
Hi Uwe and Lindsay,

ceph 14.2.15-pve2 is now available, it includes fixes for the ordering
restraints of the various ceph service unit files[0], besides that there
where zero actual code changes.

Due to that and Stoiko's tests (much thanks for all the help!) confirming
the positive result of mine, we uploaded it for general availability.

Thanks also to you two for reporting!

regards,
Thomas

[0]: https://git.proxmox.com/?p=ceph.git;a=blob;f=patches/0009-fix-service-ordering-avoid-Before-remote-fs-pre.targ.patch;h=8fe5a35c385f7d4007b4965d297fc0daa9091be3;hb=99b3812832d5d5a8ac48a2e3b084c0ecf1fd087c





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PVE-User] [update available] Re: Caution: ceph-mon service does not start after today's updates
  2020-11-26 19:35 ` [PVE-User] [update available] " Thomas Lamprecht
@ 2020-11-26 19:54   ` Uwe Sauter
  2020-11-27  0:53   ` Lindsay Mathieson
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Uwe Sauter @ 2020-11-26 19:54 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE user list

Thomas,

thank you and your team for that prompt response.

I'll try out tomorrow first thing in the morning.

Regards,

	Uwe

Am 26.11.20 um 20:35 schrieb Thomas Lamprecht:
> Hi all,
> Hi Uwe and Lindsay,
> 
> ceph 14.2.15-pve2 is now available, it includes fixes for the ordering
> restraints of the various ceph service unit files[0], besides that there
> where zero actual code changes.
> 
> Due to that and Stoiko's tests (much thanks for all the help!) confirming
> the positive result of mine, we uploaded it for general availability.
> 
> Thanks also to you two for reporting!
> 
> regards,
> Thomas
> 
> [0]: https://git.proxmox.com/?p=ceph.git;a=blob;f=patches/0009-fix-service-ordering-avoid-Before-remote-fs-pre.targ.patch;h=8fe5a35c385f7d4007b4965d297fc0daa9091be3;hb=99b3812832d5d5a8ac48a2e3b084c0ecf1fd087c
> 
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PVE-User] [update available] Re: Caution: ceph-mon service does not start after today's updates
  2020-11-26 19:35 ` [PVE-User] [update available] " Thomas Lamprecht
  2020-11-26 19:54   ` Uwe Sauter
@ 2020-11-27  0:53   ` Lindsay Mathieson
  2020-11-27  8:19   ` Uwe Sauter
  2020-11-27 13:18   ` Lindsay Mathieson
  3 siblings, 0 replies; 15+ messages in thread
From: Lindsay Mathieson @ 2020-11-27  0:53 UTC (permalink / raw)
  To: Proxmox VE user list

On 27/11/2020 5:35 am, Thomas Lamprecht wrote:
> Hi Uwe and Lindsay,
>
> ceph 14.2.15-pve2 is now available, it includes fixes for the ordering
> restraints of the various ceph service unit files[0], besides that there
> where zero actual code changes.
>
> Due to that and Stoiko's tests (much thanks for all the help!) confirming
> the positive result of mine, we uploaded it for general availability.
>
> Thanks also to you two for reporting!


Thanks for the really quick turnaround Thomas, and sorry for not getting 
back to you before, but it was 3am my time and once I got the servers 
back up, I really needed to crash :)


Will test tonight, cheers.

-- 
Lindsay




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PVE-User] [update available] Re: Caution: ceph-mon service does not start after today's updates
  2020-11-26 19:35 ` [PVE-User] [update available] " Thomas Lamprecht
  2020-11-26 19:54   ` Uwe Sauter
  2020-11-27  0:53   ` Lindsay Mathieson
@ 2020-11-27  8:19   ` Uwe Sauter
  2020-11-27 13:18   ` Lindsay Mathieson
  3 siblings, 0 replies; 15+ messages in thread
From: Uwe Sauter @ 2020-11-27  8:19 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE user list

Good morning Thomas,

the updated packages do the trick. I updated my ten hosts and not one needed a second reboot.

Next will be the update from Nautilus to Octupus… but the instructions in the wiki are well written so I don't expect any issues
there.

Thanks again,

	Uwe

Am 26.11.20 um 20:35 schrieb Thomas Lamprecht:
> Hi all,
> Hi Uwe and Lindsay,
> 
> ceph 14.2.15-pve2 is now available, it includes fixes for the ordering
> restraints of the various ceph service unit files[0], besides that there
> where zero actual code changes.
> 
> Due to that and Stoiko's tests (much thanks for all the help!) confirming
> the positive result of mine, we uploaded it for general availability.
> 
> Thanks also to you two for reporting!
> 
> regards,
> Thomas
> 
> [0]: https://git.proxmox.com/?p=ceph.git;a=blob;f=patches/0009-fix-service-ordering-avoid-Before-remote-fs-pre.targ.patch;h=8fe5a35c385f7d4007b4965d297fc0daa9091be3;hb=99b3812832d5d5a8ac48a2e3b084c0ecf1fd087c
> 
> 




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PVE-User] [update available] Re: Caution: ceph-mon service does not start after today's updates
  2020-11-26 19:35 ` [PVE-User] [update available] " Thomas Lamprecht
                     ` (2 preceding siblings ...)
  2020-11-27  8:19   ` Uwe Sauter
@ 2020-11-27 13:18   ` Lindsay Mathieson
  2020-11-27 13:29     ` Jean-Luc Oms
  3 siblings, 1 reply; 15+ messages in thread
From: Lindsay Mathieson @ 2020-11-27 13:18 UTC (permalink / raw)
  To: Proxmox VE user list

On 27/11/2020 5:35 am, Thomas Lamprecht wrote:
> Hi Uwe and Lindsay,
>
> ceph 14.2.15-pve2 is now available, it includes fixes for the ordering
> restraints of the various ceph service unit files[0], besides that there
> where zero actual code changes.


Thanks! Just updated and did a rolling reboot on our cluster. All 
servers/services came up perfectly, no issues.


Gonna test that Octopus upgrade next :)


CXheers,

-- 
Lindsay




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PVE-User] [update available] Re: Caution: ceph-mon service does not start after today's updates
  2020-11-27 13:18   ` Lindsay Mathieson
@ 2020-11-27 13:29     ` Jean-Luc Oms
  2020-11-27 13:31       ` Lindsay Mathieson
  0 siblings, 1 reply; 15+ messages in thread
From: Jean-Luc Oms @ 2020-11-27 13:29 UTC (permalink / raw)
  To: pve-user


Le 27/11/2020 à 14:18, Lindsay Mathieson a écrit :
> On 27/11/2020 5:35 am, Thomas Lamprecht wrote:
>> Hi Uwe and Lindsay,
>>
>> ceph 14.2.15-pve2 is now available, it includes fixes for the ordering
>> restraints of the various ceph service unit files[0], besides that there
>> where zero actual code changes.
>
>
> Thanks! Just updated and did a rolling reboot on our cluster. All
> servers/services came up perfectly, no issues.
You have cleared the ceph health error  ?
>
>
> Gonna test that Octopus upgrade next :)
>
>
> CXheers,
>
-- 
Jean-Luc Oms
/STI-ReseauX <https://rx.lirmm.fr>- LIRMM - CNRS/UM/
+33 4 67 41 85 93 <tel:+33-467-41-85-93> / +33 6 32 01 04 17
<tel:+33-632-01-04-17>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PVE-User] [update available] Re: Caution: ceph-mon service does not start after today's updates
  2020-11-27 13:29     ` Jean-Luc Oms
@ 2020-11-27 13:31       ` Lindsay Mathieson
  0 siblings, 0 replies; 15+ messages in thread
From: Lindsay Mathieson @ 2020-11-27 13:31 UTC (permalink / raw)
  To: pve-user

On 27/11/2020 11:29 pm, Jean-Luc Oms wrote:
> You have cleared the ceph health error  ?

That dashboard one? no.


This was in reference to some cluster & ceph service issues.

-- 
Lindsay




^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-11-27 13:31 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-26 12:18 [PVE-User] Caution: ceph-mon service does not start after today's updates Uwe Sauter
2020-11-26 14:10 ` Lindsay Mathieson
2020-11-26 14:15   ` Uwe Sauter
2020-11-26 14:46     ` Thomas Lamprecht
2020-11-26 15:03       ` Lindsay Mathieson
2020-11-26 16:14         ` Thomas Lamprecht
2020-11-26 16:35           ` Thomas Lamprecht
2020-11-26 18:56         ` Thomas Lamprecht
2020-11-26 19:35 ` [PVE-User] [update available] " Thomas Lamprecht
2020-11-26 19:54   ` Uwe Sauter
2020-11-27  0:53   ` Lindsay Mathieson
2020-11-27  8:19   ` Uwe Sauter
2020-11-27 13:18   ` Lindsay Mathieson
2020-11-27 13:29     ` Jean-Luc Oms
2020-11-27 13:31       ` Lindsay Mathieson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal