* [PVE-User] Caution: ceph-mon service does not start after today's updates
@ 2020-11-26 12:18 Uwe Sauter
2020-11-26 14:10 ` Lindsay Mathieson
2020-11-26 19:35 ` [PVE-User] [update available] " Thomas Lamprecht
0 siblings, 2 replies; 15+ messages in thread
From: Uwe Sauter @ 2020-11-26 12:18 UTC (permalink / raw)
To: pve-user
Hi all,
this is a warning for all that are eager to apply today's updates.
In my case the ceph-mon@<host> service did not start after a reboot which caused a hanging Ceph once the second monitoring service
went offline.
Make sure to check ceph-mon@<host>.service is running on freshly started hosts before rebooting the next host with a monitoring
service.
Regards,
Uwe
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PVE-User] Caution: ceph-mon service does not start after today's updates
2020-11-26 12:18 [PVE-User] Caution: ceph-mon service does not start after today's updates Uwe Sauter
@ 2020-11-26 14:10 ` Lindsay Mathieson
2020-11-26 14:15 ` Uwe Sauter
2020-11-26 19:35 ` [PVE-User] [update available] " Thomas Lamprecht
1 sibling, 1 reply; 15+ messages in thread
From: Lindsay Mathieson @ 2020-11-26 14:10 UTC (permalink / raw)
To: pve-user
On 26/11/2020 10:18 pm, Uwe Sauter wrote:
> this is a warning for all that are eager to apply today's updates.
> In my case the ceph-mon@<host> service did not start after a reboot which caused a hanging Ceph once the second monitoring service
> went offline.
I ran into that, also the node failed to rejoin the cluster quorum.
syslog had errors relating to the pem-ssl key.
Manually start the pve cluster service and a 2nd reboot solved both issues.
--
Lindsay
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PVE-User] Caution: ceph-mon service does not start after today's updates
2020-11-26 14:10 ` Lindsay Mathieson
@ 2020-11-26 14:15 ` Uwe Sauter
2020-11-26 14:46 ` Thomas Lamprecht
0 siblings, 1 reply; 15+ messages in thread
From: Uwe Sauter @ 2020-11-26 14:15 UTC (permalink / raw)
To: pve-user
Am 26.11.20 um 15:10 schrieb Lindsay Mathieson:
> On 26/11/2020 10:18 pm, Uwe Sauter wrote:
>> this is a warning for all that are eager to apply today's updates.
>> In my case the ceph-mon@<host> service did not start after a reboot which caused a hanging Ceph once the second monitoring service
>> went offline.
>
> I ran into that, also the node failed to rejoin the cluster quorum. syslog had errors relating to the pem-ssl key.
>
> Manually start the pve cluster service and a 2nd reboot solved both issues.
>
Yes, rebooting might help, but not reliably. I had nodes that needed several reboots until pvestatd did not fail.
I also had failed ceph-mgr@<host> services (with Nautilus).
My current suspicion is that my network takes too long to become available.
Regards,
Uwe
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PVE-User] Caution: ceph-mon service does not start after today's updates
2020-11-26 14:15 ` Uwe Sauter
@ 2020-11-26 14:46 ` Thomas Lamprecht
2020-11-26 15:03 ` Lindsay Mathieson
0 siblings, 1 reply; 15+ messages in thread
From: Thomas Lamprecht @ 2020-11-26 14:46 UTC (permalink / raw)
To: uwe.sauter.de, Proxmox VE user list
On 26.11.20 15:15, Uwe Sauter wrote:
> Am 26.11.20 um 15:10 schrieb Lindsay Mathieson:
>> On 26/11/2020 10:18 pm, Uwe Sauter wrote:
>>> this is a warning for all that are eager to apply today's updates.
>>> In my case the ceph-mon@<host> service did not start after a reboot which caused a hanging Ceph once the second monitoring service
>>> went offline.
>>
>> I ran into that, also the node failed to rejoin the cluster quorum. syslog had errors relating to the pem-ssl key.
>>
>> Manually start the pve cluster service and a 2nd reboot solved both issues.
>>
>
> Yes, rebooting might help, but not reliably. I had nodes that needed several reboots until pvestatd did not fail.
>
> I also had failed ceph-mgr@<host> services (with Nautilus).
>
> My current suspicion is that my network takes too long to become available.
>
Note, it's always good idea to check if all services are running OK again before
continuing with upgrading the next host, not just on this update :-)
Also, ceph monitors can be nicely restarted over the web interface, there's a
visible status about which services run outdated versions/need a restart.
Anyway, do you have any logs which could give more details for possible issues?
regards,
Thomas
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PVE-User] Caution: ceph-mon service does not start after today's updates
2020-11-26 14:46 ` Thomas Lamprecht
@ 2020-11-26 15:03 ` Lindsay Mathieson
2020-11-26 16:14 ` Thomas Lamprecht
2020-11-26 18:56 ` Thomas Lamprecht
0 siblings, 2 replies; 15+ messages in thread
From: Lindsay Mathieson @ 2020-11-26 15:03 UTC (permalink / raw)
To: pve-user
On 27/11/2020 12:46 am, Thomas Lamprecht wrote:
> Note, it's always good idea to check if all services are running OK again before
> continuing with upgrading the next host, not just on this update:-)
>
> Also, ceph monitors can be nicely restarted over the web interface, there's a
> visible status about which services run outdated versions/need a restart.
>
>
> Anyway, do you have any logs which could give more details for possible issues?
I have a node that is just failing to rejoin the cluster and the ceph
mon & mgr fail to start.
Seeing this repeated in syslog
Nov 27 00:58:23 vnh pveproxy[2903]: /etc/pve/local/pve-ssl.key:
failed to load local private key (key_file or key) at
/usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1737.
Nov 27 00:58:23 vnh pveproxy[2904]: /etc/pve/local/pve-ssl.key:
failed to load local private key (key_file or key) at
/usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1737.
Nov 27 00:58:23 vnh pveproxy[2905]: /etc/pve/local/pve-ssl.key:
failed to load local private key (key_file or key) at
/usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1737.
Nov 27 00:58:26 vnh ceph-mon[2073]: 2020-11-27 00:58:26.378
7fb182935700 -1 mon.vnh@0(probing) e9 handle_auth_bad_method hmm,
they didn't like 2 result (95) Operation not supported
Nov 27 00:58:26 vnh ceph-mon[2073]: 2020-11-27 00:58:26.390
7fb17d92b700 -1 mon.vnh@0(probing) e9 handle_auth_bad_method hmm,
they didn't like 2 result (95) Operation not supported
Nov 27 00:58:26 vnh ceph-mon[2073]: 2020-11-27 00:58:26.526
7fb183136700 -1 mon.vnh@0(probing) e9 handle_auth_bad_method hmm,
they didn't like 2 result (95) Operation not supported
Nov 27 00:58:27 vnh ceph-mon[2073]: 2020-11-27 00:58:27.702
7fb182935700 -1 mon.vnh@0(probing) e9 handle_auth_request no
AuthAuthorizeHandler found for auth method 1
The following gets the node back on the cluster:
systemctl start pve-cluster.service
systemctl restart pvestatd.service
But I can't get the mon, mgr or osd services to start.
--
Lindsay
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PVE-User] Caution: ceph-mon service does not start after today's updates
2020-11-26 15:03 ` Lindsay Mathieson
@ 2020-11-26 16:14 ` Thomas Lamprecht
2020-11-26 16:35 ` Thomas Lamprecht
2020-11-26 18:56 ` Thomas Lamprecht
1 sibling, 1 reply; 15+ messages in thread
From: Thomas Lamprecht @ 2020-11-26 16:14 UTC (permalink / raw)
To: Proxmox VE user list, Lindsay Mathieson
On 26.11.20 16:03, Lindsay Mathieson wrote:
> On 27/11/2020 12:46 am, Thomas Lamprecht wrote:
>> Note, it's always good idea to check if all services are running OK again before
>> continuing with upgrading the next host, not just on this update:-)
>>
>> Also, ceph monitors can be nicely restarted over the web interface, there's a
>> visible status about which services run outdated versions/need a restart.
>>
>>
>> Anyway, do you have any logs which could give more details for possible issues?
>
> I have a node that is just failing to rejoin the cluster and the ceph mon & mgr fail to start.
>
>
> Seeing this repeated in syslog
>
> Nov 27 00:58:23 vnh pveproxy[2903]: /etc/pve/local/pve-ssl.key:
> failed to load local private key (key_file or key) at
> /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1737.
> Nov 27 00:58:23 vnh pveproxy[2904]: /etc/pve/local/pve-ssl.key:
> failed to load local private key (key_file or key) at
> /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1737.
> Nov 27 00:58:23 vnh pveproxy[2905]: /etc/pve/local/pve-ssl.key:
> failed to load local private key (key_file or key) at
> /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1737.
> Nov 27 00:58:26 vnh ceph-mon[2073]: 2020-11-27 00:58:26.378
> 7fb182935700 -1 mon.vnh@0(probing) e9 handle_auth_bad_method hmm,
> they didn't like 2 result (95) Operation not supported
> Nov 27 00:58:26 vnh ceph-mon[2073]: 2020-11-27 00:58:26.390
> 7fb17d92b700 -1 mon.vnh@0(probing) e9 handle_auth_bad_method hmm,
> they didn't like 2 result (95) Operation not supported
> Nov 27 00:58:26 vnh ceph-mon[2073]: 2020-11-27 00:58:26.526
> 7fb183136700 -1 mon.vnh@0(probing) e9 handle_auth_bad_method hmm,
> they didn't like 2 result (95) Operation not supported
> Nov 27 00:58:27 vnh ceph-mon[2073]: 2020-11-27 00:58:27.702
> 7fb182935700 -1 mon.vnh@0(probing) e9 handle_auth_request no
> AuthAuthorizeHandler found for auth method 1
>
the errors seems like being the result of pve-cluster not coming up,
which seems the actual problem.
>
> The following gets the node back on the cluster:
>
> systemctl start pve-cluster.service
Anything of pve-cluster service in the log?
What does:
# systemd-analyze verify default.target
outputs?
cheers,
Thomas
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PVE-User] Caution: ceph-mon service does not start after today's updates
2020-11-26 16:14 ` Thomas Lamprecht
@ 2020-11-26 16:35 ` Thomas Lamprecht
0 siblings, 0 replies; 15+ messages in thread
From: Thomas Lamprecht @ 2020-11-26 16:35 UTC (permalink / raw)
To: Proxmox VE user list, Lindsay Mathieson
On 26.11.20 17:14, Thomas Lamprecht wrote:
> What does:
> # systemd-analyze verify default.target
>
> outputs?
May have found an issue with systemd service ordering, seems there's a
cycle which gets broken up arbitrary, working sometimes, but sometimes
not.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PVE-User] Caution: ceph-mon service does not start after today's updates
2020-11-26 15:03 ` Lindsay Mathieson
2020-11-26 16:14 ` Thomas Lamprecht
@ 2020-11-26 18:56 ` Thomas Lamprecht
1 sibling, 0 replies; 15+ messages in thread
From: Thomas Lamprecht @ 2020-11-26 18:56 UTC (permalink / raw)
To: Proxmox VE user list
Some news.
There are a few things at play, it boils down to two things:
* a update of various service orderings in ceph with 14.2.12 (released a bit
ago), they introduced pretty much everywhere a `Before=remote-fs-pre.target`
order enforcement.
* rrdcached, a service used by pve-cluster.service (pmxcfs), this has no native
systemd service file, so systemd auto generates one, with an `Before=remote-pre.target`
order enforcement which then has ordering for the aforementioned
`Before=remote-fs-pre.target`
Thus you get the cycle (-> means an after odering, all befores where transformed
to after by reversing them (systemd does that too)):
.> pve-cluster -> rrdcached -> remote-pre -> remote-fs-pre -> ceph-mgr@ -.
| |
`------------------------------------------------------------------------'
We're building a new ceph version with the Before=remote-fs-pre.target removed,
it is bogus for the ceph mgr, mds, mon, .. services as is.
As you probably guessed, one can also fix this by adapting rrdcached, and as
a work around you can do so:
1. copy over the generated ephemeral service file from /run to /etc, which
has higher priority.
# cp /run/systemd/generator.late/rrdcached.service /etc/systemd/system/
2. Drop the after ordering for remote-fs.target
# sed -i '/^After=remote-fs.target/d' /etc/systemd/system/rrdcached.service
3. reboot
A ceph 14.2.15-pve2 package will soon be available, we'll also see if we can
improve the rrdcached situation in the future, it has no fault on its own
naturally, the systemd auto generators heuristic is to blame, but maybe we
can see if upstream or Debian has interest in adding an hand crafted systemd
unit file, avoiding auto-generation. Otionally we could maintain it for PVE,
or do like in Proxmox Backup Server - use our own rust based RRD implementation
regards,
Thomas
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PVE-User] [update available] Re: Caution: ceph-mon service does not start after today's updates
2020-11-26 12:18 [PVE-User] Caution: ceph-mon service does not start after today's updates Uwe Sauter
2020-11-26 14:10 ` Lindsay Mathieson
@ 2020-11-26 19:35 ` Thomas Lamprecht
2020-11-26 19:54 ` Uwe Sauter
` (3 more replies)
1 sibling, 4 replies; 15+ messages in thread
From: Thomas Lamprecht @ 2020-11-26 19:35 UTC (permalink / raw)
To: Proxmox VE user list
Hi all,
Hi Uwe and Lindsay,
ceph 14.2.15-pve2 is now available, it includes fixes for the ordering
restraints of the various ceph service unit files[0], besides that there
where zero actual code changes.
Due to that and Stoiko's tests (much thanks for all the help!) confirming
the positive result of mine, we uploaded it for general availability.
Thanks also to you two for reporting!
regards,
Thomas
[0]: https://git.proxmox.com/?p=ceph.git;a=blob;f=patches/0009-fix-service-ordering-avoid-Before-remote-fs-pre.targ.patch;h=8fe5a35c385f7d4007b4965d297fc0daa9091be3;hb=99b3812832d5d5a8ac48a2e3b084c0ecf1fd087c
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PVE-User] [update available] Re: Caution: ceph-mon service does not start after today's updates
2020-11-26 19:35 ` [PVE-User] [update available] " Thomas Lamprecht
@ 2020-11-26 19:54 ` Uwe Sauter
2020-11-27 0:53 ` Lindsay Mathieson
` (2 subsequent siblings)
3 siblings, 0 replies; 15+ messages in thread
From: Uwe Sauter @ 2020-11-26 19:54 UTC (permalink / raw)
To: Thomas Lamprecht, Proxmox VE user list
Thomas,
thank you and your team for that prompt response.
I'll try out tomorrow first thing in the morning.
Regards,
Uwe
Am 26.11.20 um 20:35 schrieb Thomas Lamprecht:
> Hi all,
> Hi Uwe and Lindsay,
>
> ceph 14.2.15-pve2 is now available, it includes fixes for the ordering
> restraints of the various ceph service unit files[0], besides that there
> where zero actual code changes.
>
> Due to that and Stoiko's tests (much thanks for all the help!) confirming
> the positive result of mine, we uploaded it for general availability.
>
> Thanks also to you two for reporting!
>
> regards,
> Thomas
>
> [0]: https://git.proxmox.com/?p=ceph.git;a=blob;f=patches/0009-fix-service-ordering-avoid-Before-remote-fs-pre.targ.patch;h=8fe5a35c385f7d4007b4965d297fc0daa9091be3;hb=99b3812832d5d5a8ac48a2e3b084c0ecf1fd087c
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PVE-User] [update available] Re: Caution: ceph-mon service does not start after today's updates
2020-11-26 19:35 ` [PVE-User] [update available] " Thomas Lamprecht
2020-11-26 19:54 ` Uwe Sauter
@ 2020-11-27 0:53 ` Lindsay Mathieson
2020-11-27 8:19 ` Uwe Sauter
2020-11-27 13:18 ` Lindsay Mathieson
3 siblings, 0 replies; 15+ messages in thread
From: Lindsay Mathieson @ 2020-11-27 0:53 UTC (permalink / raw)
To: Proxmox VE user list
On 27/11/2020 5:35 am, Thomas Lamprecht wrote:
> Hi Uwe and Lindsay,
>
> ceph 14.2.15-pve2 is now available, it includes fixes for the ordering
> restraints of the various ceph service unit files[0], besides that there
> where zero actual code changes.
>
> Due to that and Stoiko's tests (much thanks for all the help!) confirming
> the positive result of mine, we uploaded it for general availability.
>
> Thanks also to you two for reporting!
Thanks for the really quick turnaround Thomas, and sorry for not getting
back to you before, but it was 3am my time and once I got the servers
back up, I really needed to crash :)
Will test tonight, cheers.
--
Lindsay
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PVE-User] [update available] Re: Caution: ceph-mon service does not start after today's updates
2020-11-26 19:35 ` [PVE-User] [update available] " Thomas Lamprecht
2020-11-26 19:54 ` Uwe Sauter
2020-11-27 0:53 ` Lindsay Mathieson
@ 2020-11-27 8:19 ` Uwe Sauter
2020-11-27 13:18 ` Lindsay Mathieson
3 siblings, 0 replies; 15+ messages in thread
From: Uwe Sauter @ 2020-11-27 8:19 UTC (permalink / raw)
To: Thomas Lamprecht, Proxmox VE user list
Good morning Thomas,
the updated packages do the trick. I updated my ten hosts and not one needed a second reboot.
Next will be the update from Nautilus to Octupus… but the instructions in the wiki are well written so I don't expect any issues
there.
Thanks again,
Uwe
Am 26.11.20 um 20:35 schrieb Thomas Lamprecht:
> Hi all,
> Hi Uwe and Lindsay,
>
> ceph 14.2.15-pve2 is now available, it includes fixes for the ordering
> restraints of the various ceph service unit files[0], besides that there
> where zero actual code changes.
>
> Due to that and Stoiko's tests (much thanks for all the help!) confirming
> the positive result of mine, we uploaded it for general availability.
>
> Thanks also to you two for reporting!
>
> regards,
> Thomas
>
> [0]: https://git.proxmox.com/?p=ceph.git;a=blob;f=patches/0009-fix-service-ordering-avoid-Before-remote-fs-pre.targ.patch;h=8fe5a35c385f7d4007b4965d297fc0daa9091be3;hb=99b3812832d5d5a8ac48a2e3b084c0ecf1fd087c
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PVE-User] [update available] Re: Caution: ceph-mon service does not start after today's updates
2020-11-26 19:35 ` [PVE-User] [update available] " Thomas Lamprecht
` (2 preceding siblings ...)
2020-11-27 8:19 ` Uwe Sauter
@ 2020-11-27 13:18 ` Lindsay Mathieson
2020-11-27 13:29 ` Jean-Luc Oms
3 siblings, 1 reply; 15+ messages in thread
From: Lindsay Mathieson @ 2020-11-27 13:18 UTC (permalink / raw)
To: Proxmox VE user list
On 27/11/2020 5:35 am, Thomas Lamprecht wrote:
> Hi Uwe and Lindsay,
>
> ceph 14.2.15-pve2 is now available, it includes fixes for the ordering
> restraints of the various ceph service unit files[0], besides that there
> where zero actual code changes.
Thanks! Just updated and did a rolling reboot on our cluster. All
servers/services came up perfectly, no issues.
Gonna test that Octopus upgrade next :)
CXheers,
--
Lindsay
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PVE-User] [update available] Re: Caution: ceph-mon service does not start after today's updates
2020-11-27 13:18 ` Lindsay Mathieson
@ 2020-11-27 13:29 ` Jean-Luc Oms
2020-11-27 13:31 ` Lindsay Mathieson
0 siblings, 1 reply; 15+ messages in thread
From: Jean-Luc Oms @ 2020-11-27 13:29 UTC (permalink / raw)
To: pve-user
Le 27/11/2020 à 14:18, Lindsay Mathieson a écrit :
> On 27/11/2020 5:35 am, Thomas Lamprecht wrote:
>> Hi Uwe and Lindsay,
>>
>> ceph 14.2.15-pve2 is now available, it includes fixes for the ordering
>> restraints of the various ceph service unit files[0], besides that there
>> where zero actual code changes.
>
>
> Thanks! Just updated and did a rolling reboot on our cluster. All
> servers/services came up perfectly, no issues.
You have cleared the ceph health error ?
>
>
> Gonna test that Octopus upgrade next :)
>
>
> CXheers,
>
--
Jean-Luc Oms
/STI-ReseauX <https://rx.lirmm.fr>- LIRMM - CNRS/UM/
+33 4 67 41 85 93 <tel:+33-467-41-85-93> / +33 6 32 01 04 17
<tel:+33-632-01-04-17>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PVE-User] [update available] Re: Caution: ceph-mon service does not start after today's updates
2020-11-27 13:29 ` Jean-Luc Oms
@ 2020-11-27 13:31 ` Lindsay Mathieson
0 siblings, 0 replies; 15+ messages in thread
From: Lindsay Mathieson @ 2020-11-27 13:31 UTC (permalink / raw)
To: pve-user
On 27/11/2020 11:29 pm, Jean-Luc Oms wrote:
> You have cleared the ceph health error ?
That dashboard one? no.
This was in reference to some cluster & ceph service issues.
--
Lindsay
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2020-11-27 13:31 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-26 12:18 [PVE-User] Caution: ceph-mon service does not start after today's updates Uwe Sauter
2020-11-26 14:10 ` Lindsay Mathieson
2020-11-26 14:15 ` Uwe Sauter
2020-11-26 14:46 ` Thomas Lamprecht
2020-11-26 15:03 ` Lindsay Mathieson
2020-11-26 16:14 ` Thomas Lamprecht
2020-11-26 16:35 ` Thomas Lamprecht
2020-11-26 18:56 ` Thomas Lamprecht
2020-11-26 19:35 ` [PVE-User] [update available] " Thomas Lamprecht
2020-11-26 19:54 ` Uwe Sauter
2020-11-27 0:53 ` Lindsay Mathieson
2020-11-27 8:19 ` Uwe Sauter
2020-11-27 13:18 ` Lindsay Mathieson
2020-11-27 13:29 ` Jean-Luc Oms
2020-11-27 13:31 ` Lindsay Mathieson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox