From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id A09F7662B5 for ; Tue, 5 Jan 2021 20:45:16 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 90B54D113 for ; Tue, 5 Jan 2021 20:44:46 +0100 (CET) Received: from mx-ext.inet.dkfz-heidelberg.de (mx-ext.inet.dkfz-heidelberg.de [192.54.49.101]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 8859FD109 for ; Tue, 5 Jan 2021 20:44:45 +0100 (CET) X-Virus-Scanned-DKFZ: amavisd-new at dkfz-heidelberg.de Received: from w610-mb05.local (dkfz-vpn116.inet.dkfz-heidelberg.de [194.94.115.116]) (authenticated bits=0) by mx-ext.inet.dkfz-heidelberg.de (8.14.7/8.14.7/smtpin) with ESMTP id 105Jie67009081 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Tue, 5 Jan 2021 20:44:43 +0100 DKIM-Filter: OpenDKIM Filter v2.11.0 mx-ext.inet.dkfz-heidelberg.de 105Jie67009081 To: pve-user@lists.proxmox.com References: <21dec802-c6e8-d395-1444-7b30df5620cd@dkfz-heidelberg.de> <255b8af8-8834-0f24-d9a6-819f2d2cf8c8@dkfz-heidelberg.de> <9811d98a-ebf2-8590-ddd0-3b707ede4a4e@dkfz-heidelberg.de> From: Frank Thommen Organization: DKFZ Heidelberg, Omics IT and Data Management Core Facility (ODCF) Message-ID: Date: Tue, 5 Jan 2021 20:44:36 +0100 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 8bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.6.2 (mx-ext.inet.dkfz-heidelberg.de [192.54.49.101]); Tue, 05 Jan 2021 20:44:43 +0100 (CET) X-Spam-Status: No, score=-100.2 required=5.0 tests=ALL_TRUSTED,NICE_REPLY_A, URIBL_BLOCKED autolearn=disabled version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on mx-ext.inet.dkfz-heidelberg.de X-SPAM-LEVEL: Spam detection results: 0 AWL -0.150 Adjusted score from AWL reputation of From: address KAM_ASCII_DIVIDERS 0.8 Spam that uses ascii formatting tricks KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -0.001 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com, ceph.target] Subject: Re: [PVE-User] After update Ceph monitor shows wrong version in UI and is down and out of quorum X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jan 2021 19:45:16 -0000 On 05.01.21 20:29, Uwe Sauter wrote: > Frank, > > Am 05.01.21 um 20:24 schrieb Frank Thommen: >> Hi Uwe, >> >>> did you look into the log of MON and OSD? >> >> I can't see any specific MON and OSD logs. However the log available >> in the UI (Ceph -> Log) has lots of messages regarding scrubbing but >> no messages regarding issues with starting the monitor >> > > On each host the logs should be in /var/log/ceph. These should be > rotated (see /etc/logrotate.d/ceph-common for details). ok. I see lots of ----------------------- 2021-01-05 20:38:05.900 7f979e753700 1 mon.odcf-pve02@-1(probing) e4 handle_auth_request failed to assign global_id 2021-01-05 20:38:07.208 7f979e753700 1 mon.odcf-pve02@-1(probing) e4 handle_auth_request failed to assign global_id 2021-01-05 20:38:08.688 7f979e753700 1 mon.odcf-pve02@-1(probing) e4 handle_auth_request failed to assign global_id 2021-01-05 20:38:08.744 7f979e753700 1 mon.odcf-pve02@-1(probing) e4 handle_auth_request failed to assign global_id 2021-01-05 20:38:09.092 7f979e753700 1 mon.odcf-pve02@-1(probing) e4 handle_auth_request failed to assign global_id 2021-01-05 20:38:12.268 7f979e753700 1 mon.odcf-pve02@-1(probing) e4 handle_auth_request failed to assign global_id 2021-01-05 20:38:12.468 7f979e753700 1 mon.odcf-pve02@-1(probing) e4 handle_auth_request failed to assign global_id 2021-01-05 20:38:12.964 7f979e753700 1 mon.odcf-pve02@-1(probing) e4 handle_auth_request failed to assign global_id 2021-01-05 20:38:15.752 7f979e753700 1 mon.odcf-pve02@-1(probing) e4 handle_auth_request failed to assign global_id 2021-01-05 20:38:17.440 7f979e753700 1 mon.odcf-pve02@-1(probing) e4 handle_auth_request failed to assign global_id 2021-01-05 20:38:19.388 7f979e753700 1 mon.odcf-pve02@-1(probing) e4 handle_auth_request failed to assign global_id 2021-01-05 20:38:19.468 7f979e753700 1 mon.odcf-pve02@-1(probing) e4 handle_auth_request failed to assign global_id 2021-01-05 20:38:22.712 7f979e753700 1 mon.odcf-pve02@-1(probing) e4 handle_auth_request failed to assign global_id 2021-01-05 20:38:22.828 7f979e753700 1 mon.odcf-pve02@-1(probing) e4 handle_auth_request failed to assign global_id ----------------------- in the mon log on the problematic host. When (unsuccessfully) starting the monitor through the UI, the following entries appear in ceph.audit.log: ----------------------- 2021-01-05 20:40:07.635369 mon.odcf-pve03 (mon.1) 288082 : audit [DBG] from='client.? 192.168.255.2:0/2418486168' entity='client.admin' cmd=[{"format":"json","prefix":"mgr metadata"}]: dispatch 2021-01-05 20:40:07.636592 mon.odcf-pve03 (mon.1) 288083 : audit [DBG] from='client.? 192.168.255.2:0/2418486168' entity='client.admin' cmd=[{"format":"json","prefix":"mgr dump"}]: dispatch 2021-01-05 20:40:08.296793 mon.odcf-pve03 (mon.1) 288084 : audit [DBG] from='client.? 192.168.255.2:0/778781756' entity='client.admin' cmd=[{"format":"json","prefix":"mon metadata"}]: dispatch 2021-01-05 20:40:08.297767 mon.odcf-pve03 (mon.1) 288085 : audit [DBG] from='client.? 192.168.255.2:0/778781756' entity='client.admin' cmd=[{"prefix":"quorum_status","format":"json"}]: dispatch 2021-01-05 20:40:08.436982 mon.odcf-pve01 (mon.0) 389632 : audit [DBG] from='client.? 192.168.255.2:0/784579843' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch ----------------------- 192.168.255.2 is the IP number of the problematic host in the Ceph mesh network. odcf-pve01 and odcf-pve03 are the "good" nodes. However I am not sure, what kind of information I should look for in the logs Frank > > Regards, > >     Uwe > > > >> >>> Can you provide the list of installed packages of the affected host >>> and the rest of the cluster? >> >> let me compile the lists and post them somewhere.  They are quite long. >> >>> >>> Is the output of "ceph status" the same for all hosts? >> >> yes >> >> Frank >> >>> >>> >>> Regards, >>> >>>      Uwe >>> >>> Am 05.01.21 um 20:01 schrieb Frank Thommen: >>>> >>>> On 04.01.21 12:44, Frank Thommen wrote: >>>>> >>>>> Dear all, >>>>> >>>>> one of our three PVE hypervisors in the cluster crashed (it was >>>>> fenced successfully) and rebooted automatically.  I took the chance >>>>> to do a complete dist-upgrade and rebooted again. >>>>> >>>>> The PVE Ceph dashboard now reports, that >>>>> >>>>>    * the monitor on the host is down (out of quorum), and >>>>>    * "A newer version was installed but old version still running, >>>>> please restart" >>>>> >>>>> The Ceph UI reports monitor version 14.2.11 while in fact 14.2.16 >>>>> is installed. The hypervisor has been rebooted twice since the >>>>> upgrade, so it should be basically impossible that the old version >>>>> is still running. >>>>> >>>>> `systemctl restart ceph.target` and restarting the monitor through >>>>> the PVE Ceph UI didn't help. The hypervisor is running PVE 6.3-3 >>>>> (the other two are running 6.3-2 with monitor 14.2.15) >>>>> >>>>> What to do in this situation? >>>>> >>>>> I am happy with either UI or commandline instructions, but I have >>>>> no Ceph experience besides setting up it up following the PVE >>>>> instructions. >>>>> >>>>> Any help or hint is appreciated. >>>>> Cheers, Frank >>>> >>>> In an attempt to fix the issue I destroyed the monitor through the >>>> UI and recreated it.  Unfortunately it can still not be started.  A >>>> popup tells me that the monitor has been started, but the overview >>>> still shows "stopped" and there is no version number any more. >>>> >>>> Then I stopped and started Ceph on the node (`pveceph stop; pveceph >>>> start`) which resulted in a degraded cluster (1 host down, 7 of 21 >>>> OSDs down). OSDs cannot be started through the UI either. >>>> >>>> I feel extremely uncomfortable with this situation and would >>>> appreciate any hint as to how I should proceed with the problem. >>>> >>>> Cheers, Frank >>>> >>>> _______________________________________________ >>>> pve-user mailing list >>>> pve-user@lists.proxmox.com >>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user >>> >>> _______________________________________________ >>> pve-user mailing list >>> pve-user@lists.proxmox.com >>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user >> >> _______________________________________________ >> pve-user mailing list >> pve-user@lists.proxmox.com >> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user > > _______________________________________________ > pve-user mailing list > pve-user@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user