public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH manager v3 0/3] Add Ceph safety checks
@ 2022-11-17 14:09 Aaron Lauterer
  2022-11-17 14:10 ` [pve-devel] [PATCH manager v3 1/3] api: ceph: add cmd-safety endpoint Aaron Lauterer
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Aaron Lauterer @ 2022-11-17 14:09 UTC (permalink / raw)
  To: pve-devel

The main motivation behind this series is to leverage several safety
checks that Ceph has to make sure it is ok to stop or destroy a service.

A new cmd-safety endpoint is added which is called from the GUI wherever
possible to show a warning.

This series needs commit 80deebd or newer from the librados2-perl
repo/package to work properly. Therefore we need to make sure to update
the min version in the dependencies.

changes:
- drop applied patches (librados2-perl)
- drop patches that adapt mon_command usage since with 80deebd
mon_command is a compat wrapper to provide the old behavior
- adapt api endpoint to the new mon_cmd

Aaron Lauterer (3):
  api: ceph: add cmd-safety endpoint
  ui: osd: warn if removal could be problematic
  ui: osd: mon: mds: warn if stop/destroy actions are problematic

 PVE/API2/Ceph.pm                 |  96 +++++++++++++++++++++
 www/manager6/ceph/OSD.js         | 140 ++++++++++++++++++++++++++-----
 www/manager6/ceph/ServiceList.js | 104 ++++++++++++++++++++---
 3 files changed, 308 insertions(+), 32 deletions(-)

-- 
2.30.2





^ permalink raw reply	[flat|nested] 5+ messages in thread

* [pve-devel] [PATCH manager v3 1/3] api: ceph: add cmd-safety endpoint
  2022-11-17 14:09 [pve-devel] [PATCH manager v3 0/3] Add Ceph safety checks Aaron Lauterer
@ 2022-11-17 14:10 ` Aaron Lauterer
  2022-11-17 14:10 ` [pve-devel] [PATCH manager v3 2/3] ui: osd: warn if removal could be problematic Aaron Lauterer
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Aaron Lauterer @ 2022-11-17 14:10 UTC (permalink / raw)
  To: pve-devel

Ceph provides us with several safety checks to verify that an action is
safe to perform. This endpoint provides means to acces them.
The actual mon commands are not exposed directly. Instead the two
actions "stop" and "destroy" are offered.

In case it is not okay to perform an action, Ceph provides a status
message explaining why. This message is part of the returned values.

For now there are the following checks for these services:

MON:
  - ok-to-stop
  - ok-to-rm
OSD:
  - ok-to-stop
  - safe-to-destroy
MDS:
  - ok-to-stop

Even though OSDs have a check if it is okay to destroy them, it is for
now not really usable in our workflow because it needs the OSD to be up
and running to return useful information. Our workflow in the GUI
currently is that the OSD needs to be stopped in order to destroy it.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
---
Needs to have the librados2-perl version that contains commit 80deebd as
min version dependency to work.

changes since:
v2:
* use mon_cmd instead of mon_command which now is used as a compat
wrapper

v1:
* remove repetitive endpoints for each service type in favor for a
  central one

 PVE/API2/Ceph.pm | 96 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 96 insertions(+)

diff --git a/PVE/API2/Ceph.pm b/PVE/API2/Ceph.pm
index 3bbcfe4c..f3442408 100644
--- a/PVE/API2/Ceph.pm
+++ b/PVE/API2/Ceph.pm
@@ -641,4 +641,100 @@ __PACKAGE__->register_method ({
 	return $res;
     }});
 
+__PACKAGE__->register_method ({
+    name => 'cmd_safety',
+    path => 'cmd-safety',
+    method => 'GET',
+    description => "Heuristical check if it is safe to perform an action.",
+    proxyto => 'node',
+    protected => 1,
+    permissions => {
+	check => ['perm', '/', [ 'Sys.audit' ]],
+    },
+    parameters => {
+	additionalProperties => 0,
+	properties => {
+	    node => get_standard_option('pve-node'),
+	    service => {
+		description => 'Service type',
+		type => 'string',
+		enum => ['osd', 'mon', 'mds'],
+	    },
+	    id => {
+		description => 'ID of the service',
+		type => 'string',
+	    },
+	    action => {
+		description => 'Action to check',
+		type => 'string',
+		enum => ['stop', 'destroy'],
+	    },
+	},
+    },
+    returns => {
+	type => 'object',
+	properties => {
+	   safe  => {
+		type => 'boolean',
+		description => 'If it is safe to run the command.',
+	    },
+	    status => {
+		type => 'string',
+		optional => 1,
+		description => 'Status message given by Ceph.'
+	    },
+	},
+    },
+    code => sub {
+	my ($param) = @_;
+
+	PVE::Ceph::Tools::check_ceph_inited();
+
+	my $id = $param->{id};
+	my $service = $param->{service};
+	my $action = $param->{action};
+
+	my $rados = PVE::RADOS->new();
+
+	my $supported_actions = {
+	    osd => {
+		stop => 'ok-to-stop',
+		destroy => 'safe-to-destroy',
+	    },
+	    mon => {
+		stop => 'ok-to-stop',
+		destroy => 'ok-to-rm',
+	    },
+	    mds => {
+		stop => 'ok-to-stop',
+	    },
+	};
+
+	die "Service does not support this action: ${service}: ${action}\n"
+	    if !$supported_actions->{$service}->{$action};
+
+	my $result = {
+	    safe => 0,
+	    status => '',
+	};
+
+	my $params = {
+	    prefix => "${service} $supported_actions->{$service}->{$action}",
+	    format => 'plain',
+	};
+	if ($service eq 'mon' && $action eq 'destroy') {
+	    $params->{id} = $id;
+	} else {
+	    $params->{ids} = [ $id ];
+	}
+
+	$result = $rados->mon_cmd($params, 1);
+	die $@ if $@;
+
+	$result->{safe} = $result->{return_code} == 0 ? 1 : 0;
+	$result->{status} = $result->{status_message};
+
+	return $result;
+    }});
+
 1;
-- 
2.30.2





^ permalink raw reply	[flat|nested] 5+ messages in thread

* [pve-devel] [PATCH manager v3 2/3] ui: osd: warn if removal could be problematic
  2022-11-17 14:09 [pve-devel] [PATCH manager v3 0/3] Add Ceph safety checks Aaron Lauterer
  2022-11-17 14:10 ` [pve-devel] [PATCH manager v3 1/3] api: ceph: add cmd-safety endpoint Aaron Lauterer
@ 2022-11-17 14:10 ` Aaron Lauterer
  2022-11-17 14:10 ` [pve-devel] [PATCH manager v3 3/3] ui: osd: mon: mds: warn if stop/destroy actions are problematic Aaron Lauterer
  2022-11-17 17:44 ` [pve-devel] applied: [PATCH manager v3 0/3] Add Ceph safety checks Thomas Lamprecht
  3 siblings, 0 replies; 5+ messages in thread
From: Aaron Lauterer @ 2022-11-17 14:10 UTC (permalink / raw)
  To: pve-devel

If an OSD is removed during the wrong conditions, it could lead to
blocked IO or worst case data loss.

Check against global flags that limit the capabilities of Ceph to heal
itself (norebalance, norecover, noout) and if there are degraded
objects.

Unfortunately, the 'safe-to-destroy' Ceph API endpoint will not help
here as it only works as long as the OSD is still running. By the time
the destroy button is enabled, the OSD will already be stopped.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
---
 www/manager6/ceph/OSD.js | 69 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 67 insertions(+), 2 deletions(-)

diff --git a/www/manager6/ceph/OSD.js b/www/manager6/ceph/OSD.js
index 78f226ff..b36787f5 100644
--- a/www/manager6/ceph/OSD.js
+++ b/www/manager6/ceph/OSD.js
@@ -178,6 +178,20 @@ Ext.define('PVE.CephRemoveOsd', {
 	    labelWidth: 130,
 	    fieldLabel: gettext('Cleanup Disks'),
 	},
+	{
+	    xtype: 'displayfield',
+	    name: 'osd-flag-hint',
+	    userCls: 'pmx-hint',
+	    value: gettext('Global flags limiting the self healing of Ceph are enabled.'),
+	    hidden: true,
+	},
+	{
+	    xtype: 'displayfield',
+	    name: 'degraded-objects-hint',
+	    userCls: 'pmx-hint',
+	    value: gettext('Objects are degraded. Consider waiting until the cluster is healthy.'),
+	    hidden: true,
+	},
     ],
     initComponent: function() {
         let me = this;
@@ -198,6 +212,13 @@ Ext.define('PVE.CephRemoveOsd', {
         });
 
         me.callParent();
+
+	if (me.warnings.flags) {
+	    me.down('field[name=osd-flag-hint]').setHidden(false);
+	}
+	if (me.warnings.degraded) {
+	    me.down('field[name=degraded-objects-hint]').setHidden(false);
+	}
     },
 });
 
@@ -439,14 +460,58 @@ Ext.define('PVE.node.CephOsdTree', {
 	    }).show();
 	},
 
-	destroy_osd: function() {
+	destroy_osd: async function() {
 	    let me = this;
 	    let vm = this.getViewModel();
+
+	    let warnings = {
+		flags: false,
+		degraded: false,
+	    };
+
+	    let flagsPromise = Proxmox.Async.api2({
+		url: `/cluster/ceph/flags`,
+		method: 'GET',
+	    });
+
+	    let statusPromise = Proxmox.Async.api2({
+		url: `/cluster/ceph/status`,
+		method: 'GET',
+	    });
+
+	    me.getView().mask(gettext('Loading...'));
+
+	    try {
+		let result = await Promise.all([flagsPromise, statusPromise]);
+
+		let flagsData = result[0].result.data;
+		let statusData = result[1].result.data;
+
+		let flags = Array.from(
+		    flagsData.filter(v => v.value),
+		    v => v.name,
+		).filter(v => ['norebalance', 'norecover', 'noout'].includes(v));
+
+		if (flags.length) {
+		    warnings.flags = true;
+		}
+		if (Object.keys(statusData.pgmap).includes('degraded_objects')) {
+		    warnings.degraded = true;
+		}
+	    } catch (error) {
+		Ext.Msg.alert(gettext('Error'), error.htmlStatus);
+		me.getView().unmask();
+		return;
+	    }
+
+	    me.getView().unmask();
 	    Ext.create('PVE.CephRemoveOsd', {
 		nodename: vm.get('osdhost'),
 		osdid: vm.get('osdid'),
+		warnings: warnings,
 		taskDone: () => { me.reload(); },
-	    }).show();
+		autoShow: true,
+	    });
 	},
 
 	set_flags: function() {
-- 
2.30.2





^ permalink raw reply	[flat|nested] 5+ messages in thread

* [pve-devel] [PATCH manager v3 3/3] ui: osd: mon: mds: warn if stop/destroy actions are problematic
  2022-11-17 14:09 [pve-devel] [PATCH manager v3 0/3] Add Ceph safety checks Aaron Lauterer
  2022-11-17 14:10 ` [pve-devel] [PATCH manager v3 1/3] api: ceph: add cmd-safety endpoint Aaron Lauterer
  2022-11-17 14:10 ` [pve-devel] [PATCH manager v3 2/3] ui: osd: warn if removal could be problematic Aaron Lauterer
@ 2022-11-17 14:10 ` Aaron Lauterer
  2022-11-17 17:44 ` [pve-devel] applied: [PATCH manager v3 0/3] Add Ceph safety checks Thomas Lamprecht
  3 siblings, 0 replies; 5+ messages in thread
From: Aaron Lauterer @ 2022-11-17 14:10 UTC (permalink / raw)
  To: pve-devel

Check if stopping of a service (OSD, MON, MDS) will be problematic for
Ceph. The warning still allows the user to proceed.

Ceph also has a check if the destruction of a MON is okay, so let's use
it.

Instead of the common OK button, label it with `Stop OSD` and so forth
to hopefully reduce the "click OK by habit" incidents.

This will not catch it every time as Ceph can need a few moments after a
change to establish its current status. For example, stopping one of 3
MONs and then right away destroying one of the last two running MONs
will most likely not trigger the warning. Doing so after a few seconds
should show the warning though.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
---
 www/manager6/ceph/OSD.js         |  71 ++++++++++++++++-----
 www/manager6/ceph/ServiceList.js | 104 +++++++++++++++++++++++++++----
 2 files changed, 145 insertions(+), 30 deletions(-)

diff --git a/www/manager6/ceph/OSD.js b/www/manager6/ceph/OSD.js
index b36787f5..6f7e2159 100644
--- a/www/manager6/ceph/OSD.js
+++ b/www/manager6/ceph/OSD.js
@@ -527,23 +527,60 @@ Ext.define('PVE.node.CephOsdTree', {
 	    let me = this;
 	    let vm = this.getViewModel();
 	    let cmd = comp.cmd || comp;
-	    Proxmox.Utils.API2Request({
-                url: "/nodes/" + vm.get('osdhost') + "/ceph/" + cmd,
-		params: { service: "osd." + vm.get('osdid') },
-		waitMsgTarget: me.getView(),
-		method: 'POST',
-		success: function(response, options) {
-		    let upid = response.result.data;
-		    let win = Ext.create('Proxmox.window.TaskProgress', {
-			upid: upid,
-			taskDone: () => { me.reload(); },
-		    });
-		    win.show();
-		},
-		failure: function(response, opts) {
-		    Ext.Msg.alert(gettext('Error'), response.htmlStatus);
-		},
-	    });
+
+	    let doRequest = function() {
+		Proxmox.Utils.API2Request({
+		    url: `/nodes/${vm.get('osdhost')}/ceph/${cmd}`,
+		    params: { service: "osd." + vm.get('osdid') },
+		    waitMsgTarget: me.getView(),
+		    method: 'POST',
+		    success: function(response, options) {
+			let upid = response.result.data;
+			let win = Ext.create('Proxmox.window.TaskProgress', {
+			    upid: upid,
+			    taskDone: () => { me.reload(); },
+			});
+			win.show();
+		    },
+		    failure: function(response, opts) {
+			Ext.Msg.alert(gettext('Error'), response.htmlStatus);
+		    },
+		});
+	    };
+
+	    if (cmd === "stop") {
+		Proxmox.Utils.API2Request({
+		    url: `/nodes/${vm.get('osdhost')}/ceph/cmd-safety`,
+		    params: {
+			service: 'osd',
+			id: vm.get('osdid'),
+			action: 'stop',
+		    },
+		    waitMsgTarget: me.getView(),
+		    method: 'GET',
+		    success: function({ result: { data } }) {
+			if (!data.safe) {
+			    Ext.Msg.show({
+				title: gettext('Warning'),
+				message: data.status,
+				icon: Ext.Msg.WARNING,
+				buttons: Ext.Msg.OKCANCEL,
+				buttonText: { ok: gettext('Stop OSD') },
+				fn: function(selection) {
+				    if (selection === 'ok') {
+					doRequest();
+				    }
+				},
+			    });
+			} else {
+			    doRequest();
+			}
+		    },
+		    failure: response => Ext.Msg.alert(gettext('Error'), response.htmlStatus),
+		});
+	    } else {
+		doRequest();
+	    }
 	},
 
 	set_selection_status: function(tp, selection) {
diff --git a/www/manager6/ceph/ServiceList.js b/www/manager6/ceph/ServiceList.js
index 9298974e..8b51fe6a 100644
--- a/www/manager6/ceph/ServiceList.js
+++ b/www/manager6/ceph/ServiceList.js
@@ -148,19 +148,57 @@ Ext.define('PVE.node.CephServiceController', {
 	    Ext.Msg.alert(gettext('Error'), "entry has no host");
 	    return;
 	}
-	Proxmox.Utils.API2Request({
-				  url: `/nodes/${rec.data.host}/ceph/${cmd}`,
-				  method: 'POST',
-				  params: { service: view.type + '.' + rec.data.name },
-				  success: function(response, options) {
-				      Ext.create('Proxmox.window.TaskProgress', {
-					  autoShow: true,
-					  upid: response.result.data,
-					  taskDone: () => view.rstore.load(),
-				      });
-				  },
-				  failure: (response, _opts) => Ext.Msg.alert(gettext('Error'), response.htmlStatus),
-	});
+	let doRequest = function() {
+	    Proxmox.Utils.API2Request({
+		url: `/nodes/${rec.data.host}/ceph/${cmd}`,
+		method: 'POST',
+		params: { service: view.type + '.' + rec.data.name },
+		success: function(response, options) {
+		    Ext.create('Proxmox.window.TaskProgress', {
+			autoShow: true,
+			upid: response.result.data,
+			taskDone: () => view.rstore.load(),
+		    });
+		},
+		failure: (response, _opts) => Ext.Msg.alert(gettext('Error'), response.htmlStatus),
+	    });
+	};
+	if (cmd === "stop" && ['mon', 'mds'].includes(view.type)) {
+	    Proxmox.Utils.API2Request({
+		url: `/nodes/${rec.data.host}/ceph/cmd-safety`,
+		params: {
+		    service: view.type,
+		    id: rec.data.name,
+		    action: 'stop',
+		},
+		method: 'GET',
+		success: function({ result: { data } }) {
+		    let stopText = {
+			mon: gettext('Stop MON'),
+			mds: gettext('Stop MDS'),
+		    };
+		    if (!data.safe) {
+			Ext.Msg.show({
+			    title: gettext('Warning'),
+			    message: data.status,
+			    icon: Ext.Msg.WARNING,
+			    buttons: Ext.Msg.OKCANCEL,
+			    buttonText: { ok: stopText[view.type] },
+			    fn: function(selection) {
+				if (selection === 'ok') {
+				    doRequest();
+				}
+			    },
+			});
+		    } else {
+			doRequest();
+		    }
+		},
+		failure: (response, _opts) => Ext.Msg.alert(gettext('Error'), response.htmlStatus),
+	    });
+	} else {
+	    doRequest();
+	}
     },
     onChangeService: function(button) {
 	let me = this;
@@ -273,6 +311,46 @@ Ext.define('PVE.node.CephServiceList', {
 		    taskDone: () => view.rstore.load(),
 		});
 	    },
+	    handler: function(btn, event, rec) {
+		let me = this;
+		let view = me.up('grid');
+		let doRequest = function() {
+		    Proxmox.button.StdRemoveButton.prototype.handler.call(me, btn, event, rec);
+		};
+		if (view.type === 'mon') {
+		    Proxmox.Utils.API2Request({
+			url: `/nodes/${rec.data.host}/ceph/cmd-safety`,
+			params: {
+			    service: view.type,
+			    id: rec.data.name,
+			    action: 'destroy',
+			},
+			method: 'GET',
+			success: function({ result: { data } }) {
+			    if (!data.safe) {
+				Ext.Msg.show({
+				    title: gettext('Warning'),
+				    message: data.status,
+				    icon: Ext.Msg.WARNING,
+				    buttons: Ext.Msg.OKCANCEL,
+				    buttonText: { ok: gettext('Destroy MON') },
+				    fn: function(selection) {
+					if (selection === 'ok') {
+					    doRequest();
+					}
+				    },
+				});
+			    } else {
+				doRequest();
+			    }
+			},
+			failure: (response, _opts) => Ext.Msg.alert(gettext('Error'), response.htmlStatus),
+		    });
+		} else {
+		    doRequest();
+		}
+	    },
+
 	},
 	'-',
 	{
-- 
2.30.2





^ permalink raw reply	[flat|nested] 5+ messages in thread

* [pve-devel] applied: [PATCH manager v3 0/3] Add Ceph safety checks
  2022-11-17 14:09 [pve-devel] [PATCH manager v3 0/3] Add Ceph safety checks Aaron Lauterer
                   ` (2 preceding siblings ...)
  2022-11-17 14:10 ` [pve-devel] [PATCH manager v3 3/3] ui: osd: mon: mds: warn if stop/destroy actions are problematic Aaron Lauterer
@ 2022-11-17 17:44 ` Thomas Lamprecht
  3 siblings, 0 replies; 5+ messages in thread
From: Thomas Lamprecht @ 2022-11-17 17:44 UTC (permalink / raw)
  To: Proxmox VE development discussion, Aaron Lauterer

Am 17/11/2022 um 15:09 schrieb Aaron Lauterer:
> The main motivation behind this series is to leverage several safety
> checks that Ceph has to make sure it is ok to stop or destroy a service.
> 
> A new cmd-safety endpoint is added which is called from the GUI wherever
> possible to show a warning.
> 
> This series needs commit 80deebd or newer from the librados2-perl
> repo/package to work properly. Therefore we need to make sure to update
> the min version in the dependencies.
> 
> changes:
> - drop applied patches (librados2-perl)
> - drop patches that adapt mon_command usage since with 80deebd
> mon_command is a compat wrapper to provide the old behavior
> - adapt api endpoint to the new mon_cmd
> 
> Aaron Lauterer (3):
>   api: ceph: add cmd-safety endpoint
>   ui: osd: warn if removal could be problematic
>   ui: osd: mon: mds: warn if stop/destroy actions are problematic
> 
>  PVE/API2/Ceph.pm                 |  96 +++++++++++++++++++++
>  www/manager6/ceph/OSD.js         | 140 ++++++++++++++++++++++++++-----
>  www/manager6/ceph/ServiceList.js | 104 ++++++++++++++++++++---
>  3 files changed, 308 insertions(+), 32 deletions(-)
> 


applied, thanks!




^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-11-17 17:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-17 14:09 [pve-devel] [PATCH manager v3 0/3] Add Ceph safety checks Aaron Lauterer
2022-11-17 14:10 ` [pve-devel] [PATCH manager v3 1/3] api: ceph: add cmd-safety endpoint Aaron Lauterer
2022-11-17 14:10 ` [pve-devel] [PATCH manager v3 2/3] ui: osd: warn if removal could be problematic Aaron Lauterer
2022-11-17 14:10 ` [pve-devel] [PATCH manager v3 3/3] ui: osd: mon: mds: warn if stop/destroy actions are problematic Aaron Lauterer
2022-11-17 17:44 ` [pve-devel] applied: [PATCH manager v3 0/3] Add Ceph safety checks Thomas Lamprecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal