* [pve-devel] [PATCH manager v3 1/3] api: ceph: add cmd-safety endpoint
2022-11-17 14:09 [pve-devel] [PATCH manager v3 0/3] Add Ceph safety checks Aaron Lauterer
@ 2022-11-17 14:10 ` Aaron Lauterer
2022-11-17 14:10 ` [pve-devel] [PATCH manager v3 2/3] ui: osd: warn if removal could be problematic Aaron Lauterer
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Aaron Lauterer @ 2022-11-17 14:10 UTC (permalink / raw)
To: pve-devel
Ceph provides us with several safety checks to verify that an action is
safe to perform. This endpoint provides means to acces them.
The actual mon commands are not exposed directly. Instead the two
actions "stop" and "destroy" are offered.
In case it is not okay to perform an action, Ceph provides a status
message explaining why. This message is part of the returned values.
For now there are the following checks for these services:
MON:
- ok-to-stop
- ok-to-rm
OSD:
- ok-to-stop
- safe-to-destroy
MDS:
- ok-to-stop
Even though OSDs have a check if it is okay to destroy them, it is for
now not really usable in our workflow because it needs the OSD to be up
and running to return useful information. Our workflow in the GUI
currently is that the OSD needs to be stopped in order to destroy it.
Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
---
Needs to have the librados2-perl version that contains commit 80deebd as
min version dependency to work.
changes since:
v2:
* use mon_cmd instead of mon_command which now is used as a compat
wrapper
v1:
* remove repetitive endpoints for each service type in favor for a
central one
PVE/API2/Ceph.pm | 96 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 96 insertions(+)
diff --git a/PVE/API2/Ceph.pm b/PVE/API2/Ceph.pm
index 3bbcfe4c..f3442408 100644
--- a/PVE/API2/Ceph.pm
+++ b/PVE/API2/Ceph.pm
@@ -641,4 +641,100 @@ __PACKAGE__->register_method ({
return $res;
}});
+__PACKAGE__->register_method ({
+ name => 'cmd_safety',
+ path => 'cmd-safety',
+ method => 'GET',
+ description => "Heuristical check if it is safe to perform an action.",
+ proxyto => 'node',
+ protected => 1,
+ permissions => {
+ check => ['perm', '/', [ 'Sys.audit' ]],
+ },
+ parameters => {
+ additionalProperties => 0,
+ properties => {
+ node => get_standard_option('pve-node'),
+ service => {
+ description => 'Service type',
+ type => 'string',
+ enum => ['osd', 'mon', 'mds'],
+ },
+ id => {
+ description => 'ID of the service',
+ type => 'string',
+ },
+ action => {
+ description => 'Action to check',
+ type => 'string',
+ enum => ['stop', 'destroy'],
+ },
+ },
+ },
+ returns => {
+ type => 'object',
+ properties => {
+ safe => {
+ type => 'boolean',
+ description => 'If it is safe to run the command.',
+ },
+ status => {
+ type => 'string',
+ optional => 1,
+ description => 'Status message given by Ceph.'
+ },
+ },
+ },
+ code => sub {
+ my ($param) = @_;
+
+ PVE::Ceph::Tools::check_ceph_inited();
+
+ my $id = $param->{id};
+ my $service = $param->{service};
+ my $action = $param->{action};
+
+ my $rados = PVE::RADOS->new();
+
+ my $supported_actions = {
+ osd => {
+ stop => 'ok-to-stop',
+ destroy => 'safe-to-destroy',
+ },
+ mon => {
+ stop => 'ok-to-stop',
+ destroy => 'ok-to-rm',
+ },
+ mds => {
+ stop => 'ok-to-stop',
+ },
+ };
+
+ die "Service does not support this action: ${service}: ${action}\n"
+ if !$supported_actions->{$service}->{$action};
+
+ my $result = {
+ safe => 0,
+ status => '',
+ };
+
+ my $params = {
+ prefix => "${service} $supported_actions->{$service}->{$action}",
+ format => 'plain',
+ };
+ if ($service eq 'mon' && $action eq 'destroy') {
+ $params->{id} = $id;
+ } else {
+ $params->{ids} = [ $id ];
+ }
+
+ $result = $rados->mon_cmd($params, 1);
+ die $@ if $@;
+
+ $result->{safe} = $result->{return_code} == 0 ? 1 : 0;
+ $result->{status} = $result->{status_message};
+
+ return $result;
+ }});
+
1;
--
2.30.2
^ permalink raw reply [flat|nested] 5+ messages in thread
* [pve-devel] [PATCH manager v3 2/3] ui: osd: warn if removal could be problematic
2022-11-17 14:09 [pve-devel] [PATCH manager v3 0/3] Add Ceph safety checks Aaron Lauterer
2022-11-17 14:10 ` [pve-devel] [PATCH manager v3 1/3] api: ceph: add cmd-safety endpoint Aaron Lauterer
@ 2022-11-17 14:10 ` Aaron Lauterer
2022-11-17 14:10 ` [pve-devel] [PATCH manager v3 3/3] ui: osd: mon: mds: warn if stop/destroy actions are problematic Aaron Lauterer
2022-11-17 17:44 ` [pve-devel] applied: [PATCH manager v3 0/3] Add Ceph safety checks Thomas Lamprecht
3 siblings, 0 replies; 5+ messages in thread
From: Aaron Lauterer @ 2022-11-17 14:10 UTC (permalink / raw)
To: pve-devel
If an OSD is removed during the wrong conditions, it could lead to
blocked IO or worst case data loss.
Check against global flags that limit the capabilities of Ceph to heal
itself (norebalance, norecover, noout) and if there are degraded
objects.
Unfortunately, the 'safe-to-destroy' Ceph API endpoint will not help
here as it only works as long as the OSD is still running. By the time
the destroy button is enabled, the OSD will already be stopped.
Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
---
www/manager6/ceph/OSD.js | 69 ++++++++++++++++++++++++++++++++++++++--
1 file changed, 67 insertions(+), 2 deletions(-)
diff --git a/www/manager6/ceph/OSD.js b/www/manager6/ceph/OSD.js
index 78f226ff..b36787f5 100644
--- a/www/manager6/ceph/OSD.js
+++ b/www/manager6/ceph/OSD.js
@@ -178,6 +178,20 @@ Ext.define('PVE.CephRemoveOsd', {
labelWidth: 130,
fieldLabel: gettext('Cleanup Disks'),
},
+ {
+ xtype: 'displayfield',
+ name: 'osd-flag-hint',
+ userCls: 'pmx-hint',
+ value: gettext('Global flags limiting the self healing of Ceph are enabled.'),
+ hidden: true,
+ },
+ {
+ xtype: 'displayfield',
+ name: 'degraded-objects-hint',
+ userCls: 'pmx-hint',
+ value: gettext('Objects are degraded. Consider waiting until the cluster is healthy.'),
+ hidden: true,
+ },
],
initComponent: function() {
let me = this;
@@ -198,6 +212,13 @@ Ext.define('PVE.CephRemoveOsd', {
});
me.callParent();
+
+ if (me.warnings.flags) {
+ me.down('field[name=osd-flag-hint]').setHidden(false);
+ }
+ if (me.warnings.degraded) {
+ me.down('field[name=degraded-objects-hint]').setHidden(false);
+ }
},
});
@@ -439,14 +460,58 @@ Ext.define('PVE.node.CephOsdTree', {
}).show();
},
- destroy_osd: function() {
+ destroy_osd: async function() {
let me = this;
let vm = this.getViewModel();
+
+ let warnings = {
+ flags: false,
+ degraded: false,
+ };
+
+ let flagsPromise = Proxmox.Async.api2({
+ url: `/cluster/ceph/flags`,
+ method: 'GET',
+ });
+
+ let statusPromise = Proxmox.Async.api2({
+ url: `/cluster/ceph/status`,
+ method: 'GET',
+ });
+
+ me.getView().mask(gettext('Loading...'));
+
+ try {
+ let result = await Promise.all([flagsPromise, statusPromise]);
+
+ let flagsData = result[0].result.data;
+ let statusData = result[1].result.data;
+
+ let flags = Array.from(
+ flagsData.filter(v => v.value),
+ v => v.name,
+ ).filter(v => ['norebalance', 'norecover', 'noout'].includes(v));
+
+ if (flags.length) {
+ warnings.flags = true;
+ }
+ if (Object.keys(statusData.pgmap).includes('degraded_objects')) {
+ warnings.degraded = true;
+ }
+ } catch (error) {
+ Ext.Msg.alert(gettext('Error'), error.htmlStatus);
+ me.getView().unmask();
+ return;
+ }
+
+ me.getView().unmask();
Ext.create('PVE.CephRemoveOsd', {
nodename: vm.get('osdhost'),
osdid: vm.get('osdid'),
+ warnings: warnings,
taskDone: () => { me.reload(); },
- }).show();
+ autoShow: true,
+ });
},
set_flags: function() {
--
2.30.2
^ permalink raw reply [flat|nested] 5+ messages in thread
* [pve-devel] [PATCH manager v3 3/3] ui: osd: mon: mds: warn if stop/destroy actions are problematic
2022-11-17 14:09 [pve-devel] [PATCH manager v3 0/3] Add Ceph safety checks Aaron Lauterer
2022-11-17 14:10 ` [pve-devel] [PATCH manager v3 1/3] api: ceph: add cmd-safety endpoint Aaron Lauterer
2022-11-17 14:10 ` [pve-devel] [PATCH manager v3 2/3] ui: osd: warn if removal could be problematic Aaron Lauterer
@ 2022-11-17 14:10 ` Aaron Lauterer
2022-11-17 17:44 ` [pve-devel] applied: [PATCH manager v3 0/3] Add Ceph safety checks Thomas Lamprecht
3 siblings, 0 replies; 5+ messages in thread
From: Aaron Lauterer @ 2022-11-17 14:10 UTC (permalink / raw)
To: pve-devel
Check if stopping of a service (OSD, MON, MDS) will be problematic for
Ceph. The warning still allows the user to proceed.
Ceph also has a check if the destruction of a MON is okay, so let's use
it.
Instead of the common OK button, label it with `Stop OSD` and so forth
to hopefully reduce the "click OK by habit" incidents.
This will not catch it every time as Ceph can need a few moments after a
change to establish its current status. For example, stopping one of 3
MONs and then right away destroying one of the last two running MONs
will most likely not trigger the warning. Doing so after a few seconds
should show the warning though.
Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
---
www/manager6/ceph/OSD.js | 71 ++++++++++++++++-----
www/manager6/ceph/ServiceList.js | 104 +++++++++++++++++++++++++++----
2 files changed, 145 insertions(+), 30 deletions(-)
diff --git a/www/manager6/ceph/OSD.js b/www/manager6/ceph/OSD.js
index b36787f5..6f7e2159 100644
--- a/www/manager6/ceph/OSD.js
+++ b/www/manager6/ceph/OSD.js
@@ -527,23 +527,60 @@ Ext.define('PVE.node.CephOsdTree', {
let me = this;
let vm = this.getViewModel();
let cmd = comp.cmd || comp;
- Proxmox.Utils.API2Request({
- url: "/nodes/" + vm.get('osdhost') + "/ceph/" + cmd,
- params: { service: "osd." + vm.get('osdid') },
- waitMsgTarget: me.getView(),
- method: 'POST',
- success: function(response, options) {
- let upid = response.result.data;
- let win = Ext.create('Proxmox.window.TaskProgress', {
- upid: upid,
- taskDone: () => { me.reload(); },
- });
- win.show();
- },
- failure: function(response, opts) {
- Ext.Msg.alert(gettext('Error'), response.htmlStatus);
- },
- });
+
+ let doRequest = function() {
+ Proxmox.Utils.API2Request({
+ url: `/nodes/${vm.get('osdhost')}/ceph/${cmd}`,
+ params: { service: "osd." + vm.get('osdid') },
+ waitMsgTarget: me.getView(),
+ method: 'POST',
+ success: function(response, options) {
+ let upid = response.result.data;
+ let win = Ext.create('Proxmox.window.TaskProgress', {
+ upid: upid,
+ taskDone: () => { me.reload(); },
+ });
+ win.show();
+ },
+ failure: function(response, opts) {
+ Ext.Msg.alert(gettext('Error'), response.htmlStatus);
+ },
+ });
+ };
+
+ if (cmd === "stop") {
+ Proxmox.Utils.API2Request({
+ url: `/nodes/${vm.get('osdhost')}/ceph/cmd-safety`,
+ params: {
+ service: 'osd',
+ id: vm.get('osdid'),
+ action: 'stop',
+ },
+ waitMsgTarget: me.getView(),
+ method: 'GET',
+ success: function({ result: { data } }) {
+ if (!data.safe) {
+ Ext.Msg.show({
+ title: gettext('Warning'),
+ message: data.status,
+ icon: Ext.Msg.WARNING,
+ buttons: Ext.Msg.OKCANCEL,
+ buttonText: { ok: gettext('Stop OSD') },
+ fn: function(selection) {
+ if (selection === 'ok') {
+ doRequest();
+ }
+ },
+ });
+ } else {
+ doRequest();
+ }
+ },
+ failure: response => Ext.Msg.alert(gettext('Error'), response.htmlStatus),
+ });
+ } else {
+ doRequest();
+ }
},
set_selection_status: function(tp, selection) {
diff --git a/www/manager6/ceph/ServiceList.js b/www/manager6/ceph/ServiceList.js
index 9298974e..8b51fe6a 100644
--- a/www/manager6/ceph/ServiceList.js
+++ b/www/manager6/ceph/ServiceList.js
@@ -148,19 +148,57 @@ Ext.define('PVE.node.CephServiceController', {
Ext.Msg.alert(gettext('Error'), "entry has no host");
return;
}
- Proxmox.Utils.API2Request({
- url: `/nodes/${rec.data.host}/ceph/${cmd}`,
- method: 'POST',
- params: { service: view.type + '.' + rec.data.name },
- success: function(response, options) {
- Ext.create('Proxmox.window.TaskProgress', {
- autoShow: true,
- upid: response.result.data,
- taskDone: () => view.rstore.load(),
- });
- },
- failure: (response, _opts) => Ext.Msg.alert(gettext('Error'), response.htmlStatus),
- });
+ let doRequest = function() {
+ Proxmox.Utils.API2Request({
+ url: `/nodes/${rec.data.host}/ceph/${cmd}`,
+ method: 'POST',
+ params: { service: view.type + '.' + rec.data.name },
+ success: function(response, options) {
+ Ext.create('Proxmox.window.TaskProgress', {
+ autoShow: true,
+ upid: response.result.data,
+ taskDone: () => view.rstore.load(),
+ });
+ },
+ failure: (response, _opts) => Ext.Msg.alert(gettext('Error'), response.htmlStatus),
+ });
+ };
+ if (cmd === "stop" && ['mon', 'mds'].includes(view.type)) {
+ Proxmox.Utils.API2Request({
+ url: `/nodes/${rec.data.host}/ceph/cmd-safety`,
+ params: {
+ service: view.type,
+ id: rec.data.name,
+ action: 'stop',
+ },
+ method: 'GET',
+ success: function({ result: { data } }) {
+ let stopText = {
+ mon: gettext('Stop MON'),
+ mds: gettext('Stop MDS'),
+ };
+ if (!data.safe) {
+ Ext.Msg.show({
+ title: gettext('Warning'),
+ message: data.status,
+ icon: Ext.Msg.WARNING,
+ buttons: Ext.Msg.OKCANCEL,
+ buttonText: { ok: stopText[view.type] },
+ fn: function(selection) {
+ if (selection === 'ok') {
+ doRequest();
+ }
+ },
+ });
+ } else {
+ doRequest();
+ }
+ },
+ failure: (response, _opts) => Ext.Msg.alert(gettext('Error'), response.htmlStatus),
+ });
+ } else {
+ doRequest();
+ }
},
onChangeService: function(button) {
let me = this;
@@ -273,6 +311,46 @@ Ext.define('PVE.node.CephServiceList', {
taskDone: () => view.rstore.load(),
});
},
+ handler: function(btn, event, rec) {
+ let me = this;
+ let view = me.up('grid');
+ let doRequest = function() {
+ Proxmox.button.StdRemoveButton.prototype.handler.call(me, btn, event, rec);
+ };
+ if (view.type === 'mon') {
+ Proxmox.Utils.API2Request({
+ url: `/nodes/${rec.data.host}/ceph/cmd-safety`,
+ params: {
+ service: view.type,
+ id: rec.data.name,
+ action: 'destroy',
+ },
+ method: 'GET',
+ success: function({ result: { data } }) {
+ if (!data.safe) {
+ Ext.Msg.show({
+ title: gettext('Warning'),
+ message: data.status,
+ icon: Ext.Msg.WARNING,
+ buttons: Ext.Msg.OKCANCEL,
+ buttonText: { ok: gettext('Destroy MON') },
+ fn: function(selection) {
+ if (selection === 'ok') {
+ doRequest();
+ }
+ },
+ });
+ } else {
+ doRequest();
+ }
+ },
+ failure: (response, _opts) => Ext.Msg.alert(gettext('Error'), response.htmlStatus),
+ });
+ } else {
+ doRequest();
+ }
+ },
+
},
'-',
{
--
2.30.2
^ permalink raw reply [flat|nested] 5+ messages in thread