[pve-devel] [PATCH librados2-perl manager 0/6] Add Ceph safety checks

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [pve-devel] [PATCH librados2-perl manager 0/6] Add Ceph safety checks
@ 2022-02-18 11:38 Aaron Lauterer
  2022-02-18 11:38 ` [pve-devel] [PATCH librados2-perl 1/6] mon_command: free outs buffer Aaron Lauterer
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Aaron Lauterer @ 2022-02-18 11:38 UTC (permalink / raw)
  To: pve-devel

The main motivation behind this series is to leverage several safety
checks that Ceph has to make sure it is ok to stop or destroy a service.

For this to work, the librados2-perl needs to be adapted as the Ceph API
will return a non-zero return value and will also present a human readable
explanation in the status message.
Therefore, when querying such data, we do not want to die when we
receive a non-zero return value but return everything we got, return
value, status message and data to then present that to the user.

librados2-perl: Aaron Lauterer (2):
  mon_command: free outs buffer
  mon_command: optionally ignore errors and return hashmap

 PVE/RADOS.pm | 37 ++++++++++++++++++++++---------------
 RADOS.xs     | 30 ++++++++++++++++++++++++------
 2 files changed, 46 insertions(+), 21 deletions(-)

manager: Aaron Lauterer (4):
  api: osd: force mon_command to scalar context
  api: mon: mds: osd: add safety check endpoints
  ui: osd: warn if removal could be problematic
  ui: osd: mon: mds: warn if stop/destroy actions are problematic

 PVE/API2/Ceph/MDS.pm             |  50 ++++++++++++
 PVE/API2/Ceph/MON.pm             | 100 +++++++++++++++++++++++
 PVE/API2/Ceph/OSD.pm             | 102 +++++++++++++++++++++++-
 www/manager6/ceph/OSD.js         | 132 ++++++++++++++++++++++++++-----
 www/manager6/ceph/ServiceList.js |  94 +++++++++++++++++++---
 5 files changed, 446 insertions(+), 32 deletions(-)
-- 
2.30.2





^ permalink raw reply	[flat|nested] 14+ messages in thread

* [pve-devel] [PATCH librados2-perl 1/6] mon_command: free outs buffer
  2022-02-18 11:38 [pve-devel] [PATCH librados2-perl manager 0/6] Add Ceph safety checks Aaron Lauterer
@ 2022-02-18 11:38 ` Aaron Lauterer
  2022-02-21 15:35   ` Thomas Lamprecht
  2022-02-18 11:38 ` [pve-devel] [PATCH librados2-perl 2/6] mon_command: optionally ignore errors and return hashmap Aaron Lauterer
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Aaron Lauterer @ 2022-02-18 11:38 UTC (permalink / raw)
  To: pve-devel

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
---

thanks @Dominik who realized that we did not free this buffer in all
situations.

 RADOS.xs | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/RADOS.xs b/RADOS.xs
index 7eca024..1eb0b5a 100644
--- a/RADOS.xs
+++ b/RADOS.xs
@@ -145,6 +145,10 @@ CODE:
     RETVAL = newSVpv(outbuf, outbuflen);
 
     rados_buffer_free(outbuf);
+
+    if (outs != NULL) {
+	rados_buffer_free(outs);
+    }
 }
 OUTPUT: RETVAL
 
-- 
2.30.2





^ permalink raw reply	[flat|nested] 14+ messages in thread

* [pve-devel] [PATCH librados2-perl 2/6] mon_command: optionally ignore errors and return hashmap
  2022-02-18 11:38 [pve-devel] [PATCH librados2-perl manager 0/6] Add Ceph safety checks Aaron Lauterer
  2022-02-18 11:38 ` [pve-devel] [PATCH librados2-perl 1/6] mon_command: free outs buffer Aaron Lauterer
@ 2022-02-18 11:38 ` Aaron Lauterer
  2022-02-21 15:44   ` Thomas Lamprecht
  2022-02-18 11:38 ` [pve-devel] [PATCH manager 3/6] api: osd: force mon_command to scalar context Aaron Lauterer
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Aaron Lauterer @ 2022-02-18 11:38 UTC (permalink / raw)
  To: pve-devel

In some situations, we do not want to abort if the Ceph API returns an
error (ret != 0).  We also want to be able to access all the retured
values from the Ceph API (return code, status, data).

One such situation can be the 'osd ok-to-stop' call where Ceph will
return a non zero code if it is not okay to stop the OSD, but will also
have useful information in the status buffer that we can use to show the
user.

For this, let's always return a hashmap with the return code, the status
message and the returned data from RADOS.xs::mon_command. Then decide on
the Perl side (RADOS.pm::mon_command) if we return the scalar data as we
used to, or the contents of the hashmap in an array.

The new parameter 'noerr' is used to indicate that we want to proceed on
non-zero return values. Omitting it gives us the old behavior to die
with the status message.

The new parameter needs to be passed to the child process, which causes
some changes there and the resulting hashmaps gets JSON encoded to be
passed back up to the parent process.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
---
This patch requires patch 3 of the series to not break OSD removal!
Therefore releasing a new version of librados2-perl and pve-manager
needs to be coordinated.

Please have a closer look at the C code that I wrote in RADOS.xs as I
have never written much C and am not sure if I introduced some nasty bug
/ security issue.

 PVE/RADOS.pm | 37 ++++++++++++++++++++++---------------
 RADOS.xs     | 26 ++++++++++++++++++++------
 2 files changed, 42 insertions(+), 21 deletions(-)

diff --git a/PVE/RADOS.pm b/PVE/RADOS.pm
index 463abc7..03712cb 100644
--- a/PVE/RADOS.pm
+++ b/PVE/RADOS.pm
@@ -42,11 +42,11 @@ require XSLoader;
 XSLoader::load('PVE::RADOS', $VERSION);
 
 my $writedata = sub {
-    my ($fh, $cmd, $data) = @_;
+    my ($fh, $cmd, $data, $noerr) = @_;
 
     local $SIG{PIPE} = 'IGNORE';
 
-    my $bin = pack "a L/a*", $cmd, $data || '';
+    my $bin = pack "a a L/a*", $cmd, $noerr || 0, $data || '';
     my $res = syswrite $fh, $bin;
 
     die "write data failed - $!\n" if !defined($res);
@@ -59,14 +59,14 @@ my $readdata = sub {
 
     local $SIG{PIPE} = 'IGNORE';
 
-    while (length($head) < 5) {
-	last if !sysread $fh, $head, 5 - length($head), length($head);
+    while (length($head) < 6) {
+	last if !sysread $fh, $head, 6 - length($head), length($head);
     }
     return undef if $allow_eof && length($head) == 0;
 
-    die "partial read\n" if length($head) < 5;
+    die "partial read\n" if length($head) < 6;
 
-    my ($cmd, $len) = unpack "a L", $head;
+    my ($cmd, $noerr, $len) = unpack "a a L", $head;
 
     my $data = '';
     while (length($data) < $len) {
@@ -74,7 +74,7 @@ my $readdata = sub {
     }
     die "partial data read\n" if length($data) < $len;
 
-    return wantarray ? ($cmd, $data) : $data;
+    return wantarray ? ($cmd, $data, $noerr) : $data;
 };
 
 my $kill_worker = sub {
@@ -95,7 +95,7 @@ my $kill_worker = sub {
 };
 
 my $sendcmd = sub {
-    my ($self, $cmd, $data, $expect_tag) = @_;
+    my ($self, $cmd, $data, $expect_tag, $noerr) = @_;
 
     $expect_tag = '>' if !$expect_tag;
 
@@ -103,7 +103,7 @@ my $sendcmd = sub {
 
     my ($restag, $raw);
     my $code = sub {
-	&$writedata($self->{child}, $cmd, $data) if $expect_tag ne 'S';
+	&$writedata($self->{child}, $cmd, $data, $noerr) if $expect_tag ne 'S';
 	($restag, $raw) = &$readdata($self->{child});
     };
     eval { PVE::Tools::run_with_timeout($self->{timeout}, $code); };
@@ -194,14 +194,14 @@ sub new {
 	$self->{conn} = $conn;
 
 	for (;;) {
-	    my ($cmd, $data) = &$readdata($parent, 1);
+	    my ($cmd, $data, $noerr) = &$readdata($parent, 1);
 
 	    last if !$cmd || $cmd eq 'Q';
 
 	    my $res;
 	    eval {
 		if ($cmd eq 'M') { # rados monitor commands
-		    $res = pve_rados_mon_command($self->{conn}, [ $data ]);
+		    $res = encode_json(pve_rados_mon_command($self->{conn}, [ $data ], $noerr));
 		} elsif ($cmd eq 'C') { # class methods
 		    my $aref = decode_json($data);
 		    my $method = shift @$aref;
@@ -259,19 +259,26 @@ sub cluster_stat {
 # example1: { prefix => 'get_command_descriptions'})
 # example2: { prefix => 'mon dump', format => 'json' }
 sub mon_command {
-    my ($self, $cmd) = @_;
+    my ($self, $cmd, $noerr) = @_;
+
+    $noerr = 0 if !$noerr;
 
     $cmd->{format} = 'json' if !$cmd->{format};
 
     my $json = encode_json($cmd);
 
-    my $raw = eval { $sendcmd->($self, 'M', $json) };
+    my $ret = eval { $sendcmd->($self, 'M', $json, undef, $noerr) };
     die "error with '$cmd->{prefix}': $@" if $@;
 
+    my $raw = decode_json($ret);
+
+    my $data = '';
     if ($cmd->{format} && $cmd->{format} eq 'json') {
-	return length($raw) ? decode_json($raw) : undef;
+	$data = length($raw->{data}) ? decode_json($raw->{data}) : undef;
+    } else {
+	$data = $raw->{data};
     }
-    return $raw;
+    return wantarray ? ($raw->{code}, $raw->{status}, $data) : $data;
 }
 
 
diff --git a/RADOS.xs b/RADOS.xs
index 1eb0b5a..3d828e1 100644
--- a/RADOS.xs
+++ b/RADOS.xs
@@ -98,11 +98,12 @@ CODE:
     rados_shutdown(cluster);
 }
 
-SV *
-pve_rados_mon_command(cluster, cmds)
+HV *
+pve_rados_mon_command(cluster, cmds, noerr=false)
 rados_t cluster
 AV *cmds
-PROTOTYPE: $$
+bool noerr
+PROTOTYPE: $$;$
 CODE:
 {
     const char *cmd[64];
@@ -129,7 +130,7 @@ CODE:
                                 &outbuf, &outbuflen,
                                 &outs, &outslen);
 
-    if (ret < 0) {
+    if (ret < 0 && noerr == false) {
         char msg[4096];
         if (outslen > sizeof(msg)) {
             outslen = sizeof(msg);
@@ -142,9 +143,22 @@ CODE:
         die(msg);
     }
 
-    RETVAL = newSVpv(outbuf, outbuflen);
+    char status[(int)outslen + 1];
+    if (outslen > sizeof(status)) {
+	outslen = sizeof(status);
+    }
+    snprintf(status, sizeof(status), "%.*s\n", (int)outslen, outs);
+
+    HV * rh = (HV *)sv_2mortal((SV *)newHV());
+
+    (void)hv_store(rh, "code", 4, newSViv(ret), 0);
+    (void)hv_store(rh, "data", 4, newSVpv(outbuf, outbuflen), 0);
+    (void)hv_store(rh, "status", 6, newSVpv(status, sizeof(status) - 1), 0);
+    RETVAL = rh;
 
-    rados_buffer_free(outbuf);
+    if (outbuf != NULL) {
+	rados_buffer_free(outbuf);
+    }
 
     if (outs != NULL) {
 	rados_buffer_free(outs);
-- 
2.30.2





^ permalink raw reply	[flat|nested] 14+ messages in thread

* [pve-devel] [PATCH manager 3/6] api: osd: force mon_command to scalar context
  2022-02-18 11:38 [pve-devel] [PATCH librados2-perl manager 0/6] Add Ceph safety checks Aaron Lauterer
  2022-02-18 11:38 ` [pve-devel] [PATCH librados2-perl 1/6] mon_command: free outs buffer Aaron Lauterer
  2022-02-18 11:38 ` [pve-devel] [PATCH librados2-perl 2/6] mon_command: optionally ignore errors and return hashmap Aaron Lauterer
@ 2022-02-18 11:38 ` Aaron Lauterer
  2022-02-18 11:38 ` [pve-devel] [PATCH manager 4/6] api: mon: mds: osd: add safety check endpoints Aaron Lauterer
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Aaron Lauterer @ 2022-02-18 11:38 UTC (permalink / raw)
  To: pve-devel

With the changes in librados2-perl, we need to make sure to call it in
scalar context.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
---
This needs to be released in combination with the previous changes (2/6)
in librados2-perl to not break OSD removal!

 PVE/API2/Ceph/OSD.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/PVE/API2/Ceph/OSD.pm b/PVE/API2/Ceph/OSD.pm
index 93433b3a..26d61e3b 100644
--- a/PVE/API2/Ceph/OSD.pm
+++ b/PVE/API2/Ceph/OSD.pm
@@ -574,7 +574,7 @@ __PACKAGE__->register_method ({
 	my $rados = PVE::RADOS->new();
 
 	my $osd_belongs_to_node = osd_belongs_to_node(
-	    $rados->mon_command({ prefix => 'osd tree' }),
+	    scalar($rados->mon_command({ prefix => 'osd tree' })),
 	    $param->{node},
 	    $osdid,
 	);
-- 
2.30.2





^ permalink raw reply	[flat|nested] 14+ messages in thread

* [pve-devel] [PATCH manager 4/6] api: mon: mds: osd: add safety check endpoints
  2022-02-18 11:38 [pve-devel] [PATCH librados2-perl manager 0/6] Add Ceph safety checks Aaron Lauterer
                   ` (2 preceding siblings ...)
  2022-02-18 11:38 ` [pve-devel] [PATCH manager 3/6] api: osd: force mon_command to scalar context Aaron Lauterer
@ 2022-02-18 11:38 ` Aaron Lauterer
  2022-02-22  8:44   ` Thomas Lamprecht
  2022-02-18 11:38 ` [pve-devel] [PATCH manager 5/6] ui: osd: warn if removal could be problematic Aaron Lauterer
  2022-02-18 11:38 ` [pve-devel] [PATCH manager 6/6] ui: osd: mon: mds: warn if stop/destroy actions are problematic Aaron Lauterer
  5 siblings, 1 reply; 14+ messages in thread
From: Aaron Lauterer @ 2022-02-18 11:38 UTC (permalink / raw)
  To: pve-devel

for mon, mds and osd: ok-to-stop
mon: ok-to-rm
osd: safe-to-destroy

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
---

I added the OSD safe-to-destroy endpoint even though I have no immediate
use for it as of now. But plan to include it in the OSD panel once I
have an idea how to do so in a sensible manner.

The main problem with safe-to-destroy is, that it only works if the OSD
is still running and by the time we enable the destroy button, the OSD
needs to be stopped.

 PVE/API2/Ceph/MDS.pm |  50 ++++++++++++++++++++++
 PVE/API2/Ceph/MON.pm | 100 +++++++++++++++++++++++++++++++++++++++++++
 PVE/API2/Ceph/OSD.pm | 100 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 250 insertions(+)

diff --git a/PVE/API2/Ceph/MDS.pm b/PVE/API2/Ceph/MDS.pm
index 1cb0b74f..2ed3cf5a 100644
--- a/PVE/API2/Ceph/MDS.pm
+++ b/PVE/API2/Ceph/MDS.pm
@@ -230,4 +230,54 @@ __PACKAGE__->register_method ({
     }
 });
 
+__PACKAGE__->register_method ({
+    name => 'oktostop',
+    path => '{name}/ok-to-stop',
+    method => 'GET',
+    description => "Check if it is ok to stop the MON",
+    proxyto => 'node',
+    protected => 1,
+    permissions => {
+	check => ['perm', '/', [ 'Sys.Modify' ]],
+    },
+    parameters => {
+	additionalProperties => 0,
+	properties => {
+	    node => get_standard_option('pve-node'),
+	    name => {
+		type => 'string',
+		pattern => PVE::Ceph::Services::SERVICE_REGEX,
+		description => 'The name (ID) of the mds',
+	    },
+	},
+    },
+    returns => {
+	type => "object",
+    },
+    code => sub {
+	my ($param) = @_;
+
+	PVE::Ceph::Tools::check_ceph_inited();
+
+	my $name = $param->{name};
+	my $rados = PVE::RADOS->new();
+
+	my $result = {
+	    ok_to_stop => 0,
+	    status => '',
+	};
+
+	my $code;
+	my $status;
+	eval {
+	    ($code, $status) = $rados->mon_command({ prefix => 'mds ok-to-stop', format => 'plain', ids => [ $name ] }, 1);
+	};
+	die $@ if $@;
+
+	$result->{ok_to_stop} = $code == 0 ? 1 : 0;
+	$result->{status} = $status;
+
+	return $result;
+    }});
+
 1;
diff --git a/PVE/API2/Ceph/MON.pm b/PVE/API2/Ceph/MON.pm
index 12c9caf0..49d15f12 100644
--- a/PVE/API2/Ceph/MON.pm
+++ b/PVE/API2/Ceph/MON.pm
@@ -591,4 +591,104 @@ __PACKAGE__->register_method ({
 	return $rpcenv->fork_worker('cephdestroymon', $monsection,  $authuser, $worker);
     }});
 
+__PACKAGE__->register_method ({
+    name => 'oktostop',
+    path => '{monid}/ok-to-stop',
+    method => 'GET',
+    description => "Check if it is ok to stop the MON",
+    proxyto => 'node',
+    protected => 1,
+    permissions => {
+	check => ['perm', '/', [ 'Sys.Modify' ]],
+    },
+    parameters => {
+	additionalProperties => 0,
+	properties => {
+	    node => get_standard_option('pve-node'),
+	    monid => {
+		type => 'string',
+		pattern => PVE::Ceph::Services::SERVICE_REGEX,
+		description => "The ID for the monitor",
+	    },
+	},
+    },
+    returns => {
+	type => "object",
+    },
+    code => sub {
+	my ($param) = @_;
+
+	PVE::Ceph::Tools::check_ceph_inited();
+
+	my $monid = $param->{monid};
+	my $rados = PVE::RADOS->new();
+
+	my $result = {
+	    ok_to_stop => 0,
+	    status => '',
+	};
+
+	my $code;
+	my $status;
+	eval {
+	    ($code, $status) = $rados->mon_command({ prefix => 'mon ok-to-stop', format => 'plain', ids => [ $monid ] }, 1);
+	};
+	die $@ if $@;
+
+	$result->{ok_to_stop} = $code == 0 ? 1 : 0;
+	$result->{status} = $status;
+
+	return $result;
+    }});
+
+__PACKAGE__->register_method ({
+    name => 'oktorm',
+    path => '{monid}/ok-to-rm',
+    method => 'GET',
+    description => "Check if it is ok to remove the MON",
+    proxyto => 'node',
+    protected => 1,
+    permissions => {
+	check => ['perm', '/', [ 'Sys.Modify' ]],
+    },
+    parameters => {
+	additionalProperties => 0,
+	properties => {
+	    node => get_standard_option('pve-node'),
+	    monid => {
+		type => 'string',
+		pattern => PVE::Ceph::Services::SERVICE_REGEX,
+		description => "The ID for the monitor",
+	    },
+	},
+    },
+    returns => {
+	type => "object",
+    },
+    code => sub {
+	my ($param) = @_;
+
+	PVE::Ceph::Tools::check_ceph_inited();
+
+	my $monid = $param->{monid};
+	my $rados = PVE::RADOS->new();
+
+	my $result = {
+	    ok_to_rm => 0,
+	    status => '',
+	};
+
+	my $code;
+	my $status;
+	eval {
+	    ($code, $status) = $rados->mon_command({ prefix => 'mon ok-to-rm', format => 'plain', id => $monid }, 1);
+	};
+	die $@ if $@;
+
+	$result->{ok_to_rm} = $code == 0 ? 1 : 0;
+	$result->{status} = $status;
+
+	return $result;
+    }});
+
 1;
diff --git a/PVE/API2/Ceph/OSD.pm b/PVE/API2/Ceph/OSD.pm
index 26d61e3b..7a7599e2 100644
--- a/PVE/API2/Ceph/OSD.pm
+++ b/PVE/API2/Ceph/OSD.pm
@@ -825,4 +825,104 @@ __PACKAGE__->register_method ({
 	return undef;
     }});
 
+__PACKAGE__->register_method ({
+    name => 'oktostop',
+    path => '{osdid}/ok-to-stop',
+    method => 'GET',
+    description => "Check if it is ok to stop the OSD",
+    proxyto => 'node',
+    protected => 1,
+    permissions => {
+	check => ['perm', '/', [ 'Sys.Modify' ]],
+    },
+    parameters => {
+	additionalProperties => 0,
+	properties => {
+	    node => get_standard_option('pve-node'),
+	    osdid => {
+		description => 'OSD ID',
+		type => 'integer',
+	    },
+	},
+    },
+    returns => {
+	type => "object",
+    },
+    code => sub {
+	my ($param) = @_;
+
+	PVE::Ceph::Tools::check_ceph_inited();
+
+	my $osdid = $param->{osdid};
+	my $rados = PVE::RADOS->new();
+	$get_osd_status->($rados, $osdid); # osd exists?
+
+	my $result = {
+	    ok_to_stop => 0,
+	    status => '',
+	};
+
+	my $code;
+	my $status;
+	eval {
+	    ($code, $status) = $rados->mon_command({ prefix => 'osd ok-to-stop', format => 'plain', ids => [ $osdid ] }, 1);
+	};
+	die $@ if $@;
+
+	$result->{ok_to_stop} = $code == 0 ? 1 : 0;
+	$result->{status} = $status;
+
+	return $result;
+    }});
+
+__PACKAGE__->register_method ({
+    name => 'safetodestroy',
+    path => '{osdid}/safe-to-destroy',
+    method => 'GET',
+    description => "Check if it is safe to destroy the OSD",
+    proxyto => 'node',
+    protected => 1,
+    permissions => {
+	check => ['perm', '/', [ 'Sys.Modify' ]],
+    },
+    parameters => {
+	additionalProperties => 0,
+	properties => {
+	    node => get_standard_option('pve-node'),
+	    osdid => {
+		description => 'OSD ID',
+		type => 'integer',
+	    },
+	},
+    },
+    returns => {
+	type => "object",
+    },
+    code => sub {
+	my ($param) = @_;
+
+	PVE::Ceph::Tools::check_ceph_inited();
+
+	my $osdid = $param->{osdid};
+	my $rados = PVE::RADOS->new();
+	$get_osd_status->($rados, $osdid); # osd exists?
+
+	my $result = {
+	    safe_to_destroy => 0,
+	    status => '',
+	};
+
+	my $code;
+	my $status;
+	eval {
+	    ($code, $status) = $rados->mon_command({ prefix => 'osd safe-to-destroy', format => 'plain', ids => [ $osdid ] }, 1);
+	};
+	die $@ if $@;
+
+	$result->{safe_to_destroy} = $code == 0 ? 1 : 0;
+	$result->{status} = $status;
+
+	return $result;
+    }});
+
 1;
-- 
2.30.2





^ permalink raw reply	[flat|nested] 14+ messages in thread

* [pve-devel] [PATCH manager 5/6] ui: osd: warn if removal could be problematic
  2022-02-18 11:38 [pve-devel] [PATCH librados2-perl manager 0/6] Add Ceph safety checks Aaron Lauterer
                   ` (3 preceding siblings ...)
  2022-02-18 11:38 ` [pve-devel] [PATCH manager 4/6] api: mon: mds: osd: add safety check endpoints Aaron Lauterer
@ 2022-02-18 11:38 ` Aaron Lauterer
  2022-02-24 12:46   ` Thomas Lamprecht
  2022-02-18 11:38 ` [pve-devel] [PATCH manager 6/6] ui: osd: mon: mds: warn if stop/destroy actions are problematic Aaron Lauterer
  5 siblings, 1 reply; 14+ messages in thread
From: Aaron Lauterer @ 2022-02-18 11:38 UTC (permalink / raw)
  To: pve-devel

If an OSD is removed during the wrong conditions, it could lead to
blocked IO or worst case data loss.

Check against global flags that limit the capabilities of Ceph to heal
itself (norebalance, norecover, noout) and if there are degraded
objects.

Unfortunately, the 'safe-to-destroy' API endpoint will not help here as
it only works as long as the OSD is still running. By the time the
destroy button is enabled, the OSD needs to be stopped.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
---
After the short discussion on the previous version [0] and the hints to
ok-to-stop & safe-to-destroy I kept the original approach with the
reason given in the commit message.

But now the checks are run before opening the window and showing a
loading screen.

[0] https://lists.proxmox.com/pipermail/pve-devel/2022-February/051597.html

 www/manager6/ceph/OSD.js | 66 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/www/manager6/ceph/OSD.js b/www/manager6/ceph/OSD.js
index e126f8d0..27f56760 100644
--- a/www/manager6/ceph/OSD.js
+++ b/www/manager6/ceph/OSD.js
@@ -178,6 +178,20 @@ Ext.define('PVE.CephRemoveOsd', {
 	    labelWidth: 130,
 	    fieldLabel: gettext('Cleanup Disks'),
 	},
+	{
+	    xtype: 'displayfield',
+	    name: 'osd-flag-hint',
+	    userCls: 'pmx-hint',
+	    value: gettext('Global flags limiting the self healing of Ceph are enabled.'),
+	    hidden: true,
+	},
+	{
+	    xtype: 'displayfield',
+	    name: 'degraded-objects-hint',
+	    userCls: 'pmx-hint',
+	    value: gettext('Objects are degraded. Consider waiting until the cluster is healthy.'),
+	    hidden: true,
+	},
     ],
     initComponent: function() {
         let me = this;
@@ -198,6 +212,13 @@ Ext.define('PVE.CephRemoveOsd', {
         });
 
         me.callParent();
+
+	if (me.warnings.flags) {
+	    me.down('field[name=osd-flag-hint]').setHidden(false);
+	}
+	if (me.warnings.degraded) {
+	    me.down('field[name=degraded-objects-hint]').setHidden(false);
+	}
     },
 });
 
@@ -434,12 +455,55 @@ Ext.define('PVE.node.CephOsdTree', {
 	    }).show();
 	},
 
-	destroy_osd: function() {
+	destroy_osd: async function() {
 	    let me = this;
 	    let vm = this.getViewModel();
+
+	    let warnings = {
+		flags: false,
+		degraded: false,
+	    };
+
+	    let flagsPromise = Proxmox.Async.api2({
+		url: `/cluster/ceph/flags`,
+		method: 'GET',
+	    });
+
+	    let statusPromise = Proxmox.Async.api2({
+		url: `/cluster/ceph/status`,
+		method: 'GET',
+	    });
+
+	    me.getView().mask(gettext('Please wait...'));
+
+	    try {
+		let result = await Promise.all([flagsPromise, statusPromise]);
+
+		let flagsData = result[0].result.data;
+		let statusData = result[1].result.data;
+
+		let flags = Array.from(
+		    flagsData.filter(v => v.value),
+		    v => v.name,
+		).filter(v => ['norebalance', 'norecover', 'noout'].includes(v));
+
+		if (flags.length) {
+		    warnings.flags = true;
+		}
+		if (Object.keys(statusData.pgmap).includes('degraded_objects')) {
+		    warnings.degraded = true;
+		}
+	    } catch (error) {
+		Ext.Msg.alert(gettext('Error'), error.htmlStatus);
+		me.getView().unmask();
+		return;
+	    }
+
+	    me.getView().unmask();
 	    Ext.create('PVE.CephRemoveOsd', {
 		nodename: vm.get('osdhost'),
 		osdid: vm.get('osdid'),
+		warnings: warnings,
 		taskDone: () => { me.reload(); },
 	    }).show();
 	},
-- 
2.30.2





^ permalink raw reply	[flat|nested] 14+ messages in thread

* [pve-devel] [PATCH manager 6/6] ui: osd: mon: mds: warn if stop/destroy actions are problematic
  2022-02-18 11:38 [pve-devel] [PATCH librados2-perl manager 0/6] Add Ceph safety checks Aaron Lauterer
                   ` (4 preceding siblings ...)
  2022-02-18 11:38 ` [pve-devel] [PATCH manager 5/6] ui: osd: warn if removal could be problematic Aaron Lauterer
@ 2022-02-18 11:38 ` Aaron Lauterer
  5 siblings, 0 replies; 14+ messages in thread
From: Aaron Lauterer @ 2022-02-18 11:38 UTC (permalink / raw)
  To: pve-devel

Check if stopping of a service (OSD, MON, MDS) will be problematic for
Ceph. The warning still allows the user to proceed.

Ceph also has a check if the destruction of a MON is okay, so let's use
it.

Instead of the common OK button, label it with `Stop OSD` and so forth
to hopefully reduce the "click OK by habit" incidents.

This will not catch it every time as Ceph can need a few moments after a
change to establish its current status. For example, stopping one of 3
MONs and then right away destroying one of the last two running MONs
will most likely not trigger the warning. Doing so after a few seconds
should show the warning though.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
---
 www/manager6/ceph/OSD.js         | 66 ++++++++++++++++------
 www/manager6/ceph/ServiceList.js | 94 +++++++++++++++++++++++++++-----
 2 files changed, 130 insertions(+), 30 deletions(-)

diff --git a/www/manager6/ceph/OSD.js b/www/manager6/ceph/OSD.js
index 27f56760..99791a80 100644
--- a/www/manager6/ceph/OSD.js
+++ b/www/manager6/ceph/OSD.js
@@ -521,23 +521,55 @@ Ext.define('PVE.node.CephOsdTree', {
 	    let me = this;
 	    let vm = this.getViewModel();
 	    let cmd = comp.cmd || comp;
-	    Proxmox.Utils.API2Request({
-                url: "/nodes/" + vm.get('osdhost') + "/ceph/" + cmd,
-		params: { service: "osd." + vm.get('osdid') },
-		waitMsgTarget: me.getView(),
-		method: 'POST',
-		success: function(response, options) {
-		    let upid = response.result.data;
-		    let win = Ext.create('Proxmox.window.TaskProgress', {
-			upid: upid,
-			taskDone: () => { me.reload(); },
-		    });
-		    win.show();
-		},
-		failure: function(response, opts) {
-		    Ext.Msg.alert(gettext('Error'), response.htmlStatus);
-		},
-	    });
+
+	    let doRequest = function() {
+		Proxmox.Utils.API2Request({
+		    url: `/nodes/${vm.get('osdhost')}/ceph/${cmd}`,
+		    params: { service: "osd." + vm.get('osdid') },
+		    waitMsgTarget: me.getView(),
+		    method: 'POST',
+		    success: function(response, options) {
+			let upid = response.result.data;
+			let win = Ext.create('Proxmox.window.TaskProgress', {
+			    upid: upid,
+			    taskDone: () => { me.reload(); },
+			});
+			win.show();
+		    },
+		    failure: function(response, opts) {
+			Ext.Msg.alert(gettext('Error'), response.htmlStatus);
+		    },
+		});
+	    };
+
+	    if (cmd === "stop") {
+		Proxmox.Utils.API2Request({
+		    url: `/nodes/${vm.get('osdhost')}/ceph/osd/${vm.get('osdid')}/ok-to-stop`,
+		    waitMsgTarget: me.getView(),
+		    method: 'GET',
+		    success: function({ result: { data } }) {
+			if (!data.ok_to_stop) {
+			    Ext.Msg.show({
+				title: gettext('Warning'),
+				message: data.status,
+				icon: Ext.Msg.WARNING,
+				buttons: Ext.Msg.OKCANCEL,
+				buttonText: { ok: gettext('Stop OSD') },
+				fn: function(selection) {
+				    if (selection === 'ok') {
+					doRequest();
+				    }
+				},
+			    });
+			} else {
+			    doRequest();
+			}
+		    },
+		    failure: response => Ext.Msg.alert(gettext('Error'), response.htmlStatus),
+		});
+	    } else {
+		doRequest();
+	    }
 	},
 
 	set_selection_status: function(tp, selection) {
diff --git a/www/manager6/ceph/ServiceList.js b/www/manager6/ceph/ServiceList.js
index 9298974e..b1550d6d 100644
--- a/www/manager6/ceph/ServiceList.js
+++ b/www/manager6/ceph/ServiceList.js
@@ -148,19 +148,52 @@ Ext.define('PVE.node.CephServiceController', {
 	    Ext.Msg.alert(gettext('Error'), "entry has no host");
 	    return;
 	}
-	Proxmox.Utils.API2Request({
-				  url: `/nodes/${rec.data.host}/ceph/${cmd}`,
-				  method: 'POST',
-				  params: { service: view.type + '.' + rec.data.name },
-				  success: function(response, options) {
-				      Ext.create('Proxmox.window.TaskProgress', {
-					  autoShow: true,
-					  upid: response.result.data,
-					  taskDone: () => view.rstore.load(),
-				      });
-				  },
-				  failure: (response, _opts) => Ext.Msg.alert(gettext('Error'), response.htmlStatus),
-	});
+	let doRequest = function() {
+	    Proxmox.Utils.API2Request({
+		url: `/nodes/${rec.data.host}/ceph/${cmd}`,
+		method: 'POST',
+		params: { service: view.type + '.' + rec.data.name },
+		success: function(response, options) {
+		    Ext.create('Proxmox.window.TaskProgress', {
+			autoShow: true,
+			upid: response.result.data,
+			taskDone: () => view.rstore.load(),
+		    });
+		},
+		failure: (response, _opts) => Ext.Msg.alert(gettext('Error'), response.htmlStatus),
+	    });
+	};
+	if (cmd === "stop" && ['mon', 'mds'].includes(view.type)) {
+	    Proxmox.Utils.API2Request({
+		url: `/nodes/${rec.data.host}/ceph/${view.type}/${rec.data.name}/ok-to-stop`,
+		method: 'GET',
+		success: function({ result: { data } }) {
+		    let stopText = {
+			mon: gettext('Stop MON'),
+			mds: gettext('Stop MDS'),
+		    };
+		    if (!data.ok_to_stop) {
+			Ext.Msg.show({
+			    title: gettext('Warning'),
+			    message: data.status,
+			    icon: Ext.Msg.WARNING,
+			    buttons: Ext.Msg.OKCANCEL,
+			    buttonText: { ok: stopText[view.type] },
+			    fn: function(selection) {
+				if (selection === 'ok') {
+				    doRequest();
+				}
+			    },
+			});
+		    } else {
+			doRequest();
+		    }
+		},
+		failure: (response, _opts) => Ext.Msg.alert(gettext('Error'), response.htmlStatus),
+	    });
+	} else {
+	    doRequest();
+	}
     },
     onChangeService: function(button) {
 	let me = this;
@@ -273,6 +306,41 @@ Ext.define('PVE.node.CephServiceList', {
 		    taskDone: () => view.rstore.load(),
 		});
 	    },
+	    handler: function(btn, event, rec) {
+		let me = this;
+		let view = me.up('grid');
+		let doRequest = function() {
+		    Proxmox.button.StdRemoveButton.prototype.handler.call(me, btn, event, rec);
+		};
+		if (view.type === 'mon') {
+		    Proxmox.Utils.API2Request({
+			url: `/nodes/${rec.data.host}/ceph/${view.type}/${rec.data.name}/ok-to-rm`,
+			method: 'GET',
+			success: function({ result: { data } }) {
+			    if (!data.ok_to_rm) {
+				Ext.Msg.show({
+				    title: gettext('Warning'),
+				    message: data.status,
+				    icon: Ext.Msg.WARNING,
+				    buttons: Ext.Msg.OKCANCEL,
+				    buttonText: { ok: gettext('Destroy MON') },
+				    fn: function(selection) {
+					if (selection === 'ok') {
+					    doRequest();
+					}
+				    },
+				});
+			    } else {
+				doRequest();
+			    }
+			},
+			failure: (response, _opts) => Ext.Msg.alert(gettext('Error'), response.htmlStatus),
+		    });
+		} else {
+		    doRequest();
+		}
+	    },
+
 	},
 	'-',
 	{
-- 
2.30.2





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [pve-devel] [PATCH librados2-perl 1/6] mon_command: free outs buffer
  2022-02-18 11:38 ` [pve-devel] [PATCH librados2-perl 1/6] mon_command: free outs buffer Aaron Lauterer
@ 2022-02-21 15:35   ` Thomas Lamprecht
  0 siblings, 0 replies; 14+ messages in thread
From: Thomas Lamprecht @ 2022-02-21 15:35 UTC (permalink / raw)
  To: Proxmox VE development discussion, Aaron Lauterer

On 18.02.22 12:38, Aaron Lauterer wrote:
> Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
> ---
> 
> thanks @Dominik who realized that we did not free this buffer in all
> situations.
> 

note that the status string is normally only allocated in the error case,
where we freed it already, so actual impact shouldn't be that big; still
definitively more correct this way ;)

>  RADOS.xs | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/RADOS.xs b/RADOS.xs
> index 7eca024..1eb0b5a 100644
> --- a/RADOS.xs
> +++ b/RADOS.xs
> @@ -145,6 +145,10 @@ CODE:
>      RETVAL = newSVpv(outbuf, outbuflen);
>  
>      rados_buffer_free(outbuf);
> +
> +    if (outs != NULL) {

fyi: I made all calls to rados_buffer_free unconditional as the code there
checks already for null-ness (not that it'd matter much, free(NULL) is a no-op)
and it makes librados' built-in tracing more complete.

> +	rados_buffer_free(outs);

fyi: above had a tab in a file that uses space only indentation.

> +    }
>  }
>  OUTPUT: RETVAL
>  

applied, thanks!





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [pve-devel] [PATCH librados2-perl 2/6] mon_command: optionally ignore errors and return hashmap
  2022-02-18 11:38 ` [pve-devel] [PATCH librados2-perl 2/6] mon_command: optionally ignore errors and return hashmap Aaron Lauterer
@ 2022-02-21 15:44   ` Thomas Lamprecht
  2022-02-22 12:42     ` Aaron Lauterer
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Lamprecht @ 2022-02-21 15:44 UTC (permalink / raw)
  To: Proxmox VE development discussion, Aaron Lauterer

On 18.02.22 12:38, Aaron Lauterer wrote:
> In some situations, we do not want to abort if the Ceph API returns an
> error (ret != 0).  We also want to be able to access all the retured
> values from the Ceph API (return code, status, data).
> 
> One such situation can be the 'osd ok-to-stop' call where Ceph will
> return a non zero code if it is not okay to stop the OSD, but will also
> have useful information in the status buffer that we can use to show the
> user.
> 
> For this, let's always return a hashmap with the return code, the status
> message and the returned data from RADOS.xs::mon_command. Then decide on
> the Perl side (RADOS.pm::mon_command) if we return the scalar data as we
> used to, or the contents of the hashmap in an array.
> 
> The new parameter 'noerr' is used to indicate that we want to proceed on
> non-zero return values. Omitting it gives us the old behavior to die
> with the status message.
> 
> The new parameter needs to be passed to the child process, which causes
> some changes there and the resulting hashmaps gets JSON encoded to be
> passed back up to the parent process.

should be avoidable, the child can always pass through the new info and
the actual perl code can do the filtering.

> 
> Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
> ---
> This patch requires patch 3 of the series to not break OSD removal!
> Therefore releasing a new version of librados2-perl and pve-manager
> needs to be coordinated.

I don't like that and think it can be avoided.

> 
> Please have a closer look at the C code that I wrote in RADOS.xs as I
> have never written much C and am not sure if I introduced some nasty bug
> / security issue.
> 
>  PVE/RADOS.pm | 37 ++++++++++++++++++++++---------------
>  RADOS.xs     | 26 ++++++++++++++++++++------
>  2 files changed, 42 insertions(+), 21 deletions(-)
> 

> @@ -259,19 +259,26 @@ sub cluster_stat {
>  # example1: { prefix => 'get_command_descriptions'})
>  # example2: { prefix => 'mon dump', format => 'json' }
>  sub mon_command {
> -    my ($self, $cmd) = @_;
> +    my ($self, $cmd, $noerr) = @_;
> +
> +    $noerr = 0 if !$noerr;
>  
>      $cmd->{format} = 'json' if !$cmd->{format};
>  
>      my $json = encode_json($cmd);
>  
> -    my $raw = eval { $sendcmd->($self, 'M', $json) };
> +    my $ret = eval { $sendcmd->($self, 'M', $json, undef, $noerr) };

I'd rather like to avoid chaining through that $noerr every where, rather pass all the
info via the die (errors can be references to structured data too, like PVE::Exception is),
or just avoid the die at the lower level completely and map the error inside the struct
too, it can then be thrown here, depending on $noerr parameters or what not.

>      die "error with '$cmd->{prefix}': $@" if $@;
>  
> +    my $raw = decode_json($ret);
> +
> +    my $data = '';
>      if ($cmd->{format} && $cmd->{format} eq 'json') {
> -	return length($raw) ? decode_json($raw) : undef;
> +	$data = length($raw->{data}) ? decode_json($raw->{data}) : undef;
> +    } else {
> +	$data = $raw->{data};
>      }
> -    return $raw;
> +    return wantarray ? ($raw->{code}, $raw->{status}, $data) : $data;


>  }
>  
>  
> diff --git a/RADOS.xs b/RADOS.xs
> index 1eb0b5a..3d828e1 100644
> --- a/RADOS.xs
> +++ b/RADOS.xs
> @@ -98,11 +98,12 @@ CODE:
>      rados_shutdown(cluster);
>  }
>  
> -SV *
> -pve_rados_mon_command(cluster, cmds)
> +HV *
> +pve_rados_mon_command(cluster, cmds, noerr=false)
>  rados_t cluster
>  AV *cmds
> -PROTOTYPE: $$
> +bool noerr
> +PROTOTYPE: $$;$
>  CODE:
>  {
>      const char *cmd[64];
> @@ -129,7 +130,7 @@ CODE:
>                                  &outbuf, &outbuflen,
>                                  &outs, &outslen);
>  
> -    if (ret < 0) {
> +    if (ret < 0 && noerr == false) {
>          char msg[4096];
>          if (outslen > sizeof(msg)) {
>              outslen = sizeof(msg);
> @@ -142,9 +143,22 @@ CODE:
>          die(msg);
>      }
>  
> -    RETVAL = newSVpv(outbuf, outbuflen);
> +    char status[(int)outslen + 1];

this is on the stack and could be to large to guarantee it always fits, but...

> +    if (outslen > sizeof(status)) {
> +	outslen = sizeof(status);
> +    }
> +    snprintf(status, sizeof(status), "%.*s\n", (int)outslen, outs);

...why not just chain outs through instead of re-allocating it and writing it out in a
relatively expensive way?

> +
> +    HV * rh = (HV *)sv_2mortal((SV *)newHV());
> +
> +    (void)hv_store(rh, "code", 4, newSViv(ret), 0);
> +    (void)hv_store(rh, "data", 4, newSVpv(outbuf, outbuflen), 0);
> +    (void)hv_store(rh, "status", 6, newSVpv(status, sizeof(status) - 1), 0);
> +    RETVAL = rh;
>  
> -    rados_buffer_free(outbuf);
> +    if (outbuf != NULL) {
> +	rados_buffer_free(outbuf);
> +    }
>  
>      if (outs != NULL) {
>  	rados_buffer_free(outs);





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [pve-devel] [PATCH manager 4/6] api: mon: mds: osd: add safety check endpoints
  2022-02-18 11:38 ` [pve-devel] [PATCH manager 4/6] api: mon: mds: osd: add safety check endpoints Aaron Lauterer
@ 2022-02-22  8:44   ` Thomas Lamprecht
  2022-03-14 16:49     ` Aaron Lauterer
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Lamprecht @ 2022-02-22  8:44 UTC (permalink / raw)
  To: Proxmox VE development discussion, Aaron Lauterer

On 18.02.22 12:38, Aaron Lauterer wrote:
> for mon, mds and osd: ok-to-stop
> mon: ok-to-rm
> osd: safe-to-destroy
> 
> Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
> ---
> 
> I added the OSD safe-to-destroy endpoint even though I have no immediate
> use for it as of now. But plan to include it in the OSD panel once I
> have an idea how to do so in a sensible manner.
> 
> The main problem with safe-to-destroy is, that it only works if the OSD
> is still running and by the time we enable the destroy button, the OSD
> needs to be stopped.


above isn't really meta info for review but rather important for the commit
message itself.


In general I see lots of repetition, and in this case I'd rather have a single
enpoint that accepts one (or maybe better a list of) service-type(s), and an
action (stop/destroy) let's encode in the name (or at least description) that
it's a heuristical check, besides things that we possible miss to observe we
could never make it 100% safe as we cannot lock the whole ceph cluster between
checking and doing an operation, so this will always be a TOCTOU race that 
expects the admins to have some change management so that they do not interfere
with each others maintenance work.

So either `/nodes/<nodename>/ceph/cmd-safety-heuristic` or drop the heuristic
from the path and just refer to that detail in the description (which shows up
in the api viewer, so should be good enough) `/nodes/<nodename>/ceph/cmd-safety`

params could be: node, type, id and command

> 
>  PVE/API2/Ceph/MDS.pm |  50 ++++++++++++++++++++++
>  PVE/API2/Ceph/MON.pm | 100 +++++++++++++++++++++++++++++++++++++++++++
>  PVE/API2/Ceph/OSD.pm | 100 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 250 insertions(+)
> 
> diff --git a/PVE/API2/Ceph/MDS.pm b/PVE/API2/Ceph/MDS.pm
> index 1cb0b74f..2ed3cf5a 100644
> --- a/PVE/API2/Ceph/MDS.pm
> +++ b/PVE/API2/Ceph/MDS.pm
> @@ -230,4 +230,54 @@ __PACKAGE__->register_method ({
>      }
>  });
>  
> +__PACKAGE__->register_method ({
> +    name => 'oktostop',

FYI: the name is usable to call this directly via code API2::Ceph::MDS->oktostop(),
so it would be good to use the perl method naming convention.

> +    path => '{name}/ok-to-stop',
> +    method => 'GET',
> +    description => "Check if it is ok to stop the MON",

"heuristic check if..."

> +    proxyto => 'node',
> +    protected => 1,
> +    permissions => {
> +	check => ['perm', '/', [ 'Sys.Modify' ]],
> +    },

hmm, why Sys.Modify, I mean this doesn't do anything so Sys.Audit could be argued
too, or did you have something other in mind here (e.g., consistency with some other,
similarly categorized api endpoint)? 

> +    parameters => {
> +	additionalProperties => 0,
> +	properties => {
> +	    node => get_standard_option('pve-node'),
> +	    name => {
> +		type => 'string',
> +		pattern => PVE::Ceph::Services::SERVICE_REGEX,
> +		description => 'The name (ID) of the mds',
> +	    },
> +	},
> +    },
> +    returns => {
> +	type => "object",

please actually describe the return format for such simple api calls

> +    },
> +    code => sub {
> +	my ($param) = @_;
> +
> +	PVE::Ceph::Tools::check_ceph_inited();
> +
> +	my $name = $param->{name};
> +	my $rados = PVE::RADOS->new();
> +
> +	my $result = {
> +	    ok_to_stop => 0,
> +	    status => '',
> +	};
> +
> +	my $code;
> +	my $status;
> +	eval {
> +	    ($code, $status) = $rados->mon_command({ prefix => 'mds ok-to-stop', format => 'plain', ids => [ $name ] }, 1);
> +	};


eval's return their last expression (which can but doesn has to use a trailing semicolon)
so you could do:

my ($code, $status) = eval {
    $rados->mon_command(...)
};

> +	die $@ if $@;

ok, why bother with eval'ing if you die 1:1 in any error case anyway??

> +
> +	$result->{ok_to_stop} = $code == 0 ? 1 : 0;
> +	$result->{status} = $status;

for such simple logic we can avoid the intermediate $result and return directly,
that shaves off about 6 lines from this api call and please also use the kebab-case
for new properties, both in input parameters or output result.

return {
    'ok-to-stop' => $code == 0 ? 1 : 0,
    status => $status,
};

> +
> +	return $result;
> +    }});
> +
>  1;





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [pve-devel] [PATCH librados2-perl 2/6] mon_command: optionally ignore errors and return hashmap
  2022-02-21 15:44   ` Thomas Lamprecht
@ 2022-02-22 12:42     ` Aaron Lauterer
  0 siblings, 0 replies; 14+ messages in thread
From: Aaron Lauterer @ 2022-02-22 12:42 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE development discussion



On 2/21/22 16:44, Thomas Lamprecht wrote:
> On 18.02.22 12:38, Aaron Lauterer wrote:
>> This patch requires patch 3 of the series to not break OSD removal!
>> Therefore releasing a new version of librados2-perl and pve-manager
>> needs to be coordinated.
> 
> I don't like that and think it can be avoided.

I'll look into it again.

>> [...]
>>   sub mon_command {
>> -    my ($self, $cmd) = @_;
>> +    my ($self, $cmd, $noerr) = @_;
>> +
>> +    $noerr = 0 if !$noerr;
>>   
>>       $cmd->{format} = 'json' if !$cmd->{format};
>>   
>>       my $json = encode_json($cmd);
>>   
>> -    my $raw = eval { $sendcmd->($self, 'M', $json) };
>> +    my $ret = eval { $sendcmd->($self, 'M', $json, undef, $noerr) };
> 
> I'd rather like to avoid chaining through that $noerr every where, rather pass all the
> info via the die (errors can be references to structured data too, like PVE::Exception is),
> or just avoid the die at the lower level completely and map the error inside the struct
> too, it can then be thrown here, depending on $noerr parameters or what not.

Okay, definitely sounds like the cleaner approach.

>> [...]
>> -    if (ret < 0) {
>> +    if (ret < 0 && noerr == false) {
>>           char msg[4096];
>>           if (outslen > sizeof(msg)) {
>>               outslen = sizeof(msg);
>> @@ -142,9 +143,22 @@ CODE:
>>           die(msg);
>>       }
>>   
>> -    RETVAL = newSVpv(outbuf, outbuflen);
>> +    char status[(int)outslen + 1];
> 
> this is on the stack and could be to large to guarantee it always fits, but...
> 
>> +    if (outslen > sizeof(status)) {
>> +	outslen = sizeof(status);
>> +    }
>> +    snprintf(status, sizeof(status), "%.*s\n", (int)outslen, outs);
> 
> ...why not just chain outs through instead of re-allocating it and writing it out in a
> relatively expensive way?
> 

Yeah, will do that.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [pve-devel] [PATCH manager 5/6] ui: osd: warn if removal could be problematic
  2022-02-18 11:38 ` [pve-devel] [PATCH manager 5/6] ui: osd: warn if removal could be problematic Aaron Lauterer
@ 2022-02-24 12:46   ` Thomas Lamprecht
  0 siblings, 0 replies; 14+ messages in thread
From: Thomas Lamprecht @ 2022-02-24 12:46 UTC (permalink / raw)
  To: Proxmox VE development discussion, Aaron Lauterer

On 18.02.22 12:38, Aaron Lauterer wrote:
> If an OSD is removed during the wrong conditions, it could lead to
> blocked IO or worst case data loss.
> 
> Check against global flags that limit the capabilities of Ceph to heal
> itself (norebalance, norecover, noout) and if there are degraded
> objects.
> 
> Unfortunately, the 'safe-to-destroy' API endpoint will not help here as
> it only works as long as the OSD is still running. By the time the
> destroy button is enabled, the OSD needs to be stopped.
> 
> Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
> ---
> After the short discussion on the previous version [0] and the hints to
> ok-to-stop & safe-to-destroy I kept the original approach with the
> reason given in the commit message.
> 
> But now the checks are run before opening the window and showing a
> loading screen.
> 
> [0] https://lists.proxmox.com/pipermail/pve-devel/2022-February/051597.html
> 
>  www/manager6/ceph/OSD.js | 66 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 65 insertions(+), 1 deletion(-)
> 
> diff --git a/www/manager6/ceph/OSD.js b/www/manager6/ceph/OSD.js
> index e126f8d0..27f56760 100644
> --- a/www/manager6/ceph/OSD.js
> +++ b/www/manager6/ceph/OSD.js
> @@ -178,6 +178,20 @@ Ext.define('PVE.CephRemoveOsd', {
>  	    labelWidth: 130,
>  	    fieldLabel: gettext('Cleanup Disks'),
>  	},
> +	{
> +	    xtype: 'displayfield',
> +	    name: 'osd-flag-hint',
> +	    userCls: 'pmx-hint',
> +	    value: gettext('Global flags limiting the self healing of Ceph are enabled.'),
> +	    hidden: true,
> +	},
> +	{
> +	    xtype: 'displayfield',
> +	    name: 'degraded-objects-hint',
> +	    userCls: 'pmx-hint',
> +	    value: gettext('Objects are degraded. Consider waiting until the cluster is healthy.'),
> +	    hidden: true,
> +	},
>      ],
>      initComponent: function() {
>          let me = this;
> @@ -198,6 +212,13 @@ Ext.define('PVE.CephRemoveOsd', {
>          });
>  
>          me.callParent();
> +
> +	if (me.warnings.flags) {
> +	    me.down('field[name=osd-flag-hint]').setHidden(false);
> +	}
> +	if (me.warnings.degraded) {
> +	    me.down('field[name=degraded-objects-hint]').setHidden(false);
> +	}
>      },
>  });
>  
> @@ -434,12 +455,55 @@ Ext.define('PVE.node.CephOsdTree', {
>  	    }).show();
>  	},
>  
> -	destroy_osd: function() {
> +	destroy_osd: async function() {
>  	    let me = this;
>  	    let vm = this.getViewModel();
> +
> +	    let warnings = {
> +		flags: false,
> +		degraded: false,
> +	    };
> +
> +	    let flagsPromise = Proxmox.Async.api2({
> +		url: `/cluster/ceph/flags`,
> +		method: 'GET',
> +	    });
> +
> +	    let statusPromise = Proxmox.Async.api2({
> +		url: `/cluster/ceph/status`,
> +		method: 'GET',
> +	    });
> +
> +	    me.getView().mask(gettext('Please wait...'));

'Loading...' is used by ExtJS, well, loading mask; I'd use the same here


> +
> +	    try {
> +		let result = await Promise.all([flagsPromise, statusPromise]);
> +
> +		let flagsData = result[0].result.data;
> +		let statusData = result[1].result.data;

one could deconstruct this directly, but it gets rather noisy:

let [ { result: { data: flagsData } }, { result: { data: statusData } } ] = await Promise.all(...);

meh...

> +
> +		let flags = Array.from(
> +		    flagsData.filter(v => v.value),
> +		    v => v.name,
> +		).filter(v => ['norebalance', 'norecover', 'noout'].includes(v));
> +
> +		if (flags.length) {
> +		    warnings.flags = true;
> +		}
> +		if (Object.keys(statusData.pgmap).includes('degraded_objects')) {
> +		    warnings.degraded = true;
> +		}
> +	    } catch (error) {
> +		Ext.Msg.alert(gettext('Error'), error.htmlStatus);
> +		me.getView().unmask();
> +		return;
> +	    }
> +
> +	    me.getView().unmask();
>  	    Ext.create('PVE.CephRemoveOsd', {
>  		nodename: vm.get('osdhost'),
>  		osdid: vm.get('osdid'),
> +		warnings: warnings,
>  		taskDone: () => { me.reload(); },
>  	    }).show();

we normally set `autoShow: true` instead for new additions of such window instances.

>  	},





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [pve-devel] [PATCH manager 4/6] api: mon: mds: osd: add safety check endpoints
  2022-02-22  8:44   ` Thomas Lamprecht
@ 2022-03-14 16:49     ` Aaron Lauterer
  2022-03-14 17:02       ` Thomas Lamprecht
  0 siblings, 1 reply; 14+ messages in thread
From: Aaron Lauterer @ 2022-03-14 16:49 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE development discussion



On 2/22/22 09:44, Thomas Lamprecht wrote:
> On 18.02.22 12:38, Aaron Lauterer wrote:
[...]

> 
> In general I see lots of repetition, and in this case I'd rather have a single
> enpoint that accepts one (or maybe better a list of) service-type(s), and an
> action (stop/destroy) let's encode in the name (or at least description) that
> it's a heuristical check, besides things that we possible miss to observe we
> could never make it 100% safe as we cannot lock the whole ceph cluster between
> checking and doing an operation, so this will always be a TOCTOU race that
> expects the admins to have some change management so that they do not interfere
> with each others maintenance work.
> 
> So either `/nodes/<nodename>/ceph/cmd-safety-heuristic` or drop the heuristic
> from the path and just refer to that detail in the description (which shows up
> in the api viewer, so should be good enough) `/nodes/<nodename>/ceph/cmd-safety`
> 
> params could be: node, type, id and command

So IIUC, you prefer to not use the Ceph names transparently?
'ok-to-stop', 'ok-to-rm', 'safe-to-destroy'; yes, for Mons it is 'ok-to-rm' and for OSDs 'safe-to-destroy'...

But rather to have our own with the list of services (mon, mds, osd), its ID and then the action of either "stop" or "destroy"?

And ideally, the option to pass a list of IDs?




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [pve-devel] [PATCH manager 4/6] api: mon: mds: osd: add safety check endpoints
  2022-03-14 16:49     ` Aaron Lauterer
@ 2022-03-14 17:02       ` Thomas Lamprecht
  0 siblings, 0 replies; 14+ messages in thread
From: Thomas Lamprecht @ 2022-03-14 17:02 UTC (permalink / raw)
  To: Aaron Lauterer, Proxmox VE development discussion

On 14.03.22 17:49, Aaron Lauterer wrote:
> On 2/22/22 09:44, Thomas Lamprecht wrote:
>> On 18.02.22 12:38, Aaron Lauterer wrote:
> [...]
> 
>>
>> In general I see lots of repetition, and in this case I'd rather have a single
>> enpoint that accepts one (or maybe better a list of) service-type(s), and an
>> action (stop/destroy) let's encode in the name (or at least description) that
>> it's a heuristical check, besides things that we possible miss to observe we
>> could never make it 100% safe as we cannot lock the whole ceph cluster between
>> checking and doing an operation, so this will always be a TOCTOU race that
>> expects the admins to have some change management so that they do not interfere
>> with each others maintenance work.
>>
>> So either `/nodes/<nodename>/ceph/cmd-safety-heuristic` or drop the heuristic
>> from the path and just refer to that detail in the description (which shows up
>> in the api viewer, so should be good enough) `/nodes/<nodename>/ceph/cmd-safety`
>>
>> params could be: node, type, id and command
> 
> So IIUC, you prefer to not use the Ceph names transparently?
> 'ok-to-stop', 'ok-to-rm', 'safe-to-destroy'; yes, for Mons it is 'ok-to-rm' and for OSDs 'safe-to-destroy'...

yes, I'd abstract those details away.

> 
> But rather to have our own with the list of services (mon, mds, osd), its ID and then the action of either "stop" or "destroy"?
> 
Yes, one endpoint with type, and (abstract) action.

> And ideally, the option to pass a list of IDs?

the list is really optional, we don't have batch actions for destroy so it would
be a bit superfluous, and we can always adapt the endpoint to take a new ids-list
parameter in the future, if we really require it.





^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-03-14 17:03 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-18 11:38 [pve-devel] [PATCH librados2-perl manager 0/6] Add Ceph safety checks Aaron Lauterer
2022-02-18 11:38 ` [pve-devel] [PATCH librados2-perl 1/6] mon_command: free outs buffer Aaron Lauterer
2022-02-21 15:35   ` Thomas Lamprecht
2022-02-18 11:38 ` [pve-devel] [PATCH librados2-perl 2/6] mon_command: optionally ignore errors and return hashmap Aaron Lauterer
2022-02-21 15:44   ` Thomas Lamprecht
2022-02-22 12:42     ` Aaron Lauterer
2022-02-18 11:38 ` [pve-devel] [PATCH manager 3/6] api: osd: force mon_command to scalar context Aaron Lauterer
2022-02-18 11:38 ` [pve-devel] [PATCH manager 4/6] api: mon: mds: osd: add safety check endpoints Aaron Lauterer
2022-02-22  8:44   ` Thomas Lamprecht
2022-03-14 16:49     ` Aaron Lauterer
2022-03-14 17:02       ` Thomas Lamprecht
2022-02-18 11:38 ` [pve-devel] [PATCH manager 5/6] ui: osd: warn if removal could be problematic Aaron Lauterer
2022-02-24 12:46   ` Thomas Lamprecht
2022-02-18 11:38 ` [pve-devel] [PATCH manager 6/6] ui: osd: mon: mds: warn if stop/destroy actions are problematic Aaron Lauterer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal