[pve-devel] [RFC container] alloc disk: fix #3970 avoid ambiguous rbd image path

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [pve-devel] [RFC container] alloc disk: fix #3970 avoid ambiguous rbd image path
@ 2022-04-01 15:24 Aaron Lauterer
  2022-04-01 15:24 ` [pve-devel] [RFC qemu-server] clone disk: fix #3970 catch same source and destination Aaron Lauterer
  2022-04-04 15:44 ` [pve-devel] [RFC container] alloc disk: fix #3970 avoid ambiguous rbd image path Fabian Grünbichler
  0 siblings, 2 replies; 5+ messages in thread
From: Aaron Lauterer @ 2022-04-01 15:24 UTC (permalink / raw)
  To: pve-devel

If two RBD storages use the same pool, but connect to different
clusters, we cannot say to which cluster the mapped RBD image belongs
to. To avoid potential data loss, we need to verify that no other
storage is configured that could have a volume mapped under the same
path before we format anything.

The ambiguous mapping is in
/dev/rbd/<pool>/<ns>/<image> where the namespace <ns> is optional.

Once we can tell the clusters apart in the mapping, we can remove these
checks again.

See bug #3969 for more information on the root cause.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
---
RFC because I would like someone else to take a look at the logic and I
wasn't sure how to format the grouping of the conditions in a way that
is easy to read

 src/PVE/LXC.pm | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm
index fe63087..b048ce0 100644
--- a/src/PVE/LXC.pm
+++ b/src/PVE/LXC.pm
@@ -1970,6 +1970,51 @@ sub alloc_disk {
     my $scfg = PVE::Storage::storage_config($storecfg, $storage);
     # fixme: use better naming ct-$vmid-disk-X.raw?
 
+    # check if another rbd storage with the same pool name but different
+    # cluster exists. If so, allocating a new volume can potentially be
+    # dangerous because the RBD mapping, exposes it in an ambiguous way under
+    # /dev/rbd/<pool>/<ns>/<image>. Without any information to which cluster it
+    # belongs, we cannot clearly determine which image we access and
+    # potentially format an already used image.  See
+    # https://bugzilla.proxmox.com/show_bug.cgi?id=3969 and
+    # https://bugzilla.proxmox.com/show_bug.cgi?id=3970
+    # TODO: remove these checks once #3969 is fixed and we can clearly tell to
+    # which cluster an image belongs to
+    if ($scfg->{type} eq 'rbd') {
+	my $pool = $storecfg->{ids}->{$storage}->{pool};
+	foreach my $stor  (keys %{$storecfg->{ids}}) {
+	    next if $stor eq $storage;
+
+	    my $ccfg = PVE::Storage::storage_config($storecfg, $stor);
+	    next if $ccfg->{type} ne 'rbd';
+
+	    if ($scfg->{pool} eq $ccfg->{pool}) {
+		if (
+		    (
+			defined($scfg->{monhost})
+			&& defined($ccfg->{monhost})
+			&& $scfg->{monhost} eq $ccfg->{monhost}
+		    )
+		    || (
+			!defined($scfg->{monhost})
+			&& !defined($ccfg->{monhost})
+		    )
+		) {
+		    # both external ones same or both hyperconverged
+		    next;
+		}
+		# different cluster here
+		# we are ok if namespaces are not the same or only one storage has one
+		if (defined($scfg->{namespace}) && defined($ccfg->{namespace})) {
+		    next if $scfg->{namespace} ne $ccfg->{namespace};
+		} elsif (defined($scfg->{namespace}) || defined($ccfg->{namespace})) {
+		    next;
+		}
+		die "Cannot determine which Ceph cluster the volume mapping belongs to. Abort!\n";
+	    }
+	}
+    }
+
     eval {
 	my $do_format = 0;
 	if ($scfg->{content}->{rootdir} && $scfg->{path}) {
-- 
2.30.2





^ permalink raw reply	[flat|nested] 5+ messages in thread

* [pve-devel] [RFC qemu-server] clone disk: fix #3970 catch same source and destination
  2022-04-01 15:24 [pve-devel] [RFC container] alloc disk: fix #3970 avoid ambiguous rbd image path Aaron Lauterer
@ 2022-04-01 15:24 ` Aaron Lauterer
  2022-04-04 15:26   ` Fabian Grünbichler
  2022-04-04 15:44 ` [pve-devel] [RFC container] alloc disk: fix #3970 avoid ambiguous rbd image path Fabian Grünbichler
  1 sibling, 1 reply; 5+ messages in thread
From: Aaron Lauterer @ 2022-04-01 15:24 UTC (permalink / raw)
  To: pve-devel

In rare situations, it could happen that the source and target path is
the same. For example, if the disk image is to be copied from one RBD
storage to another one on different Ceph clusters but the pools have the
same name.

In this situation, the clone operation will clone it to the same image and
one will end up with an empty destination volume.

This patch does not solve the underlying issue, but is a first step to
avoid potential data loss, for example  when the 'delete source' option
is enabled as well.

We also need to delete the newly created image right away because the
regular cleanup gets confused and tries to remove the source image. This
will fail and we have an orphaned image which cannot be removed easily
because the same underlying root cause (same path) will falsely trigger
the "Drive::is_volume_in_use" check.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
---
We could think about leaving this as a permanent safety check as there
should not be a situation where both paths are the same AFAICT.

 PVE/QemuServer.pm | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
index 6c80c0c..a1aa4f2 100644
--- a/PVE/QemuServer.pm
+++ b/PVE/QemuServer.pm
@@ -7633,6 +7633,17 @@ sub clone_disk {
 
 	PVE::Storage::activate_volumes($storecfg, [$newvolid]);
 
+	my $src_path = PVE::Storage::path($storecfg, $drive->{file});
+	my $dst_path = PVE::Storage::path($storecfg, $newvolid);
+
+	if ($src_path eq $dst_path) {
+	    # Delete the new volume right away. Doing it later will try to remove the old volume and
+	    # fail. Currenlty the only known case is RBD. See bugs 3969 and 3970 for more details.
+	    PVE::Storage::vdisk_free($storecfg, $newvolid);
+	    pop @$newvollist;
+	    die "Source and destination are the same!\n";
+	}
+
 	if (drive_is_cloudinit($drive)) {
 	    # when cloning multiple disks (e.g. during clone_vm) it might be the last disk
 	    # if this is the case, we have to complete any block-jobs still there from
@@ -7650,8 +7661,6 @@ sub clone_disk {
 		# the relevant data on the efidisk may be smaller than the source
 		# e.g. on RBD/ZFS, so we use dd to copy only the amount
 		# that is given by the OVMF_VARS.fd
-		my $src_path = PVE::Storage::path($storecfg, $drive->{file});
-		my $dst_path = PVE::Storage::path($storecfg, $newvolid);
 
 		# better for Ceph if block size is not too small, see bug #3324
 		my $bs = 1024*1024;
-- 
2.30.2





^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] [RFC qemu-server] clone disk: fix #3970 catch same source and destination
  2022-04-01 15:24 ` [pve-devel] [RFC qemu-server] clone disk: fix #3970 catch same source and destination Aaron Lauterer
@ 2022-04-04 15:26   ` Fabian Grünbichler
  2022-04-05  7:28     ` Aaron Lauterer
  0 siblings, 1 reply; 5+ messages in thread
From: Fabian Grünbichler @ 2022-04-04 15:26 UTC (permalink / raw)
  To: Proxmox VE development discussion

On April 1, 2022 5:24 pm, Aaron Lauterer wrote:
> In rare situations, it could happen that the source and target path is
> the same. For example, if the disk image is to be copied from one RBD
> storage to another one on different Ceph clusters but the pools have the
> same name.
> 
> In this situation, the clone operation will clone it to the same image and
> one will end up with an empty destination volume.
> 
> This patch does not solve the underlying issue, but is a first step to
> avoid potential data loss, for example  when the 'delete source' option
> is enabled as well.
> 
> We also need to delete the newly created image right away because the
> regular cleanup gets confused and tries to remove the source image. This
> will fail and we have an orphaned image which cannot be removed easily
> because the same underlying root cause (same path) will falsely trigger
> the "Drive::is_volume_in_use" check.

isn't this technically - just like for the container case - a problem in 
general, not just for cloning a disk? I haven't tested this in practice, 
but since you already have the reproducing setup ;)

e.g., given the following:
- storage A, krbd, cluster A, pool foo
- storage B, krbd, cluster B, pool foo
- VM 123, with scsi0: A:vm-123-disk-0 and no volumes on B
- qm set 123 -scsi1: B:1

next free slot on B is 'vm-123-disk-0', which will be allocated. mapping 
will skip the map part, since the RBD path already exists (provided 
scsi0's volume is already activated). the returned path will point to 
the mapped blockdev corresponding to A:vm-123-disk-0, not B:.. 

guest writes to scsi1, likely corrupting whatever is on scsi0, since 
most things that tend to get put on guest disks are not 
multi-writer-safe (or something along the way notices it?)

if the above is the case, it might actually be prudent to just put the 
check from your other patch into RBDPlugin.pm 's alloc method (and 
clone and rename?) since we'd want to block any allocations on affected 
systems?

> 
> Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
> ---
> We could think about leaving this as a permanent safety check as there
> should not be a situation where both paths are the same AFAICT.
> 
>  PVE/QemuServer.pm | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
> index 6c80c0c..a1aa4f2 100644
> --- a/PVE/QemuServer.pm
> +++ b/PVE/QemuServer.pm
> @@ -7633,6 +7633,17 @@ sub clone_disk {
>  
>  	PVE::Storage::activate_volumes($storecfg, [$newvolid]);
>  
> +	my $src_path = PVE::Storage::path($storecfg, $drive->{file});
> +	my $dst_path = PVE::Storage::path($storecfg, $newvolid);
> +
> +	if ($src_path eq $dst_path) {
> +	    # Delete the new volume right away. Doing it later will try to remove the old volume and
> +	    # fail. Currenlty the only known case is RBD. See bugs 3969 and 3970 for more details.
> +	    PVE::Storage::vdisk_free($storecfg, $newvolid);
> +	    pop @$newvollist;
> +	    die "Source and destination are the same!\n";
> +	}
> +
>  	if (drive_is_cloudinit($drive)) {
>  	    # when cloning multiple disks (e.g. during clone_vm) it might be the last disk
>  	    # if this is the case, we have to complete any block-jobs still there from
> @@ -7650,8 +7661,6 @@ sub clone_disk {
>  		# the relevant data on the efidisk may be smaller than the source
>  		# e.g. on RBD/ZFS, so we use dd to copy only the amount
>  		# that is given by the OVMF_VARS.fd
> -		my $src_path = PVE::Storage::path($storecfg, $drive->{file});
> -		my $dst_path = PVE::Storage::path($storecfg, $newvolid);
>  
>  		# better for Ceph if block size is not too small, see bug #3324
>  		my $bs = 1024*1024;
> -- 
> 2.30.2
> 
> 
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] [RFC container] alloc disk: fix #3970 avoid ambiguous rbd image path
  2022-04-01 15:24 [pve-devel] [RFC container] alloc disk: fix #3970 avoid ambiguous rbd image path Aaron Lauterer
  2022-04-01 15:24 ` [pve-devel] [RFC qemu-server] clone disk: fix #3970 catch same source and destination Aaron Lauterer
@ 2022-04-04 15:44 ` Fabian Grünbichler
  1 sibling, 0 replies; 5+ messages in thread
From: Fabian Grünbichler @ 2022-04-04 15:44 UTC (permalink / raw)
  To: Proxmox VE development discussion

On April 1, 2022 5:24 pm, Aaron Lauterer wrote:
> If two RBD storages use the same pool, but connect to different
> clusters, we cannot say to which cluster the mapped RBD image belongs
> to. To avoid potential data loss, we need to verify that no other
> storage is configured that could have a volume mapped under the same
> path before we format anything.
> 
> The ambiguous mapping is in
> /dev/rbd/<pool>/<ns>/<image> where the namespace <ns> is optional.
> 
> Once we can tell the clusters apart in the mapping, we can remove these
> checks again.
> 
> See bug #3969 for more information on the root cause.
> 
> Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
> ---
> RFC because I would like someone else to take a look at the logic and I
> wasn't sure how to format the grouping of the conditions in a way that
> is easy to read
> 
>  src/PVE/LXC.pm | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
> 
> diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm
> index fe63087..b048ce0 100644
> --- a/src/PVE/LXC.pm
> +++ b/src/PVE/LXC.pm
> @@ -1970,6 +1970,51 @@ sub alloc_disk {
>      my $scfg = PVE::Storage::storage_config($storecfg, $storage);
>      # fixme: use better naming ct-$vmid-disk-X.raw?
>  
> +    # check if another rbd storage with the same pool name but different
> +    # cluster exists. If so, allocating a new volume can potentially be
> +    # dangerous because the RBD mapping, exposes it in an ambiguous way under
> +    # /dev/rbd/<pool>/<ns>/<image>. Without any information to which cluster it
> +    # belongs, we cannot clearly determine which image we access and
> +    # potentially format an already used image.  See
> +    # https://bugzilla.proxmox.com/show_bug.cgi?id=3969 and
> +    # https://bugzilla.proxmox.com/show_bug.cgi?id=3970
> +    # TODO: remove these checks once #3969 is fixed and we can clearly tell to
> +    # which cluster an image belongs to
> +    if ($scfg->{type} eq 'rbd') {
> +	my $pool = $storecfg->{ids}->{$storage}->{pool};

not used anywhere?

> +	foreach my $stor  (keys %{$storecfg->{ids}}) {
> +	    next if $stor eq $storage;
> +
> +	    my $ccfg = PVE::Storage::storage_config($storecfg, $stor);

that's actually not needed if you are iterating over the keys ;) just 
use

my $checked_scfg = $storecfg->{ids}->{$store};

> +	    next if $ccfg->{type} ne 'rbd';
> +
> +	    if ($scfg->{pool} eq $ccfg->{pool}) {

why not simply

# different pools -> no clash possible
next if $scfg->{pool} ne $checked_scfg->{pool};

> +		if (
> +		    (
> +			defined($scfg->{monhost})
> +			&& defined($ccfg->{monhost})
> +			&& $scfg->{monhost} eq $ccfg->{monhost}
> +		    )
> +		    || (
> +			!defined($scfg->{monhost})
> +			&& !defined($ccfg->{monhost})
> +		    )
> +		) {
> +		    # both external ones same or both hyperconverged
> +		    next;

untested, but something like the following is probably more readable ;) 
could also be adapted to check for monhost overlap instead if desired 
(to catch storage A and B not having identical 
strings/formatting/elements - if any single mon is listed for both, they 
ARE the same cluster for all intents and purposes?)

my $safe_compare = sub {
  my ($key) = shift;
  my $cmp = sub { $_[0] eq $_[1] };
  return PVE::Tools::safe_compare($scfg->{$key}, $checked_scfg->{$key}, $cmp);
};

# internal and internal or external and external with identical monitors 
# => same cluster
next if $safe_compare->("monhost");

# different namespaces => no clash possible
next if !$safe_compare->("namespace");

die ..

> +		}
> +		# different cluster here
> +		# we are ok if namespaces are not the same or only one storage has one
> +		if (defined($scfg->{namespace}) && defined($ccfg->{namespace})) {
> +		    next if $scfg->{namespace} ne $ccfg->{namespace};
> +		} elsif (defined($scfg->{namespace}) || defined($ccfg->{namespace})) {
> +		    next;
> +		}
> +		die "Cannot determine which Ceph cluster the volume mapping belongs to. Abort!\n";
> +	    }
> +	}
> +    }
> +
>      eval {
>  	my $do_format = 0;
>  	if ($scfg->{content}->{rootdir} && $scfg->{path}) {
> -- 
> 2.30.2
> 
> 
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] [RFC qemu-server] clone disk: fix #3970 catch same source and destination
  2022-04-04 15:26   ` Fabian Grünbichler
@ 2022-04-05  7:28     ` Aaron Lauterer
  0 siblings, 0 replies; 5+ messages in thread
From: Aaron Lauterer @ 2022-04-05  7:28 UTC (permalink / raw)
  To: Proxmox VE development discussion, Fabian Grünbichler



On 4/4/22 17:26, Fabian Grünbichler wrote:
> On April 1, 2022 5:24 pm, Aaron Lauterer wrote:
>> In rare situations, it could happen that the source and target path is
>> the same. For example, if the disk image is to be copied from one RBD
>> storage to another one on different Ceph clusters but the pools have the
>> same name.
>>
>> In this situation, the clone operation will clone it to the same image and
>> one will end up with an empty destination volume.
>>
>> This patch does not solve the underlying issue, but is a first step to
>> avoid potential data loss, for example  when the 'delete source' option
>> is enabled as well.
>>
>> We also need to delete the newly created image right away because the
>> regular cleanup gets confused and tries to remove the source image. This
>> will fail and we have an orphaned image which cannot be removed easily
>> because the same underlying root cause (same path) will falsely trigger
>> the "Drive::is_volume_in_use" check.
> 
> isn't this technically - just like for the container case - a problem in
> general, not just for cloning a disk? I haven't tested this in practice,
> but since you already have the reproducing setup ;)
> 
> e.g., given the following:
> - storage A, krbd, cluster A, pool foo
> - storage B, krbd, cluster B, pool foo
> - VM 123, with scsi0: A:vm-123-disk-0 and no volumes on B
> - qm set 123 -scsi1: B:1
> 
> next free slot on B is 'vm-123-disk-0', which will be allocated. mapping
> will skip the map part, since the RBD path already exists (provided
> scsi0's volume is already activated). the returned path will point to
> the mapped blockdev corresponding to A:vm-123-disk-0, not B:..
> 
> guest writes to scsi1, likely corrupting whatever is on scsi0, since
> most things that tend to get put on guest disks are not
> multi-writer-safe (or something along the way notices it?)
> 
> if the above is the case, it might actually be prudent to just put the
> check from your other patch into RBDPlugin.pm 's alloc method (and
> clone and rename?) since we'd want to block any allocations on affected
> systems?

Tested it and yep... unfortunately the wrong disk is attached. I am going to implement the check in the RBDPlugin.pm.




^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-04-05  7:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-01 15:24 [pve-devel] [RFC container] alloc disk: fix #3970 avoid ambiguous rbd image path Aaron Lauterer
2022-04-01 15:24 ` [pve-devel] [RFC qemu-server] clone disk: fix #3970 catch same source and destination Aaron Lauterer
2022-04-04 15:26   ` Fabian Grünbichler
2022-04-05  7:28     ` Aaron Lauterer
2022-04-04 15:44 ` [pve-devel] [RFC container] alloc disk: fix #3970 avoid ambiguous rbd image path Fabian Grünbichler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal