* [pve-devel] [RFC container] alloc disk: fix #3970 avoid ambiguous rbd image path @ 2022-04-01 15:24 Aaron Lauterer 2022-04-01 15:24 ` [pve-devel] [RFC qemu-server] clone disk: fix #3970 catch same source and destination Aaron Lauterer 2022-04-04 15:44 ` [pve-devel] [RFC container] alloc disk: fix #3970 avoid ambiguous rbd image path Fabian Grünbichler 0 siblings, 2 replies; 5+ messages in thread From: Aaron Lauterer @ 2022-04-01 15:24 UTC (permalink / raw) To: pve-devel If two RBD storages use the same pool, but connect to different clusters, we cannot say to which cluster the mapped RBD image belongs to. To avoid potential data loss, we need to verify that no other storage is configured that could have a volume mapped under the same path before we format anything. The ambiguous mapping is in /dev/rbd/<pool>/<ns>/<image> where the namespace <ns> is optional. Once we can tell the clusters apart in the mapping, we can remove these checks again. See bug #3969 for more information on the root cause. Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com> --- RFC because I would like someone else to take a look at the logic and I wasn't sure how to format the grouping of the conditions in a way that is easy to read src/PVE/LXC.pm | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm index fe63087..b048ce0 100644 --- a/src/PVE/LXC.pm +++ b/src/PVE/LXC.pm @@ -1970,6 +1970,51 @@ sub alloc_disk { my $scfg = PVE::Storage::storage_config($storecfg, $storage); # fixme: use better naming ct-$vmid-disk-X.raw? + # check if another rbd storage with the same pool name but different + # cluster exists. If so, allocating a new volume can potentially be + # dangerous because the RBD mapping, exposes it in an ambiguous way under + # /dev/rbd/<pool>/<ns>/<image>. Without any information to which cluster it + # belongs, we cannot clearly determine which image we access and + # potentially format an already used image. See + # https://bugzilla.proxmox.com/show_bug.cgi?id=3969 and + # https://bugzilla.proxmox.com/show_bug.cgi?id=3970 + # TODO: remove these checks once #3969 is fixed and we can clearly tell to + # which cluster an image belongs to + if ($scfg->{type} eq 'rbd') { + my $pool = $storecfg->{ids}->{$storage}->{pool}; + foreach my $stor (keys %{$storecfg->{ids}}) { + next if $stor eq $storage; + + my $ccfg = PVE::Storage::storage_config($storecfg, $stor); + next if $ccfg->{type} ne 'rbd'; + + if ($scfg->{pool} eq $ccfg->{pool}) { + if ( + ( + defined($scfg->{monhost}) + && defined($ccfg->{monhost}) + && $scfg->{monhost} eq $ccfg->{monhost} + ) + || ( + !defined($scfg->{monhost}) + && !defined($ccfg->{monhost}) + ) + ) { + # both external ones same or both hyperconverged + next; + } + # different cluster here + # we are ok if namespaces are not the same or only one storage has one + if (defined($scfg->{namespace}) && defined($ccfg->{namespace})) { + next if $scfg->{namespace} ne $ccfg->{namespace}; + } elsif (defined($scfg->{namespace}) || defined($ccfg->{namespace})) { + next; + } + die "Cannot determine which Ceph cluster the volume mapping belongs to. Abort!\n"; + } + } + } + eval { my $do_format = 0; if ($scfg->{content}->{rootdir} && $scfg->{path}) { -- 2.30.2 ^ permalink raw reply [flat|nested] 5+ messages in thread
* [pve-devel] [RFC qemu-server] clone disk: fix #3970 catch same source and destination 2022-04-01 15:24 [pve-devel] [RFC container] alloc disk: fix #3970 avoid ambiguous rbd image path Aaron Lauterer @ 2022-04-01 15:24 ` Aaron Lauterer 2022-04-04 15:26 ` Fabian Grünbichler 2022-04-04 15:44 ` [pve-devel] [RFC container] alloc disk: fix #3970 avoid ambiguous rbd image path Fabian Grünbichler 1 sibling, 1 reply; 5+ messages in thread From: Aaron Lauterer @ 2022-04-01 15:24 UTC (permalink / raw) To: pve-devel In rare situations, it could happen that the source and target path is the same. For example, if the disk image is to be copied from one RBD storage to another one on different Ceph clusters but the pools have the same name. In this situation, the clone operation will clone it to the same image and one will end up with an empty destination volume. This patch does not solve the underlying issue, but is a first step to avoid potential data loss, for example when the 'delete source' option is enabled as well. We also need to delete the newly created image right away because the regular cleanup gets confused and tries to remove the source image. This will fail and we have an orphaned image which cannot be removed easily because the same underlying root cause (same path) will falsely trigger the "Drive::is_volume_in_use" check. Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com> --- We could think about leaving this as a permanent safety check as there should not be a situation where both paths are the same AFAICT. PVE/QemuServer.pm | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm index 6c80c0c..a1aa4f2 100644 --- a/PVE/QemuServer.pm +++ b/PVE/QemuServer.pm @@ -7633,6 +7633,17 @@ sub clone_disk { PVE::Storage::activate_volumes($storecfg, [$newvolid]); + my $src_path = PVE::Storage::path($storecfg, $drive->{file}); + my $dst_path = PVE::Storage::path($storecfg, $newvolid); + + if ($src_path eq $dst_path) { + # Delete the new volume right away. Doing it later will try to remove the old volume and + # fail. Currenlty the only known case is RBD. See bugs 3969 and 3970 for more details. + PVE::Storage::vdisk_free($storecfg, $newvolid); + pop @$newvollist; + die "Source and destination are the same!\n"; + } + if (drive_is_cloudinit($drive)) { # when cloning multiple disks (e.g. during clone_vm) it might be the last disk # if this is the case, we have to complete any block-jobs still there from @@ -7650,8 +7661,6 @@ sub clone_disk { # the relevant data on the efidisk may be smaller than the source # e.g. on RBD/ZFS, so we use dd to copy only the amount # that is given by the OVMF_VARS.fd - my $src_path = PVE::Storage::path($storecfg, $drive->{file}); - my $dst_path = PVE::Storage::path($storecfg, $newvolid); # better for Ceph if block size is not too small, see bug #3324 my $bs = 1024*1024; -- 2.30.2 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [pve-devel] [RFC qemu-server] clone disk: fix #3970 catch same source and destination 2022-04-01 15:24 ` [pve-devel] [RFC qemu-server] clone disk: fix #3970 catch same source and destination Aaron Lauterer @ 2022-04-04 15:26 ` Fabian Grünbichler 2022-04-05 7:28 ` Aaron Lauterer 0 siblings, 1 reply; 5+ messages in thread From: Fabian Grünbichler @ 2022-04-04 15:26 UTC (permalink / raw) To: Proxmox VE development discussion On April 1, 2022 5:24 pm, Aaron Lauterer wrote: > In rare situations, it could happen that the source and target path is > the same. For example, if the disk image is to be copied from one RBD > storage to another one on different Ceph clusters but the pools have the > same name. > > In this situation, the clone operation will clone it to the same image and > one will end up with an empty destination volume. > > This patch does not solve the underlying issue, but is a first step to > avoid potential data loss, for example when the 'delete source' option > is enabled as well. > > We also need to delete the newly created image right away because the > regular cleanup gets confused and tries to remove the source image. This > will fail and we have an orphaned image which cannot be removed easily > because the same underlying root cause (same path) will falsely trigger > the "Drive::is_volume_in_use" check. isn't this technically - just like for the container case - a problem in general, not just for cloning a disk? I haven't tested this in practice, but since you already have the reproducing setup ;) e.g., given the following: - storage A, krbd, cluster A, pool foo - storage B, krbd, cluster B, pool foo - VM 123, with scsi0: A:vm-123-disk-0 and no volumes on B - qm set 123 -scsi1: B:1 next free slot on B is 'vm-123-disk-0', which will be allocated. mapping will skip the map part, since the RBD path already exists (provided scsi0's volume is already activated). the returned path will point to the mapped blockdev corresponding to A:vm-123-disk-0, not B:.. guest writes to scsi1, likely corrupting whatever is on scsi0, since most things that tend to get put on guest disks are not multi-writer-safe (or something along the way notices it?) if the above is the case, it might actually be prudent to just put the check from your other patch into RBDPlugin.pm 's alloc method (and clone and rename?) since we'd want to block any allocations on affected systems? > > Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com> > --- > We could think about leaving this as a permanent safety check as there > should not be a situation where both paths are the same AFAICT. > > PVE/QemuServer.pm | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm > index 6c80c0c..a1aa4f2 100644 > --- a/PVE/QemuServer.pm > +++ b/PVE/QemuServer.pm > @@ -7633,6 +7633,17 @@ sub clone_disk { > > PVE::Storage::activate_volumes($storecfg, [$newvolid]); > > + my $src_path = PVE::Storage::path($storecfg, $drive->{file}); > + my $dst_path = PVE::Storage::path($storecfg, $newvolid); > + > + if ($src_path eq $dst_path) { > + # Delete the new volume right away. Doing it later will try to remove the old volume and > + # fail. Currenlty the only known case is RBD. See bugs 3969 and 3970 for more details. > + PVE::Storage::vdisk_free($storecfg, $newvolid); > + pop @$newvollist; > + die "Source and destination are the same!\n"; > + } > + > if (drive_is_cloudinit($drive)) { > # when cloning multiple disks (e.g. during clone_vm) it might be the last disk > # if this is the case, we have to complete any block-jobs still there from > @@ -7650,8 +7661,6 @@ sub clone_disk { > # the relevant data on the efidisk may be smaller than the source > # e.g. on RBD/ZFS, so we use dd to copy only the amount > # that is given by the OVMF_VARS.fd > - my $src_path = PVE::Storage::path($storecfg, $drive->{file}); > - my $dst_path = PVE::Storage::path($storecfg, $newvolid); > > # better for Ceph if block size is not too small, see bug #3324 > my $bs = 1024*1024; > -- > 2.30.2 > > > > _______________________________________________ > pve-devel mailing list > pve-devel@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel > > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [pve-devel] [RFC qemu-server] clone disk: fix #3970 catch same source and destination 2022-04-04 15:26 ` Fabian Grünbichler @ 2022-04-05 7:28 ` Aaron Lauterer 0 siblings, 0 replies; 5+ messages in thread From: Aaron Lauterer @ 2022-04-05 7:28 UTC (permalink / raw) To: Proxmox VE development discussion, Fabian Grünbichler On 4/4/22 17:26, Fabian Grünbichler wrote: > On April 1, 2022 5:24 pm, Aaron Lauterer wrote: >> In rare situations, it could happen that the source and target path is >> the same. For example, if the disk image is to be copied from one RBD >> storage to another one on different Ceph clusters but the pools have the >> same name. >> >> In this situation, the clone operation will clone it to the same image and >> one will end up with an empty destination volume. >> >> This patch does not solve the underlying issue, but is a first step to >> avoid potential data loss, for example when the 'delete source' option >> is enabled as well. >> >> We also need to delete the newly created image right away because the >> regular cleanup gets confused and tries to remove the source image. This >> will fail and we have an orphaned image which cannot be removed easily >> because the same underlying root cause (same path) will falsely trigger >> the "Drive::is_volume_in_use" check. > > isn't this technically - just like for the container case - a problem in > general, not just for cloning a disk? I haven't tested this in practice, > but since you already have the reproducing setup ;) > > e.g., given the following: > - storage A, krbd, cluster A, pool foo > - storage B, krbd, cluster B, pool foo > - VM 123, with scsi0: A:vm-123-disk-0 and no volumes on B > - qm set 123 -scsi1: B:1 > > next free slot on B is 'vm-123-disk-0', which will be allocated. mapping > will skip the map part, since the RBD path already exists (provided > scsi0's volume is already activated). the returned path will point to > the mapped blockdev corresponding to A:vm-123-disk-0, not B:.. > > guest writes to scsi1, likely corrupting whatever is on scsi0, since > most things that tend to get put on guest disks are not > multi-writer-safe (or something along the way notices it?) > > if the above is the case, it might actually be prudent to just put the > check from your other patch into RBDPlugin.pm 's alloc method (and > clone and rename?) since we'd want to block any allocations on affected > systems? Tested it and yep... unfortunately the wrong disk is attached. I am going to implement the check in the RBDPlugin.pm. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [pve-devel] [RFC container] alloc disk: fix #3970 avoid ambiguous rbd image path 2022-04-01 15:24 [pve-devel] [RFC container] alloc disk: fix #3970 avoid ambiguous rbd image path Aaron Lauterer 2022-04-01 15:24 ` [pve-devel] [RFC qemu-server] clone disk: fix #3970 catch same source and destination Aaron Lauterer @ 2022-04-04 15:44 ` Fabian Grünbichler 1 sibling, 0 replies; 5+ messages in thread From: Fabian Grünbichler @ 2022-04-04 15:44 UTC (permalink / raw) To: Proxmox VE development discussion On April 1, 2022 5:24 pm, Aaron Lauterer wrote: > If two RBD storages use the same pool, but connect to different > clusters, we cannot say to which cluster the mapped RBD image belongs > to. To avoid potential data loss, we need to verify that no other > storage is configured that could have a volume mapped under the same > path before we format anything. > > The ambiguous mapping is in > /dev/rbd/<pool>/<ns>/<image> where the namespace <ns> is optional. > > Once we can tell the clusters apart in the mapping, we can remove these > checks again. > > See bug #3969 for more information on the root cause. > > Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com> > --- > RFC because I would like someone else to take a look at the logic and I > wasn't sure how to format the grouping of the conditions in a way that > is easy to read > > src/PVE/LXC.pm | 45 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 45 insertions(+) > > diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm > index fe63087..b048ce0 100644 > --- a/src/PVE/LXC.pm > +++ b/src/PVE/LXC.pm > @@ -1970,6 +1970,51 @@ sub alloc_disk { > my $scfg = PVE::Storage::storage_config($storecfg, $storage); > # fixme: use better naming ct-$vmid-disk-X.raw? > > + # check if another rbd storage with the same pool name but different > + # cluster exists. If so, allocating a new volume can potentially be > + # dangerous because the RBD mapping, exposes it in an ambiguous way under > + # /dev/rbd/<pool>/<ns>/<image>. Without any information to which cluster it > + # belongs, we cannot clearly determine which image we access and > + # potentially format an already used image. See > + # https://bugzilla.proxmox.com/show_bug.cgi?id=3969 and > + # https://bugzilla.proxmox.com/show_bug.cgi?id=3970 > + # TODO: remove these checks once #3969 is fixed and we can clearly tell to > + # which cluster an image belongs to > + if ($scfg->{type} eq 'rbd') { > + my $pool = $storecfg->{ids}->{$storage}->{pool}; not used anywhere? > + foreach my $stor (keys %{$storecfg->{ids}}) { > + next if $stor eq $storage; > + > + my $ccfg = PVE::Storage::storage_config($storecfg, $stor); that's actually not needed if you are iterating over the keys ;) just use my $checked_scfg = $storecfg->{ids}->{$store}; > + next if $ccfg->{type} ne 'rbd'; > + > + if ($scfg->{pool} eq $ccfg->{pool}) { why not simply # different pools -> no clash possible next if $scfg->{pool} ne $checked_scfg->{pool}; > + if ( > + ( > + defined($scfg->{monhost}) > + && defined($ccfg->{monhost}) > + && $scfg->{monhost} eq $ccfg->{monhost} > + ) > + || ( > + !defined($scfg->{monhost}) > + && !defined($ccfg->{monhost}) > + ) > + ) { > + # both external ones same or both hyperconverged > + next; untested, but something like the following is probably more readable ;) could also be adapted to check for monhost overlap instead if desired (to catch storage A and B not having identical strings/formatting/elements - if any single mon is listed for both, they ARE the same cluster for all intents and purposes?) my $safe_compare = sub { my ($key) = shift; my $cmp = sub { $_[0] eq $_[1] }; return PVE::Tools::safe_compare($scfg->{$key}, $checked_scfg->{$key}, $cmp); }; # internal and internal or external and external with identical monitors # => same cluster next if $safe_compare->("monhost"); # different namespaces => no clash possible next if !$safe_compare->("namespace"); die .. > + } > + # different cluster here > + # we are ok if namespaces are not the same or only one storage has one > + if (defined($scfg->{namespace}) && defined($ccfg->{namespace})) { > + next if $scfg->{namespace} ne $ccfg->{namespace}; > + } elsif (defined($scfg->{namespace}) || defined($ccfg->{namespace})) { > + next; > + } > + die "Cannot determine which Ceph cluster the volume mapping belongs to. Abort!\n"; > + } > + } > + } > + > eval { > my $do_format = 0; > if ($scfg->{content}->{rootdir} && $scfg->{path}) { > -- > 2.30.2 > > > > _______________________________________________ > pve-devel mailing list > pve-devel@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel > > > ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2022-04-05 7:28 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-04-01 15:24 [pve-devel] [RFC container] alloc disk: fix #3970 avoid ambiguous rbd image path Aaron Lauterer 2022-04-01 15:24 ` [pve-devel] [RFC qemu-server] clone disk: fix #3970 catch same source and destination Aaron Lauterer 2022-04-04 15:26 ` Fabian Grünbichler 2022-04-05 7:28 ` Aaron Lauterer 2022-04-04 15:44 ` [pve-devel] [RFC container] alloc disk: fix #3970 avoid ambiguous rbd image path Fabian Grünbichler
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox