[pve-devel] migrate local -> drbd fails with vanished job

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [pve-devel] migrate local -> drbd fails with vanished job
@ 2021-11-26 13:03 Roland Kammerer
  2021-11-26 16:06 ` Fabian Grünbichler
  2021-11-29  8:20 ` Fabian Ebner
  0 siblings, 2 replies; 5+ messages in thread
From: Roland Kammerer @ 2021-11-26 13:03 UTC (permalink / raw)
  To: pve-devel

Dear PVE devs,

While most of our users start with fresh VMs on DRBD storage, from time
to time people try to migrate a local VM to DRBD storage. This currently fails.
Migrating VMs from DRBD to DRBD works.

I added some debug code to PVE/QemuServer.pm, which looks like the location
things go wrong, or at least where I saw them going wrong:

root@pve:/usr/share/perl5/PVE# diff -Nur QemuServer.pm{.orig,}
--- QemuServer.pm.orig  2021-11-26 11:27:28.879989894 +0100
+++ QemuServer.pm       2021-11-26 11:26:30.490988789 +0100
@@ -7390,6 +7390,8 @@
     $completion //= 'complete';
     $op //= "mirror";

+    print "$vmid, $vmiddst, $jobs, $completion, $qga, $op \n";
+    { use Data::Dumper; print Dumper($jobs); };
     eval {
        my $err_complete = 0;

@@ -7419,6 +7421,7 @@
                    next;
                }

+               print "vanished: $vanished\n"; # same as !defined($jobs)
                die "$job_id: '$op' has been cancelled\n" if !defined($job);

                my $busy = $job->{busy};


With that in place, I try to live migrate the running VM from node "pve" to
"pvf":

2021-11-26 11:29:10 starting migration of VM 100 to node 'pvf' (xx.xx.xx.xx)
2021-11-26 11:29:10 found local disk 'local-lvm:vm-100-disk-0' (in current VM config)
2021-11-26 11:29:10 starting VM 100 on remote node 'pvf'
2021-11-26 11:29:18 volume 'local-lvm:vm-100-disk-0' is 'drbdstorage:vm-100-disk-1' on the target
2021-11-26 11:29:18 start remote tunnel
2021-11-26 11:29:19 ssh tunnel ver 1
2021-11-26 11:29:19 starting storage migration
2021-11-26 11:29:19 scsi0: start migration to nbd:unix:/run/qemu-server/100_nbd.migrate:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
Use of uninitialized value $qga in concatenation (.) or string at /usr/share/perl5/PVE/QemuServer.pm line 7393.
100, 100, HASH(0x557b44474a80), skip, , mirror
$VAR1 = {
          'drive-scsi0' => {}
        };
vanished: 1
drive-scsi0: Cancelling block job
drive-scsi0: Done.
2021-11-26 11:29:19 ERROR: online migrate failure - block job (mirror) error: drive-scsi0: 'mirror' has been cancelled
2021-11-26 11:29:19 aborting phase 2 - cleanup resources
2021-11-26 11:29:19 migrate_cancel
2021-11-26 11:29:22 ERROR: migration finished with problems (duration 00:00:12)
TASK ERROR: migration problems

What I also see on "pvf" is that the plugin actually creates the DRBD block
device, and "something" even tries to write data to it, as the DRBD devices
auto-promotes to Primary.

Any hints how I can debug that further? The block device should be ready at
that point. What is going on in the background here?

FWIW the plugin can be found here:
https://github.com/linbit/linstor-proxmox

Regards, rck



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] migrate local -> drbd fails with vanished job
  2021-11-26 13:03 [pve-devel] migrate local -> drbd fails with vanished job Roland Kammerer
@ 2021-11-26 16:06 ` Fabian Grünbichler
  2021-11-29  8:48   ` Roland Kammerer
  2021-11-29  8:20 ` Fabian Ebner
  1 sibling, 1 reply; 5+ messages in thread
From: Fabian Grünbichler @ 2021-11-26 16:06 UTC (permalink / raw)
  To: Proxmox VE development discussion

On November 26, 2021 2:03 pm, Roland Kammerer wrote:
> Dear PVE devs,

I'll add some notes inline to tell you other places where to continue 
debugging

> 
> While most of our users start with fresh VMs on DRBD storage, from time
> to time people try to migrate a local VM to DRBD storage. This currently fails.
> Migrating VMs from DRBD to DRBD works.
> 
> I added some debug code to PVE/QemuServer.pm, which looks like the location
> things go wrong, or at least where I saw them going wrong:
> 
> root@pve:/usr/share/perl5/PVE# diff -Nur QemuServer.pm{.orig,}
> --- QemuServer.pm.orig  2021-11-26 11:27:28.879989894 +0100
> +++ QemuServer.pm       2021-11-26 11:26:30.490988789 +0100
> @@ -7390,6 +7390,8 @@
>      $completion //= 'complete';
>      $op //= "mirror";
> 
> +    print "$vmid, $vmiddst, $jobs, $completion, $qga, $op \n";
> +    { use Data::Dumper; print Dumper($jobs); };
>      eval {
>         my $err_complete = 0;
> 
> @@ -7419,6 +7421,7 @@
>                     next;
>                 }
> 
> +               print "vanished: $vanished\n"; # same as !defined($jobs)
>                 die "$job_id: '$op' has been cancelled\n" if !defined($job);
> 
>                 my $busy = $job->{busy};
> 
> 
> With that in place, I try to live migrate the running VM from node "pve" to
> "pvf":
> 
> 2021-11-26 11:29:10 starting migration of VM 100 to node 'pvf' (xx.xx.xx.xx)
> 2021-11-26 11:29:10 found local disk 'local-lvm:vm-100-disk-0' (in current VM config)
> 2021-11-26 11:29:10 starting VM 100 on remote node 'pvf'
> 2021-11-26 11:29:18 volume 'local-lvm:vm-100-disk-0' is 'drbdstorage:vm-100-disk-1' on the target

so at this point the VM is started on the remove node (with some special 
parameters for migration), and it gave us the info that it allocated a 
volume for the storage migration. for that volume an nbd server/export 
has been started, the target VM itself is suspended at this point and 
will only resume once all the data (disk and RAM/device state) has been 
transferred and the source VM has been suspended in turn.

> 2021-11-26 11:29:18 start remote tunnel
> 2021-11-26 11:29:19 ssh tunnel ver 1
> 2021-11-26 11:29:19 starting storage migration

connected over SSH, forwarding the sockets/ports for storage and 
RAM/state migration

> 2021-11-26 11:29:19 scsi0: start migration to nbd:unix:/run/qemu-server/100_nbd.migrate:exportname=drive-scsi0
> drive mirror is starting for drive-scsi0

drive-mirror started for the drive that changes storage

> Use of uninitialized value $qga in concatenation (.) or string at /usr/share/perl5/PVE/QemuServer.pm line 7393.

expected

> 100, 100, HASH(0x557b44474a80), skip, , mirror
> $VAR1 = {
>           'drive-scsi0' => {}
>         };

this looks okay (not you could just pass @_ to Dumper to print all args, 
handling undef as well ;))

> vanished: 1

but the job is not there when we ask qemu about it (query-block-jobs). 
this should normally return either a still-running job, or one that has 
converged so we can keep it in that state, proceed to the next disk (or 
VM state if it was the last one), and then finally tell all those 
converged mirror jobs to switch over to the target node when we 
have finished transferring the state (PVE::QemuMigrate).

so the question is whether the drive-mirror even successfully starts?

> drive-scsi0: Cancelling block job
> drive-scsi0: Done.
> 2021-11-26 11:29:19 ERROR: online migrate failure - block job (mirror) error: drive-scsi0: 'mirror' has been cancelled

here the log order is a bit confusing - the error handler cancelling the 
block job is run in any case if an error occurs, and then the original 
error (has been cancelled, which actually is 'has vanished' here), is 
printed last. so just ignore the Cancelling part, we don't actively 
cancel in this case, because there is no job to cancel ;)

> 2021-11-26 11:29:19 aborting phase 2 - cleanup resources
> 2021-11-26 11:29:19 migrate_cancel
> 2021-11-26 11:29:22 ERROR: migration finished with problems (duration 00:00:12)
> TASK ERROR: migration problems
> 
> What I also see on "pvf" is that the plugin actually creates the DRBD block
> device, and "something" even tries to write data to it, as the DRBD devices
> auto-promotes to Primary.

that might just be the start codepath allocating the volume and then 
starting the VM suspended with an NBD server..
> 
> Any hints how I can debug that further? The block device should be ready at
> that point. What is going on in the background here?

are there any messages on either end in the journal? you can of course 
add a sleep at the start of qemu_drive_mirror_monitor, and interact with 
the VM via QMP yourself to see which state the disks and block jobs are.

in case there is nothing (or nothing meaningful) there, it's also 
possible to use Qemu's trace mechanism to get a feeling for how far it 
gets (you can set it up via the 'args' config option in the VM config)

> 
> FWIW the plugin can be found here:
> https://github.com/linbit/linstor-proxmox
> 
> Regards, rck
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] migrate local -> drbd fails with vanished job
  2021-11-26 16:06 ` Fabian Grünbichler
@ 2021-11-29  8:48   ` Roland Kammerer
  0 siblings, 0 replies; 5+ messages in thread
From: Roland Kammerer @ 2021-11-29  8:48 UTC (permalink / raw)
  To: pve-devel

Hi Fabian,

thanks for the walk through, that clears things up a bit.

On Fri, Nov 26, 2021 at 05:06:05PM +0100, Fabian Grünbichler wrote:
> are there any messages on either end in the journal?

I'm afraid they are not too useful, but maybe you see something that
rings any bells.

Some notes:
- The first DRBD Primary is us forcing one node to be Primary (and
  switching it back to Secondary immediately), just so that we kick off
  the initial sync. The second Primary then might very well be starting
  the VM as you explained.
- The destination disk is about 500 MB larger than the source, which was
  just me adding some extra space in my debug code in the plugin to
  avoid any rounding errors/disk slightly too small errors.

What could be a problem here? Different sizes? Different block size? If
it does not fail somewhere else/earlier, what should make nbd writes
special? Strange.

Regards, rck

source node (.119/pve):
Nov 29 09:06:01 pve Controller[1213]: 09:06:01.628 [MainWorkerPool-1] INFO  LINSTOR/Controller - SYSTEM - Drbd-auto-verify-Algo for vm-100-disk-1 automatically set to crct10dif-pclmul
Nov 29 09:06:12 pve pmxcfs[982]: [status] notice: received log
Nov 29 09:06:12 pve pvedaemon[3660]: Use of uninitialized value $qga in concatenation (.) or string at /usr/share/perl5/PVE/QemuServer.pm line 7393.
Nov 29 09:06:12 pve pvedaemon[3660]: VM 100 qmp command failed - VM 100 qmp command 'block-job-cancel' failed - Block job 'drive-scsi0' not found
Nov 29 09:06:13 pve pmxcfs[982]: [status] notice: received log
Nov 29 09:06:13 pve pmxcfs[982]: [status] notice: received log
Nov 29 09:06:14 pve pmxcfs[982]: [status] notice: received log
Nov 29 09:06:15 pve pmxcfs[982]: [status] notice: received log
Nov 29 09:06:16 pve pvedaemon[3660]: migration problems
Nov 29 09:06:16 pve pvedaemon[1198]: <root@pam> end task UPID:pve:00000E4C:00008799:61A489E7:qmigrate:100:root@pam: migration problems
Nov 29 09:10:14 pve pvedaemon[1198]: <root@pam> starting task UPID:pve:000013D6:0000EAFA:61A48AE6:qmigrate:100:root@pam:
Nov 29 09:10:15 pve pmxcfs[982]: [status] notice: received log
Nov 29 09:10:15 pve Controller[1213]: 09:10:15.579 [MainWorkerPool-1] INFO  LINSTOR/Controller - SYSTEM - Volume definition with number '0' successfully  created in resource definition 'vm-100-disk-1'.
Nov 29 09:10:15 pve Controller[1213]: 09:10:15.651 [MainWorkerPool-1] INFO  LINSTOR/Controller - SYSTEM - Drbd-auto-verify-Algo for vm-100-disk-1 automatically set to crct10dif-pclmul
Nov 29 09:10:25 pve pmxcfs[982]: [status] notice: received log
Nov 29 09:10:26 pve pvedaemon[5078]: Use of uninitialized value $qga in concatenation (.) or string at /usr/share/perl5/PVE/QemuServer.pm line 7393.
Nov 29 09:10:26 pve pvedaemon[5078]: VM 100 qmp command failed - VM 100 qmp command 'block-job-cancel' failed - Block job 'drive-scsi0' not found
Nov 29 09:10:27 pve pmxcfs[982]: [status] notice: received log
Nov 29 09:10:27 pve pmxcfs[982]: [status] notice: received log
Nov 29 09:10:28 pve pmxcfs[982]: [status] notice: received log
Nov 29 09:10:28 pve pmxcfs[982]: [status] notice: received log
Nov 29 09:10:29 pve pvedaemon[5078]: migration problems
Nov 29 09:10:29 pve pvedaemon[1198]: <root@pam> end task UPID:pve:000013D6:0000EAFA:61A48AE6:qmigrate:100:root@pam: migration problems

target node (.120/pvf):
Nov 29 09:06:16 pvf sshd[2645]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Nov 29 09:06:16 pvf systemd[1]: Started Session 8 of user root.
Nov 29 09:06:16 pvf systemd-logind[711]: New session 8 of user root.
Nov 29 09:06:16 pvf sshd[2645]: Received disconnect from 10.43.8.119 port 60698:11: disconnected by user
Nov 29 09:06:16 pvf sshd[2645]: Disconnected from user root 10.43.8.119 port 60698
Nov 29 09:06:16 pvf sshd[2645]: pam_unix(sshd:session): session closed for user root
Nov 29 09:06:16 pvf systemd-logind[711]: Session 8 logged out. Waiting for processes to exit.
Nov 29 09:06:16 pvf systemd[1]: session-8.scope: Succeeded.
Nov 29 09:06:16 pvf systemd-logind[711]: Removed session 8.
Nov 29 09:06:16 pvf pmxcfs[897]: [status] notice: received log
Nov 29 09:10:14 pvf pmxcfs[897]: [status] notice: received log
Nov 29 09:10:14 pvf sshd[3188]: Accepted publickey for root from 10.43.8.119 port 60700 ssh2: RSA SHA256:3YNHg8CC3M6IPFUA9PlP5B7wnm2nbFLpnXd09j6k5yU
Nov 29 09:10:14 pvf sshd[3188]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Nov 29 09:10:14 pvf systemd[1]: Started Session 9 of user root.
Nov 29 09:10:14 pvf systemd-logind[711]: New session 9 of user root.
Nov 29 09:10:14 pvf sshd[3188]: Received disconnect from 10.43.8.119 port 60700:11: disconnected by user
Nov 29 09:10:14 pvf sshd[3188]: Disconnected from user root 10.43.8.119 port 60700
Nov 29 09:10:14 pvf sshd[3188]: pam_unix(sshd:session): session closed for user root
Nov 29 09:10:14 pvf systemd-logind[711]: Session 9 logged out. Waiting for processes to exit.
Nov 29 09:10:14 pvf systemd[1]: session-9.scope: Succeeded.
Nov 29 09:10:14 pvf systemd-logind[711]: Removed session 9.
Nov 29 09:10:14 pvf sshd[3195]: Accepted publickey for root from 10.43.8.119 port 60702 ssh2: RSA SHA256:3YNHg8CC3M6IPFUA9PlP5B7wnm2nbFLpnXd09j6k5yU
Nov 29 09:10:14 pvf sshd[3195]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Nov 29 09:10:14 pvf systemd[1]: Started Session 10 of user root.
Nov 29 09:10:14 pvf systemd-logind[711]: New session 10 of user root.
Nov 29 09:10:15 pvf qm[3201]: <root@pam> starting task UPID:pvf:00000C82:0000EA8E:61A48AE7:qmstart:100:root@pam:
Nov 29 09:10:15 pvf qm[3202]: start VM 100: UPID:pvf:00000C82:0000EA8E:61A48AE7:qmstart:100:root@pam:
Nov 29 09:10:15 pvf Satellite[812]: 09:10:15.686 [MainWorkerPool-1] INFO  LINSTOR/Satellite - SYSTEM - Resource 'vm-100-disk-1' created for node 'pvf'.
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: Starting worker thread (from drbdsetup [3254])
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: meta-data IO uses: blk-bio
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: rs_discard_granularity feature disabled
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: disk( Diskless -> Attaching )
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: Maximum number of peer devices = 7
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: Method to ensure write ordering: flush
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: drbd_bm_resize called with capacity == 3128600
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: resync bitmap: bits=391075 words=42777 pages=84
Nov 29 09:10:19 pvf kernel: drbd1001: detected capacity change from 0 to 3128600
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: size = 1528 MB (1564300 KB)
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: recounting of set bits took additional 0ms
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: disk( Attaching -> Inconsistent )
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: attached to current UUID: 0000000000000004
Nov 29 09:10:19 pvf Satellite[812]: 09:10:19.270 [MainWorkerPool-1] INFO  LINSTOR/Satellite - SYSTEM - Primary Resource vm-100-disk-1
Nov 29 09:10:19 pvf Satellite[812]: 09:10:19.271 [MainWorkerPool-1] INFO  LINSTOR/Satellite - SYSTEM - Primary bool set on Resource vm-100-disk-1
Nov 29 09:10:19 pvf Satellite[812]: 09:10:19.293 [MainWorkerPool-1] INFO  LINSTOR/Satellite - SYSTEM - Resource 'vm-100-disk-1' updated for node 'pvf'.
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: rs_discard_granularity feature disabled
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: Preparing cluster-wide state change 126926591 (0->-1 7683/4609)
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: Committing cluster-wide state change 126926591 (0ms)
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: role( Secondary -> Primary )
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: disk( Inconsistent -> UpToDate )
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: size = 1528 MB (1564300 KB)
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: Forced to consider local data as UpToDate!
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: new current UUID: E1541A8E1BAC0761 weak: FFFFFFFFFFFFFFFE
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: role( Primary -> Secondary )
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: Preparing cluster-wide state change 594297135 (0->-1 3/1)
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: Committing cluster-wide state change 594297135 (0ms)
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: role( Secondary -> Primary )
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: role( Primary -> Secondary )
Nov 29 09:10:19 pvf Satellite[812]: 09:10:19.589 [MainWorkerPool-1] INFO  LINSTOR/Satellite - SYSTEM - Resource 'vm-100-disk-1' updated for node 'pvf'.
Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: rs_discard_granularity feature disabled
Nov 29 09:10:24 pvf systemd[1]: Started 100.scope.
Nov 29 09:10:24 pvf systemd-udevd[3410]: Using default interface naming scheme 'v247'.
Nov 29 09:10:24 pvf systemd-udevd[3410]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Nov 29 09:10:25 pvf kernel: device tap100i0 entered promiscuous mode
Nov 29 09:10:25 pvf systemd-udevd[3419]: Using default interface naming scheme 'v247'.
Nov 29 09:10:25 pvf systemd-udevd[3419]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Nov 29 09:10:25 pvf systemd-udevd[3410]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Nov 29 09:10:25 pvf systemd-udevd[3434]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Nov 29 09:10:25 pvf systemd-udevd[3434]: Using default interface naming scheme 'v247'.
Nov 29 09:10:25 pvf kernel: fwbr100i0: port 1(fwln100i0) entered blocking state
Nov 29 09:10:25 pvf kernel: fwbr100i0: port 1(fwln100i0) entered disabled state
Nov 29 09:10:25 pvf kernel: device fwln100i0 entered promiscuous mode
Nov 29 09:10:25 pvf kernel: fwbr100i0: port 1(fwln100i0) entered blocking state
Nov 29 09:10:25 pvf kernel: fwbr100i0: port 1(fwln100i0) entered forwarding state
Nov 29 09:10:25 pvf kernel: vmbr0: port 2(fwpr100p0) entered blocking state
Nov 29 09:10:25 pvf kernel: vmbr0: port 2(fwpr100p0) entered disabled state
Nov 29 09:10:25 pvf kernel: device fwpr100p0 entered promiscuous mode
Nov 29 09:10:25 pvf kernel: vmbr0: port 2(fwpr100p0) entered blocking state
Nov 29 09:10:25 pvf kernel: vmbr0: port 2(fwpr100p0) entered forwarding state
Nov 29 09:10:25 pvf kernel: fwbr100i0: port 2(tap100i0) entered blocking state
Nov 29 09:10:25 pvf kernel: fwbr100i0: port 2(tap100i0) entered disabled state
Nov 29 09:10:25 pvf kernel: fwbr100i0: port 2(tap100i0) entered blocking state
Nov 29 09:10:25 pvf kernel: fwbr100i0: port 2(tap100i0) entered forwarding state
Nov 29 09:10:25 pvf kernel: drbd vm-100-disk-1: Preparing cluster-wide state change 523588126 (0->-1 3/1)
Nov 29 09:10:25 pvf kernel: drbd vm-100-disk-1: Committing cluster-wide state change 523588126 (0ms)
Nov 29 09:10:25 pvf kernel: drbd vm-100-disk-1: role( Secondary -> Primary )
Nov 29 09:10:25 pvf qm[3201]: <root@pam> end task UPID:pvf:00000C82:0000EA8E:61A48AE7:qmstart:100:root@pam: OK
Nov 29 09:10:25 pvf sshd[3195]: Received disconnect from 10.43.8.119 port 60702:11: disconnected by user
Nov 29 09:10:25 pvf sshd[3195]: Disconnected from user root 10.43.8.119 port 60702
Nov 29 09:10:25 pvf sshd[3195]: pam_unix(sshd:session): session closed for user root
Nov 29 09:10:25 pvf systemd-logind[711]: Session 10 logged out. Waiting for processes to exit.
Nov 29 09:10:25 pvf systemd[1]: session-10.scope: Succeeded.
Nov 29 09:10:25 pvf systemd-logind[711]: Removed session 10.
Nov 29 09:10:25 pvf sshd[3463]: Accepted publickey for root from 10.43.8.119 port 60706 ssh2: RSA SHA256:3YNHg8CC3M6IPFUA9PlP5B7wnm2nbFLpnXd09j6k5yU
Nov 29 09:10:25 pvf sshd[3463]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Nov 29 09:10:25 pvf systemd[1]: Started Session 11 of user root.
Nov 29 09:10:25 pvf systemd-logind[711]: New session 11 of user root.
Nov 29 09:10:26 pvf sshd[3470]: Accepted publickey for root from 10.43.8.119 port 60708 ssh2: RSA SHA256:3YNHg8CC3M6IPFUA9PlP5B7wnm2nbFLpnXd09j6k5yU
Nov 29 09:10:26 pvf sshd[3470]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Nov 29 09:10:26 pvf systemd[1]: Started Session 12 of user root.
Nov 29 09:10:26 pvf systemd-logind[711]: New session 12 of user root.
Nov 29 09:10:27 pvf qm[3476]: <root@pam> starting task UPID:pvf:00000D95:0000EF22:61A48AF3:qmstop:100:root@pam:
Nov 29 09:10:27 pvf qm[3477]: stop VM 100: UPID:pvf:00000D95:0000EF22:61A48AF3:qmstop:100:root@pam:
Nov 29 09:10:27 pvf QEMU[3406]: kvm: terminating on signal 15 from pid 3477 (task UPID:pvf:00000D95:0000EF22:61A48AF3:qmstop:100:root@pam:)
Nov 29 09:10:27 pvf kernel: drbd vm-100-disk-1: role( Primary -> Secondary )
Nov 29 09:10:27 pvf qm[3476]: <root@pam> end task UPID:pvf:00000D95:0000EF22:61A48AF3:qmstop:100:root@pam: OK
Nov 29 09:10:27 pvf sshd[3470]: Received disconnect from 10.43.8.119 port 60708:11: disconnected by user
Nov 29 09:10:27 pvf sshd[3470]: Disconnected from user root 10.43.8.119 port 60708
Nov 29 09:10:27 pvf sshd[3470]: pam_unix(sshd:session): session closed for user root
Nov 29 09:10:27 pvf systemd[1]: session-12.scope: Succeeded.
Nov 29 09:10:27 pvf systemd-logind[711]: Session 12 logged out. Waiting for processes to exit.
Nov 29 09:10:27 pvf systemd-logind[711]: Removed session 12.
Nov 29 09:10:27 pvf sshd[3479]: Accepted publickey for root from 10.43.8.119 port 60710 ssh2: RSA SHA256:3YNHg8CC3M6IPFUA9PlP5B7wnm2nbFLpnXd09j6k5yU
Nov 29 09:10:27 pvf sshd[3479]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Nov 29 09:10:27 pvf systemd[1]: Started Session 13 of user root.
Nov 29 09:10:27 pvf systemd-logind[711]: New session 13 of user root.
Nov 29 09:10:27 pvf kernel: fwbr100i0: port 2(tap100i0) entered disabled state
Nov 29 09:10:27 pvf kernel: fwbr100i0: port 1(fwln100i0) entered disabled state
Nov 29 09:10:27 pvf kernel: vmbr0: port 2(fwpr100p0) entered disabled state
Nov 29 09:10:27 pvf kernel: device fwln100i0 left promiscuous mode
Nov 29 09:10:27 pvf kernel: fwbr100i0: port 1(fwln100i0) entered disabled state
Nov 29 09:10:27 pvf kernel: device fwpr100p0 left promiscuous mode
Nov 29 09:10:27 pvf kernel: vmbr0: port 2(fwpr100p0) entered disabled state
Nov 29 09:10:27 pvf qmeventd[719]: read: Connection reset by peer
Nov 29 09:10:27 pvf systemd[1]: 100.scope: Succeeded.
Nov 29 09:10:28 pvf pvesm[3485]: <root@pam> starting task UPID:pvf:00000DA8:0000EF87:61A48AF4:imgdel:100@drbdstorage:root@pam:
Nov 29 09:10:28 pvf Satellite[812]: 09:10:28.144 [MainWorkerPool-1] INFO  LINSTOR/Satellite - SYSTEM - Resource 'vm-100-disk-1' updated for node 'pvf'.
Nov 29 09:10:28 pvf kernel: drbd vm-100-disk-1/0 drbd1001: disk( UpToDate -> Detaching )
Nov 29 09:10:28 pvf kernel: drbd vm-100-disk-1/0 drbd1001: disk( Detaching -> Diskless )
Nov 29 09:10:28 pvf kernel: drbd vm-100-disk-1/0 drbd1001: drbd_bm_resize called with capacity == 0
Nov 29 09:10:28 pvf kernel: drbd vm-100-disk-1: Terminating worker thread
Nov 29 09:10:28 pvf pvesm[3485]: <root@pam> end task UPID:pvf:00000DA8:0000EF87:61A48AF4:imgdel:100@drbdstorage:root@pam: OK
Nov 29 09:10:28 pvf sshd[3479]: Received disconnect from 10.43.8.119 port 60710:11: disconnected by user
Nov 29 09:10:28 pvf sshd[3479]: Disconnected from user root 10.43.8.119 port 60710
Nov 29 09:10:28 pvf sshd[3479]: pam_unix(sshd:session): session closed for user root
Nov 29 09:10:28 pvf systemd-logind[711]: Session 13 logged out. Waiting for processes to exit.
Nov 29 09:10:28 pvf systemd[1]: session-13.scope: Succeeded.
Nov 29 09:10:28 pvf systemd-logind[711]: Removed session 13.
Nov 29 09:10:28 pvf sshd[3463]: Received disconnect from 10.43.8.119 port 60706:11: disconnected by user
Nov 29 09:10:28 pvf sshd[3463]: Disconnected from user root 10.43.8.119 port 60706
Nov 29 09:10:28 pvf sshd[3463]: pam_unix(sshd:session): session closed for user root
Nov 29 09:10:28 pvf systemd-logind[711]: Session 11 logged out. Waiting for processes to exit.
Nov 29 09:10:28 pvf systemd[1]: session-11.scope: Succeeded.
Nov 29 09:10:28 pvf systemd-logind[711]: Removed session 11.
Nov 29 09:10:29 pvf sshd[3534]: Accepted publickey for root from 10.43.8.119 port 60712 ssh2: RSA SHA256:3YNHg8CC3M6IPFUA9PlP5B7wnm2nbFLpnXd09j6k5yU
Nov 29 09:10:29 pvf sshd[3534]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Nov 29 09:10:29 pvf systemd[1]: Started Session 14 of user root.
Nov 29 09:10:29 pvf systemd-logind[711]: New session 14 of user root.
Nov 29 09:10:29 pvf sshd[3534]: Received disconnect from 10.43.8.119 port 60712:11: disconnected by user
Nov 29 09:10:29 pvf sshd[3534]: Disconnected from user root 10.43.8.119 port 60712
Nov 29 09:10:29 pvf sshd[3534]: pam_unix(sshd:session): session closed for user root
Nov 29 09:10:29 pvf systemd-logind[711]: Session 14 logged out. Waiting for processes to exit.
Nov 29 09:10:29 pvf systemd[1]: session-14.scope: Succeeded.
Nov 29 09:10:29 pvf systemd-logind[711]: Removed session 14.
Nov 29 09:10:29 pvf pmxcfs[897]: [status] notice: received log




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] migrate local -> drbd fails with vanished job
  2021-11-26 13:03 [pve-devel] migrate local -> drbd fails with vanished job Roland Kammerer
  2021-11-26 16:06 ` Fabian Grünbichler
@ 2021-11-29  8:20 ` Fabian Ebner
  2021-11-29  9:03   ` Roland Kammerer
  1 sibling, 1 reply; 5+ messages in thread
From: Fabian Ebner @ 2021-11-29  8:20 UTC (permalink / raw)
  To: pve-devel, Fabian Grünbichler, roland.kammerer

Am 26.11.21 um 14:03 schrieb Roland Kammerer:
> Dear PVE devs,
> 
> While most of our users start with fresh VMs on DRBD storage, from time
> to time people try to migrate a local VM to DRBD storage. This currently fails.
> Migrating VMs from DRBD to DRBD works.

Hi,
there also is a forum thread about this [0]. It seems like the newly 
allocated disks are slightly bigger on the target DRBD storage. Just a 
hunch, but it does sounds similar to [1].

[0]: 
https://forum.proxmox.com/threads/online-migration-disk-move-problem.100171

[1]: https://bugzilla.proxmox.com/show_bug.cgi?id=3227#c8

> 
> I added some debug code to PVE/QemuServer.pm, which looks like the location
> things go wrong, or at least where I saw them going wrong:
> 
> root@pve:/usr/share/perl5/PVE# diff -Nur QemuServer.pm{.orig,}
> --- QemuServer.pm.orig  2021-11-26 11:27:28.879989894 +0100
> +++ QemuServer.pm       2021-11-26 11:26:30.490988789 +0100
> @@ -7390,6 +7390,8 @@
>       $completion //= 'complete';
>       $op //= "mirror";
> 
> +    print "$vmid, $vmiddst, $jobs, $completion, $qga, $op \n";
> +    { use Data::Dumper; print Dumper($jobs); };
>       eval {
>          my $err_complete = 0;
> 
> @@ -7419,6 +7421,7 @@
>                      next;
>                  }
> 
> +               print "vanished: $vanished\n"; # same as !defined($jobs)
>                  die "$job_id: '$op' has been cancelled\n" if !defined($job);
> 
>                  my $busy = $job->{busy};
> 
> 
> With that in place, I try to live migrate the running VM from node "pve" to
> "pvf":
> 
> 2021-11-26 11:29:10 starting migration of VM 100 to node 'pvf' (xx.xx.xx.xx)
> 2021-11-26 11:29:10 found local disk 'local-lvm:vm-100-disk-0' (in current VM config)
> 2021-11-26 11:29:10 starting VM 100 on remote node 'pvf'
> 2021-11-26 11:29:18 volume 'local-lvm:vm-100-disk-0' is 'drbdstorage:vm-100-disk-1' on the target
> 2021-11-26 11:29:18 start remote tunnel
> 2021-11-26 11:29:19 ssh tunnel ver 1
> 2021-11-26 11:29:19 starting storage migration
> 2021-11-26 11:29:19 scsi0: start migration to nbd:unix:/run/qemu-server/100_nbd.migrate:exportname=drive-scsi0
> drive mirror is starting for drive-scsi0
> Use of uninitialized value $qga in concatenation (.) or string at /usr/share/perl5/PVE/QemuServer.pm line 7393.
> 100, 100, HASH(0x557b44474a80), skip, , mirror
> $VAR1 = {
>            'drive-scsi0' => {}
>          };
> vanished: 1
> drive-scsi0: Cancelling block job
> drive-scsi0: Done.
> 2021-11-26 11:29:19 ERROR: online migrate failure - block job (mirror) error: drive-scsi0: 'mirror' has been cancelled
> 2021-11-26 11:29:19 aborting phase 2 - cleanup resources
> 2021-11-26 11:29:19 migrate_cancel
> 2021-11-26 11:29:22 ERROR: migration finished with problems (duration 00:00:12)
> TASK ERROR: migration problems
> 
> What I also see on "pvf" is that the plugin actually creates the DRBD block
> device, and "something" even tries to write data to it, as the DRBD devices
> auto-promotes to Primary.
> 
> Any hints how I can debug that further? The block device should be ready at
> that point. What is going on in the background here?
> 
> FWIW the plugin can be found here:
> https://github.com/linbit/linstor-proxmox
> 
> Regards, rck
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> 
> 




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] migrate local -> drbd fails with vanished job
  2021-11-29  8:20 ` Fabian Ebner
@ 2021-11-29  9:03   ` Roland Kammerer
  0 siblings, 0 replies; 5+ messages in thread
From: Roland Kammerer @ 2021-11-29  9:03 UTC (permalink / raw)
  To: pve-devel

On Mon, Nov 29, 2021 at 09:20:04AM +0100, Fabian Ebner wrote:
> Am 26.11.21 um 14:03 schrieb Roland Kammerer:
> > Dear PVE devs,
> > 
> > While most of our users start with fresh VMs on DRBD storage, from time
> > to time people try to migrate a local VM to DRBD storage. This currently fails.
> > Migrating VMs from DRBD to DRBD works.
> 
> Hi,
> there also is a forum thread about this [0]. It seems like the newly
> allocated disks are slightly bigger on the target DRBD storage. Just a
> hunch, but it does sounds similar to [1].
> 
> [0]:
> https://forum.proxmox.com/threads/online-migration-disk-move-problem.100171
> 
> [1]: https://bugzilla.proxmox.com/show_bug.cgi?id=3227#c8

Sorry Fabian saw your reply too late, but looks like we had the same
intention. It is always sizes and/or block sizes with these things :).
And me adding some extra space for debugging might even made have it
worse (okay, it was broken to begin with).

Thanks, rck



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-11-29  9:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-26 13:03 [pve-devel] migrate local -> drbd fails with vanished job Roland Kammerer
2021-11-26 16:06 ` Fabian Grünbichler
2021-11-29  8:48   ` Roland Kammerer
2021-11-29  8:20 ` Fabian Ebner
2021-11-29  9:03   ` Roland Kammerer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal