* [pve-devel] migrate local -> drbd fails with vanished job @ 2021-11-26 13:03 Roland Kammerer 2021-11-26 16:06 ` Fabian Grünbichler 2021-11-29 8:20 ` Fabian Ebner 0 siblings, 2 replies; 5+ messages in thread From: Roland Kammerer @ 2021-11-26 13:03 UTC (permalink / raw) To: pve-devel Dear PVE devs, While most of our users start with fresh VMs on DRBD storage, from time to time people try to migrate a local VM to DRBD storage. This currently fails. Migrating VMs from DRBD to DRBD works. I added some debug code to PVE/QemuServer.pm, which looks like the location things go wrong, or at least where I saw them going wrong: root@pve:/usr/share/perl5/PVE# diff -Nur QemuServer.pm{.orig,} --- QemuServer.pm.orig 2021-11-26 11:27:28.879989894 +0100 +++ QemuServer.pm 2021-11-26 11:26:30.490988789 +0100 @@ -7390,6 +7390,8 @@ $completion //= 'complete'; $op //= "mirror"; + print "$vmid, $vmiddst, $jobs, $completion, $qga, $op \n"; + { use Data::Dumper; print Dumper($jobs); }; eval { my $err_complete = 0; @@ -7419,6 +7421,7 @@ next; } + print "vanished: $vanished\n"; # same as !defined($jobs) die "$job_id: '$op' has been cancelled\n" if !defined($job); my $busy = $job->{busy}; With that in place, I try to live migrate the running VM from node "pve" to "pvf": 2021-11-26 11:29:10 starting migration of VM 100 to node 'pvf' (xx.xx.xx.xx) 2021-11-26 11:29:10 found local disk 'local-lvm:vm-100-disk-0' (in current VM config) 2021-11-26 11:29:10 starting VM 100 on remote node 'pvf' 2021-11-26 11:29:18 volume 'local-lvm:vm-100-disk-0' is 'drbdstorage:vm-100-disk-1' on the target 2021-11-26 11:29:18 start remote tunnel 2021-11-26 11:29:19 ssh tunnel ver 1 2021-11-26 11:29:19 starting storage migration 2021-11-26 11:29:19 scsi0: start migration to nbd:unix:/run/qemu-server/100_nbd.migrate:exportname=drive-scsi0 drive mirror is starting for drive-scsi0 Use of uninitialized value $qga in concatenation (.) or string at /usr/share/perl5/PVE/QemuServer.pm line 7393. 100, 100, HASH(0x557b44474a80), skip, , mirror $VAR1 = { 'drive-scsi0' => {} }; vanished: 1 drive-scsi0: Cancelling block job drive-scsi0: Done. 2021-11-26 11:29:19 ERROR: online migrate failure - block job (mirror) error: drive-scsi0: 'mirror' has been cancelled 2021-11-26 11:29:19 aborting phase 2 - cleanup resources 2021-11-26 11:29:19 migrate_cancel 2021-11-26 11:29:22 ERROR: migration finished with problems (duration 00:00:12) TASK ERROR: migration problems What I also see on "pvf" is that the plugin actually creates the DRBD block device, and "something" even tries to write data to it, as the DRBD devices auto-promotes to Primary. Any hints how I can debug that further? The block device should be ready at that point. What is going on in the background here? FWIW the plugin can be found here: https://github.com/linbit/linstor-proxmox Regards, rck ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [pve-devel] migrate local -> drbd fails with vanished job 2021-11-26 13:03 [pve-devel] migrate local -> drbd fails with vanished job Roland Kammerer @ 2021-11-26 16:06 ` Fabian Grünbichler 2021-11-29 8:48 ` Roland Kammerer 2021-11-29 8:20 ` Fabian Ebner 1 sibling, 1 reply; 5+ messages in thread From: Fabian Grünbichler @ 2021-11-26 16:06 UTC (permalink / raw) To: Proxmox VE development discussion On November 26, 2021 2:03 pm, Roland Kammerer wrote: > Dear PVE devs, I'll add some notes inline to tell you other places where to continue debugging > > While most of our users start with fresh VMs on DRBD storage, from time > to time people try to migrate a local VM to DRBD storage. This currently fails. > Migrating VMs from DRBD to DRBD works. > > I added some debug code to PVE/QemuServer.pm, which looks like the location > things go wrong, or at least where I saw them going wrong: > > root@pve:/usr/share/perl5/PVE# diff -Nur QemuServer.pm{.orig,} > --- QemuServer.pm.orig 2021-11-26 11:27:28.879989894 +0100 > +++ QemuServer.pm 2021-11-26 11:26:30.490988789 +0100 > @@ -7390,6 +7390,8 @@ > $completion //= 'complete'; > $op //= "mirror"; > > + print "$vmid, $vmiddst, $jobs, $completion, $qga, $op \n"; > + { use Data::Dumper; print Dumper($jobs); }; > eval { > my $err_complete = 0; > > @@ -7419,6 +7421,7 @@ > next; > } > > + print "vanished: $vanished\n"; # same as !defined($jobs) > die "$job_id: '$op' has been cancelled\n" if !defined($job); > > my $busy = $job->{busy}; > > > With that in place, I try to live migrate the running VM from node "pve" to > "pvf": > > 2021-11-26 11:29:10 starting migration of VM 100 to node 'pvf' (xx.xx.xx.xx) > 2021-11-26 11:29:10 found local disk 'local-lvm:vm-100-disk-0' (in current VM config) > 2021-11-26 11:29:10 starting VM 100 on remote node 'pvf' > 2021-11-26 11:29:18 volume 'local-lvm:vm-100-disk-0' is 'drbdstorage:vm-100-disk-1' on the target so at this point the VM is started on the remove node (with some special parameters for migration), and it gave us the info that it allocated a volume for the storage migration. for that volume an nbd server/export has been started, the target VM itself is suspended at this point and will only resume once all the data (disk and RAM/device state) has been transferred and the source VM has been suspended in turn. > 2021-11-26 11:29:18 start remote tunnel > 2021-11-26 11:29:19 ssh tunnel ver 1 > 2021-11-26 11:29:19 starting storage migration connected over SSH, forwarding the sockets/ports for storage and RAM/state migration > 2021-11-26 11:29:19 scsi0: start migration to nbd:unix:/run/qemu-server/100_nbd.migrate:exportname=drive-scsi0 > drive mirror is starting for drive-scsi0 drive-mirror started for the drive that changes storage > Use of uninitialized value $qga in concatenation (.) or string at /usr/share/perl5/PVE/QemuServer.pm line 7393. expected > 100, 100, HASH(0x557b44474a80), skip, , mirror > $VAR1 = { > 'drive-scsi0' => {} > }; this looks okay (not you could just pass @_ to Dumper to print all args, handling undef as well ;)) > vanished: 1 but the job is not there when we ask qemu about it (query-block-jobs). this should normally return either a still-running job, or one that has converged so we can keep it in that state, proceed to the next disk (or VM state if it was the last one), and then finally tell all those converged mirror jobs to switch over to the target node when we have finished transferring the state (PVE::QemuMigrate). so the question is whether the drive-mirror even successfully starts? > drive-scsi0: Cancelling block job > drive-scsi0: Done. > 2021-11-26 11:29:19 ERROR: online migrate failure - block job (mirror) error: drive-scsi0: 'mirror' has been cancelled here the log order is a bit confusing - the error handler cancelling the block job is run in any case if an error occurs, and then the original error (has been cancelled, which actually is 'has vanished' here), is printed last. so just ignore the Cancelling part, we don't actively cancel in this case, because there is no job to cancel ;) > 2021-11-26 11:29:19 aborting phase 2 - cleanup resources > 2021-11-26 11:29:19 migrate_cancel > 2021-11-26 11:29:22 ERROR: migration finished with problems (duration 00:00:12) > TASK ERROR: migration problems > > What I also see on "pvf" is that the plugin actually creates the DRBD block > device, and "something" even tries to write data to it, as the DRBD devices > auto-promotes to Primary. that might just be the start codepath allocating the volume and then starting the VM suspended with an NBD server.. > > Any hints how I can debug that further? The block device should be ready at > that point. What is going on in the background here? are there any messages on either end in the journal? you can of course add a sleep at the start of qemu_drive_mirror_monitor, and interact with the VM via QMP yourself to see which state the disks and block jobs are. in case there is nothing (or nothing meaningful) there, it's also possible to use Qemu's trace mechanism to get a feeling for how far it gets (you can set it up via the 'args' config option in the VM config) > > FWIW the plugin can be found here: > https://github.com/linbit/linstor-proxmox > > Regards, rck > > _______________________________________________ > pve-devel mailing list > pve-devel@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel > > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [pve-devel] migrate local -> drbd fails with vanished job 2021-11-26 16:06 ` Fabian Grünbichler @ 2021-11-29 8:48 ` Roland Kammerer 0 siblings, 0 replies; 5+ messages in thread From: Roland Kammerer @ 2021-11-29 8:48 UTC (permalink / raw) To: pve-devel Hi Fabian, thanks for the walk through, that clears things up a bit. On Fri, Nov 26, 2021 at 05:06:05PM +0100, Fabian Grünbichler wrote: > are there any messages on either end in the journal? I'm afraid they are not too useful, but maybe you see something that rings any bells. Some notes: - The first DRBD Primary is us forcing one node to be Primary (and switching it back to Secondary immediately), just so that we kick off the initial sync. The second Primary then might very well be starting the VM as you explained. - The destination disk is about 500 MB larger than the source, which was just me adding some extra space in my debug code in the plugin to avoid any rounding errors/disk slightly too small errors. What could be a problem here? Different sizes? Different block size? If it does not fail somewhere else/earlier, what should make nbd writes special? Strange. Regards, rck source node (.119/pve): Nov 29 09:06:01 pve Controller[1213]: 09:06:01.628 [MainWorkerPool-1] INFO LINSTOR/Controller - SYSTEM - Drbd-auto-verify-Algo for vm-100-disk-1 automatically set to crct10dif-pclmul Nov 29 09:06:12 pve pmxcfs[982]: [status] notice: received log Nov 29 09:06:12 pve pvedaemon[3660]: Use of uninitialized value $qga in concatenation (.) or string at /usr/share/perl5/PVE/QemuServer.pm line 7393. Nov 29 09:06:12 pve pvedaemon[3660]: VM 100 qmp command failed - VM 100 qmp command 'block-job-cancel' failed - Block job 'drive-scsi0' not found Nov 29 09:06:13 pve pmxcfs[982]: [status] notice: received log Nov 29 09:06:13 pve pmxcfs[982]: [status] notice: received log Nov 29 09:06:14 pve pmxcfs[982]: [status] notice: received log Nov 29 09:06:15 pve pmxcfs[982]: [status] notice: received log Nov 29 09:06:16 pve pvedaemon[3660]: migration problems Nov 29 09:06:16 pve pvedaemon[1198]: <root@pam> end task UPID:pve:00000E4C:00008799:61A489E7:qmigrate:100:root@pam: migration problems Nov 29 09:10:14 pve pvedaemon[1198]: <root@pam> starting task UPID:pve:000013D6:0000EAFA:61A48AE6:qmigrate:100:root@pam: Nov 29 09:10:15 pve pmxcfs[982]: [status] notice: received log Nov 29 09:10:15 pve Controller[1213]: 09:10:15.579 [MainWorkerPool-1] INFO LINSTOR/Controller - SYSTEM - Volume definition with number '0' successfully created in resource definition 'vm-100-disk-1'. Nov 29 09:10:15 pve Controller[1213]: 09:10:15.651 [MainWorkerPool-1] INFO LINSTOR/Controller - SYSTEM - Drbd-auto-verify-Algo for vm-100-disk-1 automatically set to crct10dif-pclmul Nov 29 09:10:25 pve pmxcfs[982]: [status] notice: received log Nov 29 09:10:26 pve pvedaemon[5078]: Use of uninitialized value $qga in concatenation (.) or string at /usr/share/perl5/PVE/QemuServer.pm line 7393. Nov 29 09:10:26 pve pvedaemon[5078]: VM 100 qmp command failed - VM 100 qmp command 'block-job-cancel' failed - Block job 'drive-scsi0' not found Nov 29 09:10:27 pve pmxcfs[982]: [status] notice: received log Nov 29 09:10:27 pve pmxcfs[982]: [status] notice: received log Nov 29 09:10:28 pve pmxcfs[982]: [status] notice: received log Nov 29 09:10:28 pve pmxcfs[982]: [status] notice: received log Nov 29 09:10:29 pve pvedaemon[5078]: migration problems Nov 29 09:10:29 pve pvedaemon[1198]: <root@pam> end task UPID:pve:000013D6:0000EAFA:61A48AE6:qmigrate:100:root@pam: migration problems target node (.120/pvf): Nov 29 09:06:16 pvf sshd[2645]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0) Nov 29 09:06:16 pvf systemd[1]: Started Session 8 of user root. Nov 29 09:06:16 pvf systemd-logind[711]: New session 8 of user root. Nov 29 09:06:16 pvf sshd[2645]: Received disconnect from 10.43.8.119 port 60698:11: disconnected by user Nov 29 09:06:16 pvf sshd[2645]: Disconnected from user root 10.43.8.119 port 60698 Nov 29 09:06:16 pvf sshd[2645]: pam_unix(sshd:session): session closed for user root Nov 29 09:06:16 pvf systemd-logind[711]: Session 8 logged out. Waiting for processes to exit. Nov 29 09:06:16 pvf systemd[1]: session-8.scope: Succeeded. Nov 29 09:06:16 pvf systemd-logind[711]: Removed session 8. Nov 29 09:06:16 pvf pmxcfs[897]: [status] notice: received log Nov 29 09:10:14 pvf pmxcfs[897]: [status] notice: received log Nov 29 09:10:14 pvf sshd[3188]: Accepted publickey for root from 10.43.8.119 port 60700 ssh2: RSA SHA256:3YNHg8CC3M6IPFUA9PlP5B7wnm2nbFLpnXd09j6k5yU Nov 29 09:10:14 pvf sshd[3188]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0) Nov 29 09:10:14 pvf systemd[1]: Started Session 9 of user root. Nov 29 09:10:14 pvf systemd-logind[711]: New session 9 of user root. Nov 29 09:10:14 pvf sshd[3188]: Received disconnect from 10.43.8.119 port 60700:11: disconnected by user Nov 29 09:10:14 pvf sshd[3188]: Disconnected from user root 10.43.8.119 port 60700 Nov 29 09:10:14 pvf sshd[3188]: pam_unix(sshd:session): session closed for user root Nov 29 09:10:14 pvf systemd-logind[711]: Session 9 logged out. Waiting for processes to exit. Nov 29 09:10:14 pvf systemd[1]: session-9.scope: Succeeded. Nov 29 09:10:14 pvf systemd-logind[711]: Removed session 9. Nov 29 09:10:14 pvf sshd[3195]: Accepted publickey for root from 10.43.8.119 port 60702 ssh2: RSA SHA256:3YNHg8CC3M6IPFUA9PlP5B7wnm2nbFLpnXd09j6k5yU Nov 29 09:10:14 pvf sshd[3195]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0) Nov 29 09:10:14 pvf systemd[1]: Started Session 10 of user root. Nov 29 09:10:14 pvf systemd-logind[711]: New session 10 of user root. Nov 29 09:10:15 pvf qm[3201]: <root@pam> starting task UPID:pvf:00000C82:0000EA8E:61A48AE7:qmstart:100:root@pam: Nov 29 09:10:15 pvf qm[3202]: start VM 100: UPID:pvf:00000C82:0000EA8E:61A48AE7:qmstart:100:root@pam: Nov 29 09:10:15 pvf Satellite[812]: 09:10:15.686 [MainWorkerPool-1] INFO LINSTOR/Satellite - SYSTEM - Resource 'vm-100-disk-1' created for node 'pvf'. Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: Starting worker thread (from drbdsetup [3254]) Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: meta-data IO uses: blk-bio Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: rs_discard_granularity feature disabled Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: disk( Diskless -> Attaching ) Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: Maximum number of peer devices = 7 Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: Method to ensure write ordering: flush Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: drbd_bm_resize called with capacity == 3128600 Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: resync bitmap: bits=391075 words=42777 pages=84 Nov 29 09:10:19 pvf kernel: drbd1001: detected capacity change from 0 to 3128600 Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: size = 1528 MB (1564300 KB) Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: recounting of set bits took additional 0ms Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: disk( Attaching -> Inconsistent ) Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: attached to current UUID: 0000000000000004 Nov 29 09:10:19 pvf Satellite[812]: 09:10:19.270 [MainWorkerPool-1] INFO LINSTOR/Satellite - SYSTEM - Primary Resource vm-100-disk-1 Nov 29 09:10:19 pvf Satellite[812]: 09:10:19.271 [MainWorkerPool-1] INFO LINSTOR/Satellite - SYSTEM - Primary bool set on Resource vm-100-disk-1 Nov 29 09:10:19 pvf Satellite[812]: 09:10:19.293 [MainWorkerPool-1] INFO LINSTOR/Satellite - SYSTEM - Resource 'vm-100-disk-1' updated for node 'pvf'. Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: rs_discard_granularity feature disabled Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: Preparing cluster-wide state change 126926591 (0->-1 7683/4609) Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: Committing cluster-wide state change 126926591 (0ms) Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: role( Secondary -> Primary ) Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: disk( Inconsistent -> UpToDate ) Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: size = 1528 MB (1564300 KB) Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: Forced to consider local data as UpToDate! Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: new current UUID: E1541A8E1BAC0761 weak: FFFFFFFFFFFFFFFE Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: role( Primary -> Secondary ) Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: Preparing cluster-wide state change 594297135 (0->-1 3/1) Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: Committing cluster-wide state change 594297135 (0ms) Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: role( Secondary -> Primary ) Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1: role( Primary -> Secondary ) Nov 29 09:10:19 pvf Satellite[812]: 09:10:19.589 [MainWorkerPool-1] INFO LINSTOR/Satellite - SYSTEM - Resource 'vm-100-disk-1' updated for node 'pvf'. Nov 29 09:10:19 pvf kernel: drbd vm-100-disk-1/0 drbd1001: rs_discard_granularity feature disabled Nov 29 09:10:24 pvf systemd[1]: Started 100.scope. Nov 29 09:10:24 pvf systemd-udevd[3410]: Using default interface naming scheme 'v247'. Nov 29 09:10:24 pvf systemd-udevd[3410]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable. Nov 29 09:10:25 pvf kernel: device tap100i0 entered promiscuous mode Nov 29 09:10:25 pvf systemd-udevd[3419]: Using default interface naming scheme 'v247'. Nov 29 09:10:25 pvf systemd-udevd[3419]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable. Nov 29 09:10:25 pvf systemd-udevd[3410]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable. Nov 29 09:10:25 pvf systemd-udevd[3434]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable. Nov 29 09:10:25 pvf systemd-udevd[3434]: Using default interface naming scheme 'v247'. Nov 29 09:10:25 pvf kernel: fwbr100i0: port 1(fwln100i0) entered blocking state Nov 29 09:10:25 pvf kernel: fwbr100i0: port 1(fwln100i0) entered disabled state Nov 29 09:10:25 pvf kernel: device fwln100i0 entered promiscuous mode Nov 29 09:10:25 pvf kernel: fwbr100i0: port 1(fwln100i0) entered blocking state Nov 29 09:10:25 pvf kernel: fwbr100i0: port 1(fwln100i0) entered forwarding state Nov 29 09:10:25 pvf kernel: vmbr0: port 2(fwpr100p0) entered blocking state Nov 29 09:10:25 pvf kernel: vmbr0: port 2(fwpr100p0) entered disabled state Nov 29 09:10:25 pvf kernel: device fwpr100p0 entered promiscuous mode Nov 29 09:10:25 pvf kernel: vmbr0: port 2(fwpr100p0) entered blocking state Nov 29 09:10:25 pvf kernel: vmbr0: port 2(fwpr100p0) entered forwarding state Nov 29 09:10:25 pvf kernel: fwbr100i0: port 2(tap100i0) entered blocking state Nov 29 09:10:25 pvf kernel: fwbr100i0: port 2(tap100i0) entered disabled state Nov 29 09:10:25 pvf kernel: fwbr100i0: port 2(tap100i0) entered blocking state Nov 29 09:10:25 pvf kernel: fwbr100i0: port 2(tap100i0) entered forwarding state Nov 29 09:10:25 pvf kernel: drbd vm-100-disk-1: Preparing cluster-wide state change 523588126 (0->-1 3/1) Nov 29 09:10:25 pvf kernel: drbd vm-100-disk-1: Committing cluster-wide state change 523588126 (0ms) Nov 29 09:10:25 pvf kernel: drbd vm-100-disk-1: role( Secondary -> Primary ) Nov 29 09:10:25 pvf qm[3201]: <root@pam> end task UPID:pvf:00000C82:0000EA8E:61A48AE7:qmstart:100:root@pam: OK Nov 29 09:10:25 pvf sshd[3195]: Received disconnect from 10.43.8.119 port 60702:11: disconnected by user Nov 29 09:10:25 pvf sshd[3195]: Disconnected from user root 10.43.8.119 port 60702 Nov 29 09:10:25 pvf sshd[3195]: pam_unix(sshd:session): session closed for user root Nov 29 09:10:25 pvf systemd-logind[711]: Session 10 logged out. Waiting for processes to exit. Nov 29 09:10:25 pvf systemd[1]: session-10.scope: Succeeded. Nov 29 09:10:25 pvf systemd-logind[711]: Removed session 10. Nov 29 09:10:25 pvf sshd[3463]: Accepted publickey for root from 10.43.8.119 port 60706 ssh2: RSA SHA256:3YNHg8CC3M6IPFUA9PlP5B7wnm2nbFLpnXd09j6k5yU Nov 29 09:10:25 pvf sshd[3463]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0) Nov 29 09:10:25 pvf systemd[1]: Started Session 11 of user root. Nov 29 09:10:25 pvf systemd-logind[711]: New session 11 of user root. Nov 29 09:10:26 pvf sshd[3470]: Accepted publickey for root from 10.43.8.119 port 60708 ssh2: RSA SHA256:3YNHg8CC3M6IPFUA9PlP5B7wnm2nbFLpnXd09j6k5yU Nov 29 09:10:26 pvf sshd[3470]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0) Nov 29 09:10:26 pvf systemd[1]: Started Session 12 of user root. Nov 29 09:10:26 pvf systemd-logind[711]: New session 12 of user root. Nov 29 09:10:27 pvf qm[3476]: <root@pam> starting task UPID:pvf:00000D95:0000EF22:61A48AF3:qmstop:100:root@pam: Nov 29 09:10:27 pvf qm[3477]: stop VM 100: UPID:pvf:00000D95:0000EF22:61A48AF3:qmstop:100:root@pam: Nov 29 09:10:27 pvf QEMU[3406]: kvm: terminating on signal 15 from pid 3477 (task UPID:pvf:00000D95:0000EF22:61A48AF3:qmstop:100:root@pam:) Nov 29 09:10:27 pvf kernel: drbd vm-100-disk-1: role( Primary -> Secondary ) Nov 29 09:10:27 pvf qm[3476]: <root@pam> end task UPID:pvf:00000D95:0000EF22:61A48AF3:qmstop:100:root@pam: OK Nov 29 09:10:27 pvf sshd[3470]: Received disconnect from 10.43.8.119 port 60708:11: disconnected by user Nov 29 09:10:27 pvf sshd[3470]: Disconnected from user root 10.43.8.119 port 60708 Nov 29 09:10:27 pvf sshd[3470]: pam_unix(sshd:session): session closed for user root Nov 29 09:10:27 pvf systemd[1]: session-12.scope: Succeeded. Nov 29 09:10:27 pvf systemd-logind[711]: Session 12 logged out. Waiting for processes to exit. Nov 29 09:10:27 pvf systemd-logind[711]: Removed session 12. Nov 29 09:10:27 pvf sshd[3479]: Accepted publickey for root from 10.43.8.119 port 60710 ssh2: RSA SHA256:3YNHg8CC3M6IPFUA9PlP5B7wnm2nbFLpnXd09j6k5yU Nov 29 09:10:27 pvf sshd[3479]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0) Nov 29 09:10:27 pvf systemd[1]: Started Session 13 of user root. Nov 29 09:10:27 pvf systemd-logind[711]: New session 13 of user root. Nov 29 09:10:27 pvf kernel: fwbr100i0: port 2(tap100i0) entered disabled state Nov 29 09:10:27 pvf kernel: fwbr100i0: port 1(fwln100i0) entered disabled state Nov 29 09:10:27 pvf kernel: vmbr0: port 2(fwpr100p0) entered disabled state Nov 29 09:10:27 pvf kernel: device fwln100i0 left promiscuous mode Nov 29 09:10:27 pvf kernel: fwbr100i0: port 1(fwln100i0) entered disabled state Nov 29 09:10:27 pvf kernel: device fwpr100p0 left promiscuous mode Nov 29 09:10:27 pvf kernel: vmbr0: port 2(fwpr100p0) entered disabled state Nov 29 09:10:27 pvf qmeventd[719]: read: Connection reset by peer Nov 29 09:10:27 pvf systemd[1]: 100.scope: Succeeded. Nov 29 09:10:28 pvf pvesm[3485]: <root@pam> starting task UPID:pvf:00000DA8:0000EF87:61A48AF4:imgdel:100@drbdstorage:root@pam: Nov 29 09:10:28 pvf Satellite[812]: 09:10:28.144 [MainWorkerPool-1] INFO LINSTOR/Satellite - SYSTEM - Resource 'vm-100-disk-1' updated for node 'pvf'. Nov 29 09:10:28 pvf kernel: drbd vm-100-disk-1/0 drbd1001: disk( UpToDate -> Detaching ) Nov 29 09:10:28 pvf kernel: drbd vm-100-disk-1/0 drbd1001: disk( Detaching -> Diskless ) Nov 29 09:10:28 pvf kernel: drbd vm-100-disk-1/0 drbd1001: drbd_bm_resize called with capacity == 0 Nov 29 09:10:28 pvf kernel: drbd vm-100-disk-1: Terminating worker thread Nov 29 09:10:28 pvf pvesm[3485]: <root@pam> end task UPID:pvf:00000DA8:0000EF87:61A48AF4:imgdel:100@drbdstorage:root@pam: OK Nov 29 09:10:28 pvf sshd[3479]: Received disconnect from 10.43.8.119 port 60710:11: disconnected by user Nov 29 09:10:28 pvf sshd[3479]: Disconnected from user root 10.43.8.119 port 60710 Nov 29 09:10:28 pvf sshd[3479]: pam_unix(sshd:session): session closed for user root Nov 29 09:10:28 pvf systemd-logind[711]: Session 13 logged out. Waiting for processes to exit. Nov 29 09:10:28 pvf systemd[1]: session-13.scope: Succeeded. Nov 29 09:10:28 pvf systemd-logind[711]: Removed session 13. Nov 29 09:10:28 pvf sshd[3463]: Received disconnect from 10.43.8.119 port 60706:11: disconnected by user Nov 29 09:10:28 pvf sshd[3463]: Disconnected from user root 10.43.8.119 port 60706 Nov 29 09:10:28 pvf sshd[3463]: pam_unix(sshd:session): session closed for user root Nov 29 09:10:28 pvf systemd-logind[711]: Session 11 logged out. Waiting for processes to exit. Nov 29 09:10:28 pvf systemd[1]: session-11.scope: Succeeded. Nov 29 09:10:28 pvf systemd-logind[711]: Removed session 11. Nov 29 09:10:29 pvf sshd[3534]: Accepted publickey for root from 10.43.8.119 port 60712 ssh2: RSA SHA256:3YNHg8CC3M6IPFUA9PlP5B7wnm2nbFLpnXd09j6k5yU Nov 29 09:10:29 pvf sshd[3534]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0) Nov 29 09:10:29 pvf systemd[1]: Started Session 14 of user root. Nov 29 09:10:29 pvf systemd-logind[711]: New session 14 of user root. Nov 29 09:10:29 pvf sshd[3534]: Received disconnect from 10.43.8.119 port 60712:11: disconnected by user Nov 29 09:10:29 pvf sshd[3534]: Disconnected from user root 10.43.8.119 port 60712 Nov 29 09:10:29 pvf sshd[3534]: pam_unix(sshd:session): session closed for user root Nov 29 09:10:29 pvf systemd-logind[711]: Session 14 logged out. Waiting for processes to exit. Nov 29 09:10:29 pvf systemd[1]: session-14.scope: Succeeded. Nov 29 09:10:29 pvf systemd-logind[711]: Removed session 14. Nov 29 09:10:29 pvf pmxcfs[897]: [status] notice: received log ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [pve-devel] migrate local -> drbd fails with vanished job 2021-11-26 13:03 [pve-devel] migrate local -> drbd fails with vanished job Roland Kammerer 2021-11-26 16:06 ` Fabian Grünbichler @ 2021-11-29 8:20 ` Fabian Ebner 2021-11-29 9:03 ` Roland Kammerer 1 sibling, 1 reply; 5+ messages in thread From: Fabian Ebner @ 2021-11-29 8:20 UTC (permalink / raw) To: pve-devel, Fabian Grünbichler, roland.kammerer Am 26.11.21 um 14:03 schrieb Roland Kammerer: > Dear PVE devs, > > While most of our users start with fresh VMs on DRBD storage, from time > to time people try to migrate a local VM to DRBD storage. This currently fails. > Migrating VMs from DRBD to DRBD works. Hi, there also is a forum thread about this [0]. It seems like the newly allocated disks are slightly bigger on the target DRBD storage. Just a hunch, but it does sounds similar to [1]. [0]: https://forum.proxmox.com/threads/online-migration-disk-move-problem.100171 [1]: https://bugzilla.proxmox.com/show_bug.cgi?id=3227#c8 > > I added some debug code to PVE/QemuServer.pm, which looks like the location > things go wrong, or at least where I saw them going wrong: > > root@pve:/usr/share/perl5/PVE# diff -Nur QemuServer.pm{.orig,} > --- QemuServer.pm.orig 2021-11-26 11:27:28.879989894 +0100 > +++ QemuServer.pm 2021-11-26 11:26:30.490988789 +0100 > @@ -7390,6 +7390,8 @@ > $completion //= 'complete'; > $op //= "mirror"; > > + print "$vmid, $vmiddst, $jobs, $completion, $qga, $op \n"; > + { use Data::Dumper; print Dumper($jobs); }; > eval { > my $err_complete = 0; > > @@ -7419,6 +7421,7 @@ > next; > } > > + print "vanished: $vanished\n"; # same as !defined($jobs) > die "$job_id: '$op' has been cancelled\n" if !defined($job); > > my $busy = $job->{busy}; > > > With that in place, I try to live migrate the running VM from node "pve" to > "pvf": > > 2021-11-26 11:29:10 starting migration of VM 100 to node 'pvf' (xx.xx.xx.xx) > 2021-11-26 11:29:10 found local disk 'local-lvm:vm-100-disk-0' (in current VM config) > 2021-11-26 11:29:10 starting VM 100 on remote node 'pvf' > 2021-11-26 11:29:18 volume 'local-lvm:vm-100-disk-0' is 'drbdstorage:vm-100-disk-1' on the target > 2021-11-26 11:29:18 start remote tunnel > 2021-11-26 11:29:19 ssh tunnel ver 1 > 2021-11-26 11:29:19 starting storage migration > 2021-11-26 11:29:19 scsi0: start migration to nbd:unix:/run/qemu-server/100_nbd.migrate:exportname=drive-scsi0 > drive mirror is starting for drive-scsi0 > Use of uninitialized value $qga in concatenation (.) or string at /usr/share/perl5/PVE/QemuServer.pm line 7393. > 100, 100, HASH(0x557b44474a80), skip, , mirror > $VAR1 = { > 'drive-scsi0' => {} > }; > vanished: 1 > drive-scsi0: Cancelling block job > drive-scsi0: Done. > 2021-11-26 11:29:19 ERROR: online migrate failure - block job (mirror) error: drive-scsi0: 'mirror' has been cancelled > 2021-11-26 11:29:19 aborting phase 2 - cleanup resources > 2021-11-26 11:29:19 migrate_cancel > 2021-11-26 11:29:22 ERROR: migration finished with problems (duration 00:00:12) > TASK ERROR: migration problems > > What I also see on "pvf" is that the plugin actually creates the DRBD block > device, and "something" even tries to write data to it, as the DRBD devices > auto-promotes to Primary. > > Any hints how I can debug that further? The block device should be ready at > that point. What is going on in the background here? > > FWIW the plugin can be found here: > https://github.com/linbit/linstor-proxmox > > Regards, rck > > _______________________________________________ > pve-devel mailing list > pve-devel@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [pve-devel] migrate local -> drbd fails with vanished job 2021-11-29 8:20 ` Fabian Ebner @ 2021-11-29 9:03 ` Roland Kammerer 0 siblings, 0 replies; 5+ messages in thread From: Roland Kammerer @ 2021-11-29 9:03 UTC (permalink / raw) To: pve-devel On Mon, Nov 29, 2021 at 09:20:04AM +0100, Fabian Ebner wrote: > Am 26.11.21 um 14:03 schrieb Roland Kammerer: > > Dear PVE devs, > > > > While most of our users start with fresh VMs on DRBD storage, from time > > to time people try to migrate a local VM to DRBD storage. This currently fails. > > Migrating VMs from DRBD to DRBD works. > > Hi, > there also is a forum thread about this [0]. It seems like the newly > allocated disks are slightly bigger on the target DRBD storage. Just a > hunch, but it does sounds similar to [1]. > > [0]: > https://forum.proxmox.com/threads/online-migration-disk-move-problem.100171 > > [1]: https://bugzilla.proxmox.com/show_bug.cgi?id=3227#c8 Sorry Fabian saw your reply too late, but looks like we had the same intention. It is always sizes and/or block sizes with these things :). And me adding some extra space for debugging might even made have it worse (okay, it was broken to begin with). Thanks, rck ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-11-29 9:03 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-11-26 13:03 [pve-devel] migrate local -> drbd fails with vanished job Roland Kammerer 2021-11-26 16:06 ` Fabian Grünbichler 2021-11-29 8:48 ` Roland Kammerer 2021-11-29 8:20 ` Fabian Ebner 2021-11-29 9:03 ` Roland Kammerer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox