From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 777E184F5 for ; Mon, 25 Apr 2022 12:48:41 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 6E25A6870 for ; Mon, 25 Apr 2022 12:48:41 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id B0B2B6867 for ; Mon, 25 Apr 2022 12:48:40 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 7EFAB428F9 for ; Mon, 25 Apr 2022 12:48:40 +0200 (CEST) Message-ID: <597ba769-9cbd-4ab0-9d9a-815859d83df8@proxmox.com> Date: Mon, 25 Apr 2022 12:48:39 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.0 Content-Language: en-US To: =?UTF-8?Q?Fabian_Gr=c3=bcnbichler?= , pve-devel@lists.proxmox.com References: <20220318075123.5445-1-f.ebner@proxmox.com> <1650457969.m5c6lqzh1t.astroid@nora.none> <29ca4b42-5260-ecb2-0a5d-858dbf6a4b93@proxmox.com> <1650527352.zckk7hhjxo.astroid@nora.none> From: Fabian Ebner In-Reply-To: <1650527352.zckk7hhjxo.astroid@nora.none> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL 1.013 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -1.857 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] [PATCH qemu-server] migrate: keep VM paused after migration if it was before X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Apr 2022 10:48:41 -0000 Am 21.04.22 um 11:15 schrieb Fabian Grünbichler: > > this confused me, so I tried to look at different code paths - a > transition from phase2 (after migration has started, any error before > that is irrelevant since it couldn't have triggered a pause) to > phase2_cleanup with $err / $self->{errors} set can only happen if > > A) migration reached 'completed' state, VM state transitioned to > FINISH_MIGRATE, but block-job completing code after the query-migrate > loop died > B) query-migrate failed too often, VM state unknown > C) query-migrate returned no or an invalid status, VM state unknown > D) query-migrate returned failed or cancelled status, VM state unknown > E) migration was interrupted > > in case A (which was my original sequence of events), the VM is now > paused for sure (by the migration), and we only want to resume it (via > migrate_cancel?) if it wasn't previously paused. so does migrate_cancel > always trigger a resume? if so, do we want to (re-)pause the VM > depending on whether it was originally paused? > > in case B or C we don't know the state or what it means. we can attempt > to migrate_cancel (which might resume the VM?), which might give us the > same situation as A > > in case D or E the VM might have been paused (and resumed again) by the > migration, or maybe not. so we might still want to do the same as with > situation A. > > looking at the qemu source it should keep track of whether the VM was > running before it paused during the migration completion > (MigrationState->vm_was_running). > > this gets set when the migration converges (in migration_completion), > before the VM transitions to FINISH_MIGRATE (i.e., paused while waiting > for completion). from there on we have two options - either the > migration status switches to failed, or to completed, and we BREAK the > migration iteration (migration_iteration_run), and we end up in > migration_iteration_finish. > > migration_iteration_finish will transition the VM to POSTMIGRATE if the > migration was completed. otherwise, if the migration got cancelled, it > will start the VM again if it was marked as previously running but > paused by migration (see above), or else if it entered convergence but > didn't complete yet it will be switched to POSTMIGRATE (without > vm_start). cancelling before the migration converged simple does nothing > except cleaning up the migration - the VM is still running or paused > depending on its original state, since the migration never paused it it > also had no need to resume it. > > so there is case A, where we should end up in POSTMIGRATE and want to > return to RUNNING or PAUSED, depending on original state. there is case > B, where cancelling before convergence is fine, cancelling after > convergence but before completion should resume iff originally running > (fine), but end up in POSTMIGRATE instead of PAUSED if not (wrong), with > cases C, D and E being similar but likely not resumed/completed. If the VM was running before, I think we can just check if the state is POSTMIGRATE (after migrate_cancel, both cases where we know for sure we want to resume should end up there; if there is some weird corner case we missed, I'd say adding it once it pops up is the way to go) and try to go back to RUNNING by issuing a QMP 'cont'. We can't actually go from POSTMIGRATE back to PAUSED directly though: typedef struct { RunState from; RunState to; } RunStateTransition; static const RunStateTransition runstate_transitions_def[] = { (...) { RUN_STATE_POSTMIGRATE, RUN_STATE_RUNNING }, { RUN_STATE_POSTMIGRATE, RUN_STATE_FINISH_MIGRATE }, { RUN_STATE_POSTMIGRATE, RUN_STATE_PRELAUNCH }, (...) Not sure what we can do besides keeping it in POSTMIGRATE there? It shouldn't be as relevant as in the "was running" case IMHO. Well, enabling the start button in the UI would help ;)