Re: [PVE-User] Proxmox VM hard resets

From: Adam Weremczuk <adamw@matrixscience.com>
To: Roland <devzero@web.de>,
	Proxmox VE user list <pve-user@lists.proxmox.com>
Subject: Re: [PVE-User] Proxmox VM hard resets
Date: Tue, 17 Jan 2023 18:59:10 +0000	[thread overview]
Message-ID: <65cc3f98-4eab-0c22-1571-0f72eacc6449@matrixscience.com> (raw)
In-Reply-To: <8d7aca90-efda-2a8f-9ca2-68792fe258cc@web.de>

I have dozens of Debian VMs (including 11) on VMware 7.0.2 being backed 
up with Altaro daily and never seen this anywhere else. Even to Proxmox 
it happens only once in a while (3 times so far).

I'm now down to 4 containers with low/moderate load, no extreme spikes. 
What doesn't feel right is they take up to 5 mins to reboot to a login 
prompt and start responding to pings. The consoles remain pitch black 
and unresponsive until then. It was the same when I ran Proxmox on bare 
metal.

Maybe these 2 issues are related?

On 17/01/2023 17:45, Roland wrote:
> can you reproduce this with debian 11 or ubuntu 22 VM (create some load
> there), i think this is not a proxmox problem which can be solved at the
> proxmox/vm-guest level
>
> see
> https://www.theregister.com/2017/11/28/stunning_antistun_vm_stun_problem_fix/ 
>
> for example
>
> roland
>
> Am 17.01.23 um 16:04 schrieb Adam Weremczuk:
>> Hi all,
>>
>> My environment is quite unusual as I run PVE 7.2-11 as a VM on VMware
>> 7.0.2. It runs several LXC containers and generally things are working
>> fine.
>>
>> Recently the Proxmox VM (called "jaguar") started resetting itself
>> (and all containers) shortly after Altaro VM Backup kicked off a
>> scheduled VM backup over the network.
>> Each time a hard reset was requested by the OS itself (Proxmox
>> hypervisor).
>>
>> The time of the "stun/unstun" operation seems to be causing the issue
>> here i.e. usually the stun/unstun operation should take a very short
>> amount of time, however, in my case, depending on the load on both the
>> hypervisor and the guest VM (nested hypervisor), that time can vary
>> and take a bit longer, snippet below from various stun/unstun 
>> operations:
>>
>> 2023-01-12T23:00:55.407Z| vcpu-0| | I005: CPT: vm was stunned for
>> 32142467 us
>> 2023-01-12T23:01:12.848Z| vcpu-0| | I005: CPT: vm was stunned for
>> 14942070 us
>> 2023-01-12T23:11:35.984Z| vcpu-0| opID=1487b0d5| I005: CPT: vm was
>> stunned for 277986 us
>> 2023-01-12T23:11:39.431Z| vcpu-0| | I005: CPT: vm was stunned for
>> 122089 us
>>
>> As you can see the stun time is different between each disk, now what
>> I think that is happening here is depending on the stun/unstun time of
>> the VM (virtualized hypervisor), the virtualized hypervisor watchdog
>> is noticing that the OS is being frozen for a X amount time and
>> issuing a hard reset. I guess when the stun time is over 30 sec, the
>> guest OS is issuing a hard reset.
>>
>> 2023-01-12T23:00:55.407Z| vcpu-0| | I005: CPT: vm was stunned for
>> 32142467 us
>> 2023-01-12T23:00:55.407Z| vcpu-0| | I005: SnapshotVMXTakeSnapshotWork:
>> Transition to mode 1.
>> 2023-01-12T23:00:55.407Z| vcpu-0| | I005:
>> SnapshotVMXTakeSnapshotComplete: Done with snapshot
>> 'ALTAROTEMPSNAPSHOTDONOTDELETE463b73a7-f363-4daf-acf3-b0322fe84429': 95
>> 2023-01-12T23:00:55.407Z| vcpu-0| | I005:
>> VigorTransport_ServerSendResponse opID=1487b008 seq=887616: Completed
>> Snapshot request.
>> 2023-01-12T23:00:55.409Z| vcpu-8| | I005: HBACommon: First write on
>> scsi0:0.fileName='/vmfs/volumes/61364720-e494cfe4-6cff-b083fed97d91/jaguar/jaguar-000001.vmdk' 
>>
>> 2023-01-12T23:00:55.409Z| vcpu-8| | I005: DDB: "longContentID" =
>> "08bf301ae8e75c151d2f273571a4ea9f" (was
>> "2a6fd4c33a60f8d724ccc100a666f0d7")
>> 2023-01-12T23:00:57.906Z| vcpu-8| | I005: DISKLIB-CHAIN :
>> DiskChainUpdateContentID: old=0xa666f0d7, new=0x71a4ea9f
>> (08bf301ae8e75c151d2f273571a4ea9f)
>> 2023-01-12T23:00:57.906Z| vcpu-9| | I005: Chipset: The guest has
>> requested that the virtual machine be hard reset.
>>
>> I'm struggling to establish how the watchdog timer (or equivalent) is
>> configured :( Maybe increasing its trigger time would solve the issue?
>>
>> Any other ideas / similar experiences?
>>
>> Regards,
>> Adam
>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>