* [PVE-User] Proxmox VM hard resets
@ 2023-01-17 15:04 Adam Weremczuk
2023-01-17 17:45 ` Roland
0 siblings, 1 reply; 3+ messages in thread
From: Adam Weremczuk @ 2023-01-17 15:04 UTC (permalink / raw)
To: pve-user
Hi all,
My environment is quite unusual as I run PVE 7.2-11 as a VM on VMware
7.0.2. It runs several LXC containers and generally things are working fine.
Recently the Proxmox VM (called "jaguar") started resetting itself (and
all containers) shortly after Altaro VM Backup kicked off a scheduled VM
backup over the network.
Each time a hard reset was requested by the OS itself (Proxmox hypervisor).
The time of the "stun/unstun" operation seems to be causing the issue
here i.e. usually the stun/unstun operation should take a very short
amount of time, however, in my case, depending on the load on both the
hypervisor and the guest VM (nested hypervisor), that time can vary and
take a bit longer, snippet below from various stun/unstun operations:
2023-01-12T23:00:55.407Z| vcpu-0| | I005: CPT: vm was stunned for
32142467 us
2023-01-12T23:01:12.848Z| vcpu-0| | I005: CPT: vm was stunned for
14942070 us
2023-01-12T23:11:35.984Z| vcpu-0| opID=1487b0d5| I005: CPT: vm was
stunned for 277986 us
2023-01-12T23:11:39.431Z| vcpu-0| | I005: CPT: vm was stunned for 122089 us
As you can see the stun time is different between each disk, now what I
think that is happening here is depending on the stun/unstun time of the
VM (virtualized hypervisor), the virtualized hypervisor watchdog is
noticing that the OS is being frozen for a X amount time and issuing a
hard reset. I guess when the stun time is over 30 sec, the guest OS is
issuing a hard reset.
2023-01-12T23:00:55.407Z| vcpu-0| | I005: CPT: vm was stunned for
32142467 us
2023-01-12T23:00:55.407Z| vcpu-0| | I005: SnapshotVMXTakeSnapshotWork:
Transition to mode 1.
2023-01-12T23:00:55.407Z| vcpu-0| | I005:
SnapshotVMXTakeSnapshotComplete: Done with snapshot
'ALTAROTEMPSNAPSHOTDONOTDELETE463b73a7-f363-4daf-acf3-b0322fe84429': 95
2023-01-12T23:00:55.407Z| vcpu-0| | I005:
VigorTransport_ServerSendResponse opID=1487b008 seq=887616: Completed
Snapshot request.
2023-01-12T23:00:55.409Z| vcpu-8| | I005: HBACommon: First write on
scsi0:0.fileName='/vmfs/volumes/61364720-e494cfe4-6cff-b083fed97d91/jaguar/jaguar-000001.vmdk'
2023-01-12T23:00:55.409Z| vcpu-8| | I005: DDB: "longContentID" =
"08bf301ae8e75c151d2f273571a4ea9f" (was "2a6fd4c33a60f8d724ccc100a666f0d7")
2023-01-12T23:00:57.906Z| vcpu-8| | I005: DISKLIB-CHAIN :
DiskChainUpdateContentID: old=0xa666f0d7, new=0x71a4ea9f
(08bf301ae8e75c151d2f273571a4ea9f)
2023-01-12T23:00:57.906Z| vcpu-9| | I005: Chipset: The guest has
requested that the virtual machine be hard reset.
I'm struggling to establish how the watchdog timer (or equivalent) is
configured :( Maybe increasing its trigger time would solve the issue?
Any other ideas / similar experiences?
Regards,
Adam
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PVE-User] Proxmox VM hard resets
2023-01-17 15:04 [PVE-User] Proxmox VM hard resets Adam Weremczuk
@ 2023-01-17 17:45 ` Roland
2023-01-17 18:59 ` Adam Weremczuk
0 siblings, 1 reply; 3+ messages in thread
From: Roland @ 2023-01-17 17:45 UTC (permalink / raw)
To: Proxmox VE user list, Adam Weremczuk
can you reproduce this with debian 11 or ubuntu 22 VM (create some load
there), i think this is not a proxmox problem which can be solved at the
proxmox/vm-guest level
see
https://www.theregister.com/2017/11/28/stunning_antistun_vm_stun_problem_fix/
for example
roland
Am 17.01.23 um 16:04 schrieb Adam Weremczuk:
> Hi all,
>
> My environment is quite unusual as I run PVE 7.2-11 as a VM on VMware
> 7.0.2. It runs several LXC containers and generally things are working
> fine.
>
> Recently the Proxmox VM (called "jaguar") started resetting itself
> (and all containers) shortly after Altaro VM Backup kicked off a
> scheduled VM backup over the network.
> Each time a hard reset was requested by the OS itself (Proxmox
> hypervisor).
>
> The time of the "stun/unstun" operation seems to be causing the issue
> here i.e. usually the stun/unstun operation should take a very short
> amount of time, however, in my case, depending on the load on both the
> hypervisor and the guest VM (nested hypervisor), that time can vary
> and take a bit longer, snippet below from various stun/unstun operations:
>
> 2023-01-12T23:00:55.407Z| vcpu-0| | I005: CPT: vm was stunned for
> 32142467 us
> 2023-01-12T23:01:12.848Z| vcpu-0| | I005: CPT: vm was stunned for
> 14942070 us
> 2023-01-12T23:11:35.984Z| vcpu-0| opID=1487b0d5| I005: CPT: vm was
> stunned for 277986 us
> 2023-01-12T23:11:39.431Z| vcpu-0| | I005: CPT: vm was stunned for
> 122089 us
>
> As you can see the stun time is different between each disk, now what
> I think that is happening here is depending on the stun/unstun time of
> the VM (virtualized hypervisor), the virtualized hypervisor watchdog
> is noticing that the OS is being frozen for a X amount time and
> issuing a hard reset. I guess when the stun time is over 30 sec, the
> guest OS is issuing a hard reset.
>
> 2023-01-12T23:00:55.407Z| vcpu-0| | I005: CPT: vm was stunned for
> 32142467 us
> 2023-01-12T23:00:55.407Z| vcpu-0| | I005: SnapshotVMXTakeSnapshotWork:
> Transition to mode 1.
> 2023-01-12T23:00:55.407Z| vcpu-0| | I005:
> SnapshotVMXTakeSnapshotComplete: Done with snapshot
> 'ALTAROTEMPSNAPSHOTDONOTDELETE463b73a7-f363-4daf-acf3-b0322fe84429': 95
> 2023-01-12T23:00:55.407Z| vcpu-0| | I005:
> VigorTransport_ServerSendResponse opID=1487b008 seq=887616: Completed
> Snapshot request.
> 2023-01-12T23:00:55.409Z| vcpu-8| | I005: HBACommon: First write on
> scsi0:0.fileName='/vmfs/volumes/61364720-e494cfe4-6cff-b083fed97d91/jaguar/jaguar-000001.vmdk'
> 2023-01-12T23:00:55.409Z| vcpu-8| | I005: DDB: "longContentID" =
> "08bf301ae8e75c151d2f273571a4ea9f" (was
> "2a6fd4c33a60f8d724ccc100a666f0d7")
> 2023-01-12T23:00:57.906Z| vcpu-8| | I005: DISKLIB-CHAIN :
> DiskChainUpdateContentID: old=0xa666f0d7, new=0x71a4ea9f
> (08bf301ae8e75c151d2f273571a4ea9f)
> 2023-01-12T23:00:57.906Z| vcpu-9| | I005: Chipset: The guest has
> requested that the virtual machine be hard reset.
>
> I'm struggling to establish how the watchdog timer (or equivalent) is
> configured :( Maybe increasing its trigger time would solve the issue?
>
> Any other ideas / similar experiences?
>
> Regards,
> Adam
>
>
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PVE-User] Proxmox VM hard resets
2023-01-17 17:45 ` Roland
@ 2023-01-17 18:59 ` Adam Weremczuk
0 siblings, 0 replies; 3+ messages in thread
From: Adam Weremczuk @ 2023-01-17 18:59 UTC (permalink / raw)
To: Roland, Proxmox VE user list
I have dozens of Debian VMs (including 11) on VMware 7.0.2 being backed
up with Altaro daily and never seen this anywhere else. Even to Proxmox
it happens only once in a while (3 times so far).
I'm now down to 4 containers with low/moderate load, no extreme spikes.
What doesn't feel right is they take up to 5 mins to reboot to a login
prompt and start responding to pings. The consoles remain pitch black
and unresponsive until then. It was the same when I ran Proxmox on bare
metal.
Maybe these 2 issues are related?
On 17/01/2023 17:45, Roland wrote:
> can you reproduce this with debian 11 or ubuntu 22 VM (create some load
> there), i think this is not a proxmox problem which can be solved at the
> proxmox/vm-guest level
>
> see
> https://www.theregister.com/2017/11/28/stunning_antistun_vm_stun_problem_fix/
>
> for example
>
> roland
>
> Am 17.01.23 um 16:04 schrieb Adam Weremczuk:
>> Hi all,
>>
>> My environment is quite unusual as I run PVE 7.2-11 as a VM on VMware
>> 7.0.2. It runs several LXC containers and generally things are working
>> fine.
>>
>> Recently the Proxmox VM (called "jaguar") started resetting itself
>> (and all containers) shortly after Altaro VM Backup kicked off a
>> scheduled VM backup over the network.
>> Each time a hard reset was requested by the OS itself (Proxmox
>> hypervisor).
>>
>> The time of the "stun/unstun" operation seems to be causing the issue
>> here i.e. usually the stun/unstun operation should take a very short
>> amount of time, however, in my case, depending on the load on both the
>> hypervisor and the guest VM (nested hypervisor), that time can vary
>> and take a bit longer, snippet below from various stun/unstun
>> operations:
>>
>> 2023-01-12T23:00:55.407Z| vcpu-0| | I005: CPT: vm was stunned for
>> 32142467 us
>> 2023-01-12T23:01:12.848Z| vcpu-0| | I005: CPT: vm was stunned for
>> 14942070 us
>> 2023-01-12T23:11:35.984Z| vcpu-0| opID=1487b0d5| I005: CPT: vm was
>> stunned for 277986 us
>> 2023-01-12T23:11:39.431Z| vcpu-0| | I005: CPT: vm was stunned for
>> 122089 us
>>
>> As you can see the stun time is different between each disk, now what
>> I think that is happening here is depending on the stun/unstun time of
>> the VM (virtualized hypervisor), the virtualized hypervisor watchdog
>> is noticing that the OS is being frozen for a X amount time and
>> issuing a hard reset. I guess when the stun time is over 30 sec, the
>> guest OS is issuing a hard reset.
>>
>> 2023-01-12T23:00:55.407Z| vcpu-0| | I005: CPT: vm was stunned for
>> 32142467 us
>> 2023-01-12T23:00:55.407Z| vcpu-0| | I005: SnapshotVMXTakeSnapshotWork:
>> Transition to mode 1.
>> 2023-01-12T23:00:55.407Z| vcpu-0| | I005:
>> SnapshotVMXTakeSnapshotComplete: Done with snapshot
>> 'ALTAROTEMPSNAPSHOTDONOTDELETE463b73a7-f363-4daf-acf3-b0322fe84429': 95
>> 2023-01-12T23:00:55.407Z| vcpu-0| | I005:
>> VigorTransport_ServerSendResponse opID=1487b008 seq=887616: Completed
>> Snapshot request.
>> 2023-01-12T23:00:55.409Z| vcpu-8| | I005: HBACommon: First write on
>> scsi0:0.fileName='/vmfs/volumes/61364720-e494cfe4-6cff-b083fed97d91/jaguar/jaguar-000001.vmdk'
>>
>> 2023-01-12T23:00:55.409Z| vcpu-8| | I005: DDB: "longContentID" =
>> "08bf301ae8e75c151d2f273571a4ea9f" (was
>> "2a6fd4c33a60f8d724ccc100a666f0d7")
>> 2023-01-12T23:00:57.906Z| vcpu-8| | I005: DISKLIB-CHAIN :
>> DiskChainUpdateContentID: old=0xa666f0d7, new=0x71a4ea9f
>> (08bf301ae8e75c151d2f273571a4ea9f)
>> 2023-01-12T23:00:57.906Z| vcpu-9| | I005: Chipset: The guest has
>> requested that the virtual machine be hard reset.
>>
>> I'm struggling to establish how the watchdog timer (or equivalent) is
>> configured :( Maybe increasing its trigger time would solve the issue?
>>
>> Any other ideas / similar experiences?
>>
>> Regards,
>> Adam
>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2023-01-17 18:59 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-17 15:04 [PVE-User] Proxmox VM hard resets Adam Weremczuk
2023-01-17 17:45 ` Roland
2023-01-17 18:59 ` Adam Weremczuk
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox