From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id BA320951AA for ; Tue, 17 Jan 2023 16:11:12 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id A3285DB82 for ; Tue, 17 Jan 2023 16:11:12 +0100 (CET) Received: from mx0.matrixscience.co.uk (mx0.matrixscience.co.uk [188.215.17.82]) by firstgate.proxmox.com (Proxmox) with ESMTP for ; Tue, 17 Jan 2023 16:11:12 +0100 (CET) Received: from [10.200.20.3] (roadies-10-200-20-3.matrixscience.co.uk [10.200.20.3]) by mx0.matrixscience.co.uk (Postfix) with ESMTP id 94C3C2C022F for ; Tue, 17 Jan 2023 15:04:39 +0000 (GMT) Message-ID: <4ac6cb85-bf4e-ff81-e120-7365be0f1c10@matrixscience.com> Date: Tue, 17 Jan 2023 15:04:36 +0000 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.6.1 To: pve-user@lists.proxmox.com From: Adam Weremczuk Content-Language: en-US Organization: Matrix Science Ltd Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: [PVE-User] Proxmox VM hard resets X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jan 2023 15:11:12 -0000 Hi all, My environment is quite unusual as I run PVE 7.2-11 as a VM on VMware 7.0.2. It runs several LXC containers and generally things are working fine. Recently the Proxmox VM (called "jaguar") started resetting itself (and all containers) shortly after Altaro VM Backup kicked off a scheduled VM backup over the network. Each time a hard reset was requested by the OS itself (Proxmox hypervisor). The time of the "stun/unstun" operation seems to be causing the issue here i.e. usually the stun/unstun operation should take a very short amount of time, however, in my case, depending on the load on both the hypervisor and the guest VM (nested hypervisor), that time can vary and take a bit longer, snippet below from various stun/unstun operations: 2023-01-12T23:00:55.407Z| vcpu-0| | I005: CPT: vm was stunned for 32142467 us 2023-01-12T23:01:12.848Z| vcpu-0| | I005: CPT: vm was stunned for 14942070 us 2023-01-12T23:11:35.984Z| vcpu-0| opID=1487b0d5| I005: CPT: vm was stunned for 277986 us 2023-01-12T23:11:39.431Z| vcpu-0| | I005: CPT: vm was stunned for 122089 us As you can see the stun time is different between each disk, now what I think that is happening here is depending on the stun/unstun time of the VM (virtualized hypervisor), the virtualized hypervisor watchdog is noticing that the OS is being frozen for a X amount time and issuing a hard reset. I guess when the stun time is over 30 sec, the guest OS is issuing a hard reset. 2023-01-12T23:00:55.407Z| vcpu-0| | I005: CPT: vm was stunned for 32142467 us 2023-01-12T23:00:55.407Z| vcpu-0| | I005: SnapshotVMXTakeSnapshotWork: Transition to mode 1. 2023-01-12T23:00:55.407Z| vcpu-0| | I005: SnapshotVMXTakeSnapshotComplete: Done with snapshot 'ALTAROTEMPSNAPSHOTDONOTDELETE463b73a7-f363-4daf-acf3-b0322fe84429': 95 2023-01-12T23:00:55.407Z| vcpu-0| | I005: VigorTransport_ServerSendResponse opID=1487b008 seq=887616: Completed Snapshot request. 2023-01-12T23:00:55.409Z| vcpu-8| | I005: HBACommon: First write on scsi0:0.fileName='/vmfs/volumes/61364720-e494cfe4-6cff-b083fed97d91/jaguar/jaguar-000001.vmdk' 2023-01-12T23:00:55.409Z| vcpu-8| | I005: DDB: "longContentID" = "08bf301ae8e75c151d2f273571a4ea9f" (was "2a6fd4c33a60f8d724ccc100a666f0d7") 2023-01-12T23:00:57.906Z| vcpu-8| | I005: DISKLIB-CHAIN : DiskChainUpdateContentID: old=0xa666f0d7, new=0x71a4ea9f (08bf301ae8e75c151d2f273571a4ea9f) 2023-01-12T23:00:57.906Z| vcpu-9| | I005: Chipset: The guest has requested that the virtual machine be hard reset. I'm struggling to establish how the watchdog timer (or equivalent) is configured :( Maybe increasing its trigger time would solve the issue? Any other ideas / similar experiences? Regards, Adam