From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <adamw@matrixscience.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id AE6AA95235
 for <pve-user@lists.proxmox.com>; Tue, 17 Jan 2023 19:59:45 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 9013F180AE
 for <pve-user@lists.proxmox.com>; Tue, 17 Jan 2023 19:59:15 +0100 (CET)
Received: from mx0.matrixscience.co.uk (mx0.matrixscience.co.uk
 [188.215.17.82]) by firstgate.proxmox.com (Proxmox) with ESMTP
 for <pve-user@lists.proxmox.com>; Tue, 17 Jan 2023 19:59:13 +0100 (CET)
Received: from [10.200.20.3] (roadies-10-200-20-3.matrixscience.co.uk
 [10.200.20.3])
 by mx0.matrixscience.co.uk (Postfix) with ESMTP id 591F72C022F;
 Tue, 17 Jan 2023 18:59:13 +0000 (GMT)
Message-ID: <65cc3f98-4eab-0c22-1571-0f72eacc6449@matrixscience.com>
Date: Tue, 17 Jan 2023 18:59:10 +0000
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.6.1
Content-Language: en-US
To: Roland <devzero@web.de>, Proxmox VE user list <pve-user@lists.proxmox.com>
References: <4ac6cb85-bf4e-ff81-e120-7365be0f1c10@matrixscience.com>
 <8d7aca90-efda-2a8f-9ca2-68792fe258cc@web.de>
From: Adam Weremczuk <adamw@matrixscience.com>
Organization: Matrix Science Ltd
In-Reply-To: <8d7aca90-efda-2a8f-9ca2-68792fe258cc@web.de>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.049 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 NICE_REPLY_A           -0.097 Looks like a legit reply (A)
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [theregister.com, proxmox.com]
Subject: Re: [PVE-User] Proxmox VM hard resets
X-BeenThere: pve-user@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE user list <pve-user.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-user/>
List-Post: <mailto:pve-user@lists.proxmox.com>
List-Help: <mailto:pve-user-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Tue, 17 Jan 2023 18:59:45 -0000

I have dozens of Debian VMs (including 11) on VMware 7.0.2 being backed 
up with Altaro daily and never seen this anywhere else. Even to Proxmox 
it happens only once in a while (3 times so far).

I'm now down to 4 containers with low/moderate load, no extreme spikes. 
What doesn't feel right is they take up to 5 mins to reboot to a login 
prompt and start responding to pings. The consoles remain pitch black 
and unresponsive until then. It was the same when I ran Proxmox on bare 
metal.

Maybe these 2 issues are related?

On 17/01/2023 17:45, Roland wrote:
> can you reproduce this with debian 11 or ubuntu 22 VM (create some load
> there), i think this is not a proxmox problem which can be solved at the
> proxmox/vm-guest level
>
> see
> https://www.theregister.com/2017/11/28/stunning_antistun_vm_stun_problem_fix/ 
>
> for example
>
> roland
>
> Am 17.01.23 um 16:04 schrieb Adam Weremczuk:
>> Hi all,
>>
>> My environment is quite unusual as I run PVE 7.2-11 as a VM on VMware
>> 7.0.2. It runs several LXC containers and generally things are working
>> fine.
>>
>> Recently the Proxmox VM (called "jaguar") started resetting itself
>> (and all containers) shortly after Altaro VM Backup kicked off a
>> scheduled VM backup over the network.
>> Each time a hard reset was requested by the OS itself (Proxmox
>> hypervisor).
>>
>> The time of the "stun/unstun" operation seems to be causing the issue
>> here i.e. usually the stun/unstun operation should take a very short
>> amount of time, however, in my case, depending on the load on both the
>> hypervisor and the guest VM (nested hypervisor), that time can vary
>> and take a bit longer, snippet below from various stun/unstun 
>> operations:
>>
>> 2023-01-12T23:00:55.407Z| vcpu-0| | I005: CPT: vm was stunned for
>> 32142467 us
>> 2023-01-12T23:01:12.848Z| vcpu-0| | I005: CPT: vm was stunned for
>> 14942070 us
>> 2023-01-12T23:11:35.984Z| vcpu-0| opID=1487b0d5| I005: CPT: vm was
>> stunned for 277986 us
>> 2023-01-12T23:11:39.431Z| vcpu-0| | I005: CPT: vm was stunned for
>> 122089 us
>>
>> As you can see the stun time is different between each disk, now what
>> I think that is happening here is depending on the stun/unstun time of
>> the VM (virtualized hypervisor), the virtualized hypervisor watchdog
>> is noticing that the OS is being frozen for a X amount time and
>> issuing a hard reset. I guess when the stun time is over 30 sec, the
>> guest OS is issuing a hard reset.
>>
>> 2023-01-12T23:00:55.407Z| vcpu-0| | I005: CPT: vm was stunned for
>> 32142467 us
>> 2023-01-12T23:00:55.407Z| vcpu-0| | I005: SnapshotVMXTakeSnapshotWork:
>> Transition to mode 1.
>> 2023-01-12T23:00:55.407Z| vcpu-0| | I005:
>> SnapshotVMXTakeSnapshotComplete: Done with snapshot
>> 'ALTAROTEMPSNAPSHOTDONOTDELETE463b73a7-f363-4daf-acf3-b0322fe84429': 95
>> 2023-01-12T23:00:55.407Z| vcpu-0| | I005:
>> VigorTransport_ServerSendResponse opID=1487b008 seq=887616: Completed
>> Snapshot request.
>> 2023-01-12T23:00:55.409Z| vcpu-8| | I005: HBACommon: First write on
>> scsi0:0.fileName='/vmfs/volumes/61364720-e494cfe4-6cff-b083fed97d91/jaguar/jaguar-000001.vmdk' 
>>
>> 2023-01-12T23:00:55.409Z| vcpu-8| | I005: DDB: "longContentID" =
>> "08bf301ae8e75c151d2f273571a4ea9f" (was
>> "2a6fd4c33a60f8d724ccc100a666f0d7")
>> 2023-01-12T23:00:57.906Z| vcpu-8| | I005: DISKLIB-CHAIN :
>> DiskChainUpdateContentID: old=0xa666f0d7, new=0x71a4ea9f
>> (08bf301ae8e75c151d2f273571a4ea9f)
>> 2023-01-12T23:00:57.906Z| vcpu-9| | I005: Chipset: The guest has
>> requested that the virtual machine be hard reset.
>>
>> I'm struggling to establish how the watchdog timer (or equivalent) is
>> configured :( Maybe increasing its trigger time would solve the issue?
>>
>> Any other ideas / similar experiences?
>>
>> Regards,
>> Adam
>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>