From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 96E111FF173 for ; Mon, 25 Nov 2024 16:08:43 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 8A47916EF6; Mon, 25 Nov 2024 16:08:40 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1732547311; x=1733152111; darn=lists.proxmox.com; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=chp9mdl2gQPmasQYBIPNcgRAqoi5oWRxF7OV3AqzyRE=; b=SzUH+MPp6nRhYmN9Cwvjc0M2lIHn3A5lx4MJnmOTlu/skOyM4ZAnts0p6EY3NAkaKo AHJHthkR2RgEUiupFeMfkYOe8ZBHiHx1GNxnUdpN+U0rgzur893GlH3KfIPILbtoMXo0 lb/eSF+JB4XxVs+5tWgw2vZGClwjmBFoOmMT2vLTCDh+hYkC9ylJyJCbmEbv23Og6Z2C GgqOpLL8G0f1jSMIlLR1erRGDIdpqx0TajP6LrvzXgVEChXbLQVPKodSjMalace5+FDb ovuNAkQ85VfXDMj1U/AvGA1cFns+9lag1WYvn1Yo2vzPJGnrRHYO1GNSWvxD4JMuEjKW UqHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732547311; x=1733152111; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=chp9mdl2gQPmasQYBIPNcgRAqoi5oWRxF7OV3AqzyRE=; b=BktJd3+X8cyUyDqiaJgg6O1ifgsMx3zq4rqgRguuWnLMgUZVVBouMmrg3EPevf0BkX oEKFNzhnhTVxwUkf1FcEyX+aJG4+vgF95GneGcexwpd9ou7j8Tljtjfq4Qs6MTWMKFJ0 coA3qj1FGl7Wzw3ejpaf7JRomc42VbXXKfoMAnf7yQCm0w1GKk+QBnx2So2DZ7zEI+tE qPCpydK5YlsckR3obnorZEqrpFab+czkA//EWqETZwGpOv4WKDW9soEPbzSLRbGUfFoc nGIrJYHQ1Wb9c2CDsSU7ocsuKlx5WPjc8pbqQrt/si5XqRHSrZC+L9CrLwpqYkPnyOpw RHPg== X-Gm-Message-State: AOJu0YytuII7LcL2+7pylueBpW+rH9FNSaDKhhwk+A7Z6O/ecgusKQNH 7Zcz867aJD/eJJilaTuVzW7L0g17Z4Eswi5LqyZ1MvMXrCIKJAlMzvRq2yQ7XafNP4DPyB0uJ16 CvVteALmlcgNYS/dl1cURCQqYbKr/pF7F X-Gm-Gg: ASbGnctAFzqUTHoV4CKIPZz1+bkmNFccMfitedR6fxZfEuFlch59H5oMwlf9ofOvtas teJtkgRAHPJtV4HObu5HNr07BWm59LNw= X-Google-Smtp-Source: AGHT+IGFuUoCjUyVeRnfIL6YHs3YqjVKKtgET1daBrfCVTZfYA77cB7HX4M+z18S755fe7uolFJKrYW6zaYGXfgkr6U= X-Received: by 2002:a05:6512:3f12:b0:53d:de3d:223c with SMTP id 2adb3069b0e04-53dde3d2263mr2948302e87.19.1732547310436; Mon, 25 Nov 2024 07:08:30 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: JR Richardson Date: Mon, 25 Nov 2024 09:08:17 -0600 Message-ID: To: Proxmox VE user list X-SPAM-LEVEL: Spam detection results: 0 AWL 0.168 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DKIM_SIGNED 0.1 Message has a DKIM or DK signature, not necessarily valid DKIM_VALID -0.1 Message has at least one valid DKIM or DK signature DKIM_VALID_AU -0.1 Message has a valid DKIM or DK signature from author's domain DKIM_VALID_EF -0.1 Message has a valid DKIM or DK signature from envelope-from domain DMARC_PASS -0.1 DMARC pass policy FREEMAIL_FROM 0.001 Sender email is commonly abused enduser mail provider RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [PVE-User] VMs With Multiple Interfaces Rebooting X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Proxmox VE user list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: pve-user-bounces@lists.proxmox.com Sender: "pve-user" > >Super stable environment for many years through software and hardware > >upgrades, few issues to speak of, then without warning one of my > >hypervisors in 3 node group crashed with a memory dimm error, cluster > >HA took over and restarted the VMs on the other two nodes in the group > >as expected. The problem quickly materialized as the VMs started > >rebooting quickly, a lot of network issues and notice of migration > >pending. I could not lockdown exactly what the root cause was. Notable > This sounds like it wanted to balance the load. Do you have CRS active and/or static load scheduling? CRS option is set to basic, not dynamic. > > >was these particular VMs all have multiple network interfaces. After > >several hours of not being able to get the current VMs stable, I tried > >spinning up new VMs on to no avail, reboots persisted on the new VMs. > >This seemed to only affect the VMs that were on the hypervisor that > >failed all other VMs across the cluster were fine. > > > >I have not installed any third-party monitoring software, found a few > >post in the forum about it, but was not my issue. > > > >In an act of desperation, I performed a dist-upgrade and this solved > >the issue straight away. > >Kernel Version Linux 6.8.12-4-pve (2024-11-06T15:04Z) > >Manager Version pve-manager/8.3.0/c1689ccb1065a83b > The upgrade likely restarted the pve-ha-lrm service, which could break the migration cycle. > > The systemd logs should give you a clue to what was happening, the ha stack logs the actions on the given node. I don't see anything particular in the lrm logs, just starting the VMs over and over. Here are relevant syslog entries from the end of one cycle reboot to beginning startup. 2024-11-21T18:36:59.023578-06:00 vvepve13 qmeventd[3838]: Starting cleanup for 13101 2024-11-21T18:36:59.105435-06:00 vvepve13 qmeventd[3838]: Finished cleanup for 13101 2024-11-21T18:37:30.758618-06:00 vvepve13 pve-ha-lrm[1608]: successfully acquired lock 'ha_agent_vvepve13_lock' 2024-11-21T18:37:30.758861-06:00 vvepve13 pve-ha-lrm[1608]: watchdog active 2024-11-21T18:37:30.758977-06:00 vvepve13 pve-ha-lrm[1608]: status change wait_for_agent_lock => active 2024-11-21T18:37:30.789271-06:00 vvepve13 pve-ha-lrm[4337]: starting service vm:13101 2024-11-21T18:37:30.808204-06:00 vvepve13 pve-ha-lrm[4338]: start VM 13101: UPID:vvepve13:000010F2:00007AEA:673FD24A:qmstart:13101:root@pam: 2024-11-21T18:37:30.808383-06:00 vvepve13 pve-ha-lrm[4337]: starting task UPID:vvepve13:000010F2:00007AEA:673FD24A:qmstart:13101:root@pam: 2024-11-21T18:37:31.112154-06:00 vvepve13 systemd[1]: Started 13101.scope. 2024-11-21T18:37:32.802414-06:00 vvepve13 kernel: [ 316.379944] tap13101i0: entered promiscuous mode 2024-11-21T18:37:32.846352-06:00 vvepve13 kernel: [ 316.423935] vmbr0: port 10(tap13101i0) entered blocking state 2024-11-21T18:37:32.846372-06:00 vvepve13 kernel: [ 316.423946] vmbr0: port 10(tap13101i0) entered disabled state 2024-11-21T18:37:32.846375-06:00 vvepve13 kernel: [ 316.423990] tap13101i0: entered allmulticast mode 2024-11-21T18:37:32.847377-06:00 vvepve13 kernel: [ 316.424825] vmbr0: port 10(tap13101i0) entered blocking state 2024-11-21T18:37:32.847391-06:00 vvepve13 kernel: [ 316.424832] vmbr0: port 10(tap13101i0) entered forwarding state 2024-11-21T18:37:34.594397-06:00 vvepve13 kernel: [ 318.172029] tap13101i1: entered promiscuous mode 2024-11-21T18:37:34.640376-06:00 vvepve13 kernel: [ 318.217302] vmbr0: port 11(tap13101i1) entered blocking state 2024-11-21T18:37:34.640393-06:00 vvepve13 kernel: [ 318.217310] vmbr0: port 11(tap13101i1) entered disabled state 2024-11-21T18:37:34.640396-06:00 vvepve13 kernel: [ 318.217341] tap13101i1: entered allmulticast mode 2024-11-21T18:37:34.640398-06:00 vvepve13 kernel: [ 318.218073] vmbr0: port 11(tap13101i1) entered blocking state 2024-11-21T18:37:34.640400-06:00 vvepve13 kernel: [ 318.218077] vmbr0: port 11(tap13101i1) entered forwarding state 2024-11-21T18:37:35.819630-06:00 vvepve13 pve-ha-lrm[4337]: Task 'UPID:vvepve13:000010F2:00007AEA:673FD24A:qmstart:13101:root@pam:' still active, waiting 2024-11-21T18:37:36.249349-06:00 vvepve13 kernel: [ 319.827024] tap13101i2: entered promiscuous mode 2024-11-21T18:37:36.291346-06:00 vvepve13 kernel: [ 319.868406] vmbr0: port 12(tap13101i2) entered blocking state 2024-11-21T18:37:36.291365-06:00 vvepve13 kernel: [ 319.868417] vmbr0: port 12(tap13101i2) entered disabled state 2024-11-21T18:37:36.291367-06:00 vvepve13 kernel: [ 319.868443] tap13101i2: entered allmulticast mode 2024-11-21T18:37:36.291368-06:00 vvepve13 kernel: [ 319.869185] vmbr0: port 12(tap13101i2) entered blocking state 2024-11-21T18:37:36.291369-06:00 vvepve13 kernel: [ 319.869191] vmbr0: port 12(tap13101i2) entered forwarding state 2024-11-21T18:37:37.997394-06:00 vvepve13 kernel: [ 321.575034] tap13101i3: entered promiscuous mode 2024-11-21T18:37:38.040384-06:00 vvepve13 kernel: [ 321.617225] vmbr0: port 13(tap13101i3) entered blocking state 2024-11-21T18:37:38.040396-06:00 vvepve13 kernel: [ 321.617236] vmbr0: port 13(tap13101i3) entered disabled state 2024-11-21T18:37:38.040400-06:00 vvepve13 kernel: [ 321.617278] tap13101i3: entered allmulticast mode 2024-11-21T18:37:38.040402-06:00 vvepve13 kernel: [ 321.618070] vmbr0: port 13(tap13101i3) entered blocking state 2024-11-21T18:37:38.040403-06:00 vvepve13 kernel: [ 321.618077] vmbr0: port 13(tap13101i3) entered forwarding state 2024-11-21T18:37:38.248094-06:00 vvepve13 pve-ha-lrm[4337]: end task UPID:vvepve13:000010F2:00007AEA:673FD24A:qmstart:13101:root@pam: OK 2024-11-21T18:37:38.254144-06:00 vvepve13 pve-ha-lrm[4337]: service status vm:13101 started 2024-11-21T18:37:44.256824-06:00 vvepve13 QEMU[3794]: kvm: ../accel/kvm/kvm-all.c:1836: kvm_irqchip_commit_routes: Assertion `ret == 0' failed. 2024-11-21T18:38:17.486394-06:00 vvepve13 kernel: [ 361.063298] vmbr0: port 10(tap13101i0) entered disabled state 2024-11-21T18:38:17.486423-06:00 vvepve13 kernel: [ 361.064099] tap13101i0 (unregistering): left allmulticast mode 2024-11-21T18:38:17.486426-06:00 vvepve13 kernel: [ 361.064110] vmbr0: port 10(tap13101i0) entered disabled state 2024-11-21T18:38:17.510386-06:00 vvepve13 kernel: [ 361.087517] vmbr0: port 11(tap13101i1) entered disabled state 2024-11-21T18:38:17.510400-06:00 vvepve13 kernel: [ 361.087796] tap13101i1 (unregistering): left allmulticast mode 2024-11-21T18:38:17.510403-06:00 vvepve13 kernel: [ 361.087805] vmbr0: port 11(tap13101i1) entered disabled state 2024-11-21T18:38:17.540386-06:00 vvepve13 kernel: [ 361.117511] vmbr0: port 12(tap13101i2) entered disabled state 2024-11-21T18:38:17.540402-06:00 vvepve13 kernel: [ 361.117817] tap13101i2 (unregistering): left allmulticast mode 2024-11-21T18:38:17.540404-06:00 vvepve13 kernel: [ 361.117827] vmbr0: port 12(tap13101i2) entered disabled state 2024-11-21T18:38:17.561380-06:00 vvepve13 kernel: [ 361.138518] vmbr0: port 13(tap13101i3) entered disabled state 2024-11-21T18:38:17.561394-06:00 vvepve13 kernel: [ 361.138965] tap13101i3 (unregistering): left allmulticast mode 2024-11-21T18:38:17.561399-06:00 vvepve13 kernel: [ 361.138977] vmbr0: port 13(tap13101i3) entered disabled state 2024-11-21T18:38:17.584412-06:00 vvepve13 systemd[1]: 13101.scope: Deactivated successfully. 2024-11-21T18:38:17.584619-06:00 vvepve13 systemd[1]: 13101.scope: Consumed 51.122s CPU time. 2024-11-21T18:38:18.522886-06:00 vvepve13 pvestatd[1476]: VM 13101 qmp command failed - VM 13101 not running 2024-11-21T18:38:18.523725-06:00 vvepve13 pve-ha-lrm[4889]: end task UPID:vvepve13:0000131A:00008A78:673FD272:qmstart:13104:root@pam: OK 2024-11-21T18:38:18.945142-06:00 vvepve13 qmeventd[4990]: Starting cleanup for 13101 2024-11-21T18:38:19.022405-06:00 vvepve13 qmeventd[4990]: Finished cleanup for 13101 Thanks JR _______________________________________________ pve-user mailing list pve-user@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user