* Re: [PVE-User] VMs With Multiple Interfaces Rebooting
[not found] <CA+U74VPYtp8uS2sC515wMHc5qc6tfjzRnRtWbxMyVtRdNTD4SQ@mail.gmail.com>
@ 2024-11-22 7:53 ` Mark Schouten via pve-user
2024-11-22 7:53 ` Mark Schouten via pve-user
2024-11-25 5:32 ` Alwin Antreich via pve-user
2 siblings, 0 replies; 5+ messages in thread
From: Mark Schouten via pve-user @ 2024-11-22 7:53 UTC (permalink / raw)
To: pve-user; +Cc: Mark Schouten, PVE User List
[-- Attachment #1: Type: message/rfc822, Size: 5867 bytes --]
From: Mark Schouten <mark@tuxis.nl>
To: pve-user@lists.proxmox.com
Cc: PVE User List <pve-user@pve.proxmox.com>
Subject: Re: [PVE-User] VMs With Multiple Interfaces Rebooting
Date: Fri, 22 Nov 2024 08:53:29 +0100
Message-ID: <EB2131A9-6A4D-447C-A61E-A6367B542A13@tuxis.nl>
Hi JR,
What do you mean by ‘reboot’? Does the vm crash so that it is powered down from a HA point of view and started back up? Or does the VM OS nicely reboot?
Mark Schouten
> Op 22 nov 2024 om 07:18 heeft JR Richardson <jmr.richardson@gmail.com> het volgende geschreven:
>
> Hey Folks,
>
> Just wanted to share an experience I recently had, Cluster parameters:
> 7 nodes, 2 HA Groups (3 nodes and 4 nodes), shared storage.
> Server Specs:
> CPU(s) 40 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (2 Sockets)
> Kernel Version Linux 6.8.12-1-pve (2024-08-05T16:17Z)
> Manager Version pve-manager/8.2.4/faa83925c9641325
>
> Super stable environment for many years through software and hardware
> upgrades, few issues to speak of, then without warning one of my
> hypervisors in 3 node group crashed with a memory dimm error, cluster
> HA took over and restarted the VMs on the other two nodes in the group
> as expected. The problem quickly materialized as the VMs started
> rebooting quickly, a lot of network issues and notice of migration
> pending. I could not lockdown exactly what the root cause was. Notable
> was these particular VMs all have multiple network interfaces. After
> several hours of not being able to get the current VMs stable, I tried
> spinning up new VMs on to no avail, reboots persisted on the new VMs.
> This seemed to only affect the VMs that were on the hypervisor that
> failed all other VMs across the cluster were fine.
>
> I have not installed any third-party monitoring software, found a few
> post in the forum about it, but was not my issue.
>
> In an act of desperation, I performed a dist-upgrade and this solved
> the issue straight away.
> Kernel Version Linux 6.8.12-4-pve (2024-11-06T15:04Z)
> Manager Version pve-manager/8.3.0/c1689ccb1065a83b
>
> Hope this was helpful and if there are any ideas on why this
> happened, I welcome any responses.
>
> Thanks.
>
> JR
>
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
[-- Attachment #2: Type: text/plain, Size: 157 bytes --]
_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PVE-User] VMs With Multiple Interfaces Rebooting
[not found] <CA+U74VPYtp8uS2sC515wMHc5qc6tfjzRnRtWbxMyVtRdNTD4SQ@mail.gmail.com>
2024-11-22 7:53 ` [PVE-User] VMs With Multiple Interfaces Rebooting Mark Schouten via pve-user
@ 2024-11-22 7:53 ` Mark Schouten via pve-user
2024-11-25 5:32 ` Alwin Antreich via pve-user
2 siblings, 0 replies; 5+ messages in thread
From: Mark Schouten via pve-user @ 2024-11-22 7:53 UTC (permalink / raw)
To: pve-user; +Cc: Mark Schouten, PVE User List
[-- Attachment #1: Type: message/rfc822, Size: 5377 bytes --]
From: Mark Schouten <mark@tuxis.nl>
To: pve-user@lists.proxmox.com
Cc: PVE User List <pve-user@pve.proxmox.com>
Subject: Re: [PVE-User] VMs With Multiple Interfaces Rebooting
Date: Fri, 22 Nov 2024 08:53:29 +0100
Message-ID: <EB2131A9-6A4D-447C-A61E-A6367B542A13@tuxis.nl>
Hi JR,
What do you mean by ‘reboot’? Does the vm crash so that it is powered down from a HA point of view and started back up? Or does the VM OS nicely reboot?
Mark Schouten
> Op 22 nov 2024 om 07:18 heeft JR Richardson <jmr.richardson@gmail.com> het volgende geschreven:
>
> Hey Folks,
>
> Just wanted to share an experience I recently had, Cluster parameters:
> 7 nodes, 2 HA Groups (3 nodes and 4 nodes), shared storage.
> Server Specs:
> CPU(s) 40 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (2 Sockets)
> Kernel Version Linux 6.8.12-1-pve (2024-08-05T16:17Z)
> Manager Version pve-manager/8.2.4/faa83925c9641325
>
> Super stable environment for many years through software and hardware
> upgrades, few issues to speak of, then without warning one of my
> hypervisors in 3 node group crashed with a memory dimm error, cluster
> HA took over and restarted the VMs on the other two nodes in the group
> as expected. The problem quickly materialized as the VMs started
> rebooting quickly, a lot of network issues and notice of migration
> pending. I could not lockdown exactly what the root cause was. Notable
> was these particular VMs all have multiple network interfaces. After
> several hours of not being able to get the current VMs stable, I tried
> spinning up new VMs on to no avail, reboots persisted on the new VMs.
> This seemed to only affect the VMs that were on the hypervisor that
> failed all other VMs across the cluster were fine.
>
> I have not installed any third-party monitoring software, found a few
> post in the forum about it, but was not my issue.
>
> In an act of desperation, I performed a dist-upgrade and this solved
> the issue straight away.
> Kernel Version Linux 6.8.12-4-pve (2024-11-06T15:04Z)
> Manager Version pve-manager/8.3.0/c1689ccb1065a83b
>
> Hope this was helpful and if there are any ideas on why this
> happened, I welcome any responses.
>
> Thanks.
>
> JR
>
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
[-- Attachment #2: Type: text/plain, Size: 157 bytes --]
_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PVE-User] VMs With Multiple Interfaces Rebooting
[not found] <CA+U74VPYtp8uS2sC515wMHc5qc6tfjzRnRtWbxMyVtRdNTD4SQ@mail.gmail.com>
2024-11-22 7:53 ` [PVE-User] VMs With Multiple Interfaces Rebooting Mark Schouten via pve-user
2024-11-22 7:53 ` Mark Schouten via pve-user
@ 2024-11-25 5:32 ` Alwin Antreich via pve-user
2 siblings, 0 replies; 5+ messages in thread
From: Alwin Antreich via pve-user @ 2024-11-25 5:32 UTC (permalink / raw)
To: Proxmox VE user list; +Cc: Alwin Antreich
[-- Attachment #1: Type: message/rfc822, Size: 5755 bytes --]
From: Alwin Antreich <alwin@antreich.com>
To: Proxmox VE user list <pve-user@lists.proxmox.com>
Subject: Re: [PVE-User] VMs With Multiple Interfaces Rebooting
Date: Mon, 25 Nov 2024 06:32:16 +0100
Message-ID: <254CB7A1-E72D-442B-9956-721A4D66BEAE@antreich.com>
On November 22, 2024 7:16:53 AM GMT+01:00, JR Richardson <jmr.richardson@gmail.com> wrote:
>Hey Folks,
>
>Just wanted to share an experience I recently had, Cluster parameters:
>7 nodes, 2 HA Groups (3 nodes and 4 nodes), shared storage.
>Server Specs:
>CPU(s) 40 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (2 Sockets)
>Kernel Version Linux 6.8.12-1-pve (2024-08-05T16:17Z)
>Manager Version pve-manager/8.2.4/faa83925c9641325
>
>Super stable environment for many years through software and hardware
>upgrades, few issues to speak of, then without warning one of my
>hypervisors in 3 node group crashed with a memory dimm error, cluster
>HA took over and restarted the VMs on the other two nodes in the group
>as expected. The problem quickly materialized as the VMs started
>rebooting quickly, a lot of network issues and notice of migration
>pending. I could not lockdown exactly what the root cause was. Notable
This sounds like it wanted to balance the load. Do you have CRS active and/or static load scheduling?
>was these particular VMs all have multiple network interfaces. After
>several hours of not being able to get the current VMs stable, I tried
>spinning up new VMs on to no avail, reboots persisted on the new VMs.
>This seemed to only affect the VMs that were on the hypervisor that
>failed all other VMs across the cluster were fine.
>
>I have not installed any third-party monitoring software, found a few
>post in the forum about it, but was not my issue.
>
>In an act of desperation, I performed a dist-upgrade and this solved
>the issue straight away.
>Kernel Version Linux 6.8.12-4-pve (2024-11-06T15:04Z)
>Manager Version pve-manager/8.3.0/c1689ccb1065a83b
The upgrade likely restarted the pve-ha-lrm service, which could break the migration cycle.
The systemd logs should give you a clue to what was happening, the ha stack logs the actions on the given node.
Cheers,
Alwin
Hi JR,
[-- Attachment #2: Type: text/plain, Size: 157 bytes --]
_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <mailman.5.1732532402.36715.pve-user@lists.proxmox.com>]
* Re: [PVE-User] VMs With Multiple Interfaces Rebooting
[not found] <mailman.5.1732532402.36715.pve-user@lists.proxmox.com>
@ 2024-11-25 15:08 ` JR Richardson
0 siblings, 0 replies; 5+ messages in thread
From: JR Richardson @ 2024-11-25 15:08 UTC (permalink / raw)
To: Proxmox VE user list
> >Super stable environment for many years through software and hardware
> >upgrades, few issues to speak of, then without warning one of my
> >hypervisors in 3 node group crashed with a memory dimm error, cluster
> >HA took over and restarted the VMs on the other two nodes in the group
> >as expected. The problem quickly materialized as the VMs started
> >rebooting quickly, a lot of network issues and notice of migration
> >pending. I could not lockdown exactly what the root cause was. Notable
> This sounds like it wanted to balance the load. Do you have CRS active and/or static load scheduling?
CRS option is set to basic, not dynamic.
>
> >was these particular VMs all have multiple network interfaces. After
> >several hours of not being able to get the current VMs stable, I tried
> >spinning up new VMs on to no avail, reboots persisted on the new VMs.
> >This seemed to only affect the VMs that were on the hypervisor that
> >failed all other VMs across the cluster were fine.
> >
> >I have not installed any third-party monitoring software, found a few
> >post in the forum about it, but was not my issue.
> >
> >In an act of desperation, I performed a dist-upgrade and this solved
> >the issue straight away.
> >Kernel Version Linux 6.8.12-4-pve (2024-11-06T15:04Z)
> >Manager Version pve-manager/8.3.0/c1689ccb1065a83b
> The upgrade likely restarted the pve-ha-lrm service, which could break the migration cycle.
>
> The systemd logs should give you a clue to what was happening, the ha stack logs the actions on the given node.
I don't see anything particular in the lrm logs, just starting the VMs
over and over.
Here are relevant syslog entries from the end of one cycle reboot to
beginning startup.
2024-11-21T18:36:59.023578-06:00 vvepve13 qmeventd[3838]: Starting
cleanup for 13101
2024-11-21T18:36:59.105435-06:00 vvepve13 qmeventd[3838]: Finished
cleanup for 13101
2024-11-21T18:37:30.758618-06:00 vvepve13 pve-ha-lrm[1608]:
successfully acquired lock 'ha_agent_vvepve13_lock'
2024-11-21T18:37:30.758861-06:00 vvepve13 pve-ha-lrm[1608]: watchdog active
2024-11-21T18:37:30.758977-06:00 vvepve13 pve-ha-lrm[1608]: status
change wait_for_agent_lock => active
2024-11-21T18:37:30.789271-06:00 vvepve13 pve-ha-lrm[4337]: starting
service vm:13101
2024-11-21T18:37:30.808204-06:00 vvepve13 pve-ha-lrm[4338]: start VM
13101: UPID:vvepve13:000010F2:00007AEA:673FD24A:qmstart:13101:root@pam:
2024-11-21T18:37:30.808383-06:00 vvepve13 pve-ha-lrm[4337]: <root@pam>
starting task UPID:vvepve13:000010F2:00007AEA:673FD24A:qmstart:13101:root@pam:
2024-11-21T18:37:31.112154-06:00 vvepve13 systemd[1]: Started 13101.scope.
2024-11-21T18:37:32.802414-06:00 vvepve13 kernel: [ 316.379944]
tap13101i0: entered promiscuous mode
2024-11-21T18:37:32.846352-06:00 vvepve13 kernel: [ 316.423935]
vmbr0: port 10(tap13101i0) entered blocking state
2024-11-21T18:37:32.846372-06:00 vvepve13 kernel: [ 316.423946]
vmbr0: port 10(tap13101i0) entered disabled state
2024-11-21T18:37:32.846375-06:00 vvepve13 kernel: [ 316.423990]
tap13101i0: entered allmulticast mode
2024-11-21T18:37:32.847377-06:00 vvepve13 kernel: [ 316.424825]
vmbr0: port 10(tap13101i0) entered blocking state
2024-11-21T18:37:32.847391-06:00 vvepve13 kernel: [ 316.424832]
vmbr0: port 10(tap13101i0) entered forwarding state
2024-11-21T18:37:34.594397-06:00 vvepve13 kernel: [ 318.172029]
tap13101i1: entered promiscuous mode
2024-11-21T18:37:34.640376-06:00 vvepve13 kernel: [ 318.217302]
vmbr0: port 11(tap13101i1) entered blocking state
2024-11-21T18:37:34.640393-06:00 vvepve13 kernel: [ 318.217310]
vmbr0: port 11(tap13101i1) entered disabled state
2024-11-21T18:37:34.640396-06:00 vvepve13 kernel: [ 318.217341]
tap13101i1: entered allmulticast mode
2024-11-21T18:37:34.640398-06:00 vvepve13 kernel: [ 318.218073]
vmbr0: port 11(tap13101i1) entered blocking state
2024-11-21T18:37:34.640400-06:00 vvepve13 kernel: [ 318.218077]
vmbr0: port 11(tap13101i1) entered forwarding state
2024-11-21T18:37:35.819630-06:00 vvepve13 pve-ha-lrm[4337]: Task
'UPID:vvepve13:000010F2:00007AEA:673FD24A:qmstart:13101:root@pam:'
still active, waiting
2024-11-21T18:37:36.249349-06:00 vvepve13 kernel: [ 319.827024]
tap13101i2: entered promiscuous mode
2024-11-21T18:37:36.291346-06:00 vvepve13 kernel: [ 319.868406]
vmbr0: port 12(tap13101i2) entered blocking state
2024-11-21T18:37:36.291365-06:00 vvepve13 kernel: [ 319.868417]
vmbr0: port 12(tap13101i2) entered disabled state
2024-11-21T18:37:36.291367-06:00 vvepve13 kernel: [ 319.868443]
tap13101i2: entered allmulticast mode
2024-11-21T18:37:36.291368-06:00 vvepve13 kernel: [ 319.869185]
vmbr0: port 12(tap13101i2) entered blocking state
2024-11-21T18:37:36.291369-06:00 vvepve13 kernel: [ 319.869191]
vmbr0: port 12(tap13101i2) entered forwarding state
2024-11-21T18:37:37.997394-06:00 vvepve13 kernel: [ 321.575034]
tap13101i3: entered promiscuous mode
2024-11-21T18:37:38.040384-06:00 vvepve13 kernel: [ 321.617225]
vmbr0: port 13(tap13101i3) entered blocking state
2024-11-21T18:37:38.040396-06:00 vvepve13 kernel: [ 321.617236]
vmbr0: port 13(tap13101i3) entered disabled state
2024-11-21T18:37:38.040400-06:00 vvepve13 kernel: [ 321.617278]
tap13101i3: entered allmulticast mode
2024-11-21T18:37:38.040402-06:00 vvepve13 kernel: [ 321.618070]
vmbr0: port 13(tap13101i3) entered blocking state
2024-11-21T18:37:38.040403-06:00 vvepve13 kernel: [ 321.618077]
vmbr0: port 13(tap13101i3) entered forwarding state
2024-11-21T18:37:38.248094-06:00 vvepve13 pve-ha-lrm[4337]: <root@pam>
end task UPID:vvepve13:000010F2:00007AEA:673FD24A:qmstart:13101:root@pam:
OK
2024-11-21T18:37:38.254144-06:00 vvepve13 pve-ha-lrm[4337]: service
status vm:13101 started
2024-11-21T18:37:44.256824-06:00 vvepve13 QEMU[3794]: kvm:
../accel/kvm/kvm-all.c:1836: kvm_irqchip_commit_routes: Assertion `ret
== 0' failed.
2024-11-21T18:38:17.486394-06:00 vvepve13 kernel: [ 361.063298]
vmbr0: port 10(tap13101i0) entered disabled state
2024-11-21T18:38:17.486423-06:00 vvepve13 kernel: [ 361.064099]
tap13101i0 (unregistering): left allmulticast mode
2024-11-21T18:38:17.486426-06:00 vvepve13 kernel: [ 361.064110]
vmbr0: port 10(tap13101i0) entered disabled state
2024-11-21T18:38:17.510386-06:00 vvepve13 kernel: [ 361.087517]
vmbr0: port 11(tap13101i1) entered disabled state
2024-11-21T18:38:17.510400-06:00 vvepve13 kernel: [ 361.087796]
tap13101i1 (unregistering): left allmulticast mode
2024-11-21T18:38:17.510403-06:00 vvepve13 kernel: [ 361.087805]
vmbr0: port 11(tap13101i1) entered disabled state
2024-11-21T18:38:17.540386-06:00 vvepve13 kernel: [ 361.117511]
vmbr0: port 12(tap13101i2) entered disabled state
2024-11-21T18:38:17.540402-06:00 vvepve13 kernel: [ 361.117817]
tap13101i2 (unregistering): left allmulticast mode
2024-11-21T18:38:17.540404-06:00 vvepve13 kernel: [ 361.117827]
vmbr0: port 12(tap13101i2) entered disabled state
2024-11-21T18:38:17.561380-06:00 vvepve13 kernel: [ 361.138518]
vmbr0: port 13(tap13101i3) entered disabled state
2024-11-21T18:38:17.561394-06:00 vvepve13 kernel: [ 361.138965]
tap13101i3 (unregistering): left allmulticast mode
2024-11-21T18:38:17.561399-06:00 vvepve13 kernel: [ 361.138977]
vmbr0: port 13(tap13101i3) entered disabled state
2024-11-21T18:38:17.584412-06:00 vvepve13 systemd[1]: 13101.scope:
Deactivated successfully.
2024-11-21T18:38:17.584619-06:00 vvepve13 systemd[1]: 13101.scope:
Consumed 51.122s CPU time.
2024-11-21T18:38:18.522886-06:00 vvepve13 pvestatd[1476]: VM 13101 qmp
command failed - VM 13101 not running
2024-11-21T18:38:18.523725-06:00 vvepve13 pve-ha-lrm[4889]: <root@pam>
end task UPID:vvepve13:0000131A:00008A78:673FD272:qmstart:13104:root@pam:
OK
2024-11-21T18:38:18.945142-06:00 vvepve13 qmeventd[4990]: Starting
cleanup for 13101
2024-11-21T18:38:19.022405-06:00 vvepve13 qmeventd[4990]: Finished
cleanup for 13101
Thanks
JR
_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PVE-User] VMs With Multiple Interfaces Rebooting
@ 2024-11-22 16:59 JR Richardson
0 siblings, 0 replies; 5+ messages in thread
From: JR Richardson @ 2024-11-22 16:59 UTC (permalink / raw)
To: pve-user
Hi Mark,
Found this error during log review:
" vvepve13 pvestatd[1468]: VM 13113 qmp command failed - VM 13113 qmp
command 'query-proxmox-support' failed - unable to connect to VM 13113 qmp
socket - timeout after 51 retries"
HA was sending shutdown to the VM after not being able to verify VM was
running. I initially through this was networking related but as I
investigate further, this seems like a bug in 'qm', so strange, been running
on this version for months, doing migrations and spinning up new VMs without
any issues.
Thanks
JR
Hi JR,
What do you mean by ?reboot?? Does the vm crash so that it is powered down
from a HA point of view and started back up? Or does the VM OS nicely
reboot?
Mark Schouten
> Op 22 nov 2024 om 07:18 heeft JR Richardson <jmr.richardson@gmail.com> het
volgende geschreven:
>
> ?Hey Folks,
>
> Just wanted to share an experience I recently had, Cluster parameters:
> 7 nodes, 2 HA Groups (3 nodes and 4 nodes), shared storage.
> Server Specs:
> CPU(s) 40 x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (2 Sockets)
> Kernel Version Linux 6.8.12-1-pve (2024-08-05T16:17Z) Manager Version
> pve-manager/8.2.4/faa83925c9641325
>
> Super stable environment for many years through software and hardware
> upgrades, few issues to speak of, then without warning one of my
> hypervisors in 3 node group crashed with a memory dimm error, cluster
> HA took over and restarted the VMs on the other two nodes in the group
> as expected. The problem quickly materialized as the VMs started
> rebooting quickly, a lot of network issues and notice of migration
> pending. I could not lockdown exactly what the root cause was. Notable
> was these particular VMs all have multiple network interfaces. After
> several hours of not being able to get the current VMs stable, I tried
> spinning up new VMs on to no avail, reboots persisted on the new VMs.
> This seemed to only affect the VMs that were on the hypervisor that
> failed all other VMs across the cluster were fine.
>
> I have not installed any third-party monitoring software, found a few
> post in the forum about it, but was not my issue.
>
> In an act of desperation, I performed a dist-upgrade and this solved
> the issue straight away.
> Kernel Version Linux 6.8.12-4-pve (2024-11-06T15:04Z) Manager Version
> pve-manager/8.3.0/c1689ccb1065a83b
>
> Hope this was helpful and if there are any ideas on why this happened,
> I welcome any responses.
>
> Thanks.
>
> JR
_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-11-25 15:08 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <CA+U74VPYtp8uS2sC515wMHc5qc6tfjzRnRtWbxMyVtRdNTD4SQ@mail.gmail.com>
2024-11-22 7:53 ` [PVE-User] VMs With Multiple Interfaces Rebooting Mark Schouten via pve-user
2024-11-22 7:53 ` Mark Schouten via pve-user
2024-11-25 5:32 ` Alwin Antreich via pve-user
[not found] <mailman.5.1732532402.36715.pve-user@lists.proxmox.com>
2024-11-25 15:08 ` JR Richardson
2024-11-22 16:59 JR Richardson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox