* Re: [PVE-User] Proxmox node hang
[not found] <mailman.463.1626338446.464.pve-user@lists.proxmox.com>
@ 2021-07-15 9:31 ` Luke Thompson
0 siblings, 0 replies; only message in thread
From: Luke Thompson @ 2021-07-15 9:31 UTC (permalink / raw)
To: pve-user, Eneko Lacunza
Hi Eneko,
Consumer RAM is always a tricky starting point. When under heavy load,
the fault rates tend to be quite surprising (hence why ECC is preferable
in enterprise/etc settings).
Your Gigabyte B450 Aurus M motherboard has many new BIOS iterations
available - they're definitely worth reading into, and applying after
your own research.
https://www.gigabyte.com/Motherboard/B450-AORUS-M-rev-1x/support#support-dl-bios
(F60/F61c in 2021 compared to F50 in 2019)
Where there's been a hardware fault downstream, we tend to see similar
kernel taint flags/states to what you are with P O / P D O / etc.
https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html
(We've had hardware like OOB cards trigger module flags)
Have you run the system through extended memory testing (outside of PVE,
which includes memtest at-boot, at least under the ISO)?
Your BIOS version is dated 1 month after this article, which would imply
that a BIOS update may be beneficial to avoiding the RNG bug.
https://arstechnica.com/gadgets/2019/10/how-a-months-old-amd-microcode-bug-destroyed-my-weekend
With the kernel taint lines, do you have the procs/calls that marry up
to the PIDs listed? What were they doing at the time?
From the brief logs you've included, they look varied implying that the
problem is likely hardware-based. I'd guess RAM.
As it's only faulted once, I'd say a decent course of action would be to
test the memory extensively, and go from there.
If it can identify a faulting module, then you can remove that DIMM and
swap it for a known-good one instead, etc.
The testing can take a while, and in our experience it can be worth
leaving it to cycle through, esp. with non-ECC.
Even though tainted kernels never "lose their taint", if you remove the
underlying cause it should clear the state.
It'll be good to hear about how you get on with it all. Best of luck
with it!
Cheers,
Luke Thompson
Operations Manager
luke.t@tncrew.com.au
PO Box 111, West Wallsend
On 15/7/21 6:40 pm, Eneko Lacunza via pve-user wrote:
> Hi all,
>
> Tonight a node of our 5-node Proxmox 6.4+Ceph cluster has frezeed at
> ~6:45. A reset has brought it online later in the morning and is working
> well for 2 hours right now.
>
> HA worked like a charm and Ceph has recovered in some minutes.
>
> Fantastic success history really, thanks for your excelent work Proxmox
> developer and contributors!
>
> Now for the "post-mortem", I see node's 8 cores "general protection
> fault"ing one after another in a minute, with different processes.
>
> I suspect a memory module or main board fault (Ryzen 3700X 8-core,
> 4x32GB non-ECC RAM and gigabyte mainboard, all "consumer" parts, it has
> been working well since dec 2019). What do you think?
>
> Here a shortened syslog (I can provide all 437 lines if necessary):
>
> ---
>
> Jul 15 06:45:00 sanmarko systemd[1]: Starting Proxmox VE replication
> runner...
>
> Jul 15 06:45:00 sanmarko systemd[1]: pvesr.service: Succeeded.
>
> Jul 15 06:45:00 sanmarko systemd[1]: Started Proxmox VE replication runner.
>
> Jul 15 06:45:01 sanmarko CRON[1913457]: (root) CMD (command -v
> debian-sa1 > /dev/null && debian-sa1 1 1)
>
> Jul 15 06:45:01 sanmarko CRON[1913458]: (root) CMD (if [ -x
> /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update
> 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then
> /etc/munin/plugins/ap
>
> t update 7200 12 >/dev/null; fi)
>
> Jul 15 06:45:10 sanmarko kernel: [145747.429110] general protection
> fault: 0000 [#1] SMP NOPTI
>
> Jul 15 06:45:10 sanmarko kernel: [145747.429175] CPU: 11 PID: 1914237
> Comm: ceph Tainted: P O 5.4.124-1-pve #1
>
> Jul 15 06:45:10 sanmarko kernel: [145747.429245] Hardware name: Gigabyte
> Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
>
> Jul 15 06:45:10 sanmarko kernel: [145747.429322] RIP:
> 0010:kmem_cache_alloc+0x89/0x240
>
> Jul 15 06:45:10 sanmarko kernel: [145747.429382] Code: 08 65 4c 03 05 30
> e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
> 00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 7
>
> 0 01 00 00 4c 89 e0 48 0f c9 48 31 cb
>
> [...]
>
> Jul 15 06:45:10 sanmarko kernel: [145747.430245] Call Trace:
>
> Jul 15 06:45:10 sanmarko kernel: [145747.430314] ?
> security_file_alloc+0x29/0x90
>
> [...]
>
> Jul 15 06:45:10 sanmarko kernel: [145747.431244]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [...]
>
> Jul 15 06:45:12 sanmarko kernel: [145749.037616] general protection
> fault: 0000 [#2] SMP NOPTI
>
> Jul 15 06:45:12 sanmarko kernel: [145749.037695] CPU: 11 PID: 2433 Comm:
> tp_fstore_op Tainted: P D O 5.4.124-1-pve #1
>
> Jul 15 06:45:12 sanmarko kernel: [145749.037793] Hardware name: Gigabyte
> Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
>
> Jul 15 06:45:12 sanmarko kernel: [145749.037898] RIP:
> 0010:apparmor_file_free_security+0x22/0x40
>
> Jul 15 06:45:12 sanmarko kernel: [145749.037975] Code: 2c ff ff eb a2 0f
> 1f 00 0f 1f 44 00 00 48 63 05 28 fb fc 00 48 03 87 c0 00 00 00 74 1a 48
> 8b 78 08 48 85 ff 74 11 55 48 89 e5 <f0> ff 0f 0f 88 dc 96 62 00 74 03
> 5d c3 c3 e8 db 55 00 00 5d c3 66
>
> [...]
>
> Jul 15 06:45:12 sanmarko kernel: [145749.038942] Call Trace:
>
> Jul 15 06:45:12 sanmarko kernel: [145749.039015]
> security_file_free+0x27/0x60
>
> [...]
>
> Jul 15 06:45:12 sanmarko kernel: [145749.039441]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [...]
>
> Jul 15 06:45:29 sanmarko kernel: [145765.573841] general protection
> fault: 0000 [#3] SMP NOPTI
>
> Jul 15 06:45:29 sanmarko kernel: [145765.573922] CPU: 11 PID: 1733 Comm:
> pve-firewall Tainted: P D O 5.4.124-1-pve #1
>
> Jul 15 06:45:29 sanmarko kernel: [145765.574021] Hardware name: Gigabyte
> Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
>
> Jul 15 06:45:29 sanmarko kernel: [145765.574127] RIP:
> 0010:kmem_cache_alloc+0x89/0x240
>
> Jul 15 06:45:29 sanmarko kernel: [145765.574201] Code: 08 65 4c 03 05 30
> e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
> 00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
> 00 00 4c 89 e0 48 0f c9 48 31 cb
>
> [...]
>
> Jul 15 06:45:29 sanmarko kernel: [145765.576321] Call Trace:
>
> Jul 15 06:45:29 sanmarko kernel: [145765.576391] ?
> security_file_alloc+0x29/0x90
>
> [...]
>
> Jul 15 06:45:29 sanmarko kernel: [145765.577258]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [...]
>
> Jul 15 06:45:29 sanmarko systemd[1]: pve-firewall.service: Main process
> exited, code=killed, status=11/SEGV
>
> Jul 15 06:45:29 sanmarko systemd[1]: pve-firewall.service: Failed with
> result 'signal'.
>
> Jul 15 06:45:35 sanmarko kernel: [145772.194438] general protection
> fault: 0000 [#4] SMP NOPTI
>
> Jul 15 06:45:35 sanmarko kernel: [145772.194516] CPU: 11 PID: 1776 Comm:
> ms_dispatch Tainted: P D O 5.4.124-1-pve #1
>
> Jul 15 06:45:35 sanmarko kernel: [145772.194614] Hardware name: Gigabyte
> Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
>
> Jul 15 06:45:35 sanmarko kernel: [145772.194718] RIP:
> 0010:kmem_cache_alloc+0x89/0x240
>
> Jul 15 06:45:35 sanmarko kernel: [145772.194792] Code: 08 65 4c 03 05 30
> e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
> 00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
> 00 00 4c 89 e0 48 0f c9 48 31 cb
>
> [...]
>
> Jul 15 06:45:35 sanmarko kernel: [145772.195750] Call Trace:
>
> Jul 15 06:45:35 sanmarko kernel: [145772.195819] ?
> security_file_alloc+0x29/0x90
>
> [...]
>
> Jul 15 06:45:35 sanmarko kernel: [145772.197176]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [...]
>
> Jul 15 06:45:37 sanmarko kernel: [145774.137506] general protection
> fault: 0000 [#5] SMP NOPTI
>
> Jul 15 06:45:37 sanmarko kernel: [145774.137586] CPU: 11 PID: 2466 Comm:
> tp_fstore_op Tainted: P D O 5.4.124-1-pve #1
>
> Jul 15 06:45:37 sanmarko kernel: [145774.137687] Hardware name: Gigabyte
> Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
>
> Jul 15 06:45:37 sanmarko kernel: [145774.137791] RIP:
> 0010:kmem_cache_alloc+0x89/0x240
>
> Jul 15 06:45:37 sanmarko kernel: [145774.137865] Code: 08 65 4c 03 05 30
> e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
> 00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
> 00 00 4c 89 e0 48 0f c9 48 31 cb
>
> [...]
>
> Jul 15 06:45:37 sanmarko kernel: [145774.139990] Call Trace:
>
> Jul 15 06:45:37 sanmarko kernel: [145774.140059] ?
> security_file_alloc+0x29/0x90
>
> [...]
>
> Jul 15 06:45:37 sanmarko kernel: [145774.140991]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [...]
>
> Jul 15 06:45:40 sanmarko kernel: [145776.830930] general protection
> fault: 0000 [#6] SMP NOPTI
>
> Jul 15 06:45:40 sanmarko kernel: [145776.831010] CPU: 11 PID: 7234 Comm:
> kvm Tainted: P D O 5.4.124-1-pve #1
>
> Jul 15 06:45:40 sanmarko kernel: [145776.831109] Hardware name: Gigabyte
> Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
>
> Jul 15 06:45:40 sanmarko kernel: [145776.831217] RIP:
> 0010:kmem_cache_alloc+0x89/0x240
>
> Jul 15 06:45:40 sanmarko kernel: [145776.831294] Code: 08 65 4c 03 05 30
> e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
> 00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
> 00 00 4c 89 e0 48 0f c9 48 31 cb
>
> [...]
>
> Jul 15 06:45:40 sanmarko kernel: [145776.832264] Call Trace:
>
> Jul 15 06:45:40 sanmarko kernel: [145776.832334] ?
> security_file_alloc+0x29/0x90
>
> [...]
>
> Jul 15 06:45:40 sanmarko kernel: [145776.833336]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Jul 15 06:45:40 sanmarko pve-ha-lrm[1914439]: starting service vm:149
>
> Jul 15 06:45:40 sanmarko pve-ha-lrm[1914439]: <root@pam> starting task
> UPID:sanmarko:001D3649:00DE7138:60EFBD74:qmstart:149:root@pam:
>
> Jul 15 06:45:40 sanmarko pve-ha-lrm[1914441]: start VM 149:
> UPID:sanmarko:001D3649:00DE7138:60EFBD74:qmstart:149:root@pam:
>
> Jul 15 06:45:40 sanmarko systemd[1]: 149.scope: Succeeded.
>
> Jul 15 06:45:40 sanmarko systemd[1]: Stopped 149.scope.
>
> Jul 15 06:45:43 sanmarko kernel: [145779.840863] general protection
> fault: 0000 [#7] SMP NOPTI
>
> Jul 15 06:45:43 sanmarko kernel: [145779.840942] CPU: 11 PID: 1740 Comm:
> pvestatd Tainted: P D O 5.4.124-1-pve #1
>
> Jul 15 06:45:43 sanmarko kernel: [145779.842207] Hardware name: Gigabyte
> Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
>
> Jul 15 06:45:43 sanmarko kernel: [145779.842310] RIP:
> 0010:kmem_cache_alloc+0x89/0x240
>
> Jul 15 06:45:43 sanmarko kernel: [145779.842383] Code: 08 65 4c 03 05 30
> e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
> 00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
> 00 00 4c 89 e0 48 0f c9 48 31 cb
>
> [...]
>
> Jul 15 06:45:43 sanmarko kernel: [145779.843337] Call Trace:
>
> Jul 15 06:45:43 sanmarko kernel: [145779.843406] ?
> security_file_alloc+0x29/0x90
>
> [...]
>
> Jul 15 06:45:43 sanmarko kernel: [145779.844545]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [...]
>
> Jul 15 06:45:43 sanmarko systemd[1]: pvestatd.service: Main process
> exited, code=killed, status=11/SEGV
>
> Jul 15 06:45:43 sanmarko systemd[1]: pvestatd.service: Failed with
> result 'signal'.
>
> Jul 15 06:45:45 sanmarko pve-ha-lrm[1914439]: Task
> 'UPID:sanmarko:001D3649:00DE7138:60EFBD74:qmstart:149:root@pam:' still
> active, waiting
>
> Jul 15 06:45:45 sanmarko pve-ha-lrm[1914441]: timeout waiting on systemd
>
> Jul 15 06:45:45 sanmarko pve-ha-lrm[1914439]: <root@pam> end task
> UPID:sanmarko:001D3649:00DE7138:60EFBD74:qmstart:149:root@pam: timeout
> waiting on systemd
>
> Jul 15 06:45:45 sanmarko pve-ha-lrm[1914439]: unable to start service vm:149
>
> Jul 15 06:45:50 sanmarko pve-ha-lrm[1804]: restart policy: retry number
> 1 for service 'vm:149'
>
> Jul 15 06:45:56 sanmarko kernel: [145792.695695] general protection
> fault: 0000 [#8] SMP NOPTI
>
> Jul 15 06:45:56 sanmarko kernel: [145792.695777] CPU: 11 PID: 1783 Comm:
> pve-ha-crm Tainted: P D O 5.4.124-1-pve #1
>
> Jul 15 06:45:56 sanmarko kernel: [145792.695876] Hardware name: Gigabyte
> Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
>
> Jul 15 06:45:56 sanmarko kernel: [145792.695980] RIP:
> 0010:kmem_cache_alloc+0x89/0x240
>
> Jul 15 06:45:56 sanmarko kernel: [145792.696054] Code: 08 65 4c 03 05 30
> e4 57 5b 49 83 78 10 00 4d 8b 20 0f 84 94 01 00 00 4d 85 e4 0f 84 8b 01
> 00 00 41 8b 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f 70 01
> 00 00 4c 89 e0 48 0f c9 48 31 cb
>
> [...]
>
> Jul 15 06:45:56 sanmarko kernel: [145792.697012] Call Trace:
>
> Jul 15 06:45:56 sanmarko kernel: [145792.697081] ?
> security_file_alloc+0x29/0x90
>
> [...]
>
> Jul 15 06:45:56 sanmarko kernel: [145792.698225]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [...]
>
> Jul 15 06:45:56 sanmarko watchdog-mux[895]: client did not stop watchdog
> - disable watchdog updates
>
> Jul 15 06:45:56 sanmarko systemd[1]: pve-ha-crm.service: Main process
> exited, code=killed, status=11/SEGV
>
> Jul 15 06:45:56 sanmarko systemd[1]: pve-ha-crm.service: Failed with
> result 'signal'.
>
> Jul 15 06:45:56 sanmarko kernel: [145792.701730] FS:
> 00007fae7b4141c0(0000) GS:ffff96b69eac0000(0000) knlGS:0000000000000000
>
> Jul 15 06:45:56 sanmarko kernel: [145792.701826] CS: 0010 DS: 0000 ES:
> 0000 CR0: 0000000080050033
>
> Jul 15 06:45:56 sanmarko kernel: [145792.701902] CR2: 00007fa6bbf73008
> CR3: 0000001f29900000 CR4: 0000000000340ee0
>
> [... no more logs until reset ...]
>
>
> # pveversion -v
>
> proxmox-ve: 6.4-1 (running kernel: 5.4.124-1-pve)
> pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
> pve-kernel-5.4: 6.4-4
> pve-kernel-helper: 6.4-4
> pve-kernel-5.3: 6.1-6
> pve-kernel-5.4.124-1-pve: 5.4.124-1
> pve-kernel-5.4.119-1-pve: 5.4.119-1
> pve-kernel-5.3.18-3-pve: 5.3.18-3
> ceph: 15.2.13-pve1~bpo10
> ceph-fuse: 15.2.13-pve1~bpo10
> corosync: 3.1.2-pve1
> criu: 3.11-3
> glusterfs-client: 5.5-3
> ifupdown: residual config
> ifupdown2: 3.0.0-1+pve4~bpo10
> libjs-extjs: 6.0.1-10
> libknet1: 1.20-pve1
> libproxmox-acme-perl: 1.1.0
> libproxmox-backup-qemu0: 1.1.0-1
> libpve-access-control: 6.4-3
> libpve-apiclient-perl: 3.1-3
> libpve-common-perl: 6.4-3
> libpve-guest-common-perl: 3.1-5
> libpve-http-server-perl: 3.2-3
> libpve-storage-perl: 6.4-1
> libqb0: 1.0.5-1
> libspice-server1: 0.14.2-4~pve6+1
> lvm2: 2.03.02-pve4
> lxc-pve: 4.0.6-2
> lxcfs: 4.0.6-pve1
> novnc-pve: 1.1.0-1
> proxmox-backup-client: 1.1.10-1
> proxmox-mini-journalreader: 1.1-1
> proxmox-widget-toolkit: 2.6-1
> pve-cluster: 6.4-1
> pve-container: 3.3-6
> pve-docs: 6.4-2
> pve-edk2-firmware: 2.20200531-1
> pve-firewall: 4.1-4
> pve-firmware: 3.2-4
> pve-ha-manager: 3.1-1
> pve-i18n: 2.3-1
> pve-qemu-kvm: 5.2.0-6
> pve-xtermjs: 4.7.0-3
> qemu-server: 6.4-2
> smartmontools: 7.2-pve2
> spiceterm: 3.1-1
> vncterm: 1.6-2
> zfsutils-linux: 2.0.4-pve1
>
>
> Thanks a lot
>
>
> Eneko Lacunza
> Zuzendari teknikoa | Director técnico
> Binovo IT Human Project
>
> Tel. +34 943 569 206 |https://www.binovo.es
> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>
> https://www.youtube.com/user/CANALBINOVO
> https://www.linkedin.com/company/37269706/
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2021-07-15 9:32 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <mailman.463.1626338446.464.pve-user@lists.proxmox.com>
2021-07-15 9:31 ` [PVE-User] Proxmox node hang Luke Thompson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox