public inbox for pve-user@lists.proxmox.com
 help / color / mirror / Atom feed
* Re: [PVE-User] Server freezing randomly with Proxmox 6.2-4 on AMD Ryzen system
       [not found] <0e58d1d5-384b-55d2-9042-ae8c1e2ade6c@qwer.tk>
@ 2020-09-07 11:21 ` Wolfgang Link
  2020-09-07 18:44   ` Hermann Himmelbauer
  2020-09-07 11:29 ` Chris Sutcliff
  2020-11-16 17:21 ` Hermann Himmelbauer
  2 siblings, 1 reply; 5+ messages in thread
From: Wolfgang Link @ 2020-09-07 11:21 UTC (permalink / raw)
  To: Proxmox VE user list, Hermann Himmelbauer

Hi Hermann,

this board with this Bios version and an Ryzen 9 3900X is running perfectly over 4 month, also with very high load in the VM.

What have you set at BIOS?

Regards

Wolfgang
> On 09/04/2020 4:45 PM Hermann Himmelbauer <hermann@qwer.tk> wrote:
> 
>  
> Dear Proxmox users,
> 
> I'm trying to install a 3-node cluster (latest proxmox/ceph) and
> experience random freezes. The node can either be completely frozen (no
> blinking cursor on console, no ping) or can get somewhat blocked / slow etc.
> 
> This happens most often on node 2 (approx. 3-4 times / day), node 3
> never got stuck within 14 days runtime, node 1 once.
> 
> Unfortunately I did not find any way to trigger this behaviour, however,
> I *think* that this happens most often if I stress the machine in some
> way (performance test within a virtual machine) and then idling the machine.
> 
> When the machine freezes completely, there is no logfile. However, if it
> is partially frozen, some info can be aquired via dmesg. (See attached
> file). ("device=2b:00.0" is an intel 10GBit ethernet adapter (X550T). So
> perhaps there is some driver issue regarding this ethernet adapter?)
> 
> The system consists of the following components:
> 
> - AMD Ryzen 3 3200G, 4x 3.60GHz, boxed (YD3200C5FHBOX)
> - ASRock Rack X470D4U2-2T (Mainboard)
> - Samsung SSD 970 EVO Plus 250GB, M.2 (MZ-V7S250BW) (builtin SSD for OS)
> - 2 * Kingston Server Premier DIMM 16GB, DDR4-2666, CL19-19-19, ECC (BOM
> Number: 9965745-002.A00G, Part Number: KSM26ED8/16ME)
> - be quiet! Pure Power 11 CM 400W ATX 2.4 (BN296) (Power supply)
> - 2 * Micron 5300 PRO - Read Intensive 960GB, SATA
> (MTFDDAK960TDS-1AW1Z6) (SSD for Ceph)
> - LogiLink PC0075, 2x RJ-45, PCIe 2.0 x1 (second NIC with two ports)
> 
> The system is Linux Debian 10.4 (Proxmox 6.2-4) with kernel 5.4.34-1-pve
> #1 SMP PVE 5.4.34-2 (Thu, 07 May 2020 10:02:02 +0200) x86_64 GNU/Linux.
> 
> What I did so far (without success):
> 
> - Disabled C6 as I read that this CPU-state can lead to unstable systems
> (via "python zenstates.py --c6-disable" -> still errors).
> - Updated my Bios to the latest version (3.30)
> - Checked that the CPU + RAM are compatible to the mainboard (they are
> listed as compatible on the ASRock website)
> - Checked logs in IPMI (undervoltage, temperature etc., nothing is logged)
> - Memory test (memtest86, no errors)
> 
> Do you have any clue what could be the reason for these freezes? Should
> I think of some hardware error? Or is this some known Linux bug that can
> be fixed?
> 
> Best Regards,
> Hermann
> 
> -- 
> hermann@qwer.tk
> PGP/GPG: 299893C7 (on keyservers)
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] Server freezing randomly with Proxmox 6.2-4 on AMD Ryzen system
       [not found] <0e58d1d5-384b-55d2-9042-ae8c1e2ade6c@qwer.tk>
  2020-09-07 11:21 ` [PVE-User] Server freezing randomly with Proxmox 6.2-4 on AMD Ryzen system Wolfgang Link
@ 2020-09-07 11:29 ` Chris Sutcliff
  2020-11-16 17:21 ` Hermann Himmelbauer
  2 siblings, 0 replies; 5+ messages in thread
From: Chris Sutcliff @ 2020-09-07 11:29 UTC (permalink / raw)
  To: Proxmox VE user list

Hi,

I'm using the 10G Lan variant of this board with a 3700x and haven't had any issues.

There is a "beta" bios version available from ASRock which updates the AGESA version to 1.0.0.6 (https://download.asrock.com/BIOS/Server/X470D4U(L3.37)ROM.zip) which might be worth trying? I'm using the equivalent version on my board.


Kind Regards

Chris Sutcliff
Sutcliff Limited

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, September 4, 2020 3:45 PM, Hermann Himmelbauer <hermann@qwer.tk> wrote:

> Dear Proxmox users,
>
> I'm trying to install a 3-node cluster (latest proxmox/ceph) and
> experience random freezes. The node can either be completely frozen (no
> blinking cursor on console, no ping) or can get somewhat blocked / slow etc.
>
> This happens most often on node 2 (approx. 3-4 times / day), node 3
> never got stuck within 14 days runtime, node 1 once.
>
> Unfortunately I did not find any way to trigger this behaviour, however,
> I think that this happens most often if I stress the machine in some
> way (performance test within a virtual machine) and then idling the machine.
>
> When the machine freezes completely, there is no logfile. However, if it
> is partially frozen, some info can be aquired via dmesg. (See attached
> file). ("device=2b:00.0" is an intel 10GBit ethernet adapter (X550T). So
> perhaps there is some driver issue regarding this ethernet adapter?)
>
> The system consists of the following components:
>
> -   AMD Ryzen 3 3200G, 4x 3.60GHz, boxed (YD3200C5FHBOX)
>
> -   ASRock Rack X470D4U2-2T (Mainboard)
>
> -   Samsung SSD 970 EVO Plus 250GB, M.2 (MZ-V7S250BW) (builtin SSD for OS)
>
> -   2 * Kingston Server Premier DIMM 16GB, DDR4-2666, CL19-19-19, ECC (BOM
>     Number: 9965745-002.A00G, Part Number: KSM26ED8/16ME)
>
> -   be quiet! Pure Power 11 CM 400W ATX 2.4 (BN296) (Power supply)
>
> -   2 * Micron 5300 PRO - Read Intensive 960GB, SATA
>     (MTFDDAK960TDS-1AW1Z6) (SSD for Ceph)
>
> -   LogiLink PC0075, 2x RJ-45, PCIe 2.0 x1 (second NIC with two ports)
>
>     The system is Linux Debian 10.4 (Proxmox 6.2-4) with kernel 5.4.34-1-pve
>     #1 SMP PVE 5.4.34-2 (Thu, 07 May 2020 10:02:02 +0200) x86_64 GNU/Linux.
>
>     What I did so far (without success):
>
> -   Disabled C6 as I read that this CPU-state can lead to unstable systems
>     (via "python zenstates.py --c6-disable" -> still errors).
>
> -   Updated my Bios to the latest version (3.30)
>
> -   Checked that the CPU + RAM are compatible to the mainboard (they are
>     listed as compatible on the ASRock website)
>
> -   Checked logs in IPMI (undervoltage, temperature etc., nothing is logged)
>
> -   Memory test (memtest86, no errors)
>
>     Do you have any clue what could be the reason for these freezes? Should
>     I think of some hardware error? Or is this some known Linux bug that can
>     be fixed?
>
>     Best Regards,
>     Hermann
>
>     --
>     hermann@qwer.tk
>     PGP/GPG: 299893C7 (on keyservers)
>
>
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user





^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] Server freezing randomly with Proxmox 6.2-4 on AMD Ryzen system
  2020-09-07 11:21 ` [PVE-User] Server freezing randomly with Proxmox 6.2-4 on AMD Ryzen system Wolfgang Link
@ 2020-09-07 18:44   ` Hermann Himmelbauer
  2020-09-08  4:25     ` Wolfgang Link
  0 siblings, 1 reply; 5+ messages in thread
From: Hermann Himmelbauer @ 2020-09-07 18:44 UTC (permalink / raw)
  To: Wolfgang Link, Proxmox VE user list

Dear Wolfgang,
Thank you for your reply. Glad to hear that the board is stable for you.

My BIOS has the default values, so no overclocking or the like. Did you
do any alterations? Did you in some way disable C6?

Maybe this is really some defect (mainboard, RAM, cpu, power supply...)
- since my posting I managed to crash node 2, however, node 1 + node 3
are stable.

BTW - did you manage to get ECC running? I do have ECC memory but it
does not seem to be detected. Maybe this is due to the AMD Ryzen 3 3200G
- I read somewhere that the CPUs with integrated graphic do not report ECC?

Can you perhaps send me the other components of your system?

The board itself + the AMD CPUs are a very price-efficient combination.
The onboard 10GBit ethernet is great for ceph, I get quite good I/O
speeds. If things get stable, it's a perfect combination for a cost
efficient HA cluster, I think.

Best Regards,
Hermann

Am 07.09.20 um 13:21 schrieb Wolfgang Link:
> Hi Hermann,
> 
> this board with this Bios version and an Ryzen 9 3900X is running perfectly over 4 month, also with very high load in the VM.
> 
> What have you set at BIOS?
> 
> Regards
> 
> Wolfgang
>> On 09/04/2020 4:45 PM Hermann Himmelbauer <hermann@qwer.tk> wrote:
>>
>>  
>> Dear Proxmox users,
>>
>> I'm trying to install a 3-node cluster (latest proxmox/ceph) and
>> experience random freezes. The node can either be completely frozen (no
>> blinking cursor on console, no ping) or can get somewhat blocked / slow etc.
>>
>> This happens most often on node 2 (approx. 3-4 times / day), node 3
>> never got stuck within 14 days runtime, node 1 once.
>>
>> Unfortunately I did not find any way to trigger this behaviour, however,
>> I *think* that this happens most often if I stress the machine in some
>> way (performance test within a virtual machine) and then idling the machine.
>>
>> When the machine freezes completely, there is no logfile. However, if it
>> is partially frozen, some info can be aquired via dmesg. (See attached
>> file). ("device=2b:00.0" is an intel 10GBit ethernet adapter (X550T). So
>> perhaps there is some driver issue regarding this ethernet adapter?)
>>
>> The system consists of the following components:
>>
>> - AMD Ryzen 3 3200G, 4x 3.60GHz, boxed (YD3200C5FHBOX)
>> - ASRock Rack X470D4U2-2T (Mainboard)
>> - Samsung SSD 970 EVO Plus 250GB, M.2 (MZ-V7S250BW) (builtin SSD for OS)
>> - 2 * Kingston Server Premier DIMM 16GB, DDR4-2666, CL19-19-19, ECC (BOM
>> Number: 9965745-002.A00G, Part Number: KSM26ED8/16ME)
>> - be quiet! Pure Power 11 CM 400W ATX 2.4 (BN296) (Power supply)
>> - 2 * Micron 5300 PRO - Read Intensive 960GB, SATA
>> (MTFDDAK960TDS-1AW1Z6) (SSD for Ceph)
>> - LogiLink PC0075, 2x RJ-45, PCIe 2.0 x1 (second NIC with two ports)
>>
>> The system is Linux Debian 10.4 (Proxmox 6.2-4) with kernel 5.4.34-1-pve
>> #1 SMP PVE 5.4.34-2 (Thu, 07 May 2020 10:02:02 +0200) x86_64 GNU/Linux.
>>
>> What I did so far (without success):
>>
>> - Disabled C6 as I read that this CPU-state can lead to unstable systems
>> (via "python zenstates.py --c6-disable" -> still errors).
>> - Updated my Bios to the latest version (3.30)
>> - Checked that the CPU + RAM are compatible to the mainboard (they are
>> listed as compatible on the ASRock website)
>> - Checked logs in IPMI (undervoltage, temperature etc., nothing is logged)
>> - Memory test (memtest86, no errors)
>>
>> Do you have any clue what could be the reason for these freezes? Should
>> I think of some hardware error? Or is this some known Linux bug that can
>> be fixed?
>>
>> Best Regards,
>> Hermann
>>
>> -- 
>> hermann@qwer.tk
>> PGP/GPG: 299893C7 (on keyservers)
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 

-- 
hermann@qwer.tk
PGP/GPG: 299893C7 (on keyservers)



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] Server freezing randomly with Proxmox 6.2-4 on AMD Ryzen system
  2020-09-07 18:44   ` Hermann Himmelbauer
@ 2020-09-08  4:25     ` Wolfgang Link
  0 siblings, 0 replies; 5+ messages in thread
From: Wolfgang Link @ 2020-09-08  4:25 UTC (permalink / raw)
  To: Hermann Himmelbauer, Proxmox VE user list


> On 09/07/2020 8:44 PM Hermann Himmelbauer <hermann@qwer.tk> wrote:
> 
>  
> Dear Wolfgang,
> Thank you for your reply. Glad to hear that the board is stable for you.
> 
> My BIOS has the default values, so no overclocking or the like. Did you
> do any alterations? Did you in some way disable C6?

As you noticed, you should turn off all energy savings. This is somewhat hidden in AMD Bios.
You can find them in the expanded section "AMS CBS" and are not afraid of waring in some of the submenus. You can ignore them and don't make sense because options like ECO mode are hidden there.
I also disable boost to reduce the time drift in the KVM.
> 
> Maybe this is really some defect (mainboard, RAM, cpu, power supply...)
> - since my posting I managed to crash node 2, however, node 1 + node 3
> are stable.
> 
> BTW - did you manage to get ECC running? I do have ECC memory but it
> does not seem to be detected. Maybe this is due to the AMD Ryzen 3 3200G
> - I read somewhere that the CPUs with integrated graphic do not report ECC?

The Ryzen 3 3200G does not support ECC as WIKIChips entry say.
https://en.wikichip.org/wiki/amd/ryzen_3/3200g

Why do you have not buy an Ryzen 3 3100 it cost nearly the same and is 30% faster and support SMT and ECC?

> 
> Can you perhaps send me the other components of your system?

4 X Samsung M391A4G43MB1-CTDQ 32GB Dimm at 2666
Lenovo 430-8i HBA

> 
> The board itself + the AMD CPUs are a very price-efficient combination.
> The onboard 10GBit ethernet is great for ceph, I get quite good I/O
> speeds. If things get stable, it's a perfect combination for a cost
> efficient HA cluster, I think.
> 
> Best Regards,
> Hermann
> 
> Am 07.09.20 um 13:21 schrieb Wolfgang Link:
> > Hi Hermann,
> > 
> > this board with this Bios version and an Ryzen 9 3900X is running perfectly over 4 month, also with very high load in the VM.
> > 
> > What have you set at BIOS?
> > 
> > Regards
> > 
> > Wolfgang
> >> On 09/04/2020 4:45 PM Hermann Himmelbauer <hermann@qwer.tk> wrote:
> >>
> >>  
> >> Dear Proxmox users,
> >>
> >> I'm trying to install a 3-node cluster (latest proxmox/ceph) and
> >> experience random freezes. The node can either be completely frozen (no
> >> blinking cursor on console, no ping) or can get somewhat blocked / slow etc.
> >>
> >> This happens most often on node 2 (approx. 3-4 times / day), node 3
> >> never got stuck within 14 days runtime, node 1 once.
> >>
> >> Unfortunately I did not find any way to trigger this behaviour, however,
> >> I *think* that this happens most often if I stress the machine in some
> >> way (performance test within a virtual machine) and then idling the machine.
> >>
> >> When the machine freezes completely, there is no logfile. However, if it
> >> is partially frozen, some info can be aquired via dmesg. (See attached
> >> file). ("device=2b:00.0" is an intel 10GBit ethernet adapter (X550T). So
> >> perhaps there is some driver issue regarding this ethernet adapter?)
> >>
> >> The system consists of the following components:
> >>
> >> - AMD Ryzen 3 3200G, 4x 3.60GHz, boxed (YD3200C5FHBOX)
> >> - ASRock Rack X470D4U2-2T (Mainboard)
> >> - Samsung SSD 970 EVO Plus 250GB, M.2 (MZ-V7S250BW) (builtin SSD for OS)
> >> - 2 * Kingston Server Premier DIMM 16GB, DDR4-2666, CL19-19-19, ECC (BOM
> >> Number: 9965745-002.A00G, Part Number: KSM26ED8/16ME)
> >> - be quiet! Pure Power 11 CM 400W ATX 2.4 (BN296) (Power supply)
> >> - 2 * Micron 5300 PRO - Read Intensive 960GB, SATA
> >> (MTFDDAK960TDS-1AW1Z6) (SSD for Ceph)
> >> - LogiLink PC0075, 2x RJ-45, PCIe 2.0 x1 (second NIC with two ports)
> >>
> >> The system is Linux Debian 10.4 (Proxmox 6.2-4) with kernel 5.4.34-1-pve
> >> #1 SMP PVE 5.4.34-2 (Thu, 07 May 2020 10:02:02 +0200) x86_64 GNU/Linux.
> >>
> >> What I did so far (without success):
> >>
> >> - Disabled C6 as I read that this CPU-state can lead to unstable systems
> >> (via "python zenstates.py --c6-disable" -> still errors).
> >> - Updated my Bios to the latest version (3.30)
> >> - Checked that the CPU + RAM are compatible to the mainboard (they are
> >> listed as compatible on the ASRock website)
> >> - Checked logs in IPMI (undervoltage, temperature etc., nothing is logged)
> >> - Memory test (memtest86, no errors)
> >>
> >> Do you have any clue what could be the reason for these freezes? Should
> >> I think of some hardware error? Or is this some known Linux bug that can
> >> be fixed?
> >>
> >> Best Regards,
> >> Hermann
> >>
> >> -- 
> >> hermann@qwer.tk
> >> PGP/GPG: 299893C7 (on keyservers)
> >> _______________________________________________
> >> pve-user mailing list
> >> pve-user@lists.proxmox.com
> >> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> > 
> 
> -- 
> hermann@qwer.tk
> PGP/GPG: 299893C7 (on keyservers)




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] Server freezing randomly with Proxmox 6.2-4 on AMD Ryzen system
       [not found] <0e58d1d5-384b-55d2-9042-ae8c1e2ade6c@qwer.tk>
  2020-09-07 11:21 ` [PVE-User] Server freezing randomly with Proxmox 6.2-4 on AMD Ryzen system Wolfgang Link
  2020-09-07 11:29 ` Chris Sutcliff
@ 2020-11-16 17:21 ` Hermann Himmelbauer
  2 siblings, 0 replies; 5+ messages in thread
From: Hermann Himmelbauer @ 2020-11-16 17:21 UTC (permalink / raw)
  To: pve-user

Hi,
In case someone is interested, the problem is now solved, the system 
seems to be rock solid after ~ 2 month testing:

I changed the AMD Ryzen 3 3200G to a AMD Ryzen 5 3600 on one node and to 
a AMD Ryzen 3 3100 on the two other nodes, now the problem is gone.

I don't really know why, I can think of two reasons:

1) The 3200G did not support ECC but I use ECC RAM. Maybe this leads to 
errors (although intensive memory testing with memtest86 did not report 
anything).
2) The new CPUs do not have integrated graphic capabilities. I noticed 
that the two onboard 10GBit-Ethernet adapters now have other PCI 
addresses with the new CPU. And with the old CPUs there were problem 
with malfunctioning of these 10G adapters.

Many thanks for input + your help.

The ASRock Rack X470D4U2-2T is definitly stable now.

Best Regards,
Hermann

Am 04.09.20 um 16:45 schrieb Hermann Himmelbauer:
> Dear Proxmox users,
> 
> I'm trying to install a 3-node cluster (latest proxmox/ceph) and
> experience random freezes. The node can either be completely frozen (no
> blinking cursor on console, no ping) or can get somewhat blocked / slow etc.
> 
> This happens most often on node 2 (approx. 3-4 times / day), node 3
> never got stuck within 14 days runtime, node 1 once.
> 
> Unfortunately I did not find any way to trigger this behaviour, however,
> I *think* that this happens most often if I stress the machine in some
> way (performance test within a virtual machine) and then idling the machine.
> 
> When the machine freezes completely, there is no logfile. However, if it
> is partially frozen, some info can be aquired via dmesg. (See attached
> file). ("device=2b:00.0" is an intel 10GBit ethernet adapter (X550T). So
> perhaps there is some driver issue regarding this ethernet adapter?)
> 
> The system consists of the following components:
> 
> - AMD Ryzen 3 3200G, 4x 3.60GHz, boxed (YD3200C5FHBOX)
> - ASRock Rack X470D4U2-2T (Mainboard)
> - Samsung SSD 970 EVO Plus 250GB, M.2 (MZ-V7S250BW) (builtin SSD for OS)
> - 2 * Kingston Server Premier DIMM 16GB, DDR4-2666, CL19-19-19, ECC (BOM
> Number: 9965745-002.A00G, Part Number: KSM26ED8/16ME)
> - be quiet! Pure Power 11 CM 400W ATX 2.4 (BN296) (Power supply)
> - 2 * Micron 5300 PRO - Read Intensive 960GB, SATA
> (MTFDDAK960TDS-1AW1Z6) (SSD for Ceph)
> - LogiLink PC0075, 2x RJ-45, PCIe 2.0 x1 (second NIC with two ports)
> 
> The system is Linux Debian 10.4 (Proxmox 6.2-4) with kernel 5.4.34-1-pve
> #1 SMP PVE 5.4.34-2 (Thu, 07 May 2020 10:02:02 +0200) x86_64 GNU/Linux.
> 
> What I did so far (without success):
> 
> - Disabled C6 as I read that this CPU-state can lead to unstable systems
> (via "python zenstates.py --c6-disable" -> still errors).
> - Updated my Bios to the latest version (3.30)
> - Checked that the CPU + RAM are compatible to the mainboard (they are
> listed as compatible on the ASRock website)
> - Checked logs in IPMI (undervoltage, temperature etc., nothing is logged)
> - Memory test (memtest86, no errors)
> 
> Do you have any clue what could be the reason for these freezes? Should
> I think of some hardware error? Or is this some known Linux bug that can
> be fixed?
> 
> Best Regards,
> Hermann
> 
> 
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-11-16 17:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <0e58d1d5-384b-55d2-9042-ae8c1e2ade6c@qwer.tk>
2020-09-07 11:21 ` [PVE-User] Server freezing randomly with Proxmox 6.2-4 on AMD Ryzen system Wolfgang Link
2020-09-07 18:44   ` Hermann Himmelbauer
2020-09-08  4:25     ` Wolfgang Link
2020-09-07 11:29 ` Chris Sutcliff
2020-11-16 17:21 ` Hermann Himmelbauer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal