[PVE-User] ZFS-8000-8A on a non-system disk. How to do?

public inbox for pve-user@lists.proxmox.com
 help / color / mirror / Atom feed

* [PVE-User] ZFS-8000-8A on a non-system disk. How to do?
@ 2023-11-12 15:21 Marco Gaiarin
  2023-11-12 18:32 ` Stefan
  0 siblings, 1 reply; 5+ messages in thread
From: Marco Gaiarin @ 2023-11-12 15:21 UTC (permalink / raw)
  To: pve-user


I've got:

root@lisei:~# zpool status -xv
  pool: rpool-hdd
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:37:50 with 1 errors on Sun Nov 12 01:01:53 2023
config:

	NAME                                            STATE     READ WRITE CKSUM
	rpool-hdd                                       ONLINE       0     0     0
	  raidz1-0                                      ONLINE       0     0     0
	    ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D6J2LN  ONLINE       0     0     2
	    ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D7Z60F  ONLINE       0     0     2
	    ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D2JSHZ  ONLINE       0     0     2

errors: Permanent errors have been detected in the following files:

        rpool-hdd/vm-401-disk-0:<0x1>

disk is for an VM used as a mere repository for rsnapshot backup, so contain
many copy of the same files, with different and abunndant retention.
Is an addon disk for the VM, eg i can safely if needed umount it, even
detach it.


There's something i can do to repair the volue, possibly online? really i
have to backup it, destroy and restore from backup?!


Thanks.

-- 
  Non può sentirsi degno di essere italiano chi non vota SI al referendum
				(Silvio Berlusconi, 21 giugno 2006)





^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] ZFS-8000-8A on a non-system disk. How to do?
  2023-11-12 15:21 [PVE-User] ZFS-8000-8A on a non-system disk. How to do? Marco Gaiarin
@ 2023-11-12 18:32 ` Stefan
  2023-11-12 19:47   ` Jan Vlach
  2023-11-13 20:28   ` Marco Gaiarin
  0 siblings, 2 replies; 5+ messages in thread
From: Stefan @ 2023-11-12 18:32 UTC (permalink / raw)
  To: Proxmox VE user list

I assume you already have ruled out flaky hardware? (Bad cable, RAM). If so repairing is not possible. You can theoretically bypass the backup/destroy/restore way but why?
You have three faulty drives that need to be replaced anyway. That operation + identifying the failed file(s) takes much longer than just copy back from backup.



Am 12. November 2023 16:21:17 MEZ schrieb Marco Gaiarin <gaio@lilliput.linux.it>:
>
>I've got:
>
>root@lisei:~# zpool status -xv
>  pool: rpool-hdd
> state: ONLINE
>status: One or more devices has experienced an error resulting in data
>	corruption.  Applications may be affected.
>action: Restore the file in question if possible.  Otherwise restore the
>	entire pool from backup.
>   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
>  scan: scrub repaired 0B in 00:37:50 with 1 errors on Sun Nov 12 01:01:53 2023
>config:
>
>	NAME                                            STATE     READ WRITE CKSUM
>	rpool-hdd                                       ONLINE       0     0     0
>	  raidz1-0                                      ONLINE       0     0     0
>	    ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D6J2LN  ONLINE       0     0     2
>	    ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D7Z60F  ONLINE       0     0     2
>	    ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D2JSHZ  ONLINE       0     0     2
>
>errors: Permanent errors have been detected in the following files:
>
>        rpool-hdd/vm-401-disk-0:<0x1>
>
>disk is for an VM used as a mere repository for rsnapshot backup, so contain
>many copy of the same files, with different and abunndant retention.
>Is an addon disk for the VM, eg i can safely if needed umount it, even
>detach it.
>
>
>There's something i can do to repair the volue, possibly online? really i
>have to backup it, destroy and restore from backup?!
>
>
>Thanks.
>
>-- 
>  Non può sentirsi degno di essere italiano chi non vota SI al referendum
>				(Silvio Berlusconi, 21 giugno 2006)
>
>
>
>_______________________________________________
>pve-user mailing list
>pve-user@lists.proxmox.com
>https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] ZFS-8000-8A on a non-system disk. How to do?
  2023-11-12 18:32 ` Stefan
@ 2023-11-12 19:47   ` Jan Vlach
  2023-11-13 20:39     ` Marco Gaiarin
  2023-11-13 20:28   ` Marco Gaiarin
  1 sibling, 1 reply; 5+ messages in thread
From: Jan Vlach @ 2023-11-12 19:47 UTC (permalink / raw)
  To: Proxmox VE user list

Hi,

having same number of checksum errors on all drives really means bad cabling or bad RAM. 

- if you have ECC RAM, check for errors in ipmitool i.e.
ipmitool sel elist
- you could see something in dmesg too.
- if you don't have ECC ram, get the memtest in UEFI mode from https://www.memtest86.com/ <https://www.memtest86.com/> take the host offline and let it run for day or two.

- I've seen this with Supermicro server where the cable for last two slots out of 10 was bent and touching the case lid and those two slots have been resetting the bus showing me increasing errors on all drives. Scrubs just changed the affected files and metadata, so I didn't trust the host anymore and consistency of data, restored everything from good backup to different one and then debugged.

- If at this point you want to backup and restore and you don't have backups, it's game over for you. 

JV

> On 12. 11. 2023, at 19:32, Stefan <proxmox@qwertz1.com> wrote:
> 
> I assume you already have ruled out flaky hardware? (Bad cable, RAM). If so repairing is not possible. You can theoretically bypass the backup/destroy/restore way but why?
> You have three faulty drives that need to be replaced anyway. That operation + identifying the failed file(s) takes much longer than just copy back from backup.
> 
> 
> 
> Am 12. November 2023 16:21:17 MEZ schrieb Marco Gaiarin <gaio@lilliput.linux.it>:
>> 
>> I've got:
>> 
>> root@lisei:~# zpool status -xv
>> pool: rpool-hdd
>> state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>> 	corruption.  Applications may be affected.
>> action: Restore the file in question if possible.  Otherwise restore the
>> 	entire pool from backup.
>>  see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
>> scan: scrub repaired 0B in 00:37:50 with 1 errors on Sun Nov 12 01:01:53 2023
>> config:
>> 
>> 	NAME                                            STATE     READ WRITE CKSUM
>> 	rpool-hdd                                       ONLINE       0     0     0
>> 	  raidz1-0                                      ONLINE       0     0     0
>> 	    ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D6J2LN  ONLINE       0     0     2
>> 	    ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D7Z60F  ONLINE       0     0     2
>> 	    ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D2JSHZ  ONLINE       0     0     2
>> 
>> errors: Permanent errors have been detected in the following files:
>> 
>>       rpool-hdd/vm-401-disk-0:<0x1>
>> 
>> disk is for an VM used as a mere repository for rsnapshot backup, so contain
>> many copy of the same files, with different and abunndant retention.
>> Is an addon disk for the VM, eg i can safely if needed umount it, even
>> detach it.
>> 
>> 
>> There's something i can do to repair the volue, possibly online? really i
>> have to backup it, destroy and restore from backup?!
>> 
>> 
>> Thanks.
>> 
>> -- 
>> Non può sentirsi degno di essere italiano chi non vota SI al referendum
>> 				(Silvio Berlusconi, 21 giugno 2006)
>> 
>> 
>> 
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>> 
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] ZFS-8000-8A on a non-system disk. How to do?
  2023-11-12 18:32 ` Stefan
  2023-11-12 19:47   ` Jan Vlach
@ 2023-11-13 20:28   ` Marco Gaiarin
  1 sibling, 0 replies; 5+ messages in thread
From: Marco Gaiarin @ 2023-11-13 20:28 UTC (permalink / raw)
  To: Stefan; +Cc: pve-user

Mandi! Stefan
  In chel di` si favelave...

> I assume you already have ruled out flaky hardware? (Bad cable, RAM). If so repairing is not possible. You can theoretically bypass the backup/destroy/restore way but why?
> You have three faulty drives that need to be replaced anyway. That operation + identifying the failed file(s) takes much longer than just copy back from backup.

...the strange thing is that seems that hardware is not 'flaky' at all...
i've received NO warning about disk/ram/... error in logs, that i constantly
monitor via logcheck.

More on next answer.

-- 
  Consolatevi! Sul sito http://www.sorryeverybody.com migliaia di americani
  chiedono scusa al mondo per la rielezione di Bush.	(da Cacao Elefante)





^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] ZFS-8000-8A on a non-system disk. How to do?
  2023-11-12 19:47   ` Jan Vlach
@ 2023-11-13 20:39     ` Marco Gaiarin
  0 siblings, 0 replies; 5+ messages in thread
From: Marco Gaiarin @ 2023-11-13 20:39 UTC (permalink / raw)
  To: Jan Vlach; +Cc: pve-user

Mandi! Jan Vlach
  In chel di` si favelave...

> having same number of checksum errors on all drives really means bad cabling or bad RAM. 

Bad cabling you mean bad SATA cabling, right? Seems no to me... i've not
received errors of sata kernel subsystem

> - if you have ECC RAM, check for errors in ipmitool i.e.
> ipmitool sel elist

I have ecc ram, but the server (an old HP ProLiant Microserver N40L) seems
doesn't have an IPMI interface...

Server have 140+ days of uptime, and dmesg are perfectly free of memeory and
sata errors... Boh...

-- 
  Consolatevi! Sul sito http://www.sorryeverybody.com migliaia di americani
  chiedono scusa al mondo per la rielezione di Bush.	(da Cacao Elefante)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-11-13 21:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-12 15:21 [PVE-User] ZFS-8000-8A on a non-system disk. How to do? Marco Gaiarin
2023-11-12 18:32 ` Stefan
2023-11-12 19:47   ` Jan Vlach
2023-11-13 20:39     ` Marco Gaiarin
2023-11-13 20:28   ` Marco Gaiarin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal