* [PVE-User] ZFS-8000-8A on a non-system disk. How to do?
@ 2023-11-12 15:21 Marco Gaiarin
2023-11-12 18:32 ` Stefan
0 siblings, 1 reply; 5+ messages in thread
From: Marco Gaiarin @ 2023-11-12 15:21 UTC (permalink / raw)
To: pve-user
I've got:
root@lisei:~# zpool status -xv
pool: rpool-hdd
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:37:50 with 1 errors on Sun Nov 12 01:01:53 2023
config:
NAME STATE READ WRITE CKSUM
rpool-hdd ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D6J2LN ONLINE 0 0 2
ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D7Z60F ONLINE 0 0 2
ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D2JSHZ ONLINE 0 0 2
errors: Permanent errors have been detected in the following files:
rpool-hdd/vm-401-disk-0:<0x1>
disk is for an VM used as a mere repository for rsnapshot backup, so contain
many copy of the same files, with different and abunndant retention.
Is an addon disk for the VM, eg i can safely if needed umount it, even
detach it.
There's something i can do to repair the volue, possibly online? really i
have to backup it, destroy and restore from backup?!
Thanks.
--
Non può sentirsi degno di essere italiano chi non vota SI al referendum
(Silvio Berlusconi, 21 giugno 2006)
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PVE-User] ZFS-8000-8A on a non-system disk. How to do?
2023-11-12 15:21 [PVE-User] ZFS-8000-8A on a non-system disk. How to do? Marco Gaiarin
@ 2023-11-12 18:32 ` Stefan
2023-11-12 19:47 ` Jan Vlach
2023-11-13 20:28 ` Marco Gaiarin
0 siblings, 2 replies; 5+ messages in thread
From: Stefan @ 2023-11-12 18:32 UTC (permalink / raw)
To: Proxmox VE user list
I assume you already have ruled out flaky hardware? (Bad cable, RAM). If so repairing is not possible. You can theoretically bypass the backup/destroy/restore way but why?
You have three faulty drives that need to be replaced anyway. That operation + identifying the failed file(s) takes much longer than just copy back from backup.
Am 12. November 2023 16:21:17 MEZ schrieb Marco Gaiarin <gaio@lilliput.linux.it>:
>
>I've got:
>
>root@lisei:~# zpool status -xv
> pool: rpool-hdd
> state: ONLINE
>status: One or more devices has experienced an error resulting in data
> corruption. Applications may be affected.
>action: Restore the file in question if possible. Otherwise restore the
> entire pool from backup.
> see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
> scan: scrub repaired 0B in 00:37:50 with 1 errors on Sun Nov 12 01:01:53 2023
>config:
>
> NAME STATE READ WRITE CKSUM
> rpool-hdd ONLINE 0 0 0
> raidz1-0 ONLINE 0 0 0
> ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D6J2LN ONLINE 0 0 2
> ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D7Z60F ONLINE 0 0 2
> ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D2JSHZ ONLINE 0 0 2
>
>errors: Permanent errors have been detected in the following files:
>
> rpool-hdd/vm-401-disk-0:<0x1>
>
>disk is for an VM used as a mere repository for rsnapshot backup, so contain
>many copy of the same files, with different and abunndant retention.
>Is an addon disk for the VM, eg i can safely if needed umount it, even
>detach it.
>
>
>There's something i can do to repair the volue, possibly online? really i
>have to backup it, destroy and restore from backup?!
>
>
>Thanks.
>
>--
> Non può sentirsi degno di essere italiano chi non vota SI al referendum
> (Silvio Berlusconi, 21 giugno 2006)
>
>
>
>_______________________________________________
>pve-user mailing list
>pve-user@lists.proxmox.com
>https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PVE-User] ZFS-8000-8A on a non-system disk. How to do?
2023-11-12 18:32 ` Stefan
@ 2023-11-12 19:47 ` Jan Vlach
2023-11-13 20:39 ` Marco Gaiarin
2023-11-13 20:28 ` Marco Gaiarin
1 sibling, 1 reply; 5+ messages in thread
From: Jan Vlach @ 2023-11-12 19:47 UTC (permalink / raw)
To: Proxmox VE user list
Hi,
having same number of checksum errors on all drives really means bad cabling or bad RAM.
- if you have ECC RAM, check for errors in ipmitool i.e.
ipmitool sel elist
- you could see something in dmesg too.
- if you don't have ECC ram, get the memtest in UEFI mode from https://www.memtest86.com/ <https://www.memtest86.com/> take the host offline and let it run for day or two.
- I've seen this with Supermicro server where the cable for last two slots out of 10 was bent and touching the case lid and those two slots have been resetting the bus showing me increasing errors on all drives. Scrubs just changed the affected files and metadata, so I didn't trust the host anymore and consistency of data, restored everything from good backup to different one and then debugged.
- If at this point you want to backup and restore and you don't have backups, it's game over for you.
JV
> On 12. 11. 2023, at 19:32, Stefan <proxmox@qwertz1.com> wrote:
>
> I assume you already have ruled out flaky hardware? (Bad cable, RAM). If so repairing is not possible. You can theoretically bypass the backup/destroy/restore way but why?
> You have three faulty drives that need to be replaced anyway. That operation + identifying the failed file(s) takes much longer than just copy back from backup.
>
>
>
> Am 12. November 2023 16:21:17 MEZ schrieb Marco Gaiarin <gaio@lilliput.linux.it>:
>>
>> I've got:
>>
>> root@lisei:~# zpool status -xv
>> pool: rpool-hdd
>> state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>> corruption. Applications may be affected.
>> action: Restore the file in question if possible. Otherwise restore the
>> entire pool from backup.
>> see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
>> scan: scrub repaired 0B in 00:37:50 with 1 errors on Sun Nov 12 01:01:53 2023
>> config:
>>
>> NAME STATE READ WRITE CKSUM
>> rpool-hdd ONLINE 0 0 0
>> raidz1-0 ONLINE 0 0 0
>> ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D6J2LN ONLINE 0 0 2
>> ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D7Z60F ONLINE 0 0 2
>> ata-WDC_WD2003FZEX-00SRLA0_WD-WMC6N0D2JSHZ ONLINE 0 0 2
>>
>> errors: Permanent errors have been detected in the following files:
>>
>> rpool-hdd/vm-401-disk-0:<0x1>
>>
>> disk is for an VM used as a mere repository for rsnapshot backup, so contain
>> many copy of the same files, with different and abunndant retention.
>> Is an addon disk for the VM, eg i can safely if needed umount it, even
>> detach it.
>>
>>
>> There's something i can do to repair the volue, possibly online? really i
>> have to backup it, destroy and restore from backup?!
>>
>>
>> Thanks.
>>
>> --
>> Non può sentirsi degno di essere italiano chi non vota SI al referendum
>> (Silvio Berlusconi, 21 giugno 2006)
>>
>>
>>
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PVE-User] ZFS-8000-8A on a non-system disk. How to do?
2023-11-12 19:47 ` Jan Vlach
@ 2023-11-13 20:39 ` Marco Gaiarin
0 siblings, 0 replies; 5+ messages in thread
From: Marco Gaiarin @ 2023-11-13 20:39 UTC (permalink / raw)
To: Jan Vlach; +Cc: pve-user
Mandi! Jan Vlach
In chel di` si favelave...
> having same number of checksum errors on all drives really means bad cabling or bad RAM.
Bad cabling you mean bad SATA cabling, right? Seems no to me... i've not
received errors of sata kernel subsystem
> - if you have ECC RAM, check for errors in ipmitool i.e.
> ipmitool sel elist
I have ecc ram, but the server (an old HP ProLiant Microserver N40L) seems
doesn't have an IPMI interface...
Server have 140+ days of uptime, and dmesg are perfectly free of memeory and
sata errors... Boh...
--
Consolatevi! Sul sito http://www.sorryeverybody.com migliaia di americani
chiedono scusa al mondo per la rielezione di Bush. (da Cacao Elefante)
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PVE-User] ZFS-8000-8A on a non-system disk. How to do?
2023-11-12 18:32 ` Stefan
2023-11-12 19:47 ` Jan Vlach
@ 2023-11-13 20:28 ` Marco Gaiarin
1 sibling, 0 replies; 5+ messages in thread
From: Marco Gaiarin @ 2023-11-13 20:28 UTC (permalink / raw)
To: Stefan; +Cc: pve-user
Mandi! Stefan
In chel di` si favelave...
> I assume you already have ruled out flaky hardware? (Bad cable, RAM). If so repairing is not possible. You can theoretically bypass the backup/destroy/restore way but why?
> You have three faulty drives that need to be replaced anyway. That operation + identifying the failed file(s) takes much longer than just copy back from backup.
...the strange thing is that seems that hardware is not 'flaky' at all...
i've received NO warning about disk/ram/... error in logs, that i constantly
monitor via logcheck.
More on next answer.
--
Consolatevi! Sul sito http://www.sorryeverybody.com migliaia di americani
chiedono scusa al mondo per la rielezione di Bush. (da Cacao Elefante)
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-11-13 21:10 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-12 15:21 [PVE-User] ZFS-8000-8A on a non-system disk. How to do? Marco Gaiarin
2023-11-12 18:32 ` Stefan
2023-11-12 19:47 ` Jan Vlach
2023-11-13 20:39 ` Marco Gaiarin
2023-11-13 20:28 ` Marco Gaiarin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox