From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <nada@verdnatura.es>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 09D74715EE
 for <pve-user@lists.proxmox.com>; Wed, 18 May 2022 18:21:21 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id EE17C2F0A1
 for <pve-user@lists.proxmox.com>; Wed, 18 May 2022 18:20:50 +0200 (CEST)
Received: from ip1-isp1-algemesi.verdnatura.es
 (ip1-isp1-algemesi.verdnatura.es [195.77.191.178])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id 7729C2F093
 for <pve-user@lists.proxmox.com>; Wed, 18 May 2022 18:20:48 +0200 (CEST)
Received: from webmail.verdnatura.es (swarm-worker3.static.verdnatura.es
 [10.0.2.70])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by ip1-isp1-algemesi.verdnatura.es (Postfix) with ESMTPSA id D0594A0091;
 Wed, 18 May 2022 18:20:40 +0200 (CEST)
MIME-Version: 1.0
Date: Wed, 18 May 2022 18:20:40 +0200
From: nada <nada@verdnatura.es>
To: Proxmox VE user list <pve-user@lists.proxmox.com>
In-Reply-To: <ftleli-4kd.ln1@hermione.lilliput.linux.it>
References: <ftleli-4kd.ln1@hermione.lilliput.linux.it>
User-Agent: Roundcube Webmail/1.4.11
Message-ID: <a2c163bfd7b2e28a6bf940351a024e59@verdnatura.es>
X-Sender: nada@verdnatura.es
Content-Type: text/plain; charset=US-ASCII;
 format=flowed
Content-Transfer-Encoding: 7bit
X-SPAM-LEVEL: Spam detection results:  0
 AWL -0.008 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 T_SCC_BODY_TEXT_LINE    -0.01 -
Subject: Re: [PVE-User] Severe disk corruption: PBS, SATA
X-BeenThere: pve-user@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE user list <pve-user.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-user/>
List-Post: <mailto:pve-user@lists.proxmox.com>
List-Help: <mailto:pve-user-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Wed, 18 May 2022 16:21:21 -0000

hi Marco
you used some local ZFS filesystem according to your info, so you may 
try

zfs list
zpool list -v
zpool history
zpool import ...
zpool replace ...

all the best
Nada

On 2022-05-18 10:04, Marco Gaiarin wrote:
> We are depicting some vary severe disk corruption on one of our
> installation, that is indeed a bit 'niche' but...
> 
> PVE 6.4 host on a Dell PowerEdge T340:
> 	root@sdpve1:~# uname -a
> 	Linux sdpve1 5.4.106-1-pve #1 SMP PVE 5.4.106-1 (Fri, 19 Mar 2021
> 11:08:47 +0100) x86_64 GNU/Linux
> 
> Debian squeeze i386 on guest:
> 	sdinny:~# uname -a
> 	Linux sdinny 2.6.32-5-686 #1 SMP Mon Feb 29 00:51:35 UTC 2016 i686 
> GNU/Linux
> 
> boot disk defined as:
> 	sata0: local-zfs:vm-120-disk-0,discard=on,size=100G
> 
> 
> After enabling PBS, everytime the backup of the VM start:
> 
>  root@sdpve1:~# grep vzdump /var/log/syslog.1
>  May 17 20:27:17 sdpve1 pvedaemon[24825]: <root@pam> starting task
> UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root@pam:
>  May 17 20:27:17 sdpve1 pvedaemon[20786]: INFO: starting new backup
> job: vzdump 120 --node sdpve1 --storage nfs-scratch --compress zstd
> --remove 0 --mode snapshot
>  May 17 20:36:50 sdpve1 pvedaemon[24825]: <root@pam> end task
> UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root@pam: OK
>  May 17 22:00:01 sdpve1 CRON[1734]: (root) CMD (vzdump 100 101 120
> --mode snapshot --mailto sys@admin --quiet 1 --mailnotification
> failure --storage pbs-BP)
>  May 17 22:00:02 sdpve1 vzdump[1738]: <root@pam> starting task
> UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root@pam:
>  May 17 22:00:02 sdpve1 vzdump[2790]: INFO: starting new backup job:
> vzdump 100 101 120 --mailnotification failure --quiet 1 --mode
> snapshot --storage pbs-BP --mailto sys@admin
>  May 17 22:00:02 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 100 
> (qemu)
>  May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 100 
> (00:00:50)
>  May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 101 
> (qemu)
>  May 17 22:02:09 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 101 
> (00:01:17)
>  May 17 22:02:10 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 120 
> (qemu)
>  May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 120 
> (01:28:52)
>  May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Backup job finished 
> successfully
>  May 17 23:31:02 sdpve1 vzdump[1738]: <root@pam> end task
> UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root@pam: OK
> 
> The VM depicted some massive and severe IO trouble:
> 
>  May 17 22:40:48 sdinny kernel: [124793.000045] ata3.00: exception
> Emask 0x0 SAct 0xf43d2c SErr 0x0 action 0x6 frozen
>  May 17 22:40:48 sdinny kernel: [124793.000493] ata3.00: failed
> command: WRITE FPDMA QUEUED
>  May 17 22:40:48 sdinny kernel: [124793.000749] ata3.00: cmd
> 61/10:10:58:e3:01/00:00:05:00:00/40 tag 2 ncq 8192 out
>  May 17 22:40:48 sdinny kernel: [124793.000749]          res
> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>  May 17 22:40:48 sdinny kernel: [124793.001628] ata3.00: status: { DRDY 
> }
>  May 17 22:40:48 sdinny kernel: [124793.001850] ata3.00: failed
> command: WRITE FPDMA QUEUED
>  May 17 22:40:48 sdinny kernel: [124793.002175] ata3.00: cmd
> 61/10:18:70:79:09/00:00:05:00:00/40 tag 3 ncq 8192 out
>  May 17 22:40:48 sdinny kernel: [124793.002175]          res
> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>  May 17 22:40:48 sdinny kernel: [124793.003052] ata3.00: status: { DRDY 
> }
>  May 17 22:40:48 sdinny kernel: [124793.003273] ata3.00: failed
> command: WRITE FPDMA QUEUED
>  May 17 22:40:48 sdinny kernel: [124793.003527] ata3.00: cmd
> 61/10:28:98:31:11/00:00:05:00:00/40 tag 5 ncq 8192 out
>  May 17 22:40:48 sdinny kernel: [124793.003559]          res
> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>  May 17 22:40:48 sdinny kernel: [124793.004420] ata3.00: status: { DRDY 
> }
>  May 17 22:40:48 sdinny kernel: [124793.004640] ata3.00: failed
> command: WRITE FPDMA QUEUED
>  May 17 22:40:48 sdinny kernel: [124793.004893] ata3.00: cmd
> 61/10:40:d8:4a:20/00:00:05:00:00/40 tag 8 ncq 8192 out
>  May 17 22:40:48 sdinny kernel: [124793.004894]          res
> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>  May 17 22:40:48 sdinny kernel: [124793.005769] ata3.00: status: { DRDY 
> }
>  [...]
>  May 17 22:40:48 sdinny kernel: [124793.020296] ata3: hard resetting 
> link
>  May 17 22:41:12 sdinny kernel: [124817.132126] ata3: SATA link up 1.5
> Gbps (SStatus 113 SControl 300)
>  May 17 22:41:12 sdinny kernel: [124817.132275] ata3.00: configured for 
> UDMA/100
>  May 17 22:41:12 sdinny kernel: [124817.132277] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132279] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132280] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132282] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132283] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132284] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132285] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132286] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132287] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132288] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132289] ata3.00: device
> reported invalid CHS sector 0
>  May 17 22:41:12 sdinny kernel: [124817.132295] ata3: EH complete
> 
> VM is still 'alive', and works.
> But we was forced to do a reboot (power outgage) and after that all the
> partition of the disk desappeared, we were forced to restore them with
> some tools like 'testdisk'.
> Partition on backups the same, desappeared.
> 
> 
> Note that there's also a 'plain' local backup that run on sunday, and 
> this
> backup task seems does not generate trouble (but still seems to have
> partition desappeared, thus was done after an I/O error).
> 
> 
> We have hit a Kernel/Qemu bug?