From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 09D74715EE for ; Wed, 18 May 2022 18:21:21 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id EE17C2F0A1 for ; Wed, 18 May 2022 18:20:50 +0200 (CEST) Received: from ip1-isp1-algemesi.verdnatura.es (ip1-isp1-algemesi.verdnatura.es [195.77.191.178]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 7729C2F093 for ; Wed, 18 May 2022 18:20:48 +0200 (CEST) Received: from webmail.verdnatura.es (swarm-worker3.static.verdnatura.es [10.0.2.70]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ip1-isp1-algemesi.verdnatura.es (Postfix) with ESMTPSA id D0594A0091; Wed, 18 May 2022 18:20:40 +0200 (CEST) MIME-Version: 1.0 Date: Wed, 18 May 2022 18:20:40 +0200 From: nada To: Proxmox VE user list In-Reply-To: References: User-Agent: Roundcube Webmail/1.4.11 Message-ID: X-Sender: nada@verdnatura.es Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 AWL -0.008 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - Subject: Re: [PVE-User] Severe disk corruption: PBS, SATA X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2022 16:21:21 -0000 hi Marco you used some local ZFS filesystem according to your info, so you may try zfs list zpool list -v zpool history zpool import ... zpool replace ... all the best Nada On 2022-05-18 10:04, Marco Gaiarin wrote: > We are depicting some vary severe disk corruption on one of our > installation, that is indeed a bit 'niche' but... > > PVE 6.4 host on a Dell PowerEdge T340: > root@sdpve1:~# uname -a > Linux sdpve1 5.4.106-1-pve #1 SMP PVE 5.4.106-1 (Fri, 19 Mar 2021 > 11:08:47 +0100) x86_64 GNU/Linux > > Debian squeeze i386 on guest: > sdinny:~# uname -a > Linux sdinny 2.6.32-5-686 #1 SMP Mon Feb 29 00:51:35 UTC 2016 i686 > GNU/Linux > > boot disk defined as: > sata0: local-zfs:vm-120-disk-0,discard=on,size=100G > > > After enabling PBS, everytime the backup of the VM start: > > root@sdpve1:~# grep vzdump /var/log/syslog.1 > May 17 20:27:17 sdpve1 pvedaemon[24825]: starting task > UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root@pam: > May 17 20:27:17 sdpve1 pvedaemon[20786]: INFO: starting new backup > job: vzdump 120 --node sdpve1 --storage nfs-scratch --compress zstd > --remove 0 --mode snapshot > May 17 20:36:50 sdpve1 pvedaemon[24825]: end task > UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root@pam: OK > May 17 22:00:01 sdpve1 CRON[1734]: (root) CMD (vzdump 100 101 120 > --mode snapshot --mailto sys@admin --quiet 1 --mailnotification > failure --storage pbs-BP) > May 17 22:00:02 sdpve1 vzdump[1738]: starting task > UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root@pam: > May 17 22:00:02 sdpve1 vzdump[2790]: INFO: starting new backup job: > vzdump 100 101 120 --mailnotification failure --quiet 1 --mode > snapshot --storage pbs-BP --mailto sys@admin > May 17 22:00:02 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 100 > (qemu) > May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 100 > (00:00:50) > May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 101 > (qemu) > May 17 22:02:09 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 101 > (00:01:17) > May 17 22:02:10 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 120 > (qemu) > May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 120 > (01:28:52) > May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Backup job finished > successfully > May 17 23:31:02 sdpve1 vzdump[1738]: end task > UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root@pam: OK > > The VM depicted some massive and severe IO trouble: > > May 17 22:40:48 sdinny kernel: [124793.000045] ata3.00: exception > Emask 0x0 SAct 0xf43d2c SErr 0x0 action 0x6 frozen > May 17 22:40:48 sdinny kernel: [124793.000493] ata3.00: failed > command: WRITE FPDMA QUEUED > May 17 22:40:48 sdinny kernel: [124793.000749] ata3.00: cmd > 61/10:10:58:e3:01/00:00:05:00:00/40 tag 2 ncq 8192 out > May 17 22:40:48 sdinny kernel: [124793.000749] res > 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > May 17 22:40:48 sdinny kernel: [124793.001628] ata3.00: status: { DRDY > } > May 17 22:40:48 sdinny kernel: [124793.001850] ata3.00: failed > command: WRITE FPDMA QUEUED > May 17 22:40:48 sdinny kernel: [124793.002175] ata3.00: cmd > 61/10:18:70:79:09/00:00:05:00:00/40 tag 3 ncq 8192 out > May 17 22:40:48 sdinny kernel: [124793.002175] res > 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > May 17 22:40:48 sdinny kernel: [124793.003052] ata3.00: status: { DRDY > } > May 17 22:40:48 sdinny kernel: [124793.003273] ata3.00: failed > command: WRITE FPDMA QUEUED > May 17 22:40:48 sdinny kernel: [124793.003527] ata3.00: cmd > 61/10:28:98:31:11/00:00:05:00:00/40 tag 5 ncq 8192 out > May 17 22:40:48 sdinny kernel: [124793.003559] res > 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > May 17 22:40:48 sdinny kernel: [124793.004420] ata3.00: status: { DRDY > } > May 17 22:40:48 sdinny kernel: [124793.004640] ata3.00: failed > command: WRITE FPDMA QUEUED > May 17 22:40:48 sdinny kernel: [124793.004893] ata3.00: cmd > 61/10:40:d8:4a:20/00:00:05:00:00/40 tag 8 ncq 8192 out > May 17 22:40:48 sdinny kernel: [124793.004894] res > 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > May 17 22:40:48 sdinny kernel: [124793.005769] ata3.00: status: { DRDY > } > [...] > May 17 22:40:48 sdinny kernel: [124793.020296] ata3: hard resetting > link > May 17 22:41:12 sdinny kernel: [124817.132126] ata3: SATA link up 1.5 > Gbps (SStatus 113 SControl 300) > May 17 22:41:12 sdinny kernel: [124817.132275] ata3.00: configured for > UDMA/100 > May 17 22:41:12 sdinny kernel: [124817.132277] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132279] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132280] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132282] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132283] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132284] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132285] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132286] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132287] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132288] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132289] ata3.00: device > reported invalid CHS sector 0 > May 17 22:41:12 sdinny kernel: [124817.132295] ata3: EH complete > > VM is still 'alive', and works. > But we was forced to do a reboot (power outgage) and after that all the > partition of the disk desappeared, we were forced to restore them with > some tools like 'testdisk'. > Partition on backups the same, desappeared. > > > Note that there's also a 'plain' local backup that run on sunday, and > this > backup task seems does not generate trouble (but still seems to have > partition desappeared, thus was done after an I/O error). > > > We have hit a Kernel/Qemu bug?