From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 31F3E71497 for ; Wed, 18 May 2022 10:40:13 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 26120EF8 for ; Wed, 18 May 2022 10:40:13 +0200 (CEST) Received: from picard.linux.it (picard.linux.it [IPv6:2001:1418:10:5::2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id B5261EEC for ; Wed, 18 May 2022 10:40:10 +0200 (CEST) Received: by picard.linux.it (Postfix, from userid 10) id 2E93B3CAA8C; Wed, 18 May 2022 10:40:04 +0200 (CEST) Received: from news by eraldo.lilliput.linux.it with local (Exim 4.89) (envelope-from ) id 1nrFA1-0000SW-Ps for pve-user@lists.proxmox.com; Wed, 18 May 2022 10:36:01 +0200 From: Marco Gaiarin Date: Wed, 18 May 2022 10:04:33 +0200 Organization: Il gaio usa sempre TIN per le liste, fallo anche tu!!! Message-ID: X-Trace: eraldo.lilliput.linux.it 1652861613 1100 192.168.24.2 (18 May 2022 08:13:33 GMT) X-Mailer: tin/2.4.4-20191224 ("Millburn") (Linux/5.4.0-110-generic (x86_64)) X-Gateway-System: SmartGate 1.4.5 To: pve-user@lists.proxmox.com X-SPAM-LEVEL: Spam detection results: 0 AWL 0.181 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% JMQ_SPF_NEUTRAL 0.5 SPF set to ?all KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - Subject: [PVE-User] Severe disk corruption: PBS, SATA X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2022 08:40:13 -0000 We are depicting some vary severe disk corruption on one of our installation, that is indeed a bit 'niche' but... PVE 6.4 host on a Dell PowerEdge T340: root@sdpve1:~# uname -a Linux sdpve1 5.4.106-1-pve #1 SMP PVE 5.4.106-1 (Fri, 19 Mar 2021 11:08:47 +0100) x86_64 GNU/Linux Debian squeeze i386 on guest: sdinny:~# uname -a Linux sdinny 2.6.32-5-686 #1 SMP Mon Feb 29 00:51:35 UTC 2016 i686 GNU/Linux boot disk defined as: sata0: local-zfs:vm-120-disk-0,discard=on,size=100G After enabling PBS, everytime the backup of the VM start: root@sdpve1:~# grep vzdump /var/log/syslog.1 May 17 20:27:17 sdpve1 pvedaemon[24825]: starting task UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root@pam: May 17 20:27:17 sdpve1 pvedaemon[20786]: INFO: starting new backup job: vzdump 120 --node sdpve1 --storage nfs-scratch --compress zstd --remove 0 --mode snapshot May 17 20:36:50 sdpve1 pvedaemon[24825]: end task UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root@pam: OK May 17 22:00:01 sdpve1 CRON[1734]: (root) CMD (vzdump 100 101 120 --mode snapshot --mailto sys@admin --quiet 1 --mailnotification failure --storage pbs-BP) May 17 22:00:02 sdpve1 vzdump[1738]: starting task UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root@pam: May 17 22:00:02 sdpve1 vzdump[2790]: INFO: starting new backup job: vzdump 100 101 120 --mailnotification failure --quiet 1 --mode snapshot --storage pbs-BP --mailto sys@admin May 17 22:00:02 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 100 (qemu) May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 100 (00:00:50) May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 101 (qemu) May 17 22:02:09 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 101 (00:01:17) May 17 22:02:10 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 120 (qemu) May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 120 (01:28:52) May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Backup job finished successfully May 17 23:31:02 sdpve1 vzdump[1738]: end task UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root@pam: OK The VM depicted some massive and severe IO trouble: May 17 22:40:48 sdinny kernel: [124793.000045] ata3.00: exception Emask 0x0 SAct 0xf43d2c SErr 0x0 action 0x6 frozen May 17 22:40:48 sdinny kernel: [124793.000493] ata3.00: failed command: WRITE FPDMA QUEUED May 17 22:40:48 sdinny kernel: [124793.000749] ata3.00: cmd 61/10:10:58:e3:01/00:00:05:00:00/40 tag 2 ncq 8192 out May 17 22:40:48 sdinny kernel: [124793.000749] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) May 17 22:40:48 sdinny kernel: [124793.001628] ata3.00: status: { DRDY } May 17 22:40:48 sdinny kernel: [124793.001850] ata3.00: failed command: WRITE FPDMA QUEUED May 17 22:40:48 sdinny kernel: [124793.002175] ata3.00: cmd 61/10:18:70:79:09/00:00:05:00:00/40 tag 3 ncq 8192 out May 17 22:40:48 sdinny kernel: [124793.002175] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) May 17 22:40:48 sdinny kernel: [124793.003052] ata3.00: status: { DRDY } May 17 22:40:48 sdinny kernel: [124793.003273] ata3.00: failed command: WRITE FPDMA QUEUED May 17 22:40:48 sdinny kernel: [124793.003527] ata3.00: cmd 61/10:28:98:31:11/00:00:05:00:00/40 tag 5 ncq 8192 out May 17 22:40:48 sdinny kernel: [124793.003559] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) May 17 22:40:48 sdinny kernel: [124793.004420] ata3.00: status: { DRDY } May 17 22:40:48 sdinny kernel: [124793.004640] ata3.00: failed command: WRITE FPDMA QUEUED May 17 22:40:48 sdinny kernel: [124793.004893] ata3.00: cmd 61/10:40:d8:4a:20/00:00:05:00:00/40 tag 8 ncq 8192 out May 17 22:40:48 sdinny kernel: [124793.004894] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) May 17 22:40:48 sdinny kernel: [124793.005769] ata3.00: status: { DRDY } [...] May 17 22:40:48 sdinny kernel: [124793.020296] ata3: hard resetting link May 17 22:41:12 sdinny kernel: [124817.132126] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) May 17 22:41:12 sdinny kernel: [124817.132275] ata3.00: configured for UDMA/100 May 17 22:41:12 sdinny kernel: [124817.132277] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132279] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132280] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132282] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132283] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132284] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132285] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132286] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132287] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132288] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132289] ata3.00: device reported invalid CHS sector 0 May 17 22:41:12 sdinny kernel: [124817.132295] ata3: EH complete VM is still 'alive', and works. But we was forced to do a reboot (power outgage) and after that all the partition of the disk desappeared, we were forced to restore them with some tools like 'testdisk'. Partition on backups the same, desappeared. Note that there's also a 'plain' local backup that run on sunday, and this backup task seems does not generate trouble (but still seems to have partition desappeared, thus was done after an I/O error). We have hit a Kernel/Qemu bug? -- E sempre allegri bisogna stare, che il nostro piangere fa male al Re fa male al ricco, al Cardinale, diventan tristi se noi piangiam... (Fo, Jannacci)