[PVE-User] Severe disk corruption: PBS, SATA

public inbox for pve-user@lists.proxmox.com
 help / color / mirror / Atom feed

From: Marco Gaiarin <gaio@lilliput.linux.it>
To: pve-user@lists.proxmox.com
Subject: [PVE-User] Severe disk corruption: PBS, SATA
Date: Wed, 18 May 2022 10:04:33 +0200	[thread overview]
Message-ID: <ftleli-4kd.ln1@hermione.lilliput.linux.it> (raw)


We are depicting some vary severe disk corruption on one of our
installation, that is indeed a bit 'niche' but...

PVE 6.4 host on a Dell PowerEdge T340:
	root@sdpve1:~# uname -a
	Linux sdpve1 5.4.106-1-pve #1 SMP PVE 5.4.106-1 (Fri, 19 Mar 2021 11:08:47 +0100) x86_64 GNU/Linux

Debian squeeze i386 on guest:
	sdinny:~# uname -a
	Linux sdinny 2.6.32-5-686 #1 SMP Mon Feb 29 00:51:35 UTC 2016 i686 GNU/Linux

boot disk defined as:
	sata0: local-zfs:vm-120-disk-0,discard=on,size=100G


After enabling PBS, everytime the backup of the VM start:

 root@sdpve1:~# grep vzdump /var/log/syslog.1 
 May 17 20:27:17 sdpve1 pvedaemon[24825]: <root@pam> starting task UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root@pam:
 May 17 20:27:17 sdpve1 pvedaemon[20786]: INFO: starting new backup job: vzdump 120 --node sdpve1 --storage nfs-scratch --compress zstd --remove 0 --mode snapshot
 May 17 20:36:50 sdpve1 pvedaemon[24825]: <root@pam> end task UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root@pam: OK
 May 17 22:00:01 sdpve1 CRON[1734]: (root) CMD (vzdump 100 101 120 --mode snapshot --mailto sys@admin --quiet 1 --mailnotification failure --storage pbs-BP)
 May 17 22:00:02 sdpve1 vzdump[1738]: <root@pam> starting task UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root@pam:
 May 17 22:00:02 sdpve1 vzdump[2790]: INFO: starting new backup job: vzdump 100 101 120 --mailnotification failure --quiet 1 --mode snapshot --storage pbs-BP --mailto sys@admin
 May 17 22:00:02 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 100 (qemu)
 May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 100 (00:00:50)
 May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 101 (qemu)
 May 17 22:02:09 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 101 (00:01:17)
 May 17 22:02:10 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 120 (qemu)
 May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 120 (01:28:52)
 May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Backup job finished successfully
 May 17 23:31:02 sdpve1 vzdump[1738]: <root@pam> end task UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root@pam: OK

The VM depicted some massive and severe IO trouble:

 May 17 22:40:48 sdinny kernel: [124793.000045] ata3.00: exception Emask 0x0 SAct 0xf43d2c SErr 0x0 action 0x6 frozen
 May 17 22:40:48 sdinny kernel: [124793.000493] ata3.00: failed command: WRITE FPDMA QUEUED
 May 17 22:40:48 sdinny kernel: [124793.000749] ata3.00: cmd 61/10:10:58:e3:01/00:00:05:00:00/40 tag 2 ncq 8192 out
 May 17 22:40:48 sdinny kernel: [124793.000749]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 May 17 22:40:48 sdinny kernel: [124793.001628] ata3.00: status: { DRDY }
 May 17 22:40:48 sdinny kernel: [124793.001850] ata3.00: failed command: WRITE FPDMA QUEUED
 May 17 22:40:48 sdinny kernel: [124793.002175] ata3.00: cmd 61/10:18:70:79:09/00:00:05:00:00/40 tag 3 ncq 8192 out
 May 17 22:40:48 sdinny kernel: [124793.002175]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 May 17 22:40:48 sdinny kernel: [124793.003052] ata3.00: status: { DRDY }
 May 17 22:40:48 sdinny kernel: [124793.003273] ata3.00: failed command: WRITE FPDMA QUEUED
 May 17 22:40:48 sdinny kernel: [124793.003527] ata3.00: cmd 61/10:28:98:31:11/00:00:05:00:00/40 tag 5 ncq 8192 out
 May 17 22:40:48 sdinny kernel: [124793.003559]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 May 17 22:40:48 sdinny kernel: [124793.004420] ata3.00: status: { DRDY }
 May 17 22:40:48 sdinny kernel: [124793.004640] ata3.00: failed command: WRITE FPDMA QUEUED
 May 17 22:40:48 sdinny kernel: [124793.004893] ata3.00: cmd 61/10:40:d8:4a:20/00:00:05:00:00/40 tag 8 ncq 8192 out
 May 17 22:40:48 sdinny kernel: [124793.004894]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 May 17 22:40:48 sdinny kernel: [124793.005769] ata3.00: status: { DRDY }
 [...]
 May 17 22:40:48 sdinny kernel: [124793.020296] ata3: hard resetting link
 May 17 22:41:12 sdinny kernel: [124817.132126] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
 May 17 22:41:12 sdinny kernel: [124817.132275] ata3.00: configured for UDMA/100
 May 17 22:41:12 sdinny kernel: [124817.132277] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132279] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132280] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132282] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132283] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132284] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132285] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132286] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132287] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132288] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132289] ata3.00: device reported invalid CHS sector 0
 May 17 22:41:12 sdinny kernel: [124817.132295] ata3: EH complete

VM is still 'alive', and works.
But we was forced to do a reboot (power outgage) and after that all the
partition of the disk desappeared, we were forced to restore them with
some tools like 'testdisk'.
Partition on backups the same, desappeared.


Note that there's also a 'plain' local backup that run on sunday, and this
backup task seems does not generate trouble (but still seems to have
partition desappeared, thus was done after an I/O error).


We have hit a Kernel/Qemu bug?

-- 
  E sempre allegri bisogna stare, che il nostro piangere fa male al Re
  fa male al ricco, al Cardinale,
  diventan tristi se noi piangiam...			(Fo, Jannacci)

next             reply	other threads:[~2022-05-18  8:40 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-18  8:04 Marco Gaiarin [this message]
2022-05-18 16:20 ` nada
2022-05-19  4:07   ` Wolf Noble
     [not found] ` <mailman.74.1652864024.356.pve-user@lists.proxmox.com>
2022-05-20 11:24   ` Marco Gaiarin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ftleli-4kd.ln1@hermione.lilliput.linux.it \
    --to=gaio@lilliput.linux.it \
    --cc=pve-user@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal