From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 846C17174E for ; Thu, 19 May 2022 06:07:48 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 9B63138C4 for ; Thu, 19 May 2022 06:07:17 +0200 (CEST) Received: from mail-oi1-x22d.google.com (mail-oi1-x22d.google.com [IPv6:2607:f8b0:4864:20::22d]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id B50B438AF for ; Thu, 19 May 2022 06:07:14 +0200 (CEST) Received: by mail-oi1-x22d.google.com with SMTP id w127so1677261oie.1 for ; Wed, 18 May 2022 21:07:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wolfspyre-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:from:mime-version:subject:message-id :references:in-reply-to:to:date; bh=aE/vqQDVoJpXALFKe7WxtdYENtujRl3/Km29kLtYG/4=; b=Fa1Oy2sjsSGV91dp+PJekQj12RIRoXj5J3n1OWMGARU7bx1tA7570h9Fy3efaawHWx C3OT5bu6amEyQwJty4D/1z1TC4FKe6Uopupp3qXqwlKH5aHIylmnXFYR1dEXCha9UBp4 ppZwZBwWNA+H4An8Ku4eJftn3grZo9NWLvPGRVwFRvHwdphkMsFA21pj35xQEqXwEBS+ phDPVtFNjYNvQAR0pk05EGCWaiH5Jq8SmZPPGs99s3VSibVWhy6+VpiCYqK66SwgBmHE Pxl26xDeY75+2VK9sq0vJz4cwPcFSJ6m7DzEKkAMLieQh58buMzCGOQPBG77exWsxqRI HsBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:content-transfer-encoding:from:mime-version :subject:message-id:references:in-reply-to:to:date; bh=aE/vqQDVoJpXALFKe7WxtdYENtujRl3/Km29kLtYG/4=; b=d8Dp8+kBGfe6igj4SmgJA5J6l0RNd8qOrW5bKA4Fl7SPrrIuW4NqVREzZjtbPJ4cLL A1BtaHlCuFlmJIQ9taM0ot+qKF6KEg6DIm2tLSyyFIJpTzkK6+1XUDL7zNXTltsDxDI1 J7whdvGSeeJ5WIQSTUNZs9hy0qALDkOCiqBAs8sym+1V79K8GnSGEwNWI7ekp60e8eht dsBzJfQ+QSA03H0HuZhmuRW7N+pIJevzyzHziG05gbfp2X9uUzvtbLl9cY2BrZnqByF9 9rdvLByGyqJB4YCR2+v0yPTE6B8zOXdFjXgby7aOfLNO8C7ljuSPcfaW6a7ApAbud0Vn irbQ== X-Gm-Message-State: AOAM533YcT0oy7F0PDApf5W+5Hgw5UL24UvJ1ekwjQh5l0vvGCVV3Kt+ CvUbqvhH9AMuyLBGR7zViz0qO7VPgDvzKDPq X-Google-Smtp-Source: ABdhPJwTpWAYK2eHU5f15E2PztUQhGKPHWSnsHkqFUgFhNLbBjlNyjL1JvdRGXtmHC7FMMKuuBiNCQ== X-Received: by 2002:a05:6808:11ca:b0:2d9:a01a:488b with SMTP id p10-20020a05680811ca00b002d9a01a488bmr1785490oiv.214.1652933227172; Wed, 18 May 2022 21:07:07 -0700 (PDT) Received: from smtpclient.apple (who.wolfspaw.com. [108.221.46.19]) by smtp.gmail.com with ESMTPSA id e21-20020a9d63d5000000b00605da994088sm1415203otl.2.2022.05.18.21.07.06 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 18 May 2022 21:07:06 -0700 (PDT) From: Wolf Noble Mime-Version: 1.0 (1.0) Message-Id: <4212DB65-25CD-491E-8380-E7D43B9063BF@wolfspyre.com> References: In-Reply-To: To: Proxmox VE user list Date: Wed, 18 May 2022 23:07:05 -0500 X-Mailer: iPhone Mail (19F77) X-SPAM-LEVEL: Spam detection results: 0 AWL 0.002 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DKIM_SIGNED 0.1 Message has a DKIM or DK signature, not necessarily valid DKIM_VALID -0.1 Message has at least one valid DKIM or DK signature HTML_MESSAGE 0.001 HTML included in message MIME_QP_LONG_LINE 0.001 Quoted-printable line longer than 76 chars RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com] Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: Re: [PVE-User] Severe disk corruption: PBS, SATA X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 May 2022 04:07:48 -0000 from over here in the cheap seats, another potential strangeness injector: zfs + any sort of raid controller which plays the abstraction game between r= aw disk and the OS can cause any number of weird and painful scenarios. ZFS believes it has an accurate idea of the underlying disks. it does it=E2=80=99s voodoo wholly believing that it=E2=80=99s solely respon= sible for dealing with data durability. with a raid controller in between playing the shell game with IO, things USU= ALLY work=E2=80=A6. RIGHT UNTIL THEY DONT. i=E2=80=99m sure you=E2=80=99re well aware of this, and have probably alread= y mitigated this concern with a jbod controller, or something that isn=E2=80= =99t preventing the OS (and thus ZFS) from talking directly to the disks=E2=80= =A6 but It felt worth pointing out on the off chance that this got overlooke= d. hope you are well and the gremlins are promptly discovered and put back into= their comfortable chairs so they can resume their harmless heckling. =F0=9F=90=BAW [=3D The contents of this message have been written, read, processed, erased= , sorted, sniffed, compressed, rewritten, misspelled, overcompensated, lost,= found, and most importantly delivered entirely with recycled electrons =3D]= > On May 18, 2022, at 11:21, nada wrote: >=20 > =EF=BB=BFhi Marco > you used some local ZFS filesystem according to your info, so you may try >=20 > zfs list > zpool list -v > zpool history > zpool import ... > zpool replace ... >=20 > all the best > Nada >=20 >> On 2022-05-18 10:04, Marco Gaiarin wrote: >> We are depicting some vary severe disk corruption on one of our >> installation, that is indeed a bit 'niche' but... >> PVE 6.4 host on a Dell PowerEdge T340: >> root@sdpve1:~# uname -a >> Linux sdpve1 5.4.106-1-pve #1 SMP PVE 5.4.106-1 (Fri, 19 Mar 2021 >> 11:08:47 +0100) x86_64 GNU/Linux >> Debian squeeze i386 on guest: >> sdinny:~# uname -a >> Linux sdinny 2.6.32-5-686 #1 SMP Mon Feb 29 00:51:35 UTC 2016 i686 GNU= /Linux >> boot disk defined as: >> sata0: local-zfs:vm-120-disk-0,discard=3Don,size=3D100G >> After enabling PBS, everytime the backup of the VM start: >> root@sdpve1:~# grep vzdump /var/log/syslog.1 >> May 17 20:27:17 sdpve1 pvedaemon[24825]: starting task >> UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root@pam: >> May 17 20:27:17 sdpve1 pvedaemon[20786]: INFO: starting new backup >> job: vzdump 120 --node sdpve1 --storage nfs-scratch --compress zstd >> --remove 0 --mode snapshot >> May 17 20:36:50 sdpve1 pvedaemon[24825]: end task >> UPID:sdpve1:00005132:36BE6E40:6283E905:vzdump:120:root@pam: OK >> May 17 22:00:01 sdpve1 CRON[1734]: (root) CMD (vzdump 100 101 120 >> --mode snapshot --mailto sys@admin --quiet 1 --mailnotification >> failure --storage pbs-BP) >> May 17 22:00:02 sdpve1 vzdump[1738]: starting task >> UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root@pam: >> May 17 22:00:02 sdpve1 vzdump[2790]: INFO: starting new backup job: >> vzdump 100 101 120 --mailnotification failure --quiet 1 --mode >> snapshot --storage pbs-BP --mailto sys@admin >> May 17 22:00:02 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 100 (qem= u) >> May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 100 (00:= 00:50) >> May 17 22:00:52 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 101 (qem= u) >> May 17 22:02:09 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 101 (00:= 01:17) >> May 17 22:02:10 sdpve1 vzdump[2790]: INFO: Starting Backup of VM 120 (qem= u) >> May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Finished Backup of VM 120 (01:= 28:52) >> May 17 23:31:02 sdpve1 vzdump[2790]: INFO: Backup job finished successful= ly >> May 17 23:31:02 sdpve1 vzdump[1738]: end task >> UPID:sdpve1:00000AE6:36C6F7D7:6283FEC2:vzdump::root@pam: OK >> The VM depicted some massive and severe IO trouble: >> May 17 22:40:48 sdinny kernel: [124793.000045] ata3.00: exception >> Emask 0x0 SAct 0xf43d2c SErr 0x0 action 0x6 frozen >> May 17 22:40:48 sdinny kernel: [124793.000493] ata3.00: failed >> command: WRITE FPDMA QUEUED >> May 17 22:40:48 sdinny kernel: [124793.000749] ata3.00: cmd >> 61/10:10:58:e3:01/00:00:05:00:00/40 tag 2 ncq 8192 out >> May 17 22:40:48 sdinny kernel: [124793.000749] res >> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) >> May 17 22:40:48 sdinny kernel: [124793.001628] ata3.00: status: { DRDY } >> May 17 22:40:48 sdinny kernel: [124793.001850] ata3.00: failed >> command: WRITE FPDMA QUEUED >> May 17 22:40:48 sdinny kernel: [124793.002175] ata3.00: cmd >> 61/10:18:70:79:09/00:00:05:00:00/40 tag 3 ncq 8192 out >> May 17 22:40:48 sdinny kernel: [124793.002175] res >> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) >> May 17 22:40:48 sdinny kernel: [124793.003052] ata3.00: status: { DRDY } >> May 17 22:40:48 sdinny kernel: [124793.003273] ata3.00: failed >> command: WRITE FPDMA QUEUED >> May 17 22:40:48 sdinny kernel: [124793.003527] ata3.00: cmd >> 61/10:28:98:31:11/00:00:05:00:00/40 tag 5 ncq 8192 out >> May 17 22:40:48 sdinny kernel: [124793.003559] res >> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) >> May 17 22:40:48 sdinny kernel: [124793.004420] ata3.00: status: { DRDY } >> May 17 22:40:48 sdinny kernel: [124793.004640] ata3.00: failed >> command: WRITE FPDMA QUEUED >> May 17 22:40:48 sdinny kernel: [124793.004893] ata3.00: cmd >> 61/10:40:d8:4a:20/00:00:05:00:00/40 tag 8 ncq 8192 out >> May 17 22:40:48 sdinny kernel: [124793.004894] res >> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) >> May 17 22:40:48 sdinny kernel: [124793.005769] ata3.00: status: { DRDY } >> [...] >> May 17 22:40:48 sdinny kernel: [124793.020296] ata3: hard resetting link >> May 17 22:41:12 sdinny kernel: [124817.132126] ata3: SATA link up 1.5 >> Gbps (SStatus 113 SControl 300) >> May 17 22:41:12 sdinny kernel: [124817.132275] ata3.00: configured for UD= MA/100 >> May 17 22:41:12 sdinny kernel: [124817.132277] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132279] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132280] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132281] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132282] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132283] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132284] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132285] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132286] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132287] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132288] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132289] ata3.00: device >> reported invalid CHS sector 0 >> May 17 22:41:12 sdinny kernel: [124817.132295] ata3: EH complete >> VM is still 'alive', and works. >> But we was forced to do a reboot (power outgage) and after that all the >> partition of the disk desappeared, we were forced to restore them with >> some tools like 'testdisk'. >> Partition on backups the same, desappeared. >> Note that there's also a 'plain' local backup that run on sunday, and thi= s >> backup task seems does not generate trouble (but still seems to have >> partition desappeared, thus was done after an I/O error). >> We have hit a Kernel/Qemu bug? >=20 > _______________________________________________ > pve-user mailing list > pve-user@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user >=20