Re: [pve-devel] How does proxmox handle loss of connection / reboot of iSCSI storage - Lorne Guse via pve-devel

From: Lorne Guse via pve-devel <pve-devel@lists.proxmox.com>
To: "Max R. Carrara" <m.carrara@proxmox.com>,
	Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Cc: Lorne Guse <boomshankerx@hotmail.com>
Subject: Re: [pve-devel] How does proxmox handle loss of connection / reboot of iSCSI storage
Date: Mon, 29 Sep 2025 17:00:18 +0000	[thread overview]
Message-ID: <mailman.499.1759165261.390.pve-devel@lists.proxmox.com> (raw)
In-Reply-To: <DD59RVVYHV7W.32HCKEOIF3WWF@proxmox.com>

[-- Attachment #1: Type: message/rfc822, Size: 21478 bytes --]

From: Lorne Guse <boomshankerx@hotmail.com>
To: "Max R. Carrara" <m.carrara@proxmox.com>, Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: How does proxmox handle loss of connection / reboot of iSCSI storage
Date: Mon, 29 Sep 2025 17:00:18 +0000
Message-ID: <DM6PR17MB3466CB51B17264D74D43A8F0D01BA@DM6PR17MB3466.namprd17.prod.outlook.com>

TrueNAS has indicated that being able to reboot the storage device "Its a basic requirement in the Enterprise." I've now done several TrueNAS upgrades and reboots without any issues. I don't have any Windows VMs on my cluster ATM, but I intend to build a few for testing purposes.

Thanks again for your input.
________________________________
From: Max R. Carrara <m.carrara@proxmox.com>
Sent: Monday, September 29, 2025 6:06 AM
To: Lorne Guse <boomshankerx@hotmail.com>; Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: How does proxmox handle loss of connection / reboot of iSCSI storage

On Fri Sep 26, 2025 at 6:41 PM CEST, Lorne Guse wrote:
> TIL what nerd-sniping is. I was worried that I broke some kind of rule a first. LOL

Hahaha, oh no, it's all good 😄

>
> Thank you for your response. I will do some more extensive testing to see if there is a limit. Some TrueNAS updates can take longer than 3 min.

You're welcome!

>
> I imagine it might be guest-dependent.
>
> I always assumed that I had to shut down my VMs before updating TrueNAS. On the next update I'll run some backups and update while my proxmox cluster is online.

Yeah, I would say it's the combination of storage and guest; that is,
it depends on what's running inside the guest and on what kind of
storage the guest's disks are residing on.

Also, not sure if I've expressed this properly in my previous response,
but I definitely wouldn't *rely* on things being fine if some storage is
down for a bit. The safe option naturally is to shutdown any guests
using that storage before updating (unless the VMs can be migrated to
a different node that's using a different storage).

> ________________________________
> From: Max R. Carrara <m.carrara@proxmox.com>
> Sent: Friday, September 26, 2025 7:32 AM
> To: Lorne Guse <boomshankerx@hotmail.com>; Proxmox VE development discussion <pve-devel@lists.proxmox.com>
> Subject: Re: How does proxmox handle loss of connection / reboot of iSCSI storage
>
> On Fri Sep 26, 2025 at 4:06 AM CEST, Lorne Guse wrote:
> > RE: TrueNAS over iSCSI Custom Storage Plugin
> >
> > TrueNAS has asked me to investigate how Proxmox reacts to reboot of the storage server while VMs and cluster are active. This is especially relevant for updates to TrueNAS.
> >
> > >The one test we'd like to see work is reboot of TrueNAS node while VMs and cluster are operational… does it it "resume" cleanly? A TrueNAS software update will be similar.
> >
> > I don't think the storage plugin is responsible for this level of interaction with the storage server. Is there anything that can be done at the storage plugin level to facilitate graceful recovery when the storage server goes down?
> >
> >
> > --
> > Lorne Guse
>
> From what I have experienced, it depends entirely on the underlying
> storage implementation. Since you nerd-sniped me a little here, I
> decided to do some testing.
>
> On ZFS over iSCSI (using LIO), the downtime does not affect the VM at
> all, except that I/O is stalled while the remote storage is rebooting.
> So while I/O operations might take a little while to go through from the
> VMs perspective, nothing broke here (in my Debian VM at least).
>
> Note that with "broke" I mean that the VM kept on running, the OS and
> its parts didn't throw any errors, no systemd units failed, etc.
> Of course, if an application running inside the VM for example sets a
> timeout on some disk operation and throws an error because of that,
> that's an "issue" with the application.
>
> I even shut down the ZFS-over-iSCSI-via-LIO remote for a couple minutes
> to see if it would throw any errors eventually, but nope, it doesn't;
> things just take a while:
>
> Starting: Fri Sep 26 02:32:52 PM CEST 2025
> d5ae75665497b917c70216497a480104b0395e0b53c6256b1f1e3de96c29eb87  foo
> Done: Fri Sep 26 02:32:58 PM CEST 2025
> Starting: Fri Sep 26 02:32:59 PM CEST 2025
> d5ae75665497b917c70216497a480104b0395e0b53c6256b1f1e3de96c29eb87  foo
> Done: Fri Sep 26 02:33:04 PM CEST 2025
> Starting: Fri Sep 26 02:33:05 PM CEST 2025
> d5ae75665497b917c70216497a480104b0395e0b53c6256b1f1e3de96c29eb87  foo
> Done: Fri Sep 26 02:36:16 PM CEST 2025
> Starting: Fri Sep 26 02:36:17 PM CEST 2025
> d5ae75665497b917c70216497a480104b0395e0b53c6256b1f1e3de96c29eb87  foo
> Done: Fri Sep 26 02:36:23 PM CEST 2025
> Starting: Fri Sep 26 02:36:24 PM CEST 2025
> d5ae75665497b917c70216497a480104b0395e0b53c6256b1f1e3de96c29eb87  foo
> Done: Fri Sep 26 02:36:29 PM CEST 2025
>
> The timestamps there show that the storage was down for ~3 minutes,
> which is a *lot*, but nevertheless everything kept on running.
>
> The above is the output of the following:
>
>     while sleep 1; do echo "Starting: $(date)"; sha256sum foo; echo "Done: $(date)"; done
>
> ... where "foo" is a 4 GiB large file I had created with:
>
>     dd if=/dev/urandom of=./foo bs=1M count=4000
>
> With the TrueNAS legacy plugin (also ZFS over iSCSI, as you know),
> reboots of TrueNAS are also handled "graciously" in this way; I was able
> to observe the same behavior as with the LIO iSCSI provider. So if you
> keep using iSCSI for the new plugin (which I think you do, IIRC),
> everything should be fine. But as I said, it's up to the applications
> inside the guest whether long disk I/O latencies are a problem or not.
>
> On a side note, I'm not too familiar with how QEMU handles iSCSI
> sessions in particular, but from what it seems it just waits until the
> iSCSI session resumes; at least that's what I'm assuming here.
>
> For curiosity's sake I also tested this with my SSHFS plugin [0], and
> in that case the VM remained online, but threw I/O errors immediately
> and remained in an unusable state even once the storage was up again.
> (I'll actually see if I can prevent that from happening; IIRC there's
> an option for reconnecting, unless I'm mistaken.)
>
> Regarding your question what the plugin can do to facilitate graceful
> recovery: In your case, things should be fine "out of the box" because
> of the magic intricacies of iSCSI + QEMU, with other plugins & storage
> implementations it really depends.
>
> Hope that helps clearing some things up!
>
> [0]: https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.proxmox.com%2F%3Fp%3Dpve-storage-plugin-examples.git%3Ba%3Dblob%3Bf%3Dplugin-sshfs%2Fsrc%2FPVE%2FStorage%2FCustom%2FSSHFSPlugin.pm%3Bh%3D2d1612b139a3342e7a91b9d2809c2cf209ed9b05%3Bhb%3Drefs%2Fheads%2Fmaster&data=05%7C02%7C%7C9ee54005a83441f7a8c008ddff509e79%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638947443902935220%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=xQ3G9cscw%2Bq1%2FfGp%2FkWEIC%2BWNJAxNaAVFw3POtKIlHk%3D&reserved=0<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.proxmox.com%2F%3Fp%3Dpve-storage-plugin-examples.git%3Ba%3Dblob%3Bf%3Dplugin-sshfs%2Fsrc%2FPVE%2FStorage%2FCustom%2FSSHFSPlugin.pm%3Bh%3D2d1612b139a3342e7a91b9d2809c2cf209ed9b05%3Bhb%3Drefs%2Fheads%2Fmaster&data=05%7C02%7C%7C9ee54005a83441f7a8c008ddff509e79%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638947443902956864%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=VgO8KqwwIkRoH7XZQZIjeaWuKzVfzLkDZw2hXy5pK58%3D&reserved=0><https://git.proxmox.com/?p=pve-storage-plugin-examples.git;a=blob;f=plugin-sshfs/src/PVE/Storage/Custom/SSHFSPlugin.pm;h=2d1612b139a3342e7a91b9d2809c2cf209ed9b05;hb=refs/heads/master>

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel