Re: [pve-devel] How does proxmox handle loss of connection / reboot of iSCSI storage

From: "Max R. Carrara" <m.carrara@proxmox.com>
To: "Lorne Guse" <boomshankerx@hotmail.com>,
	"Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] How does proxmox handle loss of connection / reboot of iSCSI storage
Date: Fri, 26 Sep 2025 15:32:21 +0200	[thread overview]
Message-ID: <DD2RQ25PEUQJ.3UCAOIFZYCS0I@proxmox.com> (raw)
In-Reply-To: <DM6PR17MB34665CA6AB7E651E1675C791D01EA@DM6PR17MB3466.namprd17.prod.outlook.com>

On Fri Sep 26, 2025 at 4:06 AM CEST, Lorne Guse wrote:
> RE: TrueNAS over iSCSI Custom Storage Plugin
>
> TrueNAS has asked me to investigate how Proxmox reacts to reboot of the storage server while VMs and cluster are active. This is especially relevant for updates to TrueNAS.
>
> >The one test we'd like to see work is reboot of TrueNAS node while VMs and cluster are operational… does it it "resume" cleanly? A TrueNAS software update will be similar.
>
> I don't think the storage plugin is responsible for this level of interaction with the storage server. Is there anything that can be done at the storage plugin level to facilitate graceful recovery when the storage server goes down?
>
>
> --
> Lorne Guse

From what I have experienced, it depends entirely on the underlying
storage implementation. Since you nerd-sniped me a little here, I
decided to do some testing.

On ZFS over iSCSI (using LIO), the downtime does not affect the VM at
all, except that I/O is stalled while the remote storage is rebooting.
So while I/O operations might take a little while to go through from the
VMs perspective, nothing broke here (in my Debian VM at least).

Note that with "broke" I mean that the VM kept on running, the OS and
its parts didn't throw any errors, no systemd units failed, etc.
Of course, if an application running inside the VM for example sets a
timeout on some disk operation and throws an error because of that,
that's an "issue" with the application.

I even shut down the ZFS-over-iSCSI-via-LIO remote for a couple minutes
to see if it would throw any errors eventually, but nope, it doesn't;
things just take a while:

Starting: Fri Sep 26 02:32:52 PM CEST 2025
d5ae75665497b917c70216497a480104b0395e0b53c6256b1f1e3de96c29eb87  foo
Done: Fri Sep 26 02:32:58 PM CEST 2025
Starting: Fri Sep 26 02:32:59 PM CEST 2025
d5ae75665497b917c70216497a480104b0395e0b53c6256b1f1e3de96c29eb87  foo
Done: Fri Sep 26 02:33:04 PM CEST 2025
Starting: Fri Sep 26 02:33:05 PM CEST 2025
d5ae75665497b917c70216497a480104b0395e0b53c6256b1f1e3de96c29eb87  foo
Done: Fri Sep 26 02:36:16 PM CEST 2025
Starting: Fri Sep 26 02:36:17 PM CEST 2025
d5ae75665497b917c70216497a480104b0395e0b53c6256b1f1e3de96c29eb87  foo
Done: Fri Sep 26 02:36:23 PM CEST 2025
Starting: Fri Sep 26 02:36:24 PM CEST 2025
d5ae75665497b917c70216497a480104b0395e0b53c6256b1f1e3de96c29eb87  foo
Done: Fri Sep 26 02:36:29 PM CEST 2025

The timestamps there show that the storage was down for ~3 minutes,
which is a *lot*, but nevertheless everything kept on running.

The above is the output of the following:

    while sleep 1; do echo "Starting: $(date)"; sha256sum foo; echo "Done: $(date)"; done

... where "foo" is a 4 GiB large file I had created with:

    dd if=/dev/urandom of=./foo bs=1M count=4000

With the TrueNAS legacy plugin (also ZFS over iSCSI, as you know),
reboots of TrueNAS are also handled "graciously" in this way; I was able
to observe the same behavior as with the LIO iSCSI provider. So if you
keep using iSCSI for the new plugin (which I think you do, IIRC),
everything should be fine. But as I said, it's up to the applications
inside the guest whether long disk I/O latencies are a problem or not.

On a side note, I'm not too familiar with how QEMU handles iSCSI
sessions in particular, but from what it seems it just waits until the
iSCSI session resumes; at least that's what I'm assuming here.

For curiosity's sake I also tested this with my SSHFS plugin [0], and
in that case the VM remained online, but threw I/O errors immediately
and remained in an unusable state even once the storage was up again.
(I'll actually see if I can prevent that from happening; IIRC there's
an option for reconnecting, unless I'm mistaken.)

Regarding your question what the plugin can do to facilitate graceful
recovery: In your case, things should be fine "out of the box" because
of the magic intricacies of iSCSI + QEMU, with other plugins & storage
implementations it really depends.

Hope that helps clearing some things up!

[0]: https://git.proxmox.com/?p=pve-storage-plugin-examples.git;a=blob;f=plugin-sshfs/src/PVE/Storage/Custom/SSHFSPlugin.pm;h=2d1612b139a3342e7a91b9d2809c2cf209ed9b05;hb=refs/heads/master

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel