public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* Re: [pve-devel] orphaned cfs lock when interrupting qm disk import
       [not found] <CA+uqL00=49RuhJjjsJZfwZURWYBqf_TdkqCGtB6PM1055pO0yQ@mail.gmail.com>
@ 2024-09-06 17:38 ` Thomas Lamprecht
  0 siblings, 0 replies; 2+ messages in thread
From: Thomas Lamprecht @ 2024-09-06 17:38 UTC (permalink / raw)
  To: Joshua Huber, PVE Devel

Hi,

Am 26/08/2024 um 22:46 schrieb Joshua Huber:
> Say you've just kicked off a long-running "qm disk import ..." command and
> notice that an incorrect flag was specified. Ok, cancel the operation with
> control-C, fix the flag and re-execute the import command...
> 
> However, when using shared storage you'll (at least for the next 120
> seconds) bump up against an orphaned cfs lock directory. One could manually
> remove the directory from /etc/pve/priv/locks, but it seems a little risky.
> (and a bad habit to get into)
> 
> So this got me thinking... could we trap SIGINT and more gracefully fail
> the operation? It seems like this would allow the cfs_lock function to
> clean up after itself. This seems like it'd be a nice CLI QOL improvement.

Yeah, that sounds sensible, albeit I'd have to look more closely into the
code, because it might not be that trivial if we execute another command,
like e.g. qemu-img, that then controls the terminal and would receive
the sigint directly. There are options for that too, but not so nice.
Anyhow, as long as this all happens in a worker it probably should not
be an issue and one could just install a handler with
`$SIG{INT} = sub { ... cleanup ...; die "interrupted" };`
and be done.

> However, I'm not familiar with the history of the cfs-lock mechanism, why
> it's used for shared storage backends, and what other invalid PVE states
> might be avoided as a side-effect of serializing storage operations.
> Allowing concurrent operations could result in disk numbering collisions,
> but I'm not sure what else. (apart from storage-specific limitations.)

In general the shared lock is to avoid concurrent access of the same volume
but it's currently much coarser that it needs to be, i.e., it could be
on just the volume or at least the VMID for volumes that are owned by guests.
But that's a rather different topic.

Anyhow, once interrupted keeping the lock active won't protect us from
anything, especially as it will become free after 120s anyway, as you noticed
yourself, so removing that actively immediately should not cause any (more)
problems, FIWCT.

Are you willing to look into this? Otherwise, a bugzilla entry would be fine
too.

cheers
 Thomas


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 2+ messages in thread

* [pve-devel] orphaned cfs lock when interrupting qm disk import
@ 2024-08-26 20:46 Joshua Huber via pve-devel
  0 siblings, 0 replies; 2+ messages in thread
From: Joshua Huber via pve-devel @ 2024-08-26 20:46 UTC (permalink / raw)
  To: PVE Devel; +Cc: Joshua Huber

[-- Attachment #1: Type: message/rfc822, Size: 5603 bytes --]

From: Joshua Huber <jhuber@blockbridge.com>
To: PVE Devel <pve-devel@lists.proxmox.com>
Subject: orphaned cfs lock when interrupting qm disk import
Date: Mon, 26 Aug 2024 16:46:10 -0400
Message-ID: <CA+uqL00=49RuhJjjsJZfwZURWYBqf_TdkqCGtB6PM1055pO0yQ@mail.gmail.com>

Hi everyone,

Say you've just kicked off a long-running "qm disk import ..." command and
notice that an incorrect flag was specified. Ok, cancel the operation with
control-C, fix the flag and re-execute the import command...

However, when using shared storage you'll (at least for the next 120
seconds) bump up against an orphaned cfs lock directory. One could manually
remove the directory from /etc/pve/priv/locks, but it seems a little risky.
(and a bad habit to get into)

So this got me thinking... could we trap SIGINT and more gracefully fail
the operation? It seems like this would allow the cfs_lock function to
clean up after itself. This seems like it'd be a nice CLI QOL improvement.

However, I'm not familiar with the history of the cfs-lock mechanism, why
it's used for shared storage backends, and what other invalid PVE states
might be avoided as a side-effect of serializing storage operations.
Allowing concurrent operations could result in disk numbering collisions,
but I'm not sure what else. (apart from storage-specific limitations.)

The README in the pve-cluster repo was helpful but a bit limited in scope.
Could anyone shed some more light on this for me?

Thanks in advance,
Josh

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-09-06 17:37 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CA+uqL00=49RuhJjjsJZfwZURWYBqf_TdkqCGtB6PM1055pO0yQ@mail.gmail.com>
2024-09-06 17:38 ` [pve-devel] orphaned cfs lock when interrupting qm disk import Thomas Lamprecht
2024-08-26 20:46 Joshua Huber via pve-devel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal