Re: [pve-devel] orphaned cfs lock when interrupting qm disk import

* Re: [pve-devel] orphaned cfs lock when interrupting qm disk import
       [not found] <CA+uqL00=49RuhJjjsJZfwZURWYBqf_TdkqCGtB6PM1055pO0yQ@mail.gmail.com>
@ 2024-09-06 17:38 ` Thomas Lamprecht
  0 siblings, 0 replies; 2+ messages in thread
From: Thomas Lamprecht @ 2024-09-06 17:38 UTC (permalink / raw)
  To: Joshua Huber, PVE Devel

Hi,

Am 26/08/2024 um 22:46 schrieb Joshua Huber:
> Say you've just kicked off a long-running "qm disk import ..." command and
> notice that an incorrect flag was specified. Ok, cancel the operation with
> control-C, fix the flag and re-execute the import command...
> 
> However, when using shared storage you'll (at least for the next 120
> seconds) bump up against an orphaned cfs lock directory. One could manually
> remove the directory from /etc/pve/priv/locks, but it seems a little risky.
> (and a bad habit to get into)
> 
> So this got me thinking... could we trap SIGINT and more gracefully fail
> the operation? It seems like this would allow the cfs_lock function to
> clean up after itself. This seems like it'd be a nice CLI QOL improvement.

Yeah, that sounds sensible, albeit I'd have to look more closely into the
code, because it might not be that trivial if we execute another command,
like e.g. qemu-img, that then controls the terminal and would receive
the sigint directly. There are options for that too, but not so nice.
Anyhow, as long as this all happens in a worker it probably should not
be an issue and one could just install a handler with
`$SIG{INT} = sub { ... cleanup ...; die "interrupted" };`
and be done.

> However, I'm not familiar with the history of the cfs-lock mechanism, why
> it's used for shared storage backends, and what other invalid PVE states
> might be avoided as a side-effect of serializing storage operations.
> Allowing concurrent operations could result in disk numbering collisions,
> but I'm not sure what else. (apart from storage-specific limitations.)

In general the shared lock is to avoid concurrent access of the same volume
but it's currently much coarser that it needs to be, i.e., it could be
on just the volume or at least the VMID for volumes that are owned by guests.
But that's a rather different topic.

Anyhow, once interrupted keeping the lock active won't protect us from
anything, especially as it will become free after 120s anyway, as you noticed
yourself, so removing that actively immediately should not cause any (more)
problems, FIWCT.

Are you willing to look into this? Otherwise, a bugzilla entry would be fine
too.

cheers
 Thomas

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 2+ messages in thread