public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Maximilien Cuony <maximilien.cuony@arcanite.ch>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [PATCH pve-kernel] Backport rbd patches to fix lock issues
Date: Mon, 6 Sep 2021 11:22:03 +0200	[thread overview]
Message-ID: <743b37f2-3b84-edac-7027-e07c9b3efe6e@arcanite.ch> (raw)

Hello,

We had a strange locking issue, described in 
https://tracker.ceph.com/issues/42757

We rebuild the pve-kernel with the two specified patches: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8798d070d416d18a75770fc19787e96705073f43 
and 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed9eb71085ecb7ded9a5118cec2ab70667cc7350 


This fixed our issue and we didn't have any virtual machines randomly 
stuck anymore.

We do think it's probably useful to apply the patch for everybody to 
avoid funny locking issue that could happens with bad network conditions.

This is a patch for pve-kernel with the two patches.

I do hope it's the correct way to submit the change :)

Have a nice day,

---
  patches/kernel/0010-rbd-always-kick.patch | 76 +++++++++++++++++++++
  patches/kernel/0011-rbd-dont-hold.patch   | 80 +++++++++++++++++++++++
  2 files changed, 156 insertions(+)
  create mode 100644 patches/kernel/0010-rbd-always-kick.patch
  create mode 100644 patches/kernel/0011-rbd-dont-hold.patch

diff --git a/patches/kernel/0010-rbd-always-kick.patch 
b/patches/kernel/0010-rbd-always-kick.patch
new file mode 100644
index 0000000..798bc70
--- /dev/null
+++ b/patches/kernel/0010-rbd-always-kick.patch
@@ -0,0 +1,76 @@
+From ba378b796088f5660050d2f2d1e0dcf555a080e6 Mon Sep 17 00:00:00 2001
+From: Ilya Dryomov <idryomov@gmail.com>
+Date: Sat, 3 Jul 2021 11:56:55 +0200
+Subject: rbd: always kick acquire on "acquired" and "released" 
notifications
+
+commit 8798d070d416d18a75770fc19787e96705073f43 upstream.
+
+Skipping the "lock has been released" notification if the lock owner
+is not what we expect based on owner_cid can lead to I/O hangs.
+One example is our own notifications: because owner_cid is cleared
+in rbd_unlock(), when we get our own notification it is processed as
+unexpected/duplicate and maybe_kick_acquire() isn't called.  If a peer
+that requested the lock then doesn't go through with acquiring it,
+I/O requests that came in while the lock was being quiesced would
+be stalled until another I/O request is submitted and kicks acquire
+from rbd_img_exclusive_lock().
+
+This makes the comment in rbd_release_lock() actually true: prior to
+this change the canceled work was being requeued in response to the
+"lock has been acquired" notification from rbd_handle_acquired_lock().
+
+Cc: stable@vger.kernel.org # 5.3+
+Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
+Tested-by: Robin Geuze <robin.geuze@nl.team.blue>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ drivers/block/rbd.c | 20 +++++++-------------
+ 1 file changed, 7 insertions(+), 13 deletions(-)
+
+(limited to 'drivers/block/rbd.c')
+
+diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
+index ca509a7a164ab..55745d6953f02 100644
+--- a/drivers/block/rbd.c
++++ b/drivers/block/rbd.c
+@@ -4340,15 +4340,11 @@ static void rbd_handle_acquired_lock(struct 
rbd_device *rbd_dev, u8 struct_v,
+       if (!rbd_cid_equal(&cid, &rbd_empty_cid)) {
+               down_write(&rbd_dev->lock_rwsem);
+               if (rbd_cid_equal(&cid, &rbd_dev->owner_cid)) {
+-                      /*
+-                       * we already know that the remote client is
+-                       * the owner
+-                       */
+-                      up_write(&rbd_dev->lock_rwsem);
+-                      return;
++                      dout("%s rbd_dev %p cid %llu-%llu == owner_cid\n",
++                           __func__, rbd_dev, cid.gid, cid.handle);
++              } else {
++                      rbd_set_owner_cid(rbd_dev, &cid);
+               }
+-
+-              rbd_set_owner_cid(rbd_dev, &cid);
+               downgrade_write(&rbd_dev->lock_rwsem);
+       } else {
+               down_read(&rbd_dev->lock_rwsem);
+@@ -4373,14 +4369,12 @@ static void rbd_handle_released_lock(struct 
rbd_device *rbd_dev, u8 struct_v,
+       if (!rbd_cid_equal(&cid, &rbd_empty_cid)) {
+               down_write(&rbd_dev->lock_rwsem);
+               if (!rbd_cid_equal(&cid, &rbd_dev->owner_cid)) {
+-                      dout("%s rbd_dev %p unexpected owner, cid 
%llu-%llu != owner_cid %llu-%llu\n",
++                      dout("%s rbd_dev %p cid %llu-%llu != owner_cid 
%llu-%llu\n",
+                            __func__, rbd_dev, cid.gid, cid.handle,
+                            rbd_dev->owner_cid.gid, 
rbd_dev->owner_cid.handle);
+-                      up_write(&rbd_dev->lock_rwsem);
+-                      return;
++              } else {
++                      rbd_set_owner_cid(rbd_dev, &rbd_empty_cid);
+               }
+-
+-              rbd_set_owner_cid(rbd_dev, &rbd_empty_cid);
+               downgrade_write(&rbd_dev->lock_rwsem);
+       } else {
+               down_read(&rbd_dev->lock_rwsem);
+--
+cgit 1.2.3-1.el7
+
diff --git a/patches/kernel/0011-rbd-dont-hold.patch 
b/patches/kernel/0011-rbd-dont-hold.patch
new file mode 100644
index 0000000..5421ef8
--- /dev/null
+++ b/patches/kernel/0011-rbd-dont-hold.patch
@@ -0,0 +1,80 @@
+From 13066d6628f04194fea7c2b6a497ebc9d1d0df5f Mon Sep 17 00:00:00 2001
+From: Ilya Dryomov <idryomov@gmail.com>
+Date: Sat, 3 Jul 2021 11:31:26 +0200
+Subject: rbd: don't hold lock_rwsem while running_list is being drained
+
+commit ed9eb71085ecb7ded9a5118cec2ab70667cc7350 upstream.
+
+Currently rbd_quiesce_lock() holds lock_rwsem for read while blocking
+on releasing_wait completion.  On the I/O completion side, each image
+request also needs to take lock_rwsem for read.  Because rw_semaphore
+implementation doesn't allow new readers after a writer has indicated
+interest in the lock, this can result in a deadlock if something that
+needs to take lock_rwsem for write gets involved.  For example:
+
+1. watch error occurs
+2. rbd_watch_errcb() takes lock_rwsem for write, clears owner_cid and
+   releases lock_rwsem
+3. after reestablishing the watch, rbd_reregister_watch() takes
+   lock_rwsem for write and calls rbd_reacquire_lock()
+4. rbd_quiesce_lock() downgrades lock_rwsem to for read and blocks on
+   releasing_wait until running_list becomes empty
+5. another watch error occurs
+6. rbd_watch_errcb() blocks trying to take lock_rwsem for write
+7. no in-flight image request can complete and delete itself from
+   running_list because lock_rwsem won't be granted anymore
+
+A similar scenario can occur with "lock has been acquired" and "lock
+has been released" notification handers which also take lock_rwsem for
+write to update owner_cid.
+
+We don't actually get anything useful from sitting on lock_rwsem in
+rbd_quiesce_lock() -- owner_cid updates certainly don't need to be
+synchronized with.  In fact the whole owner_cid tracking logic could
+probably be removed from the kernel client because we don't support
+proxied maintenance operations.
+
+Cc: stable@vger.kernel.org # 5.3+
+URL: https://tracker.ceph.com/issues/42757
+Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
+Tested-by: Robin Geuze <robin.geuze@nl.team.blue>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ drivers/block/rbd.c | 12 +++++-------
+ 1 file changed, 5 insertions(+), 7 deletions(-)
+
+(limited to 'drivers/block/rbd.c')
+
+diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
+index bf2f0373a3b2f..ca509a7a164ab 100644
+--- a/drivers/block/rbd.c
++++ b/drivers/block/rbd.c
+@@ -4239,8 +4239,6 @@ again:
+
+ static bool rbd_quiesce_lock(struct rbd_device *rbd_dev)
+ {
+-      bool need_wait;
+-
+       dout("%s rbd_dev %p\n", __func__, rbd_dev);
+       lockdep_assert_held_write(&rbd_dev->lock_rwsem);
+
+@@ -4252,11 +4250,11 @@ static bool rbd_quiesce_lock(struct rbd_device 
*rbd_dev)
+        */
+       rbd_dev->lock_state = RBD_LOCK_STATE_RELEASING;
+ rbd_assert(!completion_done(&rbd_dev->releasing_wait));
+-      need_wait = !list_empty(&rbd_dev->running_list);
+-      downgrade_write(&rbd_dev->lock_rwsem);
+-      if (need_wait)
+- wait_for_completion(&rbd_dev->releasing_wait);
+-      up_read(&rbd_dev->lock_rwsem);
++      if (list_empty(&rbd_dev->running_list))
++              return true;
++
++      up_write(&rbd_dev->lock_rwsem);
++      wait_for_completion(&rbd_dev->releasing_wait);
+
+       down_write(&rbd_dev->lock_rwsem);
+       if (rbd_dev->lock_state != RBD_LOCK_STATE_RELEASING)
+--
+cgit 1.2.3-1.el7
+
-- 
2.20.1




                 reply	other threads:[~2021-09-06  9:32 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=743b37f2-3b84-edac-7027-e07c9b3efe6e@arcanite.ch \
    --to=maximilien.cuony@arcanite.ch \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal