From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <maximilien.cuony@arcanite.ch>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id C5473707B2
 for <pve-devel@lists.proxmox.com>; Mon,  6 Sep 2021 11:32:03 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id B861BED15
 for <pve-devel@lists.proxmox.com>; Mon,  6 Sep 2021 11:32:03 +0200 (CEST)
Received: from pm.theglu.org (pm.theglu.org [5.39.81.111])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id 5C048ED0A
 for <pve-devel@lists.proxmox.com>; Mon,  6 Sep 2021 11:32:02 +0200 (CEST)
Received: from pm.theglu.org (localhost [127.0.0.1])
 by pm.theglu.org (Postfix) with ESMTP id 2D169401A6
 for <pve-devel@lists.proxmox.com>; Mon,  6 Sep 2021 09:22:19 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha1; c=simple; d=theglu.org; h=from:subject
 :to:message-id:date:mime-version:content-type
 :content-transfer-encoding; s=postier; bh=wNLkcbVW90ZBygHIzwTMLD
 LQreE=; b=n+KGL83FW7FL048Q95pFYNDTVCOQPKal4knDKmaX5Af7JRdCPV9/X6
 Ylw/P8EZqTU6kETbn2Oc74aT8IBlh/8HKtfcjiao2ZpFpJKXCoO5748Vgfm6Jw/H
 3krDMqCEka7La9DVdRJu4vNHZClNnWT3Ztnls/lWyOxQe4W8LVebU=
Received: from [10.0.17.2] (people.swisscom.puidoux-infra.ch [146.4.119.107])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by pm.theglu.org (Postfix) with ESMTPSA id BFEDB401A3
 for <pve-devel@lists.proxmox.com>; Mon,  6 Sep 2021 09:22:04 +0000 (UTC)
From: Maximilien Cuony <maximilien.cuony@arcanite.ch>
Organization: Arcanite Solutions SARL
To: pve-devel@lists.proxmox.com
Message-ID: <743b37f2-3b84-edac-7027-e07c9b3efe6e@arcanite.ch>
Date: Mon, 6 Sep 2021 11:22:03 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.13.0
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Content-Language: en-US-large
X-SPAM-LEVEL: Spam detection results:  0
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DKIM_SIGNED               0.1 Message has a DKIM or DK signature,
 not necessarily valid
 DKIM_VALID -0.1 Message has at least one valid DKIM or DK signature
 KAM_LOTSOFHASH           0.25 Emails with lots of hash-like gibberish
 SPF_HELO_PASS          -0.001 SPF: HELO matches SPF record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: [pve-devel] [PATCH pve-kernel] Backport rbd patches to fix lock
 issues
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Mon, 06 Sep 2021 09:32:03 -0000

Hello,

We had a strange locking issue, described in 
https://tracker.ceph.com/issues/42757

We rebuild the pve-kernel with the two specified patches: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8798d070d416d18a75770fc19787e96705073f43 
and 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed9eb71085ecb7ded9a5118cec2ab70667cc7350 


This fixed our issue and we didn't have any virtual machines randomly 
stuck anymore.

We do think it's probably useful to apply the patch for everybody to 
avoid funny locking issue that could happens with bad network conditions.

This is a patch for pve-kernel with the two patches.

I do hope it's the correct way to submit the change :)

Have a nice day,

---
  patches/kernel/0010-rbd-always-kick.patch | 76 +++++++++++++++++++++
  patches/kernel/0011-rbd-dont-hold.patch   | 80 +++++++++++++++++++++++
  2 files changed, 156 insertions(+)
  create mode 100644 patches/kernel/0010-rbd-always-kick.patch
  create mode 100644 patches/kernel/0011-rbd-dont-hold.patch

diff --git a/patches/kernel/0010-rbd-always-kick.patch 
b/patches/kernel/0010-rbd-always-kick.patch
new file mode 100644
index 0000000..798bc70
--- /dev/null
+++ b/patches/kernel/0010-rbd-always-kick.patch
@@ -0,0 +1,76 @@
+From ba378b796088f5660050d2f2d1e0dcf555a080e6 Mon Sep 17 00:00:00 2001
+From: Ilya Dryomov <idryomov@gmail.com>
+Date: Sat, 3 Jul 2021 11:56:55 +0200
+Subject: rbd: always kick acquire on "acquired" and "released" 
notifications
+
+commit 8798d070d416d18a75770fc19787e96705073f43 upstream.
+
+Skipping the "lock has been released" notification if the lock owner
+is not what we expect based on owner_cid can lead to I/O hangs.
+One example is our own notifications: because owner_cid is cleared
+in rbd_unlock(), when we get our own notification it is processed as
+unexpected/duplicate and maybe_kick_acquire() isn't called.  If a peer
+that requested the lock then doesn't go through with acquiring it,
+I/O requests that came in while the lock was being quiesced would
+be stalled until another I/O request is submitted and kicks acquire
+from rbd_img_exclusive_lock().
+
+This makes the comment in rbd_release_lock() actually true: prior to
+this change the canceled work was being requeued in response to the
+"lock has been acquired" notification from rbd_handle_acquired_lock().
+
+Cc: stable@vger.kernel.org # 5.3+
+Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
+Tested-by: Robin Geuze <robin.geuze@nl.team.blue>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ drivers/block/rbd.c | 20 +++++++-------------
+ 1 file changed, 7 insertions(+), 13 deletions(-)
+
+(limited to 'drivers/block/rbd.c')
+
+diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
+index ca509a7a164ab..55745d6953f02 100644
+--- a/drivers/block/rbd.c
++++ b/drivers/block/rbd.c
+@@ -4340,15 +4340,11 @@ static void rbd_handle_acquired_lock(struct 
rbd_device *rbd_dev, u8 struct_v,
+       if (!rbd_cid_equal(&cid, &rbd_empty_cid)) {
+               down_write(&rbd_dev->lock_rwsem);
+               if (rbd_cid_equal(&cid, &rbd_dev->owner_cid)) {
+-                      /*
+-                       * we already know that the remote client is
+-                       * the owner
+-                       */
+-                      up_write(&rbd_dev->lock_rwsem);
+-                      return;
++                      dout("%s rbd_dev %p cid %llu-%llu == owner_cid\n",
++                           __func__, rbd_dev, cid.gid, cid.handle);
++              } else {
++                      rbd_set_owner_cid(rbd_dev, &cid);
+               }
+-
+-              rbd_set_owner_cid(rbd_dev, &cid);
+               downgrade_write(&rbd_dev->lock_rwsem);
+       } else {
+               down_read(&rbd_dev->lock_rwsem);
+@@ -4373,14 +4369,12 @@ static void rbd_handle_released_lock(struct 
rbd_device *rbd_dev, u8 struct_v,
+       if (!rbd_cid_equal(&cid, &rbd_empty_cid)) {
+               down_write(&rbd_dev->lock_rwsem);
+               if (!rbd_cid_equal(&cid, &rbd_dev->owner_cid)) {
+-                      dout("%s rbd_dev %p unexpected owner, cid 
%llu-%llu != owner_cid %llu-%llu\n",
++                      dout("%s rbd_dev %p cid %llu-%llu != owner_cid 
%llu-%llu\n",
+                            __func__, rbd_dev, cid.gid, cid.handle,
+                            rbd_dev->owner_cid.gid, 
rbd_dev->owner_cid.handle);
+-                      up_write(&rbd_dev->lock_rwsem);
+-                      return;
++              } else {
++                      rbd_set_owner_cid(rbd_dev, &rbd_empty_cid);
+               }
+-
+-              rbd_set_owner_cid(rbd_dev, &rbd_empty_cid);
+               downgrade_write(&rbd_dev->lock_rwsem);
+       } else {
+               down_read(&rbd_dev->lock_rwsem);
+--
+cgit 1.2.3-1.el7
+
diff --git a/patches/kernel/0011-rbd-dont-hold.patch 
b/patches/kernel/0011-rbd-dont-hold.patch
new file mode 100644
index 0000000..5421ef8
--- /dev/null
+++ b/patches/kernel/0011-rbd-dont-hold.patch
@@ -0,0 +1,80 @@
+From 13066d6628f04194fea7c2b6a497ebc9d1d0df5f Mon Sep 17 00:00:00 2001
+From: Ilya Dryomov <idryomov@gmail.com>
+Date: Sat, 3 Jul 2021 11:31:26 +0200
+Subject: rbd: don't hold lock_rwsem while running_list is being drained
+
+commit ed9eb71085ecb7ded9a5118cec2ab70667cc7350 upstream.
+
+Currently rbd_quiesce_lock() holds lock_rwsem for read while blocking
+on releasing_wait completion.  On the I/O completion side, each image
+request also needs to take lock_rwsem for read.  Because rw_semaphore
+implementation doesn't allow new readers after a writer has indicated
+interest in the lock, this can result in a deadlock if something that
+needs to take lock_rwsem for write gets involved.  For example:
+
+1. watch error occurs
+2. rbd_watch_errcb() takes lock_rwsem for write, clears owner_cid and
+   releases lock_rwsem
+3. after reestablishing the watch, rbd_reregister_watch() takes
+   lock_rwsem for write and calls rbd_reacquire_lock()
+4. rbd_quiesce_lock() downgrades lock_rwsem to for read and blocks on
+   releasing_wait until running_list becomes empty
+5. another watch error occurs
+6. rbd_watch_errcb() blocks trying to take lock_rwsem for write
+7. no in-flight image request can complete and delete itself from
+   running_list because lock_rwsem won't be granted anymore
+
+A similar scenario can occur with "lock has been acquired" and "lock
+has been released" notification handers which also take lock_rwsem for
+write to update owner_cid.
+
+We don't actually get anything useful from sitting on lock_rwsem in
+rbd_quiesce_lock() -- owner_cid updates certainly don't need to be
+synchronized with.  In fact the whole owner_cid tracking logic could
+probably be removed from the kernel client because we don't support
+proxied maintenance operations.
+
+Cc: stable@vger.kernel.org # 5.3+
+URL: https://tracker.ceph.com/issues/42757
+Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
+Tested-by: Robin Geuze <robin.geuze@nl.team.blue>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ drivers/block/rbd.c | 12 +++++-------
+ 1 file changed, 5 insertions(+), 7 deletions(-)
+
+(limited to 'drivers/block/rbd.c')
+
+diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
+index bf2f0373a3b2f..ca509a7a164ab 100644
+--- a/drivers/block/rbd.c
++++ b/drivers/block/rbd.c
+@@ -4239,8 +4239,6 @@ again:
+
+ static bool rbd_quiesce_lock(struct rbd_device *rbd_dev)
+ {
+-      bool need_wait;
+-
+       dout("%s rbd_dev %p\n", __func__, rbd_dev);
+       lockdep_assert_held_write(&rbd_dev->lock_rwsem);
+
+@@ -4252,11 +4250,11 @@ static bool rbd_quiesce_lock(struct rbd_device 
*rbd_dev)
+        */
+       rbd_dev->lock_state = RBD_LOCK_STATE_RELEASING;
+ rbd_assert(!completion_done(&rbd_dev->releasing_wait));
+-      need_wait = !list_empty(&rbd_dev->running_list);
+-      downgrade_write(&rbd_dev->lock_rwsem);
+-      if (need_wait)
+- wait_for_completion(&rbd_dev->releasing_wait);
+-      up_read(&rbd_dev->lock_rwsem);
++      if (list_empty(&rbd_dev->running_list))
++              return true;
++
++      up_write(&rbd_dev->lock_rwsem);
++      wait_for_completion(&rbd_dev->releasing_wait);
+
+       down_write(&rbd_dev->lock_rwsem);
+       if (rbd_dev->lock_state != RBD_LOCK_STATE_RELEASING)
+--
+cgit 1.2.3-1.el7
+
-- 
2.20.1