Re: [PVE-User] Proxmox HCI Ceph: "osd_max_backfills" is overridden and set to 1000

From: Stefan Hanreich <s.hanreich@proxmox.com>
To: Proxmox VE user list <pve-user@lists.proxmox.com>,
	Benjamin Hofer <benjamin@gridscale.io>
Subject: Re: [PVE-User] Proxmox HCI Ceph: "osd_max_backfills" is overridden and set to 1000
Date: Tue, 30 May 2023 14:14:05 +0200	[thread overview]
Message-ID: <a8554390-7f0f-94c1-e0a2-1a3088bf8d4a@proxmox.com> (raw)
In-Reply-To: <CAD=jCXOpYPu2Hz+krGssLBkMvtMdiWyXSbH=+VwX76PTXyF_9A@mail.gmail.com>

Hi Benjamin

This behavior was introduced in Ceph with the new mClock scheduler [1]. 
If the mClock scheduler is used, the osd_max_backfills option gets 
overridden (to 1000), among others.

This is what is very likely causing the issues in your cluster when 
rebalancing. With the mClock scheduler the parameters for tuning 
rebalancing have changed. In our wiki you can find a description of the 
new parameters and how you can use them [2].

This should be fixed in the newer Ceph version 17.2.6 [3] [4], which is 
already available via our repositories (no-subscription as well as 
enterprise). It contains the fix for this issue and should override the 
max_backfills to a more reasonable value. Nevertheless, you should still 
take a look at the new mClock tuning options.

Kind Regards
Stefan

[1] https://github.com/ceph/ceph/pull/38920
[2] https://pve.proxmox.com/wiki/Ceph_mclock_tuning
[3] https://github.com/ceph/ceph/pull/48226/files
[4] 
https://github.com/ceph/ceph/commit/89e48395f8b1329066a1d7e05a4e9e083c88c1a6

On 5/30/23 12:00, Benjamin Hofer wrote:
> Dear community,
> 
> We've set up a Proxmox hyper-converged Ceph cluster in production.
> After syncing in one new OSD using the "pveceph osd create" command,
> we got massive network performance issues and outages. We then found
> that "osd_max_backfills" is set to 1000 (Ceph default is 1) and that
> this (along with some other values) have been overridden.
> 
> Does anyone know a root cause? I can't imagine that this is the
> Proxmox default behaviour and I'm very sure that we didn't change
> anything (actually I didn't even know the value before researching and
> talking to colleagues with deeper Ceph knowledge).
> 
> System:
> 
> PVE version output: pve-manager/7.3-6/723bb6ec (running kernel: 5.15.102-1-pve)
> ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable)
> 
> # ceph config get osd.1
> WHO    MASK  LEVEL  OPTION                            VALUE         RO
> osd.1        basic  osd_mclock_max_capacity_iops_ssd  17080.220753
> 
> # ceph config show osd.1
> NAME                                             VALUE
>              SOURCE    OVERRIDES  IGNORES
> auth_client_required                             cephx
>              file
> auth_cluster_required                            cephx
>              file
> auth_service_required                            cephx
>              file
> cluster_network                                  10.0.18.0/24
>              file
> daemonize                                        false
>              override
> keyring                                          $osd_data/keyring
>              default
> leveldb_log
>              default
> mon_allow_pool_delete                            true
>              file
> mon_host                                         10.0.18.30 10.0.18.10
> 10.0.18.20  file
> ms_bind_ipv4                                     true
>              file
> ms_bind_ipv6                                     false
>              file
> no_config_file                                   false
>              override
> osd_delete_sleep                                 0.000000
>              override
> osd_delete_sleep_hdd                             0.000000
>              override
> osd_delete_sleep_hybrid                          0.000000
>              override
> osd_delete_sleep_ssd                             0.000000
>              override
> osd_max_backfills                                1000
>              override
> osd_mclock_max_capacity_iops_ssd                 17080.220753
>              mon
> osd_mclock_scheduler_background_best_effort_lim  999999
>              default
> osd_mclock_scheduler_background_best_effort_res  534
>              default
> osd_mclock_scheduler_background_best_effort_wgt  2
>              default
> osd_mclock_scheduler_background_recovery_lim     2135
>              default
> osd_mclock_scheduler_background_recovery_res     534
>              default
> osd_mclock_scheduler_background_recovery_wgt     1
>              default
> osd_mclock_scheduler_client_lim                  999999
>              default
> osd_mclock_scheduler_client_res                  1068
>              default
> osd_mclock_scheduler_client_wgt                  2
>              default
> osd_pool_default_min_size                        2
>              file
> osd_pool_default_size                            3
>              file
> osd_recovery_max_active                          1000
>              override
> osd_recovery_max_active_hdd                      1000
>              override
> osd_recovery_max_active_ssd                      1000
>              override
> osd_recovery_sleep                               0.000000
>              override
> osd_recovery_sleep_hdd                           0.000000
>              override
> osd_recovery_sleep_hybrid                        0.000000
>              override
> osd_recovery_sleep_ssd                           0.000000
>              override
> osd_scrub_sleep                                  0.000000
>              override
> osd_snap_trim_sleep                              0.000000
>              override
> osd_snap_trim_sleep_hdd                          0.000000
>              override
> osd_snap_trim_sleep_hybrid                       0.000000
>              override
> osd_snap_trim_sleep_ssd                          0.000000
>              override
> public_network                                   10.0.18.0/24
>              file
> rbd_default_features                             61
>              default
> rbd_qos_exclude_ops                              0
>              default
> setgroup                                         ceph
>              cmdline
> setuser                                          ceph
>              cmdline
> 
> Thanks a lot in advance.
> 
> Best
> Benjamin
> 
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 
>