public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pbs-devel] Scheduler causing connectivity issues?
@ 2022-07-07 15:49 Mark Schouten
  2022-07-08  7:09 ` Thomas Lamprecht
  0 siblings, 1 reply; 23+ messages in thread
From: Mark Schouten @ 2022-07-07 15:49 UTC (permalink / raw)
  To: pbs-devel

Hi,

We’re getting complaints that one of our PBS’es is periodically 
unreachable. After investigation if the network might be at fault (even 
though it’s handling about 5.5Gbit at night), we found that PBS is 
piling up waiting connections every minute, on the minute, as you can 
see below. You see the output of `date`, combined with `ss -np | grep -c 
8007`, the number of active connections.

At first I thought that pvestatd was ddossing PBS, but pvestatd seems to 
run more often than once in a minute.

So stracing the API process, I found that that process is also just 
waiting for something; must be the proxy-process.

grepping for ‘minute’ in the code, I stumbled upon the function 
`next_minute` in ./src/bin/proxmox-backup-proxy.rs. I’m not quite sure 
if I understand it correctly, but it seems that every minute, the 
scheduler is going to try and find out if it should be doing something.

Drilling down on that in my strace-foo, I think I see quite some 
read/write/rename actions on jobstate-files. Which leads me to conclude 
that the proxy process is waiting for the scheduler..

This is just guess-work, but you guys can surely find out better what’s 
going on than me.

This PBS is running with 45 users and 67 datastores.

Hope you guys can find something.. If I need to debug anything, let me 
know!


============
Timestamp                        | `ss -np|grep -c 8007`
Thu 07 Jul 2022 05:38:00 PM CEST | 13
Thu 07 Jul 2022 05:38:00 PM CEST | 24
Thu 07 Jul 2022 05:38:01 PM CEST | 32
Thu 07 Jul 2022 05:38:01 PM CEST | 45
Thu 07 Jul 2022 05:38:02 PM CEST | 58
Thu 07 Jul 2022 05:38:02 PM CEST | 65
Thu 07 Jul 2022 05:38:03 PM CEST | 68
Thu 07 Jul 2022 05:38:03 PM CEST | 76
Thu 07 Jul 2022 05:38:04 PM CEST | 79
Thu 07 Jul 2022 05:38:05 PM CEST | 82
Thu 07 Jul 2022 05:38:05 PM CEST | 88
Thu 07 Jul 2022 05:38:06 PM CEST | 96
Thu 07 Jul 2022 05:38:06 PM CEST | 102
Thu 07 Jul 2022 05:38:07 PM CEST | 104
Thu 07 Jul 2022 05:38:07 PM CEST | 111
Thu 07 Jul 2022 05:38:08 PM CEST | 124
Thu 07 Jul 2022 05:38:08 PM CEST | 130
Thu 07 Jul 2022 05:38:09 PM CEST | 133
Thu 07 Jul 2022 05:38:09 PM CEST | 137
Thu 07 Jul 2022 05:38:10 PM CEST | 22
Thu 07 Jul 2022 05:38:11 PM CEST | 23
Thu 07 Jul 2022 05:39:00 PM CEST | 20
Thu 07 Jul 2022 05:39:01 PM CEST | 36
Thu 07 Jul 2022 05:39:01 PM CEST | 48
Thu 07 Jul 2022 05:39:02 PM CEST | 57
Thu 07 Jul 2022 05:39:02 PM CEST | 64
Thu 07 Jul 2022 05:39:03 PM CEST | 69
Thu 07 Jul 2022 05:39:03 PM CEST | 76
Thu 07 Jul 2022 05:39:04 PM CEST | 78
Thu 07 Jul 2022 05:39:04 PM CEST | 84
Thu 07 Jul 2022 05:39:05 PM CEST | 88
Thu 07 Jul 2022 05:39:06 PM CEST | 96
Thu 07 Jul 2022 05:39:06 PM CEST | 102
Thu 07 Jul 2022 05:39:07 PM CEST | 104
Thu 07 Jul 2022 05:39:07 PM CEST | 111
Thu 07 Jul 2022 05:39:08 PM CEST | 120
Thu 07 Jul 2022 05:39:08 PM CEST | 127
Thu 07 Jul 2022 05:39:09 PM CEST | 131
Thu 07 Jul 2022 05:39:09 PM CEST | 133
Thu 07 Jul 2022 05:39:10 PM CEST | 29
Thu 07 Jul 2022 05:39:10 PM CEST | 24
Thu 07 Jul 2022 05:40:00 PM CEST | 21
Thu 07 Jul 2022 05:40:01 PM CEST | 33
Thu 07 Jul 2022 05:40:01 PM CEST | 45
Thu 07 Jul 2022 05:40:02 PM CEST | 58
Thu 07 Jul 2022 05:40:02 PM CEST | 64
Thu 07 Jul 2022 05:40:03 PM CEST | 70
Thu 07 Jul 2022 05:40:03 PM CEST | 75
Thu 07 Jul 2022 05:40:04 PM CEST | 79
Thu 07 Jul 2022 05:40:04 PM CEST | 83
Thu 07 Jul 2022 05:40:05 PM CEST | 88
Thu 07 Jul 2022 05:40:05 PM CEST | 96
Thu 07 Jul 2022 05:40:06 PM CEST | 102
Thu 07 Jul 2022 05:40:07 PM CEST | 105
Thu 07 Jul 2022 05:40:07 PM CEST | 113
Thu 07 Jul 2022 05:40:08 PM CEST | 122
Thu 07 Jul 2022 05:40:08 PM CEST | 129
Thu 07 Jul 2022 05:40:09 PM CEST | 134
Thu 07 Jul 2022 05:40:09 PM CEST | 135
Thu 07 Jul 2022 05:40:10 PM CEST | 27


—
Mark Schouten, CTO
Tuxis B.V.
mark@tuxis.nl





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-07 15:49 [pbs-devel] Scheduler causing connectivity issues? Mark Schouten
@ 2022-07-08  7:09 ` Thomas Lamprecht
  2022-07-08  9:36   ` Jorge Boncompte
  2022-07-11  8:44   ` Mark Schouten
  0 siblings, 2 replies; 23+ messages in thread
From: Thomas Lamprecht @ 2022-07-08  7:09 UTC (permalink / raw)
  To: Mark Schouten, Proxmox Backup Server development discussion

Hi,

On 07/07/2022 17:49, Mark Schouten wrote:
> We’re getting complaints that one of our PBS’es is periodically unreachable. After investigation if the network might be at fault (even though it’s handling about 5.5Gbit at night), we found that PBS is piling up waiting connections every minute, on the minute, as you can see below. You see the output of `date`, combined with `ss -np | grep -c 8007`, the number of active connections.
> 
> At first I thought that pvestatd was ddossing PBS, but pvestatd seems to run more often than once in a minute.
> 
> So stracing the API process, I found that that process is also just waiting for something; must be the proxy-process.
> 
> grepping for ‘minute’ in the code, I stumbled upon the function `next_minute` in ./src/bin/proxmox-backup-proxy.rs. I’m not quite sure if I understand it correctly, but it seems that every minute, the scheduler is going to try and find out if it should be doing something.
> 
> Drilling down on that in my strace-foo, I think I see quite some read/write/rename actions on jobstate-files. Which leads me to conclude that the proxy process is waiting for the scheduler..
> 
> This is just guess-work, but you guys can surely find out better what’s going on than me.
> 
> This PBS is running with 45 users and 67 datastores.
> 
> Hope you guys can find something.. If I need to debug anything, let me know!

Thanks for the info, this already helps quite a bit. We'll look into it and re-check
with you if we need more info.

cheers,
Thomas




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-08  7:09 ` Thomas Lamprecht
@ 2022-07-08  9:36   ` Jorge Boncompte
  2022-07-08  9:40     ` dea
  2022-07-11  8:44   ` Mark Schouten
  1 sibling, 1 reply; 23+ messages in thread
From: Jorge Boncompte @ 2022-07-08  9:36 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Mark Schouten

El 8/7/22 a las 9:09, Thomas Lamprecht escribió:
> Hi,
> 
> On 07/07/2022 17:49, Mark Schouten wrote:
>> We’re getting complaints that one of our PBS’es is periodically unreachable. After investigation if the network might be at fault (even though it’s handling about 5.5Gbit at night), we found that PBS is piling up waiting connections every minute, on the minute, as you can see below. You see the output of `date`, combined with `ss -np | grep -c 8007`, the number of active connections.
>>
>> At first I thought that pvestatd was ddossing PBS, but pvestatd seems to run more often than once in a minute.
>>
>> So stracing the API process, I found that that process is also just waiting for something; must be the proxy-process.
>>
>> grepping for ‘minute’ in the code, I stumbled upon the function `next_minute` in ./src/bin/proxmox-backup-proxy.rs. I’m not quite sure if I understand it correctly, but it seems that every minute, the scheduler is going to try and find out if it should be doing something.
>>
>> Drilling down on that in my strace-foo, I think I see quite some read/write/rename actions on jobstate-files. Which leads me to conclude that the proxy process is waiting for the scheduler..
>>
>> This is just guess-work, but you guys can surely find out better what’s going on than me.
>>
>> This PBS is running with 45 users and 67 datastores.
>>
>> Hope you guys can find something.. If I need to debug anything, let me know!
> 
> Thanks for the info, this already helps quite a bit. We'll look into it and re-check
> with you if we need more info.

	Hi, We've been having a problem that resembles this one with several
proxmox-backup-server 2.2.x versions. Our PBS stopped accepting backups
jobs sometimes, but if we retried manully they started fine. The only
message I could find was:

  backup failed: could not activate storage 'XXXXXXX': XXXXXX: error
fetching datastores - 500 Can't connect to XXXXXXXX:8007

	Restarting the proxy seemed to help to get it working one or two days
more. We have reverted proxmox-backup-server to 2.1.8-1 and every is
fine again.

	Regards.

> 
> cheers,
> Thomas
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-08  9:36   ` Jorge Boncompte
@ 2022-07-08  9:40     ` dea
  0 siblings, 0 replies; 23+ messages in thread
From: dea @ 2022-07-08  9:40 UTC (permalink / raw)
  To: pbs-devel

Yes, same behavior for me too.

release 2.2-3 is problematic (NFS storage connected to PBS)

Previous releases work fine


Thanks

Luca


Il 08/07/22 11:36, Jorge Boncompte ha scritto:
> El 8/7/22 a las 9:09, Thomas Lamprecht escribió:
>> Hi,
>>
>> On 07/07/2022 17:49, Mark Schouten wrote:
>>> We’re getting complaints that one of our PBS’es is periodically unreachable. After investigation if the network might be at fault (even though it’s handling about 5.5Gbit at night), we found that PBS is piling up waiting connections every minute, on the minute, as you can see below. You see the output of `date`, combined with `ss -np | grep -c 8007`, the number of active connections.
>>>
>>> At first I thought that pvestatd was ddossing PBS, but pvestatd seems to run more often than once in a minute.
>>>
>>> So stracing the API process, I found that that process is also just waiting for something; must be the proxy-process.
>>>
>>> grepping for ‘minute’ in the code, I stumbled upon the function `next_minute` in ./src/bin/proxmox-backup-proxy.rs. I’m not quite sure if I understand it correctly, but it seems that every minute, the scheduler is going to try and find out if it should be doing something.
>>>
>>> Drilling down on that in my strace-foo, I think I see quite some read/write/rename actions on jobstate-files. Which leads me to conclude that the proxy process is waiting for the scheduler..
>>>
>>> This is just guess-work, but you guys can surely find out better what’s going on than me.
>>>
>>> This PBS is running with 45 users and 67 datastores.
>>>
>>> Hope you guys can find something.. If I need to debug anything, let me know!
>> Thanks for the info, this already helps quite a bit. We'll look into it and re-check
>> with you if we need more info.
> 	Hi, We've been having a problem that resembles this one with several
> proxmox-backup-server 2.2.x versions. Our PBS stopped accepting backups
> jobs sometimes, but if we retried manully they started fine. The only
> message I could find was:
>
>    backup failed: could not activate storage 'XXXXXXX': XXXXXX: error
> fetching datastores - 500 Can't connect to XXXXXXXX:8007
>
> 	Restarting the proxy seemed to help to get it working one or two days
> more. We have reverted proxmox-backup-server to 2.1.8-1 and every is
> fine again.
>
> 	Regards.
>
>> cheers,
>> Thomas
>>
>>
>> _______________________________________________
>> pbs-devel mailing list
>> pbs-devel@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
>
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-08  7:09 ` Thomas Lamprecht
  2022-07-08  9:36   ` Jorge Boncompte
@ 2022-07-11  8:44   ` Mark Schouten
  2022-07-13  7:55     ` dea
  2022-07-13  8:14     ` Thomas Lamprecht
  1 sibling, 2 replies; 23+ messages in thread
From: Mark Schouten @ 2022-07-11  8:44 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox Backup Server development discussion

Hi,

>>Hope you guys can find something.. If I need to debug anything, let me know!
>
>Thanks for the info, this already helps quite a bit. We'll look into it and re-check
>with you if we need more info.
>

Any chance you can think of a workaround for the time being, other than 
downgrading?

—
Mark Schouten, CTO
Tuxis B.V.
mark@tuxis.nl





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-11  8:44   ` Mark Schouten
@ 2022-07-13  7:55     ` dea
  2022-07-13  8:14     ` Thomas Lamprecht
  1 sibling, 0 replies; 23+ messages in thread
From: dea @ 2022-07-13  7:55 UTC (permalink / raw)
  To: pbs-devel

Hi,

I can confirm, serious problems with the 2.2-3 release. Continuous 
disconnections, backups never started due to non-availability of the PBS.

Downgraded to release 2.2-1, all fixed, works perfectly.


Luca


Il 11/07/22 10:44, Mark Schouten ha scritto:
> Hi,
>
>>> Hope you guys can find something.. If I need to debug anything, let 
>>> me know!
>>
>> Thanks for the info, this already helps quite a bit. We'll look into 
>> it and re-check
>> with you if we need more info.
>>
>
> Any chance you can think of a workaround for the time being, other 
> than downgrading?
>
> —
> Mark Schouten, CTO
> Tuxis B.V.
> mark@tuxis.nl
>
>
>
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-11  8:44   ` Mark Schouten
  2022-07-13  7:55     ` dea
@ 2022-07-13  8:14     ` Thomas Lamprecht
  2022-07-13 10:41       ` Mark Schouten
  1 sibling, 1 reply; 23+ messages in thread
From: Thomas Lamprecht @ 2022-07-13  8:14 UTC (permalink / raw)
  To: Mark Schouten, Proxmox Backup Server development discussion

Hi,

Am 11/07/2022 um 10:44 schrieb Mark Schouten:
>>> Hope you guys can find something.. If I need to debug anything, let me know!
>>
>> Thanks for the info, this already helps quite a bit. We'll look into it and re-check
>> with you if we need more info.
>>
> 
> Any chance you can think of a workaround for the time being, other than downgrading?


We identified a few places where blocking operations (e.g., the list snapshots API
endpoint) that may starve the network request accept loop in some realistic scenarios,
we'll move those out to their own separate thread to avoid such blockage.

FWIW, it would be still interesting to see a few metrics/CLI outputs from an affected
system to hopefully ensure that there's nothing additional to that problem.

- `head /proc/pressure/*` for some general info if the system is overloaded
- `ls /var/lib/proxmox-backup/jobstates` just out of interest
- `ss -tlpn` to see if another (old) proxy process is still accepting
- `uptime`






^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-13  8:14     ` Thomas Lamprecht
@ 2022-07-13 10:41       ` Mark Schouten
  2022-07-15 11:49         ` Thomas Lamprecht
  0 siblings, 1 reply; 23+ messages in thread
From: Mark Schouten @ 2022-07-13 10:41 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox Backup Server development discussion

Requested files sent offlist.

—
Mark Schouten, CTO
Tuxis B.V.
mark@tuxis.nl

------ Original Message ------
From "Thomas Lamprecht" <t.lamprecht@proxmox.com>
To "Mark Schouten" <mark@tuxis.nl>; "Proxmox Backup Server development 
discussion" <pbs-devel@lists.proxmox.com>
Date 13/07/2022 10:14:07
Subject Re: [pbs-devel] Scheduler causing connectivity issues?

>Hi,
>
>Am 11/07/2022 um 10:44 schrieb Mark Schouten:
>>>>Hope you guys can find something.. If I need to debug anything, let me know!
>>>
>>>Thanks for the info, this already helps quite a bit. We'll look into it and re-check
>>>with you if we need more info.
>>>
>>
>>Any chance you can think of a workaround for the time being, other than downgrading?
>
>
>We identified a few places where blocking operations (e.g., the list snapshots API
>endpoint) that may starve the network request accept loop in some realistic scenarios,
>we'll move those out to their own separate thread to avoid such blockage.
>
>FWIW, it would be still interesting to see a few metrics/CLI outputs from an affected
>system to hopefully ensure that there's nothing additional to that problem.
>
>- `head /proc/pressure/*` for some general info if the system is overloaded
>- `ls /var/lib/proxmox-backup/jobstates` just out of interest
>- `ss -tlpn` to see if another (old) proxy process is still accepting
>- `uptime`
>
>
>





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-13 10:41       ` Mark Schouten
@ 2022-07-15 11:49         ` Thomas Lamprecht
  2022-07-15 12:01           ` dea
  2022-07-18  7:31           ` Mark Schouten
  0 siblings, 2 replies; 23+ messages in thread
From: Thomas Lamprecht @ 2022-07-15 11:49 UTC (permalink / raw)
  To: Mark Schouten, Proxmox Backup Server development discussion

Am 13/07/2022 um 12:41 schrieb Mark Schouten:
> Requested files sent offlist.

Thanks!

You have 30% of runnable process getting stalled due waiting for IO, that
naturally should not cause the request accept future to get starved but is
the reason for why it happened with the current (or better old)
architecture. Increasing available memory, so that the page cache can hold
more entries, could already relieve that system a bit.

We improved on the reproducer we got locally by simulating a higher latency
disk using dm-delay on a small single core VM.

For one we made the libpve-storage-perl do more efficient list-snapshot
requests if they can be filtered by VMID, and on the PBS side we moved most
operations that cause IO (and are related to backup groups/snapshots) to a
separate thread pool so that the main thread should be less
congested/blocked.

The results got packaged and uploaded to our test repositories and are
available with:

- proxmox-backup-server version 2.2.4-1
- libpve-storage-perl version 7.2-7

It'd be great if you could try out those and report back if they actually
helped in your setup(s) too.

cheers,
Thomas





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-15 11:49         ` Thomas Lamprecht
@ 2022-07-15 12:01           ` dea
  2022-07-15 12:57             ` Thomas Lamprecht
  2022-07-18  7:31           ` Mark Schouten
  1 sibling, 1 reply; 23+ messages in thread
From: dea @ 2022-07-15 12:01 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Thomas Lamprecht,
	Mark Schouten

[-- Attachment #1: Type: text/plain, Size: 1502 bytes --]

Mmmmm no, for me the same problem.

Rollback 2.2.4 -> 2.2.1 and works great.







Il 15/07/22 13:49, Thomas Lamprecht ha scritto:
> Am 13/07/2022 um 12:41 schrieb Mark Schouten:
>> Requested files sent offlist.
> Thanks!
>
> You have 30% of runnable process getting stalled due waiting for IO, that
> naturally should not cause the request accept future to get starved but is
> the reason for why it happened with the current (or better old)
> architecture. Increasing available memory, so that the page cache can hold
> more entries, could already relieve that system a bit.
>
> We improved on the reproducer we got locally by simulating a higher latency
> disk using dm-delay on a small single core VM.
>
> For one we made the libpve-storage-perl do more efficient list-snapshot
> requests if they can be filtered by VMID, and on the PBS side we moved most
> operations that cause IO (and are related to backup groups/snapshots) to a
> separate thread pool so that the main thread should be less
> congested/blocked.
>
> The results got packaged and uploaded to our test repositories and are
> available with:
>
> - proxmox-backup-server version 2.2.4-1
> - libpve-storage-perl version 7.2-7
>
> It'd be great if you could try out those and report back if they actually
> helped in your setup(s) too.
>
> cheers,
> Thomas
>
>
>
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

[-- Attachment #2.1: Type: text/html, Size: 2336 bytes --]

[-- Attachment #2.2: baySqp65HDEysz54.jpeg --]
[-- Type: image/jpeg, Size: 44243 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-15 12:01           ` dea
@ 2022-07-15 12:57             ` Thomas Lamprecht
  2022-07-15 13:02               ` dea
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Lamprecht @ 2022-07-15 12:57 UTC (permalink / raw)
  To: dea, Proxmox Backup Server development discussion, Mark Schouten

Am 15/07/2022 um 14:01 schrieb dea:
> Mmmmm no, for me the same problem.
> 
> Rollback 2.2.4 -> 2.2.1 and works great.
> 
> 

how do you test?

did you reboot in between or dropped the page cache?

also what is your output for the requested info:

Am 13/07/2022 um 10:14 schrieb Thomas Lamprecht:
> FWIW, it would be still interesting to see a few metrics/CLI outputs from an affected
> system to hopefully ensure that there's nothing additional to that problem.
> 
> - `head /proc/pressure/*` for some general info if the system is overloaded
> - `ls /var/lib/proxmox-backup/jobstates` just out of interest
> - `ss -tlpn` to see if another (old) proxy process is still accepting
> - `uptime`

`lscpu` and `free` would be good to know too.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-15 12:57             ` Thomas Lamprecht
@ 2022-07-15 13:02               ` dea
  2022-07-15 13:24                 ` dea
  0 siblings, 1 reply; 23+ messages in thread
From: dea @ 2022-07-15 13:02 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox Backup Server development discussion,
	Mark Schouten

retry to upgrade 2.2-1->2.2-4 and reboot...


Il 15/07/22 14:57, Thomas Lamprecht ha scritto:
> Am 15/07/2022 um 14:01 schrieb dea:
>> Mmmmm no, for me the same problem.
>>
>> Rollback 2.2.4 -> 2.2.1 and works great.
>>
>>
> how do you test?
>
> did you reboot in between or dropped the page cache?
>
> also what is your output for the requested info:
>
> Am 13/07/2022 um 10:14 schrieb Thomas Lamprecht:
>> FWIW, it would be still interesting to see a few metrics/CLI outputs from an affected
>> system to hopefully ensure that there's nothing additional to that problem.
>>
>> - `head /proc/pressure/*` for some general info if the system is overloaded
>> - `ls /var/lib/proxmox-backup/jobstates` just out of interest
>> - `ss -tlpn` to see if another (old) proxy process is still accepting
>> - `uptime`
> `lscpu` and `free` would be good to know too.
>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-15 13:02               ` dea
@ 2022-07-15 13:24                 ` dea
  2022-07-15 14:43                   ` dea
  2022-07-17 15:04                   ` dea
  0 siblings, 2 replies; 23+ messages in thread
From: dea @ 2022-07-15 13:24 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox Backup Server development discussion,
	Mark Schouten


After reboot with version 2.2-4 everything seems OK. I keep you updated 
in case there are any problems.

THANK YOU


Il 15/07/22 15:02, dea ha scritto:
> retry to upgrade 2.2-1->2.2-4 and reboot...
>
>
> Il 15/07/22 14:57, Thomas Lamprecht ha scritto:
>> Am 15/07/2022 um 14:01 schrieb dea:
>>> Mmmmm no, for me the same problem.
>>>
>>> Rollback 2.2.4 -> 2.2.1 and works great.
>>>
>>>
>> how do you test?
>>
>> did you reboot in between or dropped the page cache?
>>
>> also what is your output for the requested info:
>>
>> Am 13/07/2022 um 10:14 schrieb Thomas Lamprecht:
>>> FWIW, it would be still interesting to see a few metrics/CLI outputs 
>>> from an affected
>>> system to hopefully ensure that there's nothing additional to that 
>>> problem.
>>>
>>> - `head /proc/pressure/*` for some general info if the system is 
>>> overloaded
>>> - `ls /var/lib/proxmox-backup/jobstates` just out of interest
>>> - `ss -tlpn` to see if another (old) proxy process is still accepting
>>> - `uptime`
>> `lscpu` and `free` would be good to know too.
>>
>
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-15 13:24                 ` dea
@ 2022-07-15 14:43                   ` dea
  2022-07-17 15:04                   ` dea
  1 sibling, 0 replies; 23+ messages in thread
From: dea @ 2022-07-15 14:43 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox Backup Server development discussion,
	Mark Schouten


... no... better than before, much better, but I still have network 
holes. Traffic in sync at 250 mbps ... I see that it collapses to zero, 
I try to log into the console .. nothing, then I see that the traffic 
resumes and then yes, it authenticates me immediately. There are still 
some holes in the network.


Il 15/07/22 15:24, dea ha scritto:
>
> After reboot with version 2.2-4 everything seems OK. I keep you 
> updated in case there are any problems.
>
> THANK YOU
>
>
> Il 15/07/22 15:02, dea ha scritto:
>> retry to upgrade 2.2-1->2.2-4 and reboot...
>>
>>
>> Il 15/07/22 14:57, Thomas Lamprecht ha scritto:
>>> Am 15/07/2022 um 14:01 schrieb dea:
>>>> Mmmmm no, for me the same problem.
>>>>
>>>> Rollback 2.2.4 -> 2.2.1 and works great.
>>>>
>>>>
>>> how do you test?
>>>
>>> did you reboot in between or dropped the page cache?
>>>
>>> also what is your output for the requested info:
>>>
>>> Am 13/07/2022 um 10:14 schrieb Thomas Lamprecht:
>>>> FWIW, it would be still interesting to see a few metrics/CLI 
>>>> outputs from an affected
>>>> system to hopefully ensure that there's nothing additional to that 
>>>> problem.
>>>>
>>>> - `head /proc/pressure/*` for some general info if the system is 
>>>> overloaded
>>>> - `ls /var/lib/proxmox-backup/jobstates` just out of interest
>>>> - `ss -tlpn` to see if another (old) proxy process is still accepting
>>>> - `uptime`
>>> `lscpu` and `free` would be good to know too.
>>>
>>
>> _______________________________________________
>> pbs-devel mailing list
>> pbs-devel@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
>
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-15 13:24                 ` dea
  2022-07-15 14:43                   ` dea
@ 2022-07-17 15:04                   ` dea
  2022-07-18 13:30                     ` Thomas Lamprecht
  1 sibling, 1 reply; 23+ messages in thread
From: dea @ 2022-07-17 15:04 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox Backup Server development discussion,
	Mark Schouten

[-- Attachment #1: Type: text/plain, Size: 116 bytes --]

I can confirm...

after upgrade to 2.2-4 connection problem persist:


Downgrade to 2.2-1 works fine


Thanks

Luca

[-- Attachment #2.1: Type: text/html, Size: 407 bytes --]

[-- Attachment #2.2: pRAZjLDyn8ZN0YYc.jpeg --]
[-- Type: image/jpeg, Size: 39164 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-15 11:49         ` Thomas Lamprecht
  2022-07-15 12:01           ` dea
@ 2022-07-18  7:31           ` Mark Schouten
  2022-07-18 11:03             ` Thomas Lamprecht
  1 sibling, 1 reply; 23+ messages in thread
From: Mark Schouten @ 2022-07-18  7:31 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox Backup Server development discussion

Hi,

>You have 30% of runnable process getting stalled due waiting for IO, that
>naturally should not cause the request accept future to get starved but is
>the reason for why it happened with the current (or better old)
>architecture. Increasing available memory, so that the page cache can hold
>more entries, could already relieve that system a bit.

Thanks. Please note that /var/lib/proxmox is on a different set of disks 
than the datastores. Root pool is on two PM883’s, datastore is lots of 
spinning disks with nvme-special devices. Not sure if that’s relevant in 
your findings, but here you have it :)

Memory upgrade is somewhere on our roadmap.

>We improved on the reproducer we got locally by simulating a higher latency
>disk using dm-delay on a small single core VM.
>
>For one we made the libpve-storage-perl do more efficient list-snapshot
>requests if they can be filtered by VMID, and on the PBS side we moved most
>operations that cause IO (and are related to backup groups/snapshots) to a
>separate thread pool so that the main thread should be less
>congested/blocked.
Given the other responses in this thread, I’m not going to upgrade yet 
to a testing-version in production. Please let me know if there is any 
other info you need from me.

—
Mark Schouten, CTO
Tuxis B.V.
mark@tuxis.nl





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-18  7:31           ` Mark Schouten
@ 2022-07-18 11:03             ` Thomas Lamprecht
  2022-07-20  7:30               ` Mark Schouten
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Lamprecht @ 2022-07-18 11:03 UTC (permalink / raw)
  To: Mark Schouten, Proxmox Backup Server development discussion

Am 18/07/2022 um 09:31 schrieb Mark Schouten:

>> For one we made the libpve-storage-perl do more efficient list-snapshot
>> requests if they can be filtered by VMID, and on the PBS side we moved most
>> operations that cause IO (and are related to backup groups/snapshots) to a
>> separate thread pool so that the main thread should be less
>> congested/blocked.
> Given the other responses in this thread, I’m not going to upgrade yet to a testing-version in production. Please let me know if there is any other info you need from me.

Note that the responses where all from just one user, so not exactly a broad feedback,
and their setup had almost no IO pressure going on, so rather a different issue (for
which we have an idea in the pipeline).

FWIW, I just moved that PBS version to no-subscription, so it wouldn't be testing
anymore.

I'll also ping this thread once the stop gap fix for the other issue, that we suspect
dea is hitting, is available in a packaged version; it'd just be interesting to know
if the general movement of IO operations to separate threads would have helped in your
big and (relatively) high IO pressure setup.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-17 15:04                   ` dea
@ 2022-07-18 13:30                     ` Thomas Lamprecht
  2022-07-19 11:01                       ` David Lawley
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Lamprecht @ 2022-07-18 13:30 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, dea, Mark Schouten

Am 17/07/2022 um 17:04 schrieb dea:
> 
> after upgrade to 2.2-4 connection problem persist:

FYI, there's proxmox-backup-server version 2.2.5-1, it includes a stop gap [0] for
an issue that could be the one affecting your system, the difference between the
previous improvements is that this doesn't necessarily has to stem from a system
with high load (IO pressure).

It's sadly hard to create a synthetic reproducer and thus also hard to know for sure
that this one addresses the regression in the various environments PBS runs in, so
we'd appreciate feedback on that.

That package is available on pbstest as of now.

[0]: https://git.proxmox.com/?p=proxmox-backup.git;a=commitdiff;h=c2206e21e0f27fbb7610180fc9dc3cb2fe1c4c16

cheers,
Thomas




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-18 13:30                     ` Thomas Lamprecht
@ 2022-07-19 11:01                       ` David Lawley
  2022-07-19 13:04                         ` Thomas Lamprecht
  0 siblings, 1 reply; 23+ messages in thread
From: David Lawley @ 2022-07-19 11:01 UTC (permalink / raw)
  To: pbs-devel

[-- Attachment #1: Type: text/html, Size: 3702 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-19 11:01                       ` David Lawley
@ 2022-07-19 13:04                         ` Thomas Lamprecht
  2022-07-19 15:08                           ` dea
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Lamprecht @ 2022-07-19 13:04 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, David Lawley

Am 19/07/2022 um 13:01 schrieb David Lawley:
> Just a note, most if not all of my timeout issues seem to have "disappeared"
> 
> proxmox-backup: 2.2-1 (running kernel: 5.15.39-1-pve) proxmox-backup-server: 
> 2.2.5-1 (running version: 2.2.5) pve-kernel-5.15: 7.2-6 pve-kernel-helper: 7.2-6 
> pve-kernel-5.13: 7.1-9 pve-kernel-5.11: 7.0-10 pve-kernel-5.4: 6.4-4 
> pve-kernel-5.15.39-1-pve: 5.15.39-1 pve-kernel-5.15.35-3-pve: 5.15.35-6 
> pve-kernel-5.13.19-6-pve: 5.13.19-15 pve-kernel-5.13.19-2-pve: 5.13.19-4 
> pve-kernel-5.11.22-7-pve: 5.11.22-12 pve-kernel-5.4.124-1-pve: 5.4.124-1 
> pve-kernel-5.4.65-1-pve: 5.4.65-1 ifupdown2: 3.1.0-1+pmx3 libjs-extjs: 7.0.0-1 
> proxmox-backup-docs: 2.2.5-1 proxmox-backup-client: 2.2.5-1 
> proxmox-mini-journalreader: 1.2-1 proxmox-widget-toolkit: 3.5.1 pve-xtermjs: 
> 4.16.0-1 smartmontools: 7.2-pve3 zfsutils-linux: 2.1.4-pve1
> 

thank you very much, such notes are already helping a lot - silence is hard
to parse after all :-)




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-19 13:04                         ` Thomas Lamprecht
@ 2022-07-19 15:08                           ` dea
  0 siblings, 0 replies; 23+ messages in thread
From: dea @ 2022-07-19 15:08 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Thomas Lamprecht,
	David Lawley

Sorry but i can't upgrade for testing from 2.2-1 to 2.2-5 right now.

It is a production server and I am completing a garbage collector, which 
takes several days to finish. I have to free up space and if I update 
the process is interrupted and I have to start it again losing days.


Sorry


Il 19/07/22 15:04, Thomas Lamprecht ha scritto:
> Am 19/07/2022 um 13:01 schrieb David Lawley:
>> Just a note, most if not all of my timeout issues seem to have "disappeared"
>>
>> proxmox-backup: 2.2-1 (running kernel: 5.15.39-1-pve) proxmox-backup-server:
>> 2.2.5-1 (running version: 2.2.5) pve-kernel-5.15: 7.2-6 pve-kernel-helper: 7.2-6
>> pve-kernel-5.13: 7.1-9 pve-kernel-5.11: 7.0-10 pve-kernel-5.4: 6.4-4
>> pve-kernel-5.15.39-1-pve: 5.15.39-1 pve-kernel-5.15.35-3-pve: 5.15.35-6
>> pve-kernel-5.13.19-6-pve: 5.13.19-15 pve-kernel-5.13.19-2-pve: 5.13.19-4
>> pve-kernel-5.11.22-7-pve: 5.11.22-12 pve-kernel-5.4.124-1-pve: 5.4.124-1
>> pve-kernel-5.4.65-1-pve: 5.4.65-1 ifupdown2: 3.1.0-1+pmx3 libjs-extjs: 7.0.0-1
>> proxmox-backup-docs: 2.2.5-1 proxmox-backup-client: 2.2.5-1
>> proxmox-mini-journalreader: 1.2-1 proxmox-widget-toolkit: 3.5.1 pve-xtermjs:
>> 4.16.0-1 smartmontools: 7.2-pve3 zfsutils-linux: 2.1.4-pve1
>>
> thank you very much, such notes are already helping a lot - silence is hard
> to parse after all :-)
>
>
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-18 11:03             ` Thomas Lamprecht
@ 2022-07-20  7:30               ` Mark Schouten
  2022-07-21 13:10                 ` Thomas Lamprecht
  0 siblings, 1 reply; 23+ messages in thread
From: Mark Schouten @ 2022-07-20  7:30 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox Backup Server development discussion

Hi,

>FWIW, I just moved that PBS version to no-subscription, so it wouldn't be testing
>anymore.
>
>I'll also ping this thread once the stop gap fix for the other issue, that we suspect
>dea is hitting, is available in a packaged version; it'd just be interesting to know
>if the general movement of IO operations to separate threads would have helped in your
>big and (relatively) high IO pressure setup.

Upgraded to the no-subscription version and rebooting as we speak. I’ll 
keep you updated.

—
Mark Schouten, CTO
Tuxis B.V.
mark@tuxis.nl





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [pbs-devel] Scheduler causing connectivity issues?
  2022-07-20  7:30               ` Mark Schouten
@ 2022-07-21 13:10                 ` Thomas Lamprecht
  0 siblings, 0 replies; 23+ messages in thread
From: Thomas Lamprecht @ 2022-07-21 13:10 UTC (permalink / raw)
  To: Mark Schouten, Proxmox Backup Server development discussion

Hi!

Am 20/07/2022 um 09:30 schrieb Mark Schouten:
>> FWIW, I just moved that PBS version to no-subscription, so it wouldn't be testing
>> anymore.
>>
>> I'll also ping this thread once the stop gap fix for the other issue, that we suspect
>> dea is hitting, is available in a packaged version; it'd just be interesting to know
>> if the general movement of IO operations to separate threads would have helped in your
>> big and (relatively) high IO pressure setup.
> 
> Upgraded to the no-subscription version and rebooting as we speak. I’ll keep you updated.

How's the new version working out for you? Should have been 2.2.5-1 already on the
no-subscription at time you wrote this mail.

cheers,
Thomas

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-07-21 13:10 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-07 15:49 [pbs-devel] Scheduler causing connectivity issues? Mark Schouten
2022-07-08  7:09 ` Thomas Lamprecht
2022-07-08  9:36   ` Jorge Boncompte
2022-07-08  9:40     ` dea
2022-07-11  8:44   ` Mark Schouten
2022-07-13  7:55     ` dea
2022-07-13  8:14     ` Thomas Lamprecht
2022-07-13 10:41       ` Mark Schouten
2022-07-15 11:49         ` Thomas Lamprecht
2022-07-15 12:01           ` dea
2022-07-15 12:57             ` Thomas Lamprecht
2022-07-15 13:02               ` dea
2022-07-15 13:24                 ` dea
2022-07-15 14:43                   ` dea
2022-07-17 15:04                   ` dea
2022-07-18 13:30                     ` Thomas Lamprecht
2022-07-19 11:01                       ` David Lawley
2022-07-19 13:04                         ` Thomas Lamprecht
2022-07-19 15:08                           ` dea
2022-07-18  7:31           ` Mark Schouten
2022-07-18 11:03             ` Thomas Lamprecht
2022-07-20  7:30               ` Mark Schouten
2022-07-21 13:10                 ` Thomas Lamprecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal