Re: [pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel

From: Dominik Csapak <d.csapak@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: Re: [pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel
Date: Fri, 11 Jul 2025 10:21:45 +0200	[thread overview]
Message-ID: <82e82fdf-5010-4a70-af5d-35c0a0826c0c@proxmox.com> (raw)
In-Reply-To: <650a97d5-d8e8-4afb-8450-83254f398bb2@proxmox.com>

On 7/10/25 14:48, Dominik Csapak wrote:
[snip]
> 
> Just for the record i also benchmarked a slower system here:
> 6x16 TiB spinners in raid-10 with nvme special devices
> over a 2.5 g link:
> 
> current approach is ~61 MiB/s restore speed
> with my patch it's ~160MiB/s restore speed with not much increase
> in cpu time (both were under 30% of a single core)
> 
> Also did perf stat for those to compare how much overhead the additional futures/async/await
> brings:
> 
> 
> first restore:
> 
>          62,871.24 msec task-clock                       #    0.115 CPUs utilized
>            878,151      context-switches                 #   13.967 K/sec
>             28,205      cpu-migrations                   #  448.615 /sec
>            519,396      page-faults                      #    8.261 K/sec
>    277,239,999,474      cpu_core/cycles/                 #    4.410 G/sec (89.20%)
>    190,782,860,504      cpu_atom/cycles/                 #    3.035 G/sec (10.80%)
>    482,534,267,606      cpu_core/instructions/           #    7.675 G/sec (89.20%)
>    188,659,352,613      cpu_atom/instructions/           #    3.001 G/sec (10.80%)
>     46,913,925,346      cpu_core/branches/               #  746.191 M/sec (89.20%)
>     19,251,496,445      cpu_atom/branches/               #  306.205 M/sec (10.80%)
>        904,032,529      cpu_core/branch-misses/          #   14.379 M/sec (89.20%)
>        621,228,739      cpu_atom/branch-misses/          #    9.881 M/sec (10.80%)
> 1,633,142,624,469      cpu_core/slots/                  #   25.976 G/sec                    (89.20%)
>    489,311,603,992      cpu_core/topdown-retiring/       #     29.7% Retiring (89.20%)
>     97,617,585,755      cpu_core/topdown-bad-spec/       #      5.9% Bad Speculation (89.20%)
>    317,074,236,582      cpu_core/topdown-fe-bound/       #     19.2% Frontend Bound (89.20%)
>    745,485,954,022      cpu_core/topdown-be-bound/       #     45.2% Backend Bound (89.20%)
>     57,463,995,650      cpu_core/topdown-heavy-ops/      #      3.5% Heavy Operations       # 26.2% 
> Light Operations        (89.20%)
>     88,333,173,745      cpu_core/topdown-br-mispredict/  #      5.4% Branch Mispredict      # 0.6% 
> Machine Clears          (89.20%)
>    217,424,427,912      cpu_core/topdown-fetch-lat/      #     13.2% Fetch Latency          # 6.0% 
> Fetch Bandwidth         (89.20%)
>    354,250,103,398      cpu_core/topdown-mem-bound/      #     21.5% Memory Bound           # 23.7% 
> Core Bound              (89.20%)
> 
> 
>      548.195368256 seconds time elapsed
> 
> 
>       44.493218000 seconds user
>       21.315124000 seconds sys
> 
> second restore:
> 
>          67,908.11 msec task-clock                       #    0.297 CPUs utilized
>            856,402      context-switches                 #   12.611 K/sec
>             46,539      cpu-migrations                   #  685.323 /sec
>            942,002      page-faults                      #   13.872 K/sec
>    300,757,558,837      cpu_core/cycles/                 #    4.429 G/sec (75.93%)
>    234,595,451,063      cpu_atom/cycles/                 #    3.455 G/sec (24.07%)
>    511,747,593,432      cpu_core/instructions/           #    7.536 G/sec (75.93%)
>    289,348,171,298      cpu_atom/instructions/           #    4.261 G/sec (24.07%)
>     49,993,266,992      cpu_core/branches/               #  736.190 M/sec (75.93%)
>     29,624,743,216      cpu_atom/branches/               #  436.248 M/sec (24.07%)
>        911,770,988      cpu_core/branch-misses/          #   13.427 M/sec (75.93%)
>        811,321,806      cpu_atom/branch-misses/          #   11.947 M/sec (24.07%)
> 1,788,660,631,633      cpu_core/slots/                  #   26.339 G/sec                    (75.93%)
>    569,029,214,725      cpu_core/topdown-retiring/       #     31.4% Retiring (75.93%)
>    125,815,987,213      cpu_core/topdown-bad-spec/       #      6.9% Bad Speculation (75.93%)
>    234,249,755,030      cpu_core/topdown-fe-bound/       #     12.9% Frontend Bound (75.93%)
>    885,539,445,254      cpu_core/topdown-be-bound/       #     48.8% Backend Bound (75.93%)
>     86,825,030,719      cpu_core/topdown-heavy-ops/      #      4.8% Heavy Operations       # 26.6% 
> Light Operations        (75.93%)
>    116,566,866,551      cpu_core/topdown-br-mispredict/  #      6.4% Branch Mispredict      # 0.5% 
> Machine Clears          (75.93%)
>    135,276,276,904      cpu_core/topdown-fetch-lat/      #      7.5% Fetch Latency          # 5.5% 
> Fetch Bandwidth         (75.93%)
>    409,898,741,185      cpu_core/topdown-mem-bound/      #     22.6% Memory Bound           # 26.2% 
> Core Bound              (75.93%)
> 
> 
>      228.528573197 seconds time elapsed
> 
> 
>       48.379229000 seconds user
>       21.779166000 seconds sys
> 
> 
> so the overhead for the additional futures was ~8%  in cycles, ~6% in instructions
> which does not seem too bad
> 

addendum:

the tests above did sadly run into a network limit of ~600MBit/s (still
trying to figure out where the bottleneck in the network is...)

tested again from a different machine that has a 10G link to the pbs mentioned above.
This time i restored to the 'null-co' driver from qemu since the target storage was too slow....

anyways, the results are:

current code: restore ~75MiB/s
16 way parallel: ~528MiB/s (7x !)

cpu usage went up from <50% of one core to ~350% (like in my initial tests with a different setup)

perf stat output below:

current:

        183,534.85 msec task-clock                       #    0.409 CPUs utilized
           117,267      context-switches                 #  638.936 /sec
               700      cpu-migrations                   #    3.814 /sec
           462,432      page-faults                      #    2.520 K/sec
   468,609,612,840      cycles                           #    2.553 GHz
1,286,188,699,253      instructions                     #    2.74  insn per cycle
    41,342,312,275      branches                         #  225.256 M/sec
       846,432,249      branch-misses                    #    2.05% of all branches

     448.965517535 seconds time elapsed

     152.007611000 seconds user
      32.189942000 seconds sys

16 way parallel:

        228,583.26 msec task-clock                       #    3.545 CPUs utilized
           114,575      context-switches                 #  501.240 /sec
             6,028      cpu-migrations                   #   26.371 /sec
         1,561,179      page-faults                      #    6.830 K/sec
   510,861,534,387      cycles                           #    2.235 GHz
1,296,819,542,686      instructions                     #    2.54  insn per cycle
    43,202,234,699      branches                         #  189.000 M/sec
       828,196,795      branch-misses                    #    1.92% of all branches

      64.482868654 seconds time elapsed

     184.172759000 seconds user
      44.560342000 seconds sys

so still about ~8% more cycles, about the same amount of instructions but in much less time

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel