Maxim Masiutin

June 15, 2020

Here is the single-threading performance comparison between FastMM5 (FastMM v5.01 dated Jun 12, 2020 and FastMM4-AVX v1.03 dated Jun 14, 2020). This test is run on Jun 16, 2020, under Intel Core i7-1065G7 CPU (base frequency: 1.3 GHz, 4 cores, 8 threads). Compiled under Delphi 10.3 Update 3, 64-bit target.

                                             FastMM5  AVX-br.   Ratio
                                              ------  ------   ------
    ReallocMem Small (1-555b) benchmark         9285    7013   24.47%
    ReallocMem Medium (1-4039b) benchmark      12002   10186   15.13%
    Block downsize                             12463    9474   23.98%
    VerySmall downsize benchmark               12025   11012    8.42%
    Address space creep benchmark              14212   10845   23.69%
    Address space creep (larger blocks)        16237   13629   16.06%
    Single-threaded reallocate and use         15462   13750   11.07%
    Single-threaded tiny reallocate and use     9263    7203   22.24%
    Single-threaded allocate, use and free     14885   14211    4.53%

You can find the program, used to generate the benchmark data,at https://github.com/maximmasiutin/FastCodeBenchmark

You can find FastMM4-AVX branch at https://github.com/maximmasiutin/FastMM4-AVX

On the tests above demonstrated, FastMM4-AVX branch is faster than FastMM5.

Besides that, FastMM5 uses "Winapi.Windows.SwitchToThread" call in multi-threading in an attempt to obtain a lock of a block manager. The "SwitchToThread" call is not a very efficient way in a spin-lock loop. A better way, even recommended by Intel, is to use "pause" instruction, e.g. 5000 times, and only then if it would not help, call "SwitchToThread". Usually, "pause" will help and the spin-lock will release before reaching 5000 iterations, so no "SwitchToThread" call will be needed.

The following should also be taken into consideration: (1) Each call to SwitchToThread() experiences the expensive cost of a context switch, which can be 10000+ cycles; (2) It also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles; (3) SwitchToThread() may be of no use if no threads are in the ready state.

The FastMM4-AVX branch checks if the CPU supports SSE2 and thus the "pause" instruction, it uses "pause" spin-loop for 5000 iterations before calling "SwitchToThread". If a CPU doesn't have the "pause" instruction or Windows doesn't have the SwitchToThread() API function, it will use EnterCriticalSection/LeaveCriticalSection.

Sign In

Maxim Masiutin

Posts

Joined

Last visited

Content Type

Profiles

Forums

Events

Posts posted by Maxim Masiutin

Newly released FastMM5

Browse

Activity