Maxim Masiutin

Members

View Profile See their activity

Posts
1
Joined
June 15, 2020
Last visited
June 16, 2020

Content Type

All Activity

Profiles

Forums

Topics
Posts

Events

Everything posted by Maxim Masiutin

Newly released FastMM5

Maxim Masiutin replied to Pep's topic in General

Here is the single-threading performance comparison between FastMM5 (FastMM v5.01 dated Jun 12, 2020 and FastMM4-AVX v1.03 dated Jun 14, 2020). This test is run on Jun 16, 2020, under Intel Core i7-1065G7 CPU (base frequency: 1.3 GHz, 4 cores, 8 threads). Compiled under Delphi 10.3 Update 3, 64-bit target. FastMM5 AVX-br. Ratio ------ ------ ------ ReallocMem Small (1-555b) benchmark 9285 7013 24.47% ReallocMem Medium (1-4039b) benchmark 12002 10186 15.13% Block downsize 12463 9474 23.98% VerySmall downsize benchmark 12025 11012 8.42% Address space creep benchmark 14212 10845 23.69% Address space creep (larger blocks) 16237 13629 16.06% Single-threaded reallocate and use 15462 13750 11.07% Single-threaded tiny reallocate and use 9263 7203 22.24% Single-threaded allocate, use and free 14885 14211 4.53% You can find the program, used to generate the benchmark data,at https://github.com/maximmasiutin/FastCodeBenchmark You can find FastMM4-AVX branch at https://github.com/maximmasiutin/FastMM4-AVX On the tests above demonstrated, FastMM4-AVX branch is faster than FastMM5. Besides that, FastMM5 uses "Winapi.Windows.SwitchToThread" call in multi-threading in an attempt to obtain a lock of a block manager. The "SwitchToThread" call is not a very efficient way in a spin-lock loop. A better way, even recommended by Intel, is to use "pause" instruction, e.g. 5000 times, and only then if it would not help, call "SwitchToThread". Usually, "pause" will help and the spin-lock will release before reaching 5000 iterations, so no "SwitchToThread" call will be needed. The following should also be taken into consideration: (1) Each call to SwitchToThread() experiences the expensive cost of a context switch, which can be 10000+ cycles; (2) It also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles; (3) SwitchToThread() may be of no use if no threads are in the ready state. The FastMM4-AVX branch checks if the CPU supports SSE2 and thus the "pause" instruction, it uses "pause" spin-loop for 5000 iterations before calling "SwitchToThread". If a CPU doesn't have the "pause" instruction or Windows doesn't have the SwitchToThread() API function, it will use EnterCriticalSection/LeaveCriticalSection.
- June 15, 2020
- 20 replies
- - 1
- - fastmm5

Sign In

Maxim Masiutin

Posts

Joined

Last visited

Content Type

Profiles

Forums

Events

Everything posted by Maxim Masiutin

Newly released FastMM5

Browse

Activity