GPU FP64, AVX-512, AffinityWatcher and the Lasso process.


Message boards : Number crunching : GPU FP64, AVX-512, AffinityWatcher and the Lasso process.

Message board moderation

To post messages, you must log in.
AuthorMessage
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 90
Credit: 10,400,371
RAC: 8,100
Message 8567 - Posted: 6 Sep 2024, 17:07:32 UTC
with 65W ECO mode, my clocks speeds are
zen 5 2662 Mhz
zen 4 3417 Mhz

Zen 5 has a lower clock speed but can process more work per clock thanks to 1x512b vs 2x256b. Power is the limiting factor, without those restrictions, Zen 5 is faster since it can execute 1x512 operations while maintaining relatively high frequencies simultaneously.

Even if the vector is 2 times wider and the CPU should be able to process tasks 2 times faster. However, it's achievable only in very synthetic benchmarks. With real apps, more factors come into play like data structures, memory, type of work etc. This is why some applications may gain significant performance boosts from AVX512, while others might not see any improvement, even if they are optimized for it. If the clock speeds are worse than the performance benefits from AVX512, the performance may even be worse.

btw there's an upcoming patch for GCC15/14 https://www.phoronix.com/news/GCC-15-Lands-More-Zen-5-Tuning that could slightly help extract more performance from Zen 5 compared to the stock app
ID: 8567 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
chr80

Send message
Joined: 24 Jun 24
Posts: 12
Credit: 34,542
RAC: 20
Message 8601 - Posted: 18 Oct 2024, 23:52:14 UTC
What is the optimal block size "Block dim: 128" for Direct GMA? for GPU.
ID: 8601 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 90
Credit: 10,400,371
RAC: 8,100
Message 8602 - Posted: 19 Oct 2024, 16:05:36 UTC - in response to Message 8601.  
Block dimension is a fixed value (currently hardcoded to 128), that determines how tasks are allocated across the GPU's resources and has nothing to do with dGMA.
Maybe Direct GMA could enhance performance, but the code has to be optimized for it and it appears to be limited to professional cards like AMD FirePro, which limits users who could potentially benefit from it (and I don't have one)
ID: 8602 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
chr80

Send message
Joined: 24 Jun 24
Posts: 12
Credit: 34,542
RAC: 20
Message 8603 - Posted: 20 Oct 2024, 19:43:02 UTC - in response to Message 8602.  
I have one like that, but setting the value to 64 makes the task calculate the fastest. https://asteroidsathome.net/boinc/show_host_detail.php?hostid=783761
FirePro W8100 8GB - you can buy a used one on ebay for 130-350$.
ID: 8603 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 90
Credit: 10,400,371
RAC: 8,100
Message 8604 - Posted: 20 Oct 2024, 21:22:19 UTC - in response to Message 8603.  
hmm, how much faster? I've tested the same WU on Radeon 7900 XTX (@2Ghz), but the difference is within the margin of an error.
Block dim: 64 - 7m 27s
Block dim: 128 - 7m 28s (default)
note that asteroid tasks aren't the same, so to get a valid comparison you have to test it with the same WU

indeed, tuning the parameter could have an effect on certain GPUs (better or worse), but I don't expect it to lead to any substantial improvement.
ID: 8604 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : GPU FP64, AVX-512, AffinityWatcher and the Lasso process.