GPU FP64, AVX-512, AffinityWatcher and the Lasso process.


Message boards : Number crunching : GPU FP64, AVX-512, AffinityWatcher and the Lasso process.

Message board moderation

To post messages, you must log in.
1 · 2 · Next
AuthorMessage
chr80

Send message
Joined: 24 Jun 24
Posts: 12
Credit: 34,542
RAC: 27
Message 8515 - Posted: 6 Aug 2024, 16:45:06 UTC
1. Are there GPU applications in Asteroids@home or other BOINC projects using FP64 (double precision), because I have 3 x Tesla K80, FirePro W8100, 2x Instinct MI8?

2. Are there Asteroids@home applications using AVX-512 in the CPU?

And is the fact that the processor has more instructions from the AVX-512 package the better, e.g. mobile CPU i7 11850H?

3. Are AffinityWatcher and the Lasso process helpful in Asteroids@home?
ID: 8515 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 90
Credit: 10,373,103
RAC: 7,349
Message 8516 - Posted: 6 Aug 2024, 18:28:22 UTC
1/ yes, the app relies heavily on FP64, and I believe Einstein also uses FP64, but to a lesser extent.

2/ yes, the app leverages SIMD instructions on compatible processors, including AVX-512 for x86 and ASIMD for ARM architectures.

And is the fact that the processor has more instructions from the AVX-512 package the better

We utilize the AVX-512f and AVX-512dq subsets. If the CPU lacks support for these, we fallback to using FMA (AVX). While AVX-512 includes additional subsets for specialized tasks such as AI or encryption, the app does not gain any advantages from them.

https://en.wikipedia.org/wiki/AVX-512

3/ PrimeGrid apps depend heavily on cache, and setting up affinity can help reduce overhead between chiplets on AMD CPUs or between P/E cores on Intel CPUs. However, I don't anticipate significant improvements for the Asteroids app.
ID: 8516 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
chr80

Send message
Joined: 24 Jun 24
Posts: 12
Credit: 34,542
RAC: 27
Message 8517 - Posted: 6 Aug 2024, 21:01:06 UTC - in response to Message 8516.  
Thank you for your quick and extensive response. This knowledge is secret and inaccessible to me. I am an amateur interested in computers and the resources of the BOINC platform. Best regards.
ID: 8517 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 23 Apr 21
Posts: 85
Credit: 115,140,164
RAC: 204,066
Message 8518 - Posted: 7 Aug 2024, 0:26:11 UTC - in response to Message 8517.  
to add on, even though the app makes heavy use of FP64 for GPU, it is not the determining factor for performance. IE, a GPU with twice the FP64 Flops will not necessarily be twice as fast in processing Asteroids tasks. 4090s are still faster than Titan Vs, despite the Tian V being many times faster in FP64 compute.

memory speed/bandwidth seems to be the biggest limiting factor.

ID: 8518 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
chr80

Send message
Joined: 24 Jun 24
Posts: 12
Credit: 34,542
RAC: 27
Message 8519 - Posted: 7 Aug 2024, 6:18:43 UTC - in response to Message 8518.  
Any additions are welcome.
Are you talking about VRAM or RAM speed?
ID: 8519 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 23 Apr 21
Posts: 85
Credit: 115,140,164
RAC: 204,066
Message 8520 - Posted: 7 Aug 2024, 13:44:06 UTC - in response to Message 8519.  
Any additions are welcome.
Are you talking about VRAM or RAM speed?


for the GPU app I'm talking about the VRAM.

but the same concept applies to the CPU app too, with RAM speed/bandwidth having a big impact on the performance.

ID: 8520 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
chr80

Send message
Joined: 24 Jun 24
Posts: 12
Credit: 34,542
RAC: 27
Message 8522 - Posted: 7 Aug 2024, 21:39:16 UTC
Does the L4 cache memory in e.g. Intel Xeon E3-1505M v5 processor provide any advantage in BOINC calculations?
ID: 8522 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 90
Credit: 10,373,103
RAC: 7,349
Message 8523 - Posted: 8 Aug 2024, 11:01:34 UTC - in response to Message 8522.  
Some specific BOINC apps involve large datasets that do not fit entirely within the smaller (and faster) L1, L2, or L3 caches. They may slightly benefit from an additional L4 or X3D cache.
However, the primary advantage of the L4 cache (Skylake) is that it allows shared access between the CPU and the integrated GPU, offering faster data access than the main memory (RAM).
ID: 8523 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
chr80

Send message
Joined: 24 Jun 24
Posts: 12
Credit: 34,542
RAC: 27
Message 8532 - Posted: 17 Aug 2024, 14:30:46 UTC
Can the Xeon Phi 7120A or other models from its family be used here or in other BOINC projects?
ID: 8532 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 90
Credit: 10,373,103
RAC: 7,349
Message 8533 - Posted: 17 Aug 2024, 15:45:58 UTC - in response to Message 8532.  
see https://asteroidsathome.net/boinc/forum_thread.php?id=228

It's possible, but no BOINC projects currently offer apps directly supporting this architecture. The code has to be recompiled for this platform, which shouldn't take much effort, but I don't have any Xeon Phi HW to test it on. If any volunteer is willing to add support, let me know.
ID: 8533 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 23 Apr 21
Posts: 85
Credit: 115,140,164
RAC: 204,066
Message 8534 - Posted: 18 Aug 2024, 0:52:29 UTC - in response to Message 8533.  

Last modified: 18 Aug 2024, 0:57:37 UTC
see https://asteroidsathome.net/boinc/forum_thread.php?id=228

It's possible, but no BOINC projects currently offer apps directly supporting this architecture. The code has to be recompiled for this platform, which shouldn't take much effort, but I don't have any Xeon Phi HW to test it on. If any volunteer is willing to add support, let me know.


I know someone who actually already ported it for Xeon Phi. let me point them to this thread. but I don't know how to get BOINC to detect it and use it properly. but they were able to run the app offline at least.

spoiler: it's pretty slow. i think a more modern CPU will still be a few times more productive.

ID: 8534 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 90
Credit: 10,373,103
RAC: 7,349
Message 8535 - Posted: 18 Aug 2024, 1:12:51 UTC
at least Boinc does support the platform
x86_64_phi-pc-linux-gnu Linux running on a Xeon Phi

There is a GCC cross-compiler available at https://github.com/apc-llc/gcc-5.1.1-knc, but it appears to be outdated. Most people suggest using an Intel compiler instead.

> spoiler: it's pretty slow. i think a more modern CPU will still be a few times more productive.
they have many cores, but these are weak Atoms with low clock speeds, consuming 300W of power. In most scenarios, a traditional GPU would be a better choice when considering both cost and efficiency. But if someone already has an unused piece lying around, why not put it to work crunching Asteroids? :)
ID: 8535 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
chr80

Send message
Joined: 24 Jun 24
Posts: 12
Credit: 34,542
RAC: 27
Message 8536 - Posted: 18 Aug 2024, 10:34:25 UTC

Last modified: 18 Aug 2024, 11:26:06 UTC
I haven't found any projects that have software applications that support the cards, although the Phi supports OpenCL, so you could rewrite the OCL application to offload the Phi card and provide the necessary driver translations. I suspect that such an endeavor would probably be unique.

Now, the x200 versions in the LGA3467 socket can be run as the main system processor, just like any other computer. They can run the OS and BOINC natively, and even support AVX512. Reports seem to indicate that performance is pretty poor, but you can get a board and a 7250 for very little money, and you may not even need to add RAM (16GB in the processor).
The slowness/poor performance may be due to the small amount of L3 cache - 34MB, and the turbo mode only works when two cores are running.
https://www.cpu-world.com/CPUs/Xeon_Phi/Intel-Xeon%20Phi%207250.html

Can the topic of code optimization and potentially adapting existing applications to the specific features of this architecture such as Xeon Phi be done using e.g. chatGPT?
ID: 8536 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 90
Credit: 10,373,103
RAC: 7,349
Message 8537 - Posted: 18 Aug 2024, 13:23:23 UTC
The key advantage of these devices is their ability to run regular x86 code, even though recompilation is still necessary, making them likely a better choice than OpenCL. Regarding AVX512, their initial implementation isn't fully compatible with later processors and lacks support for AVX512DQ, which the Asteroids app depends on, but it's not a big deal...
ID: 8537 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 90
Credit: 10,373,103
RAC: 7,349
Message 8539 - Posted: 18 Aug 2024, 13:46:29 UTC
> adapting existing applications to the specific features of this architecture such as Xeon Phi be done using e.g. chatGPT?
further optimizations will offer minimal gains compared to the current FMA implementation. No one is going to invest time optimizing the code for a niche and abandoned platform just to extract a small additional performance increase.

ChatGPT can provide some guidance, but you still need to know what you're doing. It can't magically rewrite the code for a different platform. The results are unlikely to work.

If you have access to the actual hardware and are familiar with the platform, I can either attempt to build a binary for you or assist you in doing it yourself.
ID: 8539 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
chr80

Send message
Joined: 24 Jun 24
Posts: 12
Credit: 34,542
RAC: 27
Message 8553 - Posted: 31 Aug 2024, 11:34:52 UTC
What is the FFT size for a task using L3 cache for CPU and GPU in Asteroids@Home?
ID: 8553 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 90
Credit: 10,373,103
RAC: 7,349
Message 8554 - Posted: 31 Aug 2024, 13:28:33 UTC - in response to Message 8553.  
for applications that use the FFT algorithm like Einstein@Home or Primegrid, it's easier to calculate cache requirements mathematically because the workload size is fixed.
If you want to know more about what it does, I would recommend watching https://www.youtube.com/watch?v=nmgFG7PUHfo

However, the Asteroids app does not use the FFT algorithm. Its cache requirements can vary due to several factors, such as the size of the tasks and the varying data it processes during computation. My rough estimate is 0,5MB / cpu task should be enough. Overall, the Asteroids app is definitely much less demanding on the L3 cache compared to, for example, PG apps.
ID: 8554 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
FritzB

Send message
Joined: 9 May 13
Posts: 6
Credit: 9,233,789
RAC: 10,539
Message 8564 - Posted: 6 Sep 2024, 7:37:25 UTC - in response to Message 8554.  
The AVX512 unit in ZEN5 is supposed to be significantly more powerful than in ZEN4. In my test 7950X vs. 9950X, both 105W ECO, the 9950X was only as fast the 7950X, often even 1-2 minutes slower. Same with Windows 11 (with new AMD patch) as with Linux Mint 22, . Does the app for ZEN5 require special optimization or do the improvements simply not work for BOINC projects?

NFS@Home also has an AVX512 app. It behaves the same there.
ID: 8564 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 23 Apr 21
Posts: 85
Credit: 115,140,164
RAC: 204,066
Message 8565 - Posted: 6 Sep 2024, 14:26:48 UTC - in response to Message 8564.  
The AVX512 unit in ZEN5 is supposed to be significantly more powerful than in ZEN4. In my test 7950X vs. 9950X, both 105W ECO, the 9950X was only as fast the 7950X, often even 1-2 minutes slower. Same with Windows 11 (with new AMD patch) as with Linux Mint 22, . Does the app for ZEN5 require special optimization or do the improvements simply not work for BOINC projects?

NFS@Home also has an AVX512 app. It behaves the same there.


just having the app compiled to use the avx512 pipeline wont automatically mean faster processing. the data needs to be packaged appropriately to truly increase throughput. the axv512 app here is only a little bit faster than the avx/fma app, which tells you that the data isnt structured in a way that takes much advantage of it.

imagine you have a conveyor of buckets moving water from one place to another place. the conveyor moves at a constant speed
at first your buckets are too small, and some water is overflowing from the buckets. restricting how much water is moved at a time
then you increase the bucket size and now the buckets are not overflowing and you can move the max rate of water since you are limited to how much you fill each bucket.
then you double the size of the bucket, this does nothing for overall throughput since you are still only putting the same amount in each bucket.

that's essentially whats happening here.

ID: 8565 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
FritzB

Send message
Joined: 9 May 13
Posts: 6
Credit: 9,233,789
RAC: 10,539
Message 8566 - Posted: 6 Sep 2024, 15:32:01 UTC - in response to Message 8565.  
Thanks for the explanation!
ID: 8566 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : GPU FP64, AVX-512, AffinityWatcher and the Lasso process.