GPU FP64, AVX-512, AffinityWatcher and the Lasso process.
Message boards :
Number crunching :
GPU FP64, AVX-512, AffinityWatcher and the Lasso process.
Message board moderation
Author | Message |
---|---|
Send message Joined: 24 Jun 24 Posts: 12 Credit: 34,542 RAC: 27 |
1. Are there GPU applications in Asteroids@home or other BOINC projects using FP64 (double precision), because I have 3 x Tesla K80, FirePro W8100, 2x Instinct MI8? 2. Are there Asteroids@home applications using AVX-512 in the CPU? And is the fact that the processor has more instructions from the AVX-512 package the better, e.g. mobile CPU i7 11850H? 3. Are AffinityWatcher and the Lasso process helpful in Asteroids@home? |
Send message Joined: 1 Jan 13 Posts: 90 Credit: 10,372,737 RAC: 7,332 |
1/ yes, the app relies heavily on FP64, and I believe Einstein also uses FP64, but to a lesser extent. 2/ yes, the app leverages SIMD instructions on compatible processors, including AVX-512 for x86 and ASIMD for ARM architectures. And is the fact that the processor has more instructions from the AVX-512 package the better We utilize the AVX-512f and AVX-512dq subsets. If the CPU lacks support for these, we fallback to using FMA (AVX). While AVX-512 includes additional subsets for specialized tasks such as AI or encryption, the app does not gain any advantages from them. https://en.wikipedia.org/wiki/AVX-512 3/ PrimeGrid apps depend heavily on cache, and setting up affinity can help reduce overhead between chiplets on AMD CPUs or between P/E cores on Intel CPUs. However, I don't anticipate significant improvements for the Asteroids app. |
Send message Joined: 24 Jun 24 Posts: 12 Credit: 34,542 RAC: 27 |
|
Send message Joined: 23 Apr 21 Posts: 85 Credit: 115,134,649 RAC: 204,024 |
to add on, even though the app makes heavy use of FP64 for GPU, it is not the determining factor for performance. IE, a GPU with twice the FP64 Flops will not necessarily be twice as fast in processing Asteroids tasks. 4090s are still faster than Titan Vs, despite the Tian V being many times faster in FP64 compute. memory speed/bandwidth seems to be the biggest limiting factor. |
Send message Joined: 24 Jun 24 Posts: 12 Credit: 34,542 RAC: 27 |
|
Send message Joined: 23 Apr 21 Posts: 85 Credit: 115,134,649 RAC: 204,024 |
|
Send message Joined: 24 Jun 24 Posts: 12 Credit: 34,542 RAC: 27 |
|
Send message Joined: 1 Jan 13 Posts: 90 Credit: 10,372,737 RAC: 7,332 |
Some specific BOINC apps involve large datasets that do not fit entirely within the smaller (and faster) L1, L2, or L3 caches. They may slightly benefit from an additional L4 or X3D cache. However, the primary advantage of the L4 cache (Skylake) is that it allows shared access between the CPU and the integrated GPU, offering faster data access than the main memory (RAM). |
Send message Joined: 24 Jun 24 Posts: 12 Credit: 34,542 RAC: 27 |
|
Send message Joined: 1 Jan 13 Posts: 90 Credit: 10,372,737 RAC: 7,332 |
see https://asteroidsathome.net/boinc/forum_thread.php?id=228 It's possible, but no BOINC projects currently offer apps directly supporting this architecture. The code has to be recompiled for this platform, which shouldn't take much effort, but I don't have any Xeon Phi HW to test it on. If any volunteer is willing to add support, let me know. |
Send message Joined: 23 Apr 21 Posts: 85 Credit: 115,134,649 RAC: 204,024 |
Last modified: 18 Aug 2024, 0:57:37 UTC see https://asteroidsathome.net/boinc/forum_thread.php?id=228 I know someone who actually already ported it for Xeon Phi. let me point them to this thread. but I don't know how to get BOINC to detect it and use it properly. but they were able to run the app offline at least. spoiler: it's pretty slow. i think a more modern CPU will still be a few times more productive. |
Send message Joined: 1 Jan 13 Posts: 90 Credit: 10,372,737 RAC: 7,332 |
at least Boinc does support the platform x86_64_phi-pc-linux-gnu Linux running on a Xeon Phi There is a GCC cross-compiler available at https://github.com/apc-llc/gcc-5.1.1-knc, but it appears to be outdated. Most people suggest using an Intel compiler instead. > spoiler: it's pretty slow. i think a more modern CPU will still be a few times more productive. they have many cores, but these are weak Atoms with low clock speeds, consuming 300W of power. In most scenarios, a traditional GPU would be a better choice when considering both cost and efficiency. But if someone already has an unused piece lying around, why not put it to work crunching Asteroids? :) |
Send message Joined: 24 Jun 24 Posts: 12 Credit: 34,542 RAC: 27 |
Last modified: 18 Aug 2024, 11:26:06 UTC I haven't found any projects that have software applications that support the cards, although the Phi supports OpenCL, so you could rewrite the OCL application to offload the Phi card and provide the necessary driver translations. I suspect that such an endeavor would probably be unique. Now, the x200 versions in the LGA3467 socket can be run as the main system processor, just like any other computer. They can run the OS and BOINC natively, and even support AVX512. Reports seem to indicate that performance is pretty poor, but you can get a board and a 7250 for very little money, and you may not even need to add RAM (16GB in the processor). The slowness/poor performance may be due to the small amount of L3 cache - 34MB, and the turbo mode only works when two cores are running. https://www.cpu-world.com/CPUs/Xeon_Phi/Intel-Xeon%20Phi%207250.html Can the topic of code optimization and potentially adapting existing applications to the specific features of this architecture such as Xeon Phi be done using e.g. chatGPT? |
Send message Joined: 1 Jan 13 Posts: 90 Credit: 10,372,737 RAC: 7,332 |
The key advantage of these devices is their ability to run regular x86 code, even though recompilation is still necessary, making them likely a better choice than OpenCL. Regarding AVX512, their initial implementation isn't fully compatible with later processors and lacks support for AVX512DQ, which the Asteroids app depends on, but it's not a big deal...
|
Send message Joined: 1 Jan 13 Posts: 90 Credit: 10,372,737 RAC: 7,332 |
> adapting existing applications to the specific features of this architecture such as Xeon Phi be done using e.g. chatGPT? further optimizations will offer minimal gains compared to the current FMA implementation. No one is going to invest time optimizing the code for a niche and abandoned platform just to extract a small additional performance increase. ChatGPT can provide some guidance, but you still need to know what you're doing. It can't magically rewrite the code for a different platform. The results are unlikely to work. If you have access to the actual hardware and are familiar with the platform, I can either attempt to build a binary for you or assist you in doing it yourself. |
Send message Joined: 24 Jun 24 Posts: 12 Credit: 34,542 RAC: 27 |
|
Send message Joined: 1 Jan 13 Posts: 90 Credit: 10,372,737 RAC: 7,332 |
for applications that use the FFT algorithm like Einstein@Home or Primegrid, it's easier to calculate cache requirements mathematically because the workload size is fixed. If you want to know more about what it does, I would recommend watching https://www.youtube.com/watch?v=nmgFG7PUHfo However, the Asteroids app does not use the FFT algorithm. Its cache requirements can vary due to several factors, such as the size of the tasks and the varying data it processes during computation. My rough estimate is 0,5MB / cpu task should be enough. Overall, the Asteroids app is definitely much less demanding on the L3 cache compared to, for example, PG apps. |
Send message Joined: 9 May 13 Posts: 6 Credit: 9,233,049 RAC: 10,492 |
The AVX512 unit in ZEN5 is supposed to be significantly more powerful than in ZEN4. In my test 7950X vs. 9950X, both 105W ECO, the 9950X was only as fast the 7950X, often even 1-2 minutes slower. Same with Windows 11 (with new AMD patch) as with Linux Mint 22, . Does the app for ZEN5 require special optimization or do the improvements simply not work for BOINC projects? NFS@Home also has an AVX512 app. It behaves the same there. |
Send message Joined: 23 Apr 21 Posts: 85 Credit: 115,134,649 RAC: 204,024 |
The AVX512 unit in ZEN5 is supposed to be significantly more powerful than in ZEN4. In my test 7950X vs. 9950X, both 105W ECO, the 9950X was only as fast the 7950X, often even 1-2 minutes slower. Same with Windows 11 (with new AMD patch) as with Linux Mint 22, . Does the app for ZEN5 require special optimization or do the improvements simply not work for BOINC projects? just having the app compiled to use the avx512 pipeline wont automatically mean faster processing. the data needs to be packaged appropriately to truly increase throughput. the axv512 app here is only a little bit faster than the avx/fma app, which tells you that the data isnt structured in a way that takes much advantage of it. imagine you have a conveyor of buckets moving water from one place to another place. the conveyor moves at a constant speed at first your buckets are too small, and some water is overflowing from the buckets. restricting how much water is moved at a time then you increase the bucket size and now the buckets are not overflowing and you can move the max rate of water since you are limited to how much you fill each bucket. then you double the size of the bucket, this does nothing for overall throughput since you are still only putting the same amount in each bucket. that's essentially whats happening here. |
Send message Joined: 9 May 13 Posts: 6 Credit: 9,233,049 RAC: 10,492 |
|
Message boards :
Number crunching :
GPU FP64, AVX-512, AffinityWatcher and the Lasso process.