GPU FP64, AVX-512, AffinityWatcher and the Lasso process.

Message boards : Number crunching : GPU FP64, AVX-512, AffinityWatcher and the Lasso process.

1 · 2 · Next

Author	Message
chr80 Send message Joined: 24 Jun 24 Posts: 12 Credit: 34,542 RAC: 0	Message 8515 - Posted: 6 Aug 2024, 16:45:06 UTC 1. Are there GPU applications in Asteroids@home or other BOINC projects using FP64 (double precision), because I have 3 x Tesla K80, FirePro W8100, 2x Instinct MI8? 2. Are there Asteroids@home applications using AVX-512 in the CPU? And is the fact that the processor has more instructions from the AVX-512 package the better, e.g. mobile CPU i7 11850H? 3. Are AffinityWatcher and the Lasso process helpful in Asteroids@home? ID: 8515 · Rating: 0 · rate: / Reply Quote

ahorek's team Volunteer developer Volunteer tester Send message Joined: 1 Jan 13 Posts: 174 Credit: 15,987,160 RAC: 20,220	Message 8516 - Posted: 6 Aug 2024, 18:28:22 UTC 1/ yes, the app relies heavily on FP64, and I believe Einstein also uses FP64, but to a lesser extent. 2/ yes, the app leverages SIMD instructions on compatible processors, including AVX-512 for x86 and ASIMD for ARM architectures. And is the fact that the processor has more instructions from the AVX-512 package the better We utilize the AVX-512f and AVX-512dq subsets. If the CPU lacks support for these, we fallback to using FMA (AVX). While AVX-512 includes additional subsets for specialized tasks such as AI or encryption, the app does not gain any advantages from them. https://en.wikipedia.org/wiki/AVX-512 3/ PrimeGrid apps depend heavily on cache, and setting up affinity can help reduce overhead between chiplets on AMD CPUs or between P/E cores on Intel CPUs. However, I don't anticipate significant improvements for the Asteroids app. ID: 8516 · Rating: 0 · rate: / Reply Quote

chr80 Send message Joined: 24 Jun 24 Posts: 12 Credit: 34,542 RAC: 0	Message 8517 - Posted: 6 Aug 2024, 21:01:06 UTC - in response to Message 8516. Thank you for your quick and extensive response. This knowledge is secret and inaccessible to me. I am an amateur interested in computers and the resources of the BOINC platform. Best regards. ID: 8517 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Volunteer developer Volunteer tester Send message Joined: 23 Apr 21 Posts: 126 Credit: 140,581,402 RAC: 142,802	Message 8518 - Posted: 7 Aug 2024, 0:26:11 UTC - in response to Message 8517. to add on, even though the app makes heavy use of FP64 for GPU, it is not the determining factor for performance. IE, a GPU with twice the FP64 Flops will not necessarily be twice as fast in processing Asteroids tasks. 4090s are still faster than Titan Vs, despite the Tian V being many times faster in FP64 compute. memory speed/bandwidth seems to be the biggest limiting factor. ID: 8518 · Rating: 0 · rate: / Reply Quote

chr80 Send message Joined: 24 Jun 24 Posts: 12 Credit: 34,542 RAC: 0	Message 8519 - Posted: 7 Aug 2024, 6:18:43 UTC - in response to Message 8518. Any additions are welcome. Are you talking about VRAM or RAM speed? ID: 8519 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Volunteer developer Volunteer tester Send message Joined: 23 Apr 21 Posts: 126 Credit: 140,581,402 RAC: 142,802	Message 8520 - Posted: 7 Aug 2024, 13:44:06 UTC - in response to Message 8519. Any additions are welcome. Are you talking about VRAM or RAM speed? for the GPU app I'm talking about the VRAM. but the same concept applies to the CPU app too, with RAM speed/bandwidth having a big impact on the performance. ID: 8520 · Rating: 0 · rate: / Reply Quote

chr80 Send message Joined: 24 Jun 24 Posts: 12 Credit: 34,542 RAC: 0	Message 8522 - Posted: 7 Aug 2024, 21:39:16 UTC Does the L4 cache memory in e.g. Intel Xeon E3-1505M v5 processor provide any advantage in BOINC calculations? ID: 8522 · Rating: 0 · rate: / Reply Quote

ahorek's team Volunteer developer Volunteer tester Send message Joined: 1 Jan 13 Posts: 174 Credit: 15,987,160 RAC: 20,220	Message 8523 - Posted: 8 Aug 2024, 11:01:34 UTC - in response to Message 8522. Some specific BOINC apps involve large datasets that do not fit entirely within the smaller (and faster) L1, L2, or L3 caches. They may slightly benefit from an additional L4 or X3D cache. However, the primary advantage of the L4 cache (Skylake) is that it allows shared access between the CPU and the integrated GPU, offering faster data access than the main memory (RAM). ID: 8523 · Rating: 0 · rate: / Reply Quote

chr80 Send message Joined: 24 Jun 24 Posts: 12 Credit: 34,542 RAC: 0	Message 8532 - Posted: 17 Aug 2024, 14:30:46 UTC Can the Xeon Phi 7120A or other models from its family be used here or in other BOINC projects? ID: 8532 · Rating: 0 · rate: / Reply Quote

ahorek's team Volunteer developer Volunteer tester Send message Joined: 1 Jan 13 Posts: 174 Credit: 15,987,160 RAC: 20,220	Message 8533 - Posted: 17 Aug 2024, 15:45:58 UTC - in response to Message 8532. see https://asteroidsathome.net/boinc/forum_thread.php?id=228 It's possible, but no BOINC projects currently offer apps directly supporting this architecture. The code has to be recompiled for this platform, which shouldn't take much effort, but I don't have any Xeon Phi HW to test it on. If any volunteer is willing to add support, let me know. ID: 8533 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Volunteer developer Volunteer tester Send message Joined: 23 Apr 21 Posts: 126 Credit: 140,581,402 RAC: 142,802	Message 8534 - Posted: 18 Aug 2024, 0:52:29 UTC - in response to Message 8533. Last modified: 18 Aug 2024, 0:57:37 UTC see https://asteroidsathome.net/boinc/forum_thread.php?id=228 It's possible, but no BOINC projects currently offer apps directly supporting this architecture. The code has to be recompiled for this platform, which shouldn't take much effort, but I don't have any Xeon Phi HW to test it on. If any volunteer is willing to add support, let me know. I know someone who actually already ported it for Xeon Phi. let me point them to this thread. but I don't know how to get BOINC to detect it and use it properly. but they were able to run the app offline at least. spoiler: it's pretty slow. i think a more modern CPU will still be a few times more productive. ID: 8534 · Rating: 0 · rate: / Reply Quote

ahorek's team Volunteer developer Volunteer tester Send message Joined: 1 Jan 13 Posts: 174 Credit: 15,987,160 RAC: 20,220	Message 8535 - Posted: 18 Aug 2024, 1:12:51 UTC at least Boinc does support the platform x86_64_phi-pc-linux-gnu Linux running on a Xeon Phi There is a GCC cross-compiler available at https://github.com/apc-llc/gcc-5.1.1-knc, but it appears to be outdated. Most people suggest using an Intel compiler instead. > spoiler: it's pretty slow. i think a more modern CPU will still be a few times more productive. they have many cores, but these are weak Atoms with low clock speeds, consuming 300W of power. In most scenarios, a traditional GPU would be a better choice when considering both cost and efficiency. But if someone already has an unused piece lying around, why not put it to work crunching Asteroids? :) ID: 8535 · Rating: 0 · rate: / Reply Quote

chr80 Send message Joined: 24 Jun 24 Posts: 12 Credit: 34,542 RAC: 0	Message 8536 - Posted: 18 Aug 2024, 10:34:25 UTC Last modified: 18 Aug 2024, 11:26:06 UTC I haven't found any projects that have software applications that support the cards, although the Phi supports OpenCL, so you could rewrite the OCL application to offload the Phi card and provide the necessary driver translations. I suspect that such an endeavor would probably be unique. Now, the x200 versions in the LGA3467 socket can be run as the main system processor, just like any other computer. They can run the OS and BOINC natively, and even support AVX512. Reports seem to indicate that performance is pretty poor, but you can get a board and a 7250 for very little money, and you may not even need to add RAM (16GB in the processor). The slowness/poor performance may be due to the small amount of L3 cache - 34MB, and the turbo mode only works when two cores are running. https://www.cpu-world.com/CPUs/Xeon_Phi/Intel-Xeon%20Phi%207250.html Can the topic of code optimization and potentially adapting existing applications to the specific features of this architecture such as Xeon Phi be done using e.g. chatGPT? ID: 8536 · Rating: 0 · rate: / Reply Quote

ahorek's team Volunteer developer Volunteer tester Send message Joined: 1 Jan 13 Posts: 174 Credit: 15,987,160 RAC: 20,220	Message 8537 - Posted: 18 Aug 2024, 13:23:23 UTC The key advantage of these devices is their ability to run regular x86 code, even though recompilation is still necessary, making them likely a better choice than OpenCL. Regarding AVX512, their initial implementation isn't fully compatible with later processors and lacks support for AVX512DQ, which the Asteroids app depends on, but it's not a big deal... ID: 8537 · Rating: 0 · rate: / Reply Quote

ahorek's team Volunteer developer Volunteer tester Send message Joined: 1 Jan 13 Posts: 174 Credit: 15,987,160 RAC: 20,220	Message 8539 - Posted: 18 Aug 2024, 13:46:29 UTC > adapting existing applications to the specific features of this architecture such as Xeon Phi be done using e.g. chatGPT? further optimizations will offer minimal gains compared to the current FMA implementation. No one is going to invest time optimizing the code for a niche and abandoned platform just to extract a small additional performance increase. ChatGPT can provide some guidance, but you still need to know what you're doing. It can't magically rewrite the code for a different platform. The results are unlikely to work. If you have access to the actual hardware and are familiar with the platform, I can either attempt to build a binary for you or assist you in doing it yourself. ID: 8539 · Rating: 0 · rate: / Reply Quote

chr80 Send message Joined: 24 Jun 24 Posts: 12 Credit: 34,542 RAC: 0	Message 8553 - Posted: 31 Aug 2024, 11:34:52 UTC What is the FFT size for a task using L3 cache for CPU and GPU in Asteroids@Home? ID: 8553 · Rating: 0 · rate: / Reply Quote

ahorek's team Volunteer developer Volunteer tester Send message Joined: 1 Jan 13 Posts: 174 Credit: 15,987,160 RAC: 20,220	Message 8554 - Posted: 31 Aug 2024, 13:28:33 UTC - in response to Message 8553. for applications that use the FFT algorithm like Einstein@Home or Primegrid, it's easier to calculate cache requirements mathematically because the workload size is fixed. If you want to know more about what it does, I would recommend watching https://www.youtube.com/watch?v=nmgFG7PUHfo However, the Asteroids app does not use the FFT algorithm. Its cache requirements can vary due to several factors, such as the size of the tasks and the varying data it processes during computation. My rough estimate is 0,5MB / cpu task should be enough. Overall, the Asteroids app is definitely much less demanding on the L3 cache compared to, for example, PG apps. ID: 8554 · Rating: 0 · rate: / Reply Quote

FritzB Send message Joined: 9 May 13 Posts: 6 Credit: 15,027,062 RAC: 4,405	Message 8564 - Posted: 6 Sep 2024, 7:37:25 UTC - in response to Message 8554. The AVX512 unit in ZEN5 is supposed to be significantly more powerful than in ZEN4. In my test 7950X vs. 9950X, both 105W ECO, the 9950X was only as fast the 7950X, often even 1-2 minutes slower. Same with Windows 11 (with new AMD patch) as with Linux Mint 22, . Does the app for ZEN5 require special optimization or do the improvements simply not work for BOINC projects? NFS@Home also has an AVX512 app. It behaves the same there. ID: 8564 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Volunteer developer Volunteer tester Send message Joined: 23 Apr 21 Posts: 126 Credit: 140,581,402 RAC: 142,802	Message 8565 - Posted: 6 Sep 2024, 14:26:48 UTC - in response to Message 8564. The AVX512 unit in ZEN5 is supposed to be significantly more powerful than in ZEN4. In my test 7950X vs. 9950X, both 105W ECO, the 9950X was only as fast the 7950X, often even 1-2 minutes slower. Same with Windows 11 (with new AMD patch) as with Linux Mint 22, . Does the app for ZEN5 require special optimization or do the improvements simply not work for BOINC projects? NFS@Home also has an AVX512 app. It behaves the same there. just having the app compiled to use the avx512 pipeline wont automatically mean faster processing. the data needs to be packaged appropriately to truly increase throughput. the axv512 app here is only a little bit faster than the avx/fma app, which tells you that the data isnt structured in a way that takes much advantage of it. imagine you have a conveyor of buckets moving water from one place to another place. the conveyor moves at a constant speed at first your buckets are too small, and some water is overflowing from the buckets. restricting how much water is moved at a time then you increase the bucket size and now the buckets are not overflowing and you can move the max rate of water since you are limited to how much you fill each bucket. then you double the size of the bucket, this does nothing for overall throughput since you are still only putting the same amount in each bucket. that's essentially whats happening here. ID: 8565 · Rating: 0 · rate: / Reply Quote

FritzB Send message Joined: 9 May 13 Posts: 6 Credit: 15,027,062 RAC: 4,405	Message 8566 - Posted: 6 Sep 2024, 15:32:01 UTC - in response to Message 8565. Thanks for the explanation! ID: 8566 · Rating: 0 · rate: / Reply Quote

1 · 2 · Next

Message boards : Number crunching : GPU FP64, AVX-512, AffinityWatcher and the Lasso process.