AMD Bulldozer FMA4 app


Message boards : Number crunching : AMD Bulldozer FMA4 app

Message board moderation

To post messages, you must log in.
AuthorMessage
alexander

Send message
Joined: 28 Apr 13
Posts: 87
Credit: 26,716,176
RAC: 151
Message 3012 - Posted: 10 May 2014, 21:46:27 UTC
Cruncher has pm'ed me the link, the first fma4 wu is running.
ID: 3012 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alexander

Send message
Joined: 28 Apr 13
Posts: 87
Credit: 26,716,176
RAC: 151
Message 3013 - Posted: 11 May 2014, 7:04:52 UTC
First fma4 wu validated.
A10-7700 8,764.78, i7-3770 11,695.72
https://asteroidsathome.net/boinc//workunit.php?wuid=16225158

Looks like the app has no checkpointing; after nVidia lockup I had to restart the pc, the wu startet from 0% but with the time already used.

5 more waiting for validation.
ID: 3013 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Falconet

Send message
Joined: 23 Oct 12
Posts: 18
Credit: 60,508
RAC: 77
Message 3014 - Posted: 11 May 2014, 8:45:59 UTC - in response to Message 3013.  
First fma4 wu validated.
A10-7700 8,764.78, i7-3770 11,695.72
https://asteroidsathome.net/boinc//workunit.php?wuid=16225158

Looks like the app has no checkpointing; after nVidia lockup I had to restart the pc, the wu startet from 0% but with the time already used.

5 more waiting for validation.


That is strange. I am fairly sure that mine checkpointed...
ID: 3014 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alexander

Send message
Joined: 28 Apr 13
Posts: 87
Credit: 26,716,176
RAC: 151
Message 3015 - Posted: 11 May 2014, 9:23:08 UTC
Some more validated:
A10-7700 Wingman wu
8,728 5,830 i5@3.4GHz Win7 sse3
8,285 8,817 i7 2600 @3.4GHz sse2
7.629 16,054 E5-2650 @2GHz sse2

earlier wu's were running together with onboard-gpu wu's.
my earlier avx wu's:
9,231
11,216
10,582
10,292
4,062
8,310

my earlier sse2 wu's
9,154
10,370
9,117
ID: 3015 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Falconet

Send message
Joined: 23 Oct 12
Posts: 18
Credit: 60,508
RAC: 77
Message 3016 - Posted: 11 May 2014, 9:39:25 UTC - in response to Message 3015.  
Not bad. Your tasks are running a little faster than the i7's.
ID: 3016 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alexander

Send message
Joined: 28 Apr 13
Posts: 87
Credit: 26,716,176
RAC: 151
Message 3017 - Posted: 11 May 2014, 10:16:46 UTC - in response to Message 3016.  
Not bad. Your tasks are running a little faster than the i7's.


Not really. If I compare the results against my i7 (hostid=88982) I see the avx there finishing faster (5,671 .. 6,332), but fma4 seems to be faster than sse or avx on fm2+ APU's. And this is the smaller one of the both available A10's.

All in all I would say it's an advance. Thaks to Crunch3r!
ID: 3017 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Falconet

Send message
Joined: 23 Oct 12
Posts: 18
Credit: 60,508
RAC: 77
Message 3021 - Posted: 11 May 2014, 13:04:27 UTC - in response to Message 3017.  
Not bad. Your tasks are running a little faster than the i7's.


Not really. If I compare the results against my i7 (hostid=88982) I see the avx there finishing faster (5,671 .. 6,332), but fma4 seems to be faster than sse or avx on fm2+ APU's. And this is the smaller one of the both available A10's.

All in all I would say it's an advance. Thaks to Crunch3r!


I see. By the way, the app is checkpointing with me.
ID: 3021 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alexander

Send message
Joined: 28 Apr 13
Posts: 87
Credit: 26,716,176
RAC: 151
Message 3026 - Posted: 15 May 2014, 8:24:02 UTC - in response to Message 3021.  

I see. By the way, the app is checkpointing with me.


Yes, you are right, the problem I had were caused by a faulty GTX430 which caused my system to shutdown without saving anything.

Did switch between avx and fma4 wu's under same circumstances now (one einstein gpu wu also running); the fma4 wu's seem to be faster.

Alexander
ID: 3026 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Crunch3r
Avatar

Send message
Joined: 19 Jun 12
Posts: 21
Credit: 107,293,560
RAC: 0
Message 3045 - Posted: 16 May 2014, 17:51:13 UTC
OK, to anyone who want's to try the BD fma4 app

here is the link.

http://www.boincunited.org/period_search_10210_windows_x86_64_bd_fma4_gcc.zip

It's using anonymous platform and the only thing to do is to copy it to the project directory.

I won't go into specifics on how to install the app, since only experienced boinc users should have a go at it.

Join BOINC United now!
ID: 3045 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile BilBg
Avatar

Send message
Joined: 19 Jun 12
Posts: 221
Credit: 623,640
RAC: 0
Message 3052 - Posted: 18 May 2014, 3:03:00 UTC - in response to Message 3045.  
For me the link gives no_hotlink.gif
http://www.boincunited.org/period_search_10210_windows_x86_64_bd_fma4_gcc.zip



- ALF - "Find out what you don't do well ..... then don't do it!" :)
ID: 3052 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Falconet

Send message
Joined: 23 Oct 12
Posts: 18
Credit: 60,508
RAC: 77
Message 3053 - Posted: 18 May 2014, 9:05:55 UTC
http://www.boincunited.org/period_search_10210_windows_x86_64_bd_fma4_gcc.zip

Copy that and past in the address bar.
ID: 3053 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile BilBg
Avatar

Send message
Joined: 19 Jun 12
Posts: 221
Credit: 623,640
RAC: 0
Message 3081 - Posted: 21 May 2014, 19:57:28 UTC


I wanted to get this app to include it in a Benchmark package:
http://asteroidsathome.net/boinc/forum_thread.php?id=306

Use it to determine the relative speed of different applications on the same WUs
(I don't have a CPU with FMA4 nor 64 bit Windows to test it but it's included in the Benchmark package)


I noticed people here try to look at CPU time of Completed tasks to 'measure' the speed of the app but this is hard as different WUs can have much different CPU time on the same Hardware using the same app:
http://asteroidsathome.net/boinc/results.php?hostid=110&offset=0&show_names=0&state=4&appid=





- ALF - "Find out what you don't do well ..... then don't do it!" :)
ID: 3081 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alexander

Send message
Joined: 28 Apr 13
Posts: 87
Credit: 26,716,176
RAC: 151
Message 3082 - Posted: 21 May 2014, 21:27:49 UTC - in response to Message 3081.  

I noticed people here try to look at CPU time of Completed tasks to 'measure' the speed of the app but this is hard as different WUs can have much different CPU time on the same Hardware using the same app:
http://asteroidsathome.net/boinc/results.php?hostid=110&offset=0&show_names=0&state=4&appid=


What one can do is compare his fma4 results against the wingmen; this gives a better impression of the performance.
My computers are not hidden; feel free to check the results.
ID: 3082 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile (retired account)

Send message
Joined: 3 Jan 13
Posts: 30
Credit: 1,705,200
RAC: 0
Message 3087 - Posted: 22 May 2014, 3:09:43 UTC
After receiving the link to the FMA4 app from Crunch3r beginning of last week by pm - thank you! - and after the end of the 2014 Pentathlon :) I have run a few dozen workunits on my FX-8350 and they all validated ok. Compared to wingmen there is some indication of speedup, but as BilBg already pointed out it's hard to compare due to the differences between WUs and due to the fact that you don't know the exact settings of the wingmen computer (such as clock speed, number of threads used per cpu, throttling, hyperthreading on/off, other programs running etc.). Hence I'll also try to get some results with BilBg's bench package posted yesterday.

One question concerning AVX and FMA4 on the Bulldozer: Do these instruction sets benefit from using both 128bit FPUs of one module exclusively? In that case there should be a difference between running one thread per core (i.e. 8 threads on a FX-8xxx) and one thread per module (i.e. 4 threads on a FX-8xxx), right?
ID: 3087 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile (retired account)

Send message
Joined: 3 Jan 13
Posts: 30
Credit: 1,705,200
RAC: 0
Message 3091 - Posted: 22 May 2014, 8:22:36 UTC

Last modified: 22 May 2014, 9:13:06 UTC
(oops)
ID: 3091 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 1 Jan 14
Posts: 300
Credit: 32,053,292
RAC: 14,551
Message 3092 - Posted: 22 May 2014, 10:06:19 UTC - in response to Message 3087.  

Last modified: 22 May 2014, 10:09:49 UTC
After receiving the link to the FMA4 app from Crunch3r beginning of last week by pm - thank you! - and after the end of the 2014 Pentathlon :) I have run a few dozen workunits on my FX-8350 and they all validated ok. Compared to wingmen there is some indication of speedup, but as BilBg already pointed out it's hard to compare due to the differences between WUs and due to the fact that you don't know the exact settings of the wingmen computer (such as clock speed, number of threads used per cpu, throttling, hyperthreading on/off, other programs running etc.). Hence I'll also try to get some results with BilBg's bench package posted yesterday.

One question concerning AVX and FMA4 on the Bulldozer: Do these instruction sets benefit from using both 128bit FPUs of one module exclusively? In that case there should be a difference between running one thread per core (i.e. 8 threads on a FX-8xxx) and one thread per module (i.e. 4 threads on a FX-8xxx), right?


The idea of benchmark units has been around for a very long time, the idea was to get everyone to run one the first time they sign up, but since it would not give any credits, or minimal ones at best, it never caught on. If it did give credits people could cheat by just returning it over and over and over again like they used to do in the bad old days when cheating was rampant!

A;; that being said I think BilBg's idea is a good one as it is only for those interested in running it, not mandatory for everyone. It can provide valuable data from those that wish to run it.
ID: 3092 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile (retired account)

Send message
Joined: 3 Jan 13
Posts: 30
Credit: 1,705,200
RAC: 0
Message 3097 - Posted: 22 May 2014, 23:01:41 UTC - in response to Message 3092.  
A first run of benchmarks is finished.

- I used two of the not shortened WUs from BilBG's bench package (input_22147_73.wu and input_22152_83.wu) and calculated the average from both elapsed time speedups.
- the cpu is an AMD FX-8350 (Piledriver) running at 4.0 GHz (no turbo, no throttling)
- no other cpu-intense tasks were running
- the reference app (baseline) was period_search_10210_windows_intelx86__sse2.exe

Results:

32bit plain: -99.8%
32bit SSE2: +2.8% (same as reference, only for control)
32bit SSE3: +8.4%
32bit AVX: -105,0%
64bit SSE2: +16.6%
64bit SSE3: +16.0%
64bit AVX: -19.3%
64bit FMA4: +22.9%

This confirms again that the AVX app is not suited at all for the AMD FX and that the SSE3 app has no or little advantage over SSE2 for that processor. But it shows that Crunch3r's FMA4 app has a significant speedup over the fastest stock app (64bit SSE2).

Quite surprising to me is the result for the 32bit AVX app. It's as slow as the plain app and much slower than the 64bit variant. Can anybody confirm this?
ID: 3097 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile (retired account)

Send message
Joined: 3 Jan 13
Posts: 30
Credit: 1,705,200
RAC: 0
Message 3128 - Posted: 31 May 2014, 16:49:14 UTC - in response to Message 3097.  
During the last week I made another benchmark, this time under a more realistic setting:

- I used ten WUs from the current batch (well, last weeks batch to be more specific: 150893_1, 150893_12, 150893_2, 150893_28, 150893_29, 150893_3 150893_30, 150893_31, 150894_4 and 150894_5) and calculated the average from the elapsed time speedups again. The minimum and maximum speedups are also included below.
- the FX-8350 was again running at 4.0 GHz (no turbo, no throttling)
- within BOINC another three tasks with Crunch3r's FMA4 app were running concurrently (using an app_config.xml and the 'mode noBS' switch of the benchmark package)
- this time the reference app was period_search_10210_windows_x86_64__sse2.exe, the fastest stock app from the first run, so the baseline was 'higher' than in the first benchmark run

By using ten test WUs and running three BOINC WUs concurrently I guess I got some more realistic figures here. One thing to be noted is that some workunits tend to run a bit faster with the SSE3 app while others are faster with the SSE2 app. I noticed the same during another benchmark with one of my intels (Ivy Bridge i7). However, in both cases the differences are minimal, so it doesn't matter much if you run 64bit SSE2 or SSE3. YMMV.

Results:

32bit plain: -130.33% avg. (max. -124.79%; min. -135.50%)
64bit SSE3: +0.13% avg. (max. +1.82%; min. -1.56%)
64bit AVX: -32.97% avg. (max. -31.17%; min. -34.81%)
64bit FMA4: +10.91% avg. (max. +12.35%; min. +9.68%)

Again a significant speedup of approx. 10% with the FMA4 app and no big difference between 64bit SSE2 and 64bit SSE3. AVX is out of the game again.
ID: 3128 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Falconet

Send message
Joined: 23 Oct 12
Posts: 18
Credit: 60,508
RAC: 77
Message 3131 - Posted: 31 May 2014, 17:14:35 UTC - in response to Message 3128.  
During the last week I made another benchmark, this time under a more realistic setting:

- I used ten WUs from the current batch (well, last weeks batch to be more specific: 150893_1, 150893_12, 150893_2, 150893_28, 150893_29, 150893_3 150893_30, 150893_31, 150894_4 and 150894_5) and calculated the average from the elapsed time speedups again. The minimum and maximum speedups are also included below.
- the FX-8350 was again running at 4.0 GHz (no turbo, no throttling)
- within BOINC another three tasks with Crunch3r's FMA4 app were running concurrently (using an app_config.xml and the 'mode noBS' switch of the benchmark package)
- this time the reference app was period_search_10210_windows_x86_64__sse2.exe, the fastest stock app from the first run, so the baseline was 'higher' than in the first benchmark run

By using ten test WUs and running three BOINC WUs concurrently I guess I got some more realistic figures here. One thing to be noted is that some workunits tend to run a bit faster with the SSE3 app while others are faster with the SSE2 app. I noticed the same during another benchmark with one of my intels (Ivy Bridge i7). However, in both cases the differences are minimal, so it doesn't matter much if you run 64bit SSE2 or SSE3. YMMV.

Results:

32bit plain: -130.33% avg. (max. -124.79%; min. -135.50%)
64bit SSE3: +0.13% avg. (max. +1.82%; min. -1.56%)
64bit AVX: -32.97% avg. (max. -31.17%; min. -34.81%)
64bit FMA4: +10.91% avg. (max. +12.35%; min. +9.68%)

Again a significant speedup of approx. 10% with the FMA4 app and no big difference between 64bit SSE2 and 64bit SSE3. AVX is out of the game again.


Nice, hope the project's admins test and make the app official.
ID: 3131 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Falconet

Send message
Joined: 23 Oct 12
Posts: 18
Credit: 60,508
RAC: 77
Message 3169 - Posted: 7 Jun 2014, 22:01:00 UTC - in response to Message 3131.  
Any news on making the app official?
ID: 3169 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : AMD Bulldozer FMA4 app