All tasks imediately failing on osme of my machines
Message boards :
Number crunching :
All tasks imediately failing on osme of my machines
Message board moderation
Author | Message |
---|---|
Send message Joined: 2 Mar 24 Posts: 5 Credit: 1,972,716 RAC: 5,520 |
Hi, i have several machines and since this week (approx 8.5.) on some of them (should have the same CPU) all tasks immediately fails with illegal instruction error. On another older intel and amd opteron it runs OK. All machines are running rocky 9 or debian 12 linux, problem is on both systems. Can somebody please take a look on that? <result> <name>ps_240506_input_76344_31_0</name> <final_cpu_time>0.000000</final_cpu_time> <final_elapsed_time>0.000000</final_elapsed_time> <exit_status>193</exit_status> <state>3</state> <platform>x86_64-pc-linux-gnu</platform> <version_num>10220</version_num> <app_version_num>10220</app_version_num> <stderr_out> <core_client_version>7.20.2</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63)</message> <stderr_txt> BOINC client version 7.20.2 Application: ../../projects/asteroidsathome.net_boinc/period_search_10220_x86_64-pc-linux-gnu Version: 102.20.1.1 CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz RAM: 109.6 GB Using AVX SIMD optimizations. SIGILL: illegal instruction Stack trace (8 frames): ../../projects/asteroidsathome.net_boinc/period_search_10220_x86_64-pc-linux-gnu(+0xf4863)[0x5566c9a98863] /lib64/libc.so.6(+0x54db0)[0x7f4ec5654db0] ../../projects/asteroidsathome.net_boinc/period_search_10220_x86_64-pc-linux-gnu(+0x2d888)[0x5566c99d1888] ../../projects/asteroidsathome.net_boinc/period_search_10220_x86_64-pc-linux-gnu(+0x3686c)[0x5566c99da86c] ../../projects/asteroidsathome.net_boinc/period_search_10220_x86_64-pc-linux-gnu(+0x1faf9)[0x5566c99c3af9] /lib64/libc.so.6(+0x3feb0)[0x7f4ec563feb0] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f4ec563ff60] ../../projects/asteroidsathome.net_boinc/period_search_10220_x86_64-pc-linux-gnu(+0x21d9e)[0x5566c99c5d9e] Exiting... </stderr_txt> ]]> </stderr_out> <file_info> <name>ps_240506_input_76344_31_0_r246115810_0</name> <nbytes>0.000000</nbytes> <max_nbytes>500000000.000000</max_nbytes> <md5_cksum>d41d8cd98f00b204e9800998ecf8427e</md5_cksum> <upload_url>https://asteroidsathome.net/boinc_cgi/file_upload_handler</upload_url> </file_info> </result> |
Send message Joined: 2 May 13 Posts: 10 Credit: 1,672,074 RAC: 350 |
|
Send message Joined: 1 Jan 13 Posts: 90 Credit: 10,398,217 RAC: 8,356 |
it's indeed a bug in the AVX implementation, that mostly affects hosts with legacy Intel CPUs like Sandy Bridge on Linux. we already have a fix, but releasing it would take some time (about a week) could you test this version? it would be helpful if you could double-check it works as expected on these systems. You can also run regular tasks using the anonymous platform with it. https://fileport.io/wWq7w3weQMBq thanks and sorry for the inconvenience! |
Send message Joined: 2 May 13 Posts: 10 Credit: 1,672,074 RAC: 350 |
|
Send message Joined: 1 Jan 13 Posts: 90 Credit: 10,398,217 RAC: 8,356 |
you can test it with the sample work unit "period_search_in". For https://boinc.berkeley.edu/wiki/Anonymous_platform put the binary in the project's folder, create app_info.xml, and restart the boinc client. /var/lib/boinc/projects/asteroidsathome.net_boinc/app_info.xml <app_info> <app> <name>period_search</name> </app> <file_info> <name>period_search_BOINC_linux_10220_x64_universal_linux_Release</name> <executable/> </file_info> <app_version> <app_name>period_search</app_name> <version_num>10220</version_num> <avg_ncpus>1.000000</avg_ncpus> <plan_class>avx_linux</plan_class> <file_ref> <file_name>period_search_BOINC_linux_10220_x64_universal_linux_Release</file_name> <main_program/> </file_ref> </app_version> </app_info> |
Send message Joined: 2 May 13 Posts: 10 Credit: 1,672,074 RAC: 350 |
ok, I created the xml. in /var/lib/boinc/projects/asteroidsathome.net_boinc/ and the projects directory. Put the binary in the projects directory along with the period_search_in file Restarted the boinc_client and found the line :- 13-May-2024 02:26:23 [Asteroids@home] Found app_info.xml; using anonymous platform But no processing of the single w/u. Tried re-booting the machine but no processing of w/u. Have I missed something again.?? Ta Nairb |
Send message Joined: 1 Jan 13 Posts: 90 Credit: 10,398,217 RAC: 8,356 |
hmmm, it should work. Alternatively, you can force SSE3 optimizations with app_config.xml <app_config> <app_version> <app_name>period_search</app_name> <plan_class></plan_class> <cmdline>--optimization 3</cmdline> </app_version> </app_config> however, expect a slightly worse performance compared to the AVX version. |
Send message Joined: 2 May 13 Posts: 10 Credit: 1,672,074 RAC: 350 |
Well to be sure.... I created the xml file and put it into the /var/lib/boinc/projects/asteroidsathome.net_boinc/ directory I also put the xml file into the projects directory as well..... And the customized application and the period_search_in into the projects directory. I renamed the existing application to somthing_old But no processing on the single w/u. Even after a reboot and suspending all other jobs The machine will have finished all the other projects today so should have only the asteroid w/u to do. I am really only an end user. If it gets fixed in the next week or so, then all is fine. |
Send message Joined: 1 Jan 13 Posts: 90 Credit: 10,398,217 RAC: 8,356 |
|
Send message Joined: 2 Mar 24 Posts: 5 Credit: 1,972,716 RAC: 5,520 |
Sadly the new binary still doesn't work for me. I've created app_info.xml and boinc found it: May 13 15:37:13 xxxxx boinc[4016501]: 13-May-2024 15:37:13 [Asteroids@home] Found app_info.xml; using anonymous platform But when i get new tasks they also fail. E.g.: https://asteroidsathome.net/boinc/workunit.php?wuid=210788624 I did not catch scheduler request to see the error message. (and do not know if it is also somewhere else.) I have tried the app_config and it seems to be running ok. BTW, pokud by pomohlo, tak nabizim pripadne i nejaky webmeeting ;) |
Send message Joined: 2 Mar 24 Posts: 5 Credit: 1,972,716 RAC: 5,520 |
|
Send message Joined: 2 May 13 Posts: 10 Credit: 1,672,074 RAC: 350 |
I thought I would get some w/u and suspend the lot. I got 8 w/u I allowed 1 to proceed It failed with Stderr output <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63)</message> <stderr_txt> BOINC client version 7.16.11 Application: ../../projects/asteroidsathome.net_boinc/period_search_BOINC_linux_10220_x64_universal_linux_Release Version: 102.20.1.1 CPU: Intel(R) Core(TM) i7-2600S CPU @ 2.80GHz RAM: 23.4 GB Using AVX SIMD optimizations. SIGILL: illegal instruction Stack trace (6 frames): [0x517f90] [0x570510] [0x51464b] [0x5154a2] [0x59ad2e] [0x5e7618] Exiting... I will leave the other 7 w/u suspended |
Send message Joined: 16 Nov 22 Posts: 19 Credit: 7,503,644 RAC: 0 |
Last modified: 13 May 2024, 15:53:51 UTC Dear all, Could you please elaborate on that new 102.20 version ? I have There's no plan_class, so how do I know which SIMD instructions it's using ? CPU is a Ryzen 9 7950X, so it should pick AVX-512 by default if I understand correctly. Best regards, Samuel |
Send message Joined: 1 Jan 13 Posts: 90 Credit: 10,398,217 RAC: 8,356 |
thanks, it turns out I made a mistake while building the app. Here's an updated version https://fileport.io/qCXSSnT4xCJQ could you please check? |
Send message Joined: 1 Jan 13 Posts: 90 Credit: 10,398,217 RAC: 8,356 |
the new version has all optimizations in a single application and chooses between them dynamically based on your hardware capabilities. you can check which optimizations were used in the stderr log BOINC client version 8.0.1 Application: ../../projects/asteroidsathome.net_boinc/period_search_10220_x86_64-pc-linux-gnu Version: 102.20.1.1 CPU: AMD Ryzen 9 7950X 16-Core Processor RAM: 15.35 GB Using AVX512 SIMD optimizations. |
Send message Joined: 2 May 13 Posts: 10 Credit: 1,672,074 RAC: 350 |
Last modified: 13 May 2024, 17:25:50 UTC Ok, downloaded and seems to be working sofar. The remaining 7 w/u are all running. Boinc seems to think they take 4.5 days to complete. At 4mins+ all still running. I will report back if there is an error. edit: the w/u seem to be doing 30%/hr. And still running. |
Send message Joined: 2 Mar 24 Posts: 5 Credit: 1,972,716 RAC: 5,520 |
|
Send message Joined: 1 Jan 13 Posts: 90 Credit: 10,398,217 RAC: 8,356 |
|
Send message Joined: 2 Mar 24 Posts: 5 Credit: 1,972,716 RAC: 5,520 |
I have tried it on another (identical) machine to compare it. current binary with enforced worse optimalization: ~158minutes avg new corrected binary with working optimalization: ~138minutes avg per task Thanks for repair. I assume we will need to remove the app_info and binary when you roll out the new binary. So please let us know when it happens. |
Send message Joined: 22 Jul 12 Posts: 4 Credit: 684,340 RAC: 0 |
Last modified: 7 Jun 2024, 8:39:24 UTC |
Message boards :
Number crunching :
All tasks imediately failing on osme of my machines