All tasks imediately failing on osme of my machines


Message boards : Number crunching : All tasks imediately failing on osme of my machines

Message board moderation

To post messages, you must log in.
1 · 2 · Next
AuthorMessage
prazape

Send message
Joined: 2 Mar 24
Posts: 5
Credit: 1,154,069
RAC: 6,676
Message 8366 - Posted: 11 May 2024, 16:30:33 UTC
Hi,
i have several machines and since this week (approx 8.5.) on some of them (should have the same CPU) all tasks immediately fails with illegal instruction error. On another older intel and amd opteron it runs OK. All machines are running rocky 9 or debian 12 linux, problem is on both systems.
Can somebody please take a look on that?

<result>
    <name>ps_240506_input_76344_31_0</name>
    <final_cpu_time>0.000000</final_cpu_time>
    <final_elapsed_time>0.000000</final_elapsed_time>
    <exit_status>193</exit_status>
    <state>3</state>
    <platform>x86_64-pc-linux-gnu</platform>
    <version_num>10220</version_num>
    <app_version_num>10220</app_version_num>
<stderr_out>
<core_client_version>7.20.2</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)</message>
<stderr_txt>
BOINC client version 7.20.2
Application: ../../projects/asteroidsathome.net_boinc/period_search_10220_x86_64-pc-linux-gnu
Version: 102.20.1.1
CPU:       Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
RAM: 109.6 GB
Using AVX SIMD optimizations.
SIGILL: illegal instruction
Stack trace (8 frames):
../../projects/asteroidsathome.net_boinc/period_search_10220_x86_64-pc-linux-gnu(+0xf4863)[0x5566c9a98863]
/lib64/libc.so.6(+0x54db0)[0x7f4ec5654db0]
../../projects/asteroidsathome.net_boinc/period_search_10220_x86_64-pc-linux-gnu(+0x2d888)[0x5566c99d1888]
../../projects/asteroidsathome.net_boinc/period_search_10220_x86_64-pc-linux-gnu(+0x3686c)[0x5566c99da86c]
../../projects/asteroidsathome.net_boinc/period_search_10220_x86_64-pc-linux-gnu(+0x1faf9)[0x5566c99c3af9]
/lib64/libc.so.6(+0x3feb0)[0x7f4ec563feb0]
/lib64/libc.so.6(__libc_start_main+0x80)[0x7f4ec563ff60]
../../projects/asteroidsathome.net_boinc/period_search_10220_x86_64-pc-linux-gnu(+0x21d9e)[0x5566c99c5d9e]

Exiting...

</stderr_txt>
]]>
</stderr_out>
<file_info>
    <name>ps_240506_input_76344_31_0_r246115810_0</name>
    <nbytes>0.000000</nbytes>
    <max_nbytes>500000000.000000</max_nbytes>
    <md5_cksum>d41d8cd98f00b204e9800998ecf8427e</md5_cksum>
    <upload_url>https://asteroidsathome.net/boinc_cgi/file_upload_handler</upload_url>
</file_info>
</result>
ID: 8366 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nairb

Send message
Joined: 2 May 13
Posts: 10
Credit: 1,654,306
RAC: 149
Message 8367 - Posted: 11 May 2024, 22:03:00 UTC
Had exactly the same on one of my machines. The machine runs other projects fine.
ALL Asteroids@home w/u's failed with "SIGILL: illegal instruction".

It's been fine before with the project.
ID: 8367 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 66
Credit: 8,335,519
RAC: 56,906
Message 8368 - Posted: 12 May 2024, 11:48:24 UTC
it's indeed a bug in the AVX implementation, that mostly affects hosts with legacy Intel CPUs like Sandy Bridge on Linux.

we already have a fix, but releasing it would take some time (about a week)

could you test this version? it would be helpful if you could double-check it works as expected on these systems. You can also run regular tasks using the anonymous platform with it.
https://fileport.io/wWq7w3weQMBq

thanks and sorry for the inconvenience!
ID: 8368 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nairb

Send message
Joined: 2 May 13
Posts: 10
Credit: 1,654,306
RAC: 149
Message 8369 - Posted: 12 May 2024, 21:51:59 UTC - in response to Message 8368.  
Thanks for letting us know.

I did download the new application and put it in the projects folder. Maybe I have to do something else as well??. The new jobs failed.

I can wait for the general rollout, so it's not an issue.
Ta
Nairb
ID: 8369 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 66
Credit: 8,335,519
RAC: 56,906
Message 8370 - Posted: 12 May 2024, 23:14:30 UTC
you can test it with the sample work unit "period_search_in".

For https://boinc.berkeley.edu/wiki/Anonymous_platform put the binary in the project's folder, create app_info.xml, and restart the boinc client.

/var/lib/boinc/projects/asteroidsathome.net_boinc/app_info.xml
<app_info>
 <app>
    <name>period_search</name>
 </app>
 <file_info>
  <name>period_search_BOINC_linux_10220_x64_universal_linux_Release</name>
  <executable/>
 </file_info>

<app_version>
    <app_name>period_search</app_name>
    <version_num>10220</version_num>
    <avg_ncpus>1.000000</avg_ncpus>
    <plan_class>avx_linux</plan_class>
    <file_ref>
        <file_name>period_search_BOINC_linux_10220_x64_universal_linux_Release</file_name>
        <main_program/>
    </file_ref>
</app_version>
</app_info>
ID: 8370 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nairb

Send message
Joined: 2 May 13
Posts: 10
Credit: 1,654,306
RAC: 149
Message 8371 - Posted: 13 May 2024, 1:37:17 UTC - in response to Message 8370.  
ok, I created the xml. in /var/lib/boinc/projects/asteroidsathome.net_boinc/ and the projects directory.

Put the binary in the projects directory along with the period_search_in file

Restarted the boinc_client
and found the line :-
13-May-2024 02:26:23 [Asteroids@home] Found app_info.xml; using anonymous platform

But no processing of the single w/u.

Tried re-booting the machine but no processing of w/u.

Have I missed something again.??

Ta
Nairb
ID: 8371 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 66
Credit: 8,335,519
RAC: 56,906
Message 8372 - Posted: 13 May 2024, 13:23:15 UTC
hmmm, it should work. Alternatively, you can force SSE3 optimizations with app_config.xml

<app_config>
  <app_version>
    <app_name>period_search</app_name>
    <plan_class></plan_class>
    <cmdline>--optimization 3</cmdline>
  </app_version>
</app_config>

however, expect a slightly worse performance compared to the AVX version.
ID: 8372 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nairb

Send message
Joined: 2 May 13
Posts: 10
Credit: 1,654,306
RAC: 149
Message 8373 - Posted: 13 May 2024, 13:49:54 UTC - in response to Message 8372.  
Well to be sure....
I created the xml file and put it into the
/var/lib/boinc/projects/asteroidsathome.net_boinc/
directory
I also put the xml file into the projects directory as well..... And the customized application and the period_search_in
into the projects directory.
I renamed the existing application to somthing_old

But no processing on the single w/u. Even after a reboot and suspending all other jobs
The machine will have finished all the other projects today so should have only the asteroid w/u to do.

I am really only an end user. If it gets fixed in the next week or so, then all is fine.
ID: 8373 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 66
Credit: 8,335,519
RAC: 56,906
Message 8374 - Posted: 13 May 2024, 13:54:19 UTC
Thanks for your patience. Hopefully, we will be able to release a patched version this week.
ID: 8374 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
prazape

Send message
Joined: 2 Mar 24
Posts: 5
Credit: 1,154,069
RAC: 6,676
Message 8375 - Posted: 13 May 2024, 14:00:20 UTC
Sadly the new binary still doesn't work for me.

I've created app_info.xml and boinc found it:
May 13 15:37:13 xxxxx boinc[4016501]: 13-May-2024 15:37:13 [Asteroids@home] Found app_info.xml; using anonymous platform

But when i get new tasks they also fail. E.g.:
https://asteroidsathome.net/boinc/workunit.php?wuid=210788624

I did not catch scheduler request to see the error message. (and do not know if it is also somewhere else.)

I have tried the app_config and it seems to be running ok.

BTW, pokud by pomohlo, tak nabizim pripadne i nejaky webmeeting ;)
ID: 8375 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
prazape

Send message
Joined: 2 Mar 24
Posts: 5
Credit: 1,154,069
RAC: 6,676
Message 8376 - Posted: 13 May 2024, 14:02:15 UTC - in response to Message 8373.  
Did you add execute permission on the binary?
ID: 8376 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nairb

Send message
Joined: 2 May 13
Posts: 10
Credit: 1,654,306
RAC: 149
Message 8377 - Posted: 13 May 2024, 14:30:26 UTC
I thought I would get some w/u and suspend the lot.
I got 8 w/u
I allowed 1 to proceed
It failed with
Stderr output

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)</message>
<stderr_txt>
BOINC client version 7.16.11
Application: ../../projects/asteroidsathome.net_boinc/period_search_BOINC_linux_10220_x64_universal_linux_Release
Version: 102.20.1.1
CPU: Intel(R) Core(TM) i7-2600S CPU @ 2.80GHz
RAM: 23.4 GB
Using AVX SIMD optimizations.
SIGILL: illegal instruction
Stack trace (6 frames):
[0x517f90]
[0x570510]
[0x51464b]
[0x5154a2]
[0x59ad2e]
[0x5e7618]

Exiting...

I will leave the other 7 w/u suspended
ID: 8377 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
magic_sam

Send message
Joined: 16 Nov 22
Posts: 19
Credit: 7,481,025
RAC: 2,607
Message 8378 - Posted: 13 May 2024, 15:39:21 UTC

Last modified: 13 May 2024, 15:53:51 UTC
Dear all,

Could you please elaborate on that new 102.20 version ?

I have one 26 jobs currently running on BOINC v8.0.1 x86_64.

There's no plan_class, so how do I know which SIMD instructions it's using ? CPU is a Ryzen 9 7950X, so it should pick AVX-512 by default if I understand correctly.

Best regards,

Samuel
ID: 8378 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 66
Credit: 8,335,519
RAC: 56,906
Message 8379 - Posted: 13 May 2024, 16:10:16 UTC - in response to Message 8377.  
thanks, it turns out I made a mistake while building the app. Here's an updated version https://fileport.io/qCXSSnT4xCJQ

could you please check?
ID: 8379 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 66
Credit: 8,335,519
RAC: 56,906
Message 8380 - Posted: 13 May 2024, 16:19:49 UTC - in response to Message 8378.  
the new version has all optimizations in a single application and chooses between them dynamically based on your hardware capabilities.

you can check which optimizations were used in the stderr log
BOINC client version 8.0.1
Application: ../../projects/asteroidsathome.net_boinc/period_search_10220_x86_64-pc-linux-gnu
Version: 102.20.1.1
CPU: AMD Ryzen 9 7950X 16-Core Processor            
RAM: 15.35 GB
Using AVX512 SIMD optimizations.
ID: 8380 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nairb

Send message
Joined: 2 May 13
Posts: 10
Credit: 1,654,306
RAC: 149
Message 8381 - Posted: 13 May 2024, 16:29:01 UTC - in response to Message 8379.  

Last modified: 13 May 2024, 17:25:50 UTC
Ok, downloaded and seems to be working sofar. The remaining 7 w/u are all running.

Boinc seems to think they take 4.5 days to complete.

At 4mins+ all still running.
I will report back if there is an error.

edit:
the w/u seem to be doing 30%/hr. And still running.
ID: 8381 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
prazape

Send message
Joined: 2 Mar 24
Posts: 5
Credit: 1,154,069
RAC: 6,676
Message 8385 - Posted: 13 May 2024, 20:08:18 UTC
I confirm nairb's info/observation - when using the app_info.xml method with the new binary it now runs, but the ETA is 4days.

So for now I went back to app_config.xml method of using worse optimalisation which gives only ~90minutes time.
ID: 8385 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 66
Credit: 8,335,519
RAC: 56,906
Message 8386 - Posted: 13 May 2024, 20:50:12 UTC
The ETA is inaccurate because BOINC hasn't benchmarked the new app yet. It'll settle after finishing a few WUs. Anyway, I see nairb's tasks were successful, which is a good sign. Thank you both!
ID: 8386 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
prazape

Send message
Joined: 2 Mar 24
Posts: 5
Credit: 1,154,069
RAC: 6,676
Message 8388 - Posted: 15 May 2024, 20:07:27 UTC
I have tried it on another (identical) machine to compare it.

current binary with enforced worse optimalization: ~158minutes avg
new corrected binary with working optimalization: ~138minutes avg per task

Thanks for repair. I assume we will need to remove the app_info and binary when you roll out the new binary. So please let us know when it happens.
ID: 8388 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954

Send message
Joined: 22 Jul 12
Posts: 4
Credit: 684,340
RAC: 2
Message 8438 - Posted: 7 Jun 2024, 8:35:29 UTC

Last modified: 7 Jun 2024, 8:39:24 UTC
I have the same problem. Did new install of latest Ubuntu from LinuxMint and same thing.

RAM: 15.51 GB
Using AVX SIMD optimizations.
SIGILL: illegal instruction
Stack trace (8 frames):

This is on an older I5-3570K setup. It has AVX I think.

8-)
ID: 8438 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : All tasks imediately failing on osme of my machines