Download and Processing Errors


Message boards : Number crunching : Download and Processing Errors

Message board moderation

To post messages, you must log in.
1 · 2 · Next
AuthorMessage
Profile Conan
Avatar

Send message
Joined: 19 Jun 12
Posts: 32
Credit: 5,670,494
RAC: 1,815
Message 102 - Posted: 13 Jul 2012, 0:26:42 UTC
My Host (311) has thrown 2 errors recently

<![CDATA[
<message>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>input_150_55</file_name>
<error_code>-224</error_code>
<error_message>file not found</error_message>
</file_xfer_error>

On results 16722
16827

Also my other Host (309) has thrown 3 errors but different to my other host

"Process got signal 11"

On results 14099
15945
16030

These errors waste many processing hours as they don't happen right at the beginning but a long way into the calculation.

I would also like to know what is causing them.

Conan
ID: 102 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 19 Jun 12
Posts: 32
Credit: 5,670,494
RAC: 1,815
Message 103 - Posted: 14 Jul 2012, 8:05:50 UTC
Got another one with "Process Got Signal 11"

WU 16761

Anyone know what causes this error?

It happens after the WU has been running for many hours (up to 6 hours).

It's starting to get annoying as I am wasting a lot of CPU hours.

Thanks
Conan
ID: 103 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Saenger
Avatar

Send message
Joined: 18 Jun 12
Posts: 23
Credit: 6,217,675
RAC: 2,885
Message 104 - Posted: 14 Jul 2012, 11:28:34 UTC

Last modified: 14 Jul 2012, 11:31:02 UTC
Same here, and that error is not that fine described, to say the least.

The surrounding messages from the BOINC manager for the last error are these here (time is UTC+2):
Fr 13 Jul 2012 22:34:49 CEST | Asteroids@home | [task] ACTIVE_TASK::start(): forked process: pid 29915
Fr 13 Jul 2012 22:34:49 CEST | Asteroids@home | [task] task_state=EXECUTING for ps_120622b_272_24_2 from start
Fr 13 Jul 2012 22:34:49 CEST | Asteroids@home | Starting task ps_120622b_272_24_2 using period_search version 10000
Fr 13 Jul 2012 22:55:38 CEST | Asteroids@home | [checkpoint] result ps_120622b_272_24_2 checkpointed
Fr 13 Jul 2012 23:06:04 CEST | Asteroids@home | [checkpoint] result ps_120622b_272_24_2 checkpointed
Fr 13 Jul 2012 23:16:26 CEST | Asteroids@home | [checkpoint] result ps_120622b_272_24_2 checkpointed
Fr 13 Jul 2012 23:26:56 CEST | Asteroids@home | [checkpoint] result ps_120622b_272_24_2 checkpointed
Fr 13 Jul 2012 23:37:23 CEST | Asteroids@home | [checkpoint] result ps_120622b_272_24_2 checkpointed
Fr 13 Jul 2012 23:48:13 CEST | Asteroids@home | [checkpoint] result ps_120622b_272_24_2 checkpointed
Fr 13 Jul 2012 23:59:06 CEST | Asteroids@home | [checkpoint] result ps_120622b_272_24_2 checkpointed
Sa 14 Jul 2012 00:09:54 CEST | Asteroids@home | [checkpoint] result ps_120622b_272_24_2 checkpointed
Sa 14 Jul 2012 00:09:54 CEST | Asteroids@home | [cpu_sched] Preempting ps_120622b_272_24_2 (left in memory)
Sa 14 Jul 2012 00:09:54 CEST | Asteroids@home | [task] task_state=SUSPENDED for ps_120622b_272_24_2 from suspend
Sa 14 Jul 2012 06:50:32 CEST | Asteroids@home | [task] Process for ps_120622b_272_24_2 exited
Sa 14 Jul 2012 06:50:32 CEST | Asteroids@home | [task] process got signal 0
Sa 14 Jul 2012 06:50:32 CEST | Asteroids@home | [task] task_state=WAS_SIGNALED for ps_120622b_272_24_2 from handle_exited_app
Sa 14 Jul 2012 06:50:32 CEST | Asteroids@home | [sched_op] Deferring communication for 1 min 53 sec
Sa 14 Jul 2012 06:50:32 CEST | Asteroids@home | [sched_op] Reason: Unrecoverable error for task ps_120622b_272_24_2 (process got signal 11)
Sa 14 Jul 2012 06:50:32 CEST | Asteroids@home | [task] result state=COMPUTE_ERROR for ps_120622b_272_24_2 from CS::report_result_error
Sa 14 Jul 2012 06:50:32 CEST | Asteroids@home | Computation for task ps_120622b_272_24_2 finished
Sa 14 Jul 2012 06:50:32 CEST | Asteroids@home | [task] result state=COMPUTE_ERROR for ps_120622b_272_24_2 from CS::app_finished
Sa 14 Jul 2012 06:50:34 CEST | Asteroids@home | [fxd] starting upload, upload_offset 0
Sa 14 Jul 2012 06:50:34 CEST | Asteroids@home | Started upload of ps_120622b_272_24_2_0
Sa 14 Jul 2012 06:50:34 CEST | Asteroids@home | [file_xfer] URL: http://asteroidsathome.net/boinc_cgi/file_upload_handler
Sa 14 Jul 2012 06:50:35 CEST | Asteroids@home | [file_xfer] http op done; retval 0 (Success)
Sa 14 Jul 2012 06:50:35 CEST | Asteroids@home | [file_xfer] parsing upload response: <data_server_reply>    <status>0</status></data_server_reply>
Sa 14 Jul 2012 06:50:35 CEST | Asteroids@home | [file_xfer] parsing status: 0
Sa 14 Jul 2012 06:50:35 CEST | Asteroids@home | [file_xfer] file transfer status 0 (Success)
Sa 14 Jul 2012 06:50:35 CEST | Asteroids@home | Finished upload of ps_120622b_272_24_2_0
Sa 14 Jul 2012 06:50:35 CEST | Asteroids@home | [file_xfer] Throughput 11806 bytes/sec
Sa 14 Jul 2012 06:52:37 CEST | Asteroids@home | [sched_op] Starting scheduler request
Sa 14 Jul 2012 06:52:37 CEST | Asteroids@home | Sending scheduler request: To report completed tasks.
Sa 14 Jul 2012 06:52:37 CEST | Asteroids@home | Reporting 1 completed tasks, requesting new tasks for CPU
Sa 14 Jul 2012 06:52:37 CEST | Asteroids@home | [sched_op] CPU work request: 1.00 seconds; 0.19 CPUs
Sa 14 Jul 2012 06:52:37 CEST | Asteroids@home | [sched_op] NVIDIA GPU work request: 0.00 seconds; 0.00 GPUs
Sa 14 Jul 2012 06:52:45 CEST | Asteroids@home | Scheduler request completed: got 1 new tasks
Sa 14 Jul 2012 06:52:45 CEST | Asteroids@home | [sched_op] Server version 701
Sa 14 Jul 2012 06:52:45 CEST | Asteroids@home | Project requested delay of 7 seconds


I looked a bit around, and saw that the first erroneous WU about that time was probably one from T4T, that failed to start up it's virtual machine correctly, and some other projects followed (WUProp, Asteroids, malaria, yoyo(recovered from it), Albert (recovered as well)). I don't really know if it's really T4T, but it happened quite direct after their start-up, that failed.
Grüße vom Sänger
ID: 104 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BobCat13

Send message
Joined: 18 Jun 12
Posts: 5
Credit: 1,007,218
RAC: 0
Message 105 - Posted: 14 Jul 2012, 13:39:02 UTC
My machine has had 8 tasks error due to the signal 11. All of them were due to DNS resolution problems from my provider, resulting in the dreaded no heartbeat error between client and sciences.

The T4T VM not starting correctly may have caused the no heartbeat as well. Asteroids@home tasks don't recover gracefully from no heartbeat errors, it would seem.
ID: 105 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AMDave

Send message
Joined: 19 Jun 12
Posts: 11
Credit: 100,197
RAC: 0
Message 106 - Posted: 17 Jul 2012, 10:37:58 UTC

Last modified: 17 Jul 2012, 10:57:37 UTC
I do not have any BOINC signal 11's because I follow this advice on my 64-bit systems that are running 32-bit clients (or 64-bit clients that still have 32-bit libs hooked to them) on my 64-bit systems:
http://boincfaq.mundayweb.com/index.php?view=459&language=1

If you install Ia32 libs on your 64 bit system, most often this problem goes away.
If it does not, then you need to take the initial advice on the contents and start swap-out testing your RAM, un-O/C-ing your rig, etc. to locate the source of the issue.

But most often it is from running 32-bit apps on a 64-bit system without providing the 32-bit libs that the 32-bit client was compiled under.

YAY! for the maintainers of the Ia32 libs! Good job lads.

PS - another thanks to Jorden (wherever he is now) who tested and posted that back in 2008. It is intriguing that the problem continues to the present day. I think someone really should put a catch() in front of that throw() and tell the developer / user that this is what the problem is instead of expecting folks to interpret the error ID. But that's just me.

HTH
AMDave
ID: 106 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
-ShEm-

Send message
Joined: 25 Jun 12
Posts: 2
Credit: 207,934
RAC: 0
Message 107 - Posted: 22 Jul 2012, 9:33:59 UTC - in response to Message 105.  
My machine has had 8 tasks error due to the signal 11. All of them were due to DNS resolution problems from my provider, resulting in the dreaded no heartbeat error between client and sciences.

The T4T VM not starting correctly may have caused the no heartbeat as well. Asteroids@home tasks don't recover gracefully from no heartbeat errors, it would seem.

^This^
ID: 107 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 19 Jun 12
Posts: 32
Credit: 5,670,494
RAC: 1,815
Message 120 - Posted: 29 Jul 2012, 23:59:23 UTC
The number of errors seems to have reduced as of late, which is a good thing.
However I have noticed a large percentage of Work Units are going into an "Inconclusive" state, requiring a third or even forth WU before validation is given.
This is causing a lot of WUs to be given 0.00 points. After 12 hours of processing that gets annoying.
I have had 11 Work Units so far be given Zero points (that's 132 hours of work that I don't get any credit for).
I have still got at least a dozen of these still to get a result with.

If the first two results are to my AMD and another Intel CPU and don't appear to match and the third result that is sent out is also an Intel CPU, I will lose out with my AMD (all 11 failures have been like this).
The Boinc version does not seem to be a difference as it is a mix of ver 7.0.x and ver 6.10.x.

So is there a compilation problem between the Intel Application and the AMD Application?

My computers are running at stock speeds both are AMD Phenom II @3.2GHz.

Conan

ID: 120 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
cykodennis

Send message
Joined: 10 Jul 12
Posts: 69
Credit: 9,086,498
RAC: 0
Message 121 - Posted: 30 Jul 2012, 14:02:41 UTC
Although this seems to occur very often with a AMD/Intel mix, the problem occurs also with Intel/Intel.
But I often notice different Boincversions in such situations!
All in all you are right. There seems to be a pattern, with a few excemptions.

I hope the project crew investigates that soon. Would be tragic waste when some machines/CPUs/Boincversions would produce false Data.
ID: 121 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Kyong
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 9 Jun 12
Posts: 584
Credit: 52,667,664
RAC: 0
Message 123 - Posted: 30 Jul 2012, 18:53:36 UTC - in response to Message 121.  
Next version of the app would solve some problems. There are also some minor changes with output accuracy so the app will be released with new batch of wus.
ID: 123 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TLSI2000

Send message
Joined: 5 Jul 12
Posts: 6
Credit: 184,769,195
RAC: 303
Message 125 - Posted: 1 Aug 2012, 21:34:32 UTC
I will have to agree with the post above.
Running 12+ hours and actually finishing, and then to get an 'inconclusive' status is a bit defeating.

It not like we didn't do all that was asked of us.

Still, it is all for the 'greater good '.
ID: 125 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
cykodennis

Send message
Joined: 10 Jul 12
Posts: 69
Credit: 9,086,498
RAC: 0
Message 127 - Posted: 2 Aug 2012, 9:22:22 UTC
Sure, zero credit because of "inconclusive" is a little annoying.
But on the other hand, i wouldn't want credits for false results.

As long as most results are not inconclusive, everything is okay - for me.
ID: 127 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Kyong
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 9 Jun 12
Posts: 584
Credit: 52,667,664
RAC: 0
Message 129 - Posted: 3 Aug 2012, 12:49:10 UTC
I am sorry but I hope that most of problems have been solved. The next batch will also contain shorter units.
ID: 129 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
cykodennis

Send message
Joined: 10 Jul 12
Posts: 69
Credit: 9,086,498
RAC: 0
Message 130 - Posted: 3 Aug 2012, 14:35:52 UTC - in response to Message 129.  
\o/
ID: 130 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
candido

Send message
Joined: 28 Jul 12
Posts: 21
Credit: 2,513,800
RAC: 0
Message 324 - Posted: 24 Oct 2012, 0:45:56 UTC
Hi, I am having this problem.
About 2/3 of all downloaded wu fail in the first second after starting.
This is not too big a problem as almost no time of computing is waisted, however, it does increase some data traffic, which is not a problem for me, but might be for the server if other people are having the same problem.

I have an i7 2630, win 7, 64 bit, as a host, runing debian 64 (Linux 2.6.32-5-amd64) on an oracle VBox. boinc manager is v. 6.10.58. (MAchine nr 2950)

error are:
Dom 21 Out 2012 00:13:48 WEST Asteroids@home Starting ps_120928_86_31_2
Dom 21 Out 2012 00:13:48 WEST Asteroids@home Starting task ps_120928_86_31_2 using period_search version 10100
Dom 21 Out 2012 00:13:48 WEST Asteroids@home Starting ps_120928_83_221_2
Dom 21 Out 2012 00:13:48 WEST Asteroids@home Starting task ps_120928_83_221_2 using period_search version 10100
Dom 21 Out 2012 00:13:49 WEST Asteroids@home Computation for task ps_120928_86_31_2 finished
Dom 21 Out 2012 00:13:49 WEST Asteroids@home Output file ps_120928_86_31_2_0 for task ps_120928_86_31_2 absent
Dom 21 Out 2012 00:13:49 WEST Asteroids@home Resuming task ps_120928_89_287_2 using period_search version 10100
Dom 21 Out 2012 00:13:50 WEST Asteroids@home Computation for task ps_120928_83_221_2 finished
Dom 21 Out 2012 00:13:50 WEST Asteroids@home Output file ps_120928_83_221_2_0 for task ps_120928_83_221_2 absent
Dom 21 Out 2012 00:13:50 WEST Asteroids@home Resuming task ps_120928_83_214_2 using period_search version 10100
Dom 21 Out 2012 00:14:03 WEST Asteroids@home task ps_120928_87_47_2 resumed by user
Dom 21 Out 2012 00:14:03 WEST Asteroids@home Starting ps_120928_87_47_2
Dom 21 Out 2012 00:14:03 WEST Asteroids@home Starting task ps_120928_87_47_2 using period_search version 10100
Dom 21 Out 2012 00:14:04 WEST Asteroids@home Computation for task ps_120928_87_47_2 finished
Dom 21 Out 2012 00:14:04 WEST Asteroids@home Output file ps_120928_87_47_2_0 for task ps_120928_87_47_2 absent

Is there anything I can do to solve this?
thanks
ID: 324 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jujube

Send message
Joined: 5 Sep 12
Posts: 30
Credit: 24,320
RAC: 0
Message 325 - Posted: 24 Oct 2012, 5:10:06 UTC - in response to Message 324.  

Last modified: 24 Oct 2012, 5:20:14 UTC
ccandido,

Your computers are hidden which makes it difficult for others to view your results and help you diagnose problem. Please unhide them. There is NOTHING in the information we can see about your computers that helps us attack your computer. Many believe that knowing your IP address helps others attack but nobody except you can see your IP address, it is hidden from us. I have put in the extra effort required to view your results but many others are not willing to do so. Please unhide your computers.

This stderr output from one of your failed tasks indicates the application is unable to access a shared lib named libgcc_s.so.1. A shared lib is a Linux equivalent to a Winmdows dll. That's very puzzling for me because the ldd command indicates the application is a static build which means it should not need to access any shared libs. Also, it seems you installed the Dotsch/UX based virtual machine I mentioned in the Easy Windows solution thread. If that is correct then your VM already has libgcc_s.so.1 installed with permissions set to allow everybody to access it. Anybody else have any insight into this?

edit added:

Sorry,my mistake, you did not install the Dotsch/UX based VM, you created your own Debian VM. You said your Debian is 64 bit but the stderr report says the task is using the 32 bit Asteroids application. It's late and I'm sleepy, I'm stumped for now. I'll have another look at it in the morning though I suspect someone else will have it figured out by then.
ID: 325 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Kyong
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 9 Jun 12
Posts: 584
Credit: 52,667,664
RAC: 0
Message 326 - Posted: 24 Oct 2012, 8:38:07 UTC
ccandido: It's like jujube wrote. You have 64bit version of linux there but the crashed units were computing with 32bit of application. From some reason BOINC manager decided to use 32bit version instead of 64bit version of the app. Just install ia32-libs and there will be no problem with it.
ID: 326 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mickydl*

Send message
Joined: 7 Oct 12
Posts: 1
Credit: 2,057,800
RAC: 0
Message 327 - Posted: 24 Oct 2012, 12:33:16 UTC
You have most likely run into the same "feature" of the newer BOINC server software than many other projects already have. If the project provides 32 bit and 64 bit applications the BOINC server software will ignore the "bitness" of the OS and send 32 bit applications to 64 bit operating systems because they sometimes run faster. On Windows machines this is usually no problem, on Linux it is.

There are two solutions to the problem:
1) Install the 32 bit libraries on the client side
or
2) Disable this behavior on the server (as far as I know there is a configuration option somewhere)

mickydl*
ID: 327 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
candido

Send message
Joined: 28 Jul 12
Posts: 21
Credit: 2,513,800
RAC: 0
Message 328 - Posted: 24 Oct 2012, 18:50:15 UTC
Thanks for the quick reply from all of you
I guess I will try to install the 32 bit libs (ia32-libs)
I have no idea how to do that, at the moment, but I will search the net.
I have also downloaded jujube virtual machine and might also try that.
Thanks for your help
ccandido
ID: 328 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Kyong
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 9 Jun 12
Posts: 584
Credit: 52,667,664
RAC: 0
Message 329 - Posted: 24 Oct 2012, 19:06:15 UTC
The installation is easy. In terminal log in as root 'su' then type 'apt-get update' and 'apt-get install ia32-libs'. That is how to do this in Debian.
ID: 329 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
candido

Send message
Joined: 28 Jul 12
Posts: 21
Credit: 2,513,800
RAC: 0
Message 330 - Posted: 24 Oct 2012, 19:25:02 UTC - in response to Message 329.  
I'll do that right away.
Many thanks Kyong
ID: 330 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Download and Processing Errors