Download and Processing Errors
Message boards :
Number crunching :
Download and Processing Errors
Message board moderation
Author | Message |
---|---|
Send message Joined: 19 Jun 12 Posts: 32 Credit: 5,677,115 RAC: 1,803 |
My Host (311) has thrown 2 errors recently <![CDATA[ <message> WU download error: couldn't get input files: <file_xfer_error> <file_name>input_150_55</file_name> <error_code>-224</error_code> <error_message>file not found</error_message> </file_xfer_error> On results 16722 16827 Also my other Host (309) has thrown 3 errors but different to my other host "Process got signal 11" On results 14099 15945 16030 These errors waste many processing hours as they don't happen right at the beginning but a long way into the calculation. I would also like to know what is causing them. Conan |
Send message Joined: 19 Jun 12 Posts: 32 Credit: 5,677,115 RAC: 1,803 |
|
Send message Joined: 18 Jun 12 Posts: 23 Credit: 6,218,812 RAC: 2,116 |
Last modified: 14 Jul 2012, 11:31:02 UTC Same here, and that error is not that fine described, to say the least. The surrounding messages from the BOINC manager for the last error are these here (time is UTC+2): Fr 13 Jul 2012 22:34:49 CEST | Asteroids@home | [task] ACTIVE_TASK::start(): forked process: pid 29915 Fr 13 Jul 2012 22:34:49 CEST | Asteroids@home | [task] task_state=EXECUTING for ps_120622b_272_24_2 from start Fr 13 Jul 2012 22:34:49 CEST | Asteroids@home | Starting task ps_120622b_272_24_2 using period_search version 10000 Fr 13 Jul 2012 22:55:38 CEST | Asteroids@home | [checkpoint] result ps_120622b_272_24_2 checkpointed Fr 13 Jul 2012 23:06:04 CEST | Asteroids@home | [checkpoint] result ps_120622b_272_24_2 checkpointed Fr 13 Jul 2012 23:16:26 CEST | Asteroids@home | [checkpoint] result ps_120622b_272_24_2 checkpointed Fr 13 Jul 2012 23:26:56 CEST | Asteroids@home | [checkpoint] result ps_120622b_272_24_2 checkpointed Fr 13 Jul 2012 23:37:23 CEST | Asteroids@home | [checkpoint] result ps_120622b_272_24_2 checkpointed Fr 13 Jul 2012 23:48:13 CEST | Asteroids@home | [checkpoint] result ps_120622b_272_24_2 checkpointed Fr 13 Jul 2012 23:59:06 CEST | Asteroids@home | [checkpoint] result ps_120622b_272_24_2 checkpointed Sa 14 Jul 2012 00:09:54 CEST | Asteroids@home | [checkpoint] result ps_120622b_272_24_2 checkpointed Sa 14 Jul 2012 00:09:54 CEST | Asteroids@home | [cpu_sched] Preempting ps_120622b_272_24_2 (left in memory) Sa 14 Jul 2012 00:09:54 CEST | Asteroids@home | [task] task_state=SUSPENDED for ps_120622b_272_24_2 from suspend Sa 14 Jul 2012 06:50:32 CEST | Asteroids@home | [task] Process for ps_120622b_272_24_2 exited Sa 14 Jul 2012 06:50:32 CEST | Asteroids@home | [task] process got signal 0 Sa 14 Jul 2012 06:50:32 CEST | Asteroids@home | [task] task_state=WAS_SIGNALED for ps_120622b_272_24_2 from handle_exited_app Sa 14 Jul 2012 06:50:32 CEST | Asteroids@home | [sched_op] Deferring communication for 1 min 53 sec Sa 14 Jul 2012 06:50:32 CEST | Asteroids@home | [sched_op] Reason: Unrecoverable error for task ps_120622b_272_24_2 (process got signal 11) Sa 14 Jul 2012 06:50:32 CEST | Asteroids@home | [task] result state=COMPUTE_ERROR for ps_120622b_272_24_2 from CS::report_result_error Sa 14 Jul 2012 06:50:32 CEST | Asteroids@home | Computation for task ps_120622b_272_24_2 finished Sa 14 Jul 2012 06:50:32 CEST | Asteroids@home | [task] result state=COMPUTE_ERROR for ps_120622b_272_24_2 from CS::app_finished Sa 14 Jul 2012 06:50:34 CEST | Asteroids@home | [fxd] starting upload, upload_offset 0 Sa 14 Jul 2012 06:50:34 CEST | Asteroids@home | Started upload of ps_120622b_272_24_2_0 Sa 14 Jul 2012 06:50:34 CEST | Asteroids@home | [file_xfer] URL: http://asteroidsathome.net/boinc_cgi/file_upload_handler Sa 14 Jul 2012 06:50:35 CEST | Asteroids@home | [file_xfer] http op done; retval 0 (Success) Sa 14 Jul 2012 06:50:35 CEST | Asteroids@home | [file_xfer] parsing upload response: <data_server_reply> <status>0</status></data_server_reply> Sa 14 Jul 2012 06:50:35 CEST | Asteroids@home | [file_xfer] parsing status: 0 Sa 14 Jul 2012 06:50:35 CEST | Asteroids@home | [file_xfer] file transfer status 0 (Success) Sa 14 Jul 2012 06:50:35 CEST | Asteroids@home | Finished upload of ps_120622b_272_24_2_0 Sa 14 Jul 2012 06:50:35 CEST | Asteroids@home | [file_xfer] Throughput 11806 bytes/sec Sa 14 Jul 2012 06:52:37 CEST | Asteroids@home | [sched_op] Starting scheduler request Sa 14 Jul 2012 06:52:37 CEST | Asteroids@home | Sending scheduler request: To report completed tasks. Sa 14 Jul 2012 06:52:37 CEST | Asteroids@home | Reporting 1 completed tasks, requesting new tasks for CPU Sa 14 Jul 2012 06:52:37 CEST | Asteroids@home | [sched_op] CPU work request: 1.00 seconds; 0.19 CPUs Sa 14 Jul 2012 06:52:37 CEST | Asteroids@home | [sched_op] NVIDIA GPU work request: 0.00 seconds; 0.00 GPUs Sa 14 Jul 2012 06:52:45 CEST | Asteroids@home | Scheduler request completed: got 1 new tasks Sa 14 Jul 2012 06:52:45 CEST | Asteroids@home | [sched_op] Server version 701 Sa 14 Jul 2012 06:52:45 CEST | Asteroids@home | Project requested delay of 7 seconds I looked a bit around, and saw that the first erroneous WU about that time was probably one from T4T, that failed to start up it's virtual machine correctly, and some other projects followed (WUProp, Asteroids, malaria, yoyo(recovered from it), Albert (recovered as well)). I don't really know if it's really T4T, but it happened quite direct after their start-up, that failed. Grüße vom Sänger |
Send message Joined: 18 Jun 12 Posts: 5 Credit: 1,007,218 RAC: 0 |
My machine has had 8 tasks error due to the signal 11. All of them were due to DNS resolution problems from my provider, resulting in the dreaded no heartbeat error between client and sciences. The T4T VM not starting correctly may have caused the no heartbeat as well. Asteroids@home tasks don't recover gracefully from no heartbeat errors, it would seem. |
Send message Joined: 19 Jun 12 Posts: 11 Credit: 100,197 RAC: 0 |
Last modified: 17 Jul 2012, 10:57:37 UTC I do not have any BOINC signal 11's because I follow this advice on my 64-bit systems that are running 32-bit clients (or 64-bit clients that still have 32-bit libs hooked to them) on my 64-bit systems: http://boincfaq.mundayweb.com/index.php?view=459&language=1 If you install Ia32 libs on your 64 bit system, most often this problem goes away. If it does not, then you need to take the initial advice on the contents and start swap-out testing your RAM, un-O/C-ing your rig, etc. to locate the source of the issue. But most often it is from running 32-bit apps on a 64-bit system without providing the 32-bit libs that the 32-bit client was compiled under. YAY! for the maintainers of the Ia32 libs! Good job lads. PS - another thanks to Jorden (wherever he is now) who tested and posted that back in 2008. It is intriguing that the problem continues to the present day. I think someone really should put a catch() in front of that throw() and tell the developer / user that this is what the problem is instead of expecting folks to interpret the error ID. But that's just me. HTH AMDave |
Send message Joined: 25 Jun 12 Posts: 2 Credit: 207,934 RAC: 0 |
My machine has had 8 tasks error due to the signal 11. All of them were due to DNS resolution problems from my provider, resulting in the dreaded no heartbeat error between client and sciences. ^This^ |
Send message Joined: 19 Jun 12 Posts: 32 Credit: 5,677,115 RAC: 1,803 |
The number of errors seems to have reduced as of late, which is a good thing. However I have noticed a large percentage of Work Units are going into an "Inconclusive" state, requiring a third or even forth WU before validation is given. This is causing a lot of WUs to be given 0.00 points. After 12 hours of processing that gets annoying. I have had 11 Work Units so far be given Zero points (that's 132 hours of work that I don't get any credit for). I have still got at least a dozen of these still to get a result with. If the first two results are to my AMD and another Intel CPU and don't appear to match and the third result that is sent out is also an Intel CPU, I will lose out with my AMD (all 11 failures have been like this). The Boinc version does not seem to be a difference as it is a mix of ver 7.0.x and ver 6.10.x. So is there a compilation problem between the Intel Application and the AMD Application? My computers are running at stock speeds both are AMD Phenom II @3.2GHz. Conan |
Send message Joined: 10 Jul 12 Posts: 69 Credit: 9,086,498 RAC: 0 |
Although this seems to occur very often with a AMD/Intel mix, the problem occurs also with Intel/Intel. But I often notice different Boincversions in such situations! All in all you are right. There seems to be a pattern, with a few excemptions. I hope the project crew investigates that soon. Would be tragic waste when some machines/CPUs/Boincversions would produce false Data. |
Send message Joined: 9 Jun 12 Posts: 584 Credit: 52,667,664 RAC: 0 |
|
Send message Joined: 5 Jul 12 Posts: 6 Credit: 184,769,195 RAC: 225 |
|
Send message Joined: 10 Jul 12 Posts: 69 Credit: 9,086,498 RAC: 0 |
|
Send message Joined: 9 Jun 12 Posts: 584 Credit: 52,667,664 RAC: 0 |
|
Send message Joined: 10 Jul 12 Posts: 69 Credit: 9,086,498 RAC: 0 |
|
Send message Joined: 28 Jul 12 Posts: 21 Credit: 2,513,800 RAC: 0 |
Hi, I am having this problem. About 2/3 of all downloaded wu fail in the first second after starting. This is not too big a problem as almost no time of computing is waisted, however, it does increase some data traffic, which is not a problem for me, but might be for the server if other people are having the same problem. I have an i7 2630, win 7, 64 bit, as a host, runing debian 64 (Linux 2.6.32-5-amd64) on an oracle VBox. boinc manager is v. 6.10.58. (MAchine nr 2950) error are: Dom 21 Out 2012 00:13:48 WEST Asteroids@home Starting ps_120928_86_31_2 Dom 21 Out 2012 00:13:48 WEST Asteroids@home Starting task ps_120928_86_31_2 using period_search version 10100 Dom 21 Out 2012 00:13:48 WEST Asteroids@home Starting ps_120928_83_221_2 Dom 21 Out 2012 00:13:48 WEST Asteroids@home Starting task ps_120928_83_221_2 using period_search version 10100 Dom 21 Out 2012 00:13:49 WEST Asteroids@home Computation for task ps_120928_86_31_2 finished Dom 21 Out 2012 00:13:49 WEST Asteroids@home Output file ps_120928_86_31_2_0 for task ps_120928_86_31_2 absent Dom 21 Out 2012 00:13:49 WEST Asteroids@home Resuming task ps_120928_89_287_2 using period_search version 10100 Dom 21 Out 2012 00:13:50 WEST Asteroids@home Computation for task ps_120928_83_221_2 finished Dom 21 Out 2012 00:13:50 WEST Asteroids@home Output file ps_120928_83_221_2_0 for task ps_120928_83_221_2 absent Dom 21 Out 2012 00:13:50 WEST Asteroids@home Resuming task ps_120928_83_214_2 using period_search version 10100 Dom 21 Out 2012 00:14:03 WEST Asteroids@home task ps_120928_87_47_2 resumed by user Dom 21 Out 2012 00:14:03 WEST Asteroids@home Starting ps_120928_87_47_2 Dom 21 Out 2012 00:14:03 WEST Asteroids@home Starting task ps_120928_87_47_2 using period_search version 10100 Dom 21 Out 2012 00:14:04 WEST Asteroids@home Computation for task ps_120928_87_47_2 finished Dom 21 Out 2012 00:14:04 WEST Asteroids@home Output file ps_120928_87_47_2_0 for task ps_120928_87_47_2 absent Is there anything I can do to solve this? thanks |
Send message Joined: 5 Sep 12 Posts: 30 Credit: 24,320 RAC: 0 |
Last modified: 24 Oct 2012, 5:20:14 UTC ccandido, Your computers are hidden which makes it difficult for others to view your results and help you diagnose problem. Please unhide them. There is NOTHING in the information we can see about your computers that helps us attack your computer. Many believe that knowing your IP address helps others attack but nobody except you can see your IP address, it is hidden from us. I have put in the extra effort required to view your results but many others are not willing to do so. Please unhide your computers. This stderr output from one of your failed tasks indicates the application is unable to access a shared lib named libgcc_s.so.1. A shared lib is a Linux equivalent to a Winmdows dll. That's very puzzling for me because the ldd command indicates the application is a static build which means it should not need to access any shared libs. Also, it seems you installed the Dotsch/UX based virtual machine I mentioned in the Easy Windows solution thread. If that is correct then your VM already has libgcc_s.so.1 installed with permissions set to allow everybody to access it. Anybody else have any insight into this? edit added: Sorry,my mistake, you did not install the Dotsch/UX based VM, you created your own Debian VM. You said your Debian is 64 bit but the stderr report says the task is using the 32 bit Asteroids application. It's late and I'm sleepy, I'm stumped for now. I'll have another look at it in the morning though I suspect someone else will have it figured out by then. |
Send message Joined: 9 Jun 12 Posts: 584 Credit: 52,667,664 RAC: 0 |
ccandido: It's like jujube wrote. You have 64bit version of linux there but the crashed units were computing with 32bit of application. From some reason BOINC manager decided to use 32bit version instead of 64bit version of the app. Just install ia32-libs and there will be no problem with it.
|
Send message Joined: 7 Oct 12 Posts: 1 Credit: 2,057,800 RAC: 0 |
You have most likely run into the same "feature" of the newer BOINC server software than many other projects already have. If the project provides 32 bit and 64 bit applications the BOINC server software will ignore the "bitness" of the OS and send 32 bit applications to 64 bit operating systems because they sometimes run faster. On Windows machines this is usually no problem, on Linux it is. There are two solutions to the problem: 1) Install the 32 bit libraries on the client side or 2) Disable this behavior on the server (as far as I know there is a configuration option somewhere) mickydl* |
Send message Joined: 28 Jul 12 Posts: 21 Credit: 2,513,800 RAC: 0 |
Thanks for the quick reply from all of you I guess I will try to install the 32 bit libs (ia32-libs) I have no idea how to do that, at the moment, but I will search the net. I have also downloaded jujube virtual machine and might also try that. Thanks for your help ccandido |
Send message Joined: 9 Jun 12 Posts: 584 Credit: 52,667,664 RAC: 0 |
|
Send message Joined: 28 Jul 12 Posts: 21 Credit: 2,513,800 RAC: 0 |
|
Message boards :
Number crunching :
Download and Processing Errors