Task 88762 not making progress, aborted
Message boards :
Number crunching :
Task 88762 not making progress, aborted
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Jun 12 Posts: 4 Credit: 612,808 RAC: 0 |
Task 88762 was sleeping and not using any cpu time after using about 3 hours cpu. I looked at it with system monitor, and it had no open files and no memory map. Perhaps it got stuck as it was about ready to exit. I will suspend the project for now, in case the others I have queued are similar. Let me know if some tasks are known to be bad, or if there is anything I can do to help in solving the problem. If it appears to be just my system, I will try the others and see what happens. Name ps_121018_9005_232_0 Workunit 36604 Created 18 Oct 2012 | 5:16:13 UTC Sent 21 Oct 2012 | 6:01:13 UTC Received 24 Oct 2012 | 19:50:31 UTC Server state Over Outcome Computation error Client state Aborted by user Exit status 203 (0xcb) Unknown error number Computer ID 362 Report deadline 31 Oct 2012 | 18:01:13 UTC Run time 63,090.56 CPU time 10,696.53 Validate state Invalid Credit 0.00 Application version Period Search Application v101.00 Stderr output <core_client_version>7.0.27</core_client_version> <![CDATA[ <message> aborted by user </message> <stderr_txt> 15:49:49 (17224): No heartbeat from core client for 30 sec - exiting </stderr_txt> ]]> |
Send message Joined: 5 Sep 12 Posts: 30 Credit: 24,320 RAC: 0 |
Last modified: 24 Oct 2012, 22:58:30 UTC <stderr_txt> Perhaps it was a problem with the client rather than the project application. Apparently there is only 1 "no heartbeat from core client" message in the stderr output but I wonder if there are many more such messages in client_state.xml prior to 15:49:49? Remember the app uses UTC whereas the client uses local time. If there is then that might explain why the application was doing nothing. IIRC, most science apps will tolerate 100 no heartbeats at which point they'll exit with an error, no guarantee this project's app behaves the same way and there is no guarantee it sits idle between attempts to detect a client heartbeat. Hmmm. Actually now that I think about it I would say the application sent the client the "end task" signal but didn't get a heartbeat or response so it sat through 100 or more no heartbeat cycles (the long dormant time you noticed) then you killed it. Yes, that would explain why no memory map, open files, etc. It doesn't explain why there was no heartbeat from the client but we don't want to solve a good mystery all at once. |
Send message Joined: 29 Jun 12 Posts: 4 Credit: 612,808 RAC: 0 |
I'm quite certain the task was in the process of finishing up, as you suggest. I would have thought the client would send an abort request to the task, but all we see is the no heartbeat. The no heartbeat message occured less than a minute before the task was reported (4pm edt, 8pm utc). I suggest either the client terminates tasks by failing to send a heartbeat, or (more likely?) the task failed to get the request to abort, but was aware immediately that no heartbeat was present. It is also possible that the client sent a signal, which the task caught but reported the no heartbeat instead. If so, the no heartbeat might have persisted for a long time as you suggest. I don't know the IPC design of boinc, but it probably doesn't matter. The task appears to have got stuck exiting for unknown reasons, possibly a race condition. I did stop and continue the task before aborting it, hoping that might shake something up. In any case, I suspect the error is random and not likely to happen again. I will release the others one-by-one just in case. |
Send message Joined: 29 Jun 12 Posts: 4 Credit: 612,808 RAC: 0 |
Other tasks are running fine. I think the errant task was likely not complete, but failed during processing, because the time is significantly shorter than any of the other tasks. I have no idea why no files or memory map showed up, but that must have been a false appearance anyway because it printed to stderr when aborted.
|
Send message Joined: 19 Jun 12 Posts: 32 Credit: 5,670,175 RAC: 1,808 |
|
Send message Joined: 9 Jun 12 Posts: 584 Credit: 52,667,664 RAC: 0 |
|
Send message Joined: 27 Jun 12 Posts: 129 Credit: 62,725,780 RAC: 0 |
I've seen this on various projects over the years, but fortunately it doesn't happen very often. The actions I take in order are: 1. Suspend then un-suspend the task 2. If that didn't get it going shutdown BOINC and start it up 3. If that didn't work shutdown BOINC and reboot the computer 4. If none of the above work then abort it BOINC blog |
Send message Joined: 19 Jun 12 Posts: 32 Credit: 5,670,175 RAC: 1,808 |
Did you try to stop the client and start again? Yes I did but it did not change work unit behaviour. I had noticed the problem when it had run for 12 hours when other work units were taking around 3 to 4 hours. I let it run to see if it would finish as I thought it was like the work that was released many months ago that ran for 12 to 24 hours but when progress had not moved for a number of hours I decided to kill it. Conan |
Message boards :
Number crunching :
Task 88762 not making progress, aborted