Task 88762 not making progress, aborted


Message boards : Number crunching : Task 88762 not making progress, aborted

Message board moderation

To post messages, you must log in.
AuthorMessage
rhb

Send message
Joined: 29 Jun 12
Posts: 4
Credit: 612,808
RAC: 0
Message 331 - Posted: 24 Oct 2012, 20:08:02 UTC
Task 88762 was sleeping and not using any cpu time after using about 3 hours cpu. I looked at it with system monitor, and it had no open files and no memory map. Perhaps it got stuck as it was about ready to exit.

I will suspend the project for now, in case the others I have queued are similar. Let me know if some tasks are known to be bad, or if there is anything I can do to help in solving the problem. If it appears to be just my system, I will try the others and see what happens.

Name	ps_121018_9005_232_0
Workunit	36604
Created	18 Oct 2012 | 5:16:13 UTC
Sent	21 Oct 2012 | 6:01:13 UTC
Received	24 Oct 2012 | 19:50:31 UTC
Server state	Over
Outcome	Computation error
Client state	Aborted by user
Exit status	203 (0xcb) Unknown error number
Computer ID	362
Report deadline	31 Oct 2012 | 18:01:13 UTC
Run time	63,090.56
CPU time	10,696.53
Validate state	Invalid
Credit	0.00
Application version	Period Search Application v101.00 

Stderr output
<core_client_version>7.0.27</core_client_version>
<![CDATA[
<message>
aborted by user
</message>
<stderr_txt>
15:49:49 (17224): No heartbeat from core client for 30 sec - exiting

</stderr_txt>
]]>
ID: 331 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jujube

Send message
Joined: 5 Sep 12
Posts: 30
Credit: 24,320
RAC: 0
Message 333 - Posted: 24 Oct 2012, 22:49:01 UTC - in response to Message 331.  

Last modified: 24 Oct 2012, 22:58:30 UTC
<stderr_txt>
15:49:49 (17224): No heartbeat from core client for 30 sec - exiting

</stderr_txt>
]]>[/code]


Perhaps it was a problem with the client rather than the project application. Apparently there is only 1 "no heartbeat from core client" message in the stderr output but I wonder if there are many more such messages in client_state.xml prior to 15:49:49? Remember the app uses UTC whereas the client uses local time. If there is then that might explain why the application was doing nothing. IIRC, most science apps will tolerate 100 no heartbeats at which point they'll exit with an error, no guarantee this project's app behaves the same way and there is no guarantee it sits idle between attempts to detect a client heartbeat. Hmmm. Actually now that I think about it I would say the application sent the client the "end task" signal but didn't get a heartbeat or response so it sat through 100 or more no heartbeat cycles (the long dormant time you noticed) then you killed it. Yes, that would explain why no memory map, open files, etc. It doesn't explain why there was no heartbeat from the client but we don't want to solve a good mystery all at once.
ID: 333 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rhb

Send message
Joined: 29 Jun 12
Posts: 4
Credit: 612,808
RAC: 0
Message 334 - Posted: 25 Oct 2012, 3:39:43 UTC
I'm quite certain the task was in the process of finishing up, as you suggest. I would have thought the client would send an abort request to the task, but all we see is the no heartbeat. The no heartbeat message occured less than a minute before the task was reported (4pm edt, 8pm utc). I suggest either the client terminates tasks by failing to send a heartbeat, or (more likely?) the task failed to get the request to abort, but was aware immediately that no heartbeat was present. It is also possible that the client sent a signal, which the task caught but reported the no heartbeat instead. If so, the no heartbeat might have persisted for a long time as you suggest.

I don't know the IPC design of boinc, but it probably doesn't matter. The task appears to have got stuck exiting for unknown reasons, possibly a race condition. I did stop and continue the task before aborting it, hoping that might shake something up. In any case, I suspect the error is random and not likely to happen again. I will release the others one-by-one just in case.
ID: 334 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rhb

Send message
Joined: 29 Jun 12
Posts: 4
Credit: 612,808
RAC: 0
Message 335 - Posted: 26 Oct 2012, 14:46:35 UTC
Other tasks are running fine. I think the errant task was likely not complete, but failed during processing, because the time is significantly shorter than any of the other tasks. I have no idea why no files or memory map showed up, but that must have been a false appearance anyway because it printed to stderr when aborted.
ID: 335 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 19 Jun 12
Posts: 32
Credit: 5,190,203
RAC: 1,822
Message 351 - Posted: 3 Nov 2012, 1:32:55 UTC
I aborted this work unit as it had been running for a full day (24 hours) and had only just got to 54.5% with 17 hours still to go. It hadn't moved for a while and as most work units only take half this time I killed it.

WU 103661

Conan
ID: 351 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Kyong
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 9 Jun 12
Posts: 584
Credit: 52,667,664
RAC: 0
Message 352 - Posted: 3 Nov 2012, 8:13:12 UTC
Did you try to stop the client and start again?
ID: 352 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MarkJ
Avatar

Send message
Joined: 27 Jun 12
Posts: 129
Credit: 62,716,918
RAC: 95
Message 355 - Posted: 4 Nov 2012, 0:18:16 UTC
I've seen this on various projects over the years, but fortunately it doesn't happen very often. The actions I take in order are:

1. Suspend then un-suspend the task
2. If that didn't get it going shutdown BOINC and start it up
3. If that didn't work shutdown BOINC and reboot the computer
4. If none of the above work then abort it
BOINC blog
ID: 355 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 19 Jun 12
Posts: 32
Credit: 5,190,203
RAC: 1,822
Message 356 - Posted: 4 Nov 2012, 0:38:25 UTC - in response to Message 352.  
Did you try to stop the client and start again?


Yes I did but it did not change work unit behaviour.

I had noticed the problem when it had run for 12 hours when other work units were taking around 3 to 4 hours.
I let it run to see if it would finish as I thought it was like the work that was released many months ago that ran for 12 to 24 hours but when progress had not moved for a number of hours I decided to kill it.

Conan
ID: 356 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Task 88762 not making progress, aborted