all downloaded tasks starting at the same time - and crash


Message boards : Number crunching : all downloaded tasks starting at the same time - and crash

Message board moderation

To post messages, you must log in.
AuthorMessage
erich56

Send message
Joined: 12 Jan 24
Posts: 17
Credit: 1,592,428
RAC: 0
Message 8196 - Posted: 13 Jan 2024, 14:01:33 UTC
Yesterday I attached the Asteroid@home project to 3 of my computers. In the settings I determined that only the GPU is crunching tasks. Which works perfectly on all three machines.
Today, I tried to attach a fourth PC, but I ran against the following problem:
All downloaded tasks seem to start at the same time - only the first one proceeds and finishes normally after a few minutes, all others produce a "computation error" right at the moment they want to start. Which is clear, because for sure some 20 or 30 tasks cannot be crunched at the same time.
What's going wrong?

FYI: the CPU is a Intel Xeon E5 2667v4, the GPU is a Nvidia Quadro P 5000.
The stderr of these failed tasks looks like:
https://asteroidsathome.net/boinc/result.php?resultid=434995002
ID: 8196 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team
Volunteer developer
Volunteer tester

Send message
Joined: 1 Jan 13
Posts: 90
Credit: 10,406,061
RAC: 6,503
Message 8197 - Posted: 13 Jan 2024, 16:54:08 UTC
are you running 20/30 gpu tasks in parallel? that's not a good idea and it won't be any better.

perhaps try with recent drivers? 516.94 is a bit older
https://www.nvidia.com/Download/driverResults.aspx/216860/en-us/
ID: 8197 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 23 Apr 21
Posts: 85
Credit: 115,950,169
RAC: 177,173
Message 8198 - Posted: 13 Jan 2024, 16:58:44 UTC
sounds like something wrong with your boinc configuration.

did you put 0.01 in the gpu_usage section of your app_config by mistake?

ID: 8198 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
erich56

Send message
Joined: 12 Jan 24
Posts: 17
Credit: 1,592,428
RAC: 0
Message 8199 - Posted: 13 Jan 2024, 18:06:20 UTC - in response to Message 8197.  
are you running 20/30 gpu tasks in parallel? that's not a good idea
no, it was NOT my intention to run more than 1 task at a time. I just meant to say that when I pushed the UPDATE button in the BOINC manager after attaching the to project, some 20/30 tasks were downloaded - with the expectation that one after the other will be processed.
ID: 8199 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
erich56

Send message
Joined: 12 Jan 24
Posts: 17
Credit: 1,592,428
RAC: 0
Message 8200 - Posted: 13 Jan 2024, 18:12:54 UTC - in response to Message 8198.  
sounds like something wrong with your boinc configuration.

did you put 0.01 in the gpu_usage section of your app_config by mistake?
no, on this PC I did not even put in a app_config.xml.
I guess you refer to my other posting from yesterday where I asked for the correct wording in an app_config.xml in order to run 2 tasks in parallel. I finally found out, but it turned out that the runtime for each task had doubled, so it did not make any sense to run 2 such tasks in parallel, and I finally removed the app_config.xml.
And for exactly this reason I did not even install one in the other PC from today.

BTW, any other GPU tasks which I crunch on this PC, e.g. WCG, GPUGRID, Primegrid, Einstein ... work well. So there seems everything alright with the BOINC configuration.
No idea why all of a sudden Asteroid wants to crunch all downloaded tasks together
ID: 8200 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 23 Apr 21
Posts: 85
Credit: 115,950,169
RAC: 177,173
Message 8201 - Posted: 13 Jan 2024, 18:49:58 UTC - in response to Message 8200.  
still sounds like something wrong with the BOINC configuration. since it's BOINC that controls how many tasks run at a time, not the application itself.

you might want to actually inspect the project folder to make sure there isnt some app_config file that you're not aware of. if there is indeed no app_config, then I would uninstall BOINC and reinstall the latest version you can.

ID: 8201 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
erich56

Send message
Joined: 12 Jan 24
Posts: 17
Credit: 1,592,428
RAC: 0
Message 8202 - Posted: 13 Jan 2024, 18:58:54 UTC - in response to Message 8201.  

Last modified: 13 Jan 2024, 19:08:21 UTC
still sounds like something wrong with the BOINC configuration. since it's BOINC that controls how many tasks run at a time, not the application itself.

you might want to actually inspect the project folder to make sure there isnt some app_config file that you're not aware of. if there is indeed no app_config, then I would uninstall BOINC and reinstall the latest version you can.
I double-checked: no app_config_xml in the project folder.
So it might not be a bad idea to install the latest BOINC version.
Still strange though: the problem just occurs with Asteroid but with no other project.

Also, I will update the driver. 516.94 is pretty old.
ID: 8202 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
erich56

Send message
Joined: 12 Jan 24
Posts: 17
Credit: 1,592,428
RAC: 0
Message 8204 - Posted: 14 Jan 2024, 12:10:22 UTC - in response to Message 8202.  
...So it might not be a bad idea to install the latest BOINC version.

Also, I will update the driver. 516.94 is pretty old.

I now updated both BOINC and the NVIDIA driver.

Unfortunately, I cannot make a testrun for Asteroid now, since there are no new tasks available :-(

P.S. As mentioned before, I am new to this project - so my question: does it happen frequently that no tasks are available?
I have been partcipating in project where this happens quite often, and others where this happens almost never.
ID: 8204 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 23 Apr 21
Posts: 85
Credit: 115,950,169
RAC: 177,173
Message 8205 - Posted: 14 Jan 2024, 15:26:59 UTC - in response to Message 8204.  
it happens about once a month or so. the tasks run down to 0, and it takes a day or two for the Admins to load up more work. then they load a large amount of work again that lasts another month or so.

ID: 8205 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
erich56

Send message
Joined: 12 Jan 24
Posts: 17
Credit: 1,592,428
RAC: 0
Message 8206 - Posted: 14 Jan 2024, 17:28:07 UTC - in response to Message 8205.  
it happens about once a month or so. the tasks run down to 0, and it takes a day or two for the Admins to load up more work. then they load a large amount of work again that lasts another month or so.
many thanks for the information :-)
ID: 8206 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
erich56

Send message
Joined: 12 Jan 24
Posts: 17
Credit: 1,592,428
RAC: 0
Message 8207 - Posted: 15 Jan 2024, 17:18:16 UTC - in response to Message 8204.  
...So it might not be a bad idea to install the latest BOINC version.

Also, I will update the driver. 516.94 is pretty old.

I now updated both BOINC and the NVIDIA driver.

Unfortunately, I cannot make a testrun for Asteroid now, since there are no new tasks available :-(
Now new tasks are available, so I downloaded a few. And again, all downloaded tasks startet at the same time, and except for one, all failed immediately, of course.

So neither the latest version of BOINC nor the latest version of the GPU driver could solve the problem.
As said before, this happens with no other GPU project. So it's bound to have something to do with Asteroid specifically.
ID: 8207 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 23 Apr 21
Posts: 85
Credit: 115,950,169
RAC: 177,173
Message 8208 - Posted: 15 Jan 2024, 20:37:15 UTC - in response to Message 8207.  
it's some problem with your boinc configuration. the app itself is not possible to behave the way you are describing. it only runs once. if you have multiple copies running it's because BOINC told it to.

remove the asteroids project completely from BOINC, and re-add it.

ID: 8208 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
erich56

Send message
Joined: 12 Jan 24
Posts: 17
Credit: 1,592,428
RAC: 0
Message 8209 - Posted: 16 Jan 2024, 9:39:01 UTC - in response to Message 8208.  
remove the asteroids project completely from BOINC, and re-add it.
I did this now, same problem as before.
Something is wrong, and obviously no way to get it repaired.
So, unfortunately, no chance to process Asteroid with this machine :-(
ID: 8209 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
erich56

Send message
Joined: 12 Jan 24
Posts: 17
Credit: 1,592,428
RAC: 0
Message 8220 - Posted: 18 Jan 2024, 8:05:19 UTC
good news:
a friend gave me the advice to try the app_config.xml

<app_config>
<project_max_concurrent>1</project_max_concurrent>
</app_config>


which I have known for long time and have used with some other projects now and then. I simply did not think about it.
And it works - now only 1 task is being processed at a time :-)

Sometimes solutions to problems are simpler than one might think.
ID: 8220 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 23 Apr 21
Posts: 85
Credit: 115,950,169
RAC: 177,173
Message 8221 - Posted: 18 Jan 2024, 12:49:13 UTC - in response to Message 8220.  
when the task is running. what does BOINC say about the resource allocation?

usually it's something like "0.993 CPUs + 1 NVIDIA GPU" or "1 CPUs + 0.5 NVIDIA GPU" or something like that. can you report back the specific values listed here for this system?

ID: 8221 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
erich56

Send message
Joined: 12 Jan 24
Posts: 17
Credit: 1,592,428
RAC: 0
Message 8222 - Posted: 18 Jan 2024, 14:26:03 UTC - in response to Message 8221.  
is says: "0,01 CPUs + 1 NVIDIA GPU"
ID: 8222 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : all downloaded tasks starting at the same time - and crash