Troubleshooting host with multiple NVIDIA devices but with different Compute Capabilities (CC)


Message boards : Problems and bug reports : Troubleshooting host with multiple NVIDIA devices but with different Compute Capabilities (CC)

Message board moderation

To post messages, you must log in.
AuthorMessage
Zarck

Send message
Joined: 21 Jun 12
Posts: 24
Credit: 19,989,257
RAC: 134
Message 7573 - Posted: 15 Dec 2022, 16:28:27 UTC
Erreur lors des calculs 0.00 0.00
ID: 7573 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zarck

Send message
Joined: 21 Jun 12
Posts: 24
Credit: 19,989,257
RAC: 134
Message 7574 - Posted: 15 Dec 2022, 22:43:09 UTC
ID: 7574 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Georgi Vidinski
Volunteer moderator
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 22 Nov 17
Posts: 159
Credit: 13,180,518
RAC: 0
Message 7580 - Posted: 18 Dec 2022, 5:29:42 UTC
There is almost every week a new question about "Why I'm getting errors while computing" related to one or another NVIDIA device. I'll put this here as I believe this information will enlighten the situation and the cause of the problem along with the workaround.

Why I'm getting invalid results and errors while computing on one of my cards?
When you have more than one NVIDIA card installed in the same host but with different Compute Capabilities (CC) especially when they are far from compatibility your cards with lower CC will keep receiving the same application as for the card with the highest CC.
And there is nothing that we can do from server side. It is a BOINC issue and how the BOINC-client works and reports to the server your configuration.
As Ian&Steve C. stated in his post here:
symptoms are entirely due to limitations in the BOINC client. These kinds of issues happen at every BOINC project, not just here. the client is only setup to transmit the "best" GPU. this is fact. that means the server scheduler MUST be setup to act on this information only. it cannot differentiate between two different nvidia GPUs that require different apps because it only knows about the "best" one. it can only act on different GPUs if they are from different vendors like AMD or Intel.


How to deal with the problem?
The only way for the moment is to restrict the use of a troubling cards at a specific level using the Client configuration files. Options are described at details here Client configuration
It is a workaround instead of a solution but still we all have to deal with the capabilities of BOINC software.

Of course, there is always one more option, to get out the troubling card(s) from the particular host and install it (them) to a separate computer (new host) which is not always an option as those host has their primary designation and are used exactly in the configuration they have while running BOINC projects is only a spare time task.
“The good thing about science is that it's true whether or not you believe in it.” ― Neil deGrasse Tyson
ID: 7580 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zarck

Send message
Joined: 21 Jun 12
Posts: 24
Credit: 19,989,257
RAC: 134
Message 7653 - Posted: 1 Jan 2023, 22:30:28 UTC - in response to Message 7580.  

Last modified: 1 Jan 2023, 23:13:23 UTC
I have a GTX Titan, a GTX 1070, a GT 1030, and a Quadro K5000, and Milkyway for GPU has no problems with these cards.

https://milkyway.cs.rpi.edu/milkyway/results.php?userid=5706&offset=100&show_names=0&state=0&appid=

https://milkyway.cs.rpi.edu/milkyway/hosts_user.php

And the old version of Asteroids also worked fine.

https://asteroidsathome.net/boinc/results.php?userid=1690&offset=0&show_names=0&state=4&appid=
ID: 7653 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 12 Apr 17
Posts: 31
Credit: 5,360,264
RAC: 0
Message 7662 - Posted: 2 Jan 2023, 10:40:25 UTC - in response to Message 7653.  
Zark:
I have a GTX Titan, a GTX 1070, a GT 1030, and a Quadro K5000, and Milkyway for GPU has no problems with these cards.
...

Are you saying that one host/computer has all 4 GPUs installed and running fine?
Because the post title says "... host with multiple NVIDIA devices ..."

For the Quadro K5000 the stderr says:
UDA Device: Quadro K5000 4096MB
CUDA Device driver: 462.96
Compute capability: 3.0
Shared memory per Block | per SM: 49152 | 49152
Multiprocessors: 8
Unsupported Compute Capability (CC) detected (3.0). Supported Compute Capabilities are between 5.3 and 8.9.

So maybe MW has different CC requirements?

Check again the last post from Georgi ...
ID: 7662 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 23 Apr 21
Posts: 85
Credit: 111,886,184
RAC: 202,512
Message 7693 - Posted: 12 Jan 2023, 23:25:02 UTC - in response to Message 7662.  

Last modified: 12 Jan 2023, 23:27:06 UTC
Zark:
I have a GTX Titan, a GTX 1070, a GT 1030, and a Quadro K5000, and Milkyway for GPU has no problems with these cards.
...

Are you saying that one host/computer has all 4 GPUs installed and running fine?
Because the post title says "... host with multiple NVIDIA devices ..."


the problem isnt simply having multiple GPUs. that's no problem.

the problem is when you have multiple nvidia GPUs and they need different apps.

say you have a Kepler card (CC 3.5) and a Ampere card (CC 8.6) card in the same host.

the ampere needs at least a CUDA 11.1 app. so it will use the 11.8 CUDA app available here. but that app doesnt support the Kepler card. it will error if run on that card. and conversely the Ampere card can't use the CUDA 5.5 or 10.2 apps that the Kepler can use. this is due to the limits placed on the applications when they were compiled. the 11.8 app was compiled with support for CC 5.0-8.9 only.

these kinds of restrictions are only because of how CUDA support is segmented in the toolkits and drivers. Milkyway works fine because it's a legacy OpenCL application that supports most devices, though maybe not as optimized or as fast as it could be if it were coded in CUDA or even later versions of OpenCL.

OpenCL does not know or care about anything related to CC and cannot have requirements for it. CC is an Nvidia/CUDA-only thing.

ID: 7693 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zarck

Send message
Joined: 21 Jun 12
Posts: 24
Credit: 19,989,257
RAC: 134
Message 7697 - Posted: 13 Jan 2023, 18:10:32 UTC
On my Hp Xeon Z620 Gtx Titan + Gtx 1070 even reserving one project per card (Asteroids + Milkyway) Asteroids continues to make errors, regardless of the gpu card used.

<exclude_gpu>
<url>https://asteroidsathome.net/boinc/url>
<device_num>1</device_num>
</exclude_gpu>
<exclude_gpu>
<url>http://milkyway.cs.rpi.edu/milkyway/</url>
<device_num>0</device_num>
</exclude_gpu>
ID: 7697 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 23 Apr 21
Posts: 85
Credit: 111,886,184
RAC: 202,512
Message 7699 - Posted: 13 Jan 2023, 22:06:08 UTC - in response to Message 7697.  
On my Hp Xeon Z620 Gtx Titan + Gtx 1070 even reserving one project per card (Asteroids + Milkyway) Asteroids continues to make errors, regardless of the gpu card used.

<exclude_gpu>
<url>https://asteroidsathome.net/boinc/url>
<device_num>1</device_num>
</exclude_gpu>
<exclude_gpu>
<url>http://milkyway.cs.rpi.edu/milkyway/</url>
<device_num>0</device_num>
</exclude_gpu>


i think you've excluded the wrong GPU or made some mistake in the cc_config file to where it did not take effect. all of your errors are trying to use the GTX Titan Kepler card, and its failing for the same reason I mentioned, unsupported CC version with the CUDA 11.8 application.

you did process at least one task without issue on your GTX 1070: https://asteroidsathome.net/boinc/result.php?resultid=353155970

In your case, I would recommend reverting to the older 440 branch of drivers. this driver should support both your GTX Titan and GTX 1070. this will prevent the project from sending you the CUDA 11.8 app and you instead should receive the CUDA 10.2 app which will work on both of your GPUs.

try this driver: https://www.nvidia.com/Download/driverResults.aspx/155056/en-us/
not sure if there is any major difference between win10 and win11 drivers as I don't think drivers as old as this were ever available for Win11, but it's worth a shot.

if it doesnt work, then you might need to break the Titan out into it's own system, or reconfigure (and lock) your coproc_info.xml file to reflect the capabilities of your titan (CC 3.5) so that the project can see that and send you the CUDA 10.2 app. right now all it sees is your 1070 and it's sending you a compatible app for that not knowing that your second GPU is not compatible.

ID: 7699 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zarck

Send message
Joined: 21 Jun 12
Posts: 24
Credit: 19,989,257
RAC: 134
Message 7700 - Posted: 14 Jan 2023, 0:28:29 UTC

Last modified: 14 Jan 2023, 0:30:03 UTC
Thanks for the answer, it works for my two machines, hp z xeon 600 + Quadro K5000 + Gt 1030 and Hp z xeon 620 + Gtx Titan + Gtx 1070, both machines running Windows 11 thanks to Winpass11.
ID: 7700 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zarck

Send message
Joined: 21 Jun 12
Posts: 24
Credit: 19,989,257
RAC: 134
Message 7701 - Posted: 14 Jan 2023, 17:05:03 UTC - in response to Message 7700.  
ID: 7701 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zarck

Send message
Joined: 21 Jun 12
Posts: 24
Credit: 19,989,257
RAC: 134
Message 7702 - Posted: 15 Jan 2023, 13:28:26 UTC - in response to Message 7701.  
I have the following message in the notifications tab of boinc while my gpus are calculating very well, why?

"Asteroids@home: Notice from server
NVIDIA GPU: Please update your system with the latest drivers to be able to compute with the GPU
01/15/2023 10:57:02"
ID: 7702 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Problems and bug reports : Troubleshooting host with multiple NVIDIA devices but with different Compute Capabilities (CC)