Please write out the GPU model to make debugging easier
Message boards :
Wish list :
Please write out the GPU model to make debugging easier
Message board moderation
Author | Message |
---|---|
Send message Joined: 16 Jan 14 Posts: 17 Credit: 30,330,133 RAC: 9,168 |
Last modified: 8 Feb 2023, 18:58:19 UTC On two systems, I have several Nvidia boards GTX-1060 (3 & 6 gb), 1660, 1070, p102-100 and occassionally a work unit fails to run with the message (for example) <core_client_version>7.21.0</core_client_version> <![CDATA[ <message> The system cannot find the file specified. (0x2) - exit code 2 (0x2)</message> <stderr_txt> Error: Number of lc points is greater than POINTS_MAX = 1000 </stderr_txt> ]]> I have no idea which board had the problem. |
Send message Joined: 16 Nov 22 Posts: 131 Credit: 143,316,656 RAC: 486,534 |
Is this just for errored tasks? Don't you get the board identified in the stderr.txt output for validated tasks? Like this: <core_client_version>7.19.0</core_client_version> <![CDATA[ <stderr_txt> BOINC client version 7.19.0 BOINC GPU type 'NVIDIA', deviceId=0, slot=14 CUDA version: 11080 CUDA Device number: 0 CUDA Device: NVIDIA GeForce RTX 3080 12037MB CUDA Device driver: 525.78.01 Compute capability: 8.6 Shared memory per Block | per SM: 49152 | 102400 Multiprocessors: 70 Resident blocks per multiprocessor: 16 Grid dim: 1120 = 70*16 Block dim: 128 11:17:52 (229427): called boinc_finish(0) </stderr_txt> ]] And this is what I get for an errored task: <core_client_version>7.19.0</core_client_version> <![CDATA[ <message> exceeded elapsed time limit 48606.68 (138002300.00G/2839.16G)</message> <stderr_txt> BOINC client version 7.19.0 BOINC GPU type 'NVIDIA', deviceId=1, slot=2 malloc(): invalid size (unsorted) SIGABRT: abort called BOINC client version 7.19.0 BOINC GPU type 'NVIDIA', deviceId=0, slot=2 malloc(): invalid size (unsorted) SIGABRT: abort called </stderr_txt> ]]> A proud member of the OFA (Old Farts Association) |
Send message Joined: 16 Jan 14 Posts: 17 Credit: 30,330,133 RAC: 9,168 |
Hi Keith! I am making some progress. I found the windows version of nvidia-smi and it shows one of my GPUs is "lost" C:\Program Files\NVIDIA Corporation>nvidia-smi Unable to determine the device handle for GPU0000:02:00.0: GPU is lost. Reboot the system to recover this GPU I have removed that GPU. There was no indication from the windows device manager of any problem |
Message boards :
Wish list :
Please write out the GPU model to make debugging easier