Computation error when BOINC swaps GPU tasks between different AMD GPUs after resuming from GPU task suspension


Message boards : Problems and bug reports : Computation error when BOINC swaps GPU tasks between different AMD GPUs after resuming from GPU task suspension

Message board moderation

To post messages, you must log in.
AuthorMessage
ST240

Send message
Joined: 7 Jan 19
Posts: 8
Credit: 5,041,096
RAC: 29,283
Message 8521 - Posted: 7 Aug 2024, 21:06:56 UTC
Hi,
I have 2 AMD GPUs (RX 7600 XT and RX 6600) and sometimes BOINC switches which task is on a particular GPU if GPU computing is suspended and then resumed. This causes an immediate computation error. The stderr output from one task is below.

Stderr output:
<core_client_version>8.0.4</core_client_version>
<![CDATA[
<message>
The system cannot find the file specified.
(0x2) - exit code 2 (0x2)</message>
<stderr_txt>
BOINC client version 8.0.4
BOINC GPU type 'ATI', deviceId=1, slot=0
Application: period_search_10220_windows_x86_64__opencl_102_amd_win.exe
Version: 102.20.0.0
Platform name: AMD Accelerated Parallel Processing
Platform vendor: Advanced Micro Devices, Inc.
OpenCL device C version: OpenCL C 2.0 | OpenCL 2.0 AMD-APP (3617.0)
OpenCL device Id: 1
OpenCL device name: AMD Radeon RX 6600 7GB
Device driver version: 3617.0 (PAL,LC)
Multiprocessors: 14
Max Samplers: 16
Max work item dimensions: 3
Resident blocks per multiprocessor: 16
Grid dim: 448 = 2 * 14 * 16
Block dim: 128
Binary build log for AMD Radeon RX 6600:
OK (0)
Program build log for AMD Radeon RX 6600:
OK (0)
Prefered kernel work group size multiple: 32
Setting Grid Dim to 256
Platform name: AMD Accelerated Parallel Processing
Platform vendor: Advanced Micro Devices, Inc.
OpenCL device C version: OpenCL C 2.0 | OpenCL 2.0 AMD-APP (3617.0)
OpenCL device Id: 0
OpenCL device name: AMD Radeon RX 7600 XT 15GB
Device driver version: 3617.0 (PAL,LC)
Multiprocessors: 16
Max Samplers: 16
Max work item dimensions: 3
Resident blocks per multiprocessor: 16
Grid dim: 512 = 2 * 16 * 16
Block dim: 128
Build log: AMD Accelerated Parallel Processing | AMD Radeon RX 7600 XT:
Error: The program ISA amdgcn-amd-amdhsa--gfx1032 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx1102
Error: create kernel metadata map using COMgr
Error: Cannot Find Global Var Sizes
Error: Cannot create kernels.

Error creating queue: build program failure(-11)

</stderr_txt>
]]>
ID: 8521 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ST240

Send message
Joined: 7 Jan 19
Posts: 8
Credit: 5,041,096
RAC: 29,283
Message 8524 - Posted: 8 Aug 2024, 14:18:04 UTC - in response to Message 8521.  
It seems like it might only happen when using the anonymous platform. I was using the anonymous platform to force the fastest version for my processors which required specifying both cpu and gpu versions. I switched back to the regular mode and have not had an error yet.
ID: 8524 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 23 Apr 21
Posts: 88
Credit: 119,526,037
RAC: 139,321
Message 8525 - Posted: 8 Aug 2024, 14:42:15 UTC - in response to Message 8524.  
it probably is because of the inherent differences in the GPUs.

you can see that the app custom tailors the grid size to the GPU architecture. and i don't think it can restart from a checkpoint where the grid size changes. the way all the previous work was complete becomes not compatible with the change.

the problem will only happen when the task restarts on a different GPU, if it restarts on the same GPU, or never has to pause/restart, then you wont see the problem.

you can increase the length of time for task switching to try to prevent this. also not stopping BOINC in the middle of task execution if possible.

ID: 8525 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ST240

Send message
Joined: 7 Jan 19
Posts: 8
Credit: 5,041,096
RAC: 29,283
Message 8526 - Posted: 9 Aug 2024, 4:16:21 UTC - in response to Message 8525.  
Thanks. I might just use the second gpu only, since graphics rendering becomes a little slow when tasks are running on the first gpu. (Although nVidia 10 series and newer gpus seem to render smoothly even with tasks running at the same time, possibly because they support accelerated gpu scheduling enabled.) If there was a way to exclude the second gpu from suspension during use that would also work.
ID: 8526 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Problems and bug reports : Computation error when BOINC swaps GPU tasks between different AMD GPUs after resuming from GPU task suspension