Computation error when BOINC swaps GPU tasks between different AMD GPUs after resuming from GPU task suspension
Message boards :
Problems and bug reports :
Computation error when BOINC swaps GPU tasks between different AMD GPUs after resuming from GPU task suspension
Message board moderation
Author | Message |
---|---|
Send message Joined: 7 Jan 19 Posts: 8 Credit: 5,041,096 RAC: 29,283 |
Hi, I have 2 AMD GPUs (RX 7600 XT and RX 6600) and sometimes BOINC switches which task is on a particular GPU if GPU computing is suspended and then resumed. This causes an immediate computation error. The stderr output from one task is below. Stderr output: <core_client_version>8.0.4</core_client_version> <![CDATA[ <message> The system cannot find the file specified. (0x2) - exit code 2 (0x2)</message> <stderr_txt> BOINC client version 8.0.4 BOINC GPU type 'ATI', deviceId=1, slot=0 Application: period_search_10220_windows_x86_64__opencl_102_amd_win.exe Version: 102.20.0.0 Platform name: AMD Accelerated Parallel Processing Platform vendor: Advanced Micro Devices, Inc. OpenCL device C version: OpenCL C 2.0 | OpenCL 2.0 AMD-APP (3617.0) OpenCL device Id: 1 OpenCL device name: AMD Radeon RX 6600 7GB Device driver version: 3617.0 (PAL,LC) Multiprocessors: 14 Max Samplers: 16 Max work item dimensions: 3 Resident blocks per multiprocessor: 16 Grid dim: 448 = 2 * 14 * 16 Block dim: 128 Binary build log for AMD Radeon RX 6600: OK (0) Program build log for AMD Radeon RX 6600: OK (0) Prefered kernel work group size multiple: 32 Setting Grid Dim to 256 Platform name: AMD Accelerated Parallel Processing Platform vendor: Advanced Micro Devices, Inc. OpenCL device C version: OpenCL C 2.0 | OpenCL 2.0 AMD-APP (3617.0) OpenCL device Id: 0 OpenCL device name: AMD Radeon RX 7600 XT 15GB Device driver version: 3617.0 (PAL,LC) Multiprocessors: 16 Max Samplers: 16 Max work item dimensions: 3 Resident blocks per multiprocessor: 16 Grid dim: 512 = 2 * 16 * 16 Block dim: 128 Build log: AMD Accelerated Parallel Processing | AMD Radeon RX 7600 XT: Error: The program ISA amdgcn-amd-amdhsa--gfx1032 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx1102 Error: create kernel metadata map using COMgr Error: Cannot Find Global Var Sizes Error: Cannot create kernels. Error creating queue: build program failure(-11) </stderr_txt> ]]> |
Send message Joined: 7 Jan 19 Posts: 8 Credit: 5,041,096 RAC: 29,283 |
It seems like it might only happen when using the anonymous platform. I was using the anonymous platform to force the fastest version for my processors which required specifying both cpu and gpu versions. I switched back to the regular mode and have not had an error yet.
|
Send message Joined: 23 Apr 21 Posts: 88 Credit: 119,526,037 RAC: 139,321 |
it probably is because of the inherent differences in the GPUs. you can see that the app custom tailors the grid size to the GPU architecture. and i don't think it can restart from a checkpoint where the grid size changes. the way all the previous work was complete becomes not compatible with the change. the problem will only happen when the task restarts on a different GPU, if it restarts on the same GPU, or never has to pause/restart, then you wont see the problem. you can increase the length of time for task switching to try to prevent this. also not stopping BOINC in the middle of task execution if possible. |
Send message Joined: 7 Jan 19 Posts: 8 Credit: 5,041,096 RAC: 29,283 |
Thanks. I might just use the second gpu only, since graphics rendering becomes a little slow when tasks are running on the first gpu. (Although nVidia 10 series and newer gpus seem to render smoothly even with tasks running at the same time, possibly because they support accelerated gpu scheduling enabled.) If there was a way to exclude the second gpu from suspension during use that would also work.
|
Message boards :
Problems and bug reports :
Computation error when BOINC swaps GPU tasks between different AMD GPUs after resuming from GPU task suspension