Sometimes in the middle of a simulation, I get one of these error messages:
"CUDA error in file [XX.cpp], line YY : the launch timed out and was terminated",
"CUDA error in file [XX.cpp], line YY : unknown error", what's wrong?
This error message appears if one part of the GPU code takes more than X seconds to execute. As a safety mechanism, the OS stops the GPU in order to prevent the screen from freezing.
If you are running Linux, the reason might be that the GPU is used for display and running X. Try using a separate GPU for display and use your CUDA-enabled GPU as accelerator.
If you are running Windows, there are two possible reasons why this error message appears. You are either using a GeForce card as CUDA accelerator or a Tesla card with TCC turned off. If it's the first case then the only possible way to solve this is to increase the watchdog timer. If it's the second case then try enabling TCC mode for your Tesla card.
Increase watchdog timer
The following steps describe how to enable TCC mode:
- Open the command prompt (dos or powershell) as administrator
- Navigate to "C:\Program Files\NVIDIA Corporation\NVSMI"
- Type: "nvidia-smi.exe -dm 1"
- Reboot the machine