GPUs

GPUs

General info

  • Our Linux machine is equipped with two NVIDIA RTX6000 ‘Ada Lovelace’ GPUs. The compute capability of each of these GPUs is 8.9 (see here)
  • CUDA is a closed-source, proprietary parallel computing platform and application programming interface (API) that allows certain softwares to utilize GPUs. We currently run CUDA v12.2, which is installed in /usr/local/cuda on our Linux machine. You can always check the version by running nvcc —version

Monitoring GPU usage

  • nvidia-smi from the CUDA toolkit is provides a very basic output that is useful to check status and usage of GPUs, but several other software tools offer a better UI with the same or more info, including nvtop and nvitop

Troubleshooting

  • Occasionally, the commands above return an error complaining that no GPU is detected. This can happen if our Ubuntu OS was updated which can break the GPU drivers (even if a minor update can do this). To fix this issue:
    1. ‣
      First, run ubuntu-drivers devices, to produce an output showing the GPUs currently recognized on the server. Here’s an example of what you should see on our server.
    2. Next, auto update all drivers with sudo ubuntu-drivers autoinstall. This often won’t solve the problem, but it’s still a good idea to do this step.
    3. Finally, update the specific driver you need, based on what you saw with the output from the first step above. In this case, it would be sudo apt install nvidia-driver-525
    4. Restart the computer for your changes to take effect. Now you should be good to go!