服务器GPU配置
-
1. 安装nvidia 驱动
1.1 安装
# 添加 nvidia repository sudo add-apt-repository ppa:graphics-drivers/ppa sudo apt update # 查看驱动版本 ubuntu-drivers devices # 根据列出的驱动列表选择安装的驱动版本 # (nvidia-xxx 有些电脑显示 nvidia-driver-xxx) sudo apt install nvidia-driver-440 # 注意:如果在 BIOS 中将 secure boot 设置为 on,在上述安装过程中可能出现设置 secure boot 密码的相关提示。如果在安全性方面要求不是很苛刻,可以考虑将 secure boot 设置为 off. # 重启,然后看看系统设置中的附加驱动中是否添加了 nvidia 的驱动 # smi=System Management Interface,安装成功显示驱动版本号 nvidia-smi1.2 错误及处理
错误信息如下说明 NVIDIA 内核驱动版本与系统驱动不一致,需要卸载电脑驱动,重装与nvidia 内核版本匹配的版本
# 查看当前版本 报错, anneng@anneng01:~$ nvidia-smi Failed to initialize NVML: Driver/library version mismatch# 查看 显卡驱动所使用的内核版本 anneng@anneng01:~$ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 440.100 Fri May 29 08:45:51 UTC 2020 GCC version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) # 440.100 表示当前驱动的版本 # 查看电脑驱动 anneng@anneng01:~$ cat /var/log/dpkg.log | grep nvidia报错如下一种可能的解决方法是选择其他驱动版本,尤其是较新的版本,安装之后重启,可能解决问题。
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.2. 安装CUDA
下载
官网 选择 CUDA 版本进行下载
这里选择
CUDA Toolkit 10.2 (Nov 2019) -> Linux Ubuntu 18.04 x86_64 deb [local]版本下载,用户可根据实际情况选择合适版本进行下载官网给出的安装命令如下:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600 wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.deb sudo dpkg -i cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.deb sudo apt-key add /var/cuda-repo-10-2-local-10.2.89-440.33.01/7fa2af80.pub sudo apt-get update sudo apt-get -y install cuda添加环境变量
vim ~/.bashrc # 文件末尾增加如下内容 export CUDA_HOME=/usr/local/cuda-10.2 export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH export PATH=/usr/local/cuda-10.2/bin:$PATH # 刷新环境变量 source ~/.bashrc环境变量nvcc才能生效,nvcc 程序应该在路径
/usr/local/cuda-{xx}/bin中验证
anneng@anneng01:~$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2017 NVIDIA Corporation Built on Fri_Nov__3_21:07:56_CDT_2017 Cuda compilation tools, release 9.1, V9.1.85 anneng@anneng01:~$ nvidia-smi3. 安装 cuDNN
3.1 下载
官网 选择 cuDNN 版本进行下载,需要和CUDA版本一致
这里选择
Download cuDNN v8.0.1 RC2 (June 26th, 2020), for CUDA 10.2版本,将以下三个文件都下载到本地cuDNN Runtime Library for Ubuntu18.04 (Deb)cuDNN Developer Library for Ubuntu18.04 (Deb)cuDNN Code Samples and User Guide for Ubuntu18.04 (Deb)
下载完成后会得到以下3个文件
libcudnn8_8.0.1.13-1+cuda10.2_amd64.deblibcudnn8-dev_8.0.1.13-1+cuda10.2_amd64.deblibcudnn8-doc_8.0.1.13-1+cuda10.2_amd64.deb
下载,用户可根据实际情况选择合适版本进行下载
[注:] 需要注册,注册后才能下载
3.2 安装
sudo dpkg -i libcudnn8_8.0.1.13-1+cuda10.2_amd64.deb sudo dpkg -i libcudnn8-dev_8.0.1.13-1+cuda10.2_amd64.deb sudo dpkg -i libcudnn8-doc_8.0.1.13-1+cuda10.2_amd64.deblspci | grep -i nvidia
cat /proc/driver/nvidia/version
lsmod | grep nouveau4. 安装Docker CE及nvidia-docker2
4.1 安装环境
- OS:Ubuntu 18.04 64 bit
- 显卡:NVidia Tesla T4
- CUDA:10.2
- cnDNN:7.5
4.2 配置Docker源
# 更新源 sudo apt update # 启用HTTPS sudo apt install -y \ apt-transport-https \ ca-certificates \ curl \ gnupg-agent \ software-properties-common # 添加GPG key curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - # 添加稳定版的源 sudo add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable"4.3 安装Docker CE
sudo apt update # 安装Docker CE sudo apt install -y docker-ce # 验证,输出有 Hello from Docker! 内容则成功 sudo docker run hello-world4.4 配置
nvidia-docker源,并安装# 添加源 distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list # 安装并重启docker sudo apt update && sudo apt install -y nvidia-container-toolkit sudo systemctl restart dockersudo curl -L https://github.com/docker/compose/releases/download/1.17.0/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose sudo chmod +x /usr/local/bin/docker-compose docker-compose --version -
https://zhuanlan.zhihu.com/p/91334380
显卡,显卡驱动,nvcc, cuda driver,cudatoolkit,cudnn到底是什么? -
nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
cat /var/log/dpkg.log | grep nvidia
2021-07-26 06:50:34 status installed nvidia-driver-460:amd64 460.91.03-0ubuntu0.18.04.1系统自动进行了一次升级 重启即可!
用这个帖子的办法可以不重启
https://stackoverflow.com/questions/43022843/nvidia-nvml-driver-library-version-mismatch#comment73133147_43022843How can we do that?
First, we should know which drivers are loaded.lsmod | grep nvidia
You may getnvidia_uvm 634880 8
nvidia_drm 53248 0
nvidia_modeset 790528 1 nvidia_drm
nvidia 12312576 86 nvidia_modeset,nvidia_uvm
Our final goal is to unload nvidia mod, so we should unload the module depend on nvidia:sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia_uvm
Then, unload nvidiasudo rmmod nvidia
Troubleshooting
If you get an error like rmmod: ERROR: Module nvidia is in use, which indicates that the kernel module is in use, you should kill the process that using the kmod:sudo lsof /dev/nvidia*
and then kill those process, then continue to unload the kmods.Test
Confirm you successfully unload those kmodslsmod | grep nvidia
You should get nothing. Then confirm you can load the correct driver:nvidia-smi
You should get the correct output.