暗能星系

    • 登录
    • 搜索

    服务器GPU配置

    其它
    2
    3
    28
    正在加载更多帖子
    • 从旧到新
    • 从新到旧
    • 最多赞同
    回复
    • 在新帖中回复
    登录后回复
    此主题已被删除。只有拥有主题管理权限的用户可以查看。
    • I
      ice-melt 最后由 编辑

      1. 安装nvidia 驱动

      1.1 安装

      # 添加 nvidia repository
      sudo add-apt-repository ppa:graphics-drivers/ppa
      sudo apt update
      
      # 查看驱动版本
      ubuntu-drivers devices
      
      # 根据列出的驱动列表选择安装的驱动版本
      # (nvidia-xxx 有些电脑显示 nvidia-driver-xxx)
      sudo apt install nvidia-driver-440
      
      # 注意:如果在 BIOS 中将 secure boot 设置为 on,在上述安装过程中可能出现设置 secure boot 密码的相关提示。如果在安全性方面要求不是很苛刻,可以考虑将 secure boot 设置为 off.
      
      
      # 重启,然后看看系统设置中的附加驱动中是否添加了 nvidia 的驱动
      # smi=System Management Interface,安装成功显示驱动版本号
      nvidia-smi
      
      

      1.2 错误及处理

      错误信息如下说明 NVIDIA 内核驱动版本与系统驱动不一致,需要卸载电脑驱动,重装与nvidia 内核版本匹配的版本

      # 查看当前版本 报错,
      anneng@anneng01:~$ nvidia-smi
      Failed to initialize NVML: Driver/library version mismatch
      
      # 查看 显卡驱动所使用的内核版本
      anneng@anneng01:~$ cat /proc/driver/nvidia/version 
      NVRM version: NVIDIA UNIX x86_64 Kernel Module  440.100  Fri May 29 08:45:51 UTC 2020
      GCC version:  gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) 
      
      # 440.100 表示当前驱动的版本
      
      # 查看电脑驱动
      anneng@anneng01:~$ cat /var/log/dpkg.log | grep nvidia
      

      报错如下一种可能的解决方法是选择其他驱动版本,尤其是较新的版本,安装之后重启,可能解决问题。

      NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
      

      2. 安装CUDA

      下载

      官网 选择 CUDA 版本进行下载

      这里选择 CUDA Toolkit 10.2 (Nov 2019) -> Linux Ubuntu 18.04 x86_64 deb [local] 版本下载,用户可根据实际情况选择合适版本进行下载

      官网给出的安装命令如下:

      wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
      sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
      wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.deb
      sudo dpkg -i cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.deb
      sudo apt-key add /var/cuda-repo-10-2-local-10.2.89-440.33.01/7fa2af80.pub
      sudo apt-get update
      sudo apt-get -y install cuda
      

      添加环境变量

      vim ~/.bashrc
      
      # 文件末尾增加如下内容
      export CUDA_HOME=/usr/local/cuda-10.2
      export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH
      export PATH=/usr/local/cuda-10.2/bin:$PATH
      
      
      # 刷新环境变量
      source ~/.bashrc
      

      环境变量nvcc才能生效,nvcc 程序应该在路径 /usr/local/cuda-{xx}/bin 中

      验证

      anneng@anneng01:~$ nvcc --version
      nvcc: NVIDIA (R) Cuda compiler driver
      Copyright (c) 2005-2017 NVIDIA Corporation
      Built on Fri_Nov__3_21:07:56_CDT_2017
      Cuda compilation tools, release 9.1, V9.1.85
      
      anneng@anneng01:~$ nvidia-smi
      

      3. 安装 cuDNN

      3.1 下载

      官网 选择 cuDNN 版本进行下载,需要和CUDA版本一致

      这里选择 Download cuDNN v8.0.1 RC2 (June 26th, 2020), for CUDA 10.2 版本,将以下三个文件都下载到本地

      • cuDNN Runtime Library for Ubuntu18.04 (Deb)
      • cuDNN Developer Library for Ubuntu18.04 (Deb)
      • cuDNN Code Samples and User Guide for Ubuntu18.04 (Deb)

      下载完成后会得到以下3个文件

      • libcudnn8_8.0.1.13-1+cuda10.2_amd64.deb
      • libcudnn8-dev_8.0.1.13-1+cuda10.2_amd64.deb
      • libcudnn8-doc_8.0.1.13-1+cuda10.2_amd64.deb

      下载,用户可根据实际情况选择合适版本进行下载

      [注:] 需要注册,注册后才能下载

      3.2 安装

      sudo dpkg -i libcudnn8_8.0.1.13-1+cuda10.2_amd64.deb
      sudo dpkg -i libcudnn8-dev_8.0.1.13-1+cuda10.2_amd64.deb
      sudo dpkg -i libcudnn8-doc_8.0.1.13-1+cuda10.2_amd64.deb
      

      lspci | grep -i nvidia
      cat /proc/driver/nvidia/version
      lsmod | grep nouveau

      4. 安装Docker CE及nvidia-docker2

      4.1 安装环境

      • OS:Ubuntu 18.04 64 bit
      • 显卡:NVidia Tesla T4
      • CUDA:10.2
      • cnDNN:7.5

      4.2 配置Docker源

      # 更新源
      sudo apt update
      
      # 启用HTTPS
      sudo apt install -y \
          apt-transport-https \
          ca-certificates \
          curl \
          gnupg-agent \
          software-properties-common
      
      # 添加GPG key
      curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
      
      # 添加稳定版的源
      sudo add-apt-repository \
         "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
         $(lsb_release -cs) \
         stable"
      

      4.3 安装Docker CE

      sudo apt update
      # 安装Docker CE
      sudo apt install -y docker-ce
      
      # 验证,输出有 Hello from Docker! 内容则成功
      sudo docker run hello-world
      

      4.4 配置nvidia-docker源,并安装

      # 添加源
      distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
      
      curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
      
      curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
      
      # 安装并重启docker
      sudo apt update && sudo apt install -y nvidia-container-toolkit
      
      sudo systemctl restart docker
      
      sudo curl -L https://github.com/docker/compose/releases/download/1.17.0/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose
       
      sudo chmod +x /usr/local/bin/docker-compose
      docker-compose --version
      
      1 条回复 最后回复 回复 引用 0
      • A
        anneng 最后由 编辑

        https://zhuanlan.zhihu.com/p/91334380
        显卡,显卡驱动,nvcc, cuda driver,cudatoolkit,cudnn到底是什么?

        1 条回复 最后回复 回复 引用 0
        • A
          anneng 最后由 anneng 编辑

          nvidia-smi
          Failed to initialize NVML: Driver/library version mismatch
          cat /var/log/dpkg.log | grep nvidia
          2021-07-26 06:50:34 status installed nvidia-driver-460:amd64 460.91.03-0ubuntu0.18.04.1

          系统自动进行了一次升级 重启即可!

          用这个帖子的办法可以不重启
          https://stackoverflow.com/questions/43022843/nvidia-nvml-driver-library-version-mismatch#comment73133147_43022843

          How can we do that?
          First, we should know which drivers are loaded.

          lsmod | grep nvidia
          You may get

          nvidia_uvm 634880 8
          nvidia_drm 53248 0
          nvidia_modeset 790528 1 nvidia_drm
          nvidia 12312576 86 nvidia_modeset,nvidia_uvm
          Our final goal is to unload nvidia mod, so we should unload the module depend on nvidia:

          sudo rmmod nvidia_drm
          sudo rmmod nvidia_modeset
          sudo rmmod nvidia_uvm
          Then, unload nvidia

          sudo rmmod nvidia
          Troubleshooting
          If you get an error like rmmod: ERROR: Module nvidia is in use, which indicates that the kernel module is in use, you should kill the process that using the kmod:

          sudo lsof /dev/nvidia*
          and then kill those process, then continue to unload the kmods.

          Test
          Confirm you successfully unload those kmods

          lsmod | grep nvidia
          You should get nothing. Then confirm you can load the correct driver:

          nvidia-smi
          You should get the correct output.

          1 条回复 最后回复 回复 引用 0
          • First post
            Last post
          Powered by 暗能星系