暗能星系

    • 登录
    • 搜索

    slurm计算节点重启后状态为down

    问题记录及解决
    1
    1
    11
    正在加载更多帖子
    • 从旧到新
    • 从新到旧
    • 最多赞同
    回复
    • 在新帖中回复
    登录后回复
    此主题已被删除。只有拥有主题管理权限的用户可以查看。
    • Z
      zhanglu 最后由 编辑

      问题状态

      [root@fda-0d01-ai01-cv4 slurm]# sinfo
      PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
      debug*       up   infinite     17   idle wd[12-15,21,23-25,31-34,75],we[14-15,21,63] 
      debug*       up   infinite      2   down wd22,we85
      

      查看node详情

      scontrol show node
      
      NodeName=wd13 Arch=x86_64 CoresPerSocket=18 
         CPUAlloc=0 CPUTot=36 CPULoad=8.01
         AvailableFeatures=(null)
         ActiveFeatures=(null)
         Gres=(null)
         NodeAddr=wd13 NodeHostName=wd13 Version=20.02.7
         OS=Linux 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 
         RealMemory=1 AllocMem=0 FreeMem=253045 Sockets=2 Boards=1
         State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
         Partitions=debug 
         BootTime=2021-07-20T14:52:43 SlurmdStartTime=2021-07-29T11:58:56
         CfgTRES=cpu=36,mem=1M,billing=36
         AllocTRES=
         CapWatts=n/a
         CurrentWatts=0 AveWatts=0
         ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
         Reason=Node unexpectedly rebooted [slurm@2021-07-20T14:54:26]
      
      NodeName=wd22 Arch=x86_64 CoresPerSocket=18 
         CPUAlloc=0 CPUTot=36 CPULoad=8.01
         AvailableFeatures=(null)
         ActiveFeatures=(null)
         Gres=(null)
         NodeAddr=wd22 NodeHostName=wd22 Version=20.02.7
         OS=Linux 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 
         RealMemory=1 AllocMem=0 FreeMem=252883 Sockets=2 Boards=1
         State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
         Partitions=debug 
         BootTime=2021-07-20T15:19:00 SlurmdStartTime=2021-07-20T15:20:24
         CfgTRES=cpu=36,mem=1M,billing=36
         AllocTRES=
         CapWatts=n/a
         CurrentWatts=0 AveWatts=0
         ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
         Reason=Node unexpectedly rebooted [slurm@2021-07-20T15:20:40]
      

      解决办法

      scontrol update NodeName=wd22 State=RESUME
      
      1 条回复 最后回复 回复 引用 0
      • First post
        Last post
      Powered by 暗能星系