slurm计算节点重启后状态为down
-
问题状态
[root@fda-0d01-ai01-cv4 slurm]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 17 idle wd[12-15,21,23-25,31-34,75],we[14-15,21,63] debug* up infinite 2 down wd22,we85查看node详情
scontrol show node NodeName=wd13 Arch=x86_64 CoresPerSocket=18 CPUAlloc=0 CPUTot=36 CPULoad=8.01 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=wd13 NodeHostName=wd13 Version=20.02.7 OS=Linux 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 RealMemory=1 AllocMem=0 FreeMem=253045 Sockets=2 Boards=1 State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2021-07-20T14:52:43 SlurmdStartTime=2021-07-29T11:58:56 CfgTRES=cpu=36,mem=1M,billing=36 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Node unexpectedly rebooted [slurm@2021-07-20T14:54:26] NodeName=wd22 Arch=x86_64 CoresPerSocket=18 CPUAlloc=0 CPUTot=36 CPULoad=8.01 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=wd22 NodeHostName=wd22 Version=20.02.7 OS=Linux 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 RealMemory=1 AllocMem=0 FreeMem=252883 Sockets=2 Boards=1 State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2021-07-20T15:19:00 SlurmdStartTime=2021-07-20T15:20:24 CfgTRES=cpu=36,mem=1M,billing=36 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Node unexpectedly rebooted [slurm@2021-07-20T15:20:40]解决办法
scontrol update NodeName=wd22 State=RESUME