暗能星系

    • 登录
    • 搜索

    tmp

    张渌
    2
    262
    1755
    正在加载更多帖子
    • 从旧到新
    • 从新到旧
    • 最多赞同
    回复
    • 在新帖中回复
    登录后回复
    此主题已被删除。只有拥有主题管理权限的用户可以查看。
    • Z
      zhanglu 最后由 编辑

      ceph status
      cluster:
      id: 807d820b-5c5b-451c-9f52-41b93d5d905a
      health: HEALTH_ERR
      1 MDSs report oversized cache
      1 MDSs report slow metadata IOs
      2 MDSs behind on trimming
      mon bh is low on available space
      10 backfillfull osd(s)
      1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
      full ratio(s) out of order
      Low space hindering backfill (add storage if this doesn't resolve itself): 22 pgs backfill_toofull
      Degraded data redundancy: 715/452014992 objects degraded (0.000%), 150 pgs degraded, 2 pgs undersized
      206 pgs not deep-scrubbed in time
      128 pgs not scrubbed in time
      4 pool(s) backfillfull

      1 条回复 最后回复 回复 引用 0
      • Z
        zhanglu 最后由 编辑

        [WARN] [09/06/2025 06:54:11.355] [cromwell-system-akka.actor.default-dispatcher-3] [cromwell-system/Pool(shared->http://tesk-api.default.svc.cluster.local:8080)] [1 (WaitingForResponseEntitySubscription)] Response entity was not subscribed after 1 second. Make sure to read the response entity body or call discardBytes() on it. POST /ga4gh/tes/v1/tasks/task-68f2dfc5:cancel Empty -> 400 Bad Request Chunked
        [WARN] [09/06/2025 06:54:16.926] [cromwell-system-akka.actor.default-dispatcher-24] [cromwell-system/Pool(shared->http://tesk-api.default.svc.cluster.local:8080)] [0 (WaitingForResponseEntitySubscription)] Response entity was not subscribed after 1 second. Make sure to read the response entity body or call discardBytes() on it. POST /ga4gh/tes/v1/tasks/task-3a62617f:cancel Empty -> 400 Bad Request Chunked
        [WARN] [09/06/2025 06:54:17.945] [cromwell-system-akka.actor.default-dispatcher-26] [cromwell-system/Pool(shared->http://tesk-api.default.svc.cluster.local:8080)] [1 (WaitingForResponseEntitySubscription)] Response entity was not subscribed after 1 second. Make sure to read the response entity body or call discardBytes() on it. POST /ga4gh/tes/v1/tasks/task-d38cd4a6:cancel Empty -> 400 Bad Request Chunked
        [WARN] [09/06/2025 06:54:18.095] [cromwell-system-akka.actor.default-dispatcher-24] [cromwell-system/Pool(shared->http://tesk-api.default.svc.cluster.local:8080)] [3 (WaitingForResponseEntitySubscription)] Response entity was not subscribed after 1 second. Make sure to read the response entity body or call discardBytes() on it. POST /ga4gh/tes/v1/tasks/task-d38cd4a6:cancel Empty -> 400 Bad Request Chunked
        [WARN] [09/06/2025 06:54:22.325] [cromwell-system-akka.actor.default-dispatcher-4] [cromwell-system/Pool(shared->http://tesk-api.default.svc.cluster.local:8080)] [2 (WaitingForResponseEntitySubscription)] Response entity was not subscribed after 1 second. Make sure to read the response entity body or call discardBytes() on it. POST /ga4gh/tes/v1/tasks/task-3299d1e5:cancel Empty -> 400 Bad Request Chunked
        [WARN] [09/06/2025 06:54:25.545] [cromwell-system-akka.actor.default-dispatcher-2] [cromwell-system/Pool(shared->http://tesk-api.default.svc.cluster.local:8080)] [1 (WaitingForResponseEntitySubscription)] Response entity was not subscribed after 1 second. Make sure to read the response entity body or call discardBytes() on it. POST /ga4gh/tes/v1/tasks/task-68f2dfc5:cancel Empty -> 400 Bad Request Chunked
        [WARN] [09/06/2025 06:54:25.575] [cromwell-system-akka.actor.default-dispatcher-3] [cromwell-system/Pool(shared->http://tesk-api.default.svc.cluster.local:8080)] [0 (WaitingForResponseEntitySubscription)] Response entity was not subscribed after 1 second. Make sure to read the response entity body or call discardBytes() on it. POST /ga4gh/tes/v1/tasks/task-3299d1e5:cancel Empty -> 400 Bad Request Chunked
        [WARN] [09/06/2025 06:54:30.316] [cromwell-system-akka.actor.default-dispatcher-4] [cromwell-system/Pool(shared->http://tesk-api.default.svc.cluster.local:8080)] [3 (WaitingForResponseEntitySubscription)] Response entity was not subscribed after 1 second. Make sure to read the response entity body or call discardBytes() on it. POST /ga4gh/tes/v1/tasks/task-3a62617f:cancel Empty -> 400 Bad Request Chunked
        [WARN] [09/06/2025 06:54:32.715] [cromwell-system-akka.actor.default-dispatcher-4] [cromwell-system/Pool(shared->http://tesk-api.default.svc.cluster.local:8080)] [0 (WaitingForResponseEntitySubscription)] Response entity was not subscribed after 1 second. Make sure to read the response entity body or call discardBytes() on it. POST /ga4gh/tes/v1/tasks/task-d38cd4a6:cancel Empty -> 400 Bad Request Chunked
        [WARN] [09/06/2025 06:54:32.865] [cromwell-system-akka.actor.default-dispatcher-24] [cromwell-system/Pool(shared->http://tesk-api.default.svc.cluster.local:8080)] [1 (WaitingForResponseEntitySubscription)] Response entity was not subscribed after 1 second. Make sure to read the response entity body or call discardBytes() on it. POST /ga4gh/tes/v1/tasks/task-d38cd4a6:cancel Empty -> 400 Bad Request Chunked
        [WARN] [09/06/2025 06:54:37.295] [cromwell-system-akka.actor.default-dispatcher-2] [cromwell-system/Pool(shared->http://tesk-api.default.svc.cluster.local:8080)] [3 (WaitingForResponseEntitySubscription)] Response entity was not subscribed after 1 second. Make sure to read the response entity body or call discardBytes() on it. POST /ga4gh/tes/v1/tasks/task-3299d1e5:cancel Empty -> 400 Bad Request Chunked

        1 条回复 最后回复 回复 引用 0
        • Z
          zhanglu 最后由 编辑

          import time
          import re
          import os

          def follow_log(file_path):
          """实时跟踪日志文件的新内容,类似tail -f命令"""
          # 打开文件并移动到文件末尾
          with open(file_path, 'r') as f:
          f.seek(0, os.SEEK_END)

              while True:
                  line = f.readline()
                  if not line:
                      # 没有新内容时短暂休眠
                      time.sleep(0.1)
                      continue
                  yield line
          

          def extract_task_id(line):
          """从日志行中提取task ID"""
          # 正则表达式匹配task-开头的ID
          pattern = r'task-[0-9a-f]+'
          match = re.search(pattern, line)
          if match:
          return match.group()
          return None

          def main(log_file_path):
          print(f"开始监控日志文件: {log_file_path}")
          print("正在提取task ID... (按Ctrl+C停止)")

          try:
              for line in follow_log(log_file_path):
                  # 检查行中是否包含目标POST请求
                  if 'POST /ga4gh/tes/v1/tasks/' in line and ':cancel' in line:
                      task_id = extract_task_id(line)
                      if task_id:
                          print(f"提取到task ID: {task_id}")
          except KeyboardInterrupt:
              print("\n程序已停止")
          except FileNotFoundError:
              print(f"错误: 找不到日志文件 {log_file_path}")
          except Exception as e:
              print(f"发生错误: {str(e)}")
          

          if name == "main":
          # 日志文件路径,可根据实际情况修改
          log_file = "nohup.out"
          main(log_file)

          1 条回复 最后回复 回复 引用 0
          • Z
            zhanglu 最后由 编辑

            delete from "JOB_KEY_VALUE_ENTRY" where "STORE_VALUE" = '195'

            1 条回复 最后回复 回复 引用 0
            • Z
              zhanglu 最后由 编辑

              cde89041-fe8a-4712-8385-0cb8afc0efcc, 7cf42b54-9888-486e-9460-9d5b607c02fd, 3178dfab-75f8-4ab3-95a4-408af8e3d7c1, 170c1ac2-e068-456b-b5ea-c948f432e87d, fbafa7ff-408b-4869-b9c5-c66a45d9843f, 9c35008e-0f1a-433d-a974-266d0ac5038d, 0caaa4b6-6a32-4973-8e4f-9f65fb0ece2b, 1b3b877d-cd72-4694-bfad-07fde451e6ff, cd08e0b4-a0b7-4ce3-ae67-79650a6a6bb2, 601d6df9-ab99-4a65-b790-0a2b6cc5c191, 971df97e-e631-4f4a-9d78-4704eb0a7591, c712c011-afd1-4e6c-b1d1-ef6532ff48f4, a66c9c15-22c8-4bb3-920b-9354b408862b, 58d69c6a-df71-4780-8b18-b3020532997f, 1cc9b767-573a-447d-b257-968f2c91df17, dab528d2-1eda-46e3-a232-622cb188a7c5, 2026b96c-019b-4a6c-bc13-e4cc3bb78e98, edabc69e-72d1-48df-942e-528cc60b835d, 846a9eb7-5c60-4171-98db-c7d4b370cf27, d33d18c7-293c-4d0b-ab3b-38f049a305c2, 351735e1-31cd-4577-82d6-e3a4d10d679c, 93acc752-31bf-4538-b3ce-59c0762d00f6, b0599234-046e-4914-9941-00aadf0a292e, 524ddb2b-e921-4483-b6c1-bc449047b6a4, f09dd720-5851-47d5-92d0-a36aa7d2be72, 35d9125b-cee8-4cea-abd9-717e7ba75ce2, 81df380e-50a9-434d-ba6e-c74500038d8d, 2f853868-6d6c-4b51-88b7-506dffd0e4ae, c1eb19f2-9ce4-4ccd-986e-a8abb7c5b737, 3e21be2c-b074-4a4c-9802-173be8ea45eb, 0cf1a30b-4031-4789-819e-f5a3e6f58b50, 042e5576-988e-4bdc-b782-2e212ad88e8c, 4766114f-efee-4200-a3fa-dd109a4ee9df, 3f84c54c-0e03-4ea3-a258-ab71691cee60, 41278df2-f761-4463-8df9-0c9223585348, f47e22e1-a046-4f93-ac9d-65e2801ab9ac, a1611293-04c2-4b81-847a-6a53defa7399, a164dc0d-388f-4b3e-ba7a-5f9543b34185, 98b1d8ba-757d-4599-ad01-b703d58b5d0d, 1ac95950-d62c-4fe8-ba7d-7df645dcebbf, bb3fad51-a03e-4a47-95b3-f74711577c64, beb7ba8d-0909-4f04-9223-1ec72ae7fac4, 057ed284-0b60-4ede-ae87-d7db476c1882, c9c98451-9773-4245-a636-e7dcb3c19282, d7b55685-cdd6-483b-8ea9-38131ad5ea92, 486875ec-76a0-45c5-a734-fd63bc599af3, fcd11583-245f-42a9-94c9-1c4524217923, ecd92665-0550-48d6-99b0-0588b3367469, 071e9c51-931b-4c05-b3e8-a227449f3b9e

              1 条回复 最后回复 回复 引用 0
              • Z
                zhanglu 最后由 编辑

                curl -X POST "http://192.168.30.202:31237/api/workflows/v1/edabc69e-72d1-48df-942e-528cc60b835d/abort" -H "accept: application/json"

                1 条回复 最后回复 回复 引用 0
                • Z
                  zhanglu 最后由 编辑

                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/cde89041-fe8a-4712-8385-0cb8afc0efcc/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/7cf42b54-9888-486e-9460-9d5b607c02fd/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/3178dfab-75f8-4ab3-95a4-408af8e3d7c1/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/170c1ac2-e068-456b-b5ea-c948f432e87d/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/fbafa7ff-408b-4869-b9c5-c66a45d9843f/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/9c35008e-0f1a-433d-a974-266d0ac5038d/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/0caaa4b6-6a32-4973-8e4f-9f65fb0ece2b/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/1b3b877d-cd72-4694-bfad-07fde451e6ff/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/cd08e0b4-a0b7-4ce3-ae67-79650a6a6bb2/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/601d6df9-ab99-4a65-b790-0a2b6cc5c191/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/971df97e-e631-4f4a-9d78-4704eb0a7591/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/c712c011-afd1-4e6c-b1d1-ef6532ff48f4/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/a66c9c15-22c8-4bb3-920b-9354b408862b/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/58d69c6a-df71-4780-8b18-b3020532997f/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/1cc9b767-573a-447d-b257-968f2c91df17/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/dab528d2-1eda-46e3-a232-622cb188a7c5/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/2026b96c-019b-4a6c-bc13-e4cc3bb78e98/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/edabc69e-72d1-48df-942e-528cc60b835d/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/846a9eb7-5c60-4171-98db-c7d4b370cf27/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/d33d18c7-293c-4d0b-ab3b-38f049a305c2/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/351735e1-31cd-4577-82d6-e3a4d10d679c/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/93acc752-31bf-4538-b3ce-59c0762d00f6/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/b0599234-046e-4914-9941-00aadf0a292e/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/524ddb2b-e921-4483-b6c1-bc449047b6a4/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/f09dd720-5851-47d5-92d0-a36aa7d2be72/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/35d9125b-cee8-4cea-abd9-717e7ba75ce2/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/81df380e-50a9-434d-ba6e-c74500038d8d/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/2f853868-6d6c-4b51-88b7-506dffd0e4ae/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/c1eb19f2-9ce4-4ccd-986e-a8abb7c5b737/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/3e21be2c-b074-4a4c-9802-173be8ea45eb/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/0cf1a30b-4031-4789-819e-f5a3e6f58b50/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/042e5576-988e-4bdc-b782-2e212ad88e8c/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/4766114f-efee-4200-a3fa-dd109a4ee9df/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/3f84c54c-0e03-4ea3-a258-ab71691cee60/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/41278df2-f761-4463-8df9-0c9223585348/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/f47e22e1-a046-4f93-ac9d-65e2801ab9ac/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/a1611293-04c2-4b81-847a-6a53defa7399/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/a164dc0d-388f-4b3e-ba7a-5f9543b34185/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/98b1d8ba-757d-4599-ad01-b703d58b5d0d/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/1ac95950-d62c-4fe8-ba7d-7df645dcebbf/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/bb3fad51-a03e-4a47-95b3-f74711577c64/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/beb7ba8d-0909-4f04-9223-1ec72ae7fac4/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/057ed284-0b60-4ede-ae87-d7db476c1882/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/c9c98451-9773-4245-a636-e7dcb3c19282/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/d7b55685-cdd6-483b-8ea9-38131ad5ea92/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/486875ec-76a0-45c5-a734-fd63bc599af3/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/fcd11583-245f-42a9-94c9-1c4524217923/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/ecd92665-0550-48d6-99b0-0588b3367469/abort" -H "accept: application/json"
                  curl -X POST "http://192.168.30.202:31237/api/workflows/v1/071e9c51-931b-4c05-b3e8-a227449f3b9e/abort" -H "accept: application/json"

                  1 条回复 最后回复 回复 引用 0
                  • Z
                    zhanglu 最后由 zhanglu 编辑

                    ################# retry : Some(9999) ################################## retry : Some(9998) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9998) #################[INFO] [09/10/2025 13:24:08.451] [cromwell-system-akka.dispatchers.backend-dispatcher-195] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-830dc691-9767-475f-a4c2-e65543225903/WorkflowExecutionActor-830dc691-9767-475f-a4c2-e65543225903/830dc691-9767-475f-a4c2-e65543225903-EngineJobExecutionActor-meta_workflow.Bar:NA:1/830dc691-9767-475f-a4c2-e65543225903-BackendJobExecutionActor-meta_workflow.Bar:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(830dc691)meta_workflow.Bar:NA:1]: Status change from - to Running
                    ################# retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9999) ################################## retry : Some(9999) #################[INFO] [09/10/2025 13:24:09.672] [cromwell-system-akka.dispatchers.backend-dispatcher-195] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-830dc691-9767-475f-a4c2-e65543225903/WorkflowExecutionActor-830dc691-9767-475f-a4c2-e65543225903/830dc691-9767-475f-a4c2-e65543225903-EngineJobExecutionActor-meta_workflow.metacor:NA:1/830dc691-9767-475f-a4c2-e65543225903-BackendJobExecutionActor-meta_workflow.metacor:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(830dc691)meta_workflow.metacor:NA:1]: Status change from - to Running
                    ################# retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9999) ################################## retry : Some(9998) ################################## retry : Some(9999) ################################## retry : Some(9998) ################################## retry : Some(9997) ################################## retry : Some(9998) ################################## retry : Some(9997) ################################## retry : Some(9998) ################################## retry : Some(9998) #################[INFO] [09/10/2025 13:24:09.730] [cromwell-system-akka.dispatchers.backend-dispatcher-195] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-efce8835-18f0-4b86-8783-d9080d75b68f/WorkflowExecutionActor-efce8835-18f0-4b86-8783-d9080d75b68f/efce8835-18f0-4b86-8783-d9080d75b68f-EngineJobExecutionActor-meta_workflow.TICstdredeal:NA:1/efce8835-18f0-4b86-8783-d9080d75b68f-BackendJobExecutionActor-meta_workflow.TICstdredeal:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(efce8835)meta_workflow.TICstdredeal:NA:1]: Status change from - to Running
                    ################# retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) #################[INFO] [09/10/2025 13:24:12.450] [cromwell-system-akka.dispatchers.backend-dispatcher-188] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-830dc691-9767-475f-a4c2-e65543225903/WorkflowExecutionActor-830dc691-9767-475f-a4c2-e65543225903/830dc691-9767-475f-a4c2-e65543225903-EngineJobExecutionActor-meta_workflow.roplsplsda:NA:1/830dc691-9767-475f-a4c2-e65543225903-BackendJobExecutionActor-meta_workflow.roplsplsda:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(830dc691)meta_workflow.roplsplsda:NA:1]: Status change from - to Running
                    ################# retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) #################[INFO] [09/10/2025 13:24:15.939] [cromwell-system-akka.dispatchers.backend-dispatcher-211] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-830dc691-9767-475f-a4c2-e65543225903/WorkflowExecutionActor-830dc691-9767-475f-a4c2-e65543225903/830dc691-9767-475f-a4c2-e65543225903-EngineJobExecutionActor-meta_workflow.TICsampleredeal:NA:1/830dc691-9767-475f-a4c2-e65543225903-BackendJobExecutionActor-meta_workflow.TICsampleredeal:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(830dc691)meta_workflow.TICsampleredeal:NA:1]: Status change from - to Running
                    ################# retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9997) ################################## retry : Some(9998) ################################## retry : Some(9999) ################################## retry : Some(9998) ################################## retry : Some(9999) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9998) #################[INFO] [09/10/2025 13:24:17.438] [cromwell-system-akka.dispatchers.backend-dispatcher-211] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-830dc691-9767-475f-a4c2-e65543225903/WorkflowExecutionActor-830dc691-9767-475f-a4c2-e65543225903/830dc691-9767-475f-a4c2-e65543225903-EngineJobExecutionActor-meta_workflow.kmeans:NA:1/830dc691-9767-475f-a4c2-e65543225903-BackendJobExecutionActor-meta_workflow.kmeans:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(830dc691)meta_workflow.kmeans:NA:1]: Status change from - to Running
                    ################# retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9997) ################################## retry : Some(9998) ################################## retry : Some(9998) #################[INFO] [09/10/2025 13:24:17.632] [cromwell-system-akka.dispatchers.backend-dispatcher-211] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-830dc691-9767-475f-a4c2-e65543225903/WorkflowExecutionActor-830dc691-9767-475f-a4c2-e65543225903/830dc691-9767-475f-a4c2-e65543225903-EngineJobExecutionActor-meta_workflow.all_sample_map:NA:1/830dc691-9767-475f-a4c2-e65543225903-BackendJobExecutionActor-meta_workflow.all_sample_map:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(830dc691)meta_workflow.all_sample_map:NA:1]: Status change from - to Running
                    ################# retry : Some(9997) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9997) ################################## retry : Some(9998) #################[INFO] [09/10/2025 13:24:19.638] [cromwell-system-akka.dispatchers.backend-dispatcher-211] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-830dc691-9767-475f-a4c2-e65543225903/WorkflowExecutionActor-830dc691-9767-475f-a4c2-e65543225903/830dc691-9767-475f-a4c2-e65543225903-EngineJobExecutionActor-meta_workflow.KEGG:NA:1/830dc691-9767-475f-a4c2-e65543225903-BackendJobExecutionActor-meta_workflow.KEGG:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(830dc691)meta_workflow.KEGG:NA:1]: Status change from - to Running
                    ################# retry : Some(9996) ################################## retry : Some(9997) ################################## retry : Some(9997) ################################## retry : Some(9998) ################################## retry : Some(9997) #################[INFO] [09/10/2025 13:24:22.193] [cromwell-system-akka.dispatchers.backend-dispatcher-211] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-efce8835-18f0-4b86-8783-d9080d75b68f/WorkflowExecutionActor-efce8835-18f0-4b86-8783-d9080d75b68f/efce8835-18f0-4b86-8783-d9080d75b68f-EngineJobExecutionActor-meta_workflow.TICsampleredeal:NA:1/efce8835-18f0-4b86-8783-d9080d75b68f-BackendJobExecutionActor-meta_workflow.TICsampleredeal:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(efce8835)meta_workflow.TICsampleredeal:NA:1]: Status change from - to Running
                    ################# retry : Some(9999) ################################## retry : Some(9997) ################################## retry : Some(9997) ################################## retry : Some(9997) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9997) #################[INFO] [09/10/2025 13:24:25.386] [cromwell-system-akka.dispatchers.backend-dispatcher-211] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-830dc691-9767-475f-a4c2-e65543225903/WorkflowExecutionActor-830dc691-9767-475f-a4c2-e65543225903/830dc691-9767-475f-a4c2-e65543225903-EngineJobExecutionActor-meta_workflow.heatmap:NA:1/830dc691-9767-475f-a4c2-e65543225903-BackendJobExecutionActor-meta_workflow.heatmap:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(830dc691)meta_workflow.heatmap:NA:1]: Status change from - to Running
                    ################# retry : Some(9997) ################################## retry : Some(9997) ################################## retry : Some(9997) ################################## retry : Some(9997) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9997) ################################## retry : Some(9998) ################################## retry : Some(9996) ################################## retry : Some(9998) ################################## retry : Some(9999) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9998) #################[INFO] [09/10/2025 13:24:25.501] [cromwell-system-akka.dispatchers.backend-dispatcher-211] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-830dc691-9767-475f-a4c2-e65543225903/WorkflowExecutionActor-830dc691-9767-475f-a4c2-e65543225903/830dc691-9767-475f-a4c2-e65543225903-EngineJobExecutionActor-meta_workflow.TICstdredeal:NA:1/830dc691-9767-475f-a4c2-e65543225903-BackendJobExecutionActor-meta_workflow.TICstdredeal:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(830dc691)meta_workflow.TICstdredeal:NA:1]: Status change from - to Running
                    ################# retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9997) ################################## retry : Some(9998) #################[INFO] [09/10/2025 13:24:26.411] [cromwell-system-akka.dispatchers.backend-dispatcher-211] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-830dc691-9767-475f-a4c2-e65543225903/WorkflowExecutionActor-830dc691-9767-475f-a4c2-e65543225903/830dc691-9767-475f-a4c2-e65543225903-EngineJobExecutionActor-meta_workflow.roplsoplsda:NA:1/830dc691-9767-475f-a4c2-e65543225903-BackendJobExecutionActor-meta_workflow.roplsoplsda:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(830dc691)meta_workflow.roplsoplsda:NA:1]: Status change from - to Running
                    ################# retry : Some(9998) ################################## retry : Some(9997) ################################## retry : Some(9997) ################################## retry : Some(9998) ################################## retry : Some(9996) ################################## retry : Some(9997) ################################## retry : Some(9998) #################[INFO] [09/10/2025 13:24:27.693] [cromwell-system-akka.dispatchers.backend-dispatcher-194] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-efce8835-18f0-4b86-8783-d9080d75b68f/WorkflowExecutionActor-efce8835-18f0-4b86-8783-d9080d75b68f/efce8835-18f0-4b86-8783-d9080d75b68f-EngineJobExecutionActor-meta_workflow.roplspca:NA:1/efce8835-18f0-4b86-8783-d9080d75b68f-BackendJobExecutionActor-meta_workflow.roplspca:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(efce8835)meta_workflow.roplspca:NA:1]: job id: task-9d37d05a
                    ################# retry : Some(9997) ################################## retry : Some(9997) ################################## retry : Some(9997) ################################## retry : Some(9997) ################################## retry : Some(9997) #################[INFO] [09/10/2025 13:24:27.693] [cromwell-system-akka.dispatchers.backend-dispatcher-211] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-efce8835-18f0-4b86-8783-d9080d75b68f/WorkflowExecutionActor-efce8835-18f0-4b86-8783-d9080d75b68f/efce8835-18f0-4b86-8783-d9080d75b68f-EngineJobExecutionActor-meta_workflow.all_sample_map:NA:1/efce8835-18f0-4b86-8783-d9080d75b68f-BackendJobExecutionActor-meta_workflow.all_sample_map:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(efce8835)meta_workflow.all_sample_map:NA:1]: job id: task-893eb7df
                    ################# retry : Some(9997) ################################## retry : Some(9997) ################################## retry : Some(9997) ################################## retry : Some(9997) ################################## retry : Some(9996) ################################## retry : Some(9997) ################################## retry : Some(9997) #################[INFO] [09/10/2025 13:24:27.693] [cromwell-system-akka.dispatchers.backend-dispatcher-189] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-efce8835-18f0-4b86-8783-d9080d75b68f/WorkflowExecutionActor-efce8835-18f0-4b86-8783-d9080d75b68f/efce8835-18f0-4b86-8783-d9080d75b68f-EngineJobExecutionActor-meta_workflow.roplsoplsda:NA:1/efce8835-18f0-4b86-8783-d9080d75b68f-BackendJobExecutionActor-meta_workflow.roplsoplsda:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(efce8835)meta_workflow.roplsoplsda:NA:1]: job id: task-7a6a121a
                    ################# retry : Some(9996) ################################## retry : Some(9997) ################################## retry : Some(9996) ################################## retry : Some(9996) ################################## retry : Some(9998) #################[INFO] [09/10/2025 13:24:27.693] [cromwell-system-akka.dispatchers.backend-dispatcher-229] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-efce8835-18f0-4b86-8783-d9080d75b68f/WorkflowExecutionActor-efce8835-18f0-4b86-8783-d9080d75b68f/efce8835-18f0-4b86-8783-d9080d75b68f-EngineJobExecutionActor-meta_workflow.kmeans:NA:1/efce8835-18f0-4b86-8783-d9080d75b68f-BackendJobExecutionActor-meta_workflow.kmeans:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(efce8835)meta_workflow.kmeans:NA:1]: job id: task-e853a7d3
                    ################# retry : Some(9997) ################################## retry : Some(9995) ################################## retry : Some(9996) ################################## retry : Some(9997) ################################## retry : Some(9996) ################################## retry : Some(9997) ################################## retry : Some(9996) ################################## retry : Some(9997) #################[INFO] [09/10/2025 13:24:27.693] [cromwell-system-akka.dispatchers.backend-dispatcher-204] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-efce8835-18f0-4b86-8783-d9080d75b68f/WorkflowExecutionActor-efce8835-18f0-4b86-8783-d9080d75b68f/efce8835-18f0-4b86-8783-d9080d75b68f-EngineJobExecutionActor-meta_workflow.roplsplsda:NA:1/efce8835-18f0-4b86-8783-d9080d75b68f-BackendJobExecutionActor-meta_workflow.roplsplsda:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(efce8835)meta_workflow.roplsplsda:NA:1]: job id: task-fdfbad9a
                    ################# retry : Some(9997) ################################## retry : Some(9996) ################################## retry : Some(9996) ################################## retry : Some(9997) ################################## retry : Some(9996) #################[INFO] [09/10/2025 13:24:27.693] [cromwell-system-akka.dispatchers.backend-dispatcher-192] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-efce8835-18f0-4b86-8783-d9080d75b68f/WorkflowExecutionActor-efce8835-18f0-4b86-8783-d9080d75b68f/efce8835-18f0-4b86-8783-d9080d75b68f-EngineJobExecutionActor-meta_workflow.heatmap:NA:1/efce8835-18f0-4b86-8783-d9080d75b68f-BackendJobExecutionActor-meta_workflow.heatmap:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(efce8835)meta_workflow.heatmap:NA:1]: job id: task-5c817f97
                    ################# retry : Some(9996) ################################## retry : Some(9997) ################################## retry : Some(9996) ################################## retry : Some(9996) ################################## retry : Some(9995) ################################## retry : Some(9997) ################################## retry : Some(9997) #################[INFO] [09/10/2025 13:24:27.693] [cromwell-system-akka.dispatchers.backend-dispatcher-195] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-efce8835-18f0-4b86-8783-d9080d75b68f/WorkflowExecutionActor-efce8835-18f0-4b86-8783-d9080d75b68f/efce8835-18f0-4b86-8783-d9080d75b68f-EngineJobExecutionActor-meta_workflow.metacor:NA:1/efce8835-18f0-4b86-8783-d9080d75b68f-BackendJobExecutionActor-meta_workflow.metacor:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(efce8835)meta_workflow.metacor:NA:1]: job id: task-2a5ebaf9
                    ################# retry : Some(9997) ################################## retry : Some(9997) ################################## retry : Some(9996) ################################## retry : Some(9997) ################################## retry : Some(9998) #################[INFO] [09/10/2025 13:24:27.693] [cromwell-system-akka.dispatchers.backend-dispatcher-233] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-efce8835-18f0-4b86-8783-d9080d75b68f/WorkflowExecutionActor-efce8835-18f0-4b86-8783-d9080d75b68f/efce8835-18f0-4b86-8783-d9080d75b68f-EngineJobExecutionActor-meta_workflow.KEGG:NA:1/efce8835-18f0-4b86-8783-d9080d75b68f-BackendJobExecutionActor-meta_workflow.KEGG:NA:1/TesAsyncBackendJobExecutionActor] TesAsyncBackendJobExecutionActor [UUID(efce8835)meta_workflow.KEGG:NA:1]: job id: task-0b91a00d
                    ################# retry : Some(9996) ################################## retry : Some(9996) ################################## retry : Some(9997) ################################## retry : Some(9997) ##############

                    1 条回复 最后回复 回复 引用 0
                    • Z
                      zhanglu 最后由 zhanglu 编辑

                      [INFO] [09/11/2025 06:46:32.437] [cromwell-system-akka.dispatchers.engine-dispatcher-30] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-2f372e06-2cd8-424f-8e6c-062e0b506e40/WorkflowExecutionActor-2f372e06-2cd8-424f-8e6c-062e0b506e40] WorkflowExecutionActor-2f372e06-2cd8-424f-8e6c-062e0b506e40 [UUID(2f372e06)]: Restarting blood_meta.check_file, blood_meta.predeal
                      [INFO] [09/11/2025 06:46:32.438] [cromwell-system-akka.dispatchers.engine-dispatcher-27] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-a5686468-95da-46ca-8498-50187928d6d6/WorkflowExecutionActor-a5686468-95da-46ca-8498-50187928d6d6] WorkflowExecutionActor-a5686468-95da-46ca-8498-50187928d6d6 [UUID(a5686468)]: Restarting metage_megahit.kneaddata
                      [INFO] [09/11/2025 06:46:32.438] [cromwell-system-akka.dispatchers.engine-dispatcher-6] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-f95d7ecd-be71-428b-8195-9c121ad27007/WorkflowExecutionActor-f95d7ecd-be71-428b-8195-9c121ad27007] WorkflowExecutionActor-f95d7ecd-be71-428b-8195-9c121ad27007 [UUID(f95d7ecd)]: Restarting RNASeq_eukaryon.predeal, RNASeq_eukaryon.getkeggtype

                      1 条回复 最后回复 回复 引用 0
                      • Z
                        zhanglu 最后由 编辑

                        #!/bin/bash

                        1. 定义日志文件路径(默认是当前目录的nohup.out,可根据实际路径修改)

                        LOG_FILE="./nohup.out"

                        2. 检查日志文件是否存在

                        if [ ! -f "$LOG_FILE" ]; then
                        echo "错误:日志文件 $LOG_FILE 不存在!请检查路径是否正确。"
                        exit 1
                        fi

                        3. 实时监听日志 + 提取目标任务ID(UUID)

                        echo "=== 开始监听日志 $LOG_FILE,提取含 Restarting 的任务ID ==="
                        echo "=== 按 Ctrl+C 停止监听 ==="
                        echo "=========================="

                        核心逻辑:

                        - tail -f:实时跟踪日志新增内容

                        - grep "Restarting":筛选包含“Restarting”的行

                        - sed 正则:提取“WorkflowActor-”后的36位UUID(格式:8-4-4-4-12位字符)

                        - sort -u:去重(避免同一任务多次重启导致重复输出)

                        tail -f "$LOG_FILE" |
                        grep --line-buffered "Restarting" |
                        sed -n 's/.WorkflowActor-([0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12})./\1/p' |
                        sort -u

                        1 条回复 最后回复 回复 引用 0
                        • Z
                          zhanglu 最后由 编辑

                          estarting micro_dy_gro.upstream
                          [INFO] [09/23/2025 06:46:37.396] [cromwell-system-akka.dispatchers.engine-dispatcher-9] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-fa3a52b6-19db-4435-ac0f-a5c1fbeec385/WorkflowExecutionActor-fa3a52b6-19db-4435-ac0f-a5c1fbeec385] WorkflowExecutionActor-fa3a52b6-19db-4435-ac0f-a5c1fbeec385 [UUID(fa3a52b6)]: Restarting blood_meta.jsonFile, blood_meta.reportNoFile, blood_meta.resFile
                          ################# retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ###############

                          1 条回复 最后回复 回复 引用 0
                          • Z
                            zhanglu 最后由 编辑

                            Type Reason Age From Message


                            Normal NodeReady 47m (x11 over 6h56m) kubelet Node node1 status is now: NodeReady
                            Normal NodeNotReady 44m (x12 over 7h3m) kubelet Node node1 status is now: NodeNotReady
                            Normal Starting 37m kubelet Starting kubelet.
                            Normal NodeHasSufficientMemory 37m kubelet Node node1 status is now: NodeHasSufficientMemory
                            Normal NodeHasNoDiskPressure 37m kubelet Node node1 status is now: NodeHasNoDiskPressure
                            Normal NodeHasSufficientPID 37m kubelet Node node1 status is now: NodeHasSufficientPID
                            Normal NodeAllocatableEnforced 37m kubelet Updated Node Allocatable limit across pods
                            Normal NodeReady 37m kubelet Node node1 status is now: NodeReady
                            Normal NodeNotReady 34m kubelet Node node1 status is now: NodeNotReady
                            Normal Starting 31m kubelet Starting kubelet.
                            Normal NodeHasSufficientMemory 31m kubelet Node node1 status is now: NodeHasSufficientMemory
                            Normal NodeHasNoDiskPressure 31m kubelet Node node1 status is now: NodeHasNoDiskPressure
                            Normal NodeHasSufficientPID 31m kubelet Node node1 status is now: NodeHasSufficientPID
                            Normal NodeAllocatableEnforced 31m kubelet Updated Node Allocatable limit across pods
                            Normal NodeReady 9m29s (x2 over 31m) kubelet Node node1 status is now: NodeReady
                            Normal NodeNotReady 6m28s (x2 over 28m) kubelet Node node1 status is now: NodeNotReady

                            1 条回复 最后回复 回复 引用 0
                            • Z
                              zhanglu 最后由 编辑

                              安装NVIDIA仓库配置包(适用于CentOS 8)

                              distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
                              curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo

                              sudo dnf install -y nvidia-container-toolkit

                              1 条回复 最后回复 回复 引用 0
                              • Z
                                zhanglu 最后由 编辑

                                nvidia-ctk runtime configure --runtime=docker

                                1 条回复 最后回复 回复 引用 0
                                • Z
                                  zhanglu 最后由 编辑

                                  kubectl get pods --all-namespaces -o json | jq -r '.items[] | select(.spec.containers[].resources.limits["<gpu-resource-name>"] != null) | .metadata.namespace + " " + .metadata.name'

                                  1 条回复 最后回复 回复 引用 0
                                  • Z
                                    zhanglu 最后由 编辑

                                    2025/10/13 02:31:43 Starting FS watcher.
                                    2025/10/13 02:31:43 Starting OS watcher.
                                    2025/10/13 02:31:43 Starting Plugins.
                                    2025/10/13 02:31:43 Loading configuration.
                                    2025/10/13 02:31:43 Initializing NVML.
                                    2025/10/13 02:31:43 Failed to initialize NVML: could not load NVML library.
                                    2025/10/13 02:31:43 If this is a GPU node, did you set the docker default runtime to nvidia?
                                    2025/10/13 02:31:43 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
                                    2025/10/13 02:31:43 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
                                    2025/10/13 02:31:43 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

                                    1 条回复 最后回复 回复 引用 0
                                    • Z
                                      zhanglu 最后由 编辑

                                      https://cn.download.nvidia.com/XFree86/Linux-x86_64/580.95.05/NVIDIA-Linux-x86_64-580.95.05.run

                                      1 条回复 最后回复 回复 引用 0
                                      • Z
                                        zhanglu 最后由 编辑

                                        2025/10/13 07:23:31 Failed to initialize NVML: could not load NVML library.
                                        2025/10/13 07:23:31 If this is a GPU node, did you set the docker default runtime to nvidia?

                                        1 条回复 最后回复 回复 引用 0
                                        • Z
                                          zhanglu 最后由 编辑

                                          wget https://developer.download.nvidia.com/compute/cuda/13.0.2/local_installers/cuda_13.0.2_580.95.05_linux.run

                                          1 条回复 最后回复 回复 引用 0
                                          • Z
                                            zhanglu 最后由 编辑

                                            wget https://developer.download.nvidia.com/compute/cuda/11.6.2/local_installers/cuda_11.6.2_510.47.03_linux.run

                                            1 条回复 最后回复 回复 引用 0
                                            • First post
                                              Last post
                                            Powered by 暗能星系