暗能星系

    • 登录
    • 搜索

    tmp

    张渌
    2
    262
    1755
    正在加载更多帖子
    • 从旧到新
    • 从新到旧
    • 最多赞同
    回复
    • 在新帖中回复
    登录后回复
    此主题已被删除。只有拥有主题管理权限的用户可以查看。
    • Z
      zhanglu 最后由 zhanglu 编辑

      [INFO] [09/11/2025 06:46:32.437] [cromwell-system-akka.dispatchers.engine-dispatcher-30] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-2f372e06-2cd8-424f-8e6c-062e0b506e40/WorkflowExecutionActor-2f372e06-2cd8-424f-8e6c-062e0b506e40] WorkflowExecutionActor-2f372e06-2cd8-424f-8e6c-062e0b506e40 [UUID(2f372e06)]: Restarting blood_meta.check_file, blood_meta.predeal
      [INFO] [09/11/2025 06:46:32.438] [cromwell-system-akka.dispatchers.engine-dispatcher-27] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-a5686468-95da-46ca-8498-50187928d6d6/WorkflowExecutionActor-a5686468-95da-46ca-8498-50187928d6d6] WorkflowExecutionActor-a5686468-95da-46ca-8498-50187928d6d6 [UUID(a5686468)]: Restarting metage_megahit.kneaddata
      [INFO] [09/11/2025 06:46:32.438] [cromwell-system-akka.dispatchers.engine-dispatcher-6] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-f95d7ecd-be71-428b-8195-9c121ad27007/WorkflowExecutionActor-f95d7ecd-be71-428b-8195-9c121ad27007] WorkflowExecutionActor-f95d7ecd-be71-428b-8195-9c121ad27007 [UUID(f95d7ecd)]: Restarting RNASeq_eukaryon.predeal, RNASeq_eukaryon.getkeggtype

      1 条回复 最后回复 回复 引用 0
      • Z
        zhanglu 最后由 编辑

        #!/bin/bash

        1. 定义日志文件路径(默认是当前目录的nohup.out,可根据实际路径修改)

        LOG_FILE="./nohup.out"

        2. 检查日志文件是否存在

        if [ ! -f "$LOG_FILE" ]; then
        echo "错误:日志文件 $LOG_FILE 不存在!请检查路径是否正确。"
        exit 1
        fi

        3. 实时监听日志 + 提取目标任务ID(UUID)

        echo "=== 开始监听日志 $LOG_FILE,提取含 Restarting 的任务ID ==="
        echo "=== 按 Ctrl+C 停止监听 ==="
        echo "=========================="

        核心逻辑:

        - tail -f:实时跟踪日志新增内容

        - grep "Restarting":筛选包含“Restarting”的行

        - sed 正则:提取“WorkflowActor-”后的36位UUID(格式:8-4-4-4-12位字符)

        - sort -u:去重(避免同一任务多次重启导致重复输出)

        tail -f "$LOG_FILE" |
        grep --line-buffered "Restarting" |
        sed -n 's/.WorkflowActor-([0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12})./\1/p' |
        sort -u

        1 条回复 最后回复 回复 引用 0
        • Z
          zhanglu 最后由 编辑

          estarting micro_dy_gro.upstream
          [INFO] [09/23/2025 06:46:37.396] [cromwell-system-akka.dispatchers.engine-dispatcher-9] [akka://cromwell-system/user/cromwell-service/WorkflowManagerActor/WorkflowActor-fa3a52b6-19db-4435-ac0f-a5c1fbeec385/WorkflowExecutionActor-fa3a52b6-19db-4435-ac0f-a5c1fbeec385] WorkflowExecutionActor-fa3a52b6-19db-4435-ac0f-a5c1fbeec385 [UUID(fa3a52b6)]: Restarting blood_meta.jsonFile, blood_meta.reportNoFile, blood_meta.resFile
          ################# retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9998) ################################## retry : Some(9998) ################################## retry : Some(9999) ################################## retry : Some(9999) ################################## retry : Some(9999) ###############

          1 条回复 最后回复 回复 引用 0
          • Z
            zhanglu 最后由 编辑

            Type Reason Age From Message


            Normal NodeReady 47m (x11 over 6h56m) kubelet Node node1 status is now: NodeReady
            Normal NodeNotReady 44m (x12 over 7h3m) kubelet Node node1 status is now: NodeNotReady
            Normal Starting 37m kubelet Starting kubelet.
            Normal NodeHasSufficientMemory 37m kubelet Node node1 status is now: NodeHasSufficientMemory
            Normal NodeHasNoDiskPressure 37m kubelet Node node1 status is now: NodeHasNoDiskPressure
            Normal NodeHasSufficientPID 37m kubelet Node node1 status is now: NodeHasSufficientPID
            Normal NodeAllocatableEnforced 37m kubelet Updated Node Allocatable limit across pods
            Normal NodeReady 37m kubelet Node node1 status is now: NodeReady
            Normal NodeNotReady 34m kubelet Node node1 status is now: NodeNotReady
            Normal Starting 31m kubelet Starting kubelet.
            Normal NodeHasSufficientMemory 31m kubelet Node node1 status is now: NodeHasSufficientMemory
            Normal NodeHasNoDiskPressure 31m kubelet Node node1 status is now: NodeHasNoDiskPressure
            Normal NodeHasSufficientPID 31m kubelet Node node1 status is now: NodeHasSufficientPID
            Normal NodeAllocatableEnforced 31m kubelet Updated Node Allocatable limit across pods
            Normal NodeReady 9m29s (x2 over 31m) kubelet Node node1 status is now: NodeReady
            Normal NodeNotReady 6m28s (x2 over 28m) kubelet Node node1 status is now: NodeNotReady

            1 条回复 最后回复 回复 引用 0
            • Z
              zhanglu 最后由 编辑

              安装NVIDIA仓库配置包(适用于CentOS 8)

              distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
              curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo

              sudo dnf install -y nvidia-container-toolkit

              1 条回复 最后回复 回复 引用 0
              • Z
                zhanglu 最后由 编辑

                nvidia-ctk runtime configure --runtime=docker

                1 条回复 最后回复 回复 引用 0
                • Z
                  zhanglu 最后由 编辑

                  kubectl get pods --all-namespaces -o json | jq -r '.items[] | select(.spec.containers[].resources.limits["<gpu-resource-name>"] != null) | .metadata.namespace + " " + .metadata.name'

                  1 条回复 最后回复 回复 引用 0
                  • Z
                    zhanglu 最后由 编辑

                    2025/10/13 02:31:43 Starting FS watcher.
                    2025/10/13 02:31:43 Starting OS watcher.
                    2025/10/13 02:31:43 Starting Plugins.
                    2025/10/13 02:31:43 Loading configuration.
                    2025/10/13 02:31:43 Initializing NVML.
                    2025/10/13 02:31:43 Failed to initialize NVML: could not load NVML library.
                    2025/10/13 02:31:43 If this is a GPU node, did you set the docker default runtime to nvidia?
                    2025/10/13 02:31:43 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
                    2025/10/13 02:31:43 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
                    2025/10/13 02:31:43 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

                    1 条回复 最后回复 回复 引用 0
                    • Z
                      zhanglu 最后由 编辑

                      https://cn.download.nvidia.com/XFree86/Linux-x86_64/580.95.05/NVIDIA-Linux-x86_64-580.95.05.run

                      1 条回复 最后回复 回复 引用 0
                      • Z
                        zhanglu 最后由 编辑

                        2025/10/13 07:23:31 Failed to initialize NVML: could not load NVML library.
                        2025/10/13 07:23:31 If this is a GPU node, did you set the docker default runtime to nvidia?

                        1 条回复 最后回复 回复 引用 0
                        • Z
                          zhanglu 最后由 编辑

                          wget https://developer.download.nvidia.com/compute/cuda/13.0.2/local_installers/cuda_13.0.2_580.95.05_linux.run

                          1 条回复 最后回复 回复 引用 0
                          • Z
                            zhanglu 最后由 编辑

                            wget https://developer.download.nvidia.com/compute/cuda/11.6.2/local_installers/cuda_11.6.2_510.47.03_linux.run

                            1 条回复 最后回复 回复 引用 0
                            • Z
                              zhanglu 最后由 编辑

                              Failed to initialize NVML: could not load NVML library

                              1 条回复 最后回复 回复 引用 0
                              • Z
                                zhanglu 最后由 编辑

                                https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml

                                1 条回复 最后回复 回复 引用 0
                                • Z
                                  zhanglu 最后由 编辑

                                  https://www.nvidia.cn/drivers/details/252785/

                                  1 条回复 最后回复 回复 引用 0
                                  • Z
                                    zhanglu 最后由 编辑

                                    $ curl -s -L https://nvidia.github.io/nvidia-docker/centos8/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
                                    yum install -y nvidia-container-toolkit

                                    1 条回复 最后回复 回复 引用 0
                                    • Z
                                      zhanglu 最后由 编辑

                                      d?????????? ? ? ? ? ? cephfs_data

                                      1 条回复 最后回复 回复 引用 0
                                      • Z
                                        zhanglu 最后由 编辑

                                        root
                                        Tzzs@2025*
                                        220.185.228.106
                                        30001

                                        1 条回复 最后回复 回复 引用 0
                                        • Z
                                          zhanglu 最后由 编辑

                                          http://192.168.30.202:31237/api/workflows/v1/8a6aa1f8-0ab1-4518-bc22-06390d1c7494/abort

                                          1 条回复 最后回复 回复 引用 0
                                          • Z
                                            zhanglu 最后由 编辑

                                            curl -X POST "http://192.168.30.202:31237/api/workflows/v1/8a6aa1f8-0ab1-4518-bc22-06390d1c7494/abort" -H "accept: application/json"

                                            1 条回复 最后回复 回复 引用 0
                                            • First post
                                              Last post
                                            Powered by 暗能星系