Toil验证记录
-
目标:验证Toil+CWL+k8s环境的运行 Toil采用Server模式
1.环境安装
1.1 安装k8s minikubegoogleapis无法访问 用浏览器代理下载 curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 sudo install minikube-linux-amd64 /usr/local/bin/minikube 注意要使用cn镜像 新版本下载有问题 选择稍微低的版本 minikube start --image-mirror-country='cn' --kubernetes-version=v1.23.8 --extra-config=kubelet.housekeeping-interval=10s --extra-config=kubelet.housekeeping-interval=10s 用来使metrics-server可用1.2 minikube的简单使用
查看状态 minikube status 获取集群所有节点(机器): minikube kubectl get nodes 获取集群所有命名空间: minikube kubectl get namespaces 查看集群所有 Pod: minikube kubectl -- get pods -A 进入节点服务器: minikube ssh 执行节点服务器命令,例如查看节点 docker info: minikube ssh -- docker info 删除集群, 删除 ~/.minikube 目录缓存的文件: minikube delete 关闭集群: minikube stop 销毁集群: minikube stop && minikube deleteminikube dashboard 有个错误 libva error: /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so init failed 重新安装驱动: sudo apt-get install --reinstall intel-media-va-driver:amd64(没有解决错误) 先忽略运行一个服务
minikube kubectl -- create deployment hello-minikube --image=kicbase/echo-server:1.0 minikube kubectl -- expose deployment hello-minikube --type=NodePort --port=8080 minikube service hello-minikubeToil需要 metrics-server
https://www.mls-tech.info/microservice/k8s/minikube-use-metrics-server/
minikube addons enable metrics-server //这个命令总是从k8s下载镜像
sudo docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/metrics-server:v0.6.2
sudo docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/metrics-server:v0.6.2 registry.k8s.io/metrics-server/metrics-server:v0.6.2启动 metrics-server 注意要设置国内镜像
minikube addons enable metrics-server --images="MetricsServer=metrics-server:v0.6.2" --registries="MetricsServer=registry.cn-hangzhou.aliyuncs.com/google_containers"2.安装Toil
sudo pip install virtualenv virtualenv ~/venv source ~/venv/bin/activate pip install toil pip install toil[cwl,wdl,kubernetes,server] //安装额外的插件Toil的架构:

the leader:
The leader is responsible for deciding which jobs should be run. To do this it traverses the job graph. Currently this is a single threaded process, but we make aggressive steps to prevent it becoming a bottleneckThere are two main ways to run Toil workflows on Kubernetes. You can either run the Toil leader on a machine outside the cluster, with jobs submitted to and run on the cluster, or you can submit the Toil leader itself as a job and have it run inside the cluster.
the job-store:
Handles all files shared between the components. Files in the job-store are the means by which the state of the workflow is maintained. Each job is backed by a file in the job store, and atomic updates to this state are used to ensure the workflow can always be resumed upon failure. The job-store can also store all user files, allowing them to be shared between jobs. The job-store is defined by the AbstractJobStore class. Multiple implementations of this class allow Toil to support different back-end file stores, e.g.: S3, network file systems, Google file store, etc.workers:
The workers are temporary processes responsible for running jobs, one at a time per worker. Each worker process is invoked with a job argument that it is responsible for running. The worker monitors this job and reports back success or failure to the leader by editing the job’s state in the file-store. If the job defines successor jobs the worker may choose to immediately run themthe batch-system:
Responsible for scheduling the jobs given to it by the leader, creating a worker command for each job. The batch-system is defined by the AbstractBatchSystem class. Toil uses multiple existing batch systems to schedule jobs, including Apache Mesos, GridEngine and a multi-process single node implementation that allows workflows to be run without any of these frameworks. Toil can therefore fairly easily be made to run a workflow using an existing cluster.
the node provisioner:
Creates worker nodes in which the batch system schedules workers. It is defined by the AbstractProvisioner class.
the statistics and logging monitor:
Monitors logging and statistics produced by the workers and reports them. Uses the job-store to gather this information.3.测试用例
3.1 Toil原生的hello world 流程//helloWorld.py Toil原生的流程就是python脚本编写的 from toil.common import Toil from toil.job import Job def helloWorld(message, memory="1G", cores=1, disk="1G"): return f"Hello, world!, here's a message: {message}" if __name__ == "__main__": parser = Job.Runner.getDefaultArgumentParser() options = parser.parse_args() options.clean = "always" with Toil(options) as toil: output = toil.start(Job.wrapFn(helloWorld, "You did it!")) print(output)(venv) $ python helloWorld.py file:my-job-store3.2 测试CWL流程
cwlVersion: v1.0 class: CommandLineTool baseCommand: echo stdout: output.txt inputs: message: type: string inputBinding: position: 1 outputs: output: type: stdout运行一个cwl文件 toil-cwl-runner example.cwl example-job.yaml cat output.txt3.3 测试WDL
workflow write_simple_file { call write_file } task write_file { String message command { echo ${message} > wdl-helloworld-output.txt } output { File test = "wdl-helloworld-output.txt" } } and this code into ``wdl-helloworld.json``:: { "write_simple_file.write_file.message": "Hello world!" }toil-wdl-runner wdl-helloworld.wdl wdl-helloworld.json3.4 测试CWL k8s
Toil 要求k8s 对接时 只能用AWS的存储作为job-store -
minikube的其他参考
https://www.zhaowenyu.com/minikube-doc/ops/minikube.html -
Toil k8s 部署方式一 将toil leader部署到k8s 内部
apiVersion: batch/v1 kind: Job metadata: # It is good practice to include your username in your job name. # Also specify it in TOIL_KUBERNETES_OWNER name: demo-user-toil-test # Do not try and rerun the leader job if it fails spec: backoffLimit: 0 template: spec: # Do not restart the pod when the job fails, but keep it around so the # log can be retrieved restartPolicy: Never volumes: - name: aws-credentials-vol secret: # Make sure the AWS credentials are available as a volume. # This should match TOIL_AWS_SECRET_NAME secretName: aws-credentials # You may need to replace this with a different service account name as # appropriate for your cluster. serviceAccountName: default containers: - name: main image: quay.io/ucsc_cgl/toil:5.5.0 env: # Specify your username for inclusion in job names - name: TOIL_KUBERNETES_OWNER value: demo-user # Specify where to find the AWS credentials to access the job store with - name: TOIL_AWS_SECRET_NAME value: aws-credentials # Specify where per-host caches should be stored, on the Kubernetes hosts. # Needs to be set for Toil's caching to be efficient. - name: TOIL_KUBERNETES_HOST_PATH value: /data/scratch volumeMounts: # Mount the AWS credentials volume - mountPath: /root/.aws name: aws-credentials-vol resources: # Make sure to set these resource limits to values large enough # to accommodate the work your workflow does in the leader # process, but small enough to fit on your cluster. # # Since no request values are specified, the limits are also used # for the requests. limits: cpu: 2 memory: "4Gi" ephemeral-storage: "10Gi" command: - /bin/bash - -c - | # This Bash script will set up Toil and the workflow to run, and run them. set -e # We make sure to create a work directory; Toil can't hot-deploy a # script from the root of the filesystem, which is where we start. mkdir /tmp/work cd /tmp/work # We make a virtual environment to allow workflow dependencies to be # hot-deployed. # # We don't really make use of it in this example, but for workflows # that depend on PyPI packages we will need this. # # We use --system-site-packages so that the Toil installed in the # appliance image is still available. virtualenv --python python3 --system-site-packages venv . venv/bin/activate # Now we install the workflow. Here we're using a demo workflow # script from Toil itself. wget https://raw.githubusercontent.com/DataBiosphere/toil/releases/4.1.0/src/toil/test/docs/scripts/tutorial_helloworld.py # Now we run the workflow. We make sure to use the Kubernetes batch # system and an AWS job store, and we set some generally useful # logging options. We also make sure to enable caching. python3 tutorial_helloworld.py \ aws:us-west-2:demouser-toil-test-jobstore \ --batchSystem kubernetes \ --realTimeLogging \ --logInfokubectl apply -f leader.yaml
注意:
Note that the leader pod will need your workflow script, its other dependencies, and Toil all installed. An easy way to get Toil installed is to start with the Toil appliance image for the version of Toil you want to use. In this example, we use quay.io/ucsc_cgl/toil:5.5.0.
Toil要求这种模式 把脚本、Toil都打包到镜像里面 -
Toil k8s 部署方式二 将toil leader部署到k8s 外部
主要用于开发测试 而且当前要求本地能访问 aws
实时日志将无法使用 除非本地有外网
Real time logging will not work unless your local machine is able to listen for incoming UDP packets on arbitrary ports on the address it uses to contact the IPv4 Internet; Toil does no NAT traversal or detection.$ export TOIL_KUBERNETES_OWNER=demo-user # This defaults to your local username if not set $ export TOIL_AWS_SECRET_NAME=aws-credentials $ export TOIL_KUBERNETES_HOST_PATH=/data/scratch $ virtualenv --python python3 --system-site-packages venv $ . venv/bin/activate $ wget https://raw.githubusercontent.com/DataBiosphere/toil/releases/4.1.0/src/toil/test/docs/scripts/tutorial_helloworld.py $ python3 tutorial_helloworld.py \ aws:us-west-2:demouser-toil-test-jobstore \ --batchSystem kubernetes \ --realTimeLogging \ --logInfo
ModuleNotFoundError: No module named 'boto'
pip install boto botocore boto3 mypy_boto3_s3尝试将任务在k8s上启动 使用file模式 任务可以下发到minikube 但是无法正常启动 toil默认会挂载aws的pv
python3 tutorial_helloworld.py file:job-store --batchSystem kubernetes --realTimeLogging --logInfo
MountVolume.SetUp failed for volume "s3-credentials" : secret "aws-credentials" not found
不要设置 export TOIL_AWS_SECRET_NAME=aws-credentials验证结果:
export TOIL_KUBERNETES_HOST_PATH=/home/jynlix/Downloads/src/toil/data export TOIL_WORKDIR=/home/jynlix/Downloads/src/toil/data minikube mount /home/jynlix/Downloads/src/toil/data:/home/jynlix/Downloads/src/toil/data python3 -m pdb tutorial_helloworld.py file:job-store --batchSystem kubernetes --realTimeLogging --logInfo -
https://www.researchgate.net/publication/345904527_Rapid_and_efficient_analysis_of_20000_RNA-seq_samples_with_Toil
一个案例:
Rapid and efficient analysis of 20,000 RNA-seq samples with Toil -
-
Toil Server模式
docker run -d --name wes-rabbitmq -p 5672:5672 rabbitmq:3.9.5
celery -A toil.server.celery_app worker --loglevel=INFO
toil servercurl --location --request POST 'http://localhost:8000/ga4gh/wes/v1/runs' --user test:test --form 'workflow_url="example.cwl"' --form 'workflow_type="cwl"' --form 'workflow_type_version="v1.0"' --form 'workflow_params="{"message": "Hello world!"}"' --form
'workflow_attachment=@"./example.cwl"'===========需要metrics-server==============================
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
卸载 metrics-server
kubectl delete -f components.yamllabels: k8s-app: metrics-server spec: containers: - args: - --cert-dir=/tmp - --secure-port=443 - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname - --kubelet-use-node-status-port - --metric-resolution=15s - --kubelet-insecure-tls **# add this line**kubectl apply -f components.yaml
否则会产生下面的错误
kubectl get deployment/metrics-server -n kube-system
v1beta1.metrics.k8s.io kube-system/metrics-server False (MissingEndpoints) 44m
测试:
kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
node1 1076m 13% 17670Mi 75%
node2 1295m 16% 10048Mi 65%
node3 1168m 14% 14871Mi 63%============apparmor可能有影响 删除这个服务(产品环境按照https://github.com/adamnovak/gi-kubernetes-autoscaling-config/blob/e1350ac9ad17d94b5073b20db3c75620957926e3/kubenode.ubuntu.cloud-config.yaml#L27-L67设置)=====
sudo systemctl stop apparmor.service
sudo systemctl disable apparmor.serviceToil server在启动 toil-cwl-runner的时候 可能是没有把全局变量传递过去 会报错 但是 直接使用下面的提交 就成功了
export TOIL_WORKDIR=/cephfs_data/toil
export TOIL_KUBERNETES_HOST_PATH=/cephfs_data/toil
toil-cwl-runner --writeMessages=/cephfs_data/toil/run-6aef556521e1460e94b0557ce848f49e/bus_messages --batchSystem=kubernetes --workDir=/cephfs_data/toil --clean=always --outdir=/cephfs_data/toil/run-6aef556521e1460e94b0557ce848f49e/outputs --jobStore=/cephfs_data/toil/run-6aef556521e1460e94b0557ce848f49e/toil_job_store /cephfs_data/toil/run-6aef556521e1460e94b0557ce848f49e/execution/example.cwl /cephfs_data/toil/run-6aef556521e1460e94b0557ce848f49e/execution/wes_inputs.jsoncat /cephfs_data/toil/run-6aef556521e1460e94b0557ce848f49e/outputs/output.txt
Hello world!后面产品环境看看是用hostpath 还是pv
