SegAlign 使用GPU来加速比对
-
https://github.com/gsneha26/SegAlign
SegAlign:A Scalable GPU-Based Whole Genome Aligner
1.编译
git clone https://github.com/gsneha26/SegAlign.git
export PROJECT_DIR=$PWD/SegAlign
cd $PROJECT_DIR
//由于大服务器已经安装了cuda 可以使用-c 来避免再次安装cuda
./scripts/installUbuntu.sh -c一个错误:
CMake Error at CMakeLists.txt:3 (project):
No CMAKE_CUDA_COMPILER could be found.
Tell CMake where to find the compiler by setting either the environment
variable "CUDACXX" or the CMake cache entry CMAKE_CUDA_COMPILER to the full
path to the compiler, or to the compiler name if it is in the PATH.
添加 cuda的相关变量:
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
export PATH=$PATH:$CUDA_HOME/bin//系统在安装软件的时候老是报munge错误 是etc权限导致的
sudo chmod -R u=rwx,g=rx,o=rx /etc
● munge.service - MUNGE authentication service
Loaded: loaded (/lib/systemd/system/munge.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2021-07-25 17:23:17 CST; 15ms ago
Docs: man:munged(8)
Process: 36517 ExecStart=/usr/sbin/munged (code=exited, status=1/FAILURE)Jul 25 17:23:17 anneng01 systemd[1]: Starting MUNGE authentication service...
Jul 25 17:23:17 anneng01 munged[36517]: munged: Error: Keyfile is insecure: group-writable permissions without sticky bit set on "/etc"
Jul 25 17:23:17 anneng01 systemd[1]: munge.service: Control process exited, code=exited status=1
Jul 25 17:23:17 anneng01 systemd[1]: munge.service: Failed with result 'exit-code'.
Jul 25 17:23:17 anneng01 systemd[1]: Failed to start MUNGE authentication service.//执行faToTwoBit时老是报段错误
gdb /usr/local/bin/faToTwoBit 后发现文件可能有问题
"/usr/local/bin/faToTwoBit": not in executable format: File truncated
(gdb) q
重新安装后正常 -
用自带的文件做了一个对比:
//GPU的时间是 6min多
run_segalign ce11.fa cb4.fa --output=ce11.cb4.maf
//CPU的时间是 超过1个小时
time lastz ce11.fa[multiple] cb4.fa[multiple] --format=maf > cell.cb4.lastz.maf -
实现机制:
1.算法
Smith-Waterman (SW) algorithm 算法是经典的局部比对算法,复杂度O(L r *L q )和序列的长度相关, Lr、Lq是要比对的两条序列。对于全基因组比对就无法满足性能要求,whole genome alignment algorithm (LASTZ) 基于BLAST的启发式seed-filter-extend算法,专门针对全基因组的比对进行的改进。
LASTZ的算法分为下面三个阶段:

(1)Seeding
使用尽可能完全匹配的片段(K-Mer)作为种子, 并把所有的种子保存为一张查询表。这些种子通常也就10几个碱基,因此假阳性很高,需要在下一步进行过滤。
(2)Filter
过滤步骤性能消耗占整个过程的98%以上。该算法对种子在两个方向进行延长,并计算比对的评分,评分低于某个阈值H x时终止延长。通过这些阈值的就是高分序列对high-scoring segment pair (HSP),传递到下一步继续进行分析。HSP 大约100个碱基左右。
(3)采用动态规划(Dynamic programming)算法将HSP延长到1000个碱基左右。
2.GPU 单节点加速
Streaming Multiproces-sors (SMs)3.Spark 多节点加速