使用blast+megan 来分析物种
-
1.使用blast来比对(这一步很慢 需要用我们的大数据方案)
nohup blastn -query barcode05.fasta -db /ceph_disk1/gene_data/MetaDatabase/NCBI_blast_db_FASTA/nt/ntdata/nt -num_threads 20 -out barcode05.m8 -outfmt 6 -evalue 0.001 &
2.准备工作:制作序列id和taxid的映射文件 把ncbi的映射文件转成megan自己的格式
在MEGAN.vmoptions 文件中配置java的最大内存:
-Xmx80000M 否则会报内存不足./make-acc2ncbi -i nucl_gb.accession2taxid.gz Version MEGAN Ultimate Edition (version 6.21.2, built 14 Mar 2021) Author(s) Daniel H. Huson Copyright (C) 2020 Daniel H. Huson. This program comes with ABSOLUTELY NO WARRANTY. Computing map: Processing file: nucl_gb.accession2taxid.gz 10% 20% 30% 40% 100% (193.8s) Building table: (Bits: 26, buckets: 67,108,864, bucket size: 4) Sorting map... 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (210.1s) Writing table... 10% 20% 100% (38.7s) (Bucket avg size: 3.9, max size: 19, used: 98%) (Index size: 536,870,916, data size: 3,603,383,706) Merging files 10% 100% (42.0s) Opening file: acc2tax.abin Size: 259,629,219 Total in: 259,629,220 Total out: 259,629,219 Total time: 497s Peak memory: 60.2 of 78.1G生成的m8 文件可以直接到入megan分析 不过文件都比较大 在PC机上分析很慢 我们还是建议用户把数据上传到我们服务器进行分析
3.用megan的blast2lca 来注释物种
/opt/megan-ce/tools/blast2lca -i barcode05.m8.2 -f BlastTab -m BlastN -o barcode05.lca.norank -a2t /opt/megan-ue/tools/ncbi/acc2tax.abin -sr false生成的例子如下:这个格式我们要用图表展示到UI上
f3d2e450-007a-4c9e-b43b-602e6302df32; ; Eukaryota; 100; Metazoa; 100; Chordata; 100; Mammalia; 100; Primates; 100; Hominidae; 50; Homo; 50; Homo sapiens; 50;常见错误:
报错 格式不对
/opt/megan-ce/tools/blast2lca -i barcode05.m8 -f BlastTab -m BlastN -o barcode05.lca -a2t ~/.basta/taxonomy/nucl_gb.accession2taxid
Warning: Might not be a BLAST file in TAB format: barcode05.m8
Error parsing file near line: 12028904: String index out of range: 53
Error parsing file near line: 12028905: String index out of range: 53
Error parsing file near line: 17258173: String index out of range: 53 -
blast可以直接给出物种名称 但是要注意加一个全局变量
ad1bd439-6b2b-4178-91d5-8c6f36fd8d6f,N/A,58.4物种显示NA 需要在bashrc中添加一个全局变量
export BLASTDB=/ceph_disk1/gene_data/MetaDatabase/NCBI_blast_db_FASTA/nt/ntdata/nohup blastn -query barcode05.fasta -db /ceph_disk1/gene_data/MetaDatabase/NCBI_blast_db_FASTA/nt/ntdata/nt -num_threads 20 -out barcode05.m8 -outfmt "6 qseqid,sscinames,bitscore" -evalue 0.001 & -
下面这个工具也实现了megan lca相关的算法 可以重点关注
https://github.com/fungs/taxator-tk/
https://github.com/emepyc/Blast2lca (已过时)
https://github.com/etheleon/pymegan https://etheleon.github.io/articles/pythonMEGAN/https://www.biostars.org/p/362985/ blast2lca的解析
https://www.jianshu.com/p/3d3253c59545 2022-03-18 python pandas 处理megan中blast2lca的结果统计 -
The program make-acc2ncbi can be used to create a new accession to taxonomy mapping file from files downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid.
这个工具只在Megan UE版本中有 -
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5978398/
BLAST-based validation of metagenomic sequence assignments
使用blast进行物种注释
在法医场景下 精度会非常重要 这篇文章对低精度软件的分析结果 使用blast进行二次验证

-
NCBI Blast output header -outfmt 6 or -m8(新版本没有这个选项了) header in tabular form
query_id subject_id pct_identity aln_length n_of_mismatches gap_openings q_start q_end s_start s_end e_value bit_score
-
-
准备工作 制作序列id和taxid的映射文件
下载地址:
https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/- prot.accession2taxid.gz
- nucl_gb.accession2taxid.gz