宏基因的组装 metagenomics assembly
-
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0169662
这个文章对宏基因的组装软件进行了评估 使用的不是CAMI 的"Critical Assessment of Metagenomic Interpretation" (CAMI) 的mock数据 用的是真实的NGS数据

1.为什么要组装?
Read lengths of modern sequencing technologies are increasing as well (S1 Table), making a large depth of phylogenetic and community-based functional analyses already possible by directly examining the unassembled sequencing reads. However, the assembly of overlapping reads into continuous or semi-continuous genome fragments–so called contigs or scaffolds—allows an even more detailed view of different aspects within a genomic context. This allows the reconstruction of full-length gene sequences (and even better gene clusters), which can be much more reliably assigned to specific functions or taxa compared to partial gene fragments found on unassembled reads. Longer assembled sequences also enable a more sensitive detection of larger complex genomic features such as Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR), polyketide synthase (PKS) or non-ribosomal peptide synthase (NRPS) gene clusters encoding for secondary metabolites.2.为什么要 binning?
In addition, the broader genomic context of interesting features may be further elucidated by sorting (or “binning”) partially assembled genome fragments into categories (so-called “bins”). The aim of this approach is to separate fragments that likely originate from different species while grouping those together that likely belong to the same species, leading to partial or even complete reconstruction of genomes from metagenomic datasets.文章的结果是spades比较好
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053804/
文章中还提到了一个pipeline metaAMOS

-

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5502489/
Assembling metagenomes, one community at a time
这个文章推荐的选择过程 -
-
-
-
https://blog.csdn.net/weixin_39633455/article/details/116951189
软件1、cutadaptinput=test.fq.gz
mkdir -p cutadapt
cutadapt_input=$input
cutadapt_out=cutadapt/trimed.fastq.gz
interleaved=--interleaved
cutadapt $interleaved -a AGATCGGAAGAGC -A AGATCGGAAGAGC -q 30 -m 20 --trim-n -O 10 -o $cutadapt_out $cutadapt_input
软件2、megahit
input_fa=$cutadapt_out
assembly_out=assembly_out
megahit --12 $input_fa --k-max 149 --max-tip-len 200 --min-contig-len 300 -o $assembly_out
软件3、MetaGeneMark
mkdir -p predict_gene
input_dir=assembly_out
predict_gene_out=predict_gene
model_file=../MetaGeneMark_linux_64/mgm/MetaGeneMark_v1.mod
cp ../MetaGeneMark_linux_64/gm_key ~/.gm_key
gmhmmp -d -f G -m $model_file -o $predict_gene_out/out.gff -A $predict_gene_out/final.prot.fa -D $predict_gene_out/final.nucl.fa $input_dir/final.contigs.fa
软件4、cd-hit
mkdir -p unigene_set
python filter_predict_nucl.py $predict_gene_out/final.nucl.fa $predict_gene_out/filter_final.nucl.fa #自写脚本
cd-hit -i $predict_gene_out/filter_final.nucl.fa -o unigene_set/unigene.fa -c 0.95 -aS 0.9 -d 0 -M 10000 -T 0
软件5、diamond
mkdir -p function_anno
#数据库文件需自行下载
database_eggNOG=.../metagenomics/function/database/e5.proteomes
diamond_eggNOG=function_anno/unigene.e5
database_CARD=.../metagenomics/function/database/CARD/CARD.protein
diamond_CARD=function_anno/unigene.CARD
database_CAZy=.../metagenomics/function/database/CAZy/CAZyDB.07202017
diamond_CAZy=function_anno/unigene.CAZyDB
database_PHI=.../metagenomics/function/database/PHI/phi-base_current
diamond_PHI=function_anno/unigene.phi
diamond blastx -d $database_eggNOG -q unigene_set/unigene.fa -o $diamond_eggNOG --evalue 0.00001
diamond blastx -d $database_CARD -q unigene_set/unigene.fa -o $diamond_CARD --evalue 0.00001
diamond blastx -d $database_CAZy -q unigene_set/unigene.fa -o $diamond_CAZy --evalue 0.00001
diamond blastx -d $database_PHI -q unigene_set/unigene.fa -o $diamond_PHI --evalue 0.00001
-
-
-
-
-
-







