宏基因的组装 metagenomics assembly

anneng

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0169662
这个文章对宏基因的组装软件进行了评估使用的不是CAMI 的"Critical Assessment of Metagenomic Interpretation" (CAMI) 的mock数据用的是真实的NGS数据

1.为什么要组装？
Read lengths of modern sequencing technologies are increasing as well (S1 Table), making a large depth of phylogenetic and community-based functional analyses already possible by directly examining the unassembled sequencing reads. However, the assembly of overlapping reads into continuous or semi-continuous genome fragments–so called contigs or scaffolds—allows an even more detailed view of different aspects within a genomic context. This allows the reconstruction of full-length gene sequences (and even better gene clusters), which can be much more reliably assigned to specific functions or taxa compared to partial gene fragments found on unassembled reads. Longer assembled sequences also enable a more sensitive detection of larger complex genomic features such as Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR), polyketide synthase (PKS) or non-ribosomal peptide synthase (NRPS) gene clusters encoding for secondary metabolites.

2.为什么要 binning?
In addition, the broader genomic context of interesting features may be further elucidated by sorting (or “binning”) partially assembled genome fragments into categories (so-called “bins”). The aim of this approach is to separate fragments that likely originate from different species while grouping those together that likely belong to the same species, leading to partial or even complete reconstruction of genomes from metagenomic datasets.

文章的结果是spades比较好

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053804/
文章中还提到了一个pipeline metaAMOS

anneng

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5502489/
Assembling metagenomes, one community at a time
这个文章推荐的选择过程

anneng

https://www.lcsciences.com/documents/sample_data/metagenomics/Metagenomics_html_report_DEMO.html

宏基因分析报告样例

anneng

https://github.com/metagenome-atlas/atlas

anneng

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8611953/

anneng

https://blog.csdn.net/weixin_39633455/article/details/116951189
软件1、cutadapt

input=test.fq.gz

mkdir -p cutadapt

cutadapt_input=$input

cutadapt_out=cutadapt/trimed.fastq.gz

interleaved=--interleaved

cutadapt $interleaved -a AGATCGGAAGAGC -A AGATCGGAAGAGC -q 30 -m 20 --trim-n -O 10 -o $cutadapt_out $cutadapt_input

软件2、megahit

input_fa=$cutadapt_out

assembly_out=assembly_out

megahit --12 $input_fa --k-max 149 --max-tip-len 200 --min-contig-len 300 -o $assembly_out

软件3、MetaGeneMark

mkdir -p predict_gene

input_dir=assembly_out

predict_gene_out=predict_gene

model_file=../MetaGeneMark_linux_64/mgm/MetaGeneMark_v1.mod

cp ../MetaGeneMark_linux_64/gm_key ~/.gm_key

gmhmmp -d -f G -m $model_file -o $predict_gene_out/out.gff -A $predict_gene_out/final.prot.fa -D $predict_gene_out/final.nucl.fa $input_dir/final.contigs.fa

软件4、cd-hit

mkdir -p unigene_set

python filter_predict_nucl.py $predict_gene_out/final.nucl.fa $predict_gene_out/filter_final.nucl.fa #自写脚本

cd-hit -i $predict_gene_out/filter_final.nucl.fa -o unigene_set/unigene.fa -c 0.95 -aS 0.9 -d 0 -M 10000 -T 0

软件5、diamond

mkdir -p function_anno

#数据库文件需自行下载

database_eggNOG=.../metagenomics/function/database/e5.proteomes

diamond_eggNOG=function_anno/unigene.e5

database_CARD=.../metagenomics/function/database/CARD/CARD.protein

diamond_CARD=function_anno/unigene.CARD

database_CAZy=.../metagenomics/function/database/CAZy/CAZyDB.07202017

diamond_CAZy=function_anno/unigene.CAZyDB

database_PHI=.../metagenomics/function/database/PHI/phi-base_current

diamond_PHI=function_anno/unigene.phi

diamond blastx -d $database_eggNOG -q unigene_set/unigene.fa -o $diamond_eggNOG --evalue 0.00001

diamond blastx -d $database_CARD -q unigene_set/unigene.fa -o $diamond_CARD --evalue 0.00001

diamond blastx -d $database_CAZy -q unigene_set/unigene.fa -o $diamond_CAZy --evalue 0.00001

diamond blastx -d $database_PHI -q unigene_set/unigene.fa -o $diamond_PHI --evalue 0.00001

anneng

https://teaching.healthtech.dtu.dk/22126/index.php/Metagenomic_assembly_exercise

anneng

biomolecules-11-00530-v2.pdf

anneng

https://link.springer.com/article/10.1186/s12864-019-6289-6

anneng

https://www.mdpi.com/2076-2607/8/5/669/htm#fig_body_display_microorganisms-08-00669-f001

anneng

https://bioinformaticsworkbook.org/dataAnalysis/Metagenomics/MetagenomicsP1.html#gsc.tab=0

anneng

https://blogs.iu.edu/ncgas/2021/01/26/scaffold-length-histograms/