使用KrakenUniq进行病原分析
-
1.软件安装
conda create -n tax_classifier //后续的物种分类软件都安装到这个目录下面
conda activate tax_classifier
conda install krakenuniq
2.构建数据库
krakenuniq兼容kraken数据库 不兼容kraken2数据库
krakenuniq-download --db DBDIR --dust microbial-nt//如果只涉及病原的话 建议只做微生物相关的数据库 -R表示用rsync下载 建议使用这种方式
krakenuniq-download --db . --taxa "archaea,bacteria,fungi,parasitic_worms,protozoa,viral" --dust --exclude-environmental-taxa microbial-nt -R//构建数据库
krakenuniq-build --db DBDIR --kmer-len 31 --threads 10 --taxids-for-genomes --taxids-for-sequences//序列分析
krakenuniq --report-file res_archaea.tsv --db archaea_db --threads 10 test_archaea.fna//注意 krakenuniq依赖 jellyfish 1
Found jellyfish v2.3.0
Kraken requires jellyfish version 1 -
/home/bioinfo/miniconda2/envs/tax_classifier/libexec/build_db.sh: line 46: 38880 Killed jellyfish count -m 31 -s 166082697099 -C -t 10 -o database /dev/fd/63
Command exited with non-zero status 137目录中有一个nt.fna为370GB多 因此把library-files.txt删掉 重新build 试试 build 命令会自动找library目录下面的文件
如果library中的数据比较多 krakenuniq非常耗费内存 中间在jellyfish的步骤会被kill掉 因此需要一个一个的处理library中的数据 ,例如先处理细菌的 再处理病毒的 正常的打印如下:
krakenuniq-build --db . --kmer-len 31 --threads 15 --taxids-for-genomes --taxids-for-sequences --jellyfish-hash-size 5000M
Found jellyfish v1.1.11
Kraken build set to minimize disk writes.
Found 2 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Creating k-mer set (step 1 of 6)...
Using jellyfish
K-mer set created. [2m30.142s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
db_sort: Getting database into memory ...Loaded database with 801475581 keys with k of 31 [val_len 4, key_len 8].
Loaded database with 801475581 keys with k of 31 [val_len 4, key_len 8].
db_sort: Sorting ...db_sort: Sorting complete - writing database to disk ...
K-mer set sorted. [8m26.865s]
Creating seqID to taxID map (step 4 of 6)..
18949 sequences mapped to taxa. [0.023s]
Creating taxDB (step 5 of 6)...
Building taxonomy index from taxonomy//nodes.dmp and taxonomy//names.dmp. Done, got 2312242 taxa
taxDB construction finished. [11.991s]
Building KrakenUniq LCA database (step 6 of 6)...
Adding taxonomy IDs for sequences
Adding taxonomy IDs for genomes
Reading taxonomy index from taxDB. Done.
Getting database0.kdb into memory (8.957 GB) ... Done
Loaded database with 801475581 keys with k of 31 [val_len 4, key_len 8].
Reading sequence ID to taxonomy ID mapping ... [starting new taxonomy IDs with 1000000001] got 18949 mappings.
Finished processing 37898 sequences (skipping 0 empty sequences, and 0 sequences with no taxonomy mapping)
Writing kmer counts to database.kdb.counts...
Writing database from RAM back to database.kdb ...
Writing new TaxDB ...
LCA database created. [5m10.060s]
Creating database summary report database.report.tsv ...
/home/bioinfo/miniconda2/envs/tax_classifier/libexec/classify -d ././database.kdb -i ././database.idx -t 15 -r database.report.tsv -a ././taxDB -p 12 /dev/fd/63
Database ././database.kdb
Loaded database with 801475581 keys with k of 31 [val_len 4, key_len 8].
Reading taxonomy index from ././taxDB. Done.
37898 sequences (2276.87 Mbp) processed in 42.084s (54.0 Kseq/m, 3246.14 Mbp/m).
37869 sequences classified (99.92%)
29 sequences unclassified (0.08%)
Writing report file to database.report.tsv ..
Reading genome sizes from ././database.kdb.counts ... done
Setting values in the taxonomy tree ... done
Printing classification report ... done
Report finished in 0.247 seconds.
Finishing up ...Database construction complete. [Total: 17m6.972s]
You can delete all files but database.{kdb,idx} and taxDB now, if you want -
krakenuniq 在从nt库提取微生物库(或者任何一个自库)的时候 完全基于taxnomy 和nt 文件来解析 而我们把这些数据导入数据库后 一条sql就完成了
-
krakenuniq --db HOST --db PROK --db EUK_DRAFT
Hierarchical read classification with multiple databases
krakenuniq 支持多个数据库的比对 利用这种方式可以分别单独构建数据库 -
add genomes to existing database
kraken类的软件 都不支持给已经编译好的数据库加新序列 需要重建 这个对比较大的序列源的情况几乎很难实现 需要大量的内存
https://github.com/DerrickWood/kraken2/issues/45
I would delete the *.k2d files and the seqid2taxid.map file and use this:
kraken2-build --add-to-library $NEW_FILE --db .
and then rebuild. You cannot add in new genomes once the database is built but you can just add more to the library before building.There is no current way to combine builds due to the way that the database is stored in memory.
-
总结:经过技术验证,要顺利使用krakenuniq,建议分别构建物种库,例如细菌,病毒,真菌,古菌等,然后在比对的时候同时使用这些数据库进行比对。
-
-
jellyfish的hash太大 还是容易被kill 可以将其设置小一点 这样不会影响后续Kraken的分析精度
If you encounter problems with Jellyfish not being able to allocate enough memory on your system to run the build process, you can supply a smaller hash size to Jellyfish using kraken-build's --jellyfish-hash-size switch. Each space in the hash table uses approximately 6.9 bytes, so using "--jellyfish-hash-size 6400M" will use a hash table size of 6.4 billion spaces and require 44.3 GB of RAM.https://github.com/DerrickWood/kraken/issues/65
http://gensoft.pasteur.fr/docs/kraken/0.10.5-beta/MANUAL.html
http://gensoft.pasteur.fr/docs/kraken/0.10.5-beta/MANUAL.html -
@anneng 在 使用KrakenUniq进行病原分析 中说:
krakenuniq --report-file res_archaea.tsv --db archaea_db --threads 10 test_archaea.fna
krakenuniq构建数据库参数说明
构建过程中需要6步,可以重复运行,每次运行会检查之前的文件确定是否需要运行当前步骤
不指定--jellyfish-hash-size会使用全部内存,内存不够会报错,建议根据服务器内存大小指定该参数Usage: krakenuniq-build [task option] [options] Task options (exactly one can be selected -- default is build): --download-taxonomy Download NCBI taxonomic information --download-library TYPE Download partial library (TYPE = one of "refseq/bacteria", "refseq/archaea", "refseq/viral"). Use krakenuniq-download for more options. --add-to-library FILE Add FILE to library --build Create DB from library (requires taxonomy d/l'ed and at least one file in library) --rebuild Create DB from library like --build, but remove existing non-library/taxonomy files before build --clean Remove unneeded files from a built database --shrink NEW_CT Shrink an existing DB to have only NEW_CT k-mers --standard Download and create default database, which contains complete genomes for archaea, bacteria and viruses from RefSeq, as well as viral strains from NCBI. Specify --taxids-for-genomes and --taxids-for-sequences separately, if desired. --help Print this message --version Print version information Options: --db DBDIR Kraken DB directory (mandatory except for --help/--version) --threads # Number of threads (def: 1) --new-db NAME New Kraken DB name (shrink task only; mandatory for shrink task) --kmer-len NUM K-mer length in bp (build/shrink tasks only; def: 31) --minimizer-len NUM Minimizer length in bp (build/shrink tasks only; def: 15) --jellyfish-hash-size STR Pass a specific hash size argument to jellyfish when building database (build task only) --jellyfish-bin STR Use STR as Jellyfish 1 binary. --max-db-size SIZE Shrink the DB before full build, making sure database and index together use <= SIZE gigabytes (build task only) --shrink-block-offset NUM When shrinking, select the k-mer that is NUM positions from the end of a block of k-mers (default: 1) --work-on-disk Perform most operations on disk rather than in RAM (will slow down build in most cases) --taxids-for-genomes Add taxonomy IDs (starting with 1 billion) for genomes. Only works with 3-column seqid2taxid map with third column being the name --taxids-for-sequences Add taxonomy IDs for sequences, starting with 1 billion. Can be useful to resolve classifications with multiple genomes for one taxonomy ID. --min-contig-size NUM Minimum contig size for inclusion in database. Use with draft genomes to reduce contamination, e.g. with values between 1000 and 10000. --library-dir DIR Use DIR for reference sequences instead of DBDIR/library. --taxonomy-dir DIR Use DIR for taxonomy instead of DBDIR/taxonomy. Experimental: --uid-database Build a UID database (default no) --lca-database Build a LCA database (default yes) --no-lca-database Do not build a LCA database --lca-order DIR1 Impose a hierarchical order for setting LCAs. --lca-order DIR2 The directories must be specified relative to the libary directory ... (DBDIR/library). When setting the LCAs, k-mers from sequences in DIR1 will be set first, and only unset k-mers will be set from DIR2, etc, and final from the whole library. Use this option when including low-confidence draft genomes, e.g use --lca-order Complete_Genome --lca-order Chromosome to prioritize more complete assemblies. Keep in mind that this option takes considerably longer.使用krakenuniq分析数据命令参数说明
Usage: krakenuniq --report-file FILENAME [options] <filename(s)> Options: --db NAME Name for Kraken DB (default: none) --threads NUM Number of threads (default: 1) --fasta-input Input is FASTA format --fastq-input Input is FASTQ format --gzip-compressed Input is gzip compressed --bzip2-compressed Input is bzip2 compressed --hll-precision INT Precision for HyperLogLog k-mer cardinality estimation, between 10 and 18 (default: 12) --exact Compute exact cardinality instead of estimate (slower, requires memory proportional to cardinality!) --quick Quick operation (use first hit or hits) --min-hits NUM In quick op., number of hits req'd for classification NOTE: this is ignored if --quick is not specified --unclassified-out FILENAME Print unclassified sequences to filename --classified-out FILENAME Print classified sequences to filename --output FILENAME Print output to filename (default: stdout); "off" will suppress normal output --only-classified-output Print no Kraken output for unclassified sequences --preload Loads DB into memory before classification --paired The two filenames provided are paired-end reads --check-names Ensure each pair of reads have names that agree with each other; ignored if --paired is not specified --help Print this message --version Print version information Experimental: --uid-mapping Map using UID database If none of the *-input or *-compressed flags are specified, and the file is a regular file, automatic format detection is attempted.