使用KrakenUniq进行病原分析

anneng

1.软件安装
conda create -n tax_classifier　//后续的物种分类软件都安装到这个目录下面
conda activate tax_classifier
conda install krakenuniq
2.构建数据库
krakenuniq兼容kraken数据库　不兼容kraken2数据库
krakenuniq-download --db DBDIR --dust microbial-nt

//如果只涉及病原的话建议只做微生物相关的数据库 -R表示用rsync下载建议使用这种方式
krakenuniq-download --db . --taxa "archaea,bacteria,fungi,parasitic_worms,protozoa,viral" --dust --exclude-environmental-taxa microbial-nt -R

//构建数据库
krakenuniq-build --db DBDIR --kmer-len 31 --threads 10 --taxids-for-genomes --taxids-for-sequences

//序列分析
krakenuniq --report-file res_archaea.tsv --db archaea_db --threads 10 test_archaea.fna

//注意 krakenuniq依赖 jellyfish 1
Found jellyfish v2.3.0
Kraken requires jellyfish version 1

anneng

/home/bioinfo/miniconda2/envs/tax_classifier/libexec/build_db.sh: line 46: 38880 Killed jellyfish count -m 31 -s 166082697099 -C -t 10 -o database /dev/fd/63
Command exited with non-zero status 137

目录中有一个nt.fna为370GB多因此把library-files.txt删掉重新build 试试 build 命令会自动找library目录下面的文件

如果library中的数据比较多 krakenuniq非常耗费内存中间在jellyfish的步骤会被kill掉因此需要一个一个的处理library中的数据，例如先处理细菌的再处理病毒的正常的打印如下：

krakenuniq-build --db . --kmer-len 31 --threads 15 --taxids-for-genomes --taxids-for-sequences --jellyfish-hash-size 5000M
Found jellyfish v1.1.11
Kraken build set to minimize disk writes.
Found 2 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Creating k-mer set (step 1 of 6)...
Using jellyfish
K-mer set created. [2m30.142s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
db_sort: Getting database into memory ...Loaded database with 801475581 keys with k of 31 [val_len 4, key_len 8].
Loaded database with 801475581 keys with k of 31 [val_len 4, key_len 8].
db_sort: Sorting ...db_sort: Sorting complete - writing database to disk ...
K-mer set sorted. [8m26.865s]
Creating seqID to taxID map (step 4 of 6)..
18949 sequences mapped to taxa. [0.023s]
Creating taxDB (step 5 of 6)...
Building taxonomy index from taxonomy//nodes.dmp and taxonomy//names.dmp. Done, got 2312242 taxa
taxDB construction finished. [11.991s]
Building KrakenUniq LCA database (step 6 of 6)...
Adding taxonomy IDs for sequences
Adding taxonomy IDs for genomes
Reading taxonomy index from taxDB. Done.
Getting database0.kdb into memory (8.957 GB) ... Done
Loaded database with 801475581 keys with k of 31 [val_len 4, key_len 8].
Reading sequence ID to taxonomy ID mapping ... [starting new taxonomy IDs with 1000000001] got 18949 mappings.
Finished processing 37898 sequences (skipping 0 empty sequences, and 0 sequences with no taxonomy mapping)
Writing kmer counts to database.kdb.counts...
Writing database from RAM back to database.kdb ...
Writing new TaxDB ...
LCA database created. [5m10.060s]
Creating database summary report database.report.tsv ...
/home/bioinfo/miniconda2/envs/tax_classifier/libexec/classify -d ././database.kdb -i ././database.idx -t 15 -r database.report.tsv -a ././taxDB -p 12 /dev/fd/63
Database ././database.kdb
Loaded database with 801475581 keys with k of 31 [val_len 4, key_len 8].
Reading taxonomy index from ././taxDB. Done.
37898 sequences (2276.87 Mbp) processed in 42.084s (54.0 Kseq/m, 3246.14 Mbp/m).
37869 sequences classified (99.92%)
29 sequences unclassified (0.08%)
Writing report file to database.report.tsv ..
Reading genome sizes from ././database.kdb.counts ... done
Setting values in the taxonomy tree ... done
Printing classification report ... done
Report finished in 0.247 seconds.
Finishing up ...Database construction complete. [Total: 17m6.972s]
You can delete all files but database.{kdb,idx} and taxDB now, if you want

anneng

krakenuniq 在从nt库提取微生物库（或者任何一个自库）的时候完全基于taxnomy 和nt 文件来解析而我们把这些数据导入数据库后一条sql就完成了

anneng

krakenuniq --db HOST --db PROK --db EUK_DRAFT
Hierarchical read classification with multiple databases
krakenuniq 支持多个数据库的比对利用这种方式可以分别单独构建数据库

anneng

add genomes to existing database
kraken类的软件都不支持给已经编译好的数据库加新序列需要重建这个对比较大的序列源的情况几乎很难实现需要大量的内存
https://github.com/DerrickWood/kraken2/issues/45
I would delete the *.k2d files and the seqid2taxid.map file and use this:
kraken2-build --add-to-library $NEW_FILE --db .
and then rebuild. You cannot add in new genomes once the database is built but you can just add more to the library before building.

There is no current way to combine builds due to the way that the database is stored in memory.

anneng

总结：经过技术验证，要顺利使用krakenuniq，建议分别构建物种库，例如细菌，病毒，真菌，古菌等，然后在比对的时候同时使用这些数据库进行比对。

anneng

一个例子
https://hpc.nih.gov/apps/KrakenUniq.html

anneng

jellyfish的hash太大还是容易被kill 可以将其设置小一点这样不会影响后续Kraken的分析精度
If you encounter problems with Jellyfish not being able to allocate enough memory on your system to run the build process, you can supply a smaller hash size to Jellyfish using kraken-build's --jellyfish-hash-size switch. Each space in the hash table uses approximately 6.9 bytes, so using "--jellyfish-hash-size 6400M" will use a hash table size of 6.4 billion spaces and require 44.3 GB of RAM.

https://github.com/DerrickWood/kraken/issues/65

http://gensoft.pasteur.fr/docs/kraken/0.10.5-beta/MANUAL.html
http://gensoft.pasteur.fr/docs/kraken/0.10.5-beta/MANUAL.html

ice-melt

@anneng 在使用KrakenUniq进行病原分析中说：

krakenuniq --report-file res_archaea.tsv --db archaea_db --threads 10 test_archaea.fna

krakenuniq构建数据库参数说明

构建过程中需要6步，可以重复运行，每次运行会检查之前的文件确定是否需要运行当前步骤
不指定--jellyfish-hash-size会使用全部内存，内存不够会报错，建议根据服务器内存大小指定该参数

Usage: krakenuniq-build [task option] [options]

Task options (exactly one can be selected -- default is build):
  --download-taxonomy        Download NCBI taxonomic information
  --download-library TYPE    Download partial library (TYPE = one of "refseq/bacteria", "refseq/archaea", "refseq/viral"). 
                             Use krakenuniq-download for more options.
  --add-to-library FILE      Add FILE to library
  --build                    Create DB from library (requires taxonomy d/l'ed and at 
                             least one file in library)
  --rebuild                  Create DB from library like --build, but remove
                             existing non-library/taxonomy files before build
  --clean                    Remove unneeded files from a built database
  --shrink NEW_CT            Shrink an existing DB to have only NEW_CT k-mers
  --standard                 Download and create default database, which contains complete genomes 
                             for archaea, bacteria and viruses from RefSeq, as well as viral strains 
                             from NCBI. Specify --taxids-for-genomes and --taxids-for-sequences
                             separately, if desired.

  --help                     Print this message
  --version                  Print version information

Options:
  --db DBDIR                 Kraken DB directory (mandatory except for --help/--version)
  --threads #                Number of threads (def: 1)
  --new-db NAME              New Kraken DB name (shrink task only; mandatory
                             for shrink task)
  --kmer-len NUM             K-mer length in bp (build/shrink tasks only;
                             def: 31)
  --minimizer-len NUM        Minimizer length in bp (build/shrink tasks only;
                             def: 15)
  --jellyfish-hash-size STR  Pass a specific hash size argument to jellyfish
                             when building database (build task only)
  --jellyfish-bin STR        Use STR as Jellyfish 1 binary.
  --max-db-size SIZE         Shrink the DB before full build, making sure
                             database and index together use <= SIZE gigabytes
                             (build task only)
  --shrink-block-offset NUM  When shrinking, select the k-mer that is NUM
                             positions from the end of a block of k-mers
                             (default: 1)
  --work-on-disk             Perform most operations on disk rather than in
                             RAM (will slow down build in most cases)
  --taxids-for-genomes       Add taxonomy IDs (starting with 1 billion) for genomes.
                             Only works with 3-column seqid2taxid map with third 
                             column being the name
  --taxids-for-sequences     Add taxonomy IDs for sequences, starting with 1 billion.
                             Can be useful to resolve classifications with multiple genomes
                             for one taxonomy ID.
  --min-contig-size NUM      Minimum contig size for inclusion in database.
                             Use with draft genomes to reduce contamination, e.g. with values between 1000 and 10000.
  --library-dir DIR          Use DIR for reference sequences instead of DBDIR/library.
  --taxonomy-dir DIR         Use DIR for taxonomy instead of DBDIR/taxonomy.

Experimental:
  --uid-database             Build a UID database (default no)
  --lca-database             Build a LCA database (default yes)
  --no-lca-database          Do not build a LCA database
  --lca-order DIR1           Impose a hierarchical order for setting LCAs.
  --lca-order DIR2           The directories must be specified relative to the libary directory
  ...                        (DBDIR/library). When setting the LCAs, k-mers from sequences in
                             DIR1 will be set first, and only unset k-mers will be set from
                             DIR2, etc, and final from the whole library.
							 Use this option when including low-confidence draft genomes,
                             e.g use --lca-order Complete_Genome --lca-order Chromosome to
                             prioritize more complete assemblies.
                             Keep in mind that this option takes considerably longer.

使用krakenuniq分析数据命令参数说明

Usage: krakenuniq --report-file FILENAME [options] <filename(s)>

Options:
  --db NAME               Name for Kraken DB (default: none)
  --threads NUM           Number of threads (default: 1)
  --fasta-input           Input is FASTA format
  --fastq-input           Input is FASTQ format
  --gzip-compressed       Input is gzip compressed
  --bzip2-compressed      Input is bzip2 compressed
  --hll-precision INT     Precision for HyperLogLog k-mer cardinality estimation, between 10 and 18 (default: 12)
  --exact                 Compute exact cardinality instead of estimate (slower, requires memory proportional to cardinality!)
  --quick                 Quick operation (use first hit or hits)
  --min-hits NUM          In quick op., number of hits req'd for classification
                          NOTE: this is ignored if --quick is not specified
  --unclassified-out FILENAME
                          Print unclassified sequences to filename
  --classified-out FILENAME
                          Print classified sequences to filename
  --output FILENAME       Print output to filename (default: stdout); "off" will
                          suppress normal output
  --only-classified-output
                          Print no Kraken output for unclassified sequences
  --preload               Loads DB into memory before classification
  --paired                The two filenames provided are paired-end reads
  --check-names           Ensure each pair of reads have names that agree
                          with each other; ignored if --paired is not specified
  --help                  Print this message
  --version               Print version information

Experimental:
  --uid-mapping           Map using UID database

If none of the *-input or *-compressed flags are specified, and the 
file is a regular file, automatic format detection is attempted.