暗能星系

    • 登录
    • 搜索

    使用KrakenUniq进行病原分析

    生物信息分析
    2
    9
    36
    正在加载更多帖子
    • 从旧到新
    • 从新到旧
    • 最多赞同
    回复
    • 在新帖中回复
    登录后回复
    此主题已被删除。只有拥有主题管理权限的用户可以查看。
    • A
      anneng 最后由 anneng 编辑

      1.软件安装
      conda create -n tax_classifier //后续的物种分类软件都安装到这个目录下面
      conda activate tax_classifier
      conda install krakenuniq
      2.构建数据库
      krakenuniq兼容kraken数据库 不兼容kraken2数据库
      krakenuniq-download --db DBDIR --dust microbial-nt

      //如果只涉及病原的话 建议只做微生物相关的数据库 -R表示用rsync下载 建议使用这种方式
      krakenuniq-download --db . --taxa "archaea,bacteria,fungi,parasitic_worms,protozoa,viral" --dust --exclude-environmental-taxa microbial-nt -R

      //构建数据库
      krakenuniq-build --db DBDIR --kmer-len 31 --threads 10 --taxids-for-genomes --taxids-for-sequences

      //序列分析
      krakenuniq --report-file res_archaea.tsv --db archaea_db --threads 10 test_archaea.fna

      //注意 krakenuniq依赖 jellyfish 1
      Found jellyfish v2.3.0
      Kraken requires jellyfish version 1

      I 1 条回复 最后回复 回复 引用 0
      • A
        anneng 最后由 anneng 编辑

        /home/bioinfo/miniconda2/envs/tax_classifier/libexec/build_db.sh: line 46: 38880 Killed jellyfish count -m 31 -s 166082697099 -C -t 10 -o database /dev/fd/63
        Command exited with non-zero status 137

        目录中有一个nt.fna为370GB多 因此把library-files.txt删掉 重新build 试试 build 命令会自动找library目录下面的文件

        如果library中的数据比较多 krakenuniq非常耗费内存 中间在jellyfish的步骤会被kill掉 因此需要一个一个的处理library中的数据 ,例如先处理细菌的 再处理病毒的 正常的打印如下:

        krakenuniq-build --db . --kmer-len 31 --threads 15 --taxids-for-genomes --taxids-for-sequences --jellyfish-hash-size 5000M
        Found jellyfish v1.1.11
        Kraken build set to minimize disk writes.
        Found 2 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
        Creating k-mer set (step 1 of 6)...
        Using jellyfish
        K-mer set created. [2m30.142s]
        Skipping step 2, no database reduction requested.
        Sorting k-mer set (step 3 of 6)...
        db_sort: Getting database into memory ...Loaded database with 801475581 keys with k of 31 [val_len 4, key_len 8].
        Loaded database with 801475581 keys with k of 31 [val_len 4, key_len 8].
        db_sort: Sorting ...db_sort: Sorting complete - writing database to disk ...
        K-mer set sorted. [8m26.865s]
        Creating seqID to taxID map (step 4 of 6)..
        18949 sequences mapped to taxa. [0.023s]
        Creating taxDB (step 5 of 6)...
        Building taxonomy index from taxonomy//nodes.dmp and taxonomy//names.dmp. Done, got 2312242 taxa
        taxDB construction finished. [11.991s]
        Building KrakenUniq LCA database (step 6 of 6)...
        Adding taxonomy IDs for sequences
        Adding taxonomy IDs for genomes
        Reading taxonomy index from taxDB. Done.
        Getting database0.kdb into memory (8.957 GB) ... Done
        Loaded database with 801475581 keys with k of 31 [val_len 4, key_len 8].
        Reading sequence ID to taxonomy ID mapping ... [starting new taxonomy IDs with 1000000001] got 18949 mappings.
        Finished processing 37898 sequences (skipping 0 empty sequences, and 0 sequences with no taxonomy mapping)
        Writing kmer counts to database.kdb.counts...
        Writing database from RAM back to database.kdb ...
        Writing new TaxDB ...
        LCA database created. [5m10.060s]
        Creating database summary report database.report.tsv ...
        /home/bioinfo/miniconda2/envs/tax_classifier/libexec/classify -d ././database.kdb -i ././database.idx -t 15 -r database.report.tsv -a ././taxDB -p 12 /dev/fd/63
        Database ././database.kdb
        Loaded database with 801475581 keys with k of 31 [val_len 4, key_len 8].
        Reading taxonomy index from ././taxDB. Done.
        37898 sequences (2276.87 Mbp) processed in 42.084s (54.0 Kseq/m, 3246.14 Mbp/m).
        37869 sequences classified (99.92%)
        29 sequences unclassified (0.08%)
        Writing report file to database.report.tsv ..
        Reading genome sizes from ././database.kdb.counts ... done
        Setting values in the taxonomy tree ... done
        Printing classification report ... done
        Report finished in 0.247 seconds.
        Finishing up ...Database construction complete. [Total: 17m6.972s]
        You can delete all files but database.{kdb,idx} and taxDB now, if you want

        1 条回复 最后回复 回复 引用 0
        • A
          anneng 最后由 编辑

          krakenuniq 在从nt库提取微生物库(或者任何一个自库)的时候 完全基于taxnomy 和nt 文件来解析 而我们把这些数据导入数据库后 一条sql就完成了

          1 条回复 最后回复 回复 引用 0
          • A
            anneng 最后由 编辑

            krakenuniq --db HOST --db PROK --db EUK_DRAFT
            Hierarchical read classification with multiple databases
            krakenuniq 支持多个数据库的比对 利用这种方式可以分别单独构建数据库

            1 条回复 最后回复 回复 引用 0
            • A
              anneng 最后由 编辑

              add genomes to existing database
              kraken类的软件 都不支持给已经编译好的数据库加新序列 需要重建 这个对比较大的序列源的情况几乎很难实现 需要大量的内存
              https://github.com/DerrickWood/kraken2/issues/45
              I would delete the *.k2d files and the seqid2taxid.map file and use this:
              kraken2-build --add-to-library $NEW_FILE --db .
              and then rebuild. You cannot add in new genomes once the database is built but you can just add more to the library before building.

              There is no current way to combine builds due to the way that the database is stored in memory.

              1 条回复 最后回复 回复 引用 0
              • A
                anneng 最后由 编辑

                总结:经过技术验证,要顺利使用krakenuniq,建议分别构建物种库,例如细菌,病毒,真菌,古菌等,然后在比对的时候同时使用这些数据库进行比对。

                1 条回复 最后回复 回复 引用 0
                • A
                  anneng 最后由 编辑

                  一个例子
                  https://hpc.nih.gov/apps/KrakenUniq.html

                  1 条回复 最后回复 回复 引用 0
                  • A
                    anneng 最后由 编辑

                    jellyfish的hash太大 还是容易被kill 可以将其设置小一点 这样不会影响后续Kraken的分析精度
                    If you encounter problems with Jellyfish not being able to allocate enough memory on your system to run the build process, you can supply a smaller hash size to Jellyfish using kraken-build's --jellyfish-hash-size switch. Each space in the hash table uses approximately 6.9 bytes, so using "--jellyfish-hash-size 6400M" will use a hash table size of 6.4 billion spaces and require 44.3 GB of RAM.

                    https://github.com/DerrickWood/kraken/issues/65

                    http://gensoft.pasteur.fr/docs/kraken/0.10.5-beta/MANUAL.html
                    http://gensoft.pasteur.fr/docs/kraken/0.10.5-beta/MANUAL.html

                    1 条回复 最后回复 回复 引用 0
                    • I
                      ice-melt @anneng 最后由 编辑

                      @anneng 在 使用KrakenUniq进行病原分析 中说:

                      krakenuniq --report-file res_archaea.tsv --db archaea_db --threads 10 test_archaea.fna

                      krakenuniq构建数据库参数说明

                      构建过程中需要6步,可以重复运行,每次运行会检查之前的文件确定是否需要运行当前步骤
                      不指定--jellyfish-hash-size会使用全部内存,内存不够会报错,建议根据服务器内存大小指定该参数

                      Usage: krakenuniq-build [task option] [options]
                      
                      Task options (exactly one can be selected -- default is build):
                        --download-taxonomy        Download NCBI taxonomic information
                        --download-library TYPE    Download partial library (TYPE = one of "refseq/bacteria", "refseq/archaea", "refseq/viral"). 
                                                   Use krakenuniq-download for more options.
                        --add-to-library FILE      Add FILE to library
                        --build                    Create DB from library (requires taxonomy d/l'ed and at 
                                                   least one file in library)
                        --rebuild                  Create DB from library like --build, but remove
                                                   existing non-library/taxonomy files before build
                        --clean                    Remove unneeded files from a built database
                        --shrink NEW_CT            Shrink an existing DB to have only NEW_CT k-mers
                        --standard                 Download and create default database, which contains complete genomes 
                                                   for archaea, bacteria and viruses from RefSeq, as well as viral strains 
                                                   from NCBI. Specify --taxids-for-genomes and --taxids-for-sequences
                                                   separately, if desired.
                      
                        --help                     Print this message
                        --version                  Print version information
                      
                      Options:
                        --db DBDIR                 Kraken DB directory (mandatory except for --help/--version)
                        --threads #                Number of threads (def: 1)
                        --new-db NAME              New Kraken DB name (shrink task only; mandatory
                                                   for shrink task)
                        --kmer-len NUM             K-mer length in bp (build/shrink tasks only;
                                                   def: 31)
                        --minimizer-len NUM        Minimizer length in bp (build/shrink tasks only;
                                                   def: 15)
                        --jellyfish-hash-size STR  Pass a specific hash size argument to jellyfish
                                                   when building database (build task only)
                        --jellyfish-bin STR        Use STR as Jellyfish 1 binary.
                        --max-db-size SIZE         Shrink the DB before full build, making sure
                                                   database and index together use <= SIZE gigabytes
                                                   (build task only)
                        --shrink-block-offset NUM  When shrinking, select the k-mer that is NUM
                                                   positions from the end of a block of k-mers
                                                   (default: 1)
                        --work-on-disk             Perform most operations on disk rather than in
                                                   RAM (will slow down build in most cases)
                        --taxids-for-genomes       Add taxonomy IDs (starting with 1 billion) for genomes.
                                                   Only works with 3-column seqid2taxid map with third 
                                                   column being the name
                        --taxids-for-sequences     Add taxonomy IDs for sequences, starting with 1 billion.
                                                   Can be useful to resolve classifications with multiple genomes
                                                   for one taxonomy ID.
                        --min-contig-size NUM      Minimum contig size for inclusion in database.
                                                   Use with draft genomes to reduce contamination, e.g. with values between 1000 and 10000.
                        --library-dir DIR          Use DIR for reference sequences instead of DBDIR/library.
                        --taxonomy-dir DIR         Use DIR for taxonomy instead of DBDIR/taxonomy.
                      
                      Experimental:
                        --uid-database             Build a UID database (default no)
                        --lca-database             Build a LCA database (default yes)
                        --no-lca-database          Do not build a LCA database
                        --lca-order DIR1           Impose a hierarchical order for setting LCAs.
                        --lca-order DIR2           The directories must be specified relative to the libary directory
                        ...                        (DBDIR/library). When setting the LCAs, k-mers from sequences in
                                                   DIR1 will be set first, and only unset k-mers will be set from
                                                   DIR2, etc, and final from the whole library.
                      							 Use this option when including low-confidence draft genomes,
                                                   e.g use --lca-order Complete_Genome --lca-order Chromosome to
                                                   prioritize more complete assemblies.
                                                   Keep in mind that this option takes considerably longer.
                      

                      使用krakenuniq分析数据命令参数说明

                      Usage: krakenuniq --report-file FILENAME [options] <filename(s)>
                      
                      Options:
                        --db NAME               Name for Kraken DB (default: none)
                        --threads NUM           Number of threads (default: 1)
                        --fasta-input           Input is FASTA format
                        --fastq-input           Input is FASTQ format
                        --gzip-compressed       Input is gzip compressed
                        --bzip2-compressed      Input is bzip2 compressed
                        --hll-precision INT     Precision for HyperLogLog k-mer cardinality estimation, between 10 and 18 (default: 12)
                        --exact                 Compute exact cardinality instead of estimate (slower, requires memory proportional to cardinality!)
                        --quick                 Quick operation (use first hit or hits)
                        --min-hits NUM          In quick op., number of hits req'd for classification
                                                NOTE: this is ignored if --quick is not specified
                        --unclassified-out FILENAME
                                                Print unclassified sequences to filename
                        --classified-out FILENAME
                                                Print classified sequences to filename
                        --output FILENAME       Print output to filename (default: stdout); "off" will
                                                suppress normal output
                        --only-classified-output
                                                Print no Kraken output for unclassified sequences
                        --preload               Loads DB into memory before classification
                        --paired                The two filenames provided are paired-end reads
                        --check-names           Ensure each pair of reads have names that agree
                                                with each other; ignored if --paired is not specified
                        --help                  Print this message
                        --version               Print version information
                      
                      Experimental:
                        --uid-mapping           Map using UID database
                      
                      If none of the *-input or *-compressed flags are specified, and the 
                      file is a regular file, automatic format detection is attempted.
                      
                      1 条回复 最后回复 回复 引用 0
                      • First post
                        Last post
                      Powered by 暗能星系