暗能星系

    • 登录
    • 搜索

    blast NT NR库readme介绍

    生物信息分析
    1
    1
    10
    正在加载更多帖子
    • 从旧到新
    • 从新到旧
    • 最多赞同
    回复
    • 在新帖中回复
    登录后回复
    此主题已被删除。只有拥有主题管理权限的用户可以查看。
    • A
      anneng 最后由 编辑

                           The BLAST Databases
                      Last updated on September 29, 2020
      

      IMPORTANT: As of February 4, 2020, the BLAST databases on the FTP site are version 5 (v5).
      At the same time, the databases offered has been changed. This document reflects those changes.
      Information on newly enabled features with the v5 databases at
      https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf

      This document describes the BLAST databases available on the NCBI FTP site under
      the /blast/db directory. The direct URL is ftp://ftp.ncbi.nlm.nih.gov/blast/db

      1. Quick Start

        • Get all numbered files for a database with the same base name:
          Each of these files represents a subset (volume) of that database,
          and all of them are needed to reconstitute the database.
        • After extraction, there is no need to concatenate the resulting files:
          Call the database with the base name, for nr database files, use "-db nr".
        • For easy download, use the update_blastdb.pl script from the blast+ package.
        • Incremental update is not available.
      2. General Introduction

      BLAST search pages under the Basic BLAST section of the NCBI BLAST home page
      (http://blast.ncbi.nlm.nih.gov/) use a standard set of BLAST databases for
      nucleotide, protein, and translated BLAST searches. These databases are made
      available as compressed archives of pre-formatted form) and can be downloaded from
      the /db directory of the BLAST ftp site (ftp://ftp.ncbi.nlm.nih.gov/blast/db/).
      The FASTA files reside under the /FASTA subdirectory.

      The pre-formatted databases offer the following advantages:
      * Pre-formatting removes the need to run makeblastdb;
      * Species-level taxonomy ids are included for each database entry;
      * Databases are broken into smaller-sized volumes and are therefore easier
      to download;
      * Sequences in FASTA format can be generated from the pre-formatted databases
      by using the blastdbcmd utility;
      * A convenient script (update_blastdb.pl) is available in the blast+ package
      to download the pre-formatted databases.

      Pre-formatted databases must be downloaded using the update_blastdb.pl script or
      via FTP in binary mode. Documentation for this script can be obtained by running
      the script without any arguments; Perl installation is required.

      The compressed files downloaded must be inflated with gzip or other decompress
      tools. The BLAST database files can then be extracted out of the resulting tar
      file using the tar utility on Unix/Linux, or WinZip and StuffIt Expander on
      Windows and Macintosh platforms, respectively.

      Large databases are formatted in multiple one-gigabyte volumes, which are named
      using the basename.##.tar.gz convention. All volumes with the same base name are
      required. An alias file is provided to tie individual volumes together so that
      the database can be called using the base name (without the .nal or .pal
      extension). For example, to call the est database, simply use "-db est" option
      in the command line (without the quotes).

      For other genomic BLAST databases, please check the genomes ftp directory at:
      ftp://ftp.ncbi.nlm.nih.gov/genomes/

      1. Contents of the /blast/db/ directory

      The pre-formatted BLAST databases are archived in this directory. The names of
      the available databases are listed at https://github.com/ncbi/blast_plus_docs#blast-databases
      It is recommended to use update_blastdb.pl to download databases from the FTP site in order to make
      sure that all volumes are downloaded.

      1. Contents of the /blast/db/FASTA directory

      This directory contains FASTA formatted sequence files. The file names
      and database contents are listed below. These files must be unpacked before
      use. They are provided as a convenience for users needing these sets in
      FASTA format. For use with BLAST, it is preferable to use the BLAST database
      on the FTP site.

      +-----------------------+-----------------------------------------------------+
      |File Name | Content Description |
      +-----------------------+-----------------------------------------------------+
      nr.gz* | non-redundant protein sequence database with entries
      from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq
      nt.gz* | nucleotide sequence database, with entries from all
      traditional divisions of GenBank, EMBL, and DDBJ;
      excluding bulk divisions (gss, sts, pat, est, htg)
      and wgs entries. Partially non-redundant.
      pdbaa.gz* | protein sequences from pdb protein structures
      swissprot.gz* | swiss-prot database (last major release)
      +-----------------------+---------------------------------------------------+
      NOTE:
      (1) For screening for vector contamination, use the UniVec database:
      ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/

      • marked files have pre-formatted counterparts.
      1. Database updates

      The BLAST databases are updated regularly. There is no established incremental
      update scheme. We recommend downloading the complete databases regularly to
      keep their content current.

      1. Non-redundant defline syntax

      The non-redundant databases are nr, nt and pataa. Identical sequences are
      merged into one entry in these databases. To be merged two sequences must
      have identical lengths and every residue at every position must be the
      same. The FASTA deflines for the different entries that belong to one
      record are separated by control-A characters invisible to most
      programs. In the example below both entries Q57293.1 and AAB05030.1
      have the same sequence, in every respect:

      Q57293.1 RecName: Full=Fe(3+) ions import ATP-binding protein FbpC ^AAAB05030.1 afuC
      [Actinobacillus pleuropneumoniae] ^AAAB17216.1 afuC [Actinobacillus pleuropneumoniae]
      MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVTKSSIQNRDIC
      IVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQQQRVALARALVLKPKVLILD
      EPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMNKGTIMQKARQKIFIYDRILYSLRNFMGEST
      ICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPEAIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLIN
      ANPDQFDPDATKAFIHFTEQGIFLLNKE

      Individual sequences are now identified simply by their accession.version.

      For databases whose entries are not from official NCBI sequence databases,
      such as Trace database, the gnl| convention is used. For custom databases,
      this convention should be followed and the id for each sequence must be
      unique, if one would like to take the advantage of indexed database,
      which enables specific sequence retrieval using blastdbcmd program included
      in the blast executable package. One should refer to documents
      distributed in the standalone BLAST package for more details.

      1. Formatting a FASTA file into a BLASTable database

      FASTA files need to be formatted with makeblastdb before they can be used in local
      blast search. For those from NCBI, the following makeblastdb commands are
      recommended:

      For nucleotide fasta file: makeblastdb -in input_db -dbtype nucl -parse_seqids
      For protein fasta file: makeblastdb -in input_db -dbtype prot -parse_seqids

      In general, if the database is available as BLAST database, it is better to use the
      preformatted database.

      1. Technical Support

      Questions and comments on this document and NCBI BLAST related questions
      should be sent to the blast-help group at:
      blast-help@ncbi.nlm.nih.gov

      For information about other NCBI resources/services, please send email to
      NCBI User Service at:
      info@ncbi.nlm.nih.gov

      1 条回复 最后回复 回复 引用 0
      • First post
        Last post
      Powered by 暗能星系