构建本地nt/nr数据库

anneng

http://arep.med.harvard.edu/seqanal/db.html
What is Redundancy?
A key concept in comparing databases is the issue of redundancy. Many databases try to be "non-redundant". Unfortunately, biological data is too complex to fit a simple definition of redundancy. Are two alleles of the same locus redundant? Two isozymes in the same organism? The same locus in two closely related organisms? Hence, each "non-redundant" database has its own definition of redundancy. Some use automated measures, while others use manual culling; the former are amenable to large projects, the latter give higher quality. Other databases don't attempt to be non-redundant, but rather sacrifice this goal in favor of ensuring completeness.
Databases
Nucleotide (DNA & RNA)
nr (NCBI)
The nr nucleotide database maintained by NCBI as a target for their BLAST search services is a composite of GenBank, GenBank updates, and EMBL updates.
Non-redundant: Entries with absolutely identical sequences have been merged.
GenBank / EMBL / DDBJ
In theory, GenBank, the EMBL Datalibrary, and the DNA Databank of Japan (DDBJ) are just names for the same database. In reality, small timelags in propagating data between the database centers causes minor differences in these databases. However, if one of these libraries is merged with the updates to all of these databases, a complete set of sequences is formed.
Redundant: Little to no attempts to reduce redundancy
dbEST (Boguski, Lowe, & Tolstoshev. Nature Genetics 4:332 1993) is a library of Expressed Sequence Tags (Science 252:1651), single-pass cDNA sequences generated from automated sequencers.
CAUTION: ESTs are blindly sequenced from cDNA libraries with little or no human intervention; they are therefore likely to contain sequencing errors and are frequently contaminated with heterologous sequences and transcribed repetitive elements.

Redundant: no attempts made to reduce redundancy
Protein
nr (NCBI)
The nr protein database maintained by NCBI as a target for their BLAST search services is a composite of SwissProt, SwissProt updates, PIR, PDB. Entries with absolutely identical sequences have been merged.
SwissProt
SwissProt is maintained by Amos Bairoch at the University of Geneva. SwissProt is a highly-curated, highly-crossreferenced, non-redundant database. Unfortunately, the cost of this labor-intensive quality enhancement process is that not every sequence is in SwissProt. If you wish to look up information about a sequence, SwissProt is the first place to look.
Non-redundant: manual curation used to provide only one entry per protein product; variants are annotated in entry.
Highly-cross-referenced to other databases.
PIR
The Protein Identification Resource was originated by the late Margaret Dayhoff. It attempts to enjoy the advantages of a complete and a non-redundant database.
Non-redundant: PIR1 section contains only one entry per protein product.
Redundant: Complete database (PIR1+PIR2+PIR3) has many redundancies
PDB
The Protein Data Bank, maintained by Brookhaven National Laboratory (Long Island, New York, USA), contains all publically available solved protein structures. Searches against the pdb can be used to ask whether any known 3D structures are similar to your query protein.
Non-redundant: Only the "best" determination of a given structure is left in the database; however, multiple structures for one molecule may exist due to other components (i.e. one entry uncomplexed, one complexed).
OWL
Prot. Eng. 3:153
Non-redundant: Automatically generated from component databases (see reference for further info).
Protein Motifs
Prosite
Prosite is a database of protein motifs maintained by Amos Bairoch at the University of Geneva (NAR 19:2241, 1991). Each motif (defined by either a regular expression or a profile) is accompanied by a description of the motif and what is known about it's biology, as well as a listing of the true positive, false negative, and false positive SwissProt entries for the pattern.
BLOCKS
BLOCKS is a database developed by Steve Henikoff and colleagues. A block is a gap-free multiple alignment of sequences based on Prosite (Henikoff & Henikoff, NAR 19:6565 1991).

anneng

https://www.ncbi.nlm.nih.gov/books/NBK279670/

anneng

blastdbv5.pdf

anneng

https://docs.oracle.com/cd/B19306_01/datamine.102/b14340/blast.htm
Oracle对blast的支持

anneng

Bioinformatics_ introduction to using BLAST with Ubuntu.pdf Bioinformatics_ managing BLAST data sources.pdf

anneng

https://dbsloan.github.io/TS2019/exercises/local_blast.html
Running Local BLAST and Parsing Output

makeblastdb -in Ecoli.proteins.fas -dbtype prot

makeblastdb -in Ecoli.genome.fas -dbtype nucl

blastn -task blastn  -query Salmonella.genome.fas -db Ecoli.genome.fas -evalue 1e-20 -num_threads 4 -out blastn.txt

pdf ("my_dotplot.pdf")
plot (blastnData$Query_Start, blastnData$Hit_Start, cex = .25)
dev.off()
quit()

anneng

Extracting data from BLAST databases with blastdbcmd
https://www.ncbi.nlm.nih.gov/books/NBK279689/

anneng

Preformatted BLAST vs Fasta
https://www.ncbi.nlm.nih.gov/books/NBK62345/
Getting the preformatted database files
Preformatted BLAST database files offer several advantages over the FASTA files:

The preformatted databases are broken into smaller volumes and therefore can be downloaded more readily with fewer errors
A convenient Perl script (update_blastdb.pl found in the bin directory of a locally installed blast+ package) is available to simplify the download of these preformatted databases
Preformatted database files remove the makeblastdb formatting steps, and saves valuable processing time and diskspace
Taxonomic information is encoded within the preformatted databases and can be used to limit the scope of a blast search, and sequence retrieval, and scientific name addition through the included taxdb files
Sequences in FASTA format can be generated easily from the preformatted databases using the blastdbcmd utility when needed

anneng

Annotating BLAST Reports with Taxonomy Information
https://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/15970/versions/2/previews/taxoblastdemo/html/taxoblastdemo.html?access_key=

anneng

https://github.com/lskatz/taxdb
一个工具　可以把taxdump导入sqlite

anneng

The BLAST taxonomy database is required in order to print the scientific name, common name, blast name, or super kingdom as part of the BLAST report or in a report with blastdbcmd. The BLAST database contains only the taxid (an integer) for each entry, and the taxonomy database allow BLAST to retrieve the scientific name etc. from a taxid. The BLAST taxonomy database consists of a pair of files (taxdb.bti and taxdb.btd) that are available as a compressed archive from the NCBI BLAST FTP site (ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz). The update_blastdb.pl script can be used to download and update this archive; it is recommended that the uncompressed contents of the archive be installed in the same directory where the BLAST databases reside. Assuming proper file permissions and that the BLASTDB environment variable contains the path to the installation directory of the BLAST databases, the following commands accomplish that:

Download the taxdb archive

perl update_blastdb.pl taxdb

Install it in the BLASTDB directory

gunzip -cd taxdb.tar.gz | (cd $BLASTDB; tar xvf - )

zhangfanglin

nr、nt导出规范

**%T %a %i  %t %s**    
1、物种ID 
2、accession 序列ID           
3、sequence title
4、描述
5、序列

上面的导出%i 应该是序列id 我们可以不用这个字段
%a %t %T %s

anneng

-outfmt <String>
Output format, where the available format specifiers are:
%f means sequence in FASTA format
%s means sequence data (without defline)
%a means accession
%g means gi
%o means ordinal id (OID)
%i means sequence id
%t means sequence title
%l means sequence length
%h means sequence hash value
%T means taxid
%X means leaf-node taxids
%e means membership integer
%L means common taxonomic name
%C means common taxonomic names for leaf-node taxids
%S means scientific name
%N means scientific names for leaf-node taxids
%B means BLAST name
%K means taxonomic super kingdom
%P means PIG
%m means sequence masking data.
Masking data will be displayed as a series of 'N-M' values
separated by ';' or the word 'none' if none are available.
If '%f' is specified, all other format specifiers are ignored.
For every format except '%f', each line of output will correspond
to a sequence.
Default = `%f'

anneng

如果想把一个fasta文件中的序列都当作一个物种对待那么可以使用taxid参数
合并两个数据库：
makeblastdb -in mysequences.fna -dbtype nucl -title "some sequences I found" -out mysequences -parse_seqids
blastdb_aliastool -dblist nt mysequences -dbtype nucl -title "nt database + my own sequences" -out ntandmore
如果有多个fasta 文件每个文件是一个物种可以先分别建库然后用blastdb_aliastool合并