<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[构建本地nt&#x2F;nr数据库]]></title><description><![CDATA[<p dir="auto"><a href="https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&amp;DOC_TYPE=Download" rel="nofollow ugc">https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&amp;DOC_TYPE=Download</a><br />
1.为什么要构建本地nt/nr数据库？<br />
Do you have difficulties running high volume BLAST searches?<br />
Do you have proprietary sequence data to search and cannot use the NCBI BLAST web site?<br />
Do you have access to your own server?<br />
Do you have your own research pipeline?<br />
Have security or IP concerns about sending searches outside of your organization?</p>
<p dir="auto">2.下载blast+软件<br />
<a href="ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/" rel="nofollow ugc">ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/</a><br />
软件手册:<br />
<a href="https://www.ncbi.nlm.nih.gov/books/NBK279690/" rel="nofollow ugc">https://www.ncbi.nlm.nih.gov/books/NBK279690/</a><br />
软件编译安装:<br />
cd c++<br />
./configure<br />
cd ReleaseMT/build<br />
make all_r</p>
<p dir="auto">包含的软件：<br />
<img src="/assets/uploads/files/1609245459487-3f1bc485-3c7b-4cb8-8470-6192a613324f-image.png" alt="3f1bc485-3c7b-4cb8-8470-6192a613324f-image.png" class=" img-responsive img-markdown" /><br />
<img src="/assets/uploads/files/1609245492779-df9d57cf-4f21-4f1f-ad38-76239b78a0e5-image.png" alt="df9d57cf-4f21-4f1f-ad38-76239b78a0e5-image.png" class=" img-responsive img-markdown" /><br />
<img src="/assets/uploads/files/1609332525014-8307aea4-0cfc-47a5-98ab-4a8e7e6951d7-image.png" alt="8307aea4-0cfc-47a5-98ab-4a8e7e6951d7-image.png" class=" img-responsive img-markdown" /><br />
<a href="https://open.oregonstate.education/computationalbiology/chapter/command-line-blast/" rel="nofollow ugc">https://open.oregonstate.education/computationalbiology/chapter/command-line-blast/</a><br />
配置：<br />
程序所在路径<br />
export PATH=$PATH:$HOME/ncbi-blast-2.10.1+/bin<br />
数据库所在路径<br />
export BLASTDB=$HOME/blastdb</p>
<p dir="auto">3.下载数据库<br />
<a href="ftp://ftp.ncbi.nlm.nih.gov/blast/db/" rel="nofollow ugc">ftp://ftp.ncbi.nlm.nih.gov/blast/db/</a><br />
cd $HOME/blastdb<br />
perl ../bin/update_blastdb.pl --passive --decompress 16S_ribosomal_RNA<br />
更新数据库<br />
perl ../bin/update_blastdb.pl --passive --decompress 16S_ribosomal_RNA</p>
<p dir="auto">一个例子:<br />
blastdbcmd -db 16S_ribosomal_RNA -entry nr_025000 -out 16S_query.fa<br />
blastn -db 16S_ribosomal_RNA -query 16S_query.fa -task blastn -dust no -outfmt "7 delim=, qacc sacc evalue bitscore qcovus pident" -max_target_seqs 5</p>
<pre><code>BLAST Database error: No alias or index file found for nucleotide database [./16S_ribosomal_RNA] in search path [/home/bioinfo/gene_data/blastdb::]
加一个全局变量即可
export BLASTDB=/home/bioinfo/gene_data/blastdb/16S_ribosomal_RNA/
</code></pre>
<p dir="auto">4.添加自己的序列<br />
序列使用 fast格式　建议取一个唯一的名字<br />
$ cat test.fsa</p>
<blockquote>
<p dir="auto">seq1<br />
MSFSTKPLDMATWPDFAALVERHNGVWGGCWCMAFHAKGSGAVGNREAKEARVREGSTHAALVFDGSACVGWCQFGPTGE<br />
LPRIKHLRAYEDGQAVLPDWRITCFFSDKAFRGKGVAAAALAGALAEIGRLGGGTVESYPEDAQGRTVAGAFLHNGTLAM<br />
这个序列是哪种物种？用下面的文件描述:<br />
包括identifiers to taxids的对应关系<br />
$ cat test_map.txt<br />
seq1 68287<br />
$ makeblastdb -in test.fsa -parse_seqids -blastdb_version 5 -taxid_map test_map.txt -title "Cookbook demo" -dbtype prot</p>
</blockquote>
<p dir="auto"><a href="https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/" rel="nofollow ugc">https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/</a><br />
<a href="https://ncbiinsights.ncbi.nlm.nih.gov/2018/02/22/new-taxonomy-files-available-with-lineage-type-and-host-information/" rel="nofollow ugc">https://ncbiinsights.ncbi.nlm.nih.gov/2018/02/22/new-taxonomy-files-available-with-lineage-type-and-host-information/</a><br />
<a href="https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi" rel="nofollow ugc">https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi</a><br />
也可以把fa文件当作数据库来查询：<br />
blastn -task megablast -db Dmel_transcripts_Ensembl/Dmel_genes_all.fa -query blast_query.txt -dust no -max_target_seqs 1 -outfmt "6 qseqid sseqid evalue pident stitle" -out outputfile.txt<br />
5.nt/nr的数据库格式</p>
<p dir="auto">6.按物种id查询<br />
blastn –db nt –query QUERY –taxids 9606 –outfmt 7 –out OUTPUT.tab<br />
这样可以降低查询数据量　也反应了使用sql数据库的好处</p>
<p dir="auto">7.导出数据<br />
blastdbcmd -db 16S_ribosomal_RNA -entry all -out 16S_query.fa<br />
-outfmt "%f"<br />
%f - Output all sequence data and metadata in FASTA format (this is the default behavior of the command)<br />
%s - Sequence data<br />
%a - Database specific accession ID<br />
%g - NCBI sequence ID (i.e. gi)<br />
%i - Sequence ID<br />
%l - Sequence length<br />
更多格式<br />
/home/bioinfo/software/blast_debug/ncbi-blast-2.11.0+-src/c++/ReleaseMT/bin/blastdbcmd -db ./16S_ribosomal_RNA -entry all -out 16S_query.fa -outfmt "%T %a %g %i %l %t"</p>
<p dir="auto">-outfmt &lt;String&gt;<br />
Output format, where the available format specifiers are:<br />
%f means sequence in FASTA format<br />
%s means sequence data (without defline)<br />
%a means accession<br />
%g means gi<br />
%o means ordinal id (OID)<br />
%i means sequence id<br />
%t means sequence title<br />
%l means sequence length<br />
%h means sequence hash value<br />
%T means taxid<br />
%X means leaf-node taxids<br />
%e means membership integer<br />
%L means common taxonomic name<br />
%C means common taxonomic names for leaf-node taxids<br />
%S means scientific name<br />
%N means scientific names for leaf-node taxids<br />
%B means BLAST name<br />
%n means a list of links integers separated by ';'<br />
%K means taxonomic super kingdom<br />
%P means PIG<br />
%d means defline in text ASN.1 format<br />
%b means Bioseq in text ASN.1 format<br />
%m means sequence masking data.<br />
Masking data will be displayed as a series of 'N-M' values<br />
separated by ';' or the word 'none' if none are available.<br />
If '%f' or '%d' are specified, all other format specifiers are ignored.<br />
For every format except '%f' and '%d', each line of output will correspond<br />
to a sequ</p>
]]></description><link>http://an.forum.genostack.com/topic/149/构建本地nt-nr数据库</link><generator>RSS for Node</generator><lastBuildDate>Sat, 13 Jun 2026 13:46:09 GMT</lastBuildDate><atom:link href="http://an.forum.genostack.com/topic/149.rss" rel="self" type="application/rss+xml"/><pubDate>Tue, 29 Dec 2020 06:36:20 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Wed, 09 Jun 2021 02:37:31 GMT]]></title><description><![CDATA[<p dir="auto">如果想把一个fasta文件中的序列都当作一个物种对待 那么可以使用taxid参数<br />
合并两个数据库：<br />
makeblastdb -in mysequences.fna -dbtype nucl -title "some sequences I found" -out mysequences -parse_seqids<br />
blastdb_aliastool -dblist nt mysequences -dbtype nucl -title "nt database + my own sequences" -out ntandmore<br />
如果有多个fasta 文件 每个文件是一个物种 可以先分别建库 然后用blastdb_aliastool合并</p>
]]></description><link>http://an.forum.genostack.com/post/638</link><guid isPermaLink="true">http://an.forum.genostack.com/post/638</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Wed, 09 Jun 2021 02:37:31 GMT</pubDate></item><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Tue, 30 Mar 2021 06:18:26 GMT]]></title><description><![CDATA[<p dir="auto">-outfmt &lt;String&gt;<br />
Output format, where the available format specifiers are:<br />
%f means sequence in FASTA format<br />
%s means sequence data (without defline)<br />
%a means accession<br />
%g means gi<br />
%o means ordinal id (OID)<br />
%i means sequence id<br />
%t means sequence title<br />
%l means sequence length<br />
%h means sequence hash value<br />
%T means taxid<br />
%X means leaf-node taxids<br />
%e means membership integer<br />
%L means common taxonomic name<br />
%C means common taxonomic names for leaf-node taxids<br />
%S means scientific name<br />
%N means scientific names for leaf-node taxids<br />
%B means BLAST name<br />
%K means taxonomic super kingdom<br />
%P means PIG<br />
%m means sequence masking data.<br />
Masking data will be displayed as a series of 'N-M' values<br />
separated by ';' or the word 'none' if none are available.<br />
If '%f' is specified, all other format specifiers are ignored.<br />
For every format except '%f', each line of output will correspond<br />
to a sequence.<br />
Default = `%f'</p>
]]></description><link>http://an.forum.genostack.com/post/523</link><guid isPermaLink="true">http://an.forum.genostack.com/post/523</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Tue, 30 Mar 2021 06:18:26 GMT</pubDate></item><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Tue, 30 Mar 2021 06:27:29 GMT]]></title><description><![CDATA[<p dir="auto">nr、nt导出规范</p>
<pre><code>**%T %a %i  %t %s**    
1、物种ID 
2、accession 序列ID           
3、sequence title
4、描述
5、序列
</code></pre>
<p dir="auto">上面的导出%i 应该是序列id  我们可以不用这个字段<br />
<strong>%a %t %T %s</strong></p>
]]></description><link>http://an.forum.genostack.com/post/451</link><guid isPermaLink="true">http://an.forum.genostack.com/post/451</guid><dc:creator><![CDATA[zhangfanglin]]></dc:creator><pubDate>Tue, 30 Mar 2021 06:27:29 GMT</pubDate></item><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Sat, 20 Feb 2021 07:29:22 GMT]]></title><description><![CDATA[<p dir="auto">The BLAST taxonomy database is required in order to print the scientific name, common name, blast name, or super kingdom as part of the BLAST report or in a report with blastdbcmd. The BLAST database contains only the taxid (an integer) for each entry, and the taxonomy database allow BLAST to retrieve the scientific name etc. from a taxid. The BLAST taxonomy database consists of a pair of files (taxdb.bti and taxdb.btd) that are available as a compressed archive from the NCBI BLAST FTP site (<a href="ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz" rel="nofollow ugc">ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz</a>). The update_blastdb.pl script can be used to download and update this archive; it is recommended that the uncompressed contents of the archive be installed in the same directory where the BLAST databases reside. Assuming proper file permissions and that the BLASTDB environment variable contains the path to the installation directory of the BLAST databases, the following commands accomplish that:</p>
<h1>Download the taxdb archive</h1>
<p dir="auto">perl update_blastdb.pl taxdb</p>
<h1>Install it in the BLASTDB directory</h1>
<p dir="auto">gunzip -cd taxdb.tar.gz | (cd $BLASTDB; tar xvf - )</p>
]]></description><link>http://an.forum.genostack.com/post/446</link><guid isPermaLink="true">http://an.forum.genostack.com/post/446</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Sat, 20 Feb 2021 07:29:22 GMT</pubDate></item><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Sat, 20 Feb 2021 07:13:51 GMT]]></title><description><![CDATA[<p dir="auto"><a href="https://github.com/lskatz/taxdb" rel="nofollow ugc">https://github.com/lskatz/taxdb</a><br />
一个工具　可以把taxdump导入sqlite</p>
]]></description><link>http://an.forum.genostack.com/post/445</link><guid isPermaLink="true">http://an.forum.genostack.com/post/445</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Sat, 20 Feb 2021 07:13:51 GMT</pubDate></item><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Sat, 20 Feb 2021 07:08:49 GMT]]></title><description><![CDATA[<p dir="auto">Annotating BLAST Reports with Taxonomy Information<br />
<a href="https://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/15970/versions/2/previews/taxoblastdemo/html/taxoblastdemo.html?access_key=" rel="nofollow ugc">https://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/15970/versions/2/previews/taxoblastdemo/html/taxoblastdemo.html?access_key=</a></p>
]]></description><link>http://an.forum.genostack.com/post/444</link><guid isPermaLink="true">http://an.forum.genostack.com/post/444</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Sat, 20 Feb 2021 07:08:49 GMT</pubDate></item><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Sat, 20 Feb 2021 06:48:47 GMT]]></title><description><![CDATA[<p dir="auto">Preformatted BLAST vs Fasta<br />
<a href="https://www.ncbi.nlm.nih.gov/books/NBK62345/" rel="nofollow ugc">https://www.ncbi.nlm.nih.gov/books/NBK62345/</a><br />
Getting the preformatted database files<br />
Preformatted BLAST database files offer several advantages over the FASTA files:</p>
<p dir="auto">The preformatted databases are broken into smaller volumes and therefore can be downloaded more readily with fewer errors<br />
A convenient Perl script (update_blastdb.pl found in the bin directory of a locally installed blast+ package) is available to simplify the download of these preformatted databases<br />
Preformatted database files remove the makeblastdb formatting steps, and saves valuable processing time and diskspace<br />
Taxonomic information is encoded within the preformatted databases and can be used to limit the scope of a blast search, and sequence retrieval, and scientific name addition through the included taxdb files<br />
Sequences in FASTA format can be generated easily from the preformatted databases using the blastdbcmd utility when needed</p>
]]></description><link>http://an.forum.genostack.com/post/443</link><guid isPermaLink="true">http://an.forum.genostack.com/post/443</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Sat, 20 Feb 2021 06:48:47 GMT</pubDate></item><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Sat, 20 Feb 2021 06:34:02 GMT]]></title><description><![CDATA[<p dir="auto">Extracting data from BLAST databases with blastdbcmd<br />
<a href="https://www.ncbi.nlm.nih.gov/books/NBK279689/" rel="nofollow ugc">https://www.ncbi.nlm.nih.gov/books/NBK279689/</a></p>
]]></description><link>http://an.forum.genostack.com/post/442</link><guid isPermaLink="true">http://an.forum.genostack.com/post/442</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Sat, 20 Feb 2021 06:34:02 GMT</pubDate></item><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Sat, 20 Feb 2021 03:45:25 GMT]]></title><description><![CDATA[<p dir="auto"><a href="https://dbsloan.github.io/TS2019/exercises/local_blast.html" rel="nofollow ugc">https://dbsloan.github.io/TS2019/exercises/local_blast.html</a><br />
Running Local BLAST and Parsing Output</p>
<pre><code>makeblastdb -in Ecoli.proteins.fas -dbtype prot

makeblastdb -in Ecoli.genome.fas -dbtype nucl
</code></pre>
<pre><code>blastn -task blastn  -query Salmonella.genome.fas -db Ecoli.genome.fas -evalue 1e-20 -num_threads 4 -out blastn.txt
</code></pre>
<pre><code>pdf ("my_dotplot.pdf")
plot (blastnData$Query_Start, blastnData$Hit_Start, cex = .25)
dev.off()
quit()
</code></pre>
<p dir="auto"><img src="/assets/uploads/files/1613792705108-46d8543b-55c2-4680-a439-a6c4129e7baa-image.png" alt="46d8543b-55c2-4680-a439-a6c4129e7baa-image.png" class=" img-responsive img-markdown" /></p>
]]></description><link>http://an.forum.genostack.com/post/437</link><guid isPermaLink="true">http://an.forum.genostack.com/post/437</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Sat, 20 Feb 2021 03:45:25 GMT</pubDate></item><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Wed, 30 Dec 2020 13:27:09 GMT]]></title><description><![CDATA[<p dir="auto"><a href="/assets/uploads/files/1609334754760-bioinformatics_-introduction-to-using-blast-with-ubuntu.pdf">Bioinformatics_ introduction to using BLAST with Ubuntu.pdf</a><a href="/assets/uploads/files/1609334828346-bioinformatics_-managing-blast-data-sources.pdf">Bioinformatics_ managing BLAST data sources.pdf</a></p>
]]></description><link>http://an.forum.genostack.com/post/286</link><guid isPermaLink="true">http://an.forum.genostack.com/post/286</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Wed, 30 Dec 2020 13:27:09 GMT</pubDate></item><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Wed, 30 Dec 2020 08:38:38 GMT]]></title><description><![CDATA[<p dir="auto"><a href="https://docs.oracle.com/cd/B19306_01/datamine.102/b14340/blast.htm" rel="nofollow ugc">https://docs.oracle.com/cd/B19306_01/datamine.102/b14340/blast.htm</a><br />
Oracle对blast的支持</p>
]]></description><link>http://an.forum.genostack.com/post/285</link><guid isPermaLink="true">http://an.forum.genostack.com/post/285</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Wed, 30 Dec 2020 08:38:38 GMT</pubDate></item><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Wed, 30 Dec 2020 07:28:11 GMT]]></title><description><![CDATA[<p dir="auto"><a href="/assets/uploads/files/1609313290226-blastdbv5.pdf">blastdbv5.pdf</a></p>
]]></description><link>http://an.forum.genostack.com/post/283</link><guid isPermaLink="true">http://an.forum.genostack.com/post/283</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Wed, 30 Dec 2020 07:28:11 GMT</pubDate></item><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Tue, 29 Dec 2020 07:30:36 GMT]]></title><description><![CDATA[<p dir="auto"><a href="https://www.ncbi.nlm.nih.gov/books/NBK279670/" rel="nofollow ugc">https://www.ncbi.nlm.nih.gov/books/NBK279670/</a></p>
]]></description><link>http://an.forum.genostack.com/post/281</link><guid isPermaLink="true">http://an.forum.genostack.com/post/281</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Tue, 29 Dec 2020 07:30:36 GMT</pubDate></item><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Tue, 29 Dec 2020 07:30:17 GMT]]></title><description><![CDATA[<p dir="auto"><a href="http://arep.med.harvard.edu/seqanal/db.html" rel="nofollow ugc">http://arep.med.harvard.edu/seqanal/db.html</a><br />
What is Redundancy?<br />
A key concept in comparing databases is the issue of redundancy. Many databases try to be "non-redundant". Unfortunately, biological data is too complex to fit a simple definition of redundancy. Are two alleles of the same locus redundant? Two isozymes in the same organism? The same locus in two closely related organisms? Hence, each "non-redundant" database has its own definition of redundancy. Some use automated measures, while others use manual culling; the former are amenable to large projects, the latter give higher quality. Other databases don't attempt to be non-redundant, but rather sacrifice this goal in favor of ensuring completeness.<br />
Databases<br />
Nucleotide (DNA &amp; RNA)<br />
nr (NCBI)<br />
The nr nucleotide database maintained by NCBI as a target for their BLAST search services is a composite of GenBank, GenBank updates, and EMBL updates.<br />
Non-redundant: Entries with absolutely identical sequences have been merged.<br />
GenBank / EMBL / DDBJ<br />
In theory, GenBank, the EMBL Datalibrary, and the DNA Databank of Japan (DDBJ) are just names for the same database. In reality, small timelags in propagating data between the database centers causes minor differences in these databases. However, if one of these libraries is merged with the updates to all of these databases, a complete set of sequences is formed.<br />
Redundant: Little to no attempts to reduce redundancy<br />
dbEST (Boguski, Lowe, &amp; Tolstoshev. Nature Genetics 4:332 1993) is a library of Expressed Sequence Tags (Science 252:1651), single-pass cDNA sequences generated from automated sequencers.<br />
CAUTION: ESTs are blindly sequenced from cDNA libraries with little or no human intervention; they are therefore likely to contain sequencing errors and are frequently contaminated with heterologous sequences and transcribed repetitive elements.</p>
<p dir="auto">Redundant: no attempts made to reduce redundancy<br />
Protein<br />
nr (NCBI)<br />
The nr protein database maintained by NCBI as a target for their BLAST search services is a composite of SwissProt, SwissProt updates, PIR, PDB. Entries with absolutely identical sequences have been merged.<br />
SwissProt<br />
SwissProt is maintained by Amos Bairoch at the University of Geneva. SwissProt is a highly-curated, highly-crossreferenced, non-redundant database. Unfortunately, the cost of this labor-intensive quality enhancement process is that not every sequence is in SwissProt. If you wish to look up information about a sequence, SwissProt is the first place to look.<br />
Non-redundant: manual curation used to provide only one entry per protein product; variants are annotated in entry.<br />
Highly-cross-referenced to other databases.<br />
PIR<br />
The Protein Identification Resource was originated by the late Margaret Dayhoff. It attempts to enjoy the advantages of a complete and a non-redundant database.<br />
Non-redundant: PIR1 section contains only one entry per protein product.<br />
Redundant: Complete database (PIR1+PIR2+PIR3) has many redundancies<br />
PDB<br />
The Protein Data Bank, maintained by Brookhaven National Laboratory (Long Island, New York, USA), contains all publically available solved protein structures. Searches against the pdb can be used to ask whether any known 3D structures are similar to your query protein.<br />
Non-redundant: Only the "best" determination of a given structure is left in the database; however, multiple structures for one molecule may exist due to other components (i.e. one entry uncomplexed, one complexed).<br />
OWL<br />
Prot. Eng. 3:153<br />
Non-redundant: Automatically generated from component databases (see reference for further info).<br />
Protein Motifs<br />
Prosite<br />
Prosite is a database of protein motifs maintained by Amos Bairoch at the University of Geneva (NAR 19:2241, 1991). Each motif (defined by either a regular expression or a profile) is accompanied by a description of the motif and what is known about it's biology, as well as a listing of the true positive, false negative, and false positive SwissProt entries for the pattern.<br />
BLOCKS<br />
BLOCKS is a database developed by Steve Henikoff and colleagues. A block is a gap-free multiple alignment of sequences based on Prosite (Henikoff &amp; Henikoff, NAR 19:6565 1991).</p>
]]></description><link>http://an.forum.genostack.com/post/280</link><guid isPermaLink="true">http://an.forum.genostack.com/post/280</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Tue, 29 Dec 2020 07:30:17 GMT</pubDate></item><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Tue, 29 Dec 2020 07:09:44 GMT]]></title><description><![CDATA[<pre><code>                     The BLAST Databases
                Last updated on February 3, 2020
</code></pre>
<p dir="auto">IMPORTANT: As of February 4, 2020, the BLAST databases on the FTP site are version 5 (v5).<br />
At the same time, the databases offered has been changed. This document reflects those changes.<br />
Information on newly enabled features with the v5 databases at<br />
<a href="https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf" rel="nofollow ugc">https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf</a></p>
<p dir="auto">This document describes the BLAST databases available on the NCBI FTP site under<br />
the /blast/db directory. The direct URL is <a href="ftp://ftp.ncbi.nlm.nih.gov/blast/db" rel="nofollow ugc">ftp://ftp.ncbi.nlm.nih.gov/blast/db</a></p>
<ol>
<li>
<p dir="auto">Quick Start</p>
<ul>
<li>Get all numbered files for a database with the same base name:<br />
Each of these files represents a subset (volume) of that database,<br />
and all of them are needed to reconstitute the database.</li>
<li>After extraction, there is no need to concatenate the resulting files:<br />
Call the database with the base name, for nr database files, use "-db nr".</li>
<li>For easy download, use the update_blastdb.pl script from the blast+ package.</li>
<li>Incremental update is not available.</li>
</ul>
</li>
<li>
<p dir="auto">General Introduction</p>
</li>
</ol>
<p dir="auto">BLAST search pages under the Basic BLAST section of the NCBI BLAST home page<br />
(<a href="http://blast.ncbi.nlm.nih.gov/" rel="nofollow ugc">http://blast.ncbi.nlm.nih.gov/</a>) use a standard set of BLAST databases for<br />
nucleotide, protein, and translated BLAST searches.  These databases are made<br />
available as compressed archives of pre-formatted form) and can be donwloaed from<br />
the /db directory of the BLAST ftp site (<a href="ftp://ftp.ncbi.nlm.nih.gov/blast/db/" rel="nofollow ugc">ftp://ftp.ncbi.nlm.nih.gov/blast/db/</a>).<br />
The FASTA files reside under the /FASTA subdirectory.</p>
<p dir="auto">The pre-formatted databases offer the following advantages:<br />
* Pre-formatting removes the need to run makeblastdb;<br />
* Species-level taxonomy ids are included for each database entry;<br />
* Databases are broken into smaller-sized volumes and are therefore easier<br />
to download;<br />
* Sequences in FASTA format can be generated from the pre-formatted databases<br />
by using the blastdbcmd utility;<br />
* A convenient script (update_blastdb.pl) is available in the blast+ package<br />
to download the pre-formatted databases.</p>
<p dir="auto">Pre-formatted databases must be downloaded using the update_blastdb.pl script or<br />
via FTP in binary mode. Documentation for this script can be obtained by running<br />
the script without any arguments; Perl installation is required.</p>
<p dir="auto">The compressed files downloaded must be inflated with gzip or other decompress<br />
tools. The BLAST database files can then be extracted out of the resulting tar<br />
file using the tar utility on Unix/Linux, or WinZip and StuffIt Expander on<br />
Windows and Macintosh platforms, respectively.</p>
<p dir="auto">Large databases are formatted in multiple one-gigabyte volumes, which are named<br />
using the basename.##.tar.gz convention. All volumes with the same base name are<br />
required. An alias file is provided to tie individual volumes together so that<br />
the database can be called using the base name (without the .nal or .pal<br />
extension). For example, to call the est database, simply use "-db est" option<br />
in the command line (without the quotes).</p>
<p dir="auto">For other genomic BLAST databases, please check the genomes ftp directory at:<br />
<a href="ftp://ftp.ncbi.nlm.nih.gov/genomes/" rel="nofollow ugc">ftp://ftp.ncbi.nlm.nih.gov/genomes/</a></p>
<ol start="3">
<li>Contents of the /blast/db/ directory</li>
</ol>
<p dir="auto">The pre-formatted BLAST databases are archived in this directory. The names of<br />
these databases and their contents are listed below.</p>
<p dir="auto">+-----------------------------+------------------------------------------------+<br />
File Name                    | Content Description<br />
+-----------------------------+------------------------------------------------+<br />
README                        | README for this subdirectory (this file)<br />
nr.<em>tar.gz                    | Non-redundant protein sequences from GenPept,<br />
Swissprot, PIR, PDF, PDB, and NCBI RefSeq<br />
nt.<em>tar.gz                    | Partially non-redundant nucleotide sequences from<br />
all traditional divisions of GenBank, EMBL, and DDBJ<br />
excluding GSS,STS, PAT, EST, HTG, and WGS.<br />
landmark.tar.gz               | Proteome of 27 model organisms, see<br />
<a href="https://blast.ncbi.nlm.nih.gov/smartblast/smartBlast.cgi?CMD=Web&amp;PAGE_TYPE=BlastDocs#searchSets" rel="nofollow ugc">https://blast.ncbi.nlm.nih.gov/smartblast/smartBlast.cgi?CMD=Web&amp;PAGE_TYPE=BlastDocs#searchSets</a><br />
16S_ribosomal_RNA             | 16S ribosomal RNA (Bacteria and Archaea type strains)<br />
18S_fungal_sequences.tar.gz   | 18S ribosomal RNA sequences (SSU) from Fungi type and reference material (BioProject PRJNA39195)<br />
28S_fungal_sequences.tar.gz   | 28S ribosomal RNA sequences (LSU) from Fungi type and reference material (BioProject PRJNA51803)<br />
ITS_RefSeq_Fungi.tar.gz       | Internal transcribed spacer region (ITS) from Fungi type and reference material (BioProject PRJNA177353)<br />
ITS_eukaryote_sequences.tar.gz| Internal transcribed spacer region (ITS) for eukaryotic sequences<br />
LSU_eukaryote_rRNA.tar.gz     | Large subunit ribosomal RNA sequences for eukaryotic sequences<br />
LSU_prokaryote_rRNA.tar.gz    | Large subunit ribosomal RNA sequences for prokaryotic sequences<br />
SSU_eukaryote_rRNA.tar.gz     | Small subunit ribosomal RNA sequences for eukaryotic sequences<br />
ref_euk_rep_genomes</em>tar.gz    | Refseq Representative Eukaryotic genomes (1000+ organisms)<br />
ref_prok_rep_genomes</em>tar.gz   | Refseq Representative Prokaryotic genomes (5700+ organisms)<br />
ref_viruses_rep_genomes<em>tar.gz   | Refseq Representative Virus genomes (9000+ organisms)<br />
ref_viroids_rep_genomes</em>tar.gz   | Refseq Representative Viroid genomes (46 organisms)<br />
refseq_protein.*tar.gz        | NCBI protein reference sequences<br />
refseq_rna.*tar.gz            | NCBI Transcript reference sequences<br />
swissprot.tar.gz              | Swiss-Prot sequence database (last major update)<br />
pataa.*tar.gz                 | Patent protein sequences<br />
patnt.*tar.gz                 | Patent nucleotide sequences. Both patent databases<br />
are directly from the USPTO, or from the EPO/JPO<br />
via EMBL/DDBJ<br />
pdbaa.*tar.gz                 | Sequences for the protein structure from the<br />
Protein Data Bank<br />
pdbnt.*tar.gz                 | Sequences for the nucleotide structure from the<br />
Protein Data Bank. They are NOT the protein coding<br />
sequences for the corresponding pdbaa entries.<br />
taxdb.tar.gz                  | Additional taxonomy information for the databases<br />
listed here  providing common and scientific names<br />
FASTA/                        | Subdirectory for FASTA formatted sequences<br />
v4/                           | BLAST databases in version 4 (v4).  These files are no<br />
longer being updated.<br />
cloud/	                      | Subdirectory of databases for BLAST AMI; see<br />
<a href="http://1.usa.gov/TJAnEt" rel="nofollow ugc">http://1.usa.gov/TJAnEt</a><br />
+-----------------------------+------------------------------------------------+</p>
<ol start="4">
<li>Contents of the /blast/db/FASTA directory</li>
</ol>
<p dir="auto">This directory contains FASTA formatted sequence files. The file names<br />
and database contents are listed below. These files must be unpacked before<br />
use.  They are provided as a convenience for users needing these sets in<br />
FASTA format.  For use with BLAST, it is preferable to use the BLAST database<br />
on the FTP site.</p>
<p dir="auto">+-----------------------+-----------------------------------------------------+<br />
|File Name              | Content Description                                 |<br />
+-----------------------+-----------------------------------------------------+<br />
nr.gz*                  | non-redundant protein sequence database with entries<br />
from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq<br />
nt.gz*                  | nucleotide sequence database, with entries from all<br />
traditional divisions of GenBank, EMBL, and DDBJ;<br />
excluding bulk divisions (gss, sts, pat, est, htg)<br />
and wgs entries. Partially non-redundant.<br />
pdbaa.gz*               | protein sequences from pdb protein structures<br />
swissprot.gz*           | swiss-prot database (last major release)<br />
+-----------------------+---------------------------------------------------+<br />
NOTE:<br />
(1) For screening for vector contamination, use the UniVec database:<br />
<a href="ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/" rel="nofollow ugc">ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/</a></p>
<ul>
<li>marked files have pre-formatted counterparts.</li>
</ul>
<ol start="5">
<li>Database updates</li>
</ol>
<p dir="auto">The BLAST databases are updated regularly. There is no established incremental<br />
update scheme. We recommend downloading the complete databases regularly to<br />
keep their content current.</p>
<ol start="6">
<li>Non-redundant defline syntax</li>
</ol>
<p dir="auto">The non-redundant databases are nr, nt and pataa. Identical sequences are<br />
merged into one entry in these databases. To be merged two sequences must<br />
have identical lengths and every residue at every position must be the<br />
same.  The FASTA deflines for the different entries that belong to one<br />
record are separated by control-A characters invisible to most<br />
programs. In the example below both entries Q57293.1 and AAB05030.1<br />
have the same sequence, in every respect:</p>
<blockquote>
<p dir="auto">Q57293.1 RecName: Full=Fe(3+) ions import ATP-binding protein FbpC ^AAAB05030.1 afuC<br />
[Actinobacillus pleuropneumoniae] ^AAAB17216.1 afuC [Actinobacillus pleuropneumoniae]<br />
MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVTKSSIQNRDIC<br />
IVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQQQRVALARALVLKPKVLILD<br />
EPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMNKGTIMQKARQKIFIYDRILYSLRNFMGEST<br />
ICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPEAIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLIN<br />
ANPDQFDPDATKAFIHFTEQGIFLLNKE</p>
</blockquote>
<p dir="auto">Individual sequences are now identifed simply by their accession.version.</p>
<p dir="auto">For databases whose entries are not from official NCBI sequence databases,<br />
such as Trace database, the gnl| convention is used. For custom databases,<br />
this convention should be followed and the id for each sequence must be<br />
unique, if one would like to take the advantage of indexed database,<br />
which enables specific sequence retrieval using blastdbcmd program included<br />
in the blast executable package.  One should refer to documents<br />
distributed in the standalone BLAST package for more details.</p>
<ol start="7">
<li>Formatting a FASTA file into a BLASTable database</li>
</ol>
<p dir="auto">FASTA files need to be formatted with makeblastdb before they can be used in local<br />
blast search. For those from NCBI, the following makeblastdb commands are<br />
recommended:</p>
<p dir="auto">For nucleotide fasta file:   makeblastdb -in input_db -dbtype nucl -parse_seqids<br />
For protein fasta file:      makeblastdb -in input_db -dbtype prot -parse_seqids</p>
<p dir="auto">In general, if the database is available as BLAST database, it is better to use the<br />
preformatted database.</p>
<ol start="8">
<li>Technical Support</li>
</ol>
<p dir="auto">Questions and comments on this document and NCBI BLAST related questions<br />
should be sent to the blast-help group at:<br />
<a href="mailto:blast-help@ncbi.nlm.nih.gov" rel="nofollow ugc">blast-help@ncbi.nlm.nih.gov</a></p>
<p dir="auto">For information about other NCBI resources/services, please send email to<br />
NCBI User Service at:<br />
<a href="mailto:info@ncbi.nlm.nih.gov" rel="nofollow ugc">info@ncbi.nlm.nih.gov</a></p>
]]></description><link>http://an.forum.genostack.com/post/279</link><guid isPermaLink="true">http://an.forum.genostack.com/post/279</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Tue, 29 Dec 2020 07:09:44 GMT</pubDate></item><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Tue, 29 Dec 2020 07:01:19 GMT]]></title><description><![CDATA[<p dir="auto"><a href="https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf" rel="nofollow ugc">https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf</a></p>
]]></description><link>http://an.forum.genostack.com/post/278</link><guid isPermaLink="true">http://an.forum.genostack.com/post/278</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Tue, 29 Dec 2020 07:01:19 GMT</pubDate></item><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Tue, 29 Dec 2020 06:50:58 GMT]]></title><description><![CDATA[<p dir="auto">Nr database encompasses sequences from both non-curated and curated databases:</p>
<p dir="auto">Non-curated databases (low quality):</p>
<p dir="auto">GenBank/GenPept - unreviewed sequences submitted from individual laboratories and large-scale sequencing projects. Since these sequence records are owned by the original submitters and can not be altered, GenBank might contain many low quality sequences.<br />
trEMBL - unreviewed section of UniProt. This section contains a computer-annotated supplement of SwissProt that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SwissProt<br />
Curated databases (high quality):</p>
<p dir="auto">RefSeq - GenBank sequences that are manually curated by the NCBI staff. RefSeq records are owned by NCBI and can be updated as needed to maintain current annotation or to incorporate additional information.<br />
SwissProt - manually annotated and reviewed protein sequences<br />
PIR - non-redundant annotated protein sequence database<br />
PDB - experimentally-determined structures of proteins, nucleic acids, and complex assemblies</p>
<p dir="auto"><a href="https://www.biostars.org/p/164641/" rel="nofollow ugc">https://www.biostars.org/p/164641/</a></p>
]]></description><link>http://an.forum.genostack.com/post/277</link><guid isPermaLink="true">http://an.forum.genostack.com/post/277</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Tue, 29 Dec 2020 06:50:58 GMT</pubDate></item><item><title><![CDATA[Reply to 构建本地nt&#x2F;nr数据库 on Tue, 29 Dec 2020 06:49:16 GMT]]></title><description><![CDATA[<p dir="auto">Database resources of the National Center for Biotechnology Information<br />
<a href="https://academic.oup.com/nar/article/42/D1/D7/1054454" rel="nofollow ugc">https://academic.oup.com/nar/article/42/D1/D7/1054454</a></p>
]]></description><link>http://an.forum.genostack.com/post/276</link><guid isPermaLink="true">http://an.forum.genostack.com/post/276</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Tue, 29 Dec 2020 06:49:16 GMT</pubDate></item></channel></rss>