使用Diamond basta进行物种分类

anneng

一，diamond blastx -d /data_raid1/tt/diamond_result/nr -q dengu.fastq -o dengu.m8
m8文件是标准的blast格式输出，从左到右的列的含义为：
Query accession: the accession of the sequence that was the search query against the database, as specified in the input FASTA file after the > character until the first blank.
Target accession: the accession of the target database sequence (also called subject) that the query was aligned against.
Sequence identity: The percentage of identical amino acid residues that were aligned against each other in the local alignment.
Length: The total length of the local alignment, which including matching and mismatching positions of query and subject, as well as gap positions in the query and subject.
Mismatches: The number of non-identical amino acid residues aligned against each other.
Gap openings: The number of gap openings.
Query start: The starting coordinate of the local alignment in the query (1-based).
Query end: The ending coordinate of the local alignment in the query (1-based).
Target start: The starting coordinate of the local alignment in the target (1-based).
Target end: The ending coordinate of the local alignment in the target (1-based).
E-value: The expected value of the hit quantifies the number of alignments of similar or better quality that you expect to find searching this query against a database of random sequences the same size as the actual target database. This number is most useful for measuring the significance of a hit. By default, DIAMOND will report all alignments with e-value < 0.001, meaning that a hit of this quality will be found by chance on average once per 1,000 queries.
Bit score: The bit score is a scoring matrix independent measure of the (local) similarity of the two aligned sequences, with higher numbers meaning more similar. It is always >= 0 for local Smith Waterman alignments.

二，basta安装和准备
2.1 下载和初始化物种分类文件
./bin/basta taxonomy
会下载taxdump.tar.gz并导入本地leveldb数据库，basta内部用的是leveldb　和sqlite类似　这种把文件导入数据库的方式我们后面可以借鉴　可以考虑把这些数据导入hbase进行大规模保存和查询
2.2 下载映射文件
./bin/basta download MAPPING_FILE_TYPE
MAPPING_FILE_TYPE can be one of
prot - protein-to-taxonID mapping file (protein sequences hosted at the NCBI)
uni - uniprot-to-taxonID mapping file (complete uniprot, will also be imported into database prot)
gb - genbank-to-taxonID mapping file (for most nucleotide databases)
wgs - whole-genome-sequence-to-taxonid mapping file
pdb - protein-database-to-taxonID mapping file
est - est-to-taxonID mapping file
gss - gss-to-taxonID mapping file
一般我们用gb(nt的映射)和prot(nr的映射)两个即可。
2.3 进行物种LCA
conda activate Diamond
basta sequence dengu.m8 dengu.lca prot -e 0.005 -l 10 -i 50
dengu.m8 是用blastx查的蛋白库　所以这个命令对应的查prot

最后的结果：
3d6008db-4c38-416d-9519-9ce9912fc43c Viruses;Kitrinoviricota;Flasuviricetes;Amarillovirales;Flaviviridae;Flavivirus;Dengue_virus;

anneng

https://cran.r-project.org/web/packages/taxonomizr/readme/README.html
Convert accession numbers to taxonomy

https://github.com/timkahlke/BASTA

anneng

anneng

m8格式
qseqid means Query Seq-id
sseqid means Subject Seq-id
pident means Percentage of identical matches
length means Alignment length
mismatch means Number of mismatches
gapopen means Number of gap openings
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
evalue means Expect value
bitscore means Bit score

注意：blast以前的版本有个-m 8 选项 diamond m8 就是来自这里现在是 outfmt 6
it should be '-outfmt 6' for blast+ and '-m 8' for legacy blast
https://www.biostars.org/p/166013/

anneng

diamond 比较蛋白库可能有问题直接用blastn比较nt库
//转换fastq为fasta
seqtk seq -a all.fastq > all.fasta
blastn -query all.fasta -db /ceph_disk2/MetaDatabase/NCBI_blast_db_FASTA/nt/ntdata/nt -num_threads 10 -out all.m8 -outfmt 6 -evalue 0.001

anneng

basta有个很大的问题是一次只能处理一个样本
https://github.com/husonlab/megan-ce
megan-ce　社区版本有个blast2lca的工具　可以单独使用

https://www.seqomics.hu/index.php?option=com_content&view=article&id=58&catid=2&Itemid=137
LCA-assignment algorithm
The main problem addressed by MEGAN is to compute a “species profile” by assigning the reads from a metagenomics sequencing experiment to appropriate taxa in the NCBI taxonomy. At present, this program implements the following naive approach to this problem:

Compare a given set of DNA reads to a database of known sequences, such as NCBI-NR or NCBI-NT, using a sequence comparison tool such as BLAST.
Process this data to determine all hits of taxa by reads.
For each read r, let H be the set of all taxa that r hits.
Find the lowest node v in the NCBI taxonomy that encompasses the set of hit taxa H and assign the read r to the taxon represented by v.

We call this the naive LCA-assignment algorithm (LCA = “lowest common ancestor”). In this approach, every read is assigned to some taxon. If the read aligns very specifically only to a single taxon, then it is assigned to that taxon. The less specifically a read hits taxa, the higher up in the taxonomy it is placed. Reads that hit ubiquitously may even be assigned to the root node of the NCBI taxonomy.

If a read has significant matches to two different taxa a and b, where a is an ancestor of b in the NCBI taxonomy, then the match to the ancestor a is discarded and only the more specific match to b is used.

The program provides a threshold for the bit disjointScore of hits. Any hit that falls below the threshold is discarded. Secondly, a threshold can be set to discard any hit whose disjointScore falls below a given percentage of the best hit. Finally, a third threshold is used to report only taxa that are hit by a minimal number of reads or minimal percent of all assigned reads. By default, the program requires at least 0:1% of all assigned reads to hit a taxon, before that taxon is deemed present. All reads that are initially assigned to a taxon that is not deemed present are pushed up the taxonomy until a node is reached that has enough reads. This is set using the Min Support Percent or Min Support item.

Taxa in the NCBI taxonomy can be excluded from the analysis. For example, taxa listed under root - unclassified sequences - metagenomes may give rise to matches that force the algorithm to place reads on the root node of the taxonomy. This feature is controlled by Preferences!Taxon Disabling menu. At present, the set of disabled taxa is saved as a program property and not as part of a MEGAN document.

Note that the LCA-assignment algorithm is already used on a smaller scale when parsing individual blast matches. This is because an entry in a reference database may have more than one taxon associated with it. For example, in the NCBI-NR database, an entry may be associated with up to 1000 different taxa. This implies, in particular, that a read that may be assigned to a high level node (even the root node), even though it only has one significant hit, if the corresponding reference sequence is associated with a number of very different species.

Note that the list of disabled taxa is also taken into consideration when parsing a BLAST file. Any taxa that are disabled are ignored when attempting to determine the taxon associated with a match, unless all recognized names are disabled, in which case the disabled names are used.

Weighted LCA Algorithm
The weighted LCA algorithm is identical to the weighted LCA algorithm used in Metascope. It operates as follows: In a first round of analysis, each reference sequence is given a weight. This is the number of reads that align to the given reference and that have the property that all the significant alignments for the read are to the same species as the reference sequence (but can also be to a strain or sub-species below the species node). In a second round of analysis, each read is placed on the node that is above 75% of the total weight of all references for which the read has a significant alignment.
The Weighted LCA algorithm will assign reads more specifically than the naive LCA algorithm. Because it performs two rounds of read and match analysis, it takes twice as long as the naive algorithm.

anneng

https://github.com/etheleon/pymegan
对blast2lca做了python封装

anneng

https://currentprotocols.onlinelibrary.wiley.com/doi/10.1002/cpz1.59
这个文章写的很好推荐了 Diamond+megan LCA 里面也提到了为什么要对比蛋白库（但是感觉有点牵强）主要的论点是核酸库毕竟已知的物种很少