Reply to 使用Diamond basta进行物种分类 on Thu, 07 Apr 2022 07:49:26 GMT

anneng — Thu, 07 Apr 2022 07:49:26 GMT

https://currentprotocols.onlinelibrary.wiley.com/doi/10.1002/cpz1.59
这个文章写的很好推荐了 Diamond+megan LCA 里面也提到了为什么要对比蛋白库（但是感觉有点牵强）主要的论点是核酸库毕竟已知的物种很少

Reply to 使用Diamond basta进行物种分类 on Thu, 18 Mar 2021 09:12:35 GMT

anneng — Thu, 18 Mar 2021 09:12:35 GMT

https://github.com/etheleon/pymegan
对blast2lca做了python封装

Reply to 使用Diamond basta进行物种分类 on Thu, 18 Mar 2021 09:08:42 GMT

anneng — Thu, 18 Mar 2021 09:08:42 GMT

basta有个很大的问题是一次只能处理一个样本
https://github.com/husonlab/megan-ce
megan-ce　社区版本有个blast2lca的工具　可以单独使用

https://www.seqomics.hu/index.php?option=com_content&view=article&id=58&catid=2&Itemid=137
LCA-assignment algorithm
The main problem addressed by MEGAN is to compute a “species profile” by assigning the reads from a metagenomics sequencing experiment to appropriate taxa in the NCBI taxonomy. At present, this program implements the following naive approach to this problem:

Compare a given set of DNA reads to a database of known sequences, such as NCBI-NR or NCBI-NT, using a sequence comparison tool such as BLAST.
Process this data to determine all hits of taxa by reads.
For each read r, let H be the set of all taxa that r hits.
Find the lowest node v in the NCBI taxonomy that encompasses the set of hit taxa H and assign the read r to the taxon represented by v.

We call this the naive LCA-assignment algorithm (LCA = “lowest common ancestor”). In this approach, every read is assigned to some taxon. If the read aligns very specifically only to a single taxon, then it is assigned to that taxon. The less specifically a read hits taxa, the higher up in the taxonomy it is placed. Reads that hit ubiquitously may even be assigned to the root node of the NCBI taxonomy.

If a read has significant matches to two different taxa a and b, where a is an ancestor of b in the NCBI taxonomy, then the match to the ancestor a is discarded and only the more specific match to b is used.

The program provides a threshold for the bit disjointScore of hits. Any hit that falls below the threshold is discarded. Secondly, a threshold can be set to discard any hit whose disjointScore falls below a given percentage of the best hit. Finally, a third threshold is used to report only taxa that are hit by a minimal number of reads or minimal percent of all assigned reads. By default, the program requires at least 0:1% of all assigned reads to hit a taxon, before that taxon is deemed present. All reads that are initially assigned to a taxon that is not deemed present are pushed up the taxonomy until a node is reached that has enough reads. This is set using the Min Support Percent or Min Support item.

Taxa in the NCBI taxonomy can be excluded from the analysis. For example, taxa listed under root - unclassified sequences - metagenomes may give rise to matches that force the algorithm to place reads on the root node of the taxonomy. This feature is controlled by Preferences!Taxon Disabling menu. At present, the set of disabled taxa is saved as a program property and not as part of a MEGAN document.

Note that the LCA-assignment algorithm is already used on a smaller scale when parsing individual blast matches. This is because an entry in a reference database may have more than one taxon associated with it. For example, in the NCBI-NR database, an entry may be associated with up to 1000 different taxa. This implies, in particular, that a read that may be assigned to a high level node (even the root node), even though it only has one significant hit, if the corresponding reference sequence is associated with a number of very different species.

Note that the list of disabled taxa is also taken into consideration when parsing a BLAST file. Any taxa that are disabled are ignored when attempting to determine the taxon associated with a match, unless all recognized names are disabled, in which case the disabled names are used.

Weighted LCA Algorithm
The weighted LCA algorithm is identical to the weighted LCA algorithm used in Metascope. It operates as follows: In a first round of analysis, each reference sequence is given a weight. This is the number of reads that align to the given reference and that have the property that all the significant alignments for the read are to the same species as the reference sequence (but can also be to a strain or sub-species below the species node). In a second round of analysis, each read is placed on the node that is above 75% of the total weight of all references for which the read has a significant alignment.
The Weighted LCA algorithm will assign reads more specifically than the naive LCA algorithm. Because it performs two rounds of read and match analysis, it takes twice as long as the naive algorithm.

Reply to 使用Diamond basta进行物种分类 on Thu, 01 Jul 2021 07:47:05 GMT

anneng — Thu, 01 Jul 2021 07:47:05 GMT

diamond 比较蛋白库可能有问题直接用blastn比较nt库
//转换fastq为fasta
seqtk seq -a all.fastq > all.fasta
blastn -query all.fasta -db /ceph_disk2/MetaDatabase/NCBI_blast_db_FASTA/nt/ntdata/nt -num_threads 10 -out all.m8 -outfmt 6 -evalue 0.001

Reply to 使用Diamond basta进行物种分类 on Sat, 27 Feb 2021 13:07:02 GMT

anneng — Sat, 27 Feb 2021 13:07:02 GMT

m8格式
qseqid means Query Seq-id
sseqid means Subject Seq-id
pident means Percentage of identical matches
length means Alignment length
mismatch means Number of mismatches
gapopen means Number of gap openings
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
evalue means Expect value
bitscore means Bit score

注意：blast以前的版本有个-m 8 选项 diamond m8 就是来自这里现在是 outfmt 6
it should be '-outfmt 6' for blast+ and '-m 8' for legacy blast
https://www.biostars.org/p/166013/

Reply to 使用Diamond basta进行物种分类 on Thu, 03 Dec 2020 06:18:34 GMT

anneng — Thu, 03 Dec 2020 06:18:34 GMT

Reply to 使用Diamond basta进行物种分类 on Thu, 03 Dec 2020 06:18:11 GMT

anneng — Thu, 03 Dec 2020 06:18:11 GMT

https://cran.r-project.org/web/packages/taxonomizr/readme/README.html
Convert accession numbers to taxonomy

https://github.com/timkahlke/BASTA