比对软件
-
Common Alignment Tools
A small summary of common alignment tools used by bioinformaticians and how to get started
Alignment tools are a major contributor to the domain of bioinformatics. From assemblers to database search and similarity calculations sometime in the line of work you may have come across some kind of an assembler. In this article, I will be talking about the common alignment tools that I have been using.BLAST
BLAST stands for Basic Local Alignment Search Tool. As the term describes this is mostly used for search purposes. The idea of local alignment is useful when we want to study the containment of a sequence in another, thus the use case of search.Applications
BLAST is mostly used as a database search tool due to its fast nature with sensitivity parameters. The most common service using the BLAST is NCBI search database (https://blast.ncbi.nlm.nih.gov/Blast.cgi). However, in research work BLAST comes handy when we want to perform taxonomic annotations or to label sequences with database sequence annotations such as plasmid nature, coding and non-coding regions searched against known strains, etc. In the context of a database search, BLAST is extremely fast. However, in scenarios where precise base-wise complete alignments are needed, it is better to switch on to a more sensitive aligner like BWA-MEM or Minimap2.Algorithm Overview
Removal of low complexity regions (tandem repeats and N bases for DNA) of the query sequence
Obtain a k-mer list for the query sequence (k=11), list possible matching words and score them using BLOSUM62 matrix. This is done for all k-mers
Obtain the high scoring k-mers from step 2, decided by a specified threshold
Scan the database for these high scoring k-mers and obtain high scoring segment pairs (HSPs)
Extend the search from the exact match and outwards until the accumulated score starts to drop
The algorithm is simple to explain and fast for a large search space. Database index usually contains k-mers of k-11 for nucleotide sequences. K=3 is used for protein sequences.Installation
Compiled binaries or source files can be downloaded from here. Compilation can be done by;cd c++
./configure
cd ReleaseMT/build
make all_r
More information can be found here. You can refer here as to how you might build your own database using sequence files.BWA
BWA stands for Borrows Wheeler Transform. This transforms in a manner that makes it easy to perform compression on data. This is the key idea behind the popular alignment too BWA-MEM. BWA-MEM uses a prefix index to perform the indexing and alignment. You could read deeper in Heng Li’s GitHub.Applications
BWA-MEM is commonly used for aligning short reads to the reference genomes. This is a key step in the reference-based assembly of the human genome.Installation
git clone https://github.com/lh3/bwa.git
cd bwa; make
Usage of the BWA-MEM has several steps. In the first step, you are required to index the reference genome. The tool is designed for short-reads thus you could use both paired-end reads or sing ended reads. Following are the commands from the GitHub page for your reference../bwa index ref.fa
./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz
Minimap2
Minimap2 is my favourite alignment tool, which is indeed very fast and versatile. It is robust with much longer sequences with noise. Few of the common use cases are as follows.Applications
Align long noisy reads to the references genomes
Align reads to the contigs to compute base-wise coverage
Aligning all reads against themselves as a preliminary step for assembly and read correction
Aligning reads to the assembly graph
Algorithm Overview
The algorithm is based on the idea of minimizes. A minimizer is a minimum (lexicographically) k-mer in a window of w k-mers. This is one of the main reason the algorithm is fast at a bit of a compromise on the sensitivity.Minimap2 obtains (k, w) minimizers for all the references and query sequences. The matching minimizers that are below a certain frequency in the set of references those are called seeds and used for alignment.
In my experience, the alignment could be not sensitive enough in certain scenarios where I tried to align reads-vs-reads. That is reasonable and mentioned on the GitHub page. It is always wise to use another alignment for such scenarios.
Installation
git clone https://github.com/lh3/minimap2
cd minimap2 && make
For extended use cases, you can refer the original GitHub page.I hope this will help someone who is a bit new to the field of bioinformatics. Thanks for reading.
I will introduce a few multiple sequence alignment and visualization tools in a future article.