fasta fastq

anneng

https://bioinformatics.stackexchange.com/questions/14/what-is-the-difference-between-fasta-fastq-and-sam-file-formats#:~:text=FASTA to store the reference,the sequence fragments after mapping.

Let’s start with what they have in common: All three formats store

sequence data, and
sequence metadata.
Furthermore, all three formats are text-based.

However, beyond that all three formats are different and serve different purposes.

Let’s start with the simplest format:

FASTA
FASTA stores a variable number of sequence records, and for each record it stores the sequence itself, and a sequence ID. Each record starts with a header line whose first character is >, followed by the sequence ID. The next lines of a record contain the actual sequence.

The Wikipedia artice gives several examples for peptide sequences, but since FASTQ and SAM are used exclusively (?) for nucleotide sequences, here’s a nucleotide example:

Mus_musculus_tRNA-Ala-AGC-1-1 (chr13.trna34-AlaAGC)
GGGGGTGTAGCTCAGTGGTAGAGCGCGTGCTTAGCATGCACGAGGcCCTGGGTTCGATCC
CCAGCACCTCCA
Mus_musculus_tRNA-Ala-AGC-10-1 (chr13.trna457-AlaAGC)
GGGGGATTAGCTCAAATGGTAGAGCGCTCGCTTAGCATGCAAGAGGtAGTGGGATCGATG
CCCACATCCTCCA
The ID can be in any arbitrary format, although several conventions exist.

In the context of nucleotide sequences, FASTA is mostly used to store reference data; that is, data extracted from a curated database; the above is adapted from GtRNAdb (a database of tRNA sequences).

FASTQ
FASTQ was conceived to solve a specific problem arising during sequencing: Due to how different sequencing technologies work, the confidence in each base call (that is, the estimated probability of having correctly identified a given nucleotide) varies. This is expressed in the Phred quality score. FASTA had no standardised way of encoding this. By contrast, a FASTQ record contains a sequence of quality scores for each nucleotide.

A FASTQ record has the following format:

A line starting with @, containing the sequence ID.
One or more lines that contain the sequence.
A new line starting with the character +, and being either empty or repeating the sequence ID.
One or more lines that contain the quality scores.
Here’s an example of a FASTQ file with two records:

@071112_SLXA-EAS1_s_7:5:1:817:345
GGGTGATGGCCGCTGCCGATGGCGTC
AAATCCCACC
+
IIIIIIIIIIIIIIIIIIIIIIIIII
IIII9IG9IC
@071112_SLXA-EAS1_s_7:5:1:801:338
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI
FASTQ files are mostly used to store short-read data from high-throughput sequencing experiments. The sequence and quality scores are usually put into a single line each, and indeed many tools assume that each record in a FASTQ file is exactly four lines long, even though this isn’t guaranteed.

As with FASTA, the format of the sequence ID isn’t standardised, but different producers of FASTQ use fixed notations that follow strict conventions.

SAM
SAM files are so complex that a complete description [PDF] takes 15 pages. So here’s the short version.

The original purpose of SAM files is to store mapping information for sequences from high-throughput sequencing. As a consequence, a SAM record needs to store more than just the sequence and its quality, it also needs to store information about where and how a sequence maps into the reference.

Unlike the previous formats, SAM is tab-based, and each record, consisting of either 11 or 12 fields, fills exactly one line. Here’s an example (tabs replaced by fixed-width spacing):

r001 99 chr1 7 30 17M = 37 39 TTAGATAAAGGATACTG IIIIIIIIIIIIIIIII
r002 0 chrX 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA IIIIIIIIII6IBI NM:i:1
For a description of the individual fields, refer to the documentation. The relevant bit is this: SAM can express exactly the same information as FASTQ, plus, as mentioned, the mapping information. However, SAM is also used to store read data without mapping information.

In addition to sequence records, SAM files can also contain a header, which stores information about the reference that the sequences were mapped to, and the tool used to create the SAM file. Header information precede the sequence records, and consist of lines starting with @.

SAM itself is almost never used as a storage format; instead, files are stored in BAM format, which is a compact binary representation of SAM. It stores the same information, just more efficiently and, in conjunction with a search index, allows fast retrieval of individual records from the middle of the file (= fast random access). BAM files are also much more compact than compressed FASTQ or FASTA files.

The above implies a hierarchy in what the formats can store: FASTA ⊂ FASTQ ⊂ SAM.

In a typical high-throughput analysis workflow, you will encounter all three file types:

FASTA to store the reference genome/transcriptome that the sequence fragments will be mapped to.
FASTQ to store the sequence fragments before mapping.
SAM/BAM to store the sequence fragments after mapping.