暗能星系

    • 登录
    • 搜索

    fasta fastq

    其它
    1
    1
    6
    正在加载更多帖子
    • 从旧到新
    • 从新到旧
    • 最多赞同
    回复
    • 在新帖中回复
    登录后回复
    此主题已被删除。只有拥有主题管理权限的用户可以查看。
    • A
      anneng 最后由 编辑

      https://bioinformatics.stackexchange.com/questions/14/what-is-the-difference-between-fasta-fastq-and-sam-file-formats#:~:text=FASTA to store the reference,the sequence fragments after mapping.

      Let’s start with what they have in common: All three formats store

      sequence data, and
      sequence metadata.
      Furthermore, all three formats are text-based.

      However, beyond that all three formats are different and serve different purposes.

      Let’s start with the simplest format:

      FASTA
      FASTA stores a variable number of sequence records, and for each record it stores the sequence itself, and a sequence ID. Each record starts with a header line whose first character is >, followed by the sequence ID. The next lines of a record contain the actual sequence.

      The Wikipedia artice gives several examples for peptide sequences, but since FASTQ and SAM are used exclusively (?) for nucleotide sequences, here’s a nucleotide example:

      Mus_musculus_tRNA-Ala-AGC-1-1 (chr13.trna34-AlaAGC)
      GGGGGTGTAGCTCAGTGGTAGAGCGCGTGCTTAGCATGCACGAGGcCCTGGGTTCGATCC
      CCAGCACCTCCA
      Mus_musculus_tRNA-Ala-AGC-10-1 (chr13.trna457-AlaAGC)
      GGGGGATTAGCTCAAATGGTAGAGCGCTCGCTTAGCATGCAAGAGGtAGTGGGATCGATG
      CCCACATCCTCCA
      The ID can be in any arbitrary format, although several conventions exist.

      In the context of nucleotide sequences, FASTA is mostly used to store reference data; that is, data extracted from a curated database; the above is adapted from GtRNAdb (a database of tRNA sequences).

      FASTQ
      FASTQ was conceived to solve a specific problem arising during sequencing: Due to how different sequencing technologies work, the confidence in each base call (that is, the estimated probability of having correctly identified a given nucleotide) varies. This is expressed in the Phred quality score. FASTA had no standardised way of encoding this. By contrast, a FASTQ record contains a sequence of quality scores for each nucleotide.

      A FASTQ record has the following format:

      A line starting with @, containing the sequence ID.
      One or more lines that contain the sequence.
      A new line starting with the character +, and being either empty or repeating the sequence ID.
      One or more lines that contain the quality scores.
      Here’s an example of a FASTQ file with two records:

      @071112_SLXA-EAS1_s_7:5:1:817:345
      GGGTGATGGCCGCTGCCGATGGCGTC
      AAATCCCACC
      +
      IIIIIIIIIIIIIIIIIIIIIIIIII
      IIII9IG9IC
      @071112_SLXA-EAS1_s_7:5:1:801:338
      GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
      +
      IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI
      FASTQ files are mostly used to store short-read data from high-throughput sequencing experiments. The sequence and quality scores are usually put into a single line each, and indeed many tools assume that each record in a FASTQ file is exactly four lines long, even though this isn’t guaranteed.

      As with FASTA, the format of the sequence ID isn’t standardised, but different producers of FASTQ use fixed notations that follow strict conventions.

      SAM
      SAM files are so complex that a complete description [PDF] takes 15 pages. So here’s the short version.

      The original purpose of SAM files is to store mapping information for sequences from high-throughput sequencing. As a consequence, a SAM record needs to store more than just the sequence and its quality, it also needs to store information about where and how a sequence maps into the reference.

      Unlike the previous formats, SAM is tab-based, and each record, consisting of either 11 or 12 fields, fills exactly one line. Here’s an example (tabs replaced by fixed-width spacing):

      r001 99 chr1 7 30 17M = 37 39 TTAGATAAAGGATACTG IIIIIIIIIIIIIIIII
      r002 0 chrX 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA IIIIIIIIII6IBI NM:i:1
      For a description of the individual fields, refer to the documentation. The relevant bit is this: SAM can express exactly the same information as FASTQ, plus, as mentioned, the mapping information. However, SAM is also used to store read data without mapping information.

      In addition to sequence records, SAM files can also contain a header, which stores information about the reference that the sequences were mapped to, and the tool used to create the SAM file. Header information precede the sequence records, and consist of lines starting with @.

      SAM itself is almost never used as a storage format; instead, files are stored in BAM format, which is a compact binary representation of SAM. It stores the same information, just more efficiently and, in conjunction with a search index, allows fast retrieval of individual records from the middle of the file (= fast random access). BAM files are also much more compact than compressed FASTQ or FASTA files.

      The above implies a hierarchy in what the formats can store: FASTA ⊂ FASTQ ⊂ SAM.

      In a typical high-throughput analysis workflow, you will encounter all three file types:

      FASTA to store the reference genome/transcriptome that the sequence fragments will be mapped to.
      FASTQ to store the sequence fragments before mapping.
      SAM/BAM to store the sequence fragments after mapping.

      1 条回复 最后回复 回复 引用 0
      • First post
        Last post
      Powered by 暗能星系