简化基因组流程开发笔记

ice-melt

process_radtags

process_radtags -p in_dir [--paired [--interleaved]] [-b barcode_file] -o out_dir -e enz [-c] [-q] [-r] [-t len]
process_radtags -f in_file [-b barcode_file] -o out_dir -e enz [-c] [-q] [-r] [-t len]
process_radtags -1 pair_1 -2 pair_2 [-b barcode_file] -o out_dir -e enz [-c] [-q] [-r] [-t len]

  p: path to a directory of files.
  P,--paired: files contained within the directory are paired.
  I,--interleaved: specify that the paired-end reads are interleaved in single files.
  i: input file type, either 'fastq', 'gzfastq' (gzipped fastq), 'bam', or 'bustard' (default: guess, or gzfastq if unable to).
  b: path to a file containing barcodes for this run, omit to ignore any barcoding.
  o: path to output the processed files.
  f: path to the input file if processing single-end sequences.
  1: first input file in a set of paired-end sequences.
  2: second input file in a set of paired-end sequences.
  c,--clean: clean data, remove any read with an uncalled base.
  q,--quality: discard reads with low quality scores.
  r,--rescue: rescue barcodes and RAD-Tags.
  t: truncate final read length to this value.
  D: capture discarded reads to a file.
  E: specify how quality scores are encoded, 'phred33' (Illumina 1.8+/Sanger, default) or 'phred64' (Illumina 1.3-1.5).
  w: set the size of the sliding window as a fraction of the read length, between 0 and 1 (default 0.15).
  s: set the score limit. If the average score within the sliding window drops below this value, the read is discarded (default 10).
  y: output type, either 'fastq', 'gzfastq', 'fasta', or 'gzfasta' (default: match input type).

  Barcode options:
    --inline-null:   barcode is inline with sequence, occurs only on single-end read (default).
    --index-null:    barcode is provded in FASTQ header (Illumina i5 or i7 read).
    --null-index:    barcode is provded in FASTQ header (Illumina i7 read if both i5 and i7 read are provided).
    --inline-inline: barcode is inline with sequence, occurs on single and paired-end read.
    --index-index:   barcode is provded in FASTQ header (Illumina i5 and i7 reads).
    --inline-index:  barcode is inline with sequence on single-end read and occurs in FASTQ header (from either i5 or i7 read).
    --index-inline:  barcode occurs in FASTQ header (Illumina i5 or i7 read) and is inline with single-end sequence (for single-end data) on paired-end read (for paired-end data).

  Restriction enzyme options:
    -e <enz>, --renz-1 <enz>: provide the restriction enzyme used (cut site occurs on single-end read)
    --renz-2 <enz>: if a double digest was used, provide the second restriction enzyme used (cut site occurs on the paired-end read).
    Currently supported enzymes include:
      'aciI', 'ageI', 'aluI', 'apaLI', 'apeKI', 'apoI', 'aseI', 'bamHI', 
      'bbvCI', 'bfaI', 'bfuCI', 'bgIII', 'bsaHI', 'bspDI', 'bstYI', 'btgI', 
      'cac8I', 'claI', 'csp6I', 'ddeI', 'dpnII', 'eaeI', 'ecoRI', 'ecoRV', 
      'ecoT22I', 'haeIII', 'hinP1I', 'hindIII', 'hpaII', 'hpyCH4IV', 'kpnI', 'mluCI', 
      'mseI', 'mslI', 'mspI', 'ncoI', 'ndeI', 'nheI', 'nlaIII', 'notI', 
      'nsiI', 'nspI', 'pacI', 'pspXI', 'pstI', 'rsaI', 'sacI', 'sau3AI', 
      'sbfI', 'sexAI', 'sgrAI', 'speI', 'sphI', 'taqI', 'xbaI', or 'xhoI'
      

  Protocol-specific options:
    --bestrad: library was generated using BestRAD, check for restriction enzyme on either read and potentially tranpose reads.

  Adapter options:
    --adapter-1 <sequence>: provide adaptor sequence that may occur on the single-end read for filtering.
    --adapter-2 <sequence>: provide adaptor sequence that may occur on the paired-read for filtering.
      --adapter-mm <mismatches>: number of mismatches allowed in the adapter sequence.

  Output options:
    --retain-header: retain unmodified FASTQ headers in the output.
    --merge: if no barcodes are specified, merge all input files into a single output file.

  Advanced options:
    --filter-illumina: discard reads that have been marked by Illumina's chastity/purity filter as failing.
    --disable-rad-check: disable checking if the RAD site is intact.
    --len-limit <limit>: specify a minimum sequence length (useful if your data has already been trimmed).
    --barcode-dist-1: the number of allowed mismatches when rescuing single-end barcodes (default 1).
    --barcode-dist-2: the number of allowed mismatches when rescuing paired-end barcodes (defaults to --barcode-dist-1).

分析：

输入支持目录和文件两种方式
- 目录：通过-P/--paired 指定是否双端数据
- 文件：单端-f指定文件；双端-1\-2指定R1和R2文件
可选参数 [-c] [-q] [-r] [-t len]
- -c 表示数据清洗时去除表示为N的碱基，
- -q 表示数据清理时要去除低质量碱基 ,
- -r 表示要抢救下barcode和RAD-tag。
单选参数：
- barcode 类型 -b
- 建库所用的限制性内切酶 -e, --renz-1\--renz-2
- 质量分数编码 -E,有默认值 phred33

ice-melt

参考基因组创建索引

烟草的参考基因组下载链接：
https://www.ncbi.nlm.nih.gov/assembly/GCA_000715135.1
三刺鱼参考基因组：
http://asia.ensembl.org/Gasterosteus_aculeatus/Info/Index

wget -q ftp://ftp.ensembl.org/pub/release-91/fasta/gasterosteus_aculeatus/dna/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz

gzip -d Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz

创建索引

stacks不直接参与比对，而是处理不同比对软件得到的BAM文件
这里建立bwa的索引

genome_fa=genome/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa

bwa index -p index/bwa/gac $genome_fa &> index/bwa/bwa_index.oe

# 结果如下
|-- genome
|   |-- Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa
|-- index
    |-- bwa
        |-- bwa_index.oe
        |-- gac.amb
        |-- gac.ann
        |-- gac.bwt
        |-- gac.pac
        |-- gac.sa

ice-melt

@ice-melt

这一步需要留意自己的单端测序，还是双端测序，barcode是在read中，还是在FASTQ的header中，是否还需要去接头序列，是否是双酶切建库等。

另外这一步比较耗时，尽量脱机运行或者提交到计算节点中，不然突然断网导致运行终止就更浪费时间了。
将运行结果记录到日志文件中，方便后期检查报错。

process_radtags -p $raw_dir -b $barcode_file \
    -o $processed_data/ -e sbfI --inline_null \
    -c -q -r &> $processed_data/process_radtags.lane1.oe&

ice-melt

@ice-melt 在简化基因组流程开发笔记中说：

参考基因组创建索引

烟草的参考基因组下载链接：
https://www.ncbi.nlm.nih.gov/assembly/GCA_000715135.1
三刺鱼参考基因组：
http://asia.ensembl.org/Gasterosteus_aculeatus/Info/Index
wget -q ftp://ftp.ensembl.org/pub/release-91/fasta/gasterosteus_aculeatus/dna/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz

gzip -d Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz
创建索引

stacks不直接参与比对，而是处理不同比对软件得到的BAM文件
这里建立bwa的索引
genome_fa=genome/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa

bwa index -p index/bwa/gac $genome_fa &> index/bwa/bwa_index.oe

# 结果如下
|-- genome
|   |-- Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa
|-- index
    |-- bwa
        |-- bwa_index.oe
        |-- gac.amb
        |-- gac.ann
        |-- gac.bwt
        |-- gac.pac
        |-- gac.sa

生成 bam/sam 文件

# sam 文件
bwa mem -M ../genome/index/bwa/gcf T5_R1.fq.gz T5_R2.fq.gz > T5_R2.sam &

# bam 文件
bwa mem -M ../genome/index/bwa/gcf T5_R1.fq.gz T5_R2.fq.gz |samtools view -b > T5_R2.bam &

如果创建参考基因组索引没有指定索引文件夹， bwa mem 不需要指定-M参数
samtools 可以把bam转成sam格式
可以指定 -t 增加线程数
比对的结果可以查看一下质量
```
samtools flagstat T5_R2.bam 
```

ice-melt

@ice-melt

process_radtags 双端处理格式


    // Parse a file name that looks like: lane6_NoIndex_L006_R1_003.fastq
    // but exclude the paired-end files:  lane6_NoIndex_L006_R2_003.fastq
    //
    // Another example could be:          GfddRAD1_001_ATCACG_L008_R1_001.fastq.gz
    // and excluding the paired-end file: GfddRAD1_001_ATCACG_L008_R2_001.fastq.gz