简化基因组流程开发笔记
-
process_radtags
process_radtags -p in_dir [--paired [--interleaved]] [-b barcode_file] -o out_dir -e enz [-c] [-q] [-r] [-t len] process_radtags -f in_file [-b barcode_file] -o out_dir -e enz [-c] [-q] [-r] [-t len] process_radtags -1 pair_1 -2 pair_2 [-b barcode_file] -o out_dir -e enz [-c] [-q] [-r] [-t len] p: path to a directory of files. P,--paired: files contained within the directory are paired. I,--interleaved: specify that the paired-end reads are interleaved in single files. i: input file type, either 'fastq', 'gzfastq' (gzipped fastq), 'bam', or 'bustard' (default: guess, or gzfastq if unable to). b: path to a file containing barcodes for this run, omit to ignore any barcoding. o: path to output the processed files. f: path to the input file if processing single-end sequences. 1: first input file in a set of paired-end sequences. 2: second input file in a set of paired-end sequences. c,--clean: clean data, remove any read with an uncalled base. q,--quality: discard reads with low quality scores. r,--rescue: rescue barcodes and RAD-Tags. t: truncate final read length to this value. D: capture discarded reads to a file. E: specify how quality scores are encoded, 'phred33' (Illumina 1.8+/Sanger, default) or 'phred64' (Illumina 1.3-1.5). w: set the size of the sliding window as a fraction of the read length, between 0 and 1 (default 0.15). s: set the score limit. If the average score within the sliding window drops below this value, the read is discarded (default 10). y: output type, either 'fastq', 'gzfastq', 'fasta', or 'gzfasta' (default: match input type). Barcode options: --inline-null: barcode is inline with sequence, occurs only on single-end read (default). --index-null: barcode is provded in FASTQ header (Illumina i5 or i7 read). --null-index: barcode is provded in FASTQ header (Illumina i7 read if both i5 and i7 read are provided). --inline-inline: barcode is inline with sequence, occurs on single and paired-end read. --index-index: barcode is provded in FASTQ header (Illumina i5 and i7 reads). --inline-index: barcode is inline with sequence on single-end read and occurs in FASTQ header (from either i5 or i7 read). --index-inline: barcode occurs in FASTQ header (Illumina i5 or i7 read) and is inline with single-end sequence (for single-end data) on paired-end read (for paired-end data). Restriction enzyme options: -e <enz>, --renz-1 <enz>: provide the restriction enzyme used (cut site occurs on single-end read) --renz-2 <enz>: if a double digest was used, provide the second restriction enzyme used (cut site occurs on the paired-end read). Currently supported enzymes include: 'aciI', 'ageI', 'aluI', 'apaLI', 'apeKI', 'apoI', 'aseI', 'bamHI', 'bbvCI', 'bfaI', 'bfuCI', 'bgIII', 'bsaHI', 'bspDI', 'bstYI', 'btgI', 'cac8I', 'claI', 'csp6I', 'ddeI', 'dpnII', 'eaeI', 'ecoRI', 'ecoRV', 'ecoT22I', 'haeIII', 'hinP1I', 'hindIII', 'hpaII', 'hpyCH4IV', 'kpnI', 'mluCI', 'mseI', 'mslI', 'mspI', 'ncoI', 'ndeI', 'nheI', 'nlaIII', 'notI', 'nsiI', 'nspI', 'pacI', 'pspXI', 'pstI', 'rsaI', 'sacI', 'sau3AI', 'sbfI', 'sexAI', 'sgrAI', 'speI', 'sphI', 'taqI', 'xbaI', or 'xhoI' Protocol-specific options: --bestrad: library was generated using BestRAD, check for restriction enzyme on either read and potentially tranpose reads. Adapter options: --adapter-1 <sequence>: provide adaptor sequence that may occur on the single-end read for filtering. --adapter-2 <sequence>: provide adaptor sequence that may occur on the paired-read for filtering. --adapter-mm <mismatches>: number of mismatches allowed in the adapter sequence. Output options: --retain-header: retain unmodified FASTQ headers in the output. --merge: if no barcodes are specified, merge all input files into a single output file. Advanced options: --filter-illumina: discard reads that have been marked by Illumina's chastity/purity filter as failing. --disable-rad-check: disable checking if the RAD site is intact. --len-limit <limit>: specify a minimum sequence length (useful if your data has already been trimmed). --barcode-dist-1: the number of allowed mismatches when rescuing single-end barcodes (default 1). --barcode-dist-2: the number of allowed mismatches when rescuing paired-end barcodes (defaults to --barcode-dist-1).分析:
- 输入支持目录和文件两种方式
- 目录:通过
-P/--paired指定是否双端数据 - 文件:单端
-f指定文件;双端-1\-2指定R1和R2文件
- 目录:通过
- 可选参数
[-c] [-q] [-r] [-t len]-c表示数据清洗时去除表示为N的碱基,-q表示数据清理时要去除低质量碱基 ,-r表示要抢救下barcode和RAD-tag。
- 单选参数:
- barcode 类型
-b - 建库所用的限制性内切酶
-e, --renz-1\--renz-2 - 质量分数编码
-E,有默认值 phred33
- barcode 类型
- 输入支持目录和文件两种方式
-
参考基因组创建索引
烟草的参考基因组下载链接:
https://www.ncbi.nlm.nih.gov/assembly/GCA_000715135.1
三刺鱼参考基因组:
http://asia.ensembl.org/Gasterosteus_aculeatus/Info/Indexwget -q ftp://ftp.ensembl.org/pub/release-91/fasta/gasterosteus_aculeatus/dna/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz gzip -d Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz创建索引
stacks不直接参与比对,而是处理不同比对软件得到的BAM文件
这里建立bwa的索引genome_fa=genome/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa bwa index -p index/bwa/gac $genome_fa &> index/bwa/bwa_index.oe # 结果如下 |-- genome | |-- Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa |-- index |-- bwa |-- bwa_index.oe |-- gac.amb |-- gac.ann |-- gac.bwt |-- gac.pac |-- gac.sa -
这一步需要留意自己的单端测序,还是双端测序,barcode是在read中,还是在FASTQ的header中,是否还需要去接头序列,是否是双酶切建库等。
另外这一步比较耗时,尽量脱机运行或者提交到计算节点中,不然突然断网导致运行终止就更浪费时间了。
将运行结果记录到日志文件中,方便后期检查报错。process_radtags -p $raw_dir -b $barcode_file \ -o $processed_data/ -e sbfI --inline_null \ -c -q -r &> $processed_data/process_radtags.lane1.oe& -
@ice-melt 在 简化基因组流程开发笔记 中说:
参考基因组创建索引
烟草的参考基因组下载链接:
https://www.ncbi.nlm.nih.gov/assembly/GCA_000715135.1
三刺鱼参考基因组:
http://asia.ensembl.org/Gasterosteus_aculeatus/Info/Indexwget -q ftp://ftp.ensembl.org/pub/release-91/fasta/gasterosteus_aculeatus/dna/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz gzip -d Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz创建索引
stacks不直接参与比对,而是处理不同比对软件得到的BAM文件
这里建立bwa的索引genome_fa=genome/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa bwa index -p index/bwa/gac $genome_fa &> index/bwa/bwa_index.oe # 结果如下 |-- genome | |-- Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa |-- index |-- bwa |-- bwa_index.oe |-- gac.amb |-- gac.ann |-- gac.bwt |-- gac.pac |-- gac.sa生成 bam/sam 文件
# sam 文件 bwa mem -M ../genome/index/bwa/gcf T5_R1.fq.gz T5_R2.fq.gz > T5_R2.sam & # bam 文件 bwa mem -M ../genome/index/bwa/gcf T5_R1.fq.gz T5_R2.fq.gz |samtools view -b > T5_R2.bam &- 如果创建参考基因组索引没有指定索引文件夹,
bwa mem不需要指定-M参数 - samtools 可以把bam转成sam格式
- 可以指定
-t增加线程数 - 比对的结果可以查看一下质量
samtools flagstat T5_R2.bam
- 如果创建参考基因组索引没有指定索引文件夹,
-
process_radtags 双端处理格式
// Parse a file name that looks like: lane6_NoIndex_L006_R1_003.fastq // but exclude the paired-end files: lane6_NoIndex_L006_R2_003.fastq // // Another example could be: GfddRAD1_001_ATCACG_L008_R1_001.fastq.gz // and excluding the paired-end file: GfddRAD1_001_ATCACG_L008_R2_001.fastq.gz