简化基因组分析
-
一,生物学预备知识

Restriction enzymes are DNA-cutting enzymes. Each enzyme recognizes one or a few target sequences and cuts DNA at or near those sequences.
DNA ligase is a DNA-joining enzyme. If two pieces of DNA have matching ends, ligase can link them to form a single, unbroken molecule of DNA.


enzymes that leave single-stranded overhangs are said to produce sticky ends. Sticky ends are helpful in cloning because they hold two pieces of DNA together so they can be linked by DNA ligase.
Some are “blunt cutters,” which cut straight down the middle of a target sequence and leave no overhang. The restriction enzyme SmaI is an example of a blunt cutter:


一个例子:



二,实验方法
CRoPS
RAD-Seq
GBS
double-digest RAD-Seq
2bRAD
restriction-enzyme anchored positions
https://www.researchgate.net/figure/An-overview-of-the-RAD-seq-library-creation-protocol-and-initial-analysis-steps_fig1_235794002

三,分析方法
Stack2流程
官网:https://catchenlab.life.illinois.edu/stacks/

原理:maximum likelihood statistical model
实现语言:C++ with wrapper programs written in Perl,基于OpenMP的多线程技术
无参考序列分析(de novo):
process_radtags:demultiplexed and cleaned
ustacks:building loci
cstacks:creating the catalog of loci
sstacks:matching against the catalog
tsv2bam: transpose data from being store per-sample to be stored per-locus
gstacks:assemble and merge paired-end contigs, call variant sites in the population and genotypes in each sample
populations:filter data, calculate population genetics statistics, and export a variety of data formats.
有参考序列分析:
process_radtags:demultiplexed and cleaned
gstacks:assemble and merge paired-end contigs, call variant sites in the population and genotypes in each sample
populations:filter data, calculate population genetics statistics, and export a variety of data formats.
四,安装tar xfvz stacks-2.xx.tar.gz cd stacks-2.xx ./configure make (become root) make install (or, use sudo) sudo make install五,详细过程
5.1.Clean the data
process_radtags 样本拆分和过滤低质量序列
样本是否需要拆分,以及barcode是什么,需要和实验人员确定。
参考手册4.1.1 指定 --inline_index参数
https://catchenlab.life.illinois.edu/stacks/manual/#install
指定barcode列表cat barcodes_lane3 CGATA<tab>sample_01 CGGCG sample_02 GAAGC sample_03 GAGAT sample_04 TAATG sample_05 TAGCA sample_06 AAGGG sample_07 ACACG sample_08 ACGTA sample_09process_radtags -p ./raw/ -o ./samples/ -b ./barcodes/barcodes_lane3 \ -e sbfI -r -c -q process_radtags -P -p ./raw -b ./barcodes/barcodes -o ./samples/ \ -c -q -r --inline_index --renz_1 nlaIII --renz_2 mluCI-p 待处理fataq文件所在的目录
-o 输出结果目录
-b barcode列表文件
-e restrction enzyme类型
5.2 和参考序列进行比对
可以使用常用的比对软件 GSnap, BWA, or Bowtie2.
可以用samtools或者picard查看比对的质量。
建议使用bwa mem进行比对。
烟草有参考基因组吗?
5.3
无参流程denovo_map.pl 或者单独执行命令
有参流程ref_map.pl或者单独执行命令
大部分stack2的组件需要一个population map文件,这个文件其实是根据样本特征进行分组,打标签
population mapcat popmap_both sample_01<tab>red<tab>high sample_02 red high sample_03 red high sample_04 red high sample_05 yellow high sample_06 yellow high sample_07 yellow high sample_08 yellow high sample_09 blue low sample_10 blue low sample_11 blue low sample_12 blue low sample_13 orange low sample_14 orange low sample_15 orange low sample_16 orange lowdenovo_map.pl -T 8 -M 6 -o ./stacks/ --samples ./samples --popmap ./popmaps/popmap --paired或者单独执行命令:
#!/bin/bash src=$HOME/research/project files=”sample_01 sample_02 sample_03” # # Build loci de novo in each sample for the single-end reads only. If paired-end reads are available, # they will be integrated in a later stage (tsv2bam stage). # This loop will run ustacks on each sample, e.g. # ustacks -f ./samples/sample_01.1.fq.gz -o ./stacks -i 1 --name sample_01 -M 4 -p 8 # id=1 for sample in $files do ustacks -f $src/samples/${sample}.1.fq.gz -o $src/stacks -i $id --name $sample -M 4 -p 8 let "id+=1" done # # Build the catalog of loci available in the metapopulation from the samples contained # in the population map. To build the catalog from a subset of individuals, supply # a separate population map only containing those samples. # cstacks -n 6 -P $src/stacks/ -M $src/popmaps/popmap -p 8 # # Run sstacks. Match all samples supplied in the population map against the catalog. # sstacks -P $src/stacks/ -M $src/popmaps/popmap -p 8 # # Run tsv2bam to transpose the data so it is stored by locus, instead of by sample. We will include # paired-end reads using tsv2bam. tsv2bam expects the paired read files to be in the samples # directory and they should be named consistently with the single-end reads, # e.g. sample_01.1.fq.gz and sample_01.2.fq.gz, which is how process_radtags will output them. # tsv2bam -P $src/stacks/ -M $src/popmaps/popmap --pe-reads-dir $src/samples -t 8 # # Run gstacks: build a paired-end contig from the metapopulation data (if paired-reads provided), # align reads per sample, call variant sites in the population, genotypes in each individual. # gstacks -P $src/stacks/ -M $src/popmaps/popmap -t 8 # # Run populations. Calculate Hardy-Weinberg deviation, population statistics, f-statistics # export several output files. # populations -P $src/stacks/ -M $src/popmaps/popmap -r 0.65 --vcf --genepop --structure --fstats --hwe -t 8ref_map.pl -T 8 --popmap ./popmaps/popmap -o ./stacks/ --samples ./aligned或者单独执行命令:
#!/bin/bash src=$HOME/research/project bwa_db=$src/bwa_db/my_bwa_db_prefix files=”sample_01 sample_02 sample_03” # # Align paired-end data with BWA, convert to BAM and SORT. # for sample in $files do bwa mem -t 8 $bwa_db $src/samples/${sample}.1.fq.gz $src/samples/${sample}.2.fq.gz | samtools view -b | samtools sort --threads 4 > $src/aligned/${sample}.bam done # # Run gstacks to build loci from the aligned paired-end data. We have instructed # gstacks to remove any PCR duplicates that it finds. # gstacks -I $src/aligned/ -M $src/popmaps/popmap --rm-pcr-duplicates -O $src/stacks/ -t 8 # # Run populations. Calculate Hardy-Weinberg deviation, population statistics, f-statistics and # smooth the statistics across the genome. Export several output files. # populations -P $src/stacks/ -M $src/popmaps/popmap -r 0.65 --vcf --genepop --fstats --smooth --hwe -t 85.4 结果数据分析
genetic map
population analysis
STRUCTURE
Adegenet
coverage -
https://www.yn-tobacco.com/gzfw/rqfw/xfz/jyzwjb/201801/t20180112_215188.html
烟草有参考基因组吗?
普通烟草(Nicotiana tabacum)为异源四倍体植物,基因组大小约 4.5G,基因组结构高度复杂。美国烟草基因组计划对四倍体烤烟品种采用甲基过滤法测序只获得了部分基因序列信息。我国烟草基因组计划完成了对普通烟草的两个祖先种绒毛状烟草(N. tomentosiformis)和林烟草(N. sylvestris)全基因组序列图谱的绘制,更充分地验证了烟草基因组结构高度复杂这一事实。在林烟草和绒毛状烟草7万多个基因中,有1万多个属于高度同源基因,就像孪生兄弟一样十分相像,常规的表达谱芯片探针设计难以有效区分鉴别。
N.tomentosiformis 绒毛状烟草
N.sylvestris 林烟草
N.tabacum 普通烟草
用tobacco搜索NCBI 有参考基因组 到时侯也可以和客户沟通下用哪个 -
实际测试:
鸿元服务器 /home/bioinfo/radseq
process_radtags -p . -o 1-cleandata/ -e sbfI -c -q -r
zgrep TGCAGG --color=always SRR828261.1.fq.gz |head -n 20
//比对
bwa mem ../Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa ../SRR828261.1.fastq |gzip -3 > SRR828261.1.sam.gz
//查看比对率
samtools flagstat SRR828303.1.sam.gz
//构建loci
gstacks -I ../2-align/ -M popmap -O output -t 8
//populations分析
populations -P ../3-build-loci/output/ -M ./popmap -r 0.65 --vcf --genepop --fstats --smooth --hwe -t 8 -
-
-
-
-
-
-
-
python tagdigger_script.py -w . -b samples.csv -o mycounts.csv -k markerstokeep.txt --StacksTags ../ANCHA180152/ustack/catalog.tags.tsv.gz --StacksSNPs ../ANCHA180152/ustack/catalog.snps.tsv.gz --StacksAlleles ../ANCHA180152/ustack/catalog.alleles.tsv.gz
-

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0201254
A RAD-sequencing approach to genome-wide marker discovery, genotyping, and phylogenetic inference in a diverse radiation of primates -
https://groups.google.com/g/stacks-users/c/1n6O98oWO2o/m/oRm3FYZTs1IJ?pli=1
Stacks不太适合多倍体(大于2)