简化基因组分析

anneng

一，生物学预备知识

Restriction enzymes are DNA-cutting enzymes. Each enzyme recognizes one or a few target sequences and cuts DNA at or near those sequences.
DNA ligase is a DNA-joining enzyme. If two pieces of DNA have matching ends, ligase can link them to form a single, unbroken molecule of DNA.

enzymes that leave single-stranded overhangs are said to produce sticky ends. Sticky ends are helpful in cloning because they hold two pieces of DNA together so they can be linked by DNA ligase.
Some are “blunt cutters,” which cut straight down the middle of a target sequence and leave no overhang. The restriction enzyme SmaI is an example of a blunt cutter:

一个例子：

二，实验方法
CRoPS
RAD-Seq
GBS
double-digest RAD-Seq
2bRAD
restriction-enzyme anchored positions
https://www.researchgate.net/figure/An-overview-of-the-RAD-seq-library-creation-protocol-and-initial-analysis-steps_fig1_235794002

三，分析方法
Stack2流程
官网：https://catchenlab.life.illinois.edu/stacks/

原理：maximum likelihood statistical model
实现语言：C++ with wrapper programs written in Perl，基于OpenMP的多线程技术
无参考序列分析（de novo）：
process_radtags：demultiplexed and cleaned
ustacks：building loci
cstacks：creating the catalog of loci
sstacks：matching against the catalog
tsv2bam: transpose data from being store per-sample to be stored per-locus
gstacks：assemble and merge paired-end contigs, call variant sites in the population and genotypes in each sample
populations：filter data, calculate population genetics statistics, and export a variety of data formats.
有参考序列分析：
process_radtags：demultiplexed and cleaned
gstacks：assemble and merge paired-end contigs, call variant sites in the population and genotypes in each sample
populations：filter data, calculate population genetics statistics, and export a variety of data formats.
四，安装

tar xfvz stacks-2.xx.tar.gz
cd stacks-2.xx
./configure
make
(become root)
make install
(or, use sudo)
sudo make install

五，详细过程
5.1.Clean the data
process_radtags 样本拆分和过滤低质量序列
样本是否需要拆分，以及barcode是什么，需要和实验人员确定。
参考手册4.1.1　指定 --inline_index参数
https://catchenlab.life.illinois.edu/stacks/manual/#install
指定barcode列表

cat barcodes_lane3
CGATA<tab>sample_01
CGGCG     sample_02
GAAGC     sample_03
GAGAT     sample_04
TAATG     sample_05
TAGCA     sample_06
AAGGG     sample_07
ACACG     sample_08
ACGTA     sample_09

process_radtags -p ./raw/ -o ./samples/ -b ./barcodes/barcodes_lane3 \
                  -e sbfI -r -c -q
process_radtags -P -p ./raw  -b ./barcodes/barcodes -o ./samples/ \
                  -c -q -r --inline_index --renz_1 nlaIII --renz_2 mluCI

-p 待处理fataq文件所在的目录
-o 输出结果目录
-b barcode列表文件
-e restrction enzyme类型
5.2 和参考序列进行比对
可以使用常用的比对软件　GSnap, BWA, or Bowtie2.
可以用samtools或者picard查看比对的质量。
建议使用bwa mem进行比对。
烟草有参考基因组吗？
5.3
无参流程denovo_map.pl 或者单独执行命令
有参流程ref_map.pl或者单独执行命令
大部分stack2的组件需要一个population map文件，这个文件其实是根据样本特征进行分组，打标签
population map

cat popmap_both
sample_01<tab>red<tab>high
sample_02     red      high
sample_03     red      high
sample_04     red      high
sample_05     yellow   high
sample_06     yellow   high
sample_07     yellow   high
sample_08     yellow   high
sample_09     blue     low
sample_10     blue     low
sample_11     blue     low
sample_12     blue     low
sample_13     orange   low
sample_14     orange   low
sample_15     orange   low
sample_16     orange   low

denovo_map.pl -T 8 -M 6 -o ./stacks/ --samples ./samples --popmap ./popmaps/popmap --paired

或者单独执行命令：

#!/bin/bash

src=$HOME/research/project

files=”sample_01
sample_02
sample_03”

#
# Build loci de novo in each sample for the single-end reads only. If paired-end reads are available, 
# they will be integrated in a later stage (tsv2bam stage).
# This loop will run ustacks on each sample, e.g.
#   ustacks -f ./samples/sample_01.1.fq.gz -o ./stacks -i 1 --name sample_01 -M 4 -p 8
#
id=1
for sample in $files
do
    ustacks -f $src/samples/${sample}.1.fq.gz -o $src/stacks -i $id --name $sample -M 4 -p 8
    let "id+=1"
done

# 
# Build the catalog of loci available in the metapopulation from the samples contained
# in the population map. To build the catalog from a subset of individuals, supply
# a separate population map only containing those samples.
#
cstacks -n 6 -P $src/stacks/ -M $src/popmaps/popmap -p 8

#
# Run sstacks. Match all samples supplied in the population map against the catalog.
#
sstacks -P $src/stacks/ -M $src/popmaps/popmap -p 8

#
# Run tsv2bam to transpose the data so it is stored by locus, instead of by sample. We will include
# paired-end reads using tsv2bam. tsv2bam expects the paired read files to be in the samples
# directory and they should be named consistently with the single-end reads,
# e.g. sample_01.1.fq.gz and sample_01.2.fq.gz, which is how process_radtags will output them.
#
tsv2bam -P $src/stacks/ -M $src/popmaps/popmap --pe-reads-dir $src/samples -t 8

#
# Run gstacks: build a paired-end contig from the metapopulation data (if paired-reads provided),
# align reads per sample, call variant sites in the population, genotypes in each individual.
#
gstacks -P $src/stacks/ -M $src/popmaps/popmap -t 8

#
# Run populations. Calculate Hardy-Weinberg deviation, population statistics, f-statistics
# export several output files.
#
populations -P $src/stacks/ -M $src/popmaps/popmap -r 0.65 --vcf --genepop --structure --fstats --hwe -t 8

ref_map.pl -T 8 --popmap ./popmaps/popmap -o ./stacks/ --samples ./aligned

或者单独执行命令：

#!/bin/bash

src=$HOME/research/project
bwa_db=$src/bwa_db/my_bwa_db_prefix
    
files=”sample_01
sample_02
sample_03”

#
# Align paired-end data with BWA, convert to BAM and SORT.
#
for sample in $files
do 
    bwa mem -t 8 $bwa_db $src/samples/${sample}.1.fq.gz $src/samples/${sample}.2.fq.gz |
      samtools view -b |
      samtools sort --threads 4 > $src/aligned/${sample}.bam
done

#
# Run gstacks to build loci from the aligned paired-end data. We have instructed
# gstacks to remove any PCR duplicates that it finds.
#
gstacks -I $src/aligned/ -M $src/popmaps/popmap --rm-pcr-duplicates -O $src/stacks/ -t 8

#
# Run populations. Calculate Hardy-Weinberg deviation, population statistics, f-statistics and 
# smooth the statistics across the genome. Export several output files.
#
populations -P $src/stacks/ -M $src/popmaps/popmap -r 0.65 --vcf --genepop --fstats --smooth --hwe -t 8

5.4 结果数据分析
genetic map
population analysis
STRUCTURE
Adegenet
coverage

anneng

https://www.yn-tobacco.com/gzfw/rqfw/xfz/jyzwjb/201801/t20180112_215188.html
烟草有参考基因组吗？
普通烟草（Nicotiana tabacum）为异源四倍体植物，基因组大小约 4.5G，基因组结构高度复杂。美国烟草基因组计划对四倍体烤烟品种采用甲基过滤法测序只获得了部分基因序列信息。我国烟草基因组计划完成了对普通烟草的两个祖先种绒毛状烟草（N. tomentosiformis）和林烟草（N. sylvestris）全基因组序列图谱的绘制，更充分地验证了烟草基因组结构高度复杂这一事实。在林烟草和绒毛状烟草7万多个基因中，有1万多个属于高度同源基因，就像孪生兄弟一样十分相像，常规的表达谱芯片探针设计难以有效区分鉴别。
N.tomentosiformis　绒毛状烟草
N.sylvestris　　　　林烟草
N.tabacum　　　　普通烟草
用tobacco搜索NCBI　有参考基因组　到时侯也可以和客户沟通下用哪个

anneng

实际测试:
鸿元服务器　/home/bioinfo/radseq
process_radtags -p . -o 1-cleandata/ -e sbfI -c -q -r
zgrep TGCAGG --color=always SRR828261.1.fq.gz |head -n 20
//比对
bwa mem ../Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa ../SRR828261.1.fastq |gzip -3 > SRR828261.1.sam.gz
//查看比对率
samtools flagstat SRR828303.1.sam.gz
//构建loci
gstacks -I ../2-align/ -M popmap -O output -t 8
//populations分析
populations -P ../3-build-loci/output/ -M ./popmap -r 0.65 --vcf --genepop --fstats --smooth --hwe -t 8

anneng

https://pearg.github.io/pearg_documentation/tutorials/intro-rad-seq/

anneng

https://simison.com/brian/Structure_notes.html

anneng

https://evolution.genetics.washington.edu/phylip/

anneng

https://github.com/evansbenj/Reduced-Representation-Workshop

anneng

OneMap应用指南
https://mran.microsoft.com/snapshot/2017-02-04/web/packages/onemap/vignettes/Tutorial_Onemap_reduced_version.pdf

anneng

https://evolution.genetics.washington.edu/phylip/tuimala3.pdf

anneng

https://www.researchgate.net/post/Which-program-is-best-to-use-for-phylogeny-analysis

anneng

python tagdigger_script.py -w . -b samples.csv -o mycounts.csv -k markerstokeep.txt --StacksTags ../ANCHA180152/ustack/catalog.tags.tsv.gz --StacksSNPs ../ANCHA180152/ustack/catalog.snps.tsv.gz --StacksAlleles ../ANCHA180152/ustack/catalog.alleles.tsv.gz

anneng

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0201254
A RAD-sequencing approach to genome-wide marker discovery, genotyping, and phylogenetic inference in a diverse radiation of primates

anneng

https://groups.google.com/g/stacks-users/c/1n6O98oWO2o/m/oRm3FYZTs1IJ?pli=1

Stacks不太适合多倍体（大于2）