烟草数据分析
-
1.构建索引
bwa index GCF_000715135.1_Ntab-TN90_genomic.fna.gz
samtools faidx GCF_000715135.1_Ntab-TN90_genomic.fna.gz
注意:不能对gzip压缩包构建索引
Cannot index files compressed with gzip, please use bgzip2.比对
bwa mem -M -t 10 GCF_000715135.1_Ntab-TN90_genomic.fna.gz raw/T3_R1.fq.gz raw/T3_R2.fq.gz
Bwa Mem -M Option
https://www.biostars.org/p/97323/
生成bam并对bam进行排序
samtools view T3.sam -b | samtools sort -o T3.sorted.bam添加Read Group标识
samtools addreplacerg -r 'ID:tabacco' -r 'LB:1334' -r 'SM:T3' -r 'PL:ILLUMINA' -r 'PU:AAA.2.xxxx' -o T3.sorted.RG.bam T3.sorted.bam什么是Read Group?
https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups
构建BAM索引:
samtools index T3.sorted.RG.bam -
3.java -jar picard.jar MarkDuplicates
I=input.bam
O=marked_duplicates.bam
M=marked_dup_metrics.txt
去重的工具对比 samtools vs picard
https://www.biostars.org/p/390305/ -
4.RBQS
GATK4以前的版本 还需要执行 IndelRealigner 现在不需要了
https://www.biostars.org/p/305123/
https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/2018-08-10-2018-04-11/11826-BQSR-without-IndelRealigner
According to the bestPractice of GATK4,Mark Duplicates -> RBQS -> IndelRealigning by Mutect2
which was below in previous version of GATK
Mark Duplicates -> IndelRealigner -> RBQS -> Mutect2
RBQS without INDEL realigning wouldn’t be affected by false alignments ?
/home/admin/software/gatk-4.1.0.0/gatk BaseRecalibrator
-I my_reads.bam
-R reference.fasta
--known-sites dbsnp_138.hg19.vcf
--known-sites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf
--known-sites 1000G_phase1.indels.hg19.sites.vcf
--known-sites 1000G_phase1.snps.high_confidence.hg19.sites.vcf
-L ../in/Illumina.bed
-O recal_data.table/home/admin/software/gatk-4.1.0.0/gatk ApplyBQSR
-R reference.fasta
-I input.bam
--bqsr-recal-file recal_data.table
-L ../in/Illumina.bed
-O output.bam对于非人类的物种怎么处理?
https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/
博得推荐的方法是bootstrapping 即先call SNP 然后用这些数据再做BQSR
Base Quality Score Recalibration (BQSR) is an important step for accurate variant detection that aims to minimize the effect of technical variation on base quality scores (measured as Phred scores). As with the original pipeline (link), this pipeline assumes that a ‘gold standard’ set of SNPS and indels are not available for BQSR. In the absence of a gold standard the pipeline performs an initial step detecting variants without performing BQSR, and then uses the identified SNPs as input for BQSR before calling variants again. This process is referred to as bootstrapping and is the procedure recommended by the Broad Institute’s best practices for variant discovery analysis when a gold standard is not available. -
此回复已被删除! -
5.获取突变
gatk HaplotypeCaller
-R ref.fa
-I sorted_dedup_reads.bam
-o raw_variants.vcfgatk --java-options "-Xmx8G" Mutect2
-R /home/admin/database/reference/hg19/ucsc.hg19.fasta
-I ../out/normal_recal.bam
-I ../out/cancer_recal.bam
-tumor cancer
-normal normal
--germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz
-L ../in/Illumina.bed
-O ../out/somatic.vcfgatk --java-options "-Xmx8G" FilterMutectCalls
-V ../out/somatic.vcf
-O ../out/somatic_filtered.vcf.gz获取SNP和Indels
gatk SelectVariants
-R ref.fa
-V raw_variants.vcf
-selectType SNP
-o raw_snps.vcf
gatk SelectVariants
-R ref.fa
-V raw_variants.vcf
-selectType INDEL
-o raw_indels.vcf过滤SNP和Indels
gatk VariantFiltration
-R ref.fa
-V raw_snps.vcf
-O filtered_snps.vcf
-filter-name "QD_filter" -filter "QD < 2.0"
-filter-name "FS_filter" -filter "FS > 60.0"
-filter-name "MQ_filter" -filter "MQ < 40.0"
-filter-name "SOR_filter" -filter "SOR > 4.0"
-filter-name "MQRankSum_filter" -filter "MQRankSum < -12.5"
-filter-name "ReadPosRankSum_filter" -filter "ReadPosRankSum < -8.0"gatk VariantFiltration
-R ref.fa
-V raw_indels.vcf
-O filtered_indels.vcf
-filter-name "QD_filter" -filter "QD < 2.0"
-filter-name "FS_filter" -filter "FS > 200.0"
-filter-name "SOR_filter" -filter "SOR > 10.0"gatk SelectVariants
--exclude-filtered
-V filtered_snps.vcf
-O bqsr_snps.vcf
gatk SelectVariants
--exclude-filtered
-V filtered_indels.vcf
-O bqsr_indels.vcf -
此回复已被删除! -
第一轮 Base Quality Score Recalibration (BQSR)
gatk BaseRecalibrator
-R ref.fa
-I sorted_dedup_reads.bam
--known-sites bqsr_snps.vcf
--known-sites bqsr_indels.vcf
-O recal_data.tablegatk ApplyBQSR
-R ref.fa
-I sorted_dedup_reads.bam
-bqsr recal_data.table
-O recal_reads.bam \第二轮(可选)
gatk BaseRecalibrator
-R ref.fa
-I recal_reads.bam
--known-sites bqsr_snps.vcf
--known-sites bqsr_indels.vcf
-O post_recal_data.table
分析数据:
gatk AnalyzeCovariates \
-before recal_data.table
-after post_recal_data.table
-plots recalibration_plots.pdf -
基于优化的结果进行第二轮突变分析:
gatk HaplotypeCaller
-R ref.fa
-I recal_reads.bam
-o raw_variants_recal.vcfgatk SelectVariants
-R ref.fa
-V raw_variants_recal.vcf
-selectType SNP
-o raw_snps_recal.vcf
gatk SelectVariants
-R ref.fa
-V raw_variants.vcf
-selectType INDEL
-o raw_indels_recal.vcfgatk VariantFiltration \
-R ref.fa
-V raw_snps_recal.vcf
-O filtered_snps_final.vcf
-filter-name "QD_filter" -filter "QD < 2.0"
-filter-name "FS_filter" -filter "FS > 60.0"
-filter-name "MQ_filter" -filter "MQ < 40.0"
-filter-name "SOR_filter" -filter "SOR > 4.0"
-filter-name "MQRankSum_filter" -filter "MQRankSum < -12.5"
-filter-name "ReadPosRankSum_filter" -filter "ReadPosRankSum < -8.0"gatk VariantFiltration \
-R ref.fa
-V raw_indels_recal.fa
-O filtered_indels_final.vcf
-filter-name "QD_filter" -filter "QD < 2.0"
-filter-name "FS_filter" -filter "FS > 200.0"
-filter-name "SOR_filter" -filter "SOR > 10.0" -
注释:
java -jar snpEff.jar -v \
<snpeff_db>
filtered_snps_final.vcf > $filtered_snps_final.ann.vcf或者
perl /home/admin/software/annovar/table_annovar.pl $INPUT
/home/admin/software/annovar/humandb -buildver hg19
-out $OUT_DIR/$PREFIX.annovar -remove
-protocol refGene,clinvar_20170905
-operation g,f
-nastring .
-vcfinput -
-
@anneng picard 在GATK4.0以后已经集成到了 GATK 本身
-
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/4097/100/GCF_000715135.1_Ntab-TN90/
注释的话 从这里可以下载 gff文件 用 snpeff或者 ANNOVAR 进行注释
