gatk使用笔记

anneng

查看某个功能的帮助文档

gatk HaplotypeCaller --help

anneng

//variant call
gatk HaplotypeCaller
-R ref/ref.fasta
-I bams/mother.bam
-O sandbox/mother_variants.vcf

在《Genomics in the cloud》中提到 GATK在检测indel 方面优于mpileup
It also makes the HaplotypeCaller much better at calling indels than traditional position-based callers like the old UnifiedGenotyper and Samtools mpileup.

https://www.nature.com/scitable/definition/haplotype-haplotypes-142/#:~:text=A haplotype is a group,genetic makeup of an organism.
haplotype / haplotypes
A haplotype is a group of genes within an organism that was inherited together from a single parent. The word "haplotype" is derived from the word "haploid," which describes cells with only one set of chromosomes, and from the word "genotype," which refers to the genetic makeup of an organism. A haplotype can describe a pair of genes inherited together from one parent on one chromosome, or it can describe all of the genes on a chromosome that were inherited together from a single parent. This group of genes was inherited together because of genetic linkage, or the phenomenon by which genes that are close to each other on the same chromosome are often inherited together. In addition, the term "haplotype" can also refer to the inheritance of a cluster of single nucleotide polymorphisms (SNPs), which are variations at single positions in the DNA sequence among individuals.

By examining haplotypes, scientists can identify patterns of genetic variation that are associated with health and disease states. For instance, if a haplotype is associated with a certain disease, then scientists can examine stretches of DNA near the SNP cluster to try to identify the gene or genes responsible for causing the disease.

anneng

直接在GATK中调用 Picard 的功能，当前Picard已经成了GATK的一部分
gatk ValidateSamFile
-R ref/ref.fasta
-I bams/mother.bam
-O sandbox/mother_validation.txt

anneng

通过bamout输出gatk haplotypecaller的调试信息
gatk HaplotypeCaller
-R ref/ref.fasta
-I bams/mother.bam
-O sandbox/mother_variants.snippet.debug.vcf
-bamout sandbox/mother_variants.snippet.debug.bam
-L 20:10,002,000-10,003,000

anneng

At this point, we have a callset of potential variants, but we know from the earlier overview of variant filtering that this callset is likely to contain many false-positive calls; that is, calls that are caused by technical artifacts and do not actually correspond to real biological variation. We need to get rid of as many of those as we can without losing real variants. How do we do that?
A commonly used approach for filtering germline short variants is to use variant context annotations, which are statistics captured during the variant calling process that summarize the quality and quantity of evidence that was observed for each variant. For example, some variant context annotations describe what the sequence context was like around the variant site (were there a lot of repeated bases? more GC or more AT?), how many reads covered it, how many reads covered each allele, what proportion of reads were in forward versus reverse orientation, and so on. We can choose thresholds for each annotation and set a hard filtering policy that says, for example, “For any given variant, if the value of this annotation is greater than a threshold value of X, we consider the variant to be real; otherwise, we filter it out.”

anneng

gatk VariantFiltration
-R ref/ref.fasta
-V vcfs/motherSNP.vcf.gz
--filter-expression "QD < 2.0 || DP > 100.0"
--filter-name "lowQD_highDP"
-O sandbox/motherSNP.QD2.DP100.vcf.gz
按照QD DP 过滤
思考：如果每次过滤都要执行一个命令这得生成多少vcf? 如果能从数据库里面进行过滤就可以做成交互式的模式。体验会更好。

anneng

GATK的应用场景

以及下面这些场景：
Germline copy number variation

Structural variation

Mitochondrial variation

Blood biopsy

Pathogen/contaminant identification

anneng

anneng

gvcf模式来同时分析多个样本（队列研究）
图中的consolidate cohort会用到GenomicsDB 数据库

GATK搞这种方式主要是为了解决队列研究的N+1问题即每次队列中增加一个样本，都要重新分析整个队列。

//生成gvcf的命令
gatk HaplotypeCaller
-R ref/ref.fasta
-I bams/mother.bam
-O sandbox/mother_variants.200k.g.vcf.gz
-L 20:10,000,000-10,200,000
-ERC GVCF

//导入多个样本的gvcf
gatk GenomicsDBImport
-V gvcfs/mother.g.vcf.gz
-V gvcfs/father.g.vcf.gz
--genomicsdb-workspace-path sandbox/trio-gdb
--intervals 20:10,000,000-10,200,000

//从genomics导出
gatk SelectVariants
-R ref/ref.fasta
-V gendb://sandbox/trio-gdb
-O sandbox/duo_selectvariants.g.vcf.gz

anneng

gatk GenotypeGVCFs
-R ref/ref.fasta
-V gendb://sandbox/trio-gdb
-O sandbox/trio-jointcalls.vcf.gz
-L 20:10,000,000-10,200,000

gatk HaplotypeCaller
-R ref/ref.fasta
-I bams/mother.bam
-I bams/father.bam
-I bams/son.bam
-O sandbox/trio_jointcalls_hc.vcf.gz
-L 20:10,000,000-10,200,000

anneng

intervals.list
https://www.biostars.org/p/364917/
对于植物这种没有完全组装的参考基因组　里面可能有很多contig　可以通过一个文件来指定这个contig列表然后到导入genomicsdb

anneng

https://bmcresnotes.biomedcentral.com/articles/10.1186/1756-0500-7-747
We implemented the best practices of GATK pipeline to call SNPs and Indels. We have used GATK 2.4 version and GATK-UnifiedGenotyper as SNP caller in this study. We have used multi-sample variant calling by GATK-UnifiedGenotyper. The reason of using multi-sample calling is to distinguish non-variant genotypes between homozygous reference genotype and missing genotype in cohort analysis. With single sample calling genotype called only for variants we can’t be sure if the non-variants have missing genotype or same as reference. Also, big projects like 1000 genomes have preferred multi-sample calling over single sample calling[18]. We used GATK-UnifiedGenotyper instead of GATK-HaplotypeCaller, a similar or better variant caller by GATK, in this study because of similar accuracy in calling SNPs and computational feasibility to run for large number of samples. For more than 100 samples, according to GATK website, GATK-UnifiedGenotyper is advised over GATK-HaplotypeCaller. The real advantage of Haplotypecaller over UnifiedGenotyper is in calling Indels but in this paper we are focusing on SNPs only.
这个文章提到了2点：
1.使用多个样本是为了更多的发掘突变这样可以区分纯合和missing?
2.UnifiedGenotyper虽然不建议了但是在多个样本而且只关注SNP时还是选择还方法

anneng

https://www.biostars.org/p/54673/
GATK consensus