GATK

anneng

https://gatk.broadinstitute.org/hc/en-us/community/posts/360066050311-BQSR-bootstrapping-for-multiple-sample-dataset-with-no-known-variants-non-human-

Thanks for the follow up question on this post so that we can address it!

We don't have any current BQSR bootstrapping methods or recommendations for when there is no known sites file.

If you don't have a known sites file, you can still use GATK. Just skip the BQSR step and use hard filtering instead of VQSR. It's more ideal to be able to use the BQSR and VQSR machine learning steps, but it's not possible if you don't have a known sites file.

Hope this helps!

Genevieve

GATK的的BQSR 方法依赖known sites,例如dbsnp，对于研究比较成熟的模式生物，如人类比较有用。其他物种的话可以删除。

anneng

https://eriqande.github.io/eca-bioinf-handbook/

zhanglu

GATK流程记录

参考序列处理：

1. bwa index 建索引

镜像：11918067/genomes-in-the-cloud:2.4.2-1552931386
命令： /usr/gitc/bwa index *.fa

2. picard 创建.dict文件

镜像： broadinstitute/picard:2.27.3
命令：java -jar /usr/picard/picard.jar CreateSequenceDictionary R=Nitab-v4.5_genome_Chr_Edwards2017.fasta O=Nitab-v4.5_genome_Chr_Edwards2017.fasta.dict

3. samtools 创建.fai

命令：samtools faidx Nitab-v4.5_genome_Chr_Edwards2017.fasta
镜像：quay.io/biocontainers/samtools:1.15.1--h1170115_0

4. SnpEff ref库：

命令： snp_build -name ${Name} -ann ${Gff} -fa ${Fa}
镜像： docker: "anneng01:8090/library/angs_snpeff:1.0.0"
输出路径result

注意事项

SnpEff 输入路径VCF及其索引文件必须是gz压缩格式，例如： reference.vcf.gz

WDL流程

gatk_merge.wdl

测试数据

{
    "GATK.fastq_1":"/ceph_disk3/file_server/tmp/lanzhou/data/H06HDADXX130110.1.ATCACGAT.20k_reads_1.fastq",
    "GATK.fastq_2":"/ceph_disk3/file_server/tmp/lanzhou/data/H06HDADXX130110.1.ATCACGAT.20k_reads_2.fastq",
    "GATK.ref_fasta":"/ceph_disk2/data/hongyuan/mnt/data/public_data/tobacco_1656467890/Nitab-v4.5_genome_Chr_Edwards2017.fasta",
    "GATK.snp_ref":"/ceph_disk2/data/hongyuan/mnt/data/public_data/tobacco_1656467890/result"
}

anneng

https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATK_Discovery_Tutorial-Worksheet-AUS2016.pdf

一个gatk指南讲的比较细致

anneng

https://yulijia.net/slides/bioinfomatcis_for_medical_students/2019-07-31-A_beginners_guide_to_Call_SNPs_and_indels_Part_II.html#1

anneng

https://wikis.univ-lille.fr/bilille/_media/ngs2019_dna_duplicates.pdf

anneng

https://paleomix.readthedocs.io/en/stable/other_tools.html#paleomix-rmdup-collapsed
去重合并的reads

https://www.biostars.org/p/347514/
先去重再合并

anneng

GATK_Discovery_Tutorial-Worksheet-AUS2016.pdf

anneng

https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATKwr12-5-Variant_calling_joint_genotyping.pdf
GATKwr12-5-Variant_calling_joint_genotyping.pdf
对gvcf的解释

anneng

https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants
vcf过滤

anneng

新增一个SNP过滤和注释流程包括
过滤：

   gatk VariantFiltration \
   -R reference.fasta \
   -V input.vcf.gz \
   -O output.vcf.gz \
   -filter "QD <4.0 || FS> 60.0 || MQ <40.0" \
   --filterName "tobacco"
   -window "4"
   -G-filter "GQ<20.0"

这些过滤参数要暴露出来可以设置
snpeff：注释
在gatk中得到的是全的未过滤的vcf
过滤后也用sneff看下结果

anneng

https://nbisweden.github.io/workshop-ngsintro/2005/lab_vc.html#5_Variant_calling_in_cohort

anneng

https://www.clinbioinfosspa.es/files/pipelines/germline_workflow_diagram.pdf

anneng

https://haplotypecaller1.rssing.com/chan-10646605/all_p46.html

Merging VCF files
There are three main reasons why you might want to combine variants from different files into one, and the tool to use depends on what you are trying to achieve.

The most common case is when you have been parallelizing your variant calling analyses, e.g. running HaplotypeCaller per-chromosome, producing separate VCF files (or GVCF files) per-chromosome. For that case, you can use the Picard tool MergeVcfs to merge the files. See the relevant Tool Doc page for usage details.

The second case is when you have been using HaplotypeCaller in -ERC GVCF or -ERC BP_RESOLUTION to call variants on a large cohort, producing many GVCF files. You then need to consolidate them before joint-calling variants with GenotypeGVCFs (for performance reasons). This can be done with either CombineGVCFs or ImportGenomicsDB tools, both of which are specifically designed to handle GVCFs in this way. See the relevant Tool Doc pages for usage details and the Best Practices workflow documentation to learn more about the logic of this workflow.

The third case is when you want to compare variant calls that were produced from the same samples but using different methods, for comparison. For example, if you're evaluating variant calls produced by different variant callers, different workflows, or the same but using different parameters. For this case, we recommend taking a different approach; rather than merging the VCF files (which can have all sorts of complicated consequences), you can us the VariantAnnotator tool to annotate one of the VCFs with the other treated as a resource. See the relevant Tool Doc page for usage details.

anneng

https://gatk.broadinstitute.org/hc/en-us/community/posts/360071192131-Merge-different-individual-VCF

mergevcfs 的样本列表要相同

anneng

https://www.melbournebioinformatics.org.au/tutorials/tutorials/variant_calling_gatk1/files/variant_calling_gatk1.pdf

gatk VariantsToTable \
-R reference/hg38/Homo_sapiens_assembly38.fasta \
-V output/output.vqsr.varfilter.pass.vcf.gz \
-F CHROM -F POS -F FILTER -F TYPE -GF AD -GF DP \
--show-filtered \
-O output/output.vqsr.varfilter.pass.tsv

GATK可以把vcf变成表格

anneng

https://hpc.nih.gov/training/gatk_tutorial/
A practical introduction to GATK 4 on Biowulf (NIH HPC)

anneng

https://support.terra.bio/hc/en-us/articles/360037493811--4-howto-Use-scatter-gather-to-joint-call-genotypes

anneng

https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07013-y
Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework