暗能星系

    • 登录
    • 搜索

    gatk使用笔记

    生物信息分析
    1
    14
    46
    正在加载更多帖子
    • 从旧到新
    • 从新到旧
    • 最多赞同
    回复
    • 在新帖中回复
    登录后回复
    此主题已被删除。只有拥有主题管理权限的用户可以查看。
    • A
      anneng 最后由 编辑

      通过bamout输出gatk haplotypecaller的调试信息
      gatk HaplotypeCaller
      -R ref/ref.fasta
      -I bams/mother.bam
      -O sandbox/mother_variants.snippet.debug.vcf
      -bamout sandbox/mother_variants.snippet.debug.bam
      -L 20:10,002,000-10,003,000

      1 条回复 最后回复 回复 引用 0
      • A
        anneng 最后由 编辑

        At this point, we have a callset of potential variants, but we know from the earlier overview of variant filtering that this callset is likely to contain many false-positive calls; that is, calls that are caused by technical artifacts and do not actually correspond to real biological variation. We need to get rid of as many of those as we can without losing real variants. How do we do that?
        A commonly used approach for filtering germline short variants is to use variant context annotations, which are statistics captured during the variant calling process that summarize the quality and quantity of evidence that was observed for each variant. For example, some variant context annotations describe what the sequence context was like around the variant site (were there a lot of repeated bases? more GC or more AT?), how many reads covered it, how many reads covered each allele, what proportion of reads were in forward versus reverse orientation, and so on. We can choose thresholds for each annotation and set a hard filtering policy that says, for example, “For any given variant, if the value of this annotation is greater than a threshold value of X, we consider the variant to be real; otherwise, we filter it out.”

        1 条回复 最后回复 回复 引用 0
        • A
          anneng 最后由 编辑

          gatk VariantFiltration
          -R ref/ref.fasta
          -V vcfs/motherSNP.vcf.gz
          --filter-expression "QD < 2.0 || DP > 100.0"
          --filter-name "lowQD_highDP"
          -O sandbox/motherSNP.QD2.DP100.vcf.gz
          按照QD DP 过滤
          思考:如果每次 过滤都要执行一个命令 这得生成多少vcf? 如果能从数据库里面进行过滤 就可以做成交互式的模式。体验会更好。

          1 条回复 最后回复 回复 引用 0
          • A
            anneng 最后由 编辑

            GATK的应用场景
            5cde2488-2eb2-40c5-a91a-45990bbc869d-image.png
            以及下面这些场景:
            Germline copy number variation

            Structural variation

            Mitochondrial variation

            Blood biopsy

            Pathogen/contaminant identification

            1 条回复 最后回复 回复 引用 0
            • A
              anneng 最后由 编辑

              2454b187-8a25-4433-8b2f-428f3e008d81-image.png

              1 条回复 最后回复 回复 引用 0
              • A
                anneng 最后由 anneng 编辑

                c1f7c323-9b8f-41bc-bfe7-bc31da3eb9f0-image.png

                gvcf模式来同时分析多个样本(队列研究)
                图中的consolidate cohort会用到GenomicsDB 数据库

                GATK搞这种方式主要是为了解决队列研究的N+1问题 即每次队列中增加一个样本,都要重新分析整个队列。
                62a95a72-cae1-45ce-aaab-7f0a182aac2a-image.png

                //生成gvcf的命令
                gatk HaplotypeCaller
                -R ref/ref.fasta
                -I bams/mother.bam
                -O sandbox/mother_variants.200k.g.vcf.gz
                -L 20:10,000,000-10,200,000
                -ERC GVCF

                //导入多个样本的gvcf
                gatk GenomicsDBImport
                -V gvcfs/mother.g.vcf.gz
                -V gvcfs/father.g.vcf.gz
                --genomicsdb-workspace-path sandbox/trio-gdb
                --intervals 20:10,000,000-10,200,000

                //从genomics导出
                gatk SelectVariants
                -R ref/ref.fasta
                -V gendb://sandbox/trio-gdb
                -O sandbox/duo_selectvariants.g.vcf.gz

                1 条回复 最后回复 回复 引用 0
                • A
                  anneng 最后由 编辑

                  gatk GenotypeGVCFs
                  -R ref/ref.fasta
                  -V gendb://sandbox/trio-gdb
                  -O sandbox/trio-jointcalls.vcf.gz
                  -L 20:10,000,000-10,200,000

                  gatk HaplotypeCaller
                  -R ref/ref.fasta
                  -I bams/mother.bam
                  -I bams/father.bam
                  -I bams/son.bam
                  -O sandbox/trio_jointcalls_hc.vcf.gz
                  -L 20:10,000,000-10,200,000

                  1 条回复 最后回复 回复 引用 0
                  • A
                    anneng 最后由 编辑

                    intervals.list
                    https://www.biostars.org/p/364917/
                    对于植物这种没有完全组装的参考基因组 里面可能有很多contig 可以通过一个文件 来指定这个contig列表 然后到导入genomicsdb

                    1 条回复 最后回复 回复 引用 0
                    • A
                      anneng 最后由 编辑

                      https://bmcresnotes.biomedcentral.com/articles/10.1186/1756-0500-7-747
                      We implemented the best practices of GATK pipeline to call SNPs and Indels. We have used GATK 2.4 version and GATK-UnifiedGenotyper as SNP caller in this study. We have used multi-sample variant calling by GATK-UnifiedGenotyper. The reason of using multi-sample calling is to distinguish non-variant genotypes between homozygous reference genotype and missing genotype in cohort analysis. With single sample calling genotype called only for variants we can’t be sure if the non-variants have missing genotype or same as reference. Also, big projects like 1000 genomes have preferred multi-sample calling over single sample calling[18]. We used GATK-UnifiedGenotyper instead of GATK-HaplotypeCaller, a similar or better variant caller by GATK, in this study because of similar accuracy in calling SNPs and computational feasibility to run for large number of samples. For more than 100 samples, according to GATK website, GATK-UnifiedGenotyper is advised over GATK-HaplotypeCaller. The real advantage of Haplotypecaller over UnifiedGenotyper is in calling Indels but in this paper we are focusing on SNPs only.
                      这个文章提到了2点:
                      1.使用多个样本 是为了更多的发掘突变 这样可以区分纯合和missing?
                      2.UnifiedGenotyper虽然不建议了 但是在多个样本而且只关注SNP时 还是选择还方法

                      1 条回复 最后回复 回复 引用 0
                      • A
                        anneng 最后由 编辑

                        https://www.biostars.org/p/54673/
                        GATK consensus

                        1 条回复 最后回复 回复 引用 0
                        • First post
                          Last post
                        Powered by 暗能星系