暗能星系

    • 登录
    • 搜索

    gatk使用笔记

    生物信息分析
    1
    14
    46
    正在加载更多帖子
    • 从旧到新
    • 从新到旧
    • 最多赞同
    回复
    • 在新帖中回复
    登录后回复
    此主题已被删除。只有拥有主题管理权限的用户可以查看。
    • A
      anneng 最后由 编辑

      查看某个功能的帮助文档

      gatk HaplotypeCaller --help

      1 条回复 最后回复 回复 引用 0
      • A
        anneng 最后由 anneng 编辑

        //variant call
        gatk HaplotypeCaller
        -R ref/ref.fasta
        -I bams/mother.bam
        -O sandbox/mother_variants.vcf

        在《Genomics in the cloud》中提到 GATK在检测indel 方面优于mpileup
        It also makes the HaplotypeCaller much better at calling indels than traditional position-based callers like the old UnifiedGenotyper and Samtools mpileup.

        https://www.nature.com/scitable/definition/haplotype-haplotypes-142/#:~:text=A haplotype is a group,genetic makeup of an organism.
        haplotype / haplotypes
        A haplotype is a group of genes within an organism that was inherited together from a single parent. The word "haplotype" is derived from the word "haploid," which describes cells with only one set of chromosomes, and from the word "genotype," which refers to the genetic makeup of an organism. A haplotype can describe a pair of genes inherited together from one parent on one chromosome, or it can describe all of the genes on a chromosome that were inherited together from a single parent. This group of genes was inherited together because of genetic linkage, or the phenomenon by which genes that are close to each other on the same chromosome are often inherited together. In addition, the term "haplotype" can also refer to the inheritance of a cluster of single nucleotide polymorphisms (SNPs), which are variations at single positions in the DNA sequence among individuals.

        By examining haplotypes, scientists can identify patterns of genetic variation that are associated with health and disease states. For instance, if a haplotype is associated with a certain disease, then scientists can examine stretches of DNA near the SNP cluster to try to identify the gene or genes responsible for causing the disease.

        657c48e8-6abe-499c-871b-7772c92f65e6-image.png

        1 条回复 最后回复 回复 引用 0
        • A
          anneng 最后由 编辑

          直接在GATK中调用 Picard 的功能,当前Picard已经成了GATK的一部分
          gatk ValidateSamFile
          -R ref/ref.fasta
          -I bams/mother.bam
          -O sandbox/mother_validation.txt

          1 条回复 最后回复 回复 引用 0
          • A
            anneng 最后由 编辑

            通过bamout输出gatk haplotypecaller的调试信息
            gatk HaplotypeCaller
            -R ref/ref.fasta
            -I bams/mother.bam
            -O sandbox/mother_variants.snippet.debug.vcf
            -bamout sandbox/mother_variants.snippet.debug.bam
            -L 20:10,002,000-10,003,000

            1 条回复 最后回复 回复 引用 0
            • A
              anneng 最后由 编辑

              At this point, we have a callset of potential variants, but we know from the earlier overview of variant filtering that this callset is likely to contain many false-positive calls; that is, calls that are caused by technical artifacts and do not actually correspond to real biological variation. We need to get rid of as many of those as we can without losing real variants. How do we do that?
              A commonly used approach for filtering germline short variants is to use variant context annotations, which are statistics captured during the variant calling process that summarize the quality and quantity of evidence that was observed for each variant. For example, some variant context annotations describe what the sequence context was like around the variant site (were there a lot of repeated bases? more GC or more AT?), how many reads covered it, how many reads covered each allele, what proportion of reads were in forward versus reverse orientation, and so on. We can choose thresholds for each annotation and set a hard filtering policy that says, for example, “For any given variant, if the value of this annotation is greater than a threshold value of X, we consider the variant to be real; otherwise, we filter it out.”

              1 条回复 最后回复 回复 引用 0
              • A
                anneng 最后由 编辑

                gatk VariantFiltration
                -R ref/ref.fasta
                -V vcfs/motherSNP.vcf.gz
                --filter-expression "QD < 2.0 || DP > 100.0"
                --filter-name "lowQD_highDP"
                -O sandbox/motherSNP.QD2.DP100.vcf.gz
                按照QD DP 过滤
                思考:如果每次 过滤都要执行一个命令 这得生成多少vcf? 如果能从数据库里面进行过滤 就可以做成交互式的模式。体验会更好。

                1 条回复 最后回复 回复 引用 0
                • A
                  anneng 最后由 编辑

                  GATK的应用场景
                  5cde2488-2eb2-40c5-a91a-45990bbc869d-image.png
                  以及下面这些场景:
                  Germline copy number variation

                  Structural variation

                  Mitochondrial variation

                  Blood biopsy

                  Pathogen/contaminant identification

                  1 条回复 最后回复 回复 引用 0
                  • A
                    anneng 最后由 编辑

                    2454b187-8a25-4433-8b2f-428f3e008d81-image.png

                    1 条回复 最后回复 回复 引用 0
                    • A
                      anneng 最后由 anneng 编辑

                      c1f7c323-9b8f-41bc-bfe7-bc31da3eb9f0-image.png

                      gvcf模式来同时分析多个样本(队列研究)
                      图中的consolidate cohort会用到GenomicsDB 数据库

                      GATK搞这种方式主要是为了解决队列研究的N+1问题 即每次队列中增加一个样本,都要重新分析整个队列。
                      62a95a72-cae1-45ce-aaab-7f0a182aac2a-image.png

                      //生成gvcf的命令
                      gatk HaplotypeCaller
                      -R ref/ref.fasta
                      -I bams/mother.bam
                      -O sandbox/mother_variants.200k.g.vcf.gz
                      -L 20:10,000,000-10,200,000
                      -ERC GVCF

                      //导入多个样本的gvcf
                      gatk GenomicsDBImport
                      -V gvcfs/mother.g.vcf.gz
                      -V gvcfs/father.g.vcf.gz
                      --genomicsdb-workspace-path sandbox/trio-gdb
                      --intervals 20:10,000,000-10,200,000

                      //从genomics导出
                      gatk SelectVariants
                      -R ref/ref.fasta
                      -V gendb://sandbox/trio-gdb
                      -O sandbox/duo_selectvariants.g.vcf.gz

                      1 条回复 最后回复 回复 引用 0
                      • A
                        anneng 最后由 编辑

                        gatk GenotypeGVCFs
                        -R ref/ref.fasta
                        -V gendb://sandbox/trio-gdb
                        -O sandbox/trio-jointcalls.vcf.gz
                        -L 20:10,000,000-10,200,000

                        gatk HaplotypeCaller
                        -R ref/ref.fasta
                        -I bams/mother.bam
                        -I bams/father.bam
                        -I bams/son.bam
                        -O sandbox/trio_jointcalls_hc.vcf.gz
                        -L 20:10,000,000-10,200,000

                        1 条回复 最后回复 回复 引用 0
                        • A
                          anneng 最后由 编辑

                          intervals.list
                          https://www.biostars.org/p/364917/
                          对于植物这种没有完全组装的参考基因组 里面可能有很多contig 可以通过一个文件 来指定这个contig列表 然后到导入genomicsdb

                          1 条回复 最后回复 回复 引用 0
                          • A
                            anneng 最后由 编辑

                            https://bmcresnotes.biomedcentral.com/articles/10.1186/1756-0500-7-747
                            We implemented the best practices of GATK pipeline to call SNPs and Indels. We have used GATK 2.4 version and GATK-UnifiedGenotyper as SNP caller in this study. We have used multi-sample variant calling by GATK-UnifiedGenotyper. The reason of using multi-sample calling is to distinguish non-variant genotypes between homozygous reference genotype and missing genotype in cohort analysis. With single sample calling genotype called only for variants we can’t be sure if the non-variants have missing genotype or same as reference. Also, big projects like 1000 genomes have preferred multi-sample calling over single sample calling[18]. We used GATK-UnifiedGenotyper instead of GATK-HaplotypeCaller, a similar or better variant caller by GATK, in this study because of similar accuracy in calling SNPs and computational feasibility to run for large number of samples. For more than 100 samples, according to GATK website, GATK-UnifiedGenotyper is advised over GATK-HaplotypeCaller. The real advantage of Haplotypecaller over UnifiedGenotyper is in calling Indels but in this paper we are focusing on SNPs only.
                            这个文章提到了2点:
                            1.使用多个样本 是为了更多的发掘突变 这样可以区分纯合和missing?
                            2.UnifiedGenotyper虽然不建议了 但是在多个样本而且只关注SNP时 还是选择还方法

                            1 条回复 最后回复 回复 引用 0
                            • A
                              anneng 最后由 编辑

                              https://www.biostars.org/p/54673/
                              GATK consensus

                              1 条回复 最后回复 回复 引用 0
                              • First post
                                Last post
                              Powered by 暗能星系