暗能星系

    • 登录
    • 搜索

    GATK

    生物信息分析
    2
    20
    63
    正在加载更多帖子
    • 从旧到新
    • 从新到旧
    • 最多赞同
    回复
    • 在新帖中回复
    登录后回复
    此主题已被删除。只有拥有主题管理权限的用户可以查看。
    • Z
      zhanglu 最后由 zhanglu 编辑

      GATK流程记录

      参考序列处理:

      1. bwa index 建索引

      镜像:11918067/genomes-in-the-cloud:2.4.2-1552931386
      命令: /usr/gitc/bwa index *.fa
      

      2. picard 创建.dict文件

      镜像: broadinstitute/picard:2.27.3
      命令:java -jar /usr/picard/picard.jar CreateSequenceDictionary R=Nitab-v4.5_genome_Chr_Edwards2017.fasta O=Nitab-v4.5_genome_Chr_Edwards2017.fasta.dict
      

      3. samtools 创建.fai

      命令:samtools faidx Nitab-v4.5_genome_Chr_Edwards2017.fasta
      镜像:quay.io/biocontainers/samtools:1.15.1--h1170115_0
      

      4. SnpEff ref库:

      命令: snp_build -name ${Name} -ann ${Gff} -fa ${Fa}
      镜像: docker: "anneng01:8090/library/angs_snpeff:1.0.0"
      输出路径result
      

      注意事项

      SnpEff 输入路径VCF及其索引文件必须是gz压缩格式,例如: reference.vcf.gz

      WDL流程

      gatk_merge.wdl

      测试数据

      {
          "GATK.fastq_1":"/ceph_disk3/file_server/tmp/lanzhou/data/H06HDADXX130110.1.ATCACGAT.20k_reads_1.fastq",
          "GATK.fastq_2":"/ceph_disk3/file_server/tmp/lanzhou/data/H06HDADXX130110.1.ATCACGAT.20k_reads_2.fastq",
          "GATK.ref_fasta":"/ceph_disk2/data/hongyuan/mnt/data/public_data/tobacco_1656467890/Nitab-v4.5_genome_Chr_Edwards2017.fasta",
          "GATK.snp_ref":"/ceph_disk2/data/hongyuan/mnt/data/public_data/tobacco_1656467890/result"
      }
      
      1 条回复 最后回复 回复 引用 0
      • A
        anneng 最后由 编辑

        https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATK_Discovery_Tutorial-Worksheet-AUS2016.pdf

        一个gatk指南 讲的比较细致

        1 条回复 最后回复 回复 引用 0
        • A
          anneng 最后由 编辑

          https://yulijia.net/slides/bioinfomatcis_for_medical_students/2019-07-31-A_beginners_guide_to_Call_SNPs_and_indels_Part_II.html#1
          201d70bf-c895-4ca6-9b74-f244472b1cbe-image.png

          1 条回复 最后回复 回复 引用 0
          • A
            anneng 最后由 编辑

            https://wikis.univ-lille.fr/bilille/_media/ngs2019_dna_duplicates.pdf
            d1d000ef-2dfb-491c-870e-9436c0646304-image.png

            1 条回复 最后回复 回复 引用 0
            • A
              anneng 最后由 编辑

              https://paleomix.readthedocs.io/en/stable/other_tools.html#paleomix-rmdup-collapsed
              去重合并的reads

              https://www.biostars.org/p/347514/
              先去重再合并

              1 条回复 最后回复 回复 引用 0
              • A
                anneng 最后由 编辑

                GATK_Discovery_Tutorial-Worksheet-AUS2016.pdf

                1 条回复 最后回复 回复 引用 0
                • A
                  anneng 最后由 编辑

                  https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATKwr12-5-Variant_calling_joint_genotyping.pdf
                  GATKwr12-5-Variant_calling_joint_genotyping.pdf
                  对gvcf的解释

                  1 条回复 最后回复 回复 引用 0
                  • A
                    anneng 最后由 编辑

                    https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants
                    vcf过滤

                    1 条回复 最后回复 回复 引用 0
                    • A
                      anneng 最后由 anneng 编辑

                      新增一个SNP过滤和注释流程包括
                      过滤:

                         gatk VariantFiltration \
                         -R reference.fasta \
                         -V input.vcf.gz \
                         -O output.vcf.gz \
                         -filter "QD <4.0 || FS> 60.0 || MQ <40.0" \
                         --filterName "tobacco"
                         -window "4"
                         -G-filter "GQ<20.0"
                      

                      这些过滤参数要暴露出来 可以设置
                      snpeff:注释
                      在gatk中得到的是全的未过滤的vcf
                      过滤后也用sneff看下结果

                      1 条回复 最后回复 回复 引用 0
                      • A
                        anneng 最后由 编辑

                        https://nbisweden.github.io/workshop-ngsintro/2005/lab_vc.html#5_Variant_calling_in_cohort

                        1 条回复 最后回复 回复 引用 0
                        • A
                          anneng 最后由 编辑

                          https://www.clinbioinfosspa.es/files/pipelines/germline_workflow_diagram.pdf
                          5ceb753b-158c-4e38-8f74-ef3fbccd79d7-image.png

                          1 条回复 最后回复 回复 引用 0
                          • A
                            anneng 最后由 编辑

                            https://haplotypecaller1.rssing.com/chan-10646605/all_p46.html

                            1. Merging VCF files
                              There are three main reasons why you might want to combine variants from different files into one, and the tool to use depends on what you are trying to achieve.

                            The most common case is when you have been parallelizing your variant calling analyses, e.g. running HaplotypeCaller per-chromosome, producing separate VCF files (or GVCF files) per-chromosome. For that case, you can use the Picard tool MergeVcfs to merge the files. See the relevant Tool Doc page for usage details.

                            The second case is when you have been using HaplotypeCaller in -ERC GVCF or -ERC BP_RESOLUTION to call variants on a large cohort, producing many GVCF files. You then need to consolidate them before joint-calling variants with GenotypeGVCFs (for performance reasons). This can be done with either CombineGVCFs or ImportGenomicsDB tools, both of which are specifically designed to handle GVCFs in this way. See the relevant Tool Doc pages for usage details and the Best Practices workflow documentation to learn more about the logic of this workflow.

                            The third case is when you want to compare variant calls that were produced from the same samples but using different methods, for comparison. For example, if you're evaluating variant calls produced by different variant callers, different workflows, or the same but using different parameters. For this case, we recommend taking a different approach; rather than merging the VCF files (which can have all sorts of complicated consequences), you can us the VariantAnnotator tool to annotate one of the VCFs with the other treated as a resource. See the relevant Tool Doc page for usage details.

                            1 条回复 最后回复 回复 引用 0
                            • A
                              anneng 最后由 编辑

                              https://gatk.broadinstitute.org/hc/en-us/community/posts/360071192131-Merge-different-individual-VCF

                              mergevcfs 的样本列表要相同

                              1 条回复 最后回复 回复 引用 0
                              • A
                                anneng 最后由 编辑

                                https://www.melbournebioinformatics.org.au/tutorials/tutorials/variant_calling_gatk1/files/variant_calling_gatk1.pdf

                                gatk VariantsToTable \
                                -R reference/hg38/Homo_sapiens_assembly38.fasta \
                                -V output/output.vqsr.varfilter.pass.vcf.gz \
                                -F CHROM -F POS -F FILTER -F TYPE -GF AD -GF DP \
                                --show-filtered \
                                -O output/output.vqsr.varfilter.pass.tsv
                                

                                GATK可以把vcf变成表格

                                1 条回复 最后回复 回复 引用 0
                                • A
                                  anneng 最后由 编辑

                                  https://hpc.nih.gov/training/gatk_tutorial/
                                  A practical introduction to GATK 4 on Biowulf (NIH HPC)

                                  1 条回复 最后回复 回复 引用 0
                                  • A
                                    anneng 最后由 编辑

                                    https://support.terra.bio/hc/en-us/articles/360037493811--4-howto-Use-scatter-gather-to-joint-call-genotypes

                                    1 条回复 最后回复 回复 引用 0
                                    • A
                                      anneng 最后由 编辑

                                      https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07013-y
                                      Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

                                      1 条回复 最后回复 回复 引用 0
                                      • First post
                                        Last post
                                      Powered by 暗能星系