<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[gatk使用笔记]]></title><description><![CDATA[<p dir="auto">调用gatk功能<br />
java -jar program.jar [program arguments]<br />
gatk ToolName [tool arguments]</p>
<p dir="auto">添加 jvm 参数<br />
java -Xmx4G -XX:+PrintGCDetails -jar program.jar [program arguments]<br />
gatk --java-options "-Xmx4G -XX:+PrintGCDetails" ToolName [tool arguments]</p>
<p dir="auto">//本地spark执行  4为核数<br />
gatk MySparkTool <br />
-R data/reference.fasta <br />
-I data/sample1.bam <br />
-O data/variants.vcf <br />
-- <br />
--spark-master 'local[4]'<br />
远端spark执行<br />
--spark-runner SPARK --spark-master spark://23.195.26.187:7077</p>
]]></description><link>http://an.forum.genostack.com/topic/194/gatk使用笔记</link><generator>RSS for Node</generator><lastBuildDate>Sat, 13 Jun 2026 09:38:56 GMT</lastBuildDate><atom:link href="http://an.forum.genostack.com/topic/194.rss" rel="self" type="application/rss+xml"/><pubDate>Mon, 01 Feb 2021 11:41:22 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to gatk使用笔记 on Fri, 01 Jul 2022 12:12:59 GMT]]></title><description><![CDATA[<p dir="auto"><a href="https://www.biostars.org/p/54673/" rel="nofollow ugc">https://www.biostars.org/p/54673/</a><br />
GATK consensus</p>
]]></description><link>http://an.forum.genostack.com/post/1643</link><guid isPermaLink="true">http://an.forum.genostack.com/post/1643</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Fri, 01 Jul 2022 12:12:59 GMT</pubDate></item><item><title><![CDATA[Reply to gatk使用笔记 on Thu, 30 Jun 2022 09:40:14 GMT]]></title><description><![CDATA[<p dir="auto"><a href="https://bmcresnotes.biomedcentral.com/articles/10.1186/1756-0500-7-747" rel="nofollow ugc">https://bmcresnotes.biomedcentral.com/articles/10.1186/1756-0500-7-747</a><br />
We implemented the best practices of GATK pipeline to call SNPs and Indels. We have used GATK 2.4 version and GATK-UnifiedGenotyper as SNP caller in this study. We have used multi-sample variant calling by GATK-UnifiedGenotyper. The reason of using multi-sample calling is to distinguish non-variant genotypes between homozygous reference genotype and missing genotype in cohort analysis. With single sample calling genotype called only for variants we can’t be sure if the non-variants have missing genotype or same as reference. Also, big projects like 1000 genomes have preferred multi-sample calling over single sample calling[18]. We used GATK-UnifiedGenotyper instead of GATK-HaplotypeCaller, a similar or better variant caller by GATK, in this study because of similar accuracy in calling SNPs and computational feasibility to run for large number of samples. For more than 100 samples, according to GATK website, GATK-UnifiedGenotyper is advised over GATK-HaplotypeCaller. The real advantage of Haplotypecaller over UnifiedGenotyper is in calling Indels but in this paper we are focusing on SNPs only.<br />
这个文章提到了2点：<br />
1.使用多个样本 是为了更多的发掘突变 这样可以区分纯合和missing?<br />
2.UnifiedGenotyper虽然不建议了 但是在多个样本而且只关注SNP时 还是选择还方法</p>
]]></description><link>http://an.forum.genostack.com/post/1628</link><guid isPermaLink="true">http://an.forum.genostack.com/post/1628</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Thu, 30 Jun 2022 09:40:14 GMT</pubDate></item><item><title><![CDATA[Reply to gatk使用笔记 on Mon, 08 Feb 2021 09:01:22 GMT]]></title><description><![CDATA[<p dir="auto">intervals.list<br />
<a href="https://www.biostars.org/p/364917/" rel="nofollow ugc">https://www.biostars.org/p/364917/</a><br />
对于植物这种没有完全组装的参考基因组　里面可能有很多contig　可以通过一个文件 来指定这个contig列表 然后到导入genomicsdb</p>
]]></description><link>http://an.forum.genostack.com/post/414</link><guid isPermaLink="true">http://an.forum.genostack.com/post/414</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Mon, 08 Feb 2021 09:01:22 GMT</pubDate></item><item><title><![CDATA[Reply to gatk使用笔记 on Sat, 06 Feb 2021 09:21:46 GMT]]></title><description><![CDATA[<p dir="auto">gatk GenotypeGVCFs <br />
-R ref/ref.fasta <br />
-V gendb://sandbox/trio-gdb <br />
-O sandbox/trio-jointcalls.vcf.gz <br />
-L 20:10,000,000-10,200,000</p>
<p dir="auto">gatk HaplotypeCaller <br />
-R ref/ref.fasta <br />
-I bams/mother.bam <br />
-I bams/father.bam <br />
-I bams/son.bam <br />
-O sandbox/trio_jointcalls_hc.vcf.gz <br />
-L 20:10,000,000-10,200,000</p>
]]></description><link>http://an.forum.genostack.com/post/402</link><guid isPermaLink="true">http://an.forum.genostack.com/post/402</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Sat, 06 Feb 2021 09:21:46 GMT</pubDate></item><item><title><![CDATA[Reply to gatk使用笔记 on Sat, 06 Feb 2021 09:18:35 GMT]]></title><description><![CDATA[<p dir="auto"><img src="/assets/uploads/files/1612601624213-c1f7c323-9b8f-41bc-bfe7-bc31da3eb9f0-image.png" alt="c1f7c323-9b8f-41bc-bfe7-bc31da3eb9f0-image.png" class=" img-responsive img-markdown" /></p>
<p dir="auto">gvcf模式来同时分析多个样本（队列研究）<br />
图中的consolidate cohort会用到GenomicsDB  数据库</p>
<p dir="auto">GATK搞这种方式主要是为了解决队列研究的N+1问题  即每次队列中增加一个样本，都要重新分析整个队列。<br />
<img src="/assets/uploads/files/1612601991717-62a95a72-cae1-45ce-aaab-7f0a182aac2a-image.png" alt="62a95a72-cae1-45ce-aaab-7f0a182aac2a-image.png" class=" img-responsive img-markdown" /></p>
<p dir="auto">//生成gvcf的命令<br />
gatk HaplotypeCaller <br />
-R ref/ref.fasta <br />
-I bams/mother.bam <br />
-O sandbox/mother_variants.200k.g.vcf.gz <br />
-L 20:10,000,000-10,200,000 <br />
-ERC GVCF</p>
<p dir="auto">//导入多个样本的gvcf<br />
gatk GenomicsDBImport <br />
-V gvcfs/mother.g.vcf.gz <br />
-V gvcfs/father.g.vcf.gz <br />
--genomicsdb-workspace-path sandbox/trio-gdb <br />
--intervals 20:10,000,000-10,200,000</p>
<p dir="auto">//从genomics导出<br />
gatk SelectVariants <br />
-R ref/ref.fasta <br />
-V gendb://sandbox/trio-gdb <br />
-O sandbox/duo_selectvariants.g.vcf.gz</p>
]]></description><link>http://an.forum.genostack.com/post/401</link><guid isPermaLink="true">http://an.forum.genostack.com/post/401</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Sat, 06 Feb 2021 09:18:35 GMT</pubDate></item><item><title><![CDATA[Reply to gatk使用笔记 on Sat, 06 Feb 2021 06:43:17 GMT]]></title><description><![CDATA[<p dir="auto"><img src="/assets/uploads/files/1612593794474-2454b187-8a25-4433-8b2f-428f3e008d81-image.png" alt="2454b187-8a25-4433-8b2f-428f3e008d81-image.png" class=" img-responsive img-markdown" /></p>
]]></description><link>http://an.forum.genostack.com/post/399</link><guid isPermaLink="true">http://an.forum.genostack.com/post/399</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Sat, 06 Feb 2021 06:43:17 GMT</pubDate></item><item><title><![CDATA[Reply to gatk使用笔记 on Sat, 06 Feb 2021 03:16:09 GMT]]></title><description><![CDATA[<p dir="auto">GATK的应用场景<br />
<img src="/assets/uploads/files/1612580749565-5cde2488-2eb2-40c5-a91a-45990bbc869d-image.png" alt="5cde2488-2eb2-40c5-a91a-45990bbc869d-image.png" class=" img-responsive img-markdown" /><br />
以及下面这些场景：<br />
Germline copy number variation</p>
<p dir="auto">Structural variation</p>
<p dir="auto">Mitochondrial variation</p>
<p dir="auto">Blood biopsy</p>
<p dir="auto">Pathogen/contaminant identification</p>
]]></description><link>http://an.forum.genostack.com/post/398</link><guid isPermaLink="true">http://an.forum.genostack.com/post/398</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Sat, 06 Feb 2021 03:16:09 GMT</pubDate></item><item><title><![CDATA[Reply to gatk使用笔记 on Sat, 06 Feb 2021 02:58:25 GMT]]></title><description><![CDATA[<p dir="auto">gatk VariantFiltration <br />
-R ref/ref.fasta <br />
-V vcfs/motherSNP.vcf.gz <br />
--filter-expression "QD &lt; 2.0 || DP &gt; 100.0" <br />
--filter-name "lowQD_highDP" <br />
-O sandbox/motherSNP.QD2.DP100.vcf.gz<br />
按照QD  DP 过滤<br />
思考：如果每次 过滤都要执行一个命令 这得生成多少vcf? 如果能从数据库里面进行过滤 就可以做成交互式的模式。体验会更好。</p>
]]></description><link>http://an.forum.genostack.com/post/397</link><guid isPermaLink="true">http://an.forum.genostack.com/post/397</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Sat, 06 Feb 2021 02:58:25 GMT</pubDate></item><item><title><![CDATA[Reply to gatk使用笔记 on Fri, 05 Feb 2021 11:56:57 GMT]]></title><description><![CDATA[<p dir="auto">At this point, we have a callset of potential variants, but we know from the earlier overview of variant filtering that this callset is likely to contain many false-positive calls; that is, calls that are caused by technical artifacts and do not actually correspond to real biological variation. We need to get rid of as many of those as we can without losing real variants. How do we do that?<br />
A commonly used approach for filtering germline short variants is to use <strong>variant context annotations</strong>, which are statistics captured during the variant calling process that summarize the quality and quantity of evidence that was observed for each variant. For example, some variant context annotations describe what the sequence context was like around the variant site (were there a lot of repeated bases? more GC or more AT?), how many reads covered it, how many reads covered each allele, what proportion of reads were in forward versus reverse orientation, and so on. We can choose thresholds for each annotation and set a hard filtering policy that says, for example, “For any given variant, if the value of this annotation is greater than a threshold value of X, we consider the variant to be real; otherwise, we filter it out.”</p>
]]></description><link>http://an.forum.genostack.com/post/396</link><guid isPermaLink="true">http://an.forum.genostack.com/post/396</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Fri, 05 Feb 2021 11:56:57 GMT</pubDate></item><item><title><![CDATA[Reply to gatk使用笔记 on Fri, 05 Feb 2021 11:46:52 GMT]]></title><description><![CDATA[<p dir="auto">通过bamout输出gatk haplotypecaller的调试信息<br />
gatk HaplotypeCaller <br />
-R ref/ref.fasta <br />
-I bams/mother.bam <br />
-O sandbox/mother_variants.snippet.debug.vcf <br />
-bamout sandbox/mother_variants.snippet.debug.bam <br />
-L 20:10,002,000-10,003,000</p>
]]></description><link>http://an.forum.genostack.com/post/395</link><guid isPermaLink="true">http://an.forum.genostack.com/post/395</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Fri, 05 Feb 2021 11:46:52 GMT</pubDate></item><item><title><![CDATA[Reply to gatk使用笔记 on Thu, 04 Feb 2021 07:33:06 GMT]]></title><description><![CDATA[<p dir="auto">直接在GATK中调用 Picard 的功能，当前Picard已经成了GATK的一部分<br />
gatk ValidateSamFile <br />
-R ref/ref.fasta <br />
-I bams/mother.bam <br />
-O sandbox/mother_validation.txt</p>
]]></description><link>http://an.forum.genostack.com/post/386</link><guid isPermaLink="true">http://an.forum.genostack.com/post/386</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Thu, 04 Feb 2021 07:33:06 GMT</pubDate></item><item><title><![CDATA[Reply to gatk使用笔记 on Thu, 04 Feb 2021 12:08:42 GMT]]></title><description><![CDATA[<p dir="auto">//variant call<br />
gatk HaplotypeCaller <br />
-R ref/ref.fasta <br />
-I bams/mother.bam <br />
-O sandbox/mother_variants.vcf</p>
<p dir="auto">在《Genomics in the cloud》中提到 GATK在检测indel 方面优于mpileup<br />
It also makes the HaplotypeCaller much better at calling indels than traditional position-based callers like the old UnifiedGenotyper and Samtools mpileup.</p>
<p dir="auto"><a href="https://www.nature.com/scitable/definition/haplotype-haplotypes-142/#:~:text=A%20haplotype%20is%20a%20group,genetic%20makeup%20of%20an%20organism" rel="nofollow ugc">https://www.nature.com/scitable/definition/haplotype-haplotypes-142/#:~:text=A haplotype is a group,genetic makeup of an organism</a>.<br />
haplotype / haplotypes<br />
A haplotype is a group of genes within an organism that was inherited together from a single parent. The word "haplotype" is derived from the word "haploid," which describes cells with only one set of chromosomes, and from the word "genotype," which refers to the genetic makeup of an organism. A haplotype can describe a pair of genes inherited together from one parent on one chromosome, or it can describe all of the genes on a chromosome that were inherited together from a single parent. This group of genes was inherited together because of genetic linkage, or the phenomenon by which genes that are close to each other on the same chromosome are often inherited together. In addition, the term "haplotype" can also refer to the inheritance of a cluster of single nucleotide polymorphisms (SNPs), which are variations at single positions in the DNA sequence among individuals.</p>
<p dir="auto">By examining haplotypes, scientists can identify patterns of genetic variation that are associated with health and disease states. For instance, if a haplotype is associated with a certain disease, then scientists can examine stretches of DNA near the SNP cluster to try to identify the gene or genes responsible for causing the disease.</p>
<p dir="auto"><img src="/assets/uploads/files/1612440519733-657c48e8-6abe-499c-871b-7772c92f65e6-image.png" alt="657c48e8-6abe-499c-871b-7772c92f65e6-image.png" class=" img-responsive img-markdown" /></p>
]]></description><link>http://an.forum.genostack.com/post/379</link><guid isPermaLink="true">http://an.forum.genostack.com/post/379</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Thu, 04 Feb 2021 12:08:42 GMT</pubDate></item><item><title><![CDATA[Reply to gatk使用笔记 on Wed, 03 Feb 2021 03:18:41 GMT]]></title><description><![CDATA[<p dir="auto">查看某个功能的帮助文档</p>
<h1>gatk HaplotypeCaller --help</h1>
]]></description><link>http://an.forum.genostack.com/post/378</link><guid isPermaLink="true">http://an.forum.genostack.com/post/378</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Wed, 03 Feb 2021 03:18:41 GMT</pubDate></item></channel></rss>