<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Spark在生物信息领域的应用]]></title><description><![CDATA[<p dir="auto">背景：群体研究、百万单细胞研究，数据量越来越大，对大数据分析的需求会越来越强烈，未来我们怎么布局？<br />
对于数据量比较小的课题组，我们可以卖解决方案，数据量更大的场景，我们是否要自建云平台？</p>
<p dir="auto">一，项目调研<br />
1.Glow<br />
<a href="https://projectglow.io/" rel="nofollow ugc">https://projectglow.io/</a><br />
Glow是一个开源工具包，用于处理生物库规模及更大范围内的基因组数据。该工具包本机构建在Apache Spark上，Apache Spark是大数据处理和机器学习的领先统一引擎，实现了基因组学工作流的云计算规模。<br />
Glow使用VCF或BGEN等常见文件格式的数据集以及常见的大数据标准。您可以使用Python、SQL、R、Java和Scala中的本机Spark SQL API编写查询。同样的API允许您将基因组数据与其他数据集（如电子健康记录、真实世界证据和医学图像）结合在一起。Glow可以轻松地并行化作为命令行工具或函数实现的现有工具和库。<br />
Glow构建在Apache Spark和Delta Lake上，支持基因型数据的分布式计算和分布式存储。该库向后兼容学术界开发的基因组学文件格式和生物信息学工具，使用户能够轻松地与合作者共享数据。<br />
<img src="/assets/uploads/files/1634183016453-a4be0055-b4ad-4122-a63a-043ede57a80f-image-resized.png" alt="a4be0055-b4ad-4122-a63a-043ede57a80f-image.png" class=" img-responsive img-markdown" /></p>
<p dir="auto"><img src="/assets/uploads/files/1635822428080-747a4a86-53d9-450c-8dcb-267adc33e385-image-resized.png" alt="747a4a86-53d9-450c-8dcb-267adc33e385-image.png" class=" img-responsive img-markdown" /><br />
2.DNAnexus<br />
<a href="https://documentation.dnanexus.com/user/spark" rel="nofollow ugc">https://documentation.dnanexus.com/user/spark</a></p>
<p dir="auto">3.Hail<br />
<a href="https://hail.is/index.html" rel="nofollow ugc">https://hail.is/index.html</a></p>
<p dir="auto">4.Databricks genomics<br />
<a href="https://docs.databricks.com/runtime/genomicsruntime.html#" rel="nofollow ugc">https://docs.databricks.com/runtime/genomicsruntime.html#</a><br />
上面几个项目好像是Databricks主推的　但是这个项目终结了<br />
<img src="/assets/uploads/files/1634191922370-a5c882e7-762d-4b5a-8f89-685ea4713411-image.png" alt="a5c882e7-762d-4b5a-8f89-685ea4713411-image.png" class=" img-responsive img-markdown" /><br />
5.regenie<br />
ReGeNee是一个用于全基因组关联研究的全基因组回归模型的C++程序。它是由Regeneron遗传学中心的一组科学家开发和支持的。<br />
<a href="https://rgcgithub.github.io/regenie/" rel="nofollow ugc">https://rgcgithub.github.io/regenie/</a></p>
<p dir="auto">6.Adam<br />
<a href="https://github.com/bigdatagenomics/adam" rel="nofollow ugc">https://github.com/bigdatagenomics/adam</a><br />
ADAM是一个基因组学分析平台，具有使用ApacheAvro、ApacheSark和ApacheParquet构建的专用文件格式。</p>
<p dir="auto">7.周边的项目<br />
<a href="https://docs.tiledb.com/main/" rel="nofollow ugc">https://docs.tiledb.com/main/</a><br />
<a href="https://bioinformatics.csiro.au/variantspark/" rel="nofollow ugc">https://bioinformatics.csiro.au/variantspark/</a></p>
<p dir="auto">二，文献<br />
<a href="https://academic.oup.com/bioinformatics/article-abstract/37/12/1658/6104815?redirectedFrom=fulltext" rel="nofollow ugc">https://academic.oup.com/bioinformatics/article-abstract/37/12/1658/6104815?redirectedFrom=fulltext</a><br />
Alignment-free Genomic Analysis via a Big Data Spark Platform</p>
<p dir="auto"><a href="https://pubmed.ncbi.nlm.nih.gov/33022000/" rel="nofollow ugc">https://pubmed.ncbi.nlm.nih.gov/33022000/</a><br />
Big Data in metagenomics: Apache Spark vs MPI</p>
<p dir="auto"><a href="https://par.nsf.gov/servlets/purl/10107144" rel="nofollow ugc">https://par.nsf.gov/servlets/purl/10107144</a><br />
Using Apache Spark on Genome Assembly for<br />
Scalable Overlap-graph Reduction</p>
<p dir="auto"><a href="https://chanzuckerberg.com/human-cell-atlas/accelerating-cross-sample-analysis-of-single-cell-genomic-data-with-adam-and-apache-spark/" rel="nofollow ugc">https://chanzuckerberg.com/human-cell-atlas/accelerating-cross-sample-analysis-of-single-cell-genomic-data-with-adam-and-apache-spark/</a><br />
Accelerating Cross-Sample Analysis of Single-Cell Genomic Data with Adam and Apache Spark</p>
<p dir="auto"><a href="https://databricks.com/session/practical-genomics-with-apache-spark" rel="nofollow ugc">https://databricks.com/session/practical-genomics-with-apache-spark</a></p>
]]></description><link>http://an.forum.genostack.com/topic/423/spark在生物信息领域的应用</link><generator>RSS for Node</generator><lastBuildDate>Sat, 13 Jun 2026 12:36:59 GMT</lastBuildDate><atom:link href="http://an.forum.genostack.com/topic/423.rss" rel="self" type="application/rss+xml"/><pubDate>Thu, 14 Oct 2021 02:45:09 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to Spark在生物信息领域的应用 on Wed, 02 Nov 2022 06:05:12 GMT]]></title><description><![CDATA[<p dir="auto"><a href="/assets/uploads/files/1667369110857-nih-genomics-workshop-11_4_2019.pdf">NIH-Genomics-Workshop-11_4_2019.pdf</a></p>
]]></description><link>http://an.forum.genostack.com/post/1876</link><guid isPermaLink="true">http://an.forum.genostack.com/post/1876</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Wed, 02 Nov 2022 06:05:12 GMT</pubDate></item><item><title><![CDATA[Reply to Spark在生物信息领域的应用 on Wed, 02 Nov 2022 05:54:41 GMT]]></title><description><![CDATA[<p dir="auto"><img src="/assets/uploads/files/1667368456066-690d02d5-c25d-4891-a62e-be78cbe568ef-image-resized.png" alt="690d02d5-c25d-4891-a62e-be78cbe568ef-image.png" class=" img-responsive img-markdown" /><br />
<a href="https://www.databricks.com/blog/2021/03/09/glow-v1-0-0-next-generation-genome-wide-analytics.html" rel="nofollow ugc">https://www.databricks.com/blog/2021/03/09/glow-v1-0-0-next-generation-genome-wide-analytics.html</a></p>
]]></description><link>http://an.forum.genostack.com/post/1875</link><guid isPermaLink="true">http://an.forum.genostack.com/post/1875</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Wed, 02 Nov 2022 05:54:41 GMT</pubDate></item><item><title><![CDATA[Reply to Spark在生物信息领域的应用 on Wed, 17 Aug 2022 12:16:29 GMT]]></title><description><![CDATA[<p dir="auto"><a href="https://arxiv.org/pdf/2007.10498.pdf" rel="nofollow ugc">https://arxiv.org/pdf/2007.10498.pdf</a><br />
<img src="/assets/uploads/files/1660738588133-d6c14c52-4f92-4c79-a3d9-145c1a5e8e2e-image.png" alt="d6c14c52-4f92-4c79-a3d9-145c1a5e8e2e-image.png" class=" img-responsive img-markdown" /></p>
]]></description><link>http://an.forum.genostack.com/post/1801</link><guid isPermaLink="true">http://an.forum.genostack.com/post/1801</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Wed, 17 Aug 2022 12:16:29 GMT</pubDate></item><item><title><![CDATA[Reply to Spark在生物信息领域的应用 on Tue, 16 Aug 2022 10:46:32 GMT]]></title><description><![CDATA[<p dir="auto"><a href="https://app.genebass.org/" rel="nofollow ugc">https://app.genebass.org/</a><br />
genomAD</p>
<p dir="auto">这2个项目都用了HAIL</p>
]]></description><link>http://an.forum.genostack.com/post/1800</link><guid isPermaLink="true">http://an.forum.genostack.com/post/1800</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Tue, 16 Aug 2022 10:46:32 GMT</pubDate></item><item><title><![CDATA[Reply to Spark在生物信息领域的应用 on Sat, 29 Jan 2022 16:03:49 GMT]]></title><description><![CDATA[<p dir="auto"><a href="https://www.researchgate.net/publication/357798075_Halvade_somatic_Somatic_variant_calling_with_Apache_Spark" rel="nofollow ugc">https://www.researchgate.net/publication/357798075_Halvade_somatic_Somatic_variant_calling_with_Apache_Spark</a></p>
]]></description><link>http://an.forum.genostack.com/post/1164</link><guid isPermaLink="true">http://an.forum.genostack.com/post/1164</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Sat, 29 Jan 2022 16:03:49 GMT</pubDate></item><item><title><![CDATA[Reply to Spark在生物信息领域的应用 on Wed, 03 Nov 2021 08:48:14 GMT]]></title><description><![CDATA[<p dir="auto">使用spark对scanpy进行加速:<br />
<a href="https://github.com/lasersonlab/single-cell-experiments" rel="nofollow ugc">https://github.com/lasersonlab/single-cell-experiments</a><br />
<a href="https://chanzuckerberg.com/human-cell-atlas/scalable-interactive-analysis-of-single-cell-data-with-apache-spark/" rel="nofollow ugc">https://chanzuckerberg.com/human-cell-atlas/scalable-interactive-analysis-of-single-cell-data-with-apache-spark/</a><br />
<a href="https://github.com/tomwhite/single-cell-spark-demo" rel="nofollow ugc">https://github.com/tomwhite/single-cell-spark-demo</a></p>
]]></description><link>http://an.forum.genostack.com/post/849</link><guid isPermaLink="true">http://an.forum.genostack.com/post/849</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Wed, 03 Nov 2021 08:48:14 GMT</pubDate></item><item><title><![CDATA[Reply to Spark在生物信息领域的应用 on Thu, 21 Oct 2021 09:58:37 GMT]]></title><description><![CDATA[<p dir="auto"><img src="/assets/uploads/files/1634810316066-8d42a9b6-1888-4111-8472-7ba280f047d5-image.png" alt="8d42a9b6-1888-4111-8472-7ba280f047d5-image.png" class=" img-responsive img-markdown" /></p>
]]></description><link>http://an.forum.genostack.com/post/835</link><guid isPermaLink="true">http://an.forum.genostack.com/post/835</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Thu, 21 Oct 2021 09:58:37 GMT</pubDate></item><item><title><![CDATA[Reply to Spark在生物信息领域的应用 on Thu, 21 Oct 2021 09:55:31 GMT]]></title><description><![CDATA[<p dir="auto"><img src="/assets/uploads/files/1634810130376-46403960-b72d-4a16-9e25-1bed8ac5d2fa-image.png" alt="46403960-b72d-4a16-9e25-1bed8ac5d2fa-image.png" class=" img-responsive img-markdown" /></p>
]]></description><link>http://an.forum.genostack.com/post/834</link><guid isPermaLink="true">http://an.forum.genostack.com/post/834</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Thu, 21 Oct 2021 09:55:31 GMT</pubDate></item><item><title><![CDATA[Reply to Spark在生物信息领域的应用 on Thu, 21 Oct 2021 09:52:46 GMT]]></title><description><![CDATA[<p dir="auto"><img src="/assets/uploads/files/1634809964905-c28eef16-7923-492b-bcf1-1406b28fc5ba-image.png" alt="c28eef16-7923-492b-bcf1-1406b28fc5ba-image.png" class=" img-responsive img-markdown" /></p>
]]></description><link>http://an.forum.genostack.com/post/833</link><guid isPermaLink="true">http://an.forum.genostack.com/post/833</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Thu, 21 Oct 2021 09:52:46 GMT</pubDate></item></channel></rss>