Spark在生物信息领域的应用

anneng

背景：群体研究、百万单细胞研究，数据量越来越大，对大数据分析的需求会越来越强烈，未来我们怎么布局？
对于数据量比较小的课题组，我们可以卖解决方案，数据量更大的场景，我们是否要自建云平台？

一，项目调研
1.Glow
https://projectglow.io/
Glow是一个开源工具包，用于处理生物库规模及更大范围内的基因组数据。该工具包本机构建在Apache Spark上，Apache Spark是大数据处理和机器学习的领先统一引擎，实现了基因组学工作流的云计算规模。
Glow使用VCF或BGEN等常见文件格式的数据集以及常见的大数据标准。您可以使用Python、SQL、R、Java和Scala中的本机Spark SQL API编写查询。同样的API允许您将基因组数据与其他数据集（如电子健康记录、真实世界证据和医学图像）结合在一起。Glow可以轻松地并行化作为命令行工具或函数实现的现有工具和库。
Glow构建在Apache Spark和Delta Lake上，支持基因型数据的分布式计算和分布式存储。该库向后兼容学术界开发的基因组学文件格式和生物信息学工具，使用户能够轻松地与合作者共享数据。

2.DNAnexus
https://documentation.dnanexus.com/user/spark

3.Hail
https://hail.is/index.html

4.Databricks genomics
https://docs.databricks.com/runtime/genomicsruntime.html#
上面几个项目好像是Databricks主推的　但是这个项目终结了

5.regenie
ReGeNee是一个用于全基因组关联研究的全基因组回归模型的C++程序。它是由Regeneron遗传学中心的一组科学家开发和支持的。
https://rgcgithub.github.io/regenie/

6.Adam
https://github.com/bigdatagenomics/adam
ADAM是一个基因组学分析平台，具有使用ApacheAvro、ApacheSark和ApacheParquet构建的专用文件格式。

7.周边的项目
https://docs.tiledb.com/main/
https://bioinformatics.csiro.au/variantspark/

二，文献
https://academic.oup.com/bioinformatics/article-abstract/37/12/1658/6104815?redirectedFrom=fulltext
Alignment-free Genomic Analysis via a Big Data Spark Platform

https://pubmed.ncbi.nlm.nih.gov/33022000/
Big Data in metagenomics: Apache Spark vs MPI

https://par.nsf.gov/servlets/purl/10107144
Using Apache Spark on Genome Assembly for
Scalable Overlap-graph Reduction

https://chanzuckerberg.com/human-cell-atlas/accelerating-cross-sample-analysis-of-single-cell-genomic-data-with-adam-and-apache-spark/
Accelerating Cross-Sample Analysis of Single-Cell Genomic Data with Adam and Apache Spark

https://databricks.com/session/practical-genomics-with-apache-spark

anneng

anneng

anneng

anneng

使用spark对scanpy进行加速:
https://github.com/lasersonlab/single-cell-experiments
https://chanzuckerberg.com/human-cell-atlas/scalable-interactive-analysis-of-single-cell-data-with-apache-spark/
https://github.com/tomwhite/single-cell-spark-demo

anneng

https://www.researchgate.net/publication/357798075_Halvade_somatic_Somatic_variant_calling_with_Apache_Spark

anneng

https://app.genebass.org/
genomAD

这2个项目都用了HAIL

anneng

https://arxiv.org/pdf/2007.10498.pdf

anneng

https://www.databricks.com/blog/2021/03/09/glow-v1-0-0-next-generation-genome-wide-analytics.html

anneng

NIH-Genomics-Workshop-11_4_2019.pdf