<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[GenoStack对大数据分析的支持]]></title><description><![CDATA[<p dir="auto">整体架构：<br />
1.底层基于k8s+ceph<br />
2.将spark部署到k8s   使用apache livy提供spark rest api（需要验证其必要性）<br />
使用argo跑流程？（待定）<br />
3.部署验证adam、glow、hail、gatk等大数据分析流程  特别是对parquet、delta的支持<br />
重点验证：cromwell 自动化调度到spark上是否可行<br />
4.jupyter 支持pyspark  sparkR  sparksql等api<br />
百万级单细胞数据的处理  <a href="https://github.com/asif7adil/scSparkl" rel="nofollow ugc">https://github.com/asif7adil/scSparkl</a><br />
<a href="https://github.com/martaccmoreno/gexp-ml-dask" rel="nofollow ugc">https://github.com/martaccmoreno/gexp-ml-dask</a><br />
5.ui支持 mango、superset、redash、rath 等可视化分析工具<br />
6.GPU支持 使用RAPIDS来加速spark</p>
]]></description><link>http://an.forum.genostack.com/topic/972/genostack对大数据分析的支持</link><generator>RSS for Node</generator><lastBuildDate>Sat, 13 Jun 2026 12:33:24 GMT</lastBuildDate><atom:link href="http://an.forum.genostack.com/topic/972.rss" rel="self" type="application/rss+xml"/><pubDate>Thu, 20 Jul 2023 09:54:29 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to GenoStack对大数据分析的支持 on Fri, 21 Jul 2023 11:57:30 GMT]]></title><description><![CDATA[<p dir="auto">运行spark-shell<br />
./bin/spark-shell   --master k8s://192.168.39.6:8443   --conf spark.kubernetes.container.image=spark   --conf spark.kubernetes.context=minikube   --conf spark.kubernetes.namespace=spark-demo   --verbose</p>
]]></description><link>http://an.forum.genostack.com/post/2313</link><guid isPermaLink="true">http://an.forum.genostack.com/post/2313</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Fri, 21 Jul 2023 11:57:30 GMT</pubDate></item><item><title><![CDATA[Reply to GenoStack对大数据分析的支持 on Fri, 21 Jul 2023 09:58:05 GMT]]></title><description><![CDATA[<p dir="auto"><a href="https://dev.to/stack-labs/my-journey-with-spark-on-kubernetes-in-python-1-3-4nl3" rel="nofollow ugc">https://dev.to/stack-labs/my-journey-with-spark-on-kubernetes-in-python-1-3-4nl3</a><br />
一个spark+k8s+python教程</p>
]]></description><link>http://an.forum.genostack.com/post/2312</link><guid isPermaLink="true">http://an.forum.genostack.com/post/2312</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Fri, 21 Jul 2023 09:58:05 GMT</pubDate></item><item><title><![CDATA[Reply to GenoStack对大数据分析的支持 on Fri, 21 Jul 2023 09:44:17 GMT]]></title><description><![CDATA[<p dir="auto"><img src="/assets/uploads/files/1689928615605-a3d39aff-297b-48af-a309-4cab543e7a66-image.png" alt="a3d39aff-297b-48af-a309-4cab543e7a66-image.png" class=" img-responsive img-markdown" /><br />
<a href="https://medium.com/empathyco/running-apache-spark-on-kubernetes-2e64c73d0bb2" rel="nofollow ugc">https://medium.com/empathyco/running-apache-spark-on-kubernetes-2e64c73d0bb2</a><br />
Spark Submit is sent from a client to the Kubernetes API server in the master node.<br />
Kubernetes will schedule a new Spark Driver pod.<br />
Spark Driver pod will communicate with Kubernetes to request Spark executor pods.<br />
The new executor pods will be scheduled by Kubernetes.<br />
Once the new executor pods are running, Kubernetes will notify Spark Driver pod that new Spark executor pods are ready.<br />
Spark Driver pod will schedule tasks on the new Spark executor pods.</p>
<p dir="auto">另外一个是Spark Operator项目<br />
几种运行方式的对比<br />
<a href="https://blog.cellenza.com/en/data/using-spark-with-kubernetes-k8s/" rel="nofollow ugc">https://blog.cellenza.com/en/data/using-spark-with-kubernetes-k8s/</a><br />
<img src="/assets/uploads/files/1689929085053-c24f80a3-ffdd-441e-8ac7-60ab0ce81c27-image.png" alt="c24f80a3-ffdd-441e-8ac7-60ab0ce81c27-image.png" class=" img-responsive img-markdown" /><br />
<a href="https://aws.amazon.com/cn/about-aws/whats-new/2023/06/amazon-emr-eks-spark-operator-submit/" rel="nofollow ugc">https://aws.amazon.com/cn/about-aws/whats-new/2023/06/amazon-emr-eks-spark-operator-submit/</a> 亚马逊两个都支持 看来这两个各有应用场景<br />
使用argo跑流程<br />
<img src="/assets/uploads/files/1689928899774-e83edabe-2eb0-4900-8e87-d3e3cffa2ca1-image.png" alt="e83edabe-2eb0-4900-8e87-d3e3cffa2ca1-image.png" class=" img-responsive img-markdown" /></p>
]]></description><link>http://an.forum.genostack.com/post/2311</link><guid isPermaLink="true">http://an.forum.genostack.com/post/2311</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Fri, 21 Jul 2023 09:44:17 GMT</pubDate></item><item><title><![CDATA[Reply to GenoStack对大数据分析的支持 on Sat, 22 Jul 2023 07:23:17 GMT]]></title><description><![CDATA[<p dir="auto"><a href="https://spark.apache.org/docs/latest/running-on-kubernetes.html" rel="nofollow ugc">https://spark.apache.org/docs/latest/running-on-kubernetes.html</a><br />
在k8s之上运行spark<br />
一，安全问题<br />
二，前提<br />
Kubernetes &gt;= 1.22<br />
kubectl可用<br />
Kubernetes DNS 可用<br />
三，运行逻辑<br />
<img src="/assets/uploads/files/1689911118309-28c41dac-24cf-4c79-8b77-863999d92465-image.png" alt="28c41dac-24cf-4c79-8b77-863999d92465-image.png" class=" img-responsive img-markdown" /><br />
Spark-submit可直接用于将Spark应用程序提交到Kubernetes集群。 提交机制的工作原理如下：Spark 创建一个在 Kubernetes pod 中运行的 Spark 驱动程序。 驱动程序创建也在 Kubernetes Pod 中运行的执行器并连接到它们，并执行应用程序代码。 当应用程序完成时，执行程序 Pod 终止并被清理，但驱动程序 Pod 会保留日志并在 Kubernetes API 中保持“已完成”状态，直到最终被垃圾收集或手动清理。</p>
<p dir="auto">四，应用镜像<br />
运行spark任务前 先要构建spark镜像 spark自己提供了命令可以自动进行构建<br />
kubernetes/dockerfiles/<br />
bin/docker-image-tool.sh（默认是JVM应用）<br />
提交任务：</p>
<pre><code>local是镜像里面的路径 可以使用--conf spark.kubernetes.file.upload.path配置一个上传路径  然后使用file://上传本地文件 这些文件和UI层怎么配合？包括app、jupyter、流程 后面都会向spark提交文件  另外就是kube-system等才有 cannot get resource "pods" in API group "" in the namespace "spark-demo" 权限  权限这一块怎么设计？
./bin/spark-submit     --master k8s://192.168.39.6:8443     --deploy-mode cluster     --name spark-pi     --class org.apache.spark.examples.SparkPi     --conf spark.executor.instances=5     --conf spark.kubernetes.container.image=spark --conf spark.kubernetes.namespace=kube-system  local:///opt/spark/examples/jars/spark-examples_2.12-3.4.1.jar
</code></pre>
<p dir="auto">client模式和集群模式的对比：<br />
<a href="https://sparkbyexamples.com/spark/spark-submit-command/?expand_article=1" rel="nofollow ugc">https://sparkbyexamples.com/spark/spark-submit-command/?expand_article=1</a><br />
五，存储<br />
hostPath: mounts a file or directory from the host node’s filesystem into a pod.<br />
emptyDir: an initially empty volume created when a pod is assigned to a node.<br />
nfs: mounts an existing NFS(Network File System) into a pod.<br />
persistentVolumeClaim: mounts a PersistentVolume into a pod.</p>
]]></description><link>http://an.forum.genostack.com/post/2310</link><guid isPermaLink="true">http://an.forum.genostack.com/post/2310</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Sat, 22 Jul 2023 07:23:17 GMT</pubDate></item></channel></rss>