GenoStack对大数据分析的支持

Reply to GenoStack对大数据分析的支持 on Fri, 21 Jul 2023 11:57:30 GMT

anneng — Fri, 21 Jul 2023 11:57:30 GMT

运行spark-shell
./bin/spark-shell --master k8s://192.168.39.6:8443 --conf spark.kubernetes.container.image=spark --conf spark.kubernetes.context=minikube --conf spark.kubernetes.namespace=spark-demo --verbose

Reply to GenoStack对大数据分析的支持 on Fri, 21 Jul 2023 09:58:05 GMT

anneng — Fri, 21 Jul 2023 09:58:05 GMT

https://dev.to/stack-labs/my-journey-with-spark-on-kubernetes-in-python-1-3-4nl3
一个spark+k8s+python教程

Reply to GenoStack对大数据分析的支持 on Fri, 21 Jul 2023 09:44:17 GMT

anneng — Fri, 21 Jul 2023 09:44:17 GMT

https://medium.com/empathyco/running-apache-spark-on-kubernetes-2e64c73d0bb2
Spark Submit is sent from a client to the Kubernetes API server in the master node.
Kubernetes will schedule a new Spark Driver pod.
Spark Driver pod will communicate with Kubernetes to request Spark executor pods.
The new executor pods will be scheduled by Kubernetes.
Once the new executor pods are running, Kubernetes will notify Spark Driver pod that new Spark executor pods are ready.
Spark Driver pod will schedule tasks on the new Spark executor pods.

另外一个是Spark Operator项目
几种运行方式的对比
https://blog.cellenza.com/en/data/using-spark-with-kubernetes-k8s/

https://aws.amazon.com/cn/about-aws/whats-new/2023/06/amazon-emr-eks-spark-operator-submit/ 亚马逊两个都支持看来这两个各有应用场景
使用argo跑流程

Reply to GenoStack对大数据分析的支持 on Sat, 22 Jul 2023 07:23:17 GMT

anneng — Sat, 22 Jul 2023 07:23:17 GMT

https://spark.apache.org/docs/latest/running-on-kubernetes.html
在k8s之上运行spark
一，安全问题
二，前提
Kubernetes >= 1.22
kubectl可用
Kubernetes DNS 可用
三，运行逻辑

Spark-submit可直接用于将Spark应用程序提交到Kubernetes集群。提交机制的工作原理如下：Spark 创建一个在 Kubernetes pod 中运行的 Spark 驱动程序。驱动程序创建也在 Kubernetes Pod 中运行的执行器并连接到它们，并执行应用程序代码。当应用程序完成时，执行程序 Pod 终止并被清理，但驱动程序 Pod 会保留日志并在 Kubernetes API 中保持“已完成”状态，直到最终被垃圾收集或手动清理。

四，应用镜像
运行spark任务前先要构建spark镜像 spark自己提供了命令可以自动进行构建
kubernetes/dockerfiles/
bin/docker-image-tool.sh（默认是JVM应用）
提交任务：

local是镜像里面的路径 可以使用--conf spark.kubernetes.file.upload.path配置一个上传路径  然后使用file://上传本地文件 这些文件和UI层怎么配合？包括app、jupyter、流程 后面都会向spark提交文件  另外就是kube-system等才有 cannot get resource "pods" in API group "" in the namespace "spark-demo" 权限  权限这一块怎么设计？
./bin/spark-submit     --master k8s://192.168.39.6:8443     --deploy-mode cluster     --name spark-pi     --class org.apache.spark.examples.SparkPi     --conf spark.executor.instances=5     --conf spark.kubernetes.container.image=spark --conf spark.kubernetes.namespace=kube-system  local:///opt/spark/examples/jars/spark-examples_2.12-3.4.1.jar

client模式和集群模式的对比：
https://sparkbyexamples.com/spark/spark-submit-command/?expand_article=1
五，存储
hostPath: mounts a file or directory from the host node’s filesystem into a pod.
emptyDir: an initially empty volume created when a pod is assigned to a node.
nfs: mounts an existing NFS(Network File System) into a pod.
persistentVolumeClaim: mounts a PersistentVolume into a pod.