Getting Started with the spark-operator on minikube

#apache-spark #spark-operator #Prometheus #telemetry

The spark-operator is a Kubernetes Operator that enables defining, executing, and managing the whole lifecycle of Apache Spark applications in a Kubernetes-idiomatic way, just like other applications and workloads. Additionally, the deployment through the spark-operator allows the desired resources of the driver node and the executors to be configured separately and the Spark-specific configuration settings to be specified. Previously, the project was under the umbrella of Google Cloud Platform (GCP), with the name spark-on-k8s-operator; recently, however, the project has been trusted to the hands of the kubeflow project's maintainers. This decision will strengthen the integration of Apache Spark into kubeflow's machine learning-focused ecosystem. Apache Spark is not only a well-established distributed computing engine, but it also comes out of the box with a bunch of ML algorithms.

In this article, we'll go through how to set up locally, on minikube, the spark-operator, and build & launch a sample Spark job.

Prerequisites

☸️ minikube and Helm installed.

🐳 Docker is installed, and an active account to Docker Hub, although alternative ways exist.

👩‍💻 A clone of https://github.com/makism/spark-on-k8s.

Prepare

First things first, start a cluster as follows. It is pretty self-explanatory.

minikube start --memory 8192 --cpus 4

If needed, force minikube to download kubectl, a CLI for communicating with Kubernetes’ (minikube, in our case) control plane. We will use it extensively throughout the rest of the article.

minikube kubectl -- get pods -A

Prometheus & Grafana

Next, we need to deploy Prometheus and Grafana. Prometheus “collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels.” 🔗. It is particularly well-suited for Kubernetes because it can scrape metrics from various cluster components. Grafana, on the other hand, is an open-source platform for monitoring, observability and visualizing time-series data. It can connect and integrate with multiple sources, including, of course, Prometheus. One can develop intuitive dashboards to monitor the health and performance metrics of a Kubernetes cluster 🔗.

We will use a Helm Chart to deploy everything required on our cluster.

helm repo add prometheus-community <https://prometheus-community.github.io/helm-charts>
helm install prometheus-comm prometheus-community/kube-prometheus-stack -f helms/prometheus.yaml

To access Grafana, we need two things: the secrets generated using the deployment and port-forwarding the service.