Configure the Spark History Server
For related topics, see Spark Operations.
Recommended configuration
For Fusion, configure the Spark History Server to store and read Spark logs in cloud storage. For installations on Google Kubernetes Engine, set these keys in the values.yaml
file:
gcs:
enableGCS: true
secret: history-secrets (1)
key: sparkhistory.json (1)
logDirectory: gs://[BUCKET_NAME]
service: (2)
type: ClusterIP
port:
number: 18080
pvc:
enablePVC: false
nfs:
enableExampleNFS: false (3)
1 | The key and secret fields provide the Spark History Server with the details of where it can find an account with access to the Google Cloud Storage bucket given in logDirectory . Later examples show how to set up a new service account that’s shared between the Spark History Server and the Spark driver/executors for both viewing and writing logs. |
2 | By default, the Spark History Server Helm chart creates an external LoadBalancer, exposing it to outside access. In this example, the service key overrides the default. The Spark History Server is set up on an internal IP within your cluster only and is not exposed externally. Later examples show how to access the Spark History Server. |
3 | The nfs.enableExampleNFS option turns off the unneeded default NFS server set up by the Spark History Server. |
To give the Spark History Server access to the Google Cloud Storage bucket where the logs are kept:
-
Use
gcloud
to create a new service account:export ACCOUNT_NAME=sparkhistory export GCP_PROJECT_ID=[PROJECT_ID] gcloud iam service-accounts create ${ACCOUNT_NAME} --display-name "${ACCOUNT_NAME}"
If you have an existing service account you wish to use instead, you can skip the create
command, though you will still need to create the JSON key pair and ensure that the existing account can read and write to the log bucket. -
Use
keys create
to create a JSON key pair, and upload it to your cluster as a Kubernetes secret:gcloud iam service-accounts keys create "${ACCOUNT_NAME}.json" --iam-account "${ ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com"
-
Give the service account the
storage/admin
role, allowing it to perform "create" and "view" operations:gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID} --member "serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com" --role roles/storage.admin
-
Run the
gsutil
command to apply the service account to your chosen bucket.gsutil iam ch serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com:objectAdmin gs://[BUCKET_NAME]
-
Upload the JSON key pair into the cluster as a secret:
kubectl -n [NAMESPACE] create secret generic history-secrets --from-file=sparkhistory.json
The Spark History Server can now be installed with helm install [namespace]-spark-history-server stable/spark-history-server --values values.yaml
.
Other configurations
Azure
The Azure configuration process is similar Google Kubernetes Engine. However, logs are stored in Azure Blob Storage, and you can use SAS token or key access.
echo "your-storage-account-name" >> azure-storage-account-name
echo "your-container-name" >> azure-blob-container-name
echo "your-azure-blob-sas-key" >> azure-blob-sas-key (1)
kubectl create secret generic azure-secrets --from-file=azure-storage-account-name --from-file=azure-blob-container-name [--from-file=azure-blob-sas-key | --from-file=azure-storage-account-key]
1 | This line is used to authenticate with a SAS token. Replace the line with echo "your-azure-storage-account-key" >> azure-storage-account-key to use a storage account key instead. |
To use SAS token access, the values.yaml
file resembles the following:
wasbs:
enableWASBS: true
secret: azure-secrets
sasKeyName: azure-blob-sas-key
storageAccountNameKeyName: azure-storage-account-name
containerKeyName: azure-blob-container-name
logDirectory: [BUCKET_NAME]
For non-SAS access, the values.yaml
file resembles the following:
wasbs:
enableWASBS: true
secret: azure-secrets
sasKeyMode: false
storageAccountKeyName: azure-storage-account-key
storageAccountNameKeyName: azure-storage-account-name
containerKeyName: azure-blob-container-name
logDirectory: [BUCKET_NAME]
Amazon Web Services
In AWS, you can use IAM roles or an access/secret key pair. The use of AWS IAM roles is preferred over using an access/secret key pair, but both options are described.
aws iam list-access-keys --user-name your-user-name --output text | awk '{print $2}' >> aws-access-key
echo "your-aws-secret-key" >> aws-secret-key
kubectl create secret generic aws-secrets --from-file=aws-access-key --from-file=aws-secret-key
For IAM, the values.yaml
file resembles the following:
s3:
enableS3: true
logDirectory: s3a://[BUCKET_NAME]
The values.yaml file uses the Hadoop s3a:// link instead of s3:// .
|
For an access/secret pair, add the secret:
s3:
enableS3: true
enableIAM: false
accessKeyName: aws-access-key
secretKeyName: aws-secret-key
logDirectory: s3a://[BUCKET_NAME]
Configuring Spark
After starting the Spark History Server, update the config map for Fusion’s job-launcher service so it can write logs to the same location that Spark History Server is reading from.
In this example, Fusion is installed into a namespace called sparkhistory
.
-
Before editing the config map, make a copy of the existing settings in case you need to revert the changes.
kubectl get cm -n [NAMESPACE] sparkhistory-job-launcher -o yaml > sparkhistory-job-launcher.yaml
-
Edit the config map to write the logs to the same Google Cloud Storage bucket we configured the Spark History Server to read from.
kubectl edit cm -n [NAMESPACE] sparkhistory-job-launcher
-
Update the
spark
key with the new YAML settings below:spark: hadoop: fs: AbstractFileSystem: gs: impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS gs: impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem google: cloud: auth: service: account: json: keyfile: /etc/history-secrets/[ACCOUNT_NAME].json eventLog: enabled: true compress: true dir: gs://[BUCKET_NAME] … kubernetes: driver: secrets: history-secrets: /etc/history-secrets container: … executor: secrets: history-secrets: /etc/history-secrets container: … …
The YAML settings inform Spark of the location of the secret and the settings that specify the location of the Spark
eventLog
. The settings also inform Spark how to access GCS with thespark.hadoop.fs.AbstractFileSystem.gs.impl
andspark.hadoop.fs.gs.impl
keys. -
Delete the
job-launcher
pod. The newjob-launcher
pod will apply the new configuration to later jobs.