Legacy Product

Fusion 5.10
    Fusion 5.10

    Configure the Spark History Server

    For related topics, see Spark Operations.

    Recommended configuration

    For Fusion, configure the Spark History Server to store and read Spark logs in cloud storage. For installations on Google Kubernetes Engine, set these keys in the values.yaml file:

    gcs:
      enableGCS: true
      secret: history-secrets (1)
      key: sparkhistory.json (1)
      logDirectory: gs://[BUCKET_NAME]
    service: (2)
      type: ClusterIP
      port:
         number: 18080
    
    pvc:
      enablePVC: false
    nfs:
      enableExampleNFS: false (3)
    1 The key and secret fields provide the Spark History Server with the details of where it can find an account with access to the Google Cloud Storage bucket given in logDirectory. Later examples show how to set up a new service account that’s shared between the Spark History Server and the Spark driver/executors for both viewing and writing logs.
    2 By default, the Spark History Server Helm chart creates an external LoadBalancer, exposing it to outside access. In this example, the service key overrides the default. The Spark History Server is set up on an internal IP within your cluster only and is not exposed externally. Later examples show how to access the Spark History Server.
    3 The nfs.enableExampleNFS option turns off the unneeded default NFS server set up by the Spark History Server.

    To give the Spark History Server access to the Google Cloud Storage bucket where the logs are kept:

    1. Use gcloud to create a new service account:

      export ACCOUNT_NAME=sparkhistory
      export GCP_PROJECT_ID=[PROJECT_ID]
      gcloud iam service-accounts create ${ACCOUNT_NAME} --display-name "${ACCOUNT_NAME}"
      If you have an existing service account you wish to use instead, you can skip the create command, though you will still need to create the JSON key pair and ensure that the existing account can read and write to the log bucket.
    2. Use keys create to create a JSON key pair, and upload it to your cluster as a Kubernetes secret:

      gcloud iam service-accounts keys create "${ACCOUNT_NAME}.json" --iam-account "${
      ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com"
    3. Give the service account the storage/admin role, allowing it to perform "create" and "view" operations:

      gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID} --member "serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com" --role roles/storage.admin
    4. Run the gsutil command to apply the service account to your chosen bucket.

      gsutil iam ch serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com:objectAdmin gs://[BUCKET_NAME]
    5. Upload the JSON key pair into the cluster as a secret:

      kubectl -n [NAMESPACE] create secret generic history-secrets --from-file=sparkhistory.json

    The Spark History Server can now be installed with helm install [namespace]-spark-history-server stable/spark-history-server --values values.yaml.

    Other configurations

    Azure

    The Azure configuration process is similar Google Kubernetes Engine. However, logs are stored in Azure Blob Storage, and you can use SAS token or key access.

    echo "your-storage-account-name" >> azure-storage-account-name
    echo "your-container-name" >> azure-blob-container-name
    echo "your-azure-blob-sas-key" >> azure-blob-sas-key (1)
    kubectl create secret generic azure-secrets --from-file=azure-storage-account-name --from-file=azure-blob-container-name [--from-file=azure-blob-sas-key | --from-file=azure-storage-account-key]
    1 This line is used to authenticate with a SAS token. Replace the line with echo "your-azure-storage-account-key" >> azure-storage-account-key to use a storage account key instead.

    To use SAS token access, the values.yaml file resembles the following:

    wasbs:
      enableWASBS: true
      secret: azure-secrets
      sasKeyName: azure-blob-sas-key
      storageAccountNameKeyName: azure-storage-account-name
      containerKeyName: azure-blob-container-name
      logDirectory: [BUCKET_NAME]

    For non-SAS access, the values.yaml file resembles the following:

    wasbs:
      enableWASBS: true
      secret: azure-secrets
      sasKeyMode: false
      storageAccountKeyName: azure-storage-account-key
      storageAccountNameKeyName: azure-storage-account-name
      containerKeyName:  azure-blob-container-name
      logDirectory: [BUCKET_NAME]

    Amazon Web Services

    In AWS, you can use IAM roles or an access/secret key pair. The use of AWS IAM roles is preferred over using an access/secret key pair, but both options are described.

    aws iam list-access-keys --user-name your-user-name --output text | awk '{print $2}' >> aws-access-key
    echo "your-aws-secret-key" >> aws-secret-key
    kubectl create secret generic aws-secrets --from-file=aws-access-key --from-file=aws-secret-key

    For IAM, the values.yaml file resembles the following:

    s3:
      enableS3: true
      logDirectory: s3a://[BUCKET_NAME]
    The values.yaml file uses the Hadoop s3a:// link instead of s3://.

    For an access/secret pair, add the secret:

    s3:
      enableS3: true
      enableIAM: false
      accessKeyName: aws-access-key
      secretKeyName: aws-secret-key
      logDirectory: s3a://[BUCKET_NAME]

    Configuring Spark

    After starting the Spark History Server, update the config map for Fusion’s job-launcher service so it can write logs to the same location that Spark History Server is reading from.

    In this example, Fusion is installed into a namespace called sparkhistory.

    1. Before editing the config map, make a copy of the existing settings in case you need to revert the changes.

      kubectl get cm -n [NAMESPACE] sparkhistory-job-launcher -o yaml > sparkhistory-job-launcher.yaml
    2. Edit the config map to write the logs to the same Google Cloud Storage bucket we configured the Spark History Server to read from.

      kubectl edit cm -n [NAMESPACE] sparkhistory-job-launcher
    3. Update the spark key with the new YAML settings below:

      spark:
        hadoop:
          fs:
            AbstractFileSystem:
              gs:
                impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
            gs:
              impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
          google:
            cloud:
              auth:
                service:
                  account:
                    json:
                      keyfile: /etc/history-secrets/[ACCOUNT_NAME].json
        eventLog:
          enabled: true
          compress: true
          dir: gs://[BUCKET_NAME]
        
        kubernetes:
          driver:
            secrets:
              history-secrets: /etc/history-secrets
            container:
              
          executor:
            secrets:
              history-secrets: /etc/history-secrets
            container:
              
          

      The YAML settings inform Spark of the location of the secret and the settings that specify the location of the Spark eventLog. The settings also inform Spark how to access GCS with the spark.hadoop.fs.AbstractFileSystem.gs.impl and spark.hadoop.fs.gs.impl keys.

    4. Delete the job-launcher pod. The new job-launcher pod will apply the new configuration to later jobs.