Configure Spark Jobs to Access Cloud Storage
For related topics, see Spark Operations.
Supported jobs
This procedure applies to Spark-based jobs:
-
ALS Recommender
-
Cluster Labeling
-
Co-occurrence Similarity
-
Collection Analysis
-
Create Seldon Core Model Deployment Job
-
Delete Seldon Core Model Deployment Job
-
Document Clustering
-
Ground Truth
-
Head/Tail Analysis
-
Item Similarity Recommender
-
Legacy Item Recommender
-
Legacy Item Similarity
-
Levenshtein Spell Checking
-
Logistic Regression Classifier Training
-
Matrix Decomposition-Based Query-Query Similarity
-
Outlier Detection
-
Parallel Bulk Loader
-
Parameterized SQL Aggregation
-
Phrase Extraction
-
Query-to-Query Session-Based Similarity
-
Query-to-Query Similarity
-
Random Forest Classifier Training
-
Ranking Metrics
-
SQL Aggregation
-
SQL-Based Experiment Metric (deprecated)
-
Statistically Interesting Phrases
-
Synonym Detection Jobs
-
Synonym and Similar Queries Detection Jobs
-
Token and Phrase Spell Correction
-
Word2Vec Model Training
For Argo-based jobs, see Configure An Argo-Based Job to Access GCS and Configure An Argo-Based Job to Access S3.
Amazon Web Services (AWS) and Google Cloud Storage (GCS) credentials can be configured per job or per cluster.
Configuring credentials for Spark jobs
GCS
The examples in this subsection use placeholder values. See the table below for descriptions of the placeholders:
Placeholder | Description |
---|---|
<key name> |
Name of the Solr GCS service account key. |
<key file path> |
Path to the Solr GCS service account key. |
-
Create a secret containing the credentials JSON file:
kubectl create secret generic <key name> --from-file=/<key file path>/<key name>.json
For more information, see Creating and managing service account keys. The topic is used to generate your organization’s GOOGLE_APPLICATION_CREDENTIALS, which are needed to create an extra config map.
-
Create an extra config map in Kubernetes setting the required properties for GCP.
-
Create a properties file with GCP properties:
cat gcp-launcher.properties spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS = /mnt/gcp-secrets/<key name>.json spark.kubernetes.driver.secrets.<key name> = /mnt/gcp-secrets spark.kubernetes.executor.secrets.<key name> = /mnt/gcp-secrets spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS = /mnt/gcp-secrets/<key name>.json spark.hadoop.google.cloud.auth.service.account.json.keyfile = /mnt/gcp-secrets/<key name>.json
-
Create a config map based on the properties file:
kubectl create configmap gcp-launcher --from-file=gcp-launcher.properties
-
-
Add the
gcp-launcher
config map tovalues.yaml
underjob-launcher
:configSources: - gcp-launcher
AWS S3
AWS credentials can’t be set with a single file. Instead, set two environment variables referring to the key and secret using the instructions below:
-
Create a secret pointing to the credentials:
kubectl create secret generic aws-secret --from-literal=key='<access key>' --from-literal=secret='<secret key>'
-
Create an extra config map in Kubernetes setting the required properties for AWS:
-
Create a properties file with AWS properties:
cat aws-launcher.properties spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID=aws-secret:key spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY=aws-secret:secret spark.kubernetes.executor.secretKeyRef.AWS_ACCESS_KEY_ID=aws-secret:key spark.kubernetes.executor.secretKeyRef.AWS_SECRET_ACCESS_KEY=aws-secret:secret
-
Create a config map based on the properties file:
kubectl create configmap aws-launcher --from-file=aws-launcher.properties
-
-
Add the
aws-launcher
config map tovalues.yaml
underjob-launcher
:configSources: - aws-launcher
Azure Data Lake
Configuring Azure through environment variables or configMaps
isn’t possible yet. Instead, manually upload the core-site.xml
file into the job-launcher
pod at /app/spark-dist/conf
. See below for an example core-site.xml
file:
<property>
<name>dfs.adls.oauth2.access.token.provider.type</name>
<value>ClientCredential</value>
</property>
<property>
<name>dfs.adls.oauth2.refresh.url</name>
<value> Insert Your OAuth 2.0 Endpoint URL Value Here </value>
</property>
<property>
<name>dfs.adls.oauth2.client.id</name>
<value> Insert Your Application ID Here </value>
</property>
<property>
<name>dfs.adls.oauth2.credential</name>
<value>Insert the Secret Key Value Here </value>
</property>
<property>
<name>fs.adl.impl</name>
<value>org.apache.hadoop.fs.adl.AdlFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.adl.impl</name>
<value>org.apache.hadoop.fs.adl.Adl</value>
</property>
At this time, only Data Lake Gen 1 is supported. |
Configuring credentials per job
-
Create a Kubernetes secret with the GCP/AWS credentials.
-
Add the Spark configuration to configure the secrets for the Spark driver/executor.
GCS
The examples in this subsection use placeholder values. See the table below for descriptions of the placeholders:
Placeholder | Description |
---|---|
<key name> |
Name of the Solr GCS service account key. |
<key file path> |
Path to the Solr GCS service account key. |
-
Create a secret containing the credentials JSON file.
kubectl create secret generic <key name> --from-file=/<key file path>/<key name>.json
See Creating and managing service account keys for more details.
-
Toggle the Advanced configuration in the job UI, and add the following to the Spark configuration:
spark.kubernetes.driver.secrets.<key name> = /mnt/gcp-secrets spark.kubernetes.executor.secrets.<key name> = /mnt/gcp-secrets spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS = /mnt/gcp-secrets/<key name>.json spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS = /mnt/gcp-secrets/<key name>.json spark.hadoop.google.cloud.auth.service.account.json.keyfile = /mnt/gcp-secrets/<key name>.json
AWS S3
AWS credentials can’t be set with a single file. Instead, set two environment variables referring to the key and secret using the instructions below:
-
Create a secret pointing to the credentials:
kubectl create secret generic aws-secret --from-literal=key='<access key>' --from-literal=secret='<secret key>'
-
Toggle the Advanced configuration in the job UI, and add the following to the Spark configuration:
spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID=aws-secret:key spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY=aws-secret:secret spark.kubernetes.executor.secretKeyRef.AWS_ACCESS_KEY_ID=aws-secret:key spark.kubernetes.executor.secretKeyRef.AWS_SECRET_ACCESS_KEY=aws-secret:secret