Run On
Last updated
Last updated
You can also add connections that run on Hadoop Cluster and Databricks.
Upcoming section
A new cluster is created for each job and is terminated after completion of the job.
Add a Hadoop Cluster connection.
Select "On Demand Cluster" check box on the top-right corner.
Type a connection name, which you will assign to this connection for internal use.
Provide the details as shown below
Configuring with On Demand Cluster enables jobs to leverage EMR cluster on demand. EMR cluster is launched on the fly and terminated on completing the job.
Description of the properties that are to be provided. Only four properties need to be updated.
Below properties can be copied and require no change
AWS_SERVICE_NAME=elasticmapreduce
AWS_EMR_SERVICE_ROLE=EMR_DefaultRole (Service role for Amazon EMR)
AWS_EMR_EC2_SERVICE_ROLE=EMR_EC2_DefaultRole (Service Role for Cluster EC2 Instances )
AWS_EMR_CLUSTER_APPLICATION_NAMES=Spark,Livy,Hive
AWS_EMR_RELEASE_LABEL=emr-5.31.0
AWS_EMR_CLUSTER_CREATE_JOBNAME=EMRSparkCluster
AWS_EMR_EC2_MASTER_INSTANCE_TYPE=m5.2xlarge
AWS_EMR_EC2_SLAVE_INSTANCE_TYPE=m5.2xlarge
EC2_INSTANCE_COST_TYPE=SPOT Or ON_DEMAND
Below five properties needs to be provided.
AWS_REGION=us-east-1 (Region, It is recommended to be in same subnet where Vexdata server is running otherwise network connectivity needs to be established. )
AWS_EMR_CLUSTER_TAGS=Name:XYZ,Project:Vexdata AWS_NETWORK_VPC=vpc-vpcid (Tags for auditing)
AWS_EMR_S3_LOG_URI=s3://aws-logs-XYZ-us-east-1/elasticmapreduce/ (The S3 bucket should have the write access from the EMR Cluster. )
AWS_EC2_INSTANCE_SUBNET=subnet-XYZ (private subnet id where emr cluster will be launched. It is recommended to be in same subnet where Vexdata server is running otherwise network connectivity needs to be established.)
AWS_NETWORK_VPC=vpc-XYZ (VPC id)
You can copy paste below contenet and the last five properties need to be changed. The description for the five propeerties is defined above.
There are primarily two ways Databricks can be used.
Existing Databricks cluster.
Launch on demand cluster and terminate on completing the job.
To add a Databricks Cluster as a connection:
Type a connection name, which you will assign to this connection for internal use.
Provide cluster details, as shown below.
DATABRICKS_TOKEN=<token>
DATABRICKS_INSTANCE=<XXXXXXXXXXXXXX.cloud.databricks.com>