Run On

You can also add connections that run on Hadoop Cluster and Databricks.

Static Hadoop Cluster

Upcoming section

AWS EMR Cluster On Demand

A new cluster is created for each job and is terminated after completion of the job.

  • Add a Hadoop Cluster connection.

  • Select "On Demand Cluster" check box on the top-right corner.

  • Type a connection name, which you will assign to this connection for internal use.

  • Provide the details as shown below

Configuring with On Demand Cluster enables jobs to leverage EMR cluster on demand. EMR cluster is launched on the fly and terminated on completing the job.

Description of the properties that are to be provided. Only four properties need to be updated.

Below properties can be copied and require no change

  • AWS_SERVICE_NAME=elasticmapreduce

  • AWS_EMR_SERVICE_ROLE=EMR_DefaultRole (Service role for Amazon EMR)

  • AWS_EMR_EC2_SERVICE_ROLE=EMR_EC2_DefaultRole (Service Role for Cluster EC2 Instances )

  • AWS_EMR_CLUSTER_APPLICATION_NAMES=Spark,Livy,Hive

  • AWS_EMR_RELEASE_LABEL=emr-5.31.0

  • AWS_EMR_CLUSTER_CREATE_JOBNAME=EMRSparkCluster

  • AWS_EMR_EC2_MASTER_INSTANCE_TYPE=m5.2xlarge

  • AWS_EMR_EC2_SLAVE_INSTANCE_TYPE=m5.2xlarge

  • EC2_INSTANCE_COST_TYPE=SPOT Or ON_DEMAND

Below five properties needs to be provided.

  • AWS_REGION=us-east-1 (Region, It is recommended to be in same subnet where Vexdata server is running otherwise network connectivity needs to be established. )

  • AWS_EMR_CLUSTER_TAGS=Name:XYZ,Project:Vexdata AWS_NETWORK_VPC=vpc-vpcid (Tags for auditing)

  • AWS_EMR_S3_LOG_URI=s3://aws-logs-XYZ-us-east-1/elasticmapreduce/ (The S3 bucket should have the write access from the EMR Cluster. )

  • AWS_EC2_INSTANCE_SUBNET=subnet-XYZ (private subnet id where emr cluster will be launched. It is recommended to be in same subnet where Vexdata server is running otherwise network connectivity needs to be established.)

  • AWS_NETWORK_VPC=vpc-XYZ (VPC id)

You can copy paste below contenet and the last five properties need to be changed. The description for the five propeerties is defined above.

AWS_EMR_SERVICE_ROLE=EMR_DefaultRole
AWS_EMR_EC2_SERVICE_ROLE=EMR_EC2_DefaultRole
AWS_SERVICE_NAME=elasticmapreduce
AWS_EMR_CLUSTER_APPLICATION_NAMES=Spark,Livy,Hive
AWS_EMR_RELEASE_LABEL=emr-5.31.0
AWS_EMR_CLUSTER_CREATE_JOBNAME=VexdataSparkCluster
AWS_EMR_EC2_SERVICE_ROLE=EMR_EC2_DefaultRole
AWS_EMR_EC2_MASTER_INSTANCE_TYPE=m5a.2xlarge
AWS_EMR_EC2_SLAVE_INSTANCE_TYPE=m5a.2xlarge
EC2_INSTANCE_COST_TYPE=ON_DEMAND


AWS_REGION=us-east-1
AWS_EMR_CLUSTER_TAGS=<Name:customer_name,Project:Vexdata>
AWS_NETWORK_VPC=<vps_id>
AWS_EC2_INSTANCE_SUBNET=<subnet-072956387c3bc1383>
AWS_EMR_S3_LOG_URI=<s3://aws-logs-865515016503-us-east-1/elasticmapreduce/>
EC2_INSTANCE_COST_TYPE can be SPOT if cost is important 
however it is recommended to have it ON_DEMAND for time
critical jobs.

Databricks

There are primarily two ways Databricks can be used.

  • Existing Databricks cluster.

  • Launch on demand cluster and terminate on completing the job.

To add a Databricks Cluster as a connection:

  • Type a connection name, which you will assign to this connection for internal use.

  • Provide cluster details, as shown below.

DATABRICKS_TOKEN=<token>

DATABRICKS_INSTANCE=<XXXXXXXXXXXXXX.cloud.databricks.com>

Last updated