Run On

You can also add connections that run on Hadoop Cluster and Databricks.

Static Hadoop Cluster

Upcoming section

AWS EMR Cluster On Demand

A new cluster is created for each job and is terminated after completion of the job.

  • Add a Hadoop Cluster connection.

  • Select "On Demand Cluster" check box on the top-right corner.

  • Type a connection name, which you will assign to this connection for internal use.

  • Provide the details as shown below

Hadoop Configuration

Configuring with On Demand Cluster enables jobs to leverage EMR cluster on demand. EMR cluster is launched on the fly and terminated on completing the job.

Description of the properties that are to be provided. Only four properties need to be updated.

Below properties can be copied and require no change

  • AWS_SERVICE_NAME=elasticmapreduce

  • AWS_EMR_SERVICE_ROLE=EMR_DefaultRole (Service role for Amazon EMR)

  • AWS_EMR_EC2_SERVICE_ROLE=EMR_EC2_DefaultRole (Service Role for Cluster EC2 Instances )

  • AWS_EMR_CLUSTER_APPLICATION_NAMES=Spark,Livy,Hive

  • AWS_EMR_RELEASE_LABEL=emr-5.31.0

  • AWS_EMR_CLUSTER_CREATE_JOBNAME=EMRSparkCluster

  • AWS_EMR_EC2_MASTER_INSTANCE_TYPE=m5.2xlarge

  • AWS_EMR_EC2_SLAVE_INSTANCE_TYPE=m5.2xlarge

  • EC2_INSTANCE_COST_TYPE=SPOT Or ON_DEMAND

Below five properties needs to be provided.

  • AWS_REGION=us-east-1 (Region, It is recommended to be in same subnet where Vexdata server is running otherwise network connectivity needs to be established. )

  • AWS_EMR_CLUSTER_TAGS=Name:XYZ,Project:Vexdata AWS_NETWORK_VPC=vpc-vpcid (Tags for auditing)

  • AWS_EMR_S3_LOG_URI=s3://aws-logs-XYZ-us-east-1/elasticmapreduce/ (The S3 bucket should have the write access from the EMR Cluster. )

  • AWS_EC2_INSTANCE_SUBNET=subnet-XYZ (private subnet id where emr cluster will be launched. It is recommended to be in same subnet where Vexdata server is running otherwise network connectivity needs to be established.)

  • AWS_NETWORK_VPC=vpc-XYZ (VPC id)

You can copy paste below contenet and the last five properties need to be changed. The description for the five propeerties is defined above.

Databricks

There are primarily two ways Databricks can be used.

  • Existing Databricks cluster.

  • Launch on demand cluster and terminate on completing the job.

To add a Databricks Cluster as a connection:

  • Type a connection name, which you will assign to this connection for internal use.

  • Provide cluster details, as shown below.

DATABRICKS_TOKEN=<token>

DATABRICKS_INSTANCE=<XXXXXXXXXXXXXX.cloud.databricks.com>

Databricks Configuration

Last updated