Skip to content

DataPull internally uses Spark to move data across platforms. DataPull jobs can be monitored using the Spark UI.

Spark UI is available only while the EMR cluster is in running status, and has a step in running status.

Prerequisites

Prerequisite for environments outside of Vrbo and EG Data Platform

The Spark UI is a website that is hosted on the master node of the EMR cluster running DataPull. If most cases, the IP/DNS Name of the EMR cluster's master is inaccessible from the client machine of the user; since there is no peering/VPN/Direct Connect set up from the user's network to the VPC of the EMR Cluster. Hence, it is necessary to use SSH Tunnelling to access this website on the user's client machine.

Use an SSH client to connect to the master node, configure SSH tunneling with local port forwarding, and use an Internet browser to open web interfaces hosted on the master node. This method allows you to configure web interface access without using a SOCKS proxy. For more information, refer to https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ssh-tunnel-local.html

Use an SSH client to connect to the master node, configure SSH tunneling with dynamic port forwarding, and configure your Internet browser to use an add-on such as FoxyProxy or SwitchySharp to manage your SOCKS proxy settings. This method allows you to automatically filter URLs based on text patterns and to limit the proxy settings to domains that match the form of the master node's DNS name. The browser add-on automatically handles turning the proxy on and off when you switch between viewing websites hosted on the master node and those on the Internet. For more information, refer to https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ssh-tunnel.html

Steps to access Spark UI

  1. Open the AWS console for environment in which the datapull job is running. select EMR under the services tab. EMR console page
  2. Select the EMR cluster for the corresponding datapull job. The EMR Cluster will be named <environment>-emr-<DataPull Pipeline Name>-pipeline EMR cluster
  3. Click the Hardware tab and Click the ID of the Master node which can recongnised under the Node Type & name tab. EMR master node
  4. Copy the <Master Node Private IP Address> to your clipboard.
  5. Open a new browser window/tab and browse to http://<Master Node Private IP Address>:8088 which will open the YARN ResourceManager YARN Manager
  6. Click on the link which says ApplicationMaster under the Tracking UI.
  7. If the page redirection fails with This page isn’t working or DNS Error, and if the <Master Node Private IP Address> is 1.2.3.4 then replace ip-1-2-3-4.ec2.internal with 1.2.3.4 in the URL
  8. You are now at th Spark UI. For understanding more on analyzing Spark UI, refer to Understanding your Apache Spark Application Through Visualization Spark UI

To understand all the available web interfaces which can be accessed through <Master Node Private IP Address>, refer to this page