Setup VPC etc. in AWS Account for DataPull install
The instructions for deploying DataPull on AWS Fargate and AWS EMR assume that you already have available an S3 bucket, a VPC, subnets, etc. If you do not have these, or if you want to install DataPull in a new VPC dedicated to DataPull, please follow the following instructions.
It is generally recommended to use your existing VPC, subnets, etc. since they are most likely already setup to access the data you want DatPull to work on, have access to other services like S3, etc.
When creating a VPC for DataPull, you need at least two subnets in different availability zones, since AWS Application Load Balancer requires a minimum of two availability zones. You can create a VPC with two public subnets (approach modeled on https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Scenario1.html) or a VPC with two private subnets and a public subnet (modeled on https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Scenario2.html)
VPC with two public subnets
- create VPC
datatoolswith IPv4 range10.0.0.0/24- this creates a security group named
defaultfor this VPC - this also creates a main route table with no name
- this creates a security group named
- add a rule to security group
default: to allow inbound traffic to tcp port 22 (SSH) from your client- This assumes that you will have a Bastion node or equivalent to access DataPull's API and Spark UI. If you trust your client network's public IP to be static or not be spoofed, you can allow all inbound traffic from your client, and thus obviate the need for a Bastion node or equivalent
- this also creates a main route table with no name
- create two subnets in the VPC, each one in a different availability zone in your region.
- subnet
datatools-external-1with IPv4 range10.0.0.0/25 - subnet
datatools-external-2with IPv4 range10.0.0.128/25
- subnet
- create internet gateway
datatoolsfor vpcdatatools - create route for destination
0.0.0.0/0to internet gatewaydatatoolsfor default the route table for the VPC
VPC with two private subnets and a public subnet
- create VPC
datatoolswith 2 IPv4 ranges10.0.0.0/24,10.1.0.0/24- this creates a security group named
defaultfor this VPC - this also creates a main route table with no name
- this creates a security group named
- add a rule to security group
default: to allow inbound traffic to tcp port 22 (SSH) from your client network (or from your public IP as a last resort) - create three subnets in the VPC, each one in a different availability zone in your region.
- subnet
datatools-internal-1with IPv4 range10.0.0.0/25 - subnet
datatools-internal-2with IPv4 range10.0.0.128/25 - subnet
datatools-external-1with IPv4 range10.1.0.0/24
- subnet
- create internet gateway
datatoolsfor vpcdatatools - create route table
datatools-externalin vpcdatatools - create route for detination
0.0.0.0/0to internet gatewaydatatoolsfor route tabledatatools-external - associate subnet
datatools-external-1to route tabledatatools-external- subnets
datatools-internal-[1-2]will remain associated to the main/default route table (not route tabledatatools-external) for the VPCdatatools
- subnets
- create NAT gateway in the subnet
datatools-external-1and with a new/existing elastic IP. - add an entry to the default route table with destination as
0.0.0.0/0and Target as the NAT gateway
Additional Steps (common to both VPC types above)
- create S3 bucket
datatools-datapullwith SSE-S3 encryption in the same region as the VPC - create gateway endpoint for S3, associated with VPC
datatoolsand the default route table (not route tabledatatools-external). The policy should beFull Access - create a bastion host by following https://aws.amazon.com/blogs/security/how-to-record-ssh-sessions-established-through-a-bastion-host/ with
- existing keypair, else no SSH possible
- security groups
default(that allows SSH connections from your client network) andElasticMapReduce-Master-Private(to allow bastion node to connect to EMR master)For simplicity (at the cost of auditing and security features like 2FA), you can spin up a
t2.microEC2 instance with AWS Linux 2 AMI as an alternative to a bastion host; it will allow ssh tunneling. If you are using a VPC with only public subnets, and if you trust your client network's public IP to be static or not be spoofed, you can allow all inbound traffic from your client, and thus obviate the need for a Bastion node or equivalent