Setup VPC etc. in AWS Account for DataPull install
The instructions for deploying DataPull on AWS Fargate and AWS EMR assume that you already have available an S3 bucket, a VPC, subnets, etc. If you do not have these, or if you want to install DataPull in a new VPC dedicated to DataPull, please follow the following instructions.
It is generally recommended to use your existing VPC, subnets, etc. since they are most likely already setup to access the data you want DatPull to work on, have access to other services like S3, etc.
When creating a VPC for DataPull, you need at least two subnets in different availability zones, since AWS Application Load Balancer requires a minimum of two availability zones. You can create a VPC with two public subnets (approach modeled on https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Scenario1.html) or a VPC with two private subnets and a public subnet (modeled on https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Scenario2.html)
VPC with two public subnets
- create VPC
datatools
with IPv4 range10.0.0.0/24
- this creates a security group named
default
for this VPC - this also creates a main route table with no name
- this creates a security group named
- add a rule to security group
default
: to allow inbound traffic to tcp port 22 (SSH) from your client- This assumes that you will have a Bastion node or equivalent to access DataPull's API and Spark UI. If you trust your client network's public IP to be static or not be spoofed, you can allow all inbound traffic from your client, and thus obviate the need for a Bastion node or equivalent
- this also creates a main route table with no name
- create two subnets in the VPC, each one in a different availability zone in your region.
- subnet
datatools-external-1
with IPv4 range10.0.0.0/25
- subnet
datatools-external-2
with IPv4 range10.0.0.128/25
- subnet
- create internet gateway
datatools
for vpcdatatools
- create route for destination
0.0.0.0/0
to internet gatewaydatatools
for default the route table for the VPC
VPC with two private subnets and a public subnet
- create VPC
datatools
with 2 IPv4 ranges10.0.0.0/24
,10.1.0.0/24
- this creates a security group named
default
for this VPC - this also creates a main route table with no name
- this creates a security group named
- add a rule to security group
default
: to allow inbound traffic to tcp port 22 (SSH) from your client network (or from your public IP as a last resort) - create three subnets in the VPC, each one in a different availability zone in your region.
- subnet
datatools-internal-1
with IPv4 range10.0.0.0/25
- subnet
datatools-internal-2
with IPv4 range10.0.0.128/25
- subnet
datatools-external-1
with IPv4 range10.1.0.0/24
- subnet
- create internet gateway
datatools
for vpcdatatools
- create route table
datatools-external
in vpcdatatools
- create route for detination
0.0.0.0/0
to internet gatewaydatatools
for route tabledatatools-external
- associate subnet
datatools-external-1
to route tabledatatools-external
- subnets
datatools-internal-[1-2]
will remain associated to the main/default route table (not route tabledatatools-external
) for the VPCdatatools
- subnets
- create NAT gateway in the subnet
datatools-external-1
and with a new/existing elastic IP. - add an entry to the default route table with destination as
0.0.0.0/0
and Target as the NAT gateway
Additional Steps (common to both VPC types above)
- create S3 bucket
datatools-datapull
with SSE-S3 encryption in the same region as the VPC - create gateway endpoint for S3, associated with VPC
datatools
and the default route table (not route tabledatatools-external
). The policy should beFull Access
- create a bastion host by following https://aws.amazon.com/blogs/security/how-to-record-ssh-sessions-established-through-a-bastion-host/ with
- existing keypair, else no SSH possible
- security groups
default
(that allows SSH connections from your client network) andElasticMapReduce-Master-Private
(to allow bastion node to connect to EMR master)For simplicity (at the cost of auditing and security features like 2FA), you can spin up a
t2.micro
EC2 instance with AWS Linux 2 AMI as an alternative to a bastion host; it will allow ssh tunneling. If you are using a VPC with only public subnets, and if you trust your client network's public IP to be static or not be spoofed, you can allow all inbound traffic from your client, and thus obviate the need for a Bastion node or equivalent