Distributed Computing 4 | AWS IAM Configurations, Amazon EMR and Spark Clusters

Series: Distributed Computing

Distributed Computing 4 | AWS IAM Configurations, Amazon EMR, and Spark Clusters

Note: If you meet any problems when following the instructions, my recommendation is to delete the things you have done and do all the things again.

AWS IAM Configurations

(1) The Overview of AWS Pipeline

Upload the data to AWS S3
Manage collaboration by AWS IAM
Transfer to AWS EMR
Transfer to third party software like Plotly

(2) The Definition of IAM

The identity and access management (IAM) system of AWS enables to manage access to AWS services by allowing to manage the AWS user groups.

(3) Creating IAM

The procedure of creating IAMs should be as follows,

Go to AWS.com
Sign in to the console
Select Root User
Enter the email address
Enter the password
Search for IAM and redirect to the IAM dashboard
Select Access management > Users in the left navigation bar (this should be the default page)
Select Add users
Enter the User name you want to add to your project (e.g. Adam)
Check both the Access key (means the IAM user can use AWS CLI) and Password (means the IAM user can access AWS console/website)
We can choose to autogenerate a password for this user or enter a custom password. Here we would like to make everything simple and we just choose the default settings of autogenerating the password and requiring the IAM user to reset it after the first time they log in.
Select Next: Permissions to continue.
Select Create group to create a new IAM user group. From this group management, we can manage the accessibility to the resources of this IAM user.
Next, we should enter a Group name (e.g. TestingGroup) that will be used to authorize a series of resource accessibility.

First, let’s authorize the accessibility to S3. Let’s first search S3 among the policies. If we would like to be open to all the operations, we can check AmazonS3FullAccess (means the IAM users can r/w). We are going to check this for simplicity.
Second, we also want to authorize the accessibility to EMR. Let’s then search for elasticmap, and from the result, we can check AmazonElasticMapReduceFullAccess for again, simplicity.
Click on Create group to continue.
Select the group (e.g. TestingGroup) we have created.
Click on Next: Tags to continue.
The tag section is optional but it does provide some IAM user-specific management. We can add no more than 50 tags to a specific user and then use these tags to organize, track, or control access for this IAM user. We are going to skip this part for simplicity.
Click on Next: Review to continue.
From the review page, we can check again if all the settings for this user are correct because we may have serious problems if we assigned the wrong permissions. So make sure all the settings are correct before you click on Create user.
Now, we have already created the user and we can download user security credentials by a CSV file. Also, we can also send an email as we wish. If we don’t necessarily need to download the CSV file because we can retrieve all the IAM user information back. Let’s see how it works.
Go back to the IAM dashboard, select Access management > Users.
Click on the name of the new user (e.g. Adam) we have created just now.
Select Security Credentials.

From this section, we can get the Console sign-in link, which can be used to sign in as an IAM user. Also, we can reset the password we generated because we may not save the previous CSV file. Click on Manage after the Console password . Then check Set password > Autogenerated password. Check the box of Require password reset. Then click on Apply.
Then click on Show after Console password . Note that this should be stored somewhere and after you close this window, if the password is lost, you must create a new one. For example, my autogenerated password is,

)%|vYR-vji5cN!s

Then we go to the Console sign-in link shown in your Sign-in credentials,

Goto: https://<Numbers>.signin.aws.amazon.com/console

Your current account will be logged out if you enter this sign-in page in your browser.

Keep the Account ID , then enter the IAM user login information,

IAM user name: Adam
Password: )%|vYR-vji5cN!s

Click on Sign in to log into this IAM account.
Then, because we, as the root user, checked the box for Require password reset when conducting the configurations. We have to reset the password. So, enter a new password and confirm it.
Now, check if we can access EMR and S3.
Now, check if we can create a new RDS database. You must not be able to do that because this IAM user is not authorized to perform RDS operations.

(4) Add IAM User to AWS CLI

If you haven’t installed AWS CLI before, it’s time to install it. You may use the following command to install the AWS command line to,

$ curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
$ sudo installer -pkg AWSCLIV2.pkg -target /

Then check if we installed it successfully,

$ aws --version

Then, let’s add the IAM user to our account.

First, go back to the root user’s IAM dashboard
Find the IAM user we have created (e.g. Adam), select Security credentials
From the Access keys section, click on Create access key . Click on Show to reveal the Secret access key , record these two keys (i.e. Access key ID and Secret access key) somewhere
Second, type in the following command,

$ aws configure

Then enter the Access key ID and Secret access key

AWS Access Key ID [****************K3MZ]: ...
AWS Secret Access Key [****************o8yJ]: ...
Default region name [us-east-1]:
Default output format [None]:

(5) Log in as Root User by SSH

Now that we have discussed the AWS CLI, let’s discuss the SSH accessibility. For SSH login, we have to generate a PEM file. This approach of SSH login is extremely useful when we have to get the console of an EC2 service. Now, let’s see how it works.

Log in as the root user
Let’s first search for EC2.
Select Network & Security > Key Pairs in the left navigation bar.
Click on Create key pair
Enter a name (e.g. Adam) and keep the other settings.
Click on Create key pair to download a PEM file
Finally, let’s store it somewhere so that we can use it for logging into the EC2 service in the future.

(6) Authorizing an S3 Bucket to an IAM User

Go to the IAM dashboard, select Users
Click on the username (e.g. Adam) we have created
Copy the Summary > User ARN information
Then, redirect to the S3 dashboard
Click on the non-public bucket name we would like to add this IAM user
Select Permissions
Find Bucket policy and then click on Edit
Click on Policy generator
Select S3 Bucket Policy
Add the IAM user’s User ARN as the Principle
Select GetObject in the actions to allow read-only behavior
The Amazon Resource Name (ARN) should have the following pattern,

arn:aws:s3:::${BucketName}/${KeyName}

where the ${BucketName} should be the bucket name we want to authorize, and ${KeyName} should be the files we want to allow access. If we want to allow accessing all the files, we can put a * sign here.

Click on Add Statement to add this record
Then click Generate Policy to generate this policy
Copy the JSON document generated
Paste in the Bucket polity
Click on Save changes to save this permission

2. Amazon EMR and Spark Clusters

(1) Spark Cluster

Now we are basically running the program in the local mode and we have to run it in a cluster mode if we want to achieve higher performance for a larger scale of data. As we have mentioned before, the Spark cluster is a set of interconnected processes running in a distributed manner on different machines. Basically, there are three types of Spark cluster managers,

Spark standalone
Hadoop YARN
Apache Mesos

(2) Components of the Spark Runtime System

In the Spark architecture, we have three major components,

Client: client is the physical or virtual machine we start the driver program by spark-submit, pyspark, spark-shell scripts, etc.
Driver: driver is the physical or virtual machine that orchestrates and monitors all the executors of a Spark application. The driver is in charge of checking the resources, distributing the jobs/tasks to executors, receiving the computed outputs from the executors, and sending them back to the client. Note we have only one diver per Spark application.
Executor: executors are the execution units that execute Spark tasks with a configuration with a configurable number of cores. Each executor stores and caches the data partitions only in its memory.

(3) The Definition of Elastic MapReduce (EMR)

EMR (aka. Elastic MapReduce) is a connected structure of the EC2 instances so that we can leverage a dynamically scalable EC2 network through this service. Because we are going to use AWS EMR as our Spark clusters, we are going to use Hadoop’s YARN (aka. Yet Another Resource Negotiator). It can handle the HDFS mode while Spark standalone can not.

(4) Creating an EMR Cluster

First, let’s log in as the IAM user
From the left navigation bar, select Clusters
Click on Create cluster
Select Go to advanced options
Select Release version as emr-6.3.0
Check Hadoop 3.2.1 and Spark 3.1.1 for supporting the clusters
Check JupyterEnterpriseGateway 2.1.0 to enable the jupyter notebook service
Check Livy 0.7.0 , which is the API we need for the Spark cluster
We can also change the last step completes to Cluster auto-terminates to save our resources, but we will just leave it there for simplicity
Click on Next to continue
On this page, we can check the clustered nodes we are going to have. Currently, we have three types of nodes: Master (the driver instance for orchestrating the jobs, and note that we can only have 1 master instance), Core (a kind of executor instance for data processing and storing), and Task (another kind of executor instance for only the data processing). Depending on the data, we can also change the machine type and the instance count. For simplicity, we just leave the default settings here.
Click on Next to continue
On this page, we can rename the cluster by entering a new Cluster name. We can also change some other general settings but we will just leave them there.
Click on Next to continue
On this page, we have to select the EC2 key pair we have set (e.g. Adam)
Click on Create cluster to create the instance
Then, from the Clusters tag, we can find the cluster named My cluster we have created

(5) Connecting to the EMR Cluster Using a notebook

Select Notebooks in the left navigation bar
Click on Create notebook
Enter the Notebook name (e.g. MyNotebook)
Select Choose an existing cluster we have started for Cluster (e.g. My cluster)
Select Use the default S3 location
Click on Create notebook to generate a notebook
When the notebook is no longer Pending (i.e. Ready), click on Open in JupyterLab
Select PySpark Notebook

(6) The Definition of Sparkmagic

Sparkmagic is a set of tools for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. The Sparkmagic project includes a set of magics for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment.

(7) Basic Sparkmagic Operations

Now we have opened a Jupyter Notebook with the sparkmagic and we are going to interact with it.

Open the user manual

%%help

Outputs session information for the current Livy endpoint.

%%info

Outputs the current session’s Livy logs.

%%logs

Change the memory size to 1000M

%%configure -f
{"executorMemory": "1000M"}

Change the number of cores used for an executor

%%configure -f
{"executorCores": 4}

Locally executed the code in subsequent lines

%%local
a = 1
print(a)

The result of the SQL query will be available in the %%local Python context as a Pandas dataframe named VAR_NAME

%%sql -o VAR_NAME

(8) Start with Sparkmagic

Now, let’s really configure and work with the cluster through sparkmagic .

Start a Spark application with the following configurations

%%configure -f 
{
"conf":{
        "spark.pyspark.python": "python3",
        "spark.pyspark.virtualenv.enabled": "true",
        "spark.pyspark.virtualenv.type":"native",
        "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
    
        "spark.executor.heartbeatInterval":"10800s",
        "spark.network.timeout":"24h",
    
        "spark.driver.memory": "1G",
        "spark.executor.memory": "1G",
        "spark.executor.cores":"2",

"spark.app.name":"testing"
      }
}

Creates SparkContext

sc

Print out the section infomation

%%info

Install extra packages (e.g. Plotly) to the

sc.install_pypi_package("plotly")

Check all the packages installed

sc.list_packages()

Create an RDD through our S3 bucket

rdd = sc.textFile("s3://${Bucket Name}/${File Name}")

Try to print the first line of this RDD

rdd.take(1)

Let’s download the supervisor_sf.tsv file from this link
Upload this file to our S3 bucket
Then, create an RDD with the content of this file

rdd = sc.textFile("s3://${Bucket Name}/supervisor_sf.tsv")

Operate the data by RDD transformations,

zip_id = rdd.map(lambda x : x.split("\t")).map(lambda x : (int(x[0]), int(x[1])))
zip_id_count = zip_id.groupByKey().map(lambda x : (x[0], len(x[1]))).sortByKey()

zip_id_count_df = zip_id_count.toDF()
zip_id_count_df.createOrReplaceTempView("zip_id_count_view")

View the data from that SQL temporary view zip_id_count_view , store this as a pandas DataFrame named zip_id_count_tab in %%local Python context

%%sql -o zip_id_count_tab
SELECT * FROM zip_id_count_view

Use plotly and zip_id_count_tab for visualization, execute the code in the %%local context

%%local
import plotly
import plotly.graph_objects as go

x = zip_id_count_tab['_1']
y = zip_id_count_tab['_2']

# Use the hovertext kw argument for hover text
fig = go.Figure(data=[go.Bar(x=x, y=y, 
                             marker_color='lightsalmon',
                             hovertext=x)])
fig.update_layout(xaxis=dict(tickformat="digit"))
fig.show()

Stop the SparkContext

sc.stop()