Distributed Computing 1 | Environment Setup and Configuration, Spark Installation, and…

Series: Distributed Computing

Distributed Computing 1 | Environment Setup and Configuration, Spark Installation, and Autogenerate PEP8 Pattern

  1. Environment Setup

(1) Create an Environment

$ conda create --name DistributedComputing python=3 -y

(2) Activating the Environment

$ conda activate DistributedComputing
$ conda info --envs
...
DistributedComputing * ...
...

(3) Deactivating the Environment

$ conda deactivate
$ conda info --envs
...
base * ...
DistributedComputing ...
...

(4) Export the Environment

$ conda activate DistributedComputing
$ conda env export > DistributedComputing_environment.yml
$ ls *.yml

(5) Remove an Environment

$ conda deactivate
$ conda remove --name DistributedComputing --all
...
Proceed ([y]/n)? y
...
$ conda info --envs
...
base *

(6) Create an Environment by YML File

$ conda env create -f DistributedComputing_environment.yml -n DistributedComputing

(7) Update an Environment by YML File

$ conda env update -f DistributedComputing_environment.yml -n DistributedComputing

2. Basic Concepts

(1) The Definition of Big Data

Big data is a kind of data whose size and structure are beyond the ability of traditional data processing application software can adequately handle.

(2) The Definition of Distributed Computing

For processing large volumes of data in a fast way, we have to scale out the data rather than scale-up.

  • Scaling up: means high-performance computing (HPC) developed by IBM. This is fast but expensive and unreliable.
  • Scaling out: means to distribute the data on many ordinary computers. By this means, we can achieve better performance if we keep adding computing machines. This approach is cheaper and reliable, but not that fast and takes a long time for data transferring.

This means the data processing system should have the following features,

  • Cheap: able to run large data on clusters of many smaller and cheaper machines
  • Reliability and Fault Tolerance: if one node or process fails, its workload should be assumed by other components in the system
  • Fast: it parallelizes and distributes computations

3. Spark Installation

(1) The History of Spark

Apache Spark started as a research project at the UC Berkeley AMPLab in 2009 and was open-sourced in early 2010. Many of the ideas behind the system were presented in various research papers over the years. Spark runs on

  • Java 8+
  • Python 2.7+/3.4+
  • R 3.1+

(2) Update Homebrew

$ brew update

(3) Install Java SDK version 8+

$ brew tap adoptopenjdk/openjdk
$ brew search java
$ brew install adoptopenjdk/openjdk/adoptopenjdk8 --cask

(4) Install the Jupyter for the New Environment

$ pip3 install jupyter

(5) Install the Spark

$ pip install pyspark
$ pyshark
...

(6) Open the Jupyter Notebook

$ jupyter notebook 

(7) Install nbextension for Autopep8

$ pip install jupyter_contrib_nbextensions
$ jupyter contrib nbextension install --user
$ pip install jupyter_nbextensions_configurator
$ jupyter nbextensions_configurator enable --user
$ pip install autopep8
$ jupyter notebook

Then inside the Jupyter Notebook, select Nbextensions , then check Autopep8 . Then open a notebook file and click on the Hammer icon for converting it to PEP8 pattern automatically.