What is covered?
- Overview
- Installing python in windows sub-system (Ubuntu 16.04)
- Installing Apache spark in windows sub-system (Ubuntu 16.04)
- Master/Slave configuration
- Start spark cluster
- Stop spark cluster
- Test the cluster setup
Overview
Windows 10 offers an application to install sub-operating-system known as the windows sub-system (WSL). To install windows sub-system you can follow the tutorial here. You can download Apache spark from here. I am using Ubuntu 16.04 (highly recommended because it is most stable version and I didn’t find any compatibility issues) that comes with python 3.5.2 versions which you can check by following command.
python3 -V
By default spark comes with python 2, however for distributed deep learning development I prefer to use python version as 3.6.x (because of the compatibility issues of other libraries). You can choose any python version you want. So, we also need to install required python version in sub-system and link it with spark. Please make sure that each of the nodes (master and slave) are running on same version of python or else you will get errors.
Installing python
You can follow along the below steps to install the python in your ubuntu 16.04 windows sub-system.
sudo apt-get update
sudo apt-get install libreadline-gplv2-dev libncursesw5-dev libssl-dev libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev build-essential zlib1g-dev
wget https://www.python.org/ftp/python/3.6.9/Python-3.6.9.tgz
tar -xvf Python-3.6.9.tgz
sudo rm Python-3.6.9.tgz
cd Python-3.6.9
./configure
If there are no errors, run the following commands to complete the installation process of Python 3.6.9
sudo make
sudo make install
To test the installation run the following command and it should show python version as 3.6.9.
python3 -V
Installing Apache spark
The following setup is required to be performed in all master and slave nodes.
Install Java
Download Java from here
Move the java tar file to /usr/ directory.
sudo mv /YOUR_DOWNLOAD_PATH /usr/
Navigate to the usr directory.
cd /usr/
Extract the java tar file.
sudo tar zxvf jdk-8u291-linux-aarch64.tar.gz
Rename the extracted folder to java.
sudo mv jdk-8u291-linux-aarch64 java
Remove the java tar file.
sudo rm jdk-8u291-linux-aarch64.tar.gz
Add the Java path in the /etc/profile file as follows:
sudo nano /etc/profile
Press Control + C and use arrow keys to navigate to end of the file
JAVA_HOME=/usr/java
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
export JAVA_HOME
export PATH
Press Control + X and Y and then Enter to save the file. To reload the environment path execute following command:
. /etc/profile
To test the installation run the following command and it should show java version:
java -version
Install Scala
sudo apt-get install scala
To check if scala is installed, run the following command:
scala -version
Install Spark
sudo wget https://apachemirror.wuchna.com/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz
sudo tar zxvf spark-3.1.1-bin-hadoop2.7.tgz
sudo mv spark-3.1.1-bin-hadoop2.7 /usr/spark/
sudo chmod -R a+rwX /usr/spark
Set up the environment for Spark.
sudo nano /etc/profile
Edit the last line (export PATH) as follows:
export PATH = $PATH:/usr/spark/bin
. /etc/profile
To test the installed spark execute the following command:
spark-shell
The whole spark installation procedure must be done in master as well as in all slaves.
This completes our base installation. Now we can move to configure master and slave nodes.
Master/Slave configuration
Add IP addresses of the master and slave nodes in hosts files of all the nodes. (master and slaves)
sudo nano /etc/hosts
Now add entries of master and slaves in hosts file. The names (masterslave1 and masterslave2) can be anything. These names are given for ease in remembering. You can add as many IP addresses as you like.
<IP-Address1> masterslave1
<IP-Address2> masterslave2
...
Spark master configuration
Edit spark environment file. Move to spark conf folder and create a copy of template of spark-env.sh and rename it.
cd /usr/spark/conf
sudo cp spark-env.sh.template spark-env.sh
sudo nano spark-env.sh
Add the following lines at the end of the file and save it:
export SPARK_MASTER_HOST=<MASTER-IP>
export PYTHONPATH=/usr/local/lib/python3.6:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON=python3.6
export PYSPARK_PYTHON=python3.6
Add worker nodes in master node
sudo cp workers.template workers
sudo nano workers
Remove the localhost. Add the slave nodes in the following format. You can add as many slave nodes as possible. You can also create a single master and slave node by giving same IP address as master.
<slave_username>@masterslave1
or
<slave_username>@<IP-Address1>
You can also verify the connectivity of slave nodes with master node by using ssh command as follows:
ssh <slave_username>@masterslave1
It should now ask for password of the slave node, once entered you will be logged in.
If you received an error for connecting with the slave node (ssh: connect to host masterslave port 22: Connection refused), then you can reinstall the ssh as follows and it will work.
sudo apt remove openssh-server
sudo apt install openssh-server
sudo service ssh start
Now run the ssh command again to login the slave node. Once done then we move to start the cluster.
Start spark cluster
To start the spark cluster, run the following command on master:
sudo bash /usr/spark/sbin/start-all.sh
Now it should ask for the slave’s password and then your spark cluster setup should run successfully.
If you receive an error: Permission denied (publickey), then do the following and run the above command again
sudo nano /etc/ssh/sshd_config
Update the following values:
PermitRootLogin prohibit-password to PermitRootLogin yes
PasswordAuthentication no to PasswordAuthentication yes
Restart the ssh services
sudo service ssh restart
After this, your error will be resolved and you can run start-all.sh command again.
You can view the running spark cluster in this URL: http://<Master_IP>:8080/
Stop spark cluster
To stop the spark cluster, run the following command on master:
sudo bash /usr/spark/sbin/stop-all.sh
Test the cluster setup
Below is the sample python program to run on spark cluster. Update the MASTER_IP and save this file as test.py
from pyspark import SparkContext, SparkConf
master_url = "spark://<MASTER_IP>:7077"
conf = SparkConf()
conf.setAppName("Hello Spark")
conf.setMaster(master_url)
sc = SparkContext(conf = conf)
list = range(10000)
rdd = sc.parallelize(list, 4)
even = rdd.filter(lambda x: x % 2 == 0)
print(even.take(5))
Once your cluster is up, you can enter below commands to execute test.py file.
sudo bash /usr/spark/bin/spark-submit --master spark://<MASTER_IP>:7077 test.py
You will see bunch of INFO statements in your WSL terminal along with the output of your program as shown below:
You can also view the status of your job at the master URL: http://<Master_IP>:8080/