What is covered?
- Installing python in windows sub-system (Ubuntu 16.04)
- Installing Apache spark in windows sub-system (Ubuntu 16.04)
- Master/Slave configuration
- Start spark cluster
- Stop spark cluster
- Test the cluster setup
Windows 10 offers an application to install sub-operating-system known as the windows sub-system (WSL). To install windows sub-system you can follow the tutorial here. You can download Apache spark from here. I am using Ubuntu 16.04 (highly recommended because it is most stable version and I didn’t find any compatibility issues) that comes with python 3.5.2 versions which you can check by following command.
By default spark comes with python 2, however for distributed deep learning development I prefer to use python version as 3.6.x (because of the compatibility issues of other libraries). You can choose any python version you want. So, we also need to install required python version in sub-system and link it with spark. Please make sure that each of the nodes (master and slave) are running on same version of python or else you will get errors.
You can follow along the below steps to install the python in your ubuntu 16.04 windows sub-system.
sudo apt-get update
sudo apt-get install libreadline-gplv2-dev libncursesw5-dev libssl-dev libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev build-essential zlib1g-dev
tar -xvf Python-3.6.9.tgz
sudo rm Python-3.6.9.tgz
If there are no errors, run the following commands to complete the installation process of Python 3.6.9
sudo make install
To test the installation run the following command and it should show python version as 3.6.9.
Installing Apache spark
The following setup is required to be performed in all master and slave nodes.
Download Java from here
Move the java tar file to /usr/ directory.
sudo mv /YOUR_DOWNLOAD_PATH /usr/
Navigate to the usr directory.
Extract the java tar file.
sudo tar zxvf jdk-8u291-linux-aarch64.tar.gz
Rename the extracted folder to java.
sudo mv jdk-8u291-linux-aarch64 java
Remove the java tar file.
sudo rm jdk-8u291-linux-aarch64.tar.gz
Add the Java path in the /etc/profile file as follows:
sudo nano /etc/profile
Press Control + C and use arrow keys to navigate to end of the file
Press Control + X and Y and then Enter to save the file. To reload the environment path execute following command:
To test the installation run the following command and it should show java version:
sudo apt-get install scala
To check if scala is installed, run the following command:
sudo wget https://apachemirror.wuchna.com/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz
sudo tar zxvf spark-3.1.1-bin-hadoop2.7.tgz
sudo mv spark-3.1.1-bin-hadoop2.7 /usr/spark/
sudo chmod -R a+rwX /usr/spark
Set up the environment for Spark.
sudo nano /etc/profile
Edit the last line (export PATH) as follows:
export PATH = $PATH:/usr/spark/bin
To test the installed spark execute the following command:
The whole spark installation procedure must be done in master as well as in all slaves.
This completes our base installation. Now we can move to configure master and slave nodes.
Add IP addresses of the master and slave nodes in hosts files of all the nodes. (master and slaves)
sudo nano /etc/hosts
Now add entries of master and slaves in hosts file. The names (masterslave1 and masterslave2) can be anything. These names are given for ease in remembering. You can add as many IP addresses as you like.
Spark master configuration
Edit spark environment file. Move to spark conf folder and create a copy of template of spark-env.sh and rename it.
sudo cp spark-env.sh.template spark-env.sh
sudo nano spark-env.sh
Add the following lines at the end of the file and save it:
Add worker nodes in master node
sudo cp workers.template workers
sudo nano workers
Remove the localhost. Add the slave nodes in the following format. You can add as many slave nodes as possible. You can also create a single master and slave node by giving same IP address as master.
You can also verify the connectivity of slave nodes with master node by using ssh command as follows:
It should now ask for password of the slave node, once entered you will be logged in.
If you received an error for connecting with the slave node (ssh: connect to host masterslave port 22: Connection refused), then you can reinstall the ssh as follows and it will work.
sudo apt remove openssh-server
sudo apt install openssh-server
sudo service ssh start
Now run the ssh command again to login the slave node. Once done then we move to start the cluster.
Start spark cluster
To start the spark cluster, run the following command on master:
sudo bash /usr/spark/sbin/start-all.sh
Now it should ask for the slave’s password and then your spark cluster setup should run successfully.
If you receive an error: Permission denied (publickey), then do the following and run the above command again
sudo nano /etc/ssh/sshd_config
Update the following values:
PermitRootLogin prohibit-password to PermitRootLogin yes
PasswordAuthentication no to PasswordAuthentication yes
Restart the ssh services
sudo service ssh restart
After this, your error will be resolved and you can run start-all.sh command again.
You can view the running spark cluster in this URL: http://<Master_IP>:8080/
Stop spark cluster
To stop the spark cluster, run the following command on master:
sudo bash /usr/spark/sbin/stop-all.sh
Test the cluster setup
Below is the sample python program to run on spark cluster. Update the MASTER_IP and save this file as test.py
from pyspark import SparkContext, SparkConf
master_url = "spark://<MASTER_IP>:7077"
conf = SparkConf()
sc = SparkContext(conf = conf)
list = range(10000)
rdd = sc.parallelize(list, 4)
even = rdd.filter(lambda x: x % 2 == 0)
Once your cluster is up, you can enter below commands to execute test.py file.
sudo bash /usr/spark/bin/spark-submit --master spark://<MASTER_IP>:7077 test.py
You will see bunch of INFO statements in your WSL terminal along with the output of your program as shown below:
You can also view the status of your job at the master URL: http://<Master_IP>:8080/