raspberry-pi-4-spark-cluster

Project that details the creation of a Spark Cluster using Raspberry Pi 4 and Ubuntu Server LTS 20.04

Introduction

This project was last updated 9/10/20!

This project is my own documentation of building a Spark Cluster Computer. I explain from start to finish how to setup a physical Raspberry Pi 4 Cluster Computer and install Apache Hadoop and Apache Spark on the cluster.

In this tutorial, we cover:

Physical Cluster Setup
Individual Pi Setup - Ubuntu Server LTS 20.04 Installation
Cluster Setup - Public Key SSH Authentication, Static IP, Host/Hostnames Configuration
Hadoop Installation - Single Node and Multi-Node; Hadoop 3.2.1
Spark Installation - Spark Jobs via YARN and the Spark Shell; Spark 3.0.1

Sources

This tutorial would not have been possible without knowledge from the following articles:

Without further ado, let's get on to it!

Physical Cluster Setup

1. Server Rack

To start the setup, install your Raspberry Pis on the server rack casing. You can find these cases for cheap on Amazon.

2. Network Connection

For my setup, I used a wired ethernet connection from my router which was then routed to my Netgear 4-port 100 Mbps Switch. Then, I connected each Raspberry Pi to the switch via a 1-foot ethernet cable.

3. Power Supply

I used a RAVPower 4-Port USB Power Supply to power my Raspberry Pi. As they are Raspberry Pi 4s, they all use USB-C as a power supply.

Individual Raspberry Pi Setup

1. Ubuntu Server LTS 20.04 Installation

Use the Raspberry Pi Imager to write the Ubuntu Server LTS 20.04 64 Bit to each pi.

2. Pi Configuration

In this stage, we'll SSH into each Pi and setup some basic configuration.

Repeat the following steps for each Pi.

Plug in one Pi at a time; finish the setup configuration completely before moving on to the next Pi.

Use this net-tools command to find the IP address of the Pi.

$ arp -na | grep -i "dc:a6:32"

If that doesn't return anything, you can also use your router admin interface to see a list of connected devices.

The name of the device should be ubuntu

After you're connected you'll be prompted to change the password. Make the password the same on each Pi; something secure but easy to remember.

Ensure that the Pi has the time synchronized using the following command:

$ timedatectl status

You should be returned this prompt:

               Local time: Mon 2020-08-10 12:24:54 EDT  
           Universal time: Mon 2020-08-10 16:24:54 UTC  
                 RTC time: Mon 2020-08-10 16:24:54      
                Time zone: America/New_York (EDT, -0400)
System clock synchronized: yes                          
              NTP service: active                       
          RTC in local TZ: no

If the system clock is synchronized and NTP service is active, you're good to go.

Lastly, run the following commands to finish individual configuration:

$ sudo apt update
$ sudo apt upgrade
$ sudo reboot

If you get a cache lock error after the update command, reboot the Pi try again.

Cluster Setup

In this stage, we'll setup static IPs, hosts/hostname, public-key SSH authentication, and ease-of-use functions.

1. Static IP Setup

The following steps will need to be done on each Pi.

Ubuntu Server LTS 20.04 requires Netplan for network configuration. Specifically, editing a few yaml files.

First, SSH into the Pi and find the name of your network interface by running:

$ ip a

The returned information should look like so:

1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0:  mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 7c:97:0d:a6:27:53 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.1/24 brd 192.168.0.255 scope global dynamic noprefixroute en
...

The network interface name is the bolded eth0 tag (It will not be bolded on your system, but this is the location you should look). Keep this in mind, we'll need this for the network configuration file.

Next, you'll need to use nano or vim to edit the configuration files. In this tutorial, I'll be using nano. Use the following commands to edit the configuration file to disable automatic network configuration.

$ sudo nano /etc/cloud/cloud.cfg.d/99-disable-config.cfg

All you need to add is the following code:

network: {config: disabled}

Then, you'll setup the static IP by editing the 50-cloud-init.yaml file. Use the following command:

$ sudo nano /etc/netplan/50-cloud-init.yaml

The basic template to set a static IP is:

network:
    ethernets:
        {Network Interface Name}:
            dhcp4: false
            addresses: [{Specifc IP Adress}/24]
            gateway4: {Gateway Address}
            nameservers:
                addresses: [{Gateway Address}, 8.8.8.8]
    version: 2

My configuration file looked like so: (X being the last digit of the specific IP address for each Pi; 10.1.2.121, 10.1.2.122, 10.1.2.123, etc.)

network:
    ethernets:
        eth0:
            dhcp4: false
            addresses: [10.1.2.X/24]
            gateway4: 10.1.2.1
            nameservers:
                addresses: [10.1.2.1, 8.8.8.8]
    version: 2

After editing the file, apply the settings by using the following commands:

$ sudo netplan apply

sudo netplan apply will make your SSH session hang because the ip address changes. Kill the terminal session and SSH into the Pi using the new ip address.

Then reboot the Pi and confirm the static IP address is set correctly.

2. Hosts/Hostname Configuration

The following steps will need to be done on each Pi.

Now it's time to configure the hosts and hostnames files to the specific Pi information.

First we'll SSH into the Pi and update the hostname file using the following command:

$ sudo nano /etc/hostname

The hostname file should like so (X being the last digit of the specific IP address for each Pi):

pi0X

For example, the hostname file for pi01 will look like:

pi01

Next, we'll have to update the hosts file using the following command:

$ sudo nano /etc/hosts

The hosts file should look like so after editing:

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

10.1.2.121 pi01
10.1.2.122 pi02
10.1.2.123 pi03

While editing the hosts file, make sure to delete the localhost 127.0.0.1 line from the file.

The template for adding additional nodes to the hosts file is:

{IP Address} {hostname}

Note, the hostname file will be different for each node, but the hosts file should be exactly the same.

Now reboot the Pi and move on to the next Pi until all are configured.

3. Public Key SSH Authentication Configuration

Perform these steps on the master Pi until only until directed to do otherwise.

First, edit the ssh config file on the Pi using the following command:

$ nano ~/.ssh/config

Add all of the nodes to the config file, including the Host, User, and Hostname for each Pi.

The template for adding nodes to the config file is:

Host piXX
User ubuntu
Hostname {IP Address}

My config file looked like this after adding all of my nodes to the file:

Host pi01
User ubuntu
Hostname 10.1.2.121

Host pi02
User ubuntu
Hostname 10.1.2.122

Host pi03
User ubuntu
Hostname 10.1.2.123

Next, create an SSH key pair on the Pi using:

$ ssh-keygen -t rsa -b 4096

You'll be prompted to select a directory to save and for a passphrase.

Press enter through the prompts because the key pair needs to be saved in the .ssh directory and the key pair should be passwordless.

The output should look similar to this:

Your identification has been saved in /your_home/.ssh/id_rsa
Your public key has been saved in /your_home/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:/hk7MJ5n5aiqdfTVUZr+2Qt+qCiS7BIm5Iv0dxrc3ks user@host
The key's randomart image is:
+---[RSA 4096]----+
|                .|
|               + |
|              +  |
| .           o . |
|o       S   . o  |
| + o. .oo. ..  .o|
|o = oooooEo+ ...o|
|.. o *o+=.*+o....|
|    =+=ooB=o.... |
+----[SHA256]-----+

Repeat the SSH keygen on Pi 2 and 3.

Then use the following command on all Pis (including Pi 1) to copy the public keys into Pi 1's authorized key list:

$ ssh-copy-id pi01

Finally, you can copy Pi 1's configuration files to the rest of the Pi's using the following commands:

$ scp ~/.ssh/authorized_keys piXX:~/.ssh/authorized_keys

$ scp ~/.ssh/config piXX:~/.ssh/config

Note, XX stands for the specific digit identifier for the Pi (pi02, pi03).

Now you should be able to ssh into any Pi from any of the Pis without providing a password.

Make sure to test it out from all Pis to ensure everything is working properly.

4. Cluster Ease of Use Functions

This step entails editing the .bashrc file in order to create some custom functions for ease of use.

On the master Pi, we'll first edit the ~/.bashrc file:

$ nano ~/.bashrc

Within this file, add the following code to the bottom of the file:

# cluster management functions

#   list what other nodes are in the cluster
function cluster-other-nodes {
    grep "pi" /etc/hosts | awk '{print $2}' | grep -v $(hostname)
}

#   execute a command on all nodes in the cluster
function cluster-cmd {
    for node in $(cluster-other-nodes);
    do
        echo $node;
        ssh $node "$@";
    done
    cat /etc/hostname; $@
}

#   reboot all nodes in the cluster
function cluster-reboot {
    cluster-cmd sudo reboot now
}

#   shutdown all nodes in the cluster
function cluster-shutdown {
    cluster-cmd sudo shutdown now
}

function cluster-scp {
    for node in $(cluster-other-nodes);
    do
        echo "${node} copying...";
        cat $1 | ssh $node "sudo tee $1" > /dev/null 2>&1;
    done
    echo 'all files copied successfully'
}

#   start yarn and dfs on cluster
function cluster-start-hadoop {
    start-dfs.sh; start-yarn.sh
}

#   stop yarn and dfs on cluster
function cluster-stop-hadoop {
    stop-dfs.sh; stop-yarn.sh
}

Source the .bashrc file on the master Pi:

source ~/.bashrc

Now use the following command to copy the .bashrc to all the worker nodes:

cluster-scp ~/.bashrc

Lastly, run the following command to source the .bashrc file on all nodes:

cluster-cmd source ~/.bashrc

You now have a functioning cluster computer.

In order to start running parallel processing tasks, we'll install Hadoop first and then Spark.

Hadoop 3.2.1 Installation

1. Install Java 8

To install Java 8 on each node, use the following command:

$ cluster-cmd sudo apt install openjdk-8-jdk

Then, because we haven't done it in a while, use the following command to reboot the Pis:

$ cluster-reboot

After everything is rebooted, SSH into the master Pi and run the following command to verify Java was installed correctly:

$ cluster-cmd java -version

2. Hadoop Single Node Installation

Download Hadoop

Perform the following steps only on the master Pi until directed to do otherwise.

Next, download the Hadoop 3.2.1 binary onto your machine. You can get the binary from the Apache Hadoop website and use wget to download it on to the Pi.

$ wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

Note, the link in the bash command might not work if Hadoop has been updated.

Next, extract the tar and move the binary to the /opt directory using the following command:

$ sudo tar -xvf hadoop-3.2.1.tar.gz -C /opt/ && cd /opt

Change the name of the directory from hadoop-3.2.1 to hadoop:

$ sudo mv hadoop-3.2.1 hadoop

Change the permissions on the directory.

$ sudo chown ubuntu:ubuntu -R /opt/hadoop

Setup .profile, .bashrc, and hadoop-env.sh Environment Variables

Edit .profile to add Hadoop binaries to PATH:

$ nano /opt/hadoop/.profile

Add the following line:

PATH=/opt/hadoop/hadoop/bin:/opt/hadoop/hadoop/sbin:$PATH

Edit the .bashrc file to include the following environmental variables at the bottom of the file.

# path and options for java
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64

# path and options for Hadoop
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_INSTALL=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Then source the .bashrc file to ensure it updates.

source ~/.bashrc

Next, set the value of JAVA_HOME in /opt/hadoop/etc/hadoop/hadoop-env.sh. You'll have to scroll down to find the correct line.

The line will be commented out. Uncomment the line and add the correct path to the variable.

It should look like this:

...
# The java implementation to use. By default, this environment
# variable is REQUIRED on ALL platforms except OS X!
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64
...

Setup the core-site.xml and hdfs-site.xml

Use the following command to edit the core-site.xml file.

$ nano /opt/hadoop/etc/hadoop/core-site.xml

It should look like so after editing:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://pi01:9000</value>
    </property>
</configuration>

Then edit the hdfs-site.xml file:

$ nano /opt/hadoop/etc/hadoop/hdfs-site.xml

After editing:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Test MapReduce

Format the NameNode THIS WILL DELETE ALL DATA IN THE HDFS!:

$ hdfs namenode -format

Start the NameNode and DataNode:

$ start-dfs.sh

Make the required directories to run MapReduce jobs:

$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/ubuntu

Copy input files into the distributed filesystem:

$ hdfs dfs -mkdir input
$ hdfs dfs -put /opt/hadoop/etc/hadoop/*.xml input

Run the test example:

$ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'

View the ouput on the distributed file system:

$ hdfs dfs -cat output/*

When done, stop the NameNode and DataNode:

$ stop-dfs.sh

Test YARN

Configure the following parameters in the configuration files:

/opt/hadoop/etc/hadoop/mapred-site.xml:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>

/opt/hadoop/etc/hadoop/yarn-site.xml:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

Start ResourceManager and NodeManager:

$ start-yarn.sh

Start NameNode and DataNode:

$ start-dfs.sh

Test if all daemons are running:

$ jps

You should see this output after running the jps command:

5616 SecondaryNameNode
5760 Jps
5233 NameNode
4674 NodeManager
5387 DataNode
4524 ResourceManager

If you see all six daemons, then it's time to move on the Multi-Node Installation of Hadoop.

Stop all daemons:

$ stop-yarn.sh
$ stop-dfs.sh

3. Hadoop Multi-Node Installation

Edit Site Configuration Files

Update the following configuration files:

/opt/hadoop/etc/hadoop/core-site.xml:

    <configuration>
        <property>
            <name>fs.default.name</name>
            <value>hdfs://pi01:9000</value>
        </property>
    </configuration>

/opt/hadoop/etc/hadoop/hdfs-site.xml:

<configuration>
    <property>
            <name>dfs.namenode.name.dir</name>
            <value>/opt/hadoop/data/nameNode</value>
    </property>

    <property>
            <name>dfs.datanode.data.dir</name>
            <value>/opt/hadoop/data/dataNode</value>
    </property>
    <property>
        <name>dfs.namenode.checkpoint.dir</name>
        <value>/opt/hadoop/hdfs/namenodesecondary</value>
    </property>
    <property>
            <name>dfs.replication</name>
            <value>1</value>
    </property>
</configuration>

/opt/hadoop/etc/hadoop/mapred-site.xml:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
    </property>
    <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
    </property>   
    <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.resource.memory-mb</name>
        <value>512</value>
    </property>
    <property>
        <name>mapreduce.map.resource.memory-mb</name>
        <value>256</value>
    </property>
    <property>
        <name>mapreduce.reduce.resource.memory-mb</name>
        <value>256</value>
    </property>
</configuration>

/opt/hadoop/etc/hadoop/yarn-site.xml:

<configuration>
    <property>
        <name>yarn.acl.enable</name>
        <value>0</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>pi01</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>1536</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>1536</value>
    </property>
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>128</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
</configuration>

Set the Master and Workers

For the Master:

$ nano /opt/hadoop/etc/hadoop/master

It should look like this:

pi01

For the Workers:

$ nano /opt/hadoop/etc/hadoop/workers

It should look like this:

pi02
pi03

Copy Hadoop Configuration to the rest of Nodes

SSH into Pi 2 and then Pi 3 and run the following commands:

sudo mkdir /opt/hadoop
sudo chown ubuntu:ubuntu -R /opt/hadoop

After creating the /opt/hadoop directory in Pi 2 and Pi 3 and changing the owner of the directory, SSH back into Pi 1.

Use the following command to copy all files from the master node to the worker nodes:

$ for pi in $(cluster-other-nodes); do rsync -avxP /opt/hadoop $pi:/opt; done

Then copy the .bashrc file from the master node to the worker nodes and source it on all nodes:

$ cluster-scp ~/.bashrc
$ cluster-cmd source ~/.bashrc

After everything is copied, you can verify that Hadoop was installed correctly using the following command on each node:

$ hadoop version | grep Hadoop

Format the HDFS

Run the following commands only on the master node unless directed otherwise.

To run for the first time, the HDFS needs to be formatted:

$ hdfs namenode -format

Run HDFS and YARN

To start the HDFS, run the following command on the master node:

$ start-dfs.sh

To start YARN, run the following command:

$ start-yarn.sh

Ensure that everything is working by running the following command on all nodes:

$ jps

On the master node, jps should return a similar output:

7202 SecondaryNameNode
6954 NameNode
7498 Jps
7389 ResourceManager

On the worker nodes, jps should return a similar output:

3889 Jps
3684 NodeManager
3514 DataNode

Get/Put data to the HDFS

First create the user home directory in the HDFS:

$ hdfs dfs -mkdir -p /user/ubuntu

Create a directory called books in the HDFS:

$ hdfs dfs -mkdir books

Grab a few books from the Gutenberg Project:

$ cd ~
$ wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
$ wget -O holmes.txt https://www.gutenberg.org/files/1661/1661-0.txt
$ wget -O frankenstein.txt https://www.gutenberg.org/files/84/84-0.txt

Put the books on the HDFS in the books directory:

$ hdfs dfs -put alice.txt holmes.txt frankenstein.txt books

List the contents of the books directory:

hdfs dfs -ls books

Move one of the books back into the local filesystem:

hdfs dfs -get books/holmes.txt

List the current directory to ensure that you successfully retrieved the file:

ls

MapReduce WordCount via YARN

Let's test that everything is working through the "Hello, world!" to MapReduce; the WordCount.

First, submit a job with the sample WordCount jar to YARN:

yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount "books/*" output

After the job is finished (it could take a few minutes), get the results with the following command:

hdfs dfs -ls output

You should be returned the following information:

Found 2 items
-rw-r--r--   1 ubuntu supergroup          0 2020-09-10 18:09 output/_SUCCESS
-rw-r--r--   1 ubuntu supergroup     272490 2020-09-10 18:09 output/part-r-00000

Then, print the result with the following command (use Ctrl-Z to exit):

hdfs dfs -cat output/part-r-00000 | less

Congragulations, you now have a working YARN cluster!

Spark Installation

1. Download Apache Spark

The following steps only need to be performed on the master Pi.

First, download the Spark 3.0.1 (pre-built for Hadoop 3.2 and later) binary onto the master Pi. You can get the binary from the Apache Spark website and use wget to download it.

$ wget https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz

Note, the link in the bash command might not work if Spark has been updated.

Next, extract the tar and move the binary to the /opt/hadoop/ directory using the following command:

$ sudo tar -xvf spark-3.0.1-bin-hadoop3.2.tgz -C /opt/hadoop && cd /opt/hadoop

Change the name of the directory in /opt/hadoop/ to spark:

$ sudo mv spark-3.0.1-bin-hadoop3.2 spark

Change the permissions on the directory.

$ sudo chown ubuntu:ubuntu -R /opt/hadoop/spark

2. Configure YARN for Spark Integration

Add the following code to the end of the .bashrc file:

# path and options for Spark
export SPARK_HOME=/opt/hadoop/spark
export PATH=$PATH:$SPARK_HOME/bin
export LD_LIBRARY_PATH=/opt/hadoop/lib/native:$LD_LIBRARY_PATH

Source the .bashrc file:

source ~/.bashrc

Add the Spark directory to the PATH by editing the Hadoop user profile:

$ nano /opt/hadoop/.profile

Add the following lines:

PATH=/opt/hadoop/spark/bin:$PATH
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export SPARK_HOME=/opt/hadoop/spark
export LD_LIBRARY_PATH=/opt/hadoop/lib/native:$LD_LIBRARY_PATH

Exit the terminal session and SSH back again to restart the session.

Rename the spark default template config file:

mv /opt/hadoop/spark/conf/spark-defaults.conf.template /opt/hadoop/spark/conf/spark-defaults.conf

Edit $SPARK_HOME/conf/spark-defaults.conf and set spark.master to yarn:

sudo nano /opt/hadoop/spark/conf/spark-defaults.conf

Set the following configurations:

spark.master yarn

spark.driver.memory 512m

spark.yarn.am.memory 512m

spark.executor.memory 512m

Submit a Spark Application to the YARN Cluster

Run the following command using the spark-submit command to test the parallel processing calculation of Pi:

spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi /opt/hadoop/spark/examples/jars/spark-examples_2.12-3.0.1.jar 10

You should see the following line in the output after submiting the job:

...
Pi is roughly 3.1401351401351403
...

Monitor you Spark Applications

Enable the Web UI of Spark to track your submitted jobs by editing spark-defaults.conf file:

sudo nano /opt/hadoop/spark/conf/spark-defaults.conf

Add the following lines to the configuration file:

spark.eventLog.enabled            true
spark.eventLog.dir                hdfs://pi01:9000/spark-logs
spark.history.provider            org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory     hdfs://pi01:9000/spark-logs
spark.history.fs.update.interval  10s
spark.history.ui.port             18080

Create the log directory in the HDFS:

hdfs dfs -mkdir /spark-logs

Run the history server:

/opt/hadoop/spark/sbin/start-history-server.sh

Run the Pi parallel processing job again to generate logs in the HDFS.

spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi /opt/hadoop/spark/examples/jars/spark-examples_2.12-3.0.1.jar 10

Access the Spark History Server throught the Web UI at http://{Master Node IP}:18080 in a web browser.

Using the Spark Shell

Let's take a look at some the existing data on our cluster in an interactive manner; via the Spark Shell:

To start the Spark Shell:

spark-shell

Then let's do a small Scala example to show the power of the Spark Shell.

First, let's create a string variable named input from the alice.txt file.

var input = spark.read.textFile("books/alice.txt")

Then, let's count the number of non-blank lines using the filter function:

input.filter(line => line.length()>0).count()

The returned output should be so:

res0: Long = 2813

Congratulations, you now have a working Spark Cluster Computer.

Go forth and do great things!

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
assets/images		assets/images
cluster_configuration		cluster_configuration
LICENSE		LICENSE
README.md		README.md

License

TedCha/spark-cluster-computer

Folders and files

Latest commit

History

Repository files navigation

raspberry-pi-4-spark-cluster

Introduction

Sources

Physical Cluster Setup

1. Server Rack

2. Network Connection

3. Power Supply

Individual Raspberry Pi Setup

1. Ubuntu Server LTS 20.04 Installation

2. Pi Configuration

Cluster Setup

1. Static IP Setup

2. Hosts/Hostname Configuration

3. Public Key SSH Authentication Configuration

4. Cluster Ease of Use Functions

Hadoop 3.2.1 Installation

1. Install Java 8

2. Hadoop Single Node Installation

Download Hadoop

Setup .profile, .bashrc, and hadoop-env.sh Environment Variables

Setup the core-site.xml and hdfs-site.xml

Test MapReduce

Test YARN

3. Hadoop Multi-Node Installation

Edit Site Configuration Files

Set the Master and Workers

Copy Hadoop Configuration to the rest of Nodes

Format the HDFS

Run HDFS and YARN

Get/Put data to the HDFS

MapReduce WordCount via YARN

Spark Installation

1. Download Apache Spark

2. Configure YARN for Spark Integration

Submit a Spark Application to the YARN Cluster

Monitor you Spark Applications

Using the Spark Shell

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages