- Download and install VirtualBox (pick the one that corresponds to your system & prefer version the latest version 5.2.x): https://www.virtualbox.org/wiki/Downloads
- For Ubuntu (Debian) users see: https://github.com/mnmami/Training/blob/master/VirtualBox_Ubuntu.md
- Download and install Vagrant (pick the one that corresponds to your system & prefer the latest version 2.0.x): https://www.vagrantup.com/downloads.html
- For Linux (Debian) users see: https://github.com/mnmami/Training/blob/master/Vagrant_Linux.md
- Open a terminal and create a folder for your Vagrant project then navigate to it:
mkdir myvagrant
cd myvagrant
- Create a file called
Vagrantfile
and put inside it:
Vagrant.configure("2") do |config|
config.vm.provision "shell", inline: "echo Hello there"
# config.ssh.insert_key = false
config.vm.define "master" do |master|
master.vm.box = "ubuntu/xenial64"
master.vm.network "public_network", ip: "192.168.0.10"
master.vm.network "forwarded_port", guest: 4040, host: 4040
master.vm.network "forwarded_port", guest: 8080, host: 8080
master.vm.hostname = "ubuntu1"
end
config.vm.define "slave" do |slave|
slave.vm.box = "ubuntu/xenial64"
slave.vm.network "public_network", ip: "192.168.0.11"
slave.vm.network "forwarded_port", guest: 8081, host: 8081
slave.vm.hostname = "ubuntu2"
end
end
- Windows users
- Uncomment third line
# config.ssh.insert_key = false
- do not use
sudo
in all the command lines of this step
- Uncomment third line
- Then run:
sudo vagrant up
and wait a few minutes- If you get asked which 'network interface' you should use, select the one you are connected to. For example, if you use an ethernet, common names are eth0 or em0. To be sure, run
ifconfig
and pick the one showing your current ip address.
- If you get asked which 'network interface' you should use, select the one you are connected to. For example, if you use an ethernet, common names are eth0 or em0. To be sure, run
Once STEP 2 is done successfully, we obtain two Linux 16.04 boxes (guest virtual machines) connected between them using a (public) network. One will be used as Apache Spark Master, the other for the slave. We also exposed the ports 4040, 8080 and 8181 to the host machine (that runs Vagrant). We use those ports to open web interfaces to the master and slave.
- Now, ssh to the master using
sudo vagrant ssh master
and open another terminal and ssh to the slave usingsudo vagrant ssh slave
. Now you are moving to an Ubuntu System. - In both boxes run to install the missing packages:
sudo apt-get update
- If the command hangs with the message
[Connecting to archive.ubuntu.com (2001:67c:1360:8c01::1a)]
, solve it by disabling ip6, solve it using the steps here: https://askubuntu.com/questions/440649/how-to-disable-ipv6-in-ubuntu-14-04
- Run the dollowing 2 lines:
sudo apt-get install openjdk-8-jre
sudo apt-get update
- we will install version 2.1, so run:
sudo wget https://archive.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
sudo tar -xzvf spark-2.1.0-bin-hadoop2.7.tgz
cd spark-2.1.0-bin-hadoop2.7
- Navigate to the conf folder and create Spark configurations file:
cd conf
sudo cp spark-env.sh.template spark-env.sh
- Open
spark-env.sh
for editing and add the following line:
export SPARK_MASTER_HOST=192.168.0.10
- In the Master box, navigate to the sbin folder and execute
start-master.sh
script:
cd ../sbin
sudo ./start-master.sh
- This will return a message mentioning a logging file, open it to obtain the master URL. You should find
spark://192.168.0.10:7077
. - In the Slave box, also navigate to the sbin folder and execute
start-slave.sh
script passing Spark URL in argument:
cd ../sbin
sudo ./start-slave.sh spark://192.168.0.10:7077
- Navigate to the bin folder and run spark-shell script passing Spark URL in argument:
cd ../bin
sudo ./spark-shell --master spark://192.168.0.10:7077
- Here is an example https://github.com/mnmami/Training/blob/master/Example.scala.
- Transform a SQL dumpt into comma-separated values.
- Create DataFrames from those values and query them.
- Save DataFrame to a CSV file.
- Use
vagrant snapshot push
/pull
(see here https://www.vagrantup.com/docs/cli/snapshot.html for more) to create a snapshot (version) of your machine any time, so you can roll back to that version when things go wrong, no need to destroy and start anew. - Use
vagrant suspend
/resume
(see here https://www.vagrantup.com/docs/cli/suspend.html for more) to save the state of the machine and pick up where you left off the last time, and avoid to start from scratch.
Have a question on the above? no panic, shoot me an email at: [email protected]