Copyright (C) 2010-2014 Think Big Analytics, Inc. All Rights Reserved.
StrangeLoop 2012
Dean Wampler, Think Big Analytics
[email protected]
@deanwampler
Hire Us!
This workshop is a half-day tutorial on Scalding and its place in the Hadoop ecosystem. Scalding is a Scala API developed at Twitter for distributed data programming that uses the Cascading Java API, which in turn sits on top of Hadoop's Java API. However, Scalding, through Cascading, also offers a local mode that makes it easy to run jobs without using the Hadoop libraries, for simpler testing and learning. We'll use this feature for most of this workshop.
To keep the setup process as simple as possible, the workshop git repo contains a pre-built jar that bundles Scalding v0.7.3 for Scala v2.9.2 and other required jars, such as Cascading
, Hadoop
core, Log4J
, etc. So, all you need to install is Java, Scala, Ruby, and this workshop.
It helps to pick a work directory where you will install some of the packages. In what follows, we'll assume you're using $HOME/fun
on Linux, Mac OSX, or Cygwin for Windows with the bash
shell (or a similar shell) or you are using C:\fun
on Windows.
You'll need git to clone the workshop repository and optionally for other installs. See here for details. As an alternative, you can download a workshop release from its Github repo, rather than clone it.
Download or clone this workshop from GitHub.
To clone this workshop from GitHub using bash
:
cd $HOME/fun
git clone https://github.com/thinkbiganalytics/scalding-workshop
On Windows:
cd C:\fun
git clone https://github.com/thinkbiganalytics/scalding-workshop
Or, simply download a release.
Install Java if necessary from here.
Scalding uses Scala v2.9.2. Install it from here.
Ruby is used as a platform-independent language for driver scripts by Scalding and we've followed the same convention. See ruby-lang.org for details on installing Ruby. Either version 1.8.7 or 1.9.X will work.
Once you've completed these steps, run the following commands as a sanity check to ensure that everything is setup properly. Using bash
:
cd $HOME/fun/scalding-workshop
./run.rb scripts/SanityCheck0.scala
On Windows:
cd C:\fun\scalding-workshop
ruby run.rb scripts/SanityCheck0.scala
The commands should run without error. Note that it takes a moment to compile the Scala script and run to completion. The output is written to output/SanityCheck0.txt
. What's in that file?
If you're serious about using Scalding, you should clone and build the Scalding repo. We'll talk briefly about it in the workshop, but it isn't required.
SBT is the de facto build tool for Scala. You'll need it to build Scalding. Follow these installation instructions.
Clone Scalding from GitHub. Using bash
:
cd $HOME/fun
git clone https://github.com/twitter/scalding.git
On Windows:
cd C:\fun
git clone https://github.com/thinkbiganalytics/scalding-workshop
Build Scalding according to its Getting Started page. Here is a synopsis of the steps. Using bash
:
cd $HOME/fun/scalding
sbt update
sbt assembly
On Windows:
cd C:\fun\scalding
sbt update
sbt assembly
(The Getting Started page says to build the test
target between update
and assembly
, but the later builds test
itself.)
Once you've built Scalding, run the following command as a sanity check to ensure everything is setup properly. Using bash
:
cd $HOME/fun/scalding
scripts/scald.rb --local tutorial/Tutorial0.scala
On Windows:
cd C:\fun\scalding
ruby scripts\scald.rb --local tutorial/Tutorial0.scala
The Workshop/Tutorial proper is described in the companion Workshop document.
Added missing file to distribution. Refined the run scripts to work better with different Java versions.
Refined several exercises and fixed bugs. Added Makefile
for building releases.
First release for StrangeLoop 2012 workshop.
See the Scalding GitHub page for more information about Scalding. The wiki is very useful.
Dean Wampler from Think Big Analytics prepared this workshop. Contact Dean with questions about the workshop. For information about consulting and training on Scalding and other Hadoop-related topics, send us email.
Some of the data used in these exercises was obtained from InfoChimps.
Dean Wampler
[email protected]
@deanwampler