This workshop demonstrates some techniques for pre/post processing large data sets, which do not fit onto a single machine. While there are many methods to accomplish such a task, here we focus on the map/reduce (MR) paradigm through the use of the Spark framework. Spark is a "successor" to Hadoop with a rich API and properties attractive for data exploration.
A brief outline of the material here:
- Structure of the workshop material
- Overview of hands-on tutorials
- Background knowledge
- Instructions to set up your local machine for the tutorial
The course is split into two hands-on components:
- The first part consists of two sets of exercises demonstrating some features of the Python language useful for Spark applications and a brief introduction to using Spark.
- The second part is a series of "mini-projects" analyzing a text corpus, in this case the Gutenberg books project, and a Twitter dataset. This second part is meant to be run remotely on a Hadoop cluster with an underlying Hadoop Distributed Filesystem (HDFS).
I recommend that you do the introductory tutorial exercises first before attempting the Gutenberg corpus or Twitter analysis notebooks!
All of the hands-on instruction of the course takes place in IPython notebooks. They are organized into several topics:
- Python concepts introduction - basic introduction to Big Data and python concepts
- Introduction to Spark - essential Spark skills
- Analyzing the Gutenberg Corpus - "project" notebook building up an analysis of the Gutenberg books corpus
- Using DataFrames to analyze Twitter
You should start by having a look at the python introduction notebook (where you can also execute cells) which introduces some essential python concepts. When you open the notebooks running on the notebook server (i.e. in your browser at localhost:8889
), you can execute any cell in the notebook by entering Shift+Enter
. You can also modify any of the cells to experiment. Once you're happy with the python introduction, continue on to the notebook marked "EMPTY" in the same directory and complete the exercises.
The two introductory tutorials cover the basics to get you started with Spark using Python. The more advanced content begins with the Gutenberg tutorials. Part 1 focuses on basic data manipulation in RDDs. Part 2 and Part 3 cover the following:
- Part 2: Gutenberg N-gram Viewer:
- using broadcast variables as lookup tables
- use of
mapPartitions
to reduce memory pressure - using
reduceByKey
to gain insight of the data content - efficient use of pre-partitioning to reduce shuffle and lookup times
- using vectors in
(key, value)
pair RDDs - aggregation using custom aggregators
- Part 3: Building a language prediction model from English and German books:
- identifying and correcting data skew
- using generator functions to lower the memory footprint
- using broadcast variables as lookup tables
- preparing data for use with MLlib
- training a classification model and assessing its performance
In addition, there is a tutorial covering Spark DataFrame
functionality, which makes using Spark for structured data a bit easier. The
Twitter DataFrames covers the following:
- using the
DataFrame
API to process twitter data - contrasting
DataFrame
groupBy
with RDDreduceByKey
methods - using simple and not-so-simple window functions on time-series data
- switching between
RDD
andDataFrame
methods
We assume fluency with basic programming constructs such as loops and functions. In the first session of the workshop, basics of functional programming and other python-specific constructs are introduced, as they are useful for writing Spark applications. Knowing how to use the command line (or terminal) is also expected.
The language used in the workshop is python. If you are not familiar with python at all, we recommend finding a basic tutorial and working through some examples before participating in the workshop. In particular, make sure that you understand the various python data types, such as strings, lists, tuples, dictionaries, and sets. We will go over some of these in the workshop, but you should at least try to use them in simple ways before going through the workshop material.
I strongly recommend that you install git if you don't have it installed already. To check if you have it on your machine, open up a terminal and type
$ git --version
If this returns something other than an error message, you are all set!
There are quite a few requirements for all the software used in this workshop to function properly. You can either install it all by hand (or perhaps you have it installed already), but if you don't want to bother with the set up, you can simply use the virtual machine option.
- Note for Windows users: While ssh is a standard feature of linux and mac os x operating systems, it is not so on Windows. If you are running windows, you will therefore need to install a PuTTY client or something similar that supports ssh tunneling*
We will be using the firefox browser to work interactively with the cluster in the last part of the workshop. We use Firefox because it allows us to set up a proxy without needing to change the system-wide proxy settings, i.e. this lets you continue using all other applications with the local connection instead of through the proxy. If you don't have Firefox already, you should install it - if you have it already installed, please make sure you update it to the latest version.
If you want to simply get up and running, then we recommend you use the workshop materials through the Virtual Machine (VM) environment set up specifically for this purpose. This way you will only have to deal with the minimal set of installations.
- install Vagrant (please upgrade to the latest version, if you have a previous installation already)
- install VirtualBox - must be v4.3.x (4.3.28 is strongly recommended)
If you have git
on your machine (see above), then simply open up a terminal and clone this repository
$ git clone https://github.com/rokroskar/spark_workshop.git
$ cd spark_workshop
If you don't have git, click to download the zip file and uncompress it somewhere on your hard-drive, e.g.
$ unzip spark_workshop-master.zip
$ cd spark_workshop-master
Go to the spark_workshop
directory and spin up the VM
$ vagrant up
This will first download, configure, and boot the VM. The first time you do this, it has to download everything off the internet so it will take a few minutes. You'll see lots of messages about packages being installed. Once it's finished, you should see something like
==> spark_wkshp: Machine 'spark_wkshp' has a post `vagrant up` message. This is a message
==> spark_wkshp: from the creator of the Vagrantfile, and not from Vagrant itself:
==> spark_wkshp:
==> spark_wkshp: VM for Spark workshop is running -- login in with 'vagrant ssh'
To get started, you will need to log in to the VM:
$ vagrant ssh
To exit the VM
vagrant@spark-wkshp> logout
Then, to shut down the VM temporarily (to free up your machine's resources), you can do
$ vagrant halt
Next time you call vagrant up
it will resume from where you stopped and will not need to download all the packages again so it should only take ~30 seconds.
To get rid of the VM completely
$ vagrant destroy
If you want to run Spark from your own machine and you really want to do the setup yourself, or if you are an experienced user, you can follow the instructions below. These are useful for Mac OSX and Linux, but perhaps less so for Windows.
The notebooks used for the tutorials have been tested with Python 2.7. The python installation should have the following packages: numpy, scipy, scikit-learn, pip, ipython, jupyter, matplotlib
Please note that we will be using the latest version of IPython (4.0) and the Jupyter notebooks. The notebooks developed for the workshop will not work with IPython < 3.0.
If you are not sure how to meet these dependency requirements, we recommend that you download the miniconda installer appropriate for your system. Once it's installed, run the following commands:
$ conda update conda
$ conda install -y numpy scipy scikit-learn pip ipython jupyter matplotlib
We will also use a little package called findspark
to simplify the setup a bit:
$ pip install findspark
Spark runs in a Java virtual machine, so you need to install a Java distribution. You might already have one on your machine, so first check here.
Once you have Java installed, grab a spark hadoop 2.6 compiled binary package from http://spark.apache.org/downloads.html and extract it somewhere. Then create a symbolic link in your home directory to this location:
> ln -s /path/to/spark/package ~/spark
To make your life easier later on, add this line to your .bashrc
(or whatever startup file is appropriate for the shell you are using):
export SPARK_HOME=~/spark
Now source the file you just added this line to, e.g.
> source .bashrc
and try to run a basic spark example to see if everything is working:
> $SPARK_HOME/bin/spark-submit $SPARK_HOME/examples/src/main/python/pi.py 10
Notebooks can be downloaded and run on a local machine or any spark cluster. The notebooks require IPython > 3.0.
First, clone this repository on your local disk (if you set up the VM above, you've already done this):
$ git clone https://github.com/rokroskar/spark_workshop.git
Here, the instructions diverge briefly depending on whether you are running in a VM or your own installation:
If the VM is not running, start it with vagrant up
.
Now we will launch the notebook server inside the VM
$ vagrant ssh -c "notebooks/start_notebook.py --setup --launch"
Simply setup/launch the notebook server from the notebooks
directory
$ notebooks/start_notebook.py --setup --launch
From here the process is identical. The start_notebook.py
script first sets up a secure notebook for you, so you will be prompted to enter a password (this is just for you, so pick anything you want). Finally, when it's finished you will see a message with something like this
[I 15:32:11.906 NotebookApp] The IPython Notebook is running at: https://[all ip addresses on your system]:8889/
[I 15:32:11.906 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
The number at the end of https://[all ip addresses on your system]:8889/
is the port number on which the notebook is running. You can access it by simply going to https://localhost:8889 in your local browser. Make sure you specify https and not just http! This will bring up a menacing message like this
about an untrusted certificate (we created a self-signed certificate during the notebook setup). You can safely ignore this and either click "Advanced -> Proceed to localhost" (in Chrome) or "I understand the risks -> Add Exception" (in Firefox) or simply "Continue" (in Safari).
Note that the --setup
option is only needed the first time you start the
notebook (to set up your SSL certificate and password) -- you can leave it out
when starting the notebook server up again at some later point.