Big Data/Storm

Howto: Setting up a multi-node Storm cluster under Ubuntu
This tutorial explains how to set up a Storm cluster running on several Ubuntu machines. The Storm framework allows to process unbounded data streams in a distributed manner in real-time.

The basic idea behind the distributed real-time processing of Storm is splitting the overall task into several smaller tasks which can be executed quickly. Each of these tasks is executed by a so called bolt which is one type of vertex in a directed processing graph named topology. Those bolts consume one or more data input streams and produce an output data stream. Beside bolts, there exists a second type of vertices in a topology named spouts which transform an input stream into a data stream which can be processed by bolts. The directed edges in the topology represent the data flow from spouts to bolts or from bolts to bolts.

In order to execute a topology, it is send to the master node of the Storm cluster which runs a daemon called "Nimbus". This daemon distributes the single tasks across the cluster. On all machines of the cluster runs a special daemon called "Supervisor" which receives the tasks assigned to him and executes them in one or more worker processes. The coordination between the Nimbus and the Supervisors is done through a ZooKeeper cluster.



Figure 1 gives an overview of the cluster which will be set up during this tutorial. It consists of four Ubuntu machines. Each of them is running a Supervisor daemon. The Nimbus daemon is running on the first machine and the ZooKeeper server on the second.

This tutorial was tested with the following software versions:
 * Ubuntu Linux 12.04.3
 * Java version 1.8.0-ea
 * ZooKeeper 3.4.5
 * ZeroMQ 2.1.7
 * Storm 0.8.2

Prerequisites
The following required software has to be installed on all machines.

Java 8
In order to install Java, you have to install the two further packages, first. Therefore, execute: sudo apt-get install software-properties-common sudo apt-get install python-software-properties

Then, the repository containing a built of the Oracle Java 8 has to be registered and the local list of the contained software packages has to be updated: sudo add-apt-repository ppa:webupd8team/java sudo apt-get update

Finally, Java 8 can be installed: sudo apt-get install oracle-java8-installer

SSH Server
SSH is not required for running Storm but it simplifies the installation and running of a Storm cluster because you get remote access. Therefore, an SSH server should run on all machines.

First check if SSH is already running by executing ssh address.of.machine for each machine which should be involved in the cluster. is the address of a machine. If the login on one of the machines fails, you have to install an ssh server on that machine by following the instructions on http://www.cyberciti.biz/faq/ubuntu-linux-openssh-server-installation-and-configuration/

Install ZooKeeper
In the current setting the ZooKeeper server should run on Machine02. Therefore, we first connect to this machine by executing: ssh userName@address.of.machine02 is the name of the user which should be used to log in on machine02. If the user name is the same as on the machine which wants to get remote access, then the following command would be enough: ssh address.of.machine02

First, a folder has to be created which will be used from ZooKeeper to store data and create log files: mkdir -p /path/to/zookeeper/data In this this tutorial, this path is referred to by the following place holder.

Now, ZooKeeper can be downloaded wget http://mirror.synyx.de/apache/zookeeper/stable/zookeeper-3.4.5.tar.gz and extracted tar -xzf zookeeper-3.4.5.tar.gz

In order to configure the ZooKeeper server we first create a configuration file by copying a sample configuration. cp zookeeper-3.4.5/conf/zoo_sample.cfg zookeeper-3.4.5/conf/zoo.cfg

This newly created configuration file zoo.cfg is edited by changing the value of  to the new path for the ZooKeeper data. dataDir=/path/to/zookeeper/data

Now, the installation of ZooKeeper is complete. The ZooKeeper server can be started by executing ./zookeeper-3.4.5/bin/zkServer.sh start

In order to check, if the server is running correctly, the following commands can be executed: echo ruok | nc addressOfServer 2181 echo stat | nc addressOfServer 2181

To shut down the server again simply type ./zookeeper-3.4.5/bin/zkServer.sh stop

Install ZeroMQ
The communication between the spouts and bolts of Storm is done with ZeroMQ. ZeroMQ is a socket library that transports messages between processes on the same or on different machines. First the native C-library of ZerMQ has to be built and installed. Thereafter, the Java library JMQ has to be built and installed which calls the native ZerMQ library. These steps have to be performed on all machines.

Build and install native ZeroMQ
In order to build ZeroMQ the uuid-dev development environment has to be installed. sudo apt-get install uuid-dev

Now, the source code of ZeroMQ can be downloaded wget http://download.zeromq.org/zeromq-2.1.7.tar.gz and extracted tar -xzf zeromq-2.1.7.tar.gz

After extracting ZeroMQ, it can be built and installed. Therefore, the previously extracted folder has to be entered cd zeromq-2.1.7 the build environment has to be configured ./configure the library has to be build make and finally installed sudo make install

Now, the current folder can be left. cd ..

Build and install native JZMQ
JZMQ requires several further programms: sudo apt-get install git sudo apt-get install pkg-config sudo apt-get install libtool
 * a program to download the source code from a remote repository
 * a program which provides a common interface to query installed libraries
 * a program for creating portable compiled libraries
 * and a program which creates portable makefiles: automake

The current version of the latter one could not be used, because it produces several errors when building JZMQ. In the setting described in this tutorial the version 1.11.1 works well. This version has to be downloaded wget http://launchpadlibrarian.net/60345882/automake_1.11.1-1ubuntu1_all.deb and installed sudo dpkg --install automake_1.11.1-1ubuntu1_all.deb The latter command produces an error caused by missing dependencies. These missing libraries can be installed by executing sudo apt-get -f install In order to prevent automake from automated updates, simply type echo "package hold" | sudo dpkg --set-selections

Now, JMQ can be build and installed. Therefore, a remote repository has to be cloned which contains a version of JMQ that is compatible with Storm: git clone https://github.com/nathanmarz/jzmq.git The newly created folder is entered cd jzmq Thereafter, JZMQ can be built ./autogen.sh ./configure make and installed sudo make install

Even if there is no error, the installation is incomplete. In order to make JZMQ work properly with Storm, a missing jar has to be created cd /src jar -cvf zmq.jar ./org/ sudo cp zmq.jar /usr/share/java/zmq.jar cd ../..

Furthermore, some libraries have been stored at a position where Ubuntu does not find them. Theses libraries have to be copied to the correct position, now. sudo cp jzmq/perf/zmq-perf.jar /usr/share/java/zmq-perf.jar sudo cp /usr/local/lib/. /usr/lib -R

Finally, Ubuntu has to reload all libraries. sudo ldconfig

Install Storm
After finishing the installation of ZeroMQ, Storm can be installed on all machines.

Storm is distributed in zip archives. In order to extract them, unzip has to be installed. sudo apt-get install unzip

Similar to the installation of ZooKeeper, a folder has to be created which will be used from Storm to store data: mkdir -p /path/to/storm/data In this this tutorial, this path is referred to by the following place holder.

Now, Storm can be downloaded wget https://dl.dropboxusercontent.com/s/fl4kr7w0oc8ihdw/storm-0.8.2.zip and extracted unzip -q storm-0.8.2.zip

The extracted folder  contains the folder logs in which the log files will be written, bin which contains the scripts to run Storm, and conf which contains configuration files for Storm. In order to set up Storm correctly, the file  has to be modified: storm.zookeeper.servers: - "address.of.machine02" nimbus.host: "address.of.machine01" storm.local.dir: "/path/to/storm/data" In the first two lines the address of the used ZooKeeper server is defined, which is run on Machine02 in this tutorial. In the third line the address of the Nimbus is set. Finally, the position where Storm should store its data is defined in the last line. Now, the Storm cluster is ready to start.

Run Storm
Before the Storm cluster can be started, the ZooKeeper server has to be run. Therefore, the following command has to be executed on Machine02 : ./zookeeper-3.4.5/bin/zkServer.sh start

The following commands, which will start the parts of the Storm cluster, are not executed in the background. This means that for each of them a new terminal is required. The programs can be stopped by pressing Ctrl + C.

The Storm cluster is started, by running the Nimbus on Machine01, first: ./storm-0.8.2/bin/storm nimbus Thereafter, a web server can be started on the same machine as the Nimbus. It provides a user interface for the storm cluster, which can be accessed under the URL  by a web browser. ./storm-0.8.2/bin/storm ui Finally, the Supervisors have to be started on all machines. ./storm-0.8.2/bin/storm supervisor

Execute a topology
When a Storm topology has been created (see ) it has to be packed into a jar. All depending Java classes have to be included in this jar. This means, that other jars have to be unjared and the extracted classed integrated into the jar. Classes which are already part of Storm, must not be included in the jar.

The previously created topology can be executed by performing the following command on the same machine as the Nimbus: ./storm-0.8.2/bin/storm jar path/to/topology.jar qualified.name.of.MainClass aUniqueName

The topology can be stopped by executing ./storm-0.8.2/bin/storm kill aUniqueName or using the user interface.

Further links
http://storm-project.net/

https://github.com/nathanmarz/storm/wiki

http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/

I. Bedini and S. Sakr and B. Theeten and A. Sala and P. Cogan "Modeling Performance of a Parallel Streaming Engine: Bridging Theory and Costs" Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering (ICPE '13), 2013, Pages 173-184, ACM New York, NY, USA