setupZookeeperHadoopHbaseTomcatSolrNutch

Overview

ZooKeeper is a distributed coordination service for distributed applications.
Hadoop is a framework that supports data-intensive distributed applications.
HBase is a non relational, distributed database that uses both ZooKeeper and HBase.
Tomcat is a java webserver and servlet container.
Solr is a java servlet on top of Apache Lucene, providing distributed search and index replication.
Nutch is a web crawler that stores data in HBase and feeds data into Solr.

Since HBase requires ZooKeeper and Hadoop, make sure to start those before starting HBase, and make sure you close HBase before you close ZooKeeper or Hadoop.

Download

Create a new directory, download these files and extract them. We will call this directory techStackSearch in this tutorial.

ZooKeeper: ZooKeeper 3.4.5 in this tutorial.
Hadoop: Hadoop 0.20.204 in this tutorial. Make sure to grab a binary releasae.
HBase. HBase 0.90.4 in this tutorial.
Solr: Solr 4.3.1 in this tutorial. (Solr 4.4.0 should be installable aswell with the steps described in this tutorial).
Nutch: Nutch 2.2.1 in this tutorial.

Be careful, only certain versions of these programs work together! As of the time of this writing Nutch 2.2.1 is the most recent stable release, it requires Solr 4.3.1 and HBase 0.90.4, which itself only seems to work with Hadoop 0.20.204.

ZooKeeper

Create a copy of the sample configuration:

$ cd /path/to/techStackSearch/zookeeper-3.4.5/conf
$ cp zoo_sample.cfg zoo.cfg

You can leave the configuration as is, you just need to specify a dataDir. Edit zoo.cfg:

dataDir=/path/to/techStackSearch/data/zookeeper/

Make sure to create that directory:

$ mkdir -p /path/to/techStackSearch/data/zookeeper

After this you should be able to start ZooKeeper:

$ cd /path/to/techStackSearch/zookeeper-3.4.5/bin
$ ./zkServer.sh start

You can stop the ZooKeeper with ./zkServer.sh stop.

To check if ZooKeeper is running you can do simple, file-like operations with

$ ./zkCli.sh localhost:2181

Logging seems to happen in /path/to/techStackSearch/zookeeper-3.4.5/bin/zookeeper.out.

todo: how do we interact with zookeeper?

Hadoop

We are going to setup Hadoop in pseudo-distributed operation.

ssh needs to be installed and sshd needs to be running for Hadoop scripts to access remote Hadoop daemons. Make sure your SSH server is running and try to SSH into your local maschine:

$ ssh localhost

if you are running an ubuntu system you should be able to install a ssh server by executing: sudo apt-get install openssh-server following http://www.cyberciti.biz/faq/ubuntu-linux-openssh-server-installation-and-configuration/

To ssh into localhost without a passphrase, execute (maybe not a good idea for production servers):

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Set JAVA_HOME in /path/to/techStackSearch/hadoop-0.20.204/conf/hadoop-env.sh, this is what it looks like for me:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk

Go to /path/to/techStackSearch/hadoop-0.20.205/conf and set the following configuration (Make sure to not use ~ (tilde) in any configuration files, replace it with /home/yourusername):

core-site.xml:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/path/to/techStackSearch/data/hadoop/tmp</value>
  </property>
  <property>
    <name>fs.checkpoint.dir</name>
    <value>/path/to/techStackSearch/data/hadoop/snn</value>
  </property>
</configuration>

hdfs-site.xml:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>/path/to/techStackSearch/data/hadoop/nn</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/path/to/techStackSearch/data/hadoop/dn</value>
  </property>
  <property>
    <name>dfs.permissions.supergroup</name>
    <value>hadoop</value>
  </property>
</configuration>

mapred-site.xml:

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>
  <property>
    <name>mapred.local.dir</name>
    <value>/path/to/techStackSearch/data/hadoop/cache/mapred-local</value>
  </property>
  <property>
    <name>mapred.temp.dir</name>
    <value>/path/to/techStackSearch/data/hadoop/cache/mapred-tmp</value>
  </property>
  <property>
    <name>mapred.system.dir</name>
    <value>/path/to/techStackSearch/data/hadoop/mapred-system</value>
  </property>
</configuration>

Format a new distributed filesystem:

$ cd /path/to/techStackSearch/hadoop-0.20.205/bin
$ ./hadoop namenode -format

Attention: If you format an existing namenode, you will be asked, if you are sure to override. Answer with captial Y or N. Otherwise formatting will fail.

Start the Hadoop daemons:

$ ./start-dfs.sh
$ ./start-mapred.sh

Stop them with:

$ ./stop-mapred.sh
$ ./stop-dfs.sh

You should be able to browse the web interfaces:

NameNode: http://localhost:50070/
JobTracker: http://localhost:50030/
TaskTracker: http://localhost:50060/
Running Hadoop on Ubuntu Linux (Single-Node Cluster)
Hadoop Single Node Setup

HBase

We are going to setup HBase in pseudo-distributed operation.

HBase expects the lookback IP address to be 127.0.0.1. Ubuntu and some other distributions will default to 127.0.1.1, and this will cause problems for you. For more information see Why does HBase care about /etc/hosts?.

Edit /path/to/techStackSearch/hbase-0.90.4/conf/hbase-site.xml, to tell it its running in pseudo-distributed mode:

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://localhost:9000/hbase</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>localhost</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

Also we need to tell HBase that we'd like to manage ZooKeeper ourselves. Edit /path/to/techStackSearch/hbase-0.90.4/conf/hbase-env.sh:

export HBASE_MANAGES_ZK=false

Similar to ealier with Hadoop we have to set JAVA_HOME for HBase. In the same file:

export JAVA_HOME=/user/lib/jvm/java-7-openjdk

Make sure Zookeeper and Hadoop are running. Start HBase like this:

$ cd /path/to/techStackSearch/hbase-0.90.4/bin
$ ./start-hbase.sh

To check if HBase is running run ./hbase shell and type list. If HBase is up you should get a response pretty quick (0 row(s) in 0.2860 in seconds). If not you might find this message in your logs:

2013-08-09 12:24:32,042 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
java.io.IOException: Call to localhost/127.0.0.1:9000 failed on local exception: java.io.EOFException

This is most likely caused because HBase still uses old Hadoop client library versions. Execute the following.

$ cd /path/to/techStackSearch/hbase-0.90.4/lib/
$ rm hadoop-core*
$ cp /path/to/techStackSearch/hadoop-0.20.204.0/hadoop-core* /path/to/techStackSearch/hadoop-0.20.204.0/lib/commons-configuration* ./
$ chmod +x hadoop-core* commons-configuration*

You can consult technologyHBase for troubleshooting, especially if you have trouble shutting down HBase.

You can stop HBase again with ./stop-hbase.sh.

ATTENTION: Often ./stop-hbase.sh will fail to stop HBase. If this happens consider editing of stop-hbase.sh

while kill -0 `cat $pid` > /dev/null 2>&1; do

becomes

while kill -9 `cat $pid` > /dev/null 2>&1; do

HBase run modes: Standalone and Distributed

Tomcat

Set up Tomcat however is appropiate for your system.

You might want to take a look at Sebastian's Guide.

Solr

We need to setup logging for our solr servlet:

$ cp /path/to/techStackSearch/solr-4.3.1/example/lib/ext/* /usr/share/tomcat7/lib
$ cp /path/to/techStackSearch/solr-4.3.1/example/resources/log4j.properties /usr/share/tomcat7/lib

You also need to correct the logging path in the log4j.properties file. Change log4j.appender.file.File to a file you want Solr to log to (Note: on my system only absolute paths seem to work):

log4j.appender.file.File=/usr/share/tomcat7/logs/solr.log

We need to create a new directory to contain our solr servlet.

$ mkdir /somewhere/solr
$ cp -r /path/to/techStackSearch/solr-4.3.1/example/multicore /somewhere/solr/solr
$ cp /path/to/techStackSearch/solr-4.3.1/example/webapps/solr.war /somewhere/solr

Make sure this directory is accessible by the user running your tomcat server. So if that is called tomcat7 and resides in user group tomcat7 you would do:

sudo chown tomcat7:tomcat7 /somewhere/solr

Also /somewhere/solr sholdn't be under your home-Directory because even if you give a user read access there he won't be able to open those files.

Finally we need to tell Tomcat where our Servlet resides. Create /usr/share/tomcat7/conf/Catalina/localhost/solr.xml (depending on your system this might also be /var/lib/tomcat7/conf/Catalina/localhost/solr.xml or similar):

<Context docBase="/somewhere/solr/solr.war" debug="0" crossContext="true">
  <Environment name="solr/home" type="java.lang.String" value="/somewhere/solr/solr" override="true" />
</Context>

Before we continue we need to prepare Solr for Nutch storage. Download Nutch schema.xml and overwrite /somewhere/solr/solr/core0/conf/schema.xml. Lastly creat a few files this schema needs for configuration.

$ cd /somewhere/solr/solr/core0/conf
$ touch stopwords.txt
$ touch synonyms.txt
$ touch protwords.txt

You should now be able to start Tomcat and access solr under http://localhost:8080/solr/.

TODO: Check how Solr and Zookeeper interact.

TODO: Set up Solr in cloud mode.

Install Apache Tomcat with Solr

Nutch

We need to setup a name for our web crawler. We also need to tell Nutch that we use HBase as a Gora datastore backend. Edit /path/to/techStackSearch/apache-nutch-2.2.1/conf/nutch-site.xml.

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>your-crawler-name</value>
  </property>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
    <description>Default class for storing data</description>
  </property>
</configuration>

Change this line in /path/to/techStackSearch/apache-nutch-2.2.1/conf/gora.properties:

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

Open /path/to/techStackSearch/apache-nutch-2.2.1/ivy/ivy.xml. Scroll down to section Gora artifacs and uncomment this line:

<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

Now we need to compile Nutch (since 2.x only source archives are available).

$ cd /path/to/techStackSearch/apache-nutch-2.2.1/
$ ant runtime

(This might take a long time for the first time. On my mashine it took 25 minutes.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly