-
Notifications
You must be signed in to change notification settings - Fork 10
setupZookeeperHadoopHbaseTomcatSolrNutch
- ZooKeeper is a distributed coordination service for distributed applications.
- Hadoop is a framework that supports data-intensive distributed applications.
- HBase is a non relational, distributed database that uses both ZooKeeper and HBase.
- Tomcat is a java webserver and servlet container.
- Solr is a java servlet on top of Apache Lucene, providing distributed search and index replication.
- Nutch is a web crawler that stores data in HBase and feeds data into Solr.
Since HBase requires ZooKeeper and Hadoop, make sure to start those before starting HBase, and make sure you close HBase before you close ZooKeeper or Hadoop.
Create a new directory, download these files and extract them. We will call this directory techStackSearch
in this tutorial.
-
ZooKeeper:
ZooKeeper 3.4.5
in this tutorial. -
Hadoop:
Hadoop 0.20.204
in this tutorial. Make sure to grab a binary releasae. -
HBase.
HBase 0.90.4
in this tutorial. -
Solr:
Solr 4.3.1
in this tutorial. (Solr 4.4.0
should be installable aswell with the steps described in this tutorial). -
Nutch:
Nutch 2.2.1
in this tutorial.
Be careful, only certain versions of these programs work together! As of the time of this writing Nutch 2.2.1
is the most recent stable release, it requires Solr 4.3.1
and HBase 0.90.4
, which itself only seems to work with Hadoop 0.20.204
.
Create a copy of the sample configuration:
$ cd /path/to/techStackSearch/zookeeper-3.4.5/conf
$ cp zoo_sample.cfg zoo.cfg
You can leave the configuration as is, you just need to specify a dataDir
. Edit zoo.cfg
:
dataDir=/path/to/techStackSearch/data/zookeeper/
Make sure to create that directory:
$ mkdir -p /path/to/techStackSearch/data/zookeeper
After this you should be able to start ZooKeeper:
$ cd /path/to/techStackSearch/zookeeper-3.4.5/bin
$ ./zkServer.sh start
You can stop the ZooKeeper with ./zkServer.sh stop
.
To check if ZooKeeper is running you can do simple, file-like operations with
$ ./zkCli.sh localhost:2181
Logging seems to happen in /path/to/techStackSearch/zookeeper-3.4.5/bin/zookeeper.out
.
todo: how do we interact with zookeeper?
We are going to setup Hadoop in pseudo-distributed operation.
ssh
needs to be installed and sshd
needs to be running for Hadoop scripts to access remote Hadoop daemons. Make sure your SSH server is running and try to SSH into your local maschine:
$ ssh localhost
if you are running an ubuntu system you should be able to install a ssh server by executing: sudo apt-get install openssh-server
following http://www.cyberciti.biz/faq/ubuntu-linux-openssh-server-installation-and-configuration/
To ssh into localhost without a passphrase, execute (maybe not a good idea for production servers):
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Set JAVA_HOME
in /path/to/techStackSearch/hadoop-0.20.204/conf/hadoop-env.sh
, this is what it looks like for me:
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk
Go to /path/to/techStackSearch/hadoop-0.20.205/conf
and set the following configuration (Make sure to not use ~
(tilde) in any configuration files, replace it with /home/yourusername
):
core-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/path/to/techStackSearch/data/hadoop/tmp</value>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>/path/to/techStackSearch/data/hadoop/snn</value>
</property>
</configuration>
hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/path/to/techStackSearch/data/hadoop/nn</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/path/to/techStackSearch/data/hadoop/dn</value>
</property>
<property>
<name>dfs.permissions.supergroup</name>
<value>hadoop</value>
</property>
</configuration>
mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/path/to/techStackSearch/data/hadoop/cache/mapred-local</value>
</property>
<property>
<name>mapred.temp.dir</name>
<value>/path/to/techStackSearch/data/hadoop/cache/mapred-tmp</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/path/to/techStackSearch/data/hadoop/mapred-system</value>
</property>
</configuration>
Format a new distributed filesystem:
$ cd /path/to/techStackSearch/hadoop-0.20.205/bin
$ ./hadoop namenode -format
Attention: If you format an existing namenode, you will be asked, if you are sure to override. Answer with captial Y or N. Otherwise formatting will fail.
Start the Hadoop daemons:
$ ./start-dfs.sh
$ ./start-mapred.sh
Stop them with:
$ ./stop-mapred.sh
$ ./stop-dfs.sh
You should be able to browse the web interfaces:
-
NameNode: http://localhost:50070/
-
JobTracker: http://localhost:50030/
-
TaskTracker: http://localhost:50060/
We are going to setup HBase in pseudo-distributed operation.
HBase expects the lookback IP address to be 127.0.0.1
. Ubuntu and some other distributions will default to 127.0.1.1
, and this will cause problems for you. For more information see Why does HBase care about /etc/hosts
?.
Edit /path/to/techStackSearch/hbase-0.90.4/conf/hbase-site.xml
, to tell it its running in pseudo-distributed mode:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Also we need to tell HBase that we'd like to manage ZooKeeper ourselves. Edit /path/to/techStackSearch/hbase-0.90.4/conf/hbase-env.sh
:
export HBASE_MANAGES_ZK=false
Similar to ealier with Hadoop we have to set JAVA_HOME
for HBase. In the same file:
export JAVA_HOME=/user/lib/jvm/java-7-openjdk
Make sure Zookeeper and Hadoop are running. Start HBase like this:
$ cd /path/to/techStackSearch/hbase-0.90.4/bin
$ ./start-hbase.sh
To check if HBase is running run ./hbase shell
and type list
. If HBase is up you should get a response pretty quick (0 row(s) in 0.2860 in seconds
). If not you might find this message in your logs:
2013-08-09 12:24:32,042 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
java.io.IOException: Call to localhost/127.0.0.1:9000 failed on local exception: java.io.EOFException
This is most likely caused because HBase still uses old Hadoop client library versions. Execute the following.
$ cd /path/to/techStackSearch/hbase-0.90.4/lib/
$ rm hadoop-core*
$ cp /path/to/techStackSearch/hadoop-0.20.204.0/hadoop-core* /path/to/techStackSearch/hadoop-0.20.204.0/lib/commons-configuration* ./
$ chmod +x hadoop-core* commons-configuration*
You can consult technologyHBase for troubleshooting, especially if you have trouble shutting down HBase.
You can stop HBase again with ./stop-hbase.sh
.
ATTENTION: Often ./stop-hbase.sh
will fail to stop HBase. If this happens consider editing of stop-hbase.sh
while kill -0 `cat $pid` > /dev/null 2>&1; do
becomes
while kill -9 `cat $pid` > /dev/null 2>&1; do
Set up Tomcat however is appropiate for your system.
You might want to take a look at Sebastian's Guide.
We need to setup logging for our solr servlet:
$ cp /path/to/techStackSearch/solr-4.3.1/example/lib/ext/* /usr/share/tomcat7/lib
$ cp /path/to/techStackSearch/solr-4.3.1/example/resources/log4j.properties /usr/share/tomcat7/lib
You also need to correct the logging path in the log4j.properties
file. Change log4j.appender.file.File
to a file you want Solr to log to (Note: on my system only absolute paths seem to work):
log4j.appender.file.File=/usr/share/tomcat7/logs/solr.log
We need to create a new directory to contain our solr servlet.
$ mkdir /somewhere/solr
$ cp -r /path/to/techStackSearch/solr-4.3.1/example/multicore /somewhere/solr/solr
$ cp /path/to/techStackSearch/solr-4.3.1/example/webapps/solr.war /somewhere/solr
Make sure this directory is accessible by the user running your tomcat server. So if that is called tomcat7
and resides in user group tomcat7
you would do:
sudo chown tomcat7:tomcat7 /somewhere/solr
Also /somewhere/solr
sholdn't be under your home-Directory because even if you give a user read access there he won't be able to open those files.
Finally we need to tell Tomcat where our Servlet resides. Create /usr/share/tomcat7/conf/Catalina/localhost/solr.xml
(depending on your system this might also be /var/lib/tomcat7/conf/Catalina/localhost/solr.xml
or similar):
<Context docBase="/somewhere/solr/solr.war" debug="0" crossContext="true">
<Environment name="solr/home" type="java.lang.String" value="/somewhere/solr/solr" override="true" />
</Context>
Before we continue we need to prepare Solr for Nutch storage. Download Nutch schema.xml and overwrite /somewhere/solr/solr/core0/conf/schema.xml
. Lastly creat a few files this schema needs for configuration.
$ cd /somewhere/solr/solr/core0/conf
$ touch stopwords.txt
$ touch synonyms.txt
$ touch protwords.txt
You should now be able to start Tomcat and access solr under http://localhost:8080/solr/
.
TODO: Check how Solr and Zookeeper interact.
TODO: Set up Solr in cloud mode.
We need to setup a name for our web crawler. We also need to tell Nutch that we use HBase as a Gora datastore backend. Edit /path/to/techStackSearch/apache-nutch-2.2.1/conf/nutch-site.xml
.
<configuration>
<property>
<name>http.agent.name</name>
<value>your-crawler-name</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
</configuration>
Change this line in /path/to/techStackSearch/apache-nutch-2.2.1/conf/gora.properties
:
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
Open /path/to/techStackSearch/apache-nutch-2.2.1/ivy/ivy.xml
. Scroll down to section Gora artifacs
and uncomment this line:
<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
Now we need to compile Nutch (since 2.x only source archives are available).
$ cd /path/to/techStackSearch/apache-nutch-2.2.1/
$ ant runtime
(This might take a long time for the first time. On my mashine it took 25 minutes.)