Monday, August 07, 2006

Nutch 0.8.1 howto

The following tutorial explains how to setup nutch 0.8.1 on a clustered environment.

assumptions
  • Machines Layout:
    • we have two machines:
      1. searchCluster1: plays as namenode and datanode
      2. searchCluster2: plays as datanode.
  • Linux FC5.
  • User: root
  • Filesystem layout:
    • /home/search: nutch home
    • /home/filesystem: nutch Distributed File System Home.
  • enable key-based authentication between namenode and all datanodes. with blank phrase so you got to the other machine with no asking for password.
Requirments
it's supposed that you have the following installed
  • JDK 1.5
  • ANT 1.6.5
Step By Step
follow up :)

  • get teh latest release of nutch from http://lucene.apache.org/nutch
  • export JAVA_HOME, ANT_HOME , and NUTCH_HOME (which is in our case: /home/search) to your path - you know how, right? :)
  • untar nutch at /home/ and rename it "search"
  • Configuration: your configuration file shall look like this
    • conf/nutch-site.conf:
      • override properties at conf/nutch-default.conf to suite you needs; don't forget to set http.agent.name since it's can not be empty.
    • conf/hadoop-site.conf: it will look like the following
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>fs.default.name</name>
<value>searchCluster1:9000</value>
<description>
The name of the default file system. Either the literal string
"local" or a host:port for NDFS.
</description>
</property>

<property>
<name>mapred.job.tracker</name>
<value>searchCluster1:9001</value>
<description>
The host and port that the MapReduce job tracker runs at. If
"local", then jobs are run in-process as a single map and
reduce task.
</description>
</property>

<property>
<name>mapred.map.tasks</name>
<value>4</value>
<description>
define mapred.map tasks to be number of slave hosts
</description>
</property>

<property>
<name>mapred.reduce.tasks</name>
<value>4</value>
<description>
define mapred.reduce tasks to be number of slave hosts
</description>
</property>

<property>
<name>dfs.name.dir</name>
<value>/home/filesystem/name</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>/home/filesystem/data</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/filesystem/mapreduce/system</value>
</property>

<property>
<name>mapred.local.dir</name>
<value>/home/filesystem/mapreduce/local</value>
</property>

<property>
<name>dfs.replication</name>
<value>3</value>
</property>

</configuration>

    • conf/hadoop-env.sh: it will look like the following:
export HADOOP_HOME=/home/search
export JAVA_HOME=/home/apps/jdk5
export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves

    • at conf/slaves add the data node hostnames one host by line.
    • edit conf/crawl-urlfiletrs.txt to suit your needs. all necessary instruction are there
  • create the logs directory: {NUTCH_HOME]$ mkdir logs
  • at NUTCH_HOME type ant
  • when everything works fine copy /home/search and /home/apps from the namenode to all datanodes. with the same directory structure.
  • type: ./bin/start-dfs.sh
  • format your DFS by typing: ./bin/hadoop namenode -format
  • if you got a confimrmation that namenode is formatted then you are ready to start.
  • add root text file to your DFS that nutch start crawling from it. OK. follow:
    • mkdir urls
    • echo http://yoursiteaddress/ > urls/urls.txt
    • ./bin/hadoop dfs -put urls urls
  • restart them all by typing: ./bin/stop-dfs.sh then type: ./bin/start-all.sh
  • watch the logs; if you have no exceptions then everything is Ok.
  • start crawler as a test: ./bin/nutch crawl urls -dir crawled -depth 3 -topN 10
  • when it complete successfully then you can move forward to search.
  • get nutch war file: by typing [NUTCH_HOME]$ ant war
  • move nutch-0.8.1.war to [TOMCAT_HOME]/webapps/
  • rename it to search
  • start tomcat.
  • go to webapps/search/WEB-INF/classes/nutch-site.xml and add the following
<property>
<name>fs.default.name</name>
<value>searchCluster1:9000</value>
<description>
The name of the default file system. Either the literal string
"local" or a host:port for NDFS.
</description>
</property>

<property>
<name>searcher.dir</name>
<value>/user/root/crawled</value>
</property>

  • restart tomcat and go to http://searchCluster1:8080/search/

- Nice dreams ;)