Nevermind Solutions: Nutch 0.8.1 howto

The following tutorial explains how to setup nutch 0.8.1 on a clustered environment.

assumptions

Machines Layout:

we have two machines:

searchCluster1: plays as namenode and datanode
searchCluster2: plays as datanode.

Linux FC5.
User: root
Filesystem layout:

/home/search: nutch home
/home/filesystem: nutch Distributed File System Home.

enable key-based authentication between namenode and all datanodes. with blank phrase so you got to the other machine with no asking for password.

Requirments
it's supposed that you have the following installed

JDK 1.5
ANT 1.6.5

Step By Step
follow up :)

get teh latest release of nutch from http://lucene.apache.org/nutch
export JAVA_HOME, ANT_HOME , and NUTCH_HOME (which is in our case: /home/search) to your path - you know how, right? :)
untar nutch at /home/ and rename it "search"
Configuration: your configuration file shall look like this

conf/nutch-site.conf:

override properties at conf/nutch-default.conf to suite you needs; don't forget to set http.agent.name since it's can not be empty.

conf/hadoop-site.conf: it will look like the following

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
<name>fs.default.name</name>
<value>searchCluster1:9000</value>
<description>
The name of the default file system. Either the literal string
"local" or a host:port for NDFS.
</description>
</property>

<property>
<name>mapred.job.tracker</name>
<value>searchCluster1:9001</value>
<description>
The host and port that the MapReduce job tracker runs at. If
"local", then jobs are run in-process as a single map and
reduce task.
</description>
</property>

<property>
<name>mapred.map.tasks</name>
<value>4</value>
<description>
define mapred.map tasks to be number of slave hosts
</description>
</property>

<property>
<name>mapred.reduce.tasks</name>
<value>4</value>
<description>
define mapred.reduce tasks to be number of slave hosts
</description>
</property>

<property>
<name>dfs.name.dir</name>
<value>/home/filesystem/name</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>/home/filesystem/data</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/filesystem/mapreduce/system</value>
</property>

<property>
<name>mapred.local.dir</name>
<value>/home/filesystem/mapreduce/local</value>
</property>

<property>
<name>dfs.replication</name>
<value>3</value>
</property>

</configuration>

conf/hadoop-env.sh: it will look like the following:

export HADOOP_HOME=/home/search
export JAVA_HOME=/home/apps/jdk5
export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves

at conf/slaves add the data node hostnames one host by line.
edit conf/crawl-urlfiletrs.txt to suit your needs. all necessary instruction are there

create the logs directory: {NUTCH_HOME]$ mkdir logs
at NUTCH_HOME type ant
when everything works fine copy /home/search and /home/apps from the namenode to all datanodes. with the same directory structure.
type: ./bin/start-dfs.sh
format your DFS by typing: ./bin/hadoop namenode -format
if you got a confimrmation that namenode is formatted then you are ready to start.
add root text file to your DFS that nutch start crawling from it. OK. follow:

mkdir urls
echo http://yoursiteaddress/ > urls/urls.txt
./bin/hadoop dfs -put urls urls

restart them all by typing: ./bin/stop-dfs.sh then type: ./bin/start-all.sh
watch the logs; if you have no exceptions then everything is Ok.
start crawler as a test: ./bin/nutch crawl urls -dir crawled -depth 3 -topN 10
when it complete successfully then you can move forward to search.
get nutch war file: by typing [NUTCH_HOME]$ ant war
move nutch-0.8.1.war to [TOMCAT_HOME]/webapps/
rename it to search
start tomcat.
go to webapps/search/WEB-INF/classes/nutch-site.xml and add the following

<property>
<name>fs.default.name</name>
<value>searchCluster1:9000</value>
<description>
The name of the default file system. Either the literal string
"local" or a host:port for NDFS.
</description>
</property>

<property>
<name>searcher.dir</name>
<value>/user/root/crawled</value>
</property>

restart tomcat and go to http://searchCluster1:8080/search/

- Nice dreams ;)

Nevermind Solutions

Monday, August 07, 2006

Nutch 0.8.1 howto

No comments:

About Me

Blog Archive