assumptions
- Machines Layout:
- we have two machines:
- searchCluster1: plays as namenode and datanode
- searchCluster2: plays as datanode.
- Linux FC5.
- User: root
- Filesystem layout:
- /home/search: nutch home
- /home/filesystem: nutch Distributed File System Home.
- enable key-based authentication between namenode and all datanodes. with blank phrase so you got to the other machine with no asking for password.
it's supposed that you have the following installed
- JDK 1.5
- ANT 1.6.5
follow up :)
- get teh latest release of nutch from http://lucene.apache.org/nutch
- export JAVA_HOME, ANT_HOME , and NUTCH_HOME (which is in our case: /home/search) to your path - you know how, right? :)
- untar nutch at /home/ and rename it "search"
- Configuration: your configuration file shall look like this
- conf/nutch-site.conf:
- override properties at conf/nutch-default.conf to suite you needs; don't forget to set http.agent.name since it's can not be empty.
- conf/hadoop-site.conf: it will look like the following
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>searchCluster1:9000</value>
<description>
The name of the default file system. Either the literal string
"local" or a host:port for NDFS.
</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>searchCluster1:9001</value>
<description>
The host and port that the MapReduce job tracker runs at. If
"local", then jobs are run in-process as a single map and
reduce task.
</description>
</property>
<property>
<name>mapred.map.tasks</name>
<value>4</value>
<description>
define mapred.map tasks to be number of slave hosts
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>4</value>
<description>
define mapred.reduce tasks to be number of slave hosts
</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/filesystem/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/filesystem/data</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/filesystem/mapreduce/system</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/filesystem/mapreduce/local</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
- conf/hadoop-env.sh: it will look like the following:
export JAVA_HOME=/home/apps/jdk5
export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
- at conf/slaves add the data node hostnames one host by line.
- edit conf/crawl-urlfiletrs.txt to suit your needs. all necessary instruction are there
- create the logs directory: {NUTCH_HOME]$ mkdir logs
- at NUTCH_HOME type ant
- when everything works fine copy /home/search and /home/apps from the namenode to all datanodes. with the same directory structure.
- type: ./bin/start-dfs.sh
- format your DFS by typing: ./bin/hadoop namenode -format
- if you got a confimrmation that namenode is formatted then you are ready to start.
- add root text file to your DFS that nutch start crawling from it. OK. follow:
- mkdir urls
- echo http://yoursiteaddress/ > urls/urls.txt
- ./bin/hadoop dfs -put urls urls
- restart them all by typing: ./bin/stop-dfs.sh then type: ./bin/start-all.sh
- watch the logs; if you have no exceptions then everything is Ok.
- start crawler as a test: ./bin/nutch crawl urls -dir crawled -depth 3 -topN 10
- when it complete successfully then you can move forward to search.
- get nutch war file: by typing [NUTCH_HOME]$ ant war
- move nutch-0.8.1.war to [TOMCAT_HOME]/webapps/
- rename it to search
- start tomcat.
- go to webapps/search/WEB-INF/classes/nutch-site.xml and add the following
<name>fs.default.name</name>
<value>searchCluster1:9000</value>
<description>
The name of the default file system. Either the literal string
"local" or a host:port for NDFS.
</description>
</property>
<property>
<name>searcher.dir</name>
<value>/user/root/crawled</value>
</property>
- restart tomcat and go to http://searchCluster1:8080/search/
- Nice dreams ;)
No comments:
Post a Comment