- Machines Layout:
- we have two machines:
- searchCluster1: plays as namenode and datanode
- searchCluster2: plays as datanode.
- Linux FC5.
- User: root
- Filesystem layout:
- /home/search: nutch home
- /home/filesystem: nutch Distributed File System Home.
- enable key-based authentication between namenode and all datanodes. with blank phrase so you got to the other machine with no asking for password.
it's supposed that you have the following installed
- JDK 1.5
- ANT 1.6.5
follow up :)
- get teh latest release of nutch from
- export JAVA_HOME, ANT_HOME , and NUTCH_HOME (which is in our case: /home/search) to your path - you know how, right? :)
- untar nutch at /home/ and rename it "search"
- Configuration: your configuration file shall look like this
- conf/nutch-site.conf:
- override properties at conf/nutch-default.conf to suite you needs; don't forget to set since it's can not be empty.
- conf/hadoop-site.conf: it will look like the following
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
The name of the default file system. Either the literal string
"local" or a host:port for NDFS.
The host and port that the MapReduce job tracker runs at. If
"local", then jobs are run in-process as a single map and
reduce task.
define tasks to be number of slave hosts
define mapred.reduce tasks to be number of slave hosts
- conf/ it will look like the following:
export JAVA_HOME=/home/apps/jdk5
export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
- at conf/slaves add the data node hostnames one host by line.
- edit conf/crawl-urlfiletrs.txt to suit your needs. all necessary instruction are there
- create the logs directory: {NUTCH_HOME]$ mkdir logs
- at NUTCH_HOME type ant
- when everything works fine copy /home/search and /home/apps from the namenode to all datanodes. with the same directory structure.
- type: ./bin/
- format your DFS by typing: ./bin/hadoop namenode -format
- if you got a confimrmation that namenode is formatted then you are ready to start.
- add root text file to your DFS that nutch start crawling from it. OK. follow:
- mkdir urls
- echo http://yoursiteaddress/ > urls/urls.txt
- ./bin/hadoop dfs -put urls urls
- restart them all by typing: ./bin/ then type: ./bin/
- watch the logs; if you have no exceptions then everything is Ok.
- start crawler as a test: ./bin/nutch crawl urls -dir crawled -depth 3 -topN 10
- when it complete successfully then you can move forward to search.
- get nutch war file: by typing [NUTCH_HOME]$ ant war
- move nutch-0.8.1.war to [TOMCAT_HOME]/webapps/
- rename it to search
- start tomcat.
- go to webapps/search/WEB-INF/classes/nutch-site.xml and add the following
The name of the default file system. Either the literal string
"local" or a host:port for NDFS.
- restart tomcat and go to http://searchCluster1:8080/search/
- Nice dreams ;)
No comments:
Post a Comment