Fetch and Index Web Pages with Nutch, MySQL and Solr

OS
Ubuntu 13.04
Download
Apache Nutch 2.1: http://apache.etoak.com/nutch/2.1/apache-nutch-2.1-src.tar.gz
Apache MySQL 5.5:

apt-get install -y mysql-server-5.5

Apache Solr 4.3: http://mirror.reverse.net/pub/apache/lucene/solr/4.3.0/solr-4.3.0-src.tgz

Create database in MySQL
The database is used to store data fetched by nutch. The statement is as following.

Create Database

CREATE DATABASE nutch DEFAULT CHARACTER 
SET utf8 DEFAULT COLLATE utf8_general_ci;

Create Table

CREATE TABLE `webpage` (

`id` varchar(767) CHARACTER SET latin1 NOT NULL,

`headers` blob,

`text` mediumtext DEFAULT NULL,

`status` int(11) DEFAULT NULL,

`markers` blob,

`parseStatus` blob,

`modifiedTime` bigint(20) DEFAULT NULL,

`score` float DEFAULT NULL,

`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,

`baseUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,

`content` mediumblob,

`title` varchar(2048) DEFAULT NULL,

`reprUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,

`fetchInterval` int(11) DEFAULT NULL,

`prevFetchTime` bigint(20) DEFAULT NULL,

`inlinks` mediumblob,

`prevSignature` blob,

`outlinks` mediumblob,

`fetchTime` bigint(20) DEFAULT NULL,

`retriesSinceFetch` int(11) DEFAULT NULL,

`protocolStatus` blob,

`signature` blob,

`metadata` blob,

PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Install Nutch
Decompress the tarball to a preferred directory(e.g. /opt) and make it $NUTCH_HOME(e.g. /opt/nutch).

tar zxvf apache-nutch-2.1-src.tar.gz -C /opt/

Edit $NUTCH_HOME/ivy/ivy.xml to have nutch support MySQL

vi /opt/nutch/ivy/ivy,xml
#########################
# Decomment this line #
<dependency org=”mysql” name=”mysql-connector-java” rev=”5.1.18″ conf=”*->default”/>

Configure Gora to have it support MySQL

vi /opt/nutch/conf/gora.properties
##################################
# Comment out the following lines #
#gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
#gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest
#gora.sqlstore.jdbc.user=sa
#gora.sqlstore.jdbc.password=
...
# Add following lines
###############################
# MySQL properties           
################################

gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=utf8&autoReconnect=true
gora.sqlstore.jdbc.user=yourmysqlaccount
gora.sqlstore.jdbc.password=yourpassword

Define your crawler.

vi /opt/nutch/conf/nutch-site.xml
#################################
# Add following lines inside the <configuration> tag#


    <property>

    <name>http.agent.name</name>

    <value>Your Nutch Spider</value>

    </property>

    <property>

    <name>http.accept.language</name>

    <value>zh-cn, ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>

    <description>Value of the “Accept-Language” request header field.

    This allows selecting non-English language as default one to retrieve.

    It is a useful setting for search engines build for certain national group.

    </description>

    </property>

    <property>

    <name>parser.character.encoding.default</name>

    <value>utf-8</value>

    <description>The character encoding to fall back to when no other information

    is available</description>

    </property>

    <property>

    <name>storage.data.store.class</name>

    <value>org.apache.gora.sql.store.SqlStore</value>

    <description>The Gora DataStore class for storing and retrieving data.

    Currently the following stores are available: ….

    </description>

    </property>

Then compile nutch with Ant.

cd /opt/nutch
ant

Having nutch built successfully, you can start indicating which site you would like to fetch and start fetching. Indicate the sites in seed.txt. You will find the data fetched by Nutch in the database created before.

cd /opt/nutch/runtime/local
mkdir urls
echo 'http://your.site' > urls/seed.txt
bin/nutch crawl urls -depth 3 -topN 5

Install Solr
Follow steps in this post: https://dcvan24.wordpress.com/2013/05/16/how-to-deploy-solr-4-3-on-jetty-9/

Index Web Pages with Solr
After installing Solr, download the schema from the following link: http://nlp.solutions.asia/wp-content/uploads/2012/08/schema.xml. Replace the schema in your collection with it. And then restart jetty to update the schema.

sudo service jetty restart
cd /opt/nutch/runtime/local
bin/nutch solrindex http://localhost:[solr-port]/[solr-dir] -reindex
Advertisements
This entry was posted in Solr. Bookmark the permalink.

2 Responses to Fetch and Index Web Pages with Nutch, MySQL and Solr

  1. Pingback: [FE LOG]5.13-5.18 | Java Notes

  2. Before indexing we first invert all of the links, so that we may index incoming anchor text with the pages.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s