-
Notifications
You must be signed in to change notification settings - Fork 141
sparkler 0.1
Quick Start Guide
Apache Solr (Tested on 6.4.0, recommended, older versions have bugs which affects the functionality of this system)
# A place to keep all the files organized mkdir ~/work/sparkler/ -p cd ~/work/sparkler/ # Download Solr Binary For Mac : curl -O http://archive.apache.org/dist/lucene/solr/6.4.0/solr-6.4.0.tgz For other: wget "http://archive.apache.org/dist/lucene/solr/6.4.0/solr-6.4.0.tgz" # pick your version and mirror # Extract Solr tar xvzf solr-6.4.0.tgz # Add crawldb config sets cd solr-6.4.0/ cp -rv ${SPARKLER_GIT_SOURCE_PATH}/conf/solr/crawldb server/solr/configsets/
Solr can be started in local mode or in cloud mode. Note: You have to do either one of these below modes:
There are many ways to do this, Here is a relatively easy way to start solr with crawldb
# from the solr extracted directory cp -r server/solr/configsets/crawldb server/solr/ ./bin/solr startWait for a while to start the solr, Open http://localhost:8983/solr/#/~cores/ in your browser, Follow Add Core > then fill 'crawldb' for both name and instanceDir form fields and click Add Core.
After above steps you should have a core named "crawldb" in solr. You can verify it by opening http://localhost:8983/solr/crawldb/select?q=* in your browser. This link should give a valid solr response with 0 documents.
Now the crawldb core is ready, go to Inject Seed URLs phase.
Once the crawldb configs are copied to server/solr/configsets/ folder as described above, follow the interactive shell to launch a solr cloud.
The below section shows steps to create cloud with 3 instances or 2 shards, 2 replicas of crawldb collection. Hit enter to leave default values.
$ bin/solr -e cloud
Welcome to the SolrCloud example!
This interactive session will help you launch a SolrCloud cluster on your local workstation.
To begin, how many Solr nodes would you like to run in your local cluster? (specify 1-4 nodes) [2]:
3
Ok, let's start up 3 Solr nodes for your example SolrCloud cluster.
Please enter the port for node1 [8983]:
Please enter the port for node2 [7574]:
Please enter the port for node3 [8984]:
Creating Solr home directory /Users/tg/work/irds/sparkler/workspace/solr-6.4.0/example/cloud/node1/solr
Cloning /Users/tg/work/irds/sparkler/workspace/solr-6.4.0/example/cloud/node1 into
/Users/tg/work/irds/sparkler/workspace/solr-6.4.0/example/cloud/node2
Cloning /Users/tg/work/irds/sparkler/workspace/solr-6.4.0/example/cloud/node1 into
/Users/tg/work/irds/sparkler/workspace/solr-6.4.0/example/cloud/node3
...
Now let's create a new collection for indexing documents in your 3-node cluster.
Please provide a name for your new collection: [gettingstarted]
crawldb
How many shards would you like to split crawldb into? [2]
How many replicas per shard would you like to create? [2]
Please choose a configuration for the crawldb collection, available options are:
basic_configs, data_driven_schema_configs, or sample_techproducts_configs [data_driven_schema_configs]
crawldb
Connecting to ZooKeeper at localhost:9983 ...
....
SolrCloud example running, please visit: http://localhost:8983/solr
conf/sparkler-default.yaml
and set crawldb.uri: crawldb::localhost:9983
Open a file called seed.txt and enter your seed urls. Example :
http://nutch.apache.org/ http://tika.apache.org/
If not already, build the `sparkler-app` jar referring to Build and Deploy instructions.
To inject URLs, run the following command.
$ java -jar sparkler-app-0.1.jar inject -sf seed.txt 2016-06-07 19:22:49 INFO Injector$:70 [main] - Injecting 2 seeds >>jobId = sparkler-job-1465352569649
This step just injected 2 URLs. In addition, we got a jobId `sparkler-job-1465352569649`. Suppose, to inject more seeds to the crawldb later phase, we can update using this job id. Usage :
$ java -jar sparkler-app-0.1.jar inject -id (--job-id) VAL : Id of an existing Job to which the urls are to be injected. No argument will create a new job -sf (--seed-file) FILE : path to seed file -su (--seed-url) STRING[] : Seed Url(s)
For example:
bin/sparkler.sh inject -id sparkler-job-1465352569649 \ -su http://www.bbc.com/news -su http://espn.go.com/
To see these URLS in crawldb : http://localhost:8983/solr/crawldb/query?q=*:*&facet=true&facet.field=status&facet.field=depth&facet.field=group
//NOTE: solr url can be updated in `sparkler-[default|site].properties` file
To run a crawl:
$ java -jar sparkler-app-0.1.jar crawl -i (--iterations) N : Number of iterations to run -id (--id) VAL : Job id. When not sure, get the job id from injector command -m (--master) VAL : Spark Master URI. Ignore this if job is started by spark-submit -o (--out) VAL : Output path, default is job id -tg (--top-groups) N : Max Groups to be selected for fetch.. -tn (--top-n) N : Top urls per domain to be selected for a round
Example :
bin/sparkler.sh crawl -id sparkler-job-1465352569649 -m local[*] -i 1