-
Notifications
You must be signed in to change notification settings - Fork 20
Home
Chris Mattmann edited this page Oct 13, 2015
·
3 revisions
Welcome to the nutch-python wiki!
Right now the API is evolving rapidly, but here is some code that should get you running.
Download and build the latest version of Nutch trunk (in a separate terminal).
- git clone https://github.com/apache/nutch.git
- cd nutch
- ant runtime
- cd runtime/local
- ./bin/nutch startserver
from nutch.nutch import Nutch
from nutch.nutch import SeedClient
from nutch.nutch import Server
from nutch.nutch import JobClient
import nutch
sv=Server('http://localhost:8081')
sc=SeedClient(sv)
seed_urls=('http://espn.go.com','http://www.espn.com')
sd= sc.create('espn-seed',seed_urls)
nt = Nutch('default')
jc = JobClient(sv, 'test', 'default')
cc = nt.Crawl(sd, sc, jc)
while True:
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
if job == None:
break