Cosmo

You have to run before you can crawl

This is just a small, incomplete web crawler written as a coding test. Given a starting URL, it will store link and image URLs per page in an SQLite database as (page URL, link type, link URL) triples. The link types are page, image, stylesheet, script, object, embed, iframe, media and form. For a gathered link to be valid for crawling, it must:

be of type page or iframe
not have been crawled already
have the same host and port as the page it was found on
be permitted by robots.txt

Usage:
  cosmo [options] <url>
  cosmo -h|--help
  cosmo --version

Options:
  -h --help               Show this screen.
  --version               Show version.
  -v --verbose            Output each URL on stderr as it is fetched.
  -d --database=<file>    Database file.
  -F --flush              Flush the database before crawling.
  -f --format=<format>    Select output format (see below).

Output formats:
  nice      Hierarchical by page URL and link type
  raw       Raw triples
  dot       GraphViz DOT format graph

Known limitations

There's no throttling
Pages that fail to be retrieved are not retried
Only HTML files are parsed, but it would be useful to parse CSS files too
It's not at all concurrent

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
cosmo		cosmo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cosmo

Known limitations

About

Releases

Packages

Languages

License

danellis/cosmo

Folders and files

Latest commit

History

Repository files navigation

Cosmo

Known limitations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages