Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawling quotestoscrape #304

Open
PicoRadia opened this issue Apr 25, 2022 · 0 comments
Open

Crawling quotestoscrape #304

PicoRadia opened this issue Apr 25, 2022 · 0 comments

Comments

@PicoRadia
Copy link

PicoRadia commented Apr 25, 2022

Hello,
I want to achieve multi level recursive site crawling
I have a question concerning the web scraping of the website quotes.toscrape.com as an example.
For the pagination scraping, I am able to do it.
But for crawling the website i.e for each page scrape the quotes and authors. Then take the link of the author (about) and scrape the day of birth and place of birth for example. And finally handle the pagination.

Here's my code :

var base_url = 'https://quotes.toscrape.com';

// empty list init
var my_list = []

// define the logic of the first scraper
var scraper1 = {
  iterator: 'div.quote',
  data: {
    'quotes': {
      sel: 'span'
    },
    'author': {
      sel: 'small.author'
    },
    'link': {
      sel: 'a',
      attr: 'href'
    }
  }
};

// define the logic of the second scraper

var scraper2 = {
  iterator: 'div.author-details',
  data: {
    'dob': {
      sel: 'span.author-born-date'
    },
    'pob': {
      sel: 'span.author-born-location'
    }
  }
}


// pagination
function nextUrl($page) {
  return $page.find('li.next > a').attr('href');
}

artoo.log.debug('Starting the scraper...');
var frontpage = artoo.scrape(scraper1);

// spider

var my_list = []
// artoo spider

function pagination() {

  artoo.ajaxSpider(
    function(i, $data) {
      //console.log($data.innerHTML);
      return nextUrl(!i ? artoo.$(document) : $data);
    }, {
      limit: 1, // number of pages to scrape
      scrape: scraper1,
      concat: true,
      done: function(data) {
        artoo.log.debug('Finished retrieving data. Downloading...');
        console.log(data);
        for (var i = 0; i < my_list.length; i++) {
          my_list.push(base_url + data[i].link)
        }
        console.log(my_list)
      }
    })
  return my_list;
}
// Append links in a list
//my_list.push(base_url + data[0].link);


function crawl(mylist) {
  artoo.ajaxSpider(
    my_list, {
      limit: 1, // number of pages to scrape
      scrape: scraper2,
      concat: true,
      done: function(data) {
        console.log(data);
        artoo.log.debug('Finished retrieving data. Downloading...');
      }
    })
}

//var ll = null;
let links = pagination();
crawl(links)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant