Skip to content

Web Scraping Booking.com to obtain hotel information. Using Python implement multiprocess to split the workload and reduce run time to obtain data.

Notifications You must be signed in to change notification settings

Chalermdej-l/Portfolio-Project-Web_Scraping

Repository files navigation

Portfolio-Profect-Web_Scraping

Table of contents

Project Overview

This project is aim to create a web-scraping python script. To scrape Booking.com to get the information for the hotel like review score, loaation, Id etc.

Checking the Robots.txt for Booking.com there is no disallow of scaping the hotel page data there is also a XML page with all the hotel in different language for the clawer to use.

We will use this script GetUrl to get the hotel list in the XML page and pass this to this (template)[/template.txt] to scpate the data.

There are many hotel just TH hotel alone have around 20,000 hotels. So to reduce the script run time I have try to split the workload into multiple run by using this Scrapebooking to populate the (template)[/template.txt] into multiple file base on the node varible provide.

Then use this [.bat] (/Run.bat) file to run the code at the same time this will create multikple instance of python to run this script. Once the script done running the file will be paste in this folder Output then we will use this Combine_load to combine all the file into one and insert into a local SQL database.

In the future I want to implement Multiprocessing and Multithreading to reduce the run time for the script.

Prerequisite

This project use Python and below package

  1. Pandas
  2. Pyodbc
  3. Requests
  4. Bs4
  5. Regex

Please clone this project

git clone https://github.com/Chalermdej-l/Portfolio-Project-Web_Scraping

And navigate to the clone directory

cd Portfolio-Project-Web_Scraping

You can install the above package in the Requirement.txt file provide in this project

pip install -r Requirements.txt

Project Detail

1 Get all the hotel list

The goal is to get data of the hotel on Booking.com. Thankfully Booking.com don't forbid cawlers Robots.txt and they provide a list of useful information in xml format amoung them is all the hotel list this XML page.

xml

Which we will use to scrape the data. We will use this script to save all the hotel list into local directory. The script will access the xml page and download each file into one list and then sepearte them into each country in a csv file.

csv

2 Scrape the data using the url list

After we get the url list of the hotel we will then scrape the page using this template this code will scrape the data for location name hotel id and dest id and the rview of the hotels.

hotel

as there are many url in the list depend on the country there are about 20,000 hotel in TH alone so we can't run the script with 1 intance this will take too much time to solve this issue I have create this script this script will open the csv file we get in step 1 and calculate the workload into different node we provide then it will create a script for each node using the template then run the created code with this .bat file to run all the code at the same time with as the task is an I/O bound tpye.

splitfile

3 Combine all the file and insert into database

After we done with the script we will then run this script this script will combine all the created file into one and insert this data into the local database. We can also save this as a CSV file instead

scrape

Further Improvements

Implement a Multiprocessing and Multithreading to reduce the workload intead of split the task into multiple file and run them. The current code need to be run manaully and not continuous as a pipeline with Multiprocessing and Multithreading we can setup a pipeline to wait for all the task to finish then combine all the file and insert them into a data storage.

About

Web Scraping Booking.com to obtain hotel information. Using Python implement multiprocess to split the workload and reduce run time to obtain data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published