This repository contains a 🐍 Python script that scrapes job listings from Jobinja. The script is designed to extract detailed information from job ads, such as job title 📋, job type ⏱️, location 📍, and other relevant attributes, and output the data for further 📊 analysis.
- Scrape Job Listings: Extracts information ℹ️ from job listings available on the Jobinja 🌐 website.
- Detailed Data Extraction: Collects various attributes including job title 📜, company name 🏢, location 📍, work experience requirements 💼, contract type 📃, gender 🚻, minimum salary 💰, and education level 🎓.
- Data Sorting and Display: Organizes the extracted data based on specified attributes and displays it in a tabular format 🧮 for easy analysis.
- Save Extracted Data: Saves the sorted job listings as individual text files 📄 in a specified directory for later review.
- 🐍 Python 3.x
- The following Python libraries are required:
requests
BeautifulSoup
frombs4
pandas
os
To install the dependencies, run:
pip install requests beautifulsoup4 pandas
git clone https://github.com/yourusername/jobinja-job-scraper.git
cd jobinja-job-scraper
Update the base URL 🔗 or headers 📋 if necessary:
url = "https://jobinja.ir/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
scraper = JobinjaScraper(url, headers)
scraper.scrape()
scraper.descriptive_statistics()
scraper.sort_data()
Execute the script by running:
python jobinja_scraper.py
The script will scrape data from Jobinja 🌐, generate descriptive statistics 📊, and display sorted job data 📋. It will also save individual job data into .txt
files 📁 in C:/Users/negin/jobinja_sorted_data
.
__init__(self, base_url, headers)
: Initializes the scraper with the base URL 🔗 and headers 📋.get_links(self)
: Retrieves all relevant links 🔗 from the Jobinja base page for further processing.extract_subpage_text(self)
: Extracts job attributes such as job title 📋, type ⏱️, location 📍, company 🏢, and other relevant details from each subpage.scrape(self)
: Executes the process by callingget_links
🔗 andextract_subpage_text
📜 to gather job data 🗂️.descriptive_statistics(self)
: Usespandas
to generate descriptive statistics 📊 for the dataset.sort_data(self)
: Sorts the job data based on attributes like job title 📜, job type ⏱️, location 📍, etc., and displays the data in a structured matrix 🧮. It also saves the sorted data into text files 📁 for easy access.
- Console Output: Displays scraped job data 📋, descriptive statistics 📊, and a sorted data matrix 🧮.
- Text Files: Each job listing is saved as an individual text file 📄 in
C:/Users/negin/jobinja_sorted_data
with detailed job information.
URL: https://jobinja.ir/job/listing-url
Content Snippet: [Snippet of the job description]
Job Title: Software Developer
Job Type: Full-Time
Job Location: 📍 Tehran
Company Name: Example Co.
Contract Type: Permanent 📃
Work Experience: 3-5 Years 💼
Min Salary: 💰 40,000,000 IRR
Gender: 🚻 Female
Education Level: 🎓 Bachelor's Degree
- The script includes error handling for SSL errors 🔒 and generic request errors ❗ to manage connectivity issues smoothly.
- Requests to the server are spaced out with a time delay ⏳ to avoid overwhelming the server (
time.sleep(1)
).
This project is licensed under the MIT License.
Feel free to submit a pull request 📥 if you have any improvements ✨ or bug fixes 🐛. All contributions are welcome 🤗.
Created by Negin Faal.