Web Scraper + Excel Report Generator using BeautifulSoup and openpyxl
AdvancedAdvanced Python Automation with BeautifulSoup, pandas, and openpyxl
1. Project Overview
This project automates the process of extracting structured information from websites (like job listings, news headlines, or product data) and generating well-formatted Excel reports.
You’ll build a web scraper that:
- Extracts live data from a public website (we’ll use job postings from “https://realpython.github.io/fake-jobs/” for demonstration).
- Cleans and organizes the data.
- Generates an Excel report using openpyxl with proper formatting (headers, styling, etc.).
🎯 Real-World Use Case:
- Automate job data collection for HR analytics.
- Scrape e-commerce sites for product prices or reviews.
- Extract and track competitor updates for business intelligence.
🧠 Technical Goals:
- Learn robust web scraping with BeautifulSoup.
- Handle HTML parsing, exception handling, and data cleaning.
- Build Excel reports programmatically using openpyxl with dynamic formatting.
2. Key Technologies & Libraries
| Category | Library | Purpose |
|---|---|---|
| Web Scraping | requests, BeautifulSoup | To fetch and parse website HTML content |
| Data Handling | pandas | For clean data manipulation |
| Excel Report | openpyxl | To create and style Excel files |
| Error Handling | logging | For runtime logging and debugging |
| Automation | os, datetime | For report naming and folder management |
Install all dependencies:
pip install requests beautifulsoup4 pandas openpyxl3. Learning Outcomes
By completing this project, you will learn:
- ✅ Advanced web scraping techniques using BeautifulSoup.
- ✅ Data cleaning and structuring using pandas.
- ✅ Creating professional Excel reports using openpyxl.
- ✅ Adding timestamps and logs for automation pipelines.
- ✅ Following modular programming and clean code practices.
4. Step-by-Step Explanation
- Step 1 – Setup Environment
Create a project folder named:web_scraper_excel_report/
Inside it, create:main.py - Step 2 – Import Required Libraries
Import modules for HTTP requests, HTML parsing, data handling, and Excel generation. - Step 3 – Fetch HTML Data
Use requests to get the webpage content. - Step 4 – Parse HTML with BeautifulSoup
Extract job titles, companies, and locations. - Step 5 – Clean and Structure Data
Store the extracted data in a pandas DataFrame. - Step 6 – Export to Excel
Use openpyxl to create an Excel sheet, format headers, and save the report with a timestamp. - Step 7 – Add Logging
Log success or errors for traceability.
5. Full Working and Verified Python Code
"""
Project: Web Scraper + Excel Report Generator
Author: Tajammul Hussain
Level: Advanced Python Automation
"""
import requests
from bs4 import BeautifulSoup
import pandas as pd
from openpyxl import Workbook
from openpyxl.styles import Font, Alignment, PatternFill
from datetime import datetime
import os
import logging
# ================== LOGGING CONFIGURATION ==================
logging.basicConfig(
filename="scraper_log.txt",
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s"
)
# ================== SCRAPER FUNCTION ==================
def scrape_jobs(url):
"""Scrape job listings from the given URL and return structured data."""
try:
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
jobs = soup.find_all('div', class_='card-content')
job_list = []
for job in jobs:
title = job.find('h2', class_='title').text.strip()
company = job.find('h3', class_='company').text.strip()
location = job.find('p', class_='location').text.strip()
job_list.append({
"Title": title,
"Company": company,
"Location": location
})
logging.info(f"Successfully scraped {len(job_list)} job listings.")
return job_list
except Exception as e:
logging.error(f"Error during scraping: {e}")
return []
# ================== EXCEL REPORT FUNCTION ==================
def generate_excel_report(data, filename):
"""Generate a styled Excel report from scraped data."""
if not data:
logging.warning("No data available to write into Excel.")
return
df = pd.DataFrame(data)
wb = Workbook()
ws = wb.active
ws.title = "Job Listings"
# Write header
for col_num, column_title in enumerate(df.columns, 1):
cell = ws.cell(row=1, column=col_num, value=column_title)
cell.font = Font(bold=True, color="FFFFFF")
cell.alignment = Alignment(horizontal="center")
cell.fill = PatternFill(start_color="4F81BD", end_color="4F81BD", fill_type="solid")
# Write data
for row in df.itertuples(index=False):
ws.append(row)
# Auto-adjust column width
for col in ws.columns:
max_length = 0
column = col[0].column_letter
for cell in col:
try:
if len(str(cell.value)) > max_length:
max_length = len(str(cell.value))
except:
pass
ws.column_dimensions[column].width = max_length + 2
wb.save(filename)
logging.info(f"Excel report saved successfully as {filename}")
# ================== MAIN FUNCTION ==================
def main():
url = "https://realpython.github.io/fake-jobs/"
logging.info("Starting job scraping process...")
data = scrape_jobs(url)
if not data:
print("No data scraped. Please check logs.")
return
# Create reports folder
if not os.path.exists("reports"):
os.makedirs("reports")
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
filename = f"reports/job_report_{timestamp}.xlsx"
generate_excel_report(data, filename)
print(f"✅ Excel report generated: {filename}")
if __name__ == "__main__":
main()
6. Sample Output or Results
Console Output:
✅ Excel report generated: reports/job_report_2025-10-28_12-35-47.xlsx
Excel Output Preview:
Title Company Location
Python Developer Real Python Remote
Data Scientist Data Corp New York
Backend Engineer Tech Solutions San FranciscoLog File (scraper_log.txt):
2025-10-28 12:35:47 - INFO - Successfully scraped 100 job listings.
2025-10-28 12:35:47 - INFO - Excel report saved successfully as reports/job_report_2025-10-28_12-35-47.xlsx7. Possible Enhancements
🚀 Add more features for real-world applications:
- Add filtering (e.g., only “Python Developer” jobs).
- Integrate email automation to send the Excel report daily.
- Use Flask or Streamlit for a simple GUI dashboard.
- Scrape multiple pages automatically.
- Store data in a database (SQLite or PostgreSQL) instead of Excel.
- Deploy to cloud (AWS Lambda / Azure Function) as an automated data pipeline.