Web Scraper + Excel Report Generator using BeautifulSoup and openpyxl

1. Project Overview

This project automates the process of extracting structured information from websites (like job listings, news headlines, or product data) and generating well-formatted Excel reports.
You’ll build a web scraper that:

Extracts live data from a public website (we’ll use job postings from “https://realpython.github.io/fake-jobs/” for demonstration).
Cleans and organizes the data.
Generates an Excel report using openpyxl with proper formatting (headers, styling, etc.).

🎯 Real-World Use Case:

Automate job data collection for HR analytics.
Scrape e-commerce sites for product prices or reviews.
Extract and track competitor updates for business intelligence.

🧠 Technical Goals:

Learn robust web scraping with BeautifulSoup.
Handle HTML parsing, exception handling, and data cleaning.
Build Excel reports programmatically using openpyxl with dynamic formatting.

2. Key Technologies & Libraries

Category	Library	Purpose
Web Scraping	requests, BeautifulSoup	To fetch and parse website HTML content
Data Handling	pandas	For clean data manipulation
Excel Report	openpyxl	To create and style Excel files
Error Handling	logging	For runtime logging and debugging
Automation	os, datetime	For report naming and folder management

Install all dependencies:

pip install requests beautifulsoup4 pandas openpyxl

3. Learning Outcomes

By completing this project, you will learn:

✅ Advanced web scraping techniques using BeautifulSoup.
✅ Data cleaning and structuring using pandas.
✅ Creating professional Excel reports using openpyxl.
✅ Adding timestamps and logs for automation pipelines.
✅ Following modular programming and clean code practices.

4. Step-by-Step Explanation

Step 1 – Setup Environment
Create a project folder named:
web_scraper_excel_report/
Inside it, create:
main.py
Step 2 – Import Required Libraries
Import modules for HTTP requests, HTML parsing, data handling, and Excel generation.
Step 3 – Fetch HTML Data
Use requests to get the webpage content.
Step 4 – Parse HTML with BeautifulSoup
Extract job titles, companies, and locations.
Step 5 – Clean and Structure Data
Store the extracted data in a pandas DataFrame.
Step 6 – Export to Excel
Use openpyxl to create an Excel sheet, format headers, and save the report with a timestamp.
Step 7 – Add Logging
Log success or errors for traceability.

5. Full Working and Verified Python Code

"""
Project: Web Scraper + Excel Report Generator
Author: Tajammul Hussain
Level: Advanced Python Automation
"""

import requests
from bs4 import BeautifulSoup
import pandas as pd
from openpyxl import Workbook
from openpyxl.styles import Font, Alignment, PatternFill
from datetime import datetime
import os
import logging

# ================== LOGGING CONFIGURATION ==================
logging.basicConfig(
    filename="scraper_log.txt",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

# ================== SCRAPER FUNCTION ==================
def scrape_jobs(url):
    """Scrape job listings from the given URL and return structured data."""
    try:
        response = requests.get(url)
        response.raise_for_status()

        soup = BeautifulSoup(response.text, 'html.parser')
        jobs = soup.find_all('div', class_='card-content')

        job_list = []
        for job in jobs:
            title = job.find('h2', class_='title').text.strip()
            company = job.find('h3', class_='company').text.strip()
            location = job.find('p', class_='location').text.strip()

            job_list.append({
                "Title": title,
                "Company": company,
                "Location": location
            })

        logging.info(f"Successfully scraped {len(job_list)} job listings.")
        return job_list

    except Exception as e:
        logging.error(f"Error during scraping: {e}")
        return []


# ================== EXCEL REPORT FUNCTION ==================
def generate_excel_report(data, filename):
    """Generate a styled Excel report from scraped data."""
    if not data:
        logging.warning("No data available to write into Excel.")
        return

    df = pd.DataFrame(data)
    wb = Workbook()
    ws = wb.active
    ws.title = "Job Listings"

    # Write header
    for col_num, column_title in enumerate(df.columns, 1):
        cell = ws.cell(row=1, column=col_num, value=column_title)
        cell.font = Font(bold=True, color="FFFFFF")
        cell.alignment = Alignment(horizontal="center")
        cell.fill = PatternFill(start_color="4F81BD", end_color="4F81BD", fill_type="solid")

    # Write data
    for row in df.itertuples(index=False):
        ws.append(row)

    # Auto-adjust column width
    for col in ws.columns:
        max_length = 0
        column = col[0].column_letter
        for cell in col:
            try:
                if len(str(cell.value)) > max_length:
                    max_length = len(str(cell.value))
            except:
                pass
        ws.column_dimensions[column].width = max_length + 2

    wb.save(filename)
    logging.info(f"Excel report saved successfully as {filename}")


# ================== MAIN FUNCTION ==================
def main():
    url = "https://realpython.github.io/fake-jobs/"
    logging.info("Starting job scraping process...")

    data = scrape_jobs(url)
    if not data:
        print("No data scraped. Please check logs.")
        return

    # Create reports folder
    if not os.path.exists("reports"):
        os.makedirs("reports")

    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    filename = f"reports/job_report_{timestamp}.xlsx"

    generate_excel_report(data, filename)
    print(f"✅ Excel report generated: {filename}")


if __name__ == "__main__":
    main()

6. Sample Output or Results

Console Output:

✅ Excel report generated: reports/job_report_2025-10-28_12-35-47.xlsx

Excel Output Preview:

Title	Company	Location
Python Developer	Real Python	Remote
Data Scientist	Data Corp	New York
Backend Engineer	Tech Solutions	San Francisco

Log File (scraper_log.txt):

2025-10-28 12:35:47 - INFO - Successfully scraped 100 job listings.
2025-10-28 12:35:47 - INFO - Excel report saved successfully as reports/job_report_2025-10-28_12-35-47.xlsx

7. Possible Enhancements

🚀 Add more features for real-world applications:

Add filtering (e.g., only “Python Developer” jobs).
Integrate email automation to send the Excel report daily.
Use Flask or Streamlit for a simple GUI dashboard.
Scrape multiple pages automatically.
Store data in a database (SQLite or PostgreSQL) instead of Excel.
Deploy to cloud (AWS Lambda / Azure Function) as an automated data pipeline.

python Topics

python Tutorial