🎉 Welcome to PyVerse! Start Learning Today

Web Scraper + Excel Report Generator using BeautifulSoup and openpyxl

Advanced

Advanced Python Automation with BeautifulSoup, pandas, and openpyxl

1. Project Overview

This project automates the process of extracting structured information from websites (like job listings, news headlines, or product data) and generating well-formatted Excel reports.
You’ll build a web scraper that:

  • Extracts live data from a public website (we’ll use job postings from “https://realpython.github.io/fake-jobs/” for demonstration).
  • Cleans and organizes the data.
  • Generates an Excel report using openpyxl with proper formatting (headers, styling, etc.).

🎯 Real-World Use Case:

  • Automate job data collection for HR analytics.
  • Scrape e-commerce sites for product prices or reviews.
  • Extract and track competitor updates for business intelligence.

🧠 Technical Goals:

  • Learn robust web scraping with BeautifulSoup.
  • Handle HTML parsing, exception handling, and data cleaning.
  • Build Excel reports programmatically using openpyxl with dynamic formatting.

2. Key Technologies & Libraries

CategoryLibraryPurpose
Web Scrapingrequests, BeautifulSoupTo fetch and parse website HTML content
Data HandlingpandasFor clean data manipulation
Excel ReportopenpyxlTo create and style Excel files
Error HandlingloggingFor runtime logging and debugging
Automationos, datetimeFor report naming and folder management

Install all dependencies:

pip install requests beautifulsoup4 pandas openpyxl

3. Learning Outcomes

By completing this project, you will learn:

  • ✅ Advanced web scraping techniques using BeautifulSoup.
  • ✅ Data cleaning and structuring using pandas.
  • ✅ Creating professional Excel reports using openpyxl.
  • ✅ Adding timestamps and logs for automation pipelines.
  • ✅ Following modular programming and clean code practices.

4. Step-by-Step Explanation

  1. Step 1 – Setup Environment
    Create a project folder named:
    web_scraper_excel_report/
    Inside it, create:
    main.py
  2. Step 2 – Import Required Libraries
    Import modules for HTTP requests, HTML parsing, data handling, and Excel generation.
  3. Step 3 – Fetch HTML Data
    Use requests to get the webpage content.
  4. Step 4 – Parse HTML with BeautifulSoup
    Extract job titles, companies, and locations.
  5. Step 5 – Clean and Structure Data
    Store the extracted data in a pandas DataFrame.
  6. Step 6 – Export to Excel
    Use openpyxl to create an Excel sheet, format headers, and save the report with a timestamp.
  7. Step 7 – Add Logging
    Log success or errors for traceability.

5. Full Working and Verified Python Code

""" Project: Web Scraper + Excel Report Generator Author: Tajammul Hussain Level: Advanced Python Automation """ import requests from bs4 import BeautifulSoup import pandas as pd from openpyxl import Workbook from openpyxl.styles import Font, Alignment, PatternFill from datetime import datetime import os import logging # ================== LOGGING CONFIGURATION ================== logging.basicConfig( filename="scraper_log.txt", level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s" ) # ================== SCRAPER FUNCTION ================== def scrape_jobs(url): """Scrape job listings from the given URL and return structured data.""" try: response = requests.get(url) response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') jobs = soup.find_all('div', class_='card-content') job_list = [] for job in jobs: title = job.find('h2', class_='title').text.strip() company = job.find('h3', class_='company').text.strip() location = job.find('p', class_='location').text.strip() job_list.append({ "Title": title, "Company": company, "Location": location }) logging.info(f"Successfully scraped {len(job_list)} job listings.") return job_list except Exception as e: logging.error(f"Error during scraping: {e}") return [] # ================== EXCEL REPORT FUNCTION ================== def generate_excel_report(data, filename): """Generate a styled Excel report from scraped data.""" if not data: logging.warning("No data available to write into Excel.") return df = pd.DataFrame(data) wb = Workbook() ws = wb.active ws.title = "Job Listings" # Write header for col_num, column_title in enumerate(df.columns, 1): cell = ws.cell(row=1, column=col_num, value=column_title) cell.font = Font(bold=True, color="FFFFFF") cell.alignment = Alignment(horizontal="center") cell.fill = PatternFill(start_color="4F81BD", end_color="4F81BD", fill_type="solid") # Write data for row in df.itertuples(index=False): ws.append(row) # Auto-adjust column width for col in ws.columns: max_length = 0 column = col[0].column_letter for cell in col: try: if len(str(cell.value)) > max_length: max_length = len(str(cell.value)) except: pass ws.column_dimensions[column].width = max_length + 2 wb.save(filename) logging.info(f"Excel report saved successfully as {filename}") # ================== MAIN FUNCTION ================== def main(): url = "https://realpython.github.io/fake-jobs/" logging.info("Starting job scraping process...") data = scrape_jobs(url) if not data: print("No data scraped. Please check logs.") return # Create reports folder if not os.path.exists("reports"): os.makedirs("reports") timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S") filename = f"reports/job_report_{timestamp}.xlsx" generate_excel_report(data, filename) print(f"✅ Excel report generated: {filename}") if __name__ == "__main__": main()

6. Sample Output or Results

Console Output:

✅ Excel report generated: reports/job_report_2025-10-28_12-35-47.xlsx

Excel Output Preview:

Title Company Location Python Developer Real Python Remote Data Scientist Data Corp New York Backend Engineer Tech Solutions San Francisco

Log File (scraper_log.txt):

2025-10-28 12:35:47 - INFO - Successfully scraped 100 job listings. 2025-10-28 12:35:47 - INFO - Excel report saved successfully as reports/job_report_2025-10-28_12-35-47.xlsx

7. Possible Enhancements

🚀 Add more features for real-world applications:

  1. Add filtering (e.g., only “Python Developer” jobs).
  2. Integrate email automation to send the Excel report daily.
  3. Use Flask or Streamlit for a simple GUI dashboard.
  4. Scrape multiple pages automatically.
  5. Store data in a database (SQLite or PostgreSQL) instead of Excel.
  6. Deploy to cloud (AWS Lambda / Azure Function) as an automated data pipeline.