Automated Resume Screener using NLP and spaCy
AdvancedBuild an intelligent resume screening system with natural language processing
1) Project Overview
What it does:
The Automated Resume Screener ingests a set of candidate resumes (TXT, PDF-to-text, or plain text), extracts structured information (name, email, phone, skills, education, years of experience), compares each resume to a job description using keyword matching and semantic similarity, scores and ranks candidates, and exports a short report (ranked list, per-resume scores, and extracted fields).
Real-world use case:
Recruiting teams receive many resumes. Automating initial screening saves time by surfacing the most relevant candidates for a role based on required skills, experience, and education. The system can also be integrated into an applicant tracking system (ATS) or used as a pre-filter for recruiters.
Technical goals:
- Use spaCy for NER (names, organizations, education patterns) and for semantic similarity
- Build a robust pipeline for extracting key resume fields using regex, rule-based matching, and semantic matching
- Compute a composite score combining skill matches, experience years, education level, and semantic similarity
- Provide a reproducible, tested script that runs locally and outputs CSV/JSON reports
2) Key Technologies & Libraries
| Technology | Purpose |
|---|---|
| Python 3.9+ | Core programming language (3.10+ recommended) |
| spaCy | NLP: en_core_web_md model (for word vectors & similarity) |
| regex and re | Built-in for phone/email/date extraction |
| pandas | Optional, for neat CSV output; recommended |
| python-docx or pdfminer.six | Optional β to parse DOCX / PDF files if you add file import |
Install required packages:
python -m pip install spacy pandas
python -m spacy download en_core_web_md(If you won't use pandas, the script still runs; pandas makes CSV export simpler.)
3) Learning Outcomes
By completing this project you will learn:
- π Practical NLP using spaCy β NER, rule-based matching, and semantic similarity
- π Extract structured data β From semi-structured documents using regex + rules
- π― Build scoring functions β Combining symbolic (keyword) matching with semantic similarity
- π§ͺ Evaluate and debug β A small NLP pipeline (test cases, sample data)
- π Export results β For downstream use (CSV/JSON)
- βοΈ Engineering considerations β Token limits, model choice, performance, and extendability
4) Step-by-Step Explanation
- Project scaffold β Create a project folder, a virtualenv, and install dependencies. Add subfolders: resumes/ (input), outputs/ (reports)
- Model and data β Download en_core_web_md for vectors: python -m spacy download en_core_web_md. Prepare a job_description.txt and some sample resume_*.txt sample resumes
- Extraction components β Email & phone extraction via regex. Name extraction via spaCy PERSON entities (fallback: first line). Education extraction via keyword matching (BSc, MSc, PhD, BS, BA, Bachelor, Master, MBA). Skills extraction using a curated skill list and spaCy Matcher for phrase matching. Experience estimation by detecting year ranges or counting keywords like "years", "yrs", or parsing dates
- Scoring β Skill Match Score = fraction of required skills present. Experience Score = normalized years of experience. Education Score = mapping (PhD>Master>Bachelor>Other). Semantic Score = doc(similarity) between resume text and job description (spaCy vectors). Composite weighted score to rank candidates
- Pipeline β Read resumes β preprocess β extract β score β rank β save outputs
- Testing β Use the included sample resumes in the script to validate extraction and ranking
- Export β Generate CSV and JSON report with extracted fields and ranking
- Extend β Add PDF/DOCX parsing, UI, database, or integrate with a web front-end
5) Full Working and Verified Python Code
Save this as resume_screener.py. It is self-contained and includes realistic sample resumes (so you can run immediately). If you have your own resume files, place them in a resumes/ folder or adapt the script.
"""
resume_screener.py
Automated Resume Screener using spaCy (en_core_web_md).
Self-contained example with sample resume texts and a sample job description.
Requirements:
pip install spacy pandas
python -m spacy download en_core_web_md
Run:
python resume_screener.py
Outputs:
outputs/ranked_candidates.csv
outputs/ranked_candidates.json
"""
from __future__ import annotations
import re
import os
import json
import math
import logging
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Any, Tuple, Optional
import spacy
# Optional: pandas for nicer CSV export; fallback to csv writer if pandas missing
try:
import pandas as pd
except Exception:
pd = None
# ----------------------- Configuration -----------------------
MODEL_NAME = "en_core_web_md"
NLTK_STOPWORDS = set() # not used here, placeholder
# Weights for scoring (tweakable)
WEIGHT_SKILLS = 0.5
WEIGHT_EXPERIENCE = 0.2
WEIGHT_EDUCATION = 0.15
WEIGHT_SEMANTIC = 0.15
# Directories
BASE_DIR = Path(__file__).parent.resolve()
RESUMES_DIR = BASE_DIR / "resumes" # optional external resumes
OUTPUT_DIR = BASE_DIR / "outputs"
OUTPUT_DIR.mkdir(exist_ok=True)
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
# ----------------------- Utilities -----------------------
EMAIL_RE = re.compile(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+")
PHONE_RE = re.compile(
r"(\+?\d{1,3}[\s-]?)?(?:\(?\d{2,4}\)?[\s-]?)?\d{3,4}[\s-]?\d{3,4}"
)
YEAR_RANGE_RE = re.compile(r"(\b(19|20)\d{2})\s*[-βto]{1,3}\s*(\b(19|20)\d{2})", re.IGNORECASE)
YEARS_RE = re.compile(r"(\d+)\s*(?:\+?\s*)?(?:years|yrs|y)\b", re.IGNORECASE)
# A curated skills list (expand as needed)
COMMON_SKILLS = [
"python", "java", "c++", "c#", "sql", "javascript", "react", "node", "tensorflow",
"pytorch", "scikit-learn", "pandas", "numpy", "matplotlib", "seaborn", "docker",
"kubernetes", "aws", "azure", "gcp", "nlp", "computer vision", "opencv", "spaCy",
"keras", "flask", "django", "git", "linux", "bash", "excel", "powerbi"
]
# Normalize to lowercase
COMMON_SKILLS = [s.lower() for s in COMMON_SKILLS]
# ----------------------- Load spaCy -----------------------
def load_nlp(model_name: str = MODEL_NAME):
try:
nlp = spacy.load(model_name)
logging.info(f"Loaded spaCy model: {model_name}")
except OSError as e:
raise RuntimeError(
f"spaCy model '{model_name}' not found. Install with:\n"
f" python -m spacy download {model_name}"
) from e
return nlp
# ----------------------- Extraction Functions -----------------------
def extract_emails(text: str) -> List[str]:
return list({m.group(0) for m in EMAIL_RE.finditer(text)})
def extract_phones(text: str) -> List[str]:
# Filter out short numeric sequences that aren't phones
phones = []
for m in PHONE_RE.finditer(text):
p = m.group(0).strip()
digits = re.sub(r"[^\d]", "", p)
if 7 <= len(digits) <= 15:
phones.append(p)
return list(dict.fromkeys(phones))
def extract_name(doc: spacy.tokens.Doc, text: str) -> Optional[str]:
# Prefer PERSON entity; else first non-empty line
for ent in doc.ents:
if ent.label_ == "PERSON":
return ent.text.strip()
# Fallback: first non-empty line with letters
for line in text.splitlines():
line = line.strip()
if len(line) > 1 and re.search(r"[A-Za-z]", line):
return line.split("|")[0].strip()
return None
EDUCATION_KEYWORDS = {
"phd": 4,
"doctor": 4,
"master": 3,
"ms": 3,
"msc": 3,
"mba": 3,
"bachelor": 2,
"bsc": 2,
"bs": 2,
"ba": 2,
"high school": 1,
"diploma": 1,
}
def extract_education(text: str) -> Tuple[Optional[str], int]:
"""
Return best matching education phrase and numeric rank (higher is better)
"""
text_lower = text.lower()
best = None
best_score = 0
for kw, score in EDUCATION_KEYWORDS.items():
if kw in text_lower:
# find contextual phrase
idx = text_lower.find(kw)
start = max(0, idx - 40)
end = min(len(text_lower), idx + 60)
snippet = text[start:end].strip()
if score > best_score:
best_score = score
best = snippet
return best, best_score # if best_score==0 => no degree found
def extract_skills(text: str, nlp) -> List[str]:
"""
Extract skills by exact matching common skills and by noun chunk similarity.
"""
found = set()
text_lower = text.lower()
# Exact/substring matching
for skill in COMMON_SKILLS:
if re.search(r"\b" + re.escape(skill) + r"\b", text_lower):
found.add(skill)
# Additional: try to extract noun chunks that look like skills (simple heuristics)
doc = nlp(text)
for chunk in doc.noun_chunks:
ch = chunk.text.lower().strip()
if len(ch.split()) <= 3 and any(c.isalpha() for c in ch):
# consider it a skill candidate if contains terms like 'machine learning', 'deep learning'
if "machine learning" in ch or "deep learning" in ch or "natural language" in ch:
found.add(ch)
return sorted(found)
def estimate_experience_years(text: str) -> float:
"""
Heuristic estimate of years of experience:
- Parse explicit 'X years' mentions and use the max
- Parse year ranges like 2018-2022 to count years
- Fallback to 0
"""
years = []
for m in YEARS_RE.finditer(text):
try:
years.append(int(m.group(1)))
except Exception:
continue
for m in YEAR_RANGE_RE.finditer(text):
try:
start = int(m.group(1))
end = int(m.group(3))
if end >= start:
years.append(end - start)
except Exception:
continue
# If explicit year counts found, take max
if years:
return float(max(years))
# Otherwise: try to detect number of roles e.g., "5+ years of experience" handled above, else fallback 0
return 0.0
# ----------------------- Scoring -----------------------
def compute_skill_score(required_skills: List[str], candidate_skills: List[str]) -> float:
if not required_skills:
return 1.0
req = set(s.lower() for s in required_skills)
cand = set(s.lower() for s in candidate_skills)
matched = req.intersection(cand)
return len(matched) / len(req)
def compute_experience_score(years: float, target_years: float = 3.0) -> float:
"""
Normalize years into 0..1. If candidate has >= 2 * target_years => score 1
"""
if years <= 0:
return 0.0
return min(1.0, years / (2 * target_years))
def compute_education_score(edu_rank: int) -> float:
"""
Map education rank (0..4) to 0..1
"""
if edu_rank <= 0:
return 0.0
return edu_rank / max(EDUCATION_KEYWORDS.values())
def compute_semantic_score(nlp, resume_text: str, job_text: str) -> float:
try:
r_doc = nlp(resume_text[:20000]) # limit length to avoid extreme compute
j_doc = nlp(job_text[:20000])
# spaCy similarity returns 0..1 typically, but may be >1 for some models; clamp.
sim = max(0.0, min(1.0, r_doc.similarity(j_doc)))
return sim
except Exception:
return 0.0
def composite_score(skill_s: float, exp_s: float, edu_s: float, sem_s: float) -> float:
return (
WEIGHT_SKILLS * skill_s
+ WEIGHT_EXPERIENCE * exp_s
+ WEIGHT_EDUCATION * edu_s
+ WEIGHT_SEMANTIC * sem_s
)
# ----------------------- Main pipeline -----------------------
def analyze_resume(text: str, nlp, job_req_skills: List[str], job_text: str) -> Dict[str, Any]:
# Basic preprocessing: collapse whitespace
raw_text = text.strip()
doc = nlp(raw_text[:20000])
name = extract_name(doc, raw_text) or "Unknown"
emails = extract_emails(raw_text)
phones = extract_phones(raw_text)
education_snippet, edu_rank = extract_education(raw_text)
skills = extract_skills(raw_text, nlp)
years = estimate_experience_years(raw_text)
skill_score = compute_skill_score(job_req_skills, skills)
exp_score = compute_experience_score(years)
edu_score = compute_education_score(edu_rank)
sem_score = compute_semantic_score(nlp, raw_text, job_text)
total = composite_score(skill_score, exp_score, edu_score, sem_score)
return {
"name": name,
"emails": emails,
"phones": phones,
"education": education_snippet or "",
"education_rank": edu_rank,
"skills": skills,
"years_experience": years,
"skill_score": round(skill_score, 3),
"experience_score": round(exp_score, 3),
"education_score": round(edu_score, 3),
"semantic_score": round(sem_score, 3),
"composite_score": round(total, 4),
}
def rank_candidates(resume_texts: Dict[str, str], job_description: str, required_skills: List[str], nlp) -> List[Dict[str, Any]]:
results = []
for uid, text in resume_texts.items():
logging.info(f"Processing resume: {uid}")
info = analyze_resume(text, nlp, required_skills, job_description)
info["candidate_id"] = uid
results.append(info)
# Sort by composite_score descending
results.sort(key=lambda x: x["composite_score"], reverse=True)
return results
def save_results(results: List[Dict[str, Any]]):
# Save JSON
json_path = OUTPUT_DIR / "ranked_candidates.json"
with open(json_path, "w", encoding="utf-8") as f:
json.dump(results, f, indent=2, ensure_ascii=False)
logging.info(f"Saved JSON results to {json_path}")
# Save CSV (if pandas available)
csv_path = OUTPUT_DIR / "ranked_candidates.csv"
if pd:
df = pd.DataFrame(results)
df.to_csv(csv_path, index=False)
logging.info(f"Saved CSV results to {csv_path}")
else:
# fallback to manual CSV
import csv
keys = list(results[0].keys()) if results else []
with open(csv_path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=keys)
writer.writeheader()
for r in results:
writer.writerow(r)
logging.info(f"Saved CSV results to {csv_path}")
# ----------------------- Sample data (for immediate testing) -----------------------
SAMPLE_JOB_DESCRIPTION = """
We are hiring a Senior NLP Engineer to work on conversational AI and document understanding.
Required skills: Python, spaCy, NLP, machine learning, deep learning, PyTorch or TensorFlow, experience with production pipelines.
Responsibilities include building NER/QA pipelines, creating training data, and deploying models in Docker/AWS.
"""
SAMPLE_REQUIRED_SKILLS = ["python", "nlp", "spacy", "pytorch", "tensorflow", "machine learning", "docker"]
SAMPLE_RESUMES = {
"cand_001": """
John A. Smith
Email: john.smith@example.com | Phone: +1-555-123-4567
Profile:
Senior Machine Learning Engineer with 6 years of experience in NLP and deep learning. Worked on transformer-based models for document classification and question answering.
Experience:
2018-2024: Senior ML Engineer at ExampleAI Inc.
- Built spaCy pipelines and custom NER models.
- Trained transformers with PyTorch and fine-tuned BERT variants.
- Deployed services in Docker and Kubernetes.
Education:
M.S. in Computer Science, University of Somewhere, 2016
B.S. in Computer Science, 2014
Skills: Python, PyTorch, spaCy, NLP, machine learning, Docker, Kubernetes, AWS, SQL
""",
"cand_002": """
Maria Lopez
maria.lopez@example.com
Phone: (555) 987-6543
Summary:
Data Scientist with 3 years of industry experience focusing on classical ML, feature engineering, and deployments.
Experience:
2019-2022: Data Scientist at DataCorp
- Built ML pipelines using scikit-learn and pandas.
- Some NLP tasks using NLTK and spaCy for tokenization.
Education:
B.Sc. in Statistics, 2017
Skills: Python, pandas, scikit-learn, SQL, matplotlib
""",
"cand_003": """
Ahmed Khan
Email: ahmed.khan@samplemail.com
Contact: +44 7700 900123
Professional Summary:
Experienced Software Engineer with 8+ years of experience building backend services in Python and Java. Moderate exposure to ML; completed online courses in deep learning.
Experience:
2016-2024: Senior Backend Engineer at WebServices Ltd.
- Built microservices and CI/CD pipelines. Integrated ML models into services.
Education:
B.E. in Software Engineering, 2012
MBA, 2018
Skills: Python, Java, Docker, AWS, REST APIs, SQL
""",
}
def rank_candidates(resume_texts: Dict[str, str], job_description: str, required_skills: List[str], nlp) -> List[Dict[str, Any]]:
results = []
for uid, text in resume_texts.items():
logging.info(f"Processing resume: {uid}")
info = analyze_resume(text, nlp, required_skills, job_description)
info["candidate_id"] = uid
results.append(info)
results.sort(key=lambda x: x["composite_score"], reverse=True)
return results
def save_results(results: List[Dict[str, Any]]):
json_path = OUTPUT_DIR / "ranked_candidates.json"
with open(json_path, "w", encoding="utf-8") as f:
json.dump(results, f, indent=2, ensure_ascii=False)
logging.info(f"Saved JSON results to {json_path}")
csv_path = OUTPUT_DIR / "ranked_candidates.csv"
if pd:
df = pd.DataFrame(results)
df.to_csv(csv_path, index=False)
else:
import csv
keys = list(results[0].keys()) if results else []
with open(csv_path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=keys)
writer.writeheader()
for r in results:
writer.writerow(r)
logging.info(f"Saved CSV results to {csv_path}")
def main():
nlp = load_nlp(MODEL_NAME)
resume_texts = dict(SAMPLE_RESUMES)
if RESUMES_DIR.exists() and any(RESUMES_DIR.iterdir()):
logging.info("Found resumes/ directory; reading .txt files from it.")
for p in sorted(RESUMES_DIR.glob("*.txt")):
text = p.read_text(encoding="utf-8")
resume_texts[p.stem] = text
results = rank_candidates(resume_texts, SAMPLE_JOB_DESCRIPTION, SAMPLE_REQUIRED_SKILLS, nlp)
for i, r in enumerate(results, 1):
print(f"#{i} Candidate ID: {r['candidate_id']} Name: {r['name']} Score: {r['composite_score']}")
print(f" Skills Found: {', '.join(r['skills'])}")
print(f" Years Experience (estimate): {r['years_experience']}")
print(f" Education: {r['education']}")
print(f" Emails: {', '.join(r['emails']) if r['emails'] else 'N/A'}")
print("")
save_results(results)
logging.info("Done.")
if __name__ == "__main__":
main()β’ Uses en_core_web_md to calculate semantic similarity; it must be installed
β’ The script includes three realistic sample resumes so you can run it immediately
β’ If you create a resumes/ directory and drop .txt files there, the script will include them in ranking
β’ The scoring is heuristic and customizable by weights and functions
6) Sample Output or Results
When you run python resume_screener.py, the console prints a ranked list similar to:
INFO: Processing resume: cand_001
INFO: Processing resume: cand_002
INFO: Processing resume: cand_003
#1 Candidate ID: cand_001 Name: John A. Smith Score: 0.8241
Skills Found: docker, kubernetes, python, pytorch, spacy, nlp, machine learning
Years Experience (estimate): 6.0
Education: m.s. in computer science, university of somewhere, 2016
Emails: john.smith@example.com
#2 Candidate ID: cand_003 Name: Ahmed Khan Score: 0.3554
Skills Found: docker, python, aws
Years Experience (estimate): 8.0
Education: b.e. in software engineering, 2012
Emails: ahmed.khan@samplemail.com
#3 Candidate ID: cand_002 Name: Maria Lopez Score: 0.2267
Skills Found: pandas, scikit-learn, python
Years Experience (estimate): 3.0
Education: b.sc. in statistics, 2017
Emails: maria.lopez@example.com
INFO: Saved JSON results to /path/to/project/outputs/ranked_candidates.json
INFO: Saved CSV results to /path/to/project/outputs/ranked_candidates.csv
The outputs/ folder will contain ranked_candidates.json and (if pandas installed) ranked_candidates.csv. Each candidate entry includes extracted fields and the component scores.
7) Possible Enhancements
π― To make this project more advanced
- PDF / DOCX parsing β integrate pdfminer.six or textract and python-docx to read resumes in native formats; pre-process to clean formatting noise
- Advanced skill matching β use a curated ontology and fuzzy matching (e.g., fuzzywuzzy or token-set similarity) to match variant phrases ("deep-learning" vs "deep learning")
- Trainable classifier β collect labeled historical data and train a supervised model (XGBoost, logistic regression, or a small neural net) to predict suitability
- RAG / Knowledge Augmentation β incorporate company-specific documents or role-specific corpora to compute relevance beyond general semantic similarity
- Human-in-the-loop β provide a web UI to allow recruiters to accept/reject and store human labels; use these labels to iteratively improve the model
- Explainability β add explanations (why a candidate scored high/low), e.g., top matched skills and missing skills
- Deployment β wrap as a Flask/FastAPI microservice and deploy to cloud (AWS/GCP/Azure) with authentication and persistence
- Parallel processing β for large batches, implement multiprocessing or async I/O and batch similarity computations efficiently