🎉 Welcome to PyVerse! Start Learning Today

Introduction to NumPy and Pandas

What you'll learn

  • Why NumPy arrays are faster and more powerful than Python lists for math
  • How to build, slice, and compute with NumPy arrays (including broadcasting)
  • How to use Pandas Series and DataFrames to clean, filter, and summarize real data
  • How NumPy and Pandas work together for analysis
  • A mini project that analyzes a small dataset with both libraries

Setup

Make sure you have Python 3.9+ installed.

Install the libraries:

pip install numpy pandas

Part 1 — NumPy: Fast math with arrays

Plain Python lists are great for storing values, but they're slow for math. NumPy arrays store numbers in a compact block of memory and use fast C code under the hood.

Key ideas

  • Array: like a list, but typed and fast. Has shape (size of each dimension) and dtype (data type).
  • Vectorization: do math on whole arrays without loops.
  • Broadcasting: NumPy automatically stretches arrays with size 1 to match shapes during operations.
  • Axes: axis=0 means "down the rows," axis=1 means "across the columns" for 2D arrays.

Code: Creating arrays and basic info

import numpy as np # 1D and 2D arrays a = np.array([1, 2, 3], dtype=np.int32) b = np.array([[1, 2, 3], [4, 5, 6]], dtype=float) print("a:", a, "dtype:", a.dtype, "shape:", a.shape) print("b:\n", b, "\ndtype:", b.dtype, "shape:", b.shape, "ndim:", b.ndim) # Quick arrays zeros = np.zeros((2, 3)) ones = np.ones((3, 2)) r = np.arange(0, 10, 2) # 0,2,4,6,8 lin = np.linspace(0, 1, 5) # 5 evenly spaced numbers from 0 to 1 print("zeros:\n", zeros) print("r:", r) print("lin:", lin)

Vectorized math and ufuncs

x = np.array([10, 20, 30, 40], dtype=np.float64) print("x * 0.1:", x * 0.1) # Multiply every element print("x + 5:", x + 5) # Add to every element print("sqrt:", np.sqrt(x)) # Universal function print("mean:", np.mean(x)) # Aggregate

Aggregations and axes

M = np.array([[1, 2, 3], [4, 5, 6]]) print("sum all:", M.sum()) print("sum by column (axis=0):", M.sum(axis=0)) # [1+4, 2+5, 3+6] print("sum by row (axis=1):", M.sum(axis=1))

Broadcasting (automatic shape matching)

# Column vector (3x1) + row vector (1x4) -> full 3x4 grid col = np.array([[1], [2], [3]]) row = np.array([10, 20, 30, 40]) print("Broadcast result:\n", col + row)

Indexing, slicing, boolean masks

arr = np.arange(1, 13).reshape(3, 4) # 3 rows, 4 cols print("arr:\n", arr) print("Element at row 1, col 2:", arr[1, 2]) # zero-based indexing print("First row:", arr[0, :]) print("Last two columns:\n", arr[:, -2:]) # Boolean mask: pick even numbers mask = (arr % 2 == 0) print("Mask:\n", mask) print("Even numbers:", arr[mask])

Random numbers (reproducible)

np.random.seed(42) # same random values every run rand_ints = np.random.randint(0, 100, size=(5, 3)) rand_norm = np.random.normal(loc=0, scale=1, size=5) print("Random ints:\n", rand_ints) print("Random normal:", rand_norm)

Part 2 — Pandas: Data tables you can filter and summarize

Pandas sits on top of NumPy and adds labels (row index, column names) and lots of data tools. Think of a DataFrame like a spreadsheet you can control with Python.

Key ideas

  • Series: one column (1D, labeled).
  • DataFrame: many columns (2D table), each column is a Series.
  • loc selects by labels; iloc selects by position.
  • GroupBy summarizes data by categories.
  • Missing values: NaN. Use isna, fillna, dropna.

Create a DataFrame and explore

import pandas as pd import numpy as np df = pd.DataFrame({ "name": ["Ava", "Ben", "Cara", "Dan", "Elle"], "class": ["Red", "Blue", "Red", "Blue", "Red"], "math": [88, 92, np.nan, 75, 85], "science": [91, 84, 89, 90, 88] }) print(df.head()) # first rows print(df.info()) # column types, non-null counts print(df.describe()) # numeric summary

Selecting and filtering

# Select columns print(df["math"]) # Series print(df[["name", "science"]]) # DataFrame # loc: filter rows with a condition, and choose columns smart = df.loc[df["math"] >= 85, ["name", "math"]] print("Math >= 85:\n", smart) # iloc: select by position [rows, cols] print("First 3 rows, cols 1..2:\n", df.iloc[0:3, 1:3])

New columns and vectorized operations

df["average"] = df[["math", "science"]].mean(axis=1) # Grade letters using NumPy where df["grade"] = np.where(df["average"] >= 90, "A", np.where(df["average"] >= 80, "B", np.where(df["average"] >= 70, "C", "D"))) print(df[["name", "average", "grade"]])

GroupBy and aggregation

grouped = df.groupby("class").agg( math_mean=("math", "mean"), sci_mean=("science", "mean"), count=("name", "count") ) print(grouped)

Handle missing values (NaN)

print("Missing per column:\n", df.isna().sum()) # Fill math NaN with the column mean df["math"] = df["math"].fillna(df["math"].mean()) print("After fill:\n", df)

Reading from and writing to CSV

df.to_csv("students.csv", index=False) loaded = pd.read_csv("students.csv") print("Loaded:\n", loaded.head())

NumPy + Pandas together

  • Pandas columns are NumPy arrays inside, so NumPy functions work on them.
  • You can convert DataFrame parts to NumPy with .to_numpy() when needed.

Examples

# Use NumPy ufuncs directly on Series loaded["log_math"] = np.log(loaded["math"]) print(loaded[["math", "log_math"]].head()) # Convert to NumPy for custom computation (row-wise norm) scores = loaded[["math", "science"]].to_numpy() row_norm = np.sqrt((scores**2).sum(axis=1)) loaded["score_norm"] = row_norm print(loaded[["name", "score_norm"]])

Practical project — Student Scores Analyst

Goal: Use Pandas to load, clean, and summarize a small dataset, and NumPy to compute z-scores to spot unusually high or low scores.

What you'll build

  • A CSV of student test scores across subjects and dates
  • A cleaned DataFrame with missing values filled by subject mean
  • Summaries: top students, subject stats
  • Z-scores for each score to find outliers
  • Saved results as CSVs

Step 0: Create a sample dataset (or replace with your own CSV)

import numpy as np import pandas as pd np.random.seed(7) students = ["Ava", "Ben", "Cara", "Dan", "Elle", "Finn", "Gia", "Hugo"] subjects = ["Math", "Science", "History"] dates = pd.date_range("2025-01-01", periods=10, freq="W") rows = [] for d in dates: for s in subjects: for st in students: score = np.random.randint(50, 101) # 50..100 # Randomly make ~8% of scores missing if np.random.rand() < 0.08: score = np.nan rows.append({"date": d, "student": st, "subject": s, "score": score}) data = pd.DataFrame(rows) data.to_csv("scores.csv", index=False) print("Saved scores.csv with", len(data), "rows")

Step 1: Load and inspect

df = pd.read_csv("scores.csv", parse_dates=["date"]) print(df.head()) print(df.info()) print("Missing per column:\n", df.isna().sum())

Step 2: Clean missing scores (fill with subject mean)

# Compute mean score per subject, aligned to each row subject_means = df.groupby("subject")["score"].transform("mean") df["score_filled"] = df["score"].fillna(subject_means) # Check no missing remain in score_filled print("Missing in score_filled:", df["score_filled"].isna().sum())

Step 3: Summaries

# Top 5 students by average score student_avg = df.groupby("student")["score_filled"].mean().sort_values(ascending=False) print("Top students:\n", student_avg.head(5)) # Subject-level stats subject_stats = df.groupby("subject")["score_filled"].agg(["count", "mean", "median", "std", "min", "max"]) print("Subject stats:\n", subject_stats)

Step 4: Z-scores per subject (using NumPy)

Explanation: z = (value - mean) / std. This compares a score to others in the same subject.

mean_by_subj = df.groupby("subject")["score_filled"].transform("mean") std_by_subj = df.groupby("subject")["score_filled"].transform("std") df["zscore"] = (df["score_filled"] - mean_by_subj) / std_by_subj # Find unusually high/low scores (|z| >= 2) outliers = df.loc[df["zscore"].abs() >= 2, ["date", "student", "subject", "score_filled", "zscore"]] print("Outliers:\n", outliers.sort_values("zscore"))

Step 5: Pivot table for a dashboard-like view

pivot = pd.pivot_table( df, index="student", columns="subject", values="score_filled", aggfunc="mean" ) print("Average score per student per subject:\n", pivot.round(1))

Step 6: Save results

student_avg.to_csv("student_averages.csv", header=["avg_score"]) subject_stats.to_csv("subject_stats.csv") outliers.to_csv("outliers.csv", index=False) pivot.to_csv("student_subject_matrix.csv") print("Saved analysis CSVs.")

Optional challenges

  • Filter by a date range to compare early vs. late performance.
  • Add a new feature: consistency = std per student (lower is steadier).
  • Use NumPy to compute the 90th percentile score overall:
    p90 = np.percentile(df["score_filled"], 90)

Common tips and gotchas

  • Use df.loc[condition, columns] to filter and update safely.
  • When combining conditions, use parentheses and & (and), | (or).
    Example: df.loc[(df.math > 80) & (df.science > 80)]
  • Pay attention to shapes in NumPy; shape mismatches cause errors. Use reshape or keep dimensions aligned.
  • Set a random seed for reproducible results when using np.random.

Summary

  • NumPy gives you fast, vectorized arrays with powerful math, broadcasting, and aggregation tools.
  • Pandas builds labeled tables (DataFrames) that make real-world data cleaning, filtering, and summarizing easy.
  • They work beautifully together: Pandas columns are NumPy arrays, so you can use NumPy functions directly.
  • You practiced loading data, fixing missing values, grouping and aggregating, computing z-scores, and saving results.

Next step ideas

  • Add plots with matplotlib or pandas.plot() to visualize your results.
  • Try merging two DataFrames (pd.merge) like "scores" with "student_info" to enrich your analysis.

Loading quizzes...