Intermediate 25 min

The Problem with “I Got 95% Accuracy”

You train a model. You get 95% accuracy. You’re done, right?

Not quite. That single number doesn’t tell you much. It doesn’t tell you:

  • Whether your model will work on new data
  • What types of errors it makes
  • If it’s biased toward certain classes
  • How stable its performance is

A single train/test split can be misleading. You might get lucky with one split and unlucky with another. That’s why we need better evaluation methods.

What You’ll Learn

In this tutorial, we’ll work with a real classification problem and see how different evaluation methods can change the story of your model’s performance. You’ll learn:

1. Train/Test Split

  • How to split data correctly
  • Why stratification matters
  • What a single split can and can’t tell you

2. Cross-Validation

  • Why one split isn’t enough
  • How k-fold cross-validation works
  • Getting more stable performance estimates

3. Multiple Metrics

  • Accuracy, precision, recall, F1-score
  • ROC AUC for binary classification
  • When to use which metric

4. Model Comparison

  • Comparing models fairly
  • Using cross-validation for comparison
  • Avoiding common mistakes

The Evaluation Workflow

Here’s how proper evaluation works:

Split 80% 20% K-Fold Scores Final Full Dataset Train/Test Split Train Set Test Set Cross-Validation Final Metrics

Why Train/Test Split Isn’t Enough

A single train/test split has problems:

Problem 1: Random Variation

  • One random split might give you 95% accuracy
  • Another split might give you 92% accuracy
  • Which one is right? You don’t know.

Problem 2: Small Test Sets

  • If your test set is small, one misclassified sample changes your accuracy a lot
  • A 100-sample test set: one error = 1% accuracy drop
  • Not very stable

Problem 3: No Stability Measure

  • You get one number
  • No idea if that number is typical or unusual
  • Can’t see the variance in performance

Solution: Cross-Validation

  • Train and test multiple times on different splits
  • Get multiple scores
  • See the mean and standard deviation
  • Much more reliable

What We’ll Build

We’ll work with the Breast Cancer Wisconsin dataset - a real medical classification problem. This makes the evaluation more meaningful because:

  • It’s a binary classification problem (malignant vs benign)
  • Class imbalance matters (we’ll see why)
  • Different types of errors have different costs
  • It’s a real-world scenario where evaluation matters

Key Takeaways

Before moving to the next page, remember:

  1. Single splits are unreliable - One split can be lucky or unlucky
  2. Cross-validation is better - Multiple splits give more stable estimates
  3. Metrics matter - Accuracy alone doesn’t tell the full story
  4. Context matters - Different problems need different metrics

What’s Next?

In the next page, we’ll set up our environment and load the dataset. You’ll see how to import the necessary libraries and explore the data before we start evaluating models.

Progress 14%
Page 1 of 7
Previous Next