The Problem with “I Got 95% Accuracy”

You train a model. You get 95% accuracy. You’re done, right?

Not quite. That single number doesn’t tell you much. It doesn’t tell you:

Whether your model will work on new data
What types of errors it makes
If it’s biased toward certain classes
How stable its performance is

A single train/test split can be misleading. You might get lucky with one split and unlucky with another. That’s why we need better evaluation methods.

What You’ll Learn

In this tutorial, we’ll work with a real classification problem and see how different evaluation methods can change the story of your model’s performance. You’ll learn:

1. Train/Test Split

How to split data correctly
Why stratification matters
What a single split can and can’t tell you

2. Cross-Validation

Why one split isn’t enough
How k-fold cross-validation works
Getting more stable performance estimates

3. Multiple Metrics

Accuracy, precision, recall, F1-score
ROC AUC for binary classification
When to use which metric

4. Model Comparison

Comparing models fairly
Using cross-validation for comparison
Avoiding common mistakes

The Evaluation Workflow

Here’s how proper evaluation works:

Why Train/Test Split Isn’t Enough

A single train/test split has problems:

Problem 1: Random Variation

One random split might give you 95% accuracy
Another split might give you 92% accuracy
Which one is right? You don’t know.

Problem 2: Small Test Sets

If your test set is small, one misclassified sample changes your accuracy a lot
A 100-sample test set: one error = 1% accuracy drop
Not very stable

Problem 3: No Stability Measure

You get one number
No idea if that number is typical or unusual
Can’t see the variance in performance

Solution: Cross-Validation

Train and test multiple times on different splits
Get multiple scores
See the mean and standard deviation
Much more reliable

What We’ll Build

We’ll work with the Breast Cancer Wisconsin dataset - a real medical classification problem. This makes the evaluation more meaningful because:

It’s a binary classification problem (malignant vs benign)
Class imbalance matters (we’ll see why)
Different types of errors have different costs
It’s a real-world scenario where evaluation matters

Key Takeaways

Before moving to the next page, remember:

Single splits are unreliable - One split can be lucky or unlucky
Cross-validation is better - Multiple splits give more stable estimates
Metrics matter - Accuracy alone doesn’t tell the full story
Context matters - Different problems need different metrics

What’s Next?

In the next page, we’ll set up our environment and load the dataset. You’ll see how to import the necessary libraries and explore the data before we start evaluating models.

Progress 14%

Page 1 of 7

← Previous → Next

Sign In