The Problem with “I Got 95% Accuracy”
You train a model. You get 95% accuracy. You’re done, right?
Not quite. That single number doesn’t tell you much. It doesn’t tell you:
- Whether your model will work on new data
- What types of errors it makes
- If it’s biased toward certain classes
- How stable its performance is
A single train/test split can be misleading. You might get lucky with one split and unlucky with another. That’s why we need better evaluation methods.
What You’ll Learn
In this tutorial, we’ll work with a real classification problem and see how different evaluation methods can change the story of your model’s performance. You’ll learn:
1. Train/Test Split
- How to split data correctly
- Why stratification matters
- What a single split can and can’t tell you
2. Cross-Validation
- Why one split isn’t enough
- How k-fold cross-validation works
- Getting more stable performance estimates
3. Multiple Metrics
- Accuracy, precision, recall, F1-score
- ROC AUC for binary classification
- When to use which metric
4. Model Comparison
- Comparing models fairly
- Using cross-validation for comparison
- Avoiding common mistakes
The Evaluation Workflow
Here’s how proper evaluation works:
Why Train/Test Split Isn’t Enough
A single train/test split has problems:
Problem 1: Random Variation
- One random split might give you 95% accuracy
- Another split might give you 92% accuracy
- Which one is right? You don’t know.
Problem 2: Small Test Sets
- If your test set is small, one misclassified sample changes your accuracy a lot
- A 100-sample test set: one error = 1% accuracy drop
- Not very stable
Problem 3: No Stability Measure
- You get one number
- No idea if that number is typical or unusual
- Can’t see the variance in performance
Solution: Cross-Validation
- Train and test multiple times on different splits
- Get multiple scores
- See the mean and standard deviation
- Much more reliable
What We’ll Build
We’ll work with the Breast Cancer Wisconsin dataset - a real medical classification problem. This makes the evaluation more meaningful because:
- It’s a binary classification problem (malignant vs benign)
- Class imbalance matters (we’ll see why)
- Different types of errors have different costs
- It’s a real-world scenario where evaluation matters
Key Takeaways
Before moving to the next page, remember:
- Single splits are unreliable - One split can be lucky or unlucky
- Cross-validation is better - Multiple splits give more stable estimates
- Metrics matter - Accuracy alone doesn’t tell the full story
- Context matters - Different problems need different metrics
What’s Next?
In the next page, we’ll set up our environment and load the dataset. You’ll see how to import the necessary libraries and explore the data before we start evaluating models.