Why We Split Data
We split data for a simple reason: we want to estimate how well our model will work on data it hasn’t seen before.
The idea:
- Train on one part of the data
- Test on another part the model hasn’t seen
- This gives us an estimate of real-world performance
The problem: One split can be lucky or unlucky. But let’s start with the basics.
Visualizing the Split
Here’s what happens when we split data:
Split the Data
Let’s split our data. Notice we use stratify=y to keep class proportions the same in train and test sets:
Why stratify? Without stratification, you might get unlucky and have all samples of one class in the test set. Stratification prevents that.
Train a Model
Now let’s train a simple logistic regression model:
Calculate Basic Metrics
Now let’s see how well our model did:
What Do These Metrics Mean?
Accuracy: Share of correct predictions
- Simple: (correct predictions) / (total predictions)
- Good when classes are balanced
- Can be misleading with imbalanced classes
Precision: When we say “positive”, how often are we right?
- Formula: True Positives / (True Positives + False Positives)
- High precision = few false alarms
- Important when false positives are costly
Recall: How many real positives did we catch?
- Formula: True Positives / (True Positives + False Negatives)
- High recall = we catch most positives
- Important when missing positives is costly
F1-Score: Balance between precision and recall
- Formula: 2 × (Precision × Recall) / (Precision + Recall)
- Harmonic mean of precision and recall
- Good when you need both precision and recall
Try It Yourself
Quick question: Print y_test.value_counts() and see if classes are balanced. How might imbalanced classes make accuracy misleading?
The Problem with One Split
We got some numbers. But here’s the thing: if we split the data differently, we’d get different numbers. One split might give us 95% accuracy, another might give us 92%. Which one is right?
We don’t know. That’s why we need cross-validation - to test multiple splits and get a more stable estimate.
Key Takeaways
Before moving forward:
- Stratification matters - Keeps class proportions in train/test
- Multiple metrics - Accuracy alone doesn’t tell the full story
- One split is unstable - Different splits give different results
- Context matters - In medical problems, recall (catching cancer) might matter more than precision
What’s Next?
In the next page, we’ll look at the confusion matrix. This shows us exactly what types of errors our model makes - much more detail than a single accuracy number.