Feb 1, 2026

Intermediate 25 min

Why We Split Data

We split data for a simple reason: we want to estimate how well our model will work on data it hasn’t seen before.

The idea:

Train on one part of the data
Test on another part the model hasn’t seen
This gives us an estimate of real-world performance

The problem: One split can be lucky or unlucky. But let’s start with the basics.

Visualizing the Split

Here’s what happens when we split data:

Split the Data

Let’s split our data. Notice we use stratify=y to keep class proportions the same in train and test sets:

🐍 Python Split Data

📟 Console Output

Run code to see output...

Why stratify? Without stratification, you might get unlucky and have all samples of one class in the test set. Stratification prevents that.

Train a Model

Now let’s train a simple logistic regression model:

🐍 Python Train Model

📟 Console Output

Run code to see output...

Calculate Basic Metrics

Now let’s see how well our model did:

🐍 Python Calculate Metrics

📟 Console Output

Run code to see output...

What Do These Metrics Mean?

Accuracy: Share of correct predictions

Simple: (correct predictions) / (total predictions)
Good when classes are balanced
Can be misleading with imbalanced classes

Precision: When we say “positive”, how often are we right?

Formula: True Positives / (True Positives + False Positives)
High precision = few false alarms
Important when false positives are costly

Recall: How many real positives did we catch?

Formula: True Positives / (True Positives + False Negatives)
High recall = we catch most positives
Important when missing positives is costly

F1-Score: Balance between precision and recall

Formula: 2 × (Precision × Recall) / (Precision + Recall)
Harmonic mean of precision and recall
Good when you need both precision and recall

Try It Yourself

Quick question: Print y_test.value_counts() and see if classes are balanced. How might imbalanced classes make accuracy misleading?

🐍 Python Check Class Balance

📟 Console Output

Run code to see output...

The Problem with One Split

We got some numbers. But here’s the thing: if we split the data differently, we’d get different numbers. One split might give us 95% accuracy, another might give us 92%. Which one is right?

We don’t know. That’s why we need cross-validation - to test multiple splits and get a more stable estimate.

Key Takeaways

Before moving forward:

Stratification matters - Keeps class proportions in train/test
Multiple metrics - Accuracy alone doesn’t tell the full story
One split is unstable - Different splits give different results
Context matters - In medical problems, recall (catching cancer) might matter more than precision

What’s Next?

In the next page, we’ll look at the confusion matrix. This shows us exactly what types of errors our model makes - much more detail than a single accuracy number.

Progress 43%

Page 3 of 7

← Previous → Next

Sign In