Intermediate 25 min

Why We Split Data

We split data for a simple reason: we want to estimate how well our model will work on data it hasn’t seen before.

The idea:

  • Train on one part of the data
  • Test on another part the model hasn’t seen
  • This gives us an estimate of real-world performance

The problem: One split can be lucky or unlucky. But let’s start with the basics.

Visualizing the Split

Here’s what happens when we split data:

Split 80% 20% Fit Predict Scores Full Dataset (569 samples) Split 80/20 Train Set (455 samples) Test Set (114 samples) Train Model Evaluate

Split the Data

Let’s split our data. Notice we use stratify=y to keep class proportions the same in train and test sets:

🐍 Python Split Data
📟 Console Output
Run code to see output...

Why stratify? Without stratification, you might get unlucky and have all samples of one class in the test set. Stratification prevents that.

Train a Model

Now let’s train a simple logistic regression model:

🐍 Python Train Model
📟 Console Output
Run code to see output...

Calculate Basic Metrics

Now let’s see how well our model did:

🐍 Python Calculate Metrics
📟 Console Output
Run code to see output...

What Do These Metrics Mean?

Accuracy: Share of correct predictions

  • Simple: (correct predictions) / (total predictions)
  • Good when classes are balanced
  • Can be misleading with imbalanced classes

Precision: When we say “positive”, how often are we right?

  • Formula: True Positives / (True Positives + False Positives)
  • High precision = few false alarms
  • Important when false positives are costly

Recall: How many real positives did we catch?

  • Formula: True Positives / (True Positives + False Negatives)
  • High recall = we catch most positives
  • Important when missing positives is costly

F1-Score: Balance between precision and recall

  • Formula: 2 × (Precision × Recall) / (Precision + Recall)
  • Harmonic mean of precision and recall
  • Good when you need both precision and recall

Try It Yourself

Quick question: Print y_test.value_counts() and see if classes are balanced. How might imbalanced classes make accuracy misleading?

🐍 Python Check Class Balance
📟 Console Output
Run code to see output...

The Problem with One Split

We got some numbers. But here’s the thing: if we split the data differently, we’d get different numbers. One split might give us 95% accuracy, another might give us 92%. Which one is right?

We don’t know. That’s why we need cross-validation - to test multiple splits and get a more stable estimate.

Key Takeaways

Before moving forward:

  1. Stratification matters - Keeps class proportions in train/test
  2. Multiple metrics - Accuracy alone doesn’t tell the full story
  3. One split is unstable - Different splits give different results
  4. Context matters - In medical problems, recall (catching cancer) might matter more than precision

What’s Next?

In the next page, we’ll look at the confusion matrix. This shows us exactly what types of errors our model makes - much more detail than a single accuracy number.