Intermediate 25 min

Cross-Validation with cross_val_score

A single train/test split gives us one performance estimate. But that estimate might be lucky or unlucky depending on how the data was split. Cross-validation gives us multiple estimates, making our evaluation more reliable.

What is Cross-Validation?

Cross-validation splits the data into multiple folds (typically 5). It trains on 4 folds and tests on 1, repeating this 5 times. Each fold gets a turn as the test set.

This gives us 5 performance scores instead of 1. We can see the mean (average performance) and standard deviation (how consistent the performance is).

Split Split Split Split Split Test Train Train Train Train Evaluate Average All Data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Train on Folds 2-5 Test on Fold 1 Score 1 Mean ± Std

Using cross_val_score

Let’s use cross-validation with our pipeline:

🐍 Python Cross-Validation with Pipeline
📟 Console Output
Run code to see output...

Why cv=5?

Five folds is a common choice because:

  • Balance - Good balance between computation time and reliability
  • Standard - Widely used in practice
  • Stable - Enough folds to get reliable estimates

You can use 3 folds for faster computation or 10 folds for more stable estimates. The trade-off is computation time vs. reliability.

Comparing Single Split vs Cross-Validation

Let’s see the difference:

🐍 Python Single Split vs Cross-Validation
📟 Console Output
Run code to see output...

Why Pipelines Make Cross-Validation Easy

Without a pipeline, cross-validation is tricky. You’d need to:

  1. Split data into folds
  2. For each fold:
    • Fit preprocessor on training fold
    • Transform training fold
    • Transform test fold
    • Train model
    • Evaluate

With a pipeline, cross_val_score handles all of this automatically. The preprocessor is fit on each training fold separately, preventing data leakage.

Interpreting Results

When you see cross-validation results:

  • High mean, low std - Model performs well and consistently
  • High mean, high std - Model performs well but inconsistently (might be overfitting)
  • Low mean, low std - Model performs poorly but consistently (might need better features)
  • Low mean, high std - Model is unreliable (might need more data or simpler model)

Key Takeaways

Before moving on:

  1. Cross-validation - Multiple train/test splits for more reliable estimates
  2. cv=5 - Common choice, good balance of time and reliability
  3. Mean and std - Show average performance and consistency
  4. Pipelines make it easy - cross_val_score handles everything automatically
  5. No data leakage - Preprocessor fit separately on each fold

What’s Next?

In the next page, we’ll tune hyperparameters using GridSearchCV. This finds the best combination of model parameters automatically.