Cross-Validation with cross_val_score

A single train/test split gives us one performance estimate. But that estimate might be lucky or unlucky depending on how the data was split. Cross-validation gives us multiple estimates, making our evaluation more reliable.

What is Cross-Validation?

Cross-validation splits the data into multiple folds (typically 5). It trains on 4 folds and tests on 1, repeating this 5 times. Each fold gets a turn as the test set.

This gives us 5 performance scores instead of 1. We can see the mean (average performance) and standard deviation (how consistent the performance is).

Using cross_val_score

Let’s use cross-validation with our pipeline:

🐍 Python Cross-Validation with Pipeline

📟 Console Output

Run code to see output...

Why cv=5?

Five folds is a common choice because:

Balance - Good balance between computation time and reliability
Standard - Widely used in practice
Stable - Enough folds to get reliable estimates

You can use 3 folds for faster computation or 10 folds for more stable estimates. The trade-off is computation time vs. reliability.

Comparing Single Split vs Cross-Validation

Let’s see the difference:

🐍 Python Single Split vs Cross-Validation

📟 Console Output

Run code to see output...

Why Pipelines Make Cross-Validation Easy

Without a pipeline, cross-validation is tricky. You’d need to:

Split data into folds
For each fold:
- Fit preprocessor on training fold
- Transform training fold
- Transform test fold
- Train model
- Evaluate

With a pipeline, cross_val_score handles all of this automatically. The preprocessor is fit on each training fold separately, preventing data leakage.

Interpreting Results

When you see cross-validation results:

High mean, low std - Model performs well and consistently
High mean, high std - Model performs well but inconsistently (might be overfitting)
Low mean, low std - Model performs poorly but consistently (might need better features)
Low mean, high std - Model is unreliable (might need more data or simpler model)

Key Takeaways

Before moving on:

Cross-validation - Multiple train/test splits for more reliable estimates
cv=5 - Common choice, good balance of time and reliability
Mean and std - Show average performance and consistency
Pipelines make it easy - cross_val_score handles everything automatically
No data leakage - Preprocessor fit separately on each fold

What’s Next?

In the next page, we’ll tune hyperparameters using GridSearchCV. This finds the best combination of model parameters automatically.

Progress 71%

Page 5 of 7

← Previous → Next

Sign In