Cross-Validation with cross_val_score
A single train/test split gives us one performance estimate. But that estimate might be lucky or unlucky depending on how the data was split. Cross-validation gives us multiple estimates, making our evaluation more reliable.
What is Cross-Validation?
Cross-validation splits the data into multiple folds (typically 5). It trains on 4 folds and tests on 1, repeating this 5 times. Each fold gets a turn as the test set.
This gives us 5 performance scores instead of 1. We can see the mean (average performance) and standard deviation (how consistent the performance is).
Using cross_val_score
Let’s use cross-validation with our pipeline:
Why cv=5?
Five folds is a common choice because:
- Balance - Good balance between computation time and reliability
- Standard - Widely used in practice
- Stable - Enough folds to get reliable estimates
You can use 3 folds for faster computation or 10 folds for more stable estimates. The trade-off is computation time vs. reliability.
Comparing Single Split vs Cross-Validation
Let’s see the difference:
Why Pipelines Make Cross-Validation Easy
Without a pipeline, cross-validation is tricky. You’d need to:
- Split data into folds
- For each fold:
- Fit preprocessor on training fold
- Transform training fold
- Transform test fold
- Train model
- Evaluate
With a pipeline, cross_val_score handles all of this automatically. The preprocessor is fit on each training fold separately, preventing data leakage.
Interpreting Results
When you see cross-validation results:
- High mean, low std - Model performs well and consistently
- High mean, high std - Model performs well but inconsistently (might be overfitting)
- Low mean, low std - Model performs poorly but consistently (might need better features)
- Low mean, high std - Model is unreliable (might need more data or simpler model)
Key Takeaways
Before moving on:
- Cross-validation - Multiple train/test splits for more reliable estimates
- cv=5 - Common choice, good balance of time and reliability
- Mean and std - Show average performance and consistency
- Pipelines make it easy - cross_val_score handles everything automatically
- No data leakage - Preprocessor fit separately on each fold
What’s Next?
In the next page, we’ll tune hyperparameters using GridSearchCV. This finds the best combination of model parameters automatically.