Feb 1, 2026

Intermediate 25 min

Comparing Two Models

Let’s compare Logistic Regression with Random Forest. The key: use the same cross-validation splits for both models.

🐍 Python Compare Models

📟 Console Output

Run code to see output...

Why Same CV Splits Matter

Using the same splits ensures:

Fair comparison - Both models see the same data splits
Reduced variance - Differences are due to models, not data splits
Statistical validity - Can compare means properly

Without same splits: One model might get “easier” test folds, making comparison unfair.

Try Different Hyperparameters

Try it yourself: Change Random Forest hyperparameters and see how metrics change:

🐍 Python Try Different Hyperparameters

📟 Console Output

Run code to see output...

Common Pitfalls and Best Practices

❌ Don’t Do This

1. Look at test set many times

Every time you check test set, you’re “using” it
Eventually, you’re overfitting to test set
Solution: Only look at test set once, at the very end

2. Tune hyperparameters on test set

Test set is for final evaluation only
Tuning on test set = data leakage
Solution: Use cross-validation for tuning, test set only for final check

3. Ignore class imbalance

Accuracy can be misleading with imbalanced classes
Solution: Use stratified splits and multiple metrics

4. Report only accuracy

Accuracy doesn’t tell the full story
Solution: Report precision, recall, F1, and confusion matrix

5. Forget random seeds

Results won’t be reproducible
Solution: Always set random_state for reproducibility

✅ Do This Instead

1. Use cross-validation for model selection

More reliable than single split
Use same CV splits for fair comparison

2. Keep test set untouched until the end

Test set is for final evaluation only
Use validation set or CV for tuning

3. Use stratified splits for classification

Keeps class proportions
Important for imbalanced datasets

4. Report multiple metrics

At minimum: accuracy, precision, recall, F1
Add ROC AUC for binary classification
Include confusion matrix

5. Set random seeds

Makes results reproducible
Use random_state=42 (or any number)

Evaluation Checklist

Before deploying a model, check:

Used cross-validation (not just one split)
Used stratified splits (for classification)
Reported multiple metrics (not just accuracy)
Compared models with same CV splits
Didn’t tune on test set
Set random seeds for reproducibility
Checked confusion matrix
Considered problem-specific metrics (recall for medical, precision for spam)

Summary

You’ve learned:

Train/test split - Basic evaluation, but can be unstable
Confusion matrix - Shows what types of errors your model makes
Cross-validation - More stable than single split, tests multiple times
Multiple metrics - Accuracy, precision, recall, F1, ROC AUC all tell different stories
Model comparison - Use same CV splits for fair comparison
Best practices - Avoid common pitfalls, use proper evaluation methods

Next Steps

Now that you know proper evaluation:

Hyperparameter tuning - Use GridSearchCV with cross-validation
Imbalanced datasets - Learn about class weights and sampling
Preprocessing pipelines - Combine preprocessing with evaluation
Production evaluation - Monitor models in production

Test Your Knowledge

Let’s see what you’ve learned:

Knowledge Check

This interactive quiz requires JavaScript to be enabled.

Question 1: Why is a single train/test split not enough for reliable evaluation?

A. It's too slow
B. One random split can be lucky or unlucky, giving unstable estimates (Correct)
C. It requires too much data
D. It's not supported by scikit-learn

Explanation: A single split can give you different results depending on which samples end up in train vs test. Cross-validation tests multiple splits to get a more stable estimate.

Question 2: For a medical diagnosis problem (like cancer detection), which metric is most important?

A. Accuracy
B. Precision
C. Recall (Correct)
D. F1-score

Explanation: Recall is most important because missing a real disease (false negative) is worse than a false alarm (false positive). We want to catch all cancers.

Question 3: What does ROC AUC measure?

A. The accuracy of the model
B. How well the model separates between classes (Correct)
C. The precision of predictions
D. The recall of predictions

Explanation: ROC AUC measures how well the model can distinguish between classes. A score of 0.5 means random guessing, while 1.0 means perfect separation.

Question 4: Why should you use the same cross-validation splits when comparing models?

A. It's faster
B. It ensures fair comparison - differences are due to models, not data splits (Correct)
C. It uses less memory
D. It's required by scikit-learn

Explanation: Using the same CV splits ensures that any differences in performance are due to the models themselves, not because one model got 'easier' test folds.

Question 5: What is the main problem with tuning hyperparameters on the test set?

A. It's too slow
B. It causes data leakage - you're using test set information to improve the model (Correct)
C. It requires too much memory
D. It's not supported by scikit-learn

Explanation: Tuning on the test set means you're using information from the test set to improve your model, which causes data leakage. The test set should only be used for final evaluation.

Congratulations! 🎉

You’ve completed the tutorial on Model Evaluation and Cross-Validation. You now know how to:

Split data correctly
Use cross-validation for stable estimates
Choose and interpret multiple metrics
Compare models fairly
Avoid common evaluation pitfalls

Keep practicing with different datasets and problems. The more you evaluate models, the better you’ll get at understanding their performance.

Progress 100%

Page 7 of 7

← Previous → Next

Sign In