Intermediate 25 min

Comparing Two Models

Let’s compare Logistic Regression with Random Forest. The key: use the same cross-validation splits for both models.

🐍 Python Compare Models
📟 Console Output
Run code to see output...

Why Same CV Splits Matter

Using the same splits ensures:

  • Fair comparison - Both models see the same data splits
  • Reduced variance - Differences are due to models, not data splits
  • Statistical validity - Can compare means properly

Without same splits: One model might get “easier” test folds, making comparison unfair.

Try Different Hyperparameters

Try it yourself: Change Random Forest hyperparameters and see how metrics change:

🐍 Python Try Different Hyperparameters
📟 Console Output
Run code to see output...

Common Pitfalls and Best Practices

❌ Don’t Do This

1. Look at test set many times

  • Every time you check test set, you’re “using” it
  • Eventually, you’re overfitting to test set
  • Solution: Only look at test set once, at the very end

2. Tune hyperparameters on test set

  • Test set is for final evaluation only
  • Tuning on test set = data leakage
  • Solution: Use cross-validation for tuning, test set only for final check

3. Ignore class imbalance

  • Accuracy can be misleading with imbalanced classes
  • Solution: Use stratified splits and multiple metrics

4. Report only accuracy

  • Accuracy doesn’t tell the full story
  • Solution: Report precision, recall, F1, and confusion matrix

5. Forget random seeds

  • Results won’t be reproducible
  • Solution: Always set random_state for reproducibility

✅ Do This Instead

1. Use cross-validation for model selection

  • More reliable than single split
  • Use same CV splits for fair comparison

2. Keep test set untouched until the end

  • Test set is for final evaluation only
  • Use validation set or CV for tuning

3. Use stratified splits for classification

  • Keeps class proportions
  • Important for imbalanced datasets

4. Report multiple metrics

  • At minimum: accuracy, precision, recall, F1
  • Add ROC AUC for binary classification
  • Include confusion matrix

5. Set random seeds

  • Makes results reproducible
  • Use random_state=42 (or any number)

Evaluation Checklist

Before deploying a model, check:

  • Used cross-validation (not just one split)
  • Used stratified splits (for classification)
  • Reported multiple metrics (not just accuracy)
  • Compared models with same CV splits
  • Didn’t tune on test set
  • Set random seeds for reproducibility
  • Checked confusion matrix
  • Considered problem-specific metrics (recall for medical, precision for spam)

Summary

You’ve learned:

  1. Train/test split - Basic evaluation, but can be unstable
  2. Confusion matrix - Shows what types of errors your model makes
  3. Cross-validation - More stable than single split, tests multiple times
  4. Multiple metrics - Accuracy, precision, recall, F1, ROC AUC all tell different stories
  5. Model comparison - Use same CV splits for fair comparison
  6. Best practices - Avoid common pitfalls, use proper evaluation methods

Next Steps

Now that you know proper evaluation:

  1. Hyperparameter tuning - Use GridSearchCV with cross-validation
  2. Imbalanced datasets - Learn about class weights and sampling
  3. Preprocessing pipelines - Combine preprocessing with evaluation
  4. Production evaluation - Monitor models in production

Test Your Knowledge

Let’s see what you’ve learned:

Congratulations! 🎉

You’ve completed the tutorial on Model Evaluation and Cross-Validation. You now know how to:

  • Split data correctly
  • Use cross-validation for stable estimates
  • Choose and interpret multiple metrics
  • Compare models fairly
  • Avoid common evaluation pitfalls

Keep practicing with different datasets and problems. The more you evaluate models, the better you’ll get at understanding their performance.