Intermediate 25 min

Hyperparameter Tuning with GridSearchCV

Hyperparameters are settings that control how a model learns. Unlike model parameters (learned from data), hyperparameters are set before training. Examples include the number of trees in a random forest or the learning rate in gradient descent.

Finding good hyperparameters manually is tedious. GridSearchCV automates this by trying different combinations and picking the best one.

What are Hyperparameters?

Model parameters are learned from data (e.g., weights in a neural network).
Hyperparameters are set before training (e.g., number of trees, max depth).

For Random Forest:

  • n_estimators - Number of trees
  • max_depth - Maximum depth of trees
  • min_samples_split - Minimum samples to split a node

These affect model performance, but we don’t know the best values ahead of time.

Defining a Parameter Grid

We specify which hyperparameters to try and what values to test:

🐍 Python Defining Parameter Grid
📟 Console Output
Run code to see output...

Using GridSearchCV

GridSearchCV tries all combinations in the grid and picks the best one using cross-validation:

🐍 Python Using GridSearchCV
📟 Console Output
Run code to see output...

Understanding the Results

GridSearchCV returns:

  • best_params_ - Best hyperparameter values
  • best_score_ - Best cross-validation score
  • best_estimator_ - The pipeline with best parameters (already fitted)

Why NOT Tune on Test Set?

Important: Never use the test set for hyperparameter tuning. Here’s why:

  1. Data leakage - You’d be using test data to make decisions
  2. Overfitting - Model might overfit to test set
  3. Unreliable evaluation - Test set should only be used for final evaluation

GridSearchCV uses cross-validation, so it never touches your test set. The test set should be held out completely until the very end.

Comparing Grid Sizes

Smaller grids are faster but might miss good combinations. Larger grids are slower but more thorough:


              # Quick search - fewer combinations
param_grid = {
  "model__n_estimators": [50, 100],
  "model__max_depth": [None, 10],
}
# Total: 2 × 2 = 4 combinations
            

RandomizedSearchCV

When grids get large, RandomizedSearchCV is faster. Instead of trying all combinations, it randomly samples a specified number:

🐍 Python RandomizedSearchCV Alternative
📟 Console Output
Run code to see output...

Accessing Results

You can see all the results, not just the best:

🐍 Python Exploring All Results
📟 Console Output
Run code to see output...

Key Takeaways

Before moving on:

  1. Hyperparameters - Settings that control model learning
  2. GridSearchCV - Tries all combinations, uses CV to evaluate
  3. Never tune on test set - Use CV, hold out test set for final evaluation
  4. RandomizedSearchCV - Faster alternative for large parameter spaces
  5. Pipeline naming - Use step__param format in parameter grid

Quick Knowledge Check

What’s Next?

In the final page, we’ll evaluate our tuned model, interpret the results, and save the pipeline for reuse.