Dec 19, 2025

Intermediate 25 min

Hyperparameter Tuning with GridSearchCV

Hyperparameters are settings that control how a model learns. Unlike model parameters (learned from data), hyperparameters are set before training. Examples include the number of trees in a random forest or the learning rate in gradient descent.

Finding good hyperparameters manually is tedious. GridSearchCV automates this by trying different combinations and picking the best one.

What are Hyperparameters?

Model parameters are learned from data (e.g., weights in a neural network).
Hyperparameters are set before training (e.g., number of trees, max depth).

For Random Forest:

n_estimators - Number of trees
max_depth - Maximum depth of trees
min_samples_split - Minimum samples to split a node

These affect model performance, but we don’t know the best values ahead of time.

Defining a Parameter Grid

We specify which hyperparameters to try and what values to test:

🐍 Python Defining Parameter Grid

📟 Console Output

Run code to see output...

Using GridSearchCV

GridSearchCV tries all combinations in the grid and picks the best one using cross-validation:

🐍 Python Using GridSearchCV

📟 Console Output

Run code to see output...

Understanding the Results

GridSearchCV returns:

best_params_ - Best hyperparameter values
best_score_ - Best cross-validation score
best_estimator_ - The pipeline with best parameters (already fitted)

Why NOT Tune on Test Set?

Important: Never use the test set for hyperparameter tuning. Here’s why:

Data leakage - You’d be using test data to make decisions
Overfitting - Model might overfit to test set
Unreliable evaluation - Test set should only be used for final evaluation

GridSearchCV uses cross-validation, so it never touches your test set. The test set should be held out completely until the very end.

Comparing Grid Sizes

Smaller grids are faster but might miss good combinations. Larger grids are slower but more thorough:


              # Quick search - fewer combinations
param_grid = {
  "model__n_estimators": [50, 100],
  "model__max_depth": [None, 10],
}
# Total: 2 × 2 = 4 combinations


              # Deeper search - more combinations
param_grid = {
  "model__n_estimators": [50, 100, 200, 300],
  "model__max_depth": [None, 5, 10, 15],
  "model__min_samples_split": [2, 5, 10],
}
# Total: 4 × 4 × 3 = 48 combinations

RandomizedSearchCV

When grids get large, RandomizedSearchCV is faster. Instead of trying all combinations, it randomly samples a specified number:

🐍 Python RandomizedSearchCV Alternative

📟 Console Output

Run code to see output...

Accessing Results

You can see all the results, not just the best:

🐍 Python Exploring All Results

📟 Console Output

Run code to see output...

Key Takeaways

Before moving on:

Hyperparameters - Settings that control model learning
GridSearchCV - Tries all combinations, uses CV to evaluate
Never tune on test set - Use CV, hold out test set for final evaluation
RandomizedSearchCV - Faster alternative for large parameter spaces
Pipeline naming - Use step__param format in parameter grid

Quick Knowledge Check

Knowledge Check

This interactive quiz requires JavaScript to be enabled.

Question 1: What is the main advantage of using GridSearchCV with a pipeline?

A. It makes code run faster
B. It automatically applies preprocessing correctly during cross-validation (Correct)
C. It reduces memory usage
D. It eliminates the need for a test set

Explanation: GridSearchCV with pipelines ensures preprocessing is fit separately on each CV fold, preventing data leakage and ensuring correct application of transformations.

Question 2: Why should you never tune hyperparameters on the test set?

A. It's too slow
B. It causes data leakage and makes evaluation unreliable (Correct)
C. The test set is too small
D. GridSearchCV doesn't support it

Explanation: Using the test set for tuning causes data leakage - you're using test data to make decisions, which makes your evaluation unreliable. The test set should only be used for final evaluation.

Question 3: When should you use RandomizedSearchCV instead of GridSearchCV?

A. When you have categorical features
B. When the parameter space is large and you want faster results (Correct)
C. When you have less than 100 samples
D. When using linear models

Explanation: RandomizedSearchCV is useful when the parameter space is large. Instead of trying all combinations, it randomly samples a specified number, making it faster while still finding good solutions.

What’s Next?

In the final page, we’ll evaluate our tuned model, interpret the results, and save the pipeline for reuse.

Progress 86%

Page 6 of 7

← Previous → Next

Sign In