Intermediate 25 min

Building the Full Pipeline

Now we’ll combine preprocessing and the model into a single Pipeline. This ensures preprocessing is applied consistently and makes the code much cleaner.

Why Pipelines Matter

Pipelines solve several problems:

  1. Consistency - Same preprocessing applied at train and predict time
  2. Simplicity - One object handles everything
  3. Integration - Works seamlessly with cross-validation and hyperparameter tuning
  4. Deployment - Easy to save and load the entire pipeline

Without a pipeline, you have to remember to:

  • Apply preprocessing to training data
  • Apply the same preprocessing to test data
  • Apply the same preprocessing to new predictions
  • Keep track of which transformations were used

It’s easy to make mistakes. Pipelines prevent that.

Creating a Pipeline

Let’s build our first pipeline:

🐍 Python Creating a Pipeline
📟 Console Output
Run code to see output...

Training the Pipeline

Training a pipeline is just like training a model. The pipeline handles all the preprocessing automatically:

🐍 Python Training and Using the Pipeline
📟 Console Output
Run code to see output...

Comparing Different Models

Let’s see how different models perform in the pipeline:


              from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42, n_estimators=100)
clf = Pipeline([
  ("preprocessor", preprocessor),
  ("model", model),
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, y_pred):.4f}")
            

Pipeline Flow Visualization

Here’s how data flows through our pipeline:

Input Transform Predict Output Raw Input Preprocessor Scaled Data Model Prediction

Pipeline Naming Convention

When you need to access or modify pipeline components, you use the naming convention:

step_name__parameter_name

For example:

  • model__n_estimators - The n_estimators parameter of the model step
  • preprocessor__num__with_mean - The with_mean parameter of the numeric transformer

This becomes important when we do hyperparameter tuning in the next pages.

🐍 Python Pipeline Parameter Access
📟 Console Output
Run code to see output...

Benefits of Pipelines

Let’s compare the manual approach vs pipeline approach:

Manual approach:

# Fit preprocessor
X_train_scaled = preprocessor.fit_transform(X_train)
X_test_scaled = preprocessor.transform(X_test)

# Train model
model.fit(X_train_scaled, y_train)

# Predict
y_pred = model.predict(X_test_scaled)

Pipeline approach:

# Everything in one step
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

The pipeline approach is:

  • Shorter - Less code to write
  • Safer - Can’t forget to preprocess
  • Cleaner - One object to manage
  • Better for CV - Works seamlessly with cross-validation

Key Takeaways

Before moving on:

  1. Pipelines combine preprocessing and model into one object
  2. Same interface - Use fit() and predict() like a regular model
  3. Automatic preprocessing - Applied consistently at train and predict time
  4. Naming convention - Use step__param to access parameters
  5. Production ready - Easy to save and deploy

What’s Next?

In the next page, we’ll use cross-validation to get more reliable performance estimates. Pipelines make this especially easy.