Dec 19, 2025

Intermediate 25 min

Building the Full Pipeline

Now we’ll combine preprocessing and the model into a single Pipeline. This ensures preprocessing is applied consistently and makes the code much cleaner.

Why Pipelines Matter

Pipelines solve several problems:

Consistency - Same preprocessing applied at train and predict time
Simplicity - One object handles everything
Integration - Works seamlessly with cross-validation and hyperparameter tuning
Deployment - Easy to save and load the entire pipeline

Without a pipeline, you have to remember to:

Apply preprocessing to training data
Apply the same preprocessing to test data
Apply the same preprocessing to new predictions
Keep track of which transformations were used

It’s easy to make mistakes. Pipelines prevent that.

Creating a Pipeline

Let’s build our first pipeline:

🐍 Python Creating a Pipeline

📟 Console Output

Run code to see output...

Training the Pipeline

Training a pipeline is just like training a model. The pipeline handles all the preprocessing automatically:

🐍 Python Training and Using the Pipeline

📟 Console Output

Run code to see output...

Comparing Different Models

Let’s see how different models perform in the pipeline:


              from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42, n_estimators=100)
clf = Pipeline([
  ("preprocessor", preprocessor),
  ("model", model),
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, y_pred):.4f}")


              from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=42, max_iter=1000)
clf = Pipeline([
  ("preprocessor", preprocessor),
  ("model", model),
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f"Logistic Regression Accuracy: {accuracy_score(y_test, y_pred):.4f}")


              from sklearn.svm import SVC

model = SVC(random_state=42)
clf = Pipeline([
  ("preprocessor", preprocessor),
  ("model", model),
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f"SVM Accuracy: {accuracy_score(y_test, y_pred):.4f}")

Pipeline Flow Visualization

Here’s how data flows through our pipeline:

Pipeline Naming Convention

When you need to access or modify pipeline components, you use the naming convention:

step_name__parameter_name

For example:

model__n_estimators - The n_estimators parameter of the model step
preprocessor__num__with_mean - The with_mean parameter of the numeric transformer

This becomes important when we do hyperparameter tuning in the next pages.

🐍 Python Pipeline Parameter Access

📟 Console Output

Run code to see output...

Benefits of Pipelines

Let’s compare the manual approach vs pipeline approach:

Manual approach:

# Fit preprocessor
X_train_scaled = preprocessor.fit_transform(X_train)
X_test_scaled = preprocessor.transform(X_test)

# Train model
model.fit(X_train_scaled, y_train)

# Predict
y_pred = model.predict(X_test_scaled)

Pipeline approach:

# Everything in one step
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

The pipeline approach is:

Shorter - Less code to write
Safer - Can’t forget to preprocess
Cleaner - One object to manage
Better for CV - Works seamlessly with cross-validation

Key Takeaways

Before moving on:

Pipelines combine preprocessing and model into one object
Same interface - Use fit() and predict() like a regular model
Automatic preprocessing - Applied consistently at train and predict time
Naming convention - Use step__param to access parameters
Production ready - Easy to save and deploy

What’s Next?

In the next page, we’ll use cross-validation to get more reliable performance estimates. Pipelines make this especially easy.

Progress 57%

Page 4 of 7

← Previous → Next

Sign In