Building the Full Pipeline
Now we’ll combine preprocessing and the model into a single Pipeline. This ensures preprocessing is applied consistently and makes the code much cleaner.
Why Pipelines Matter
Pipelines solve several problems:
- Consistency - Same preprocessing applied at train and predict time
- Simplicity - One object handles everything
- Integration - Works seamlessly with cross-validation and hyperparameter tuning
- Deployment - Easy to save and load the entire pipeline
Without a pipeline, you have to remember to:
- Apply preprocessing to training data
- Apply the same preprocessing to test data
- Apply the same preprocessing to new predictions
- Keep track of which transformations were used
It’s easy to make mistakes. Pipelines prevent that.
Creating a Pipeline
Let’s build our first pipeline:
Training the Pipeline
Training a pipeline is just like training a model. The pipeline handles all the preprocessing automatically:
Comparing Different Models
Let’s see how different models perform in the pipeline:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42, n_estimators=100)
clf = Pipeline([
("preprocessor", preprocessor),
("model", model),
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, y_pred):.4f}")
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=42, max_iter=1000)
clf = Pipeline([
("preprocessor", preprocessor),
("model", model),
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f"Logistic Regression Accuracy: {accuracy_score(y_test, y_pred):.4f}")
from sklearn.svm import SVC
model = SVC(random_state=42)
clf = Pipeline([
("preprocessor", preprocessor),
("model", model),
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f"SVM Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Pipeline Flow Visualization
Here’s how data flows through our pipeline:
Pipeline Naming Convention
When you need to access or modify pipeline components, you use the naming convention:
step_name__parameter_name
For example:
model__n_estimators- The n_estimators parameter of the model steppreprocessor__num__with_mean- The with_mean parameter of the numeric transformer
This becomes important when we do hyperparameter tuning in the next pages.
Benefits of Pipelines
Let’s compare the manual approach vs pipeline approach:
Manual approach:
# Fit preprocessor
X_train_scaled = preprocessor.fit_transform(X_train)
X_test_scaled = preprocessor.transform(X_test)
# Train model
model.fit(X_train_scaled, y_train)
# Predict
y_pred = model.predict(X_test_scaled)
Pipeline approach:
# Everything in one step
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
The pipeline approach is:
- Shorter - Less code to write
- Safer - Can’t forget to preprocess
- Cleaner - One object to manage
- Better for CV - Works seamlessly with cross-validation
Key Takeaways
Before moving on:
- Pipelines combine preprocessing and model into one object
- Same interface - Use fit() and predict() like a regular model
- Automatic preprocessing - Applied consistently at train and predict time
- Naming convention - Use
step__paramto access parameters - Production ready - Easy to save and deploy
What’s Next?
In the next page, we’ll use cross-validation to get more reliable performance estimates. Pipelines make this especially easy.