Intermediate 25 min

Adding Preprocessing with ColumnTransformer

Real-world data often has mixed feature types. Some columns are numeric, others are categorical. Even if all features are numeric, they might be on different scales. Let’s see how to handle this properly.

Why Preprocessing Matters

Different features can have vastly different scales. For example:

  • Alcohol content: 11-15
  • Proline: 278-1680

Models like Logistic Regression and SVM are sensitive to feature scales. Features with larger values can dominate the model, even if they’re not more important.

Scaling puts all features on the same scale, typically between 0 and 1 or with mean 0 and standard deviation 1.

StandardScaler

StandardScaler transforms features so they have:

  • Mean = 0
  • Standard deviation = 1

This is called “standardization” or “z-score normalization.”

🐍 Python Understanding StandardScaler
📟 Console Output
Run code to see output...

ColumnTransformer Basics

ColumnTransformer lets us apply different transformations to different columns. This is useful when you have:

  • Numeric features that need scaling
  • Categorical features that need encoding

Even though our Wine dataset is all numeric, let’s set up ColumnTransformer properly. This pattern works for any dataset.

🐍 Python Using ColumnTransformer
📟 Console Output
Run code to see output...

Handling Categorical Features

Even though our dataset doesn’t have categorical features, let’s see how you’d handle them. This is important for real-world datasets.

Split Split Scale Encode Merge Merge Raw Data Numeric Features Categorical Features StandardScaler OneHotEncoder Combined Features

Here’s how you’d set it up with categorical features:

🐍 Python ColumnTransformer with Multiple Feature Types
📟 Console Output
Run code to see output...

Why handle_unknown=“ignore”?

When using OneHotEncoder, handle_unknown="ignore" tells the encoder what to do if it encounters a category it hasn’t seen during training.

Without it, the model would crash if a new category appears in test data. With ignore, it simply creates a zero vector for that category, allowing the model to continue.

Testing the Preprocessor

Let’s see how preprocessing affects our model performance:

🐍 Python Testing Preprocessing Impact
📟 Console Output
Run code to see output...

Important: fit vs transform

Notice the difference:

  • fit_transform(X_train) - Learn the scaling parameters from training data, then transform it
  • transform(X_test) - Apply the learned scaling to test data

Never call fit or fit_transform on test data. This would leak information from the test set into your model, making your evaluation unreliable.

Key Takeaways

Before moving on:

  1. Scaling matters - Features on different scales can bias models
  2. ColumnTransformer - Apply different transformations to different columns
  3. fit on train, transform on test - Never fit on test data
  4. handle_unknown - Important for categorical features in production

What’s Next?

In the next page, we’ll wrap preprocessing and the model into a single Pipeline. This ensures preprocessing is always applied correctly and makes the code much cleaner.