Intermediate 25 min

Quick Baseline Model

Before we build a full pipeline, let’s create a simple baseline model. This gives us something to compare against and helps us understand what we’re working with.

Why a Baseline Model?

A baseline model is a simple, often naive approach that gives us a performance benchmark. It answers the question: “What’s the minimum performance we should expect?”

Think of it like this: if a complex model performs worse than a simple baseline, something’s wrong. The baseline sets the floor for our expectations.

Splitting the Data

First, we need to split our data into training and testing sets. The training set is used to teach the model, and the test set is used to evaluate how well it learned.

🐍 Python Splitting the Data
📟 Console Output
Run code to see output...

Understanding random_state

The random_state parameter controls the randomness of the split. Setting it to a fixed number (like 42) ensures you get the same split every time you run the code. This is important for reproducibility.

Without random_state, you’d get different train/test splits each run, making it hard to compare results.

Training a Simple Model

Let’s train a Logistic Regression model as our baseline. It’s simple, fast, and often works well for classification tasks.

🐍 Python Training and Evaluating Baseline Model
📟 Console Output
Run code to see output...

Try Different Models

Let’s compare a few different models to see how they perform:


              # Logistic Regression
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Accuracy: {accuracy:.4f}")
            

Understanding Overfitting

Overfitting happens when a model learns the training data too well. It memorizes patterns that don’t generalize to new data.

Think of it like studying for a test by memorizing specific questions instead of understanding the concepts. You might ace the practice test, but fail the real exam.

Our train/test split helps us detect overfitting. If the model performs much better on training data than test data, it’s likely overfitting.

🐍 Python Checking for Overfitting
📟 Console Output
Run code to see output...

The Problem with This Approach

Right now, we’re training models directly on the raw data. But real-world data often needs preprocessing:

  • Scaling - Features might be on different scales (e.g., alcohol content 0-15, proline 278-1680)
  • Normalization - Some models work better with normalized features
  • Feature engineering - Creating new features from existing ones

If we add preprocessing later, we need to remember to apply the same preprocessing to both training and test data. This is error-prone and easy to mess up.

That’s where pipelines come in. They ensure preprocessing is applied consistently.

Key Takeaways

Before moving on, remember:

  1. Baseline models give us a performance benchmark
  2. Train/test split helps evaluate generalization
  3. random_state ensures reproducible results
  4. Overfitting is when models memorize training data
  5. Manual preprocessing is error-prone - pipelines solve this

What’s Next?

In the next page, we’ll add preprocessing using ColumnTransformer. This prepares our data properly and sets us up for building a full pipeline.