Quick Baseline Model
Before we build a full pipeline, let’s create a simple baseline model. This gives us something to compare against and helps us understand what we’re working with.
Why a Baseline Model?
A baseline model is a simple, often naive approach that gives us a performance benchmark. It answers the question: “What’s the minimum performance we should expect?”
Think of it like this: if a complex model performs worse than a simple baseline, something’s wrong. The baseline sets the floor for our expectations.
Splitting the Data
First, we need to split our data into training and testing sets. The training set is used to teach the model, and the test set is used to evaluate how well it learned.
Understanding random_state
The random_state parameter controls the randomness of the split. Setting it to a fixed number (like 42) ensures you get the same split every time you run the code. This is important for reproducibility.
Without random_state, you’d get different train/test splits each run, making it hard to compare results.
Training a Simple Model
Let’s train a Logistic Regression model as our baseline. It’s simple, fast, and often works well for classification tasks.
Try Different Models
Let’s compare a few different models to see how they perform:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Accuracy: {accuracy:.4f}")
# Random Forest
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42, n_estimators=100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Accuracy: {accuracy:.4f}")
# Support Vector Machine
from sklearn.svm import SVC
model = SVC(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"SVM Accuracy: {accuracy:.4f}")
Understanding Overfitting
Overfitting happens when a model learns the training data too well. It memorizes patterns that don’t generalize to new data.
Think of it like studying for a test by memorizing specific questions instead of understanding the concepts. You might ace the practice test, but fail the real exam.
Our train/test split helps us detect overfitting. If the model performs much better on training data than test data, it’s likely overfitting.
The Problem with This Approach
Right now, we’re training models directly on the raw data. But real-world data often needs preprocessing:
- Scaling - Features might be on different scales (e.g., alcohol content 0-15, proline 278-1680)
- Normalization - Some models work better with normalized features
- Feature engineering - Creating new features from existing ones
If we add preprocessing later, we need to remember to apply the same preprocessing to both training and test data. This is error-prone and easy to mess up.
That’s where pipelines come in. They ensure preprocessing is applied consistently.
Key Takeaways
Before moving on, remember:
- Baseline models give us a performance benchmark
- Train/test split helps evaluate generalization
- random_state ensures reproducible results
- Overfitting is when models memorize training data
- Manual preprocessing is error-prone - pipelines solve this
What’s Next?
In the next page, we’ll add preprocessing using ColumnTransformer. This prepares our data properly and sets us up for building a full pipeline.