Intermediate 25 min

Setup and Dataset

Let’s start by setting up our environment and getting familiar with the data we’ll be working with.

Installing Dependencies

First, make sure you have the required packages installed. You can install or upgrade them using pip:

pip install -U scikit-learn pandas numpy

Note: This tutorial works best in a Jupyter Notebook or similar interactive environment, but you can also run it as a Python script.

Introducing the Dataset

We’ll use Scikit-Learn’s built-in Wine dataset. This dataset contains chemical analysis results of wines from three different cultivars (classes). Our task is to predict the wine class based on its chemical properties.

Why use a built-in dataset? It’s clean, well-documented, and doesn’t require downloading external files. This lets us focus on learning pipeline concepts without dealing with CSV issues or missing file paths.

🐍 Python Loading the Wine Dataset
📟 Console Output
Run code to see output...

Understanding the Data

The Wine dataset has:

  • 178 samples - Each row is a wine sample
  • 13 features - Chemical properties like alcohol content, malic acid, etc.
  • 3 classes - Three different wine cultivars (0, 1, 2)

Let’s take a closer look at the actual data:

🐍 Python Exploring the Dataset
📟 Console Output
Run code to see output...

The Task

Our goal is to build a model that can predict the wine class (0, 1, or 2) given the 13 chemical features. This is a classification problem with three classes.

The features are all numeric (floating-point numbers), which makes preprocessing straightforward. In later sections, we’ll see how to handle both numeric and categorical features.

Data Flow Overview

Here’s how data flows through our ML pipeline:

Load Split Split Train Evaluate Predict Raw Data Preprocessing Train Set Test Set Model Predictions

Key Points to Remember

Before moving on, keep these points in mind:

  1. All features are numeric - No categorical encoding needed for this dataset
  2. No missing values - The dataset is clean and complete
  3. Balanced classes - The three wine classes are relatively balanced
  4. 13 features - We’ll use all of them for prediction

What’s Next?

In the next page, we’ll build a quick baseline model without using pipelines. This gives us something to compare against when we add preprocessing and pipeline structure later.

Progress 14%
Page 1 of 7
Previous Next