Intermediate 25 min

Install and Import Libraries

First, let’s install the packages we need:

pip install scikit-learn pandas matplotlib numpy

Now let’s import everything we’ll use:

🐍 Python Import Libraries
📟 Console Output
Run code to see output...

Load the Dataset

We’ll use the Breast Cancer Wisconsin dataset. It’s a classic binary classification problem: predicting whether a tumor is malignant (cancerous) or benign (not cancerous).

🐍 Python Load Dataset
📟 Console Output
Run code to see output...

Explore the Dataset

Let’s check the class distribution and get a feel for the data:

🐍 Python Explore Dataset
📟 Console Output
Run code to see output...

Try it yourself: Change head() to sample(5) in the first code block to see random rows from the dataset.

Understanding the Classes

This dataset has two classes:

  • Class 0 (Malignant): Cancerous tumors - these are dangerous
  • Class 1 (Benign): Non-cancerous tumors - these are safe

Notice the classes are somewhat balanced, but not perfectly. This matters for evaluation:

  • If classes were 99% one class, accuracy would be misleading
  • Even with balanced classes, we need to look at more than accuracy
  • In medical problems, false negatives (missing cancer) are worse than false positives (false alarm)

Quick Data Check

Let’s see some basic statistics:

🐍 Python Data Statistics
📟 Console Output
Run code to see output...

Why This Dataset?

We chose this dataset because:

  1. Real-world relevance - Medical diagnosis is a real problem
  2. Binary classification - Simple enough to understand, complex enough to be interesting
  3. Class imbalance considerations - Not perfectly balanced, so we’ll see why metrics matter
  4. Feature-rich - 30 features give us plenty to work with
  5. Well-known - You can find lots of resources if you want to learn more

Key Takeaways

Before moving forward:

  1. Data is loaded - We have features (X) and target (y)
  2. Classes are somewhat balanced - But not perfectly, which matters
  3. No missing values - Clean dataset, ready for modeling
  4. 30 features - Plenty of information to work with

What’s Next?

In the next page, we’ll do a simple train/test split and train our first model. You’ll see how to split data correctly and get your first evaluation metrics.