Install and Import Libraries

First, let’s install the packages we need:

pip install scikit-learn pandas matplotlib numpy

Now let’s import everything we’ll use:

🐍 Python Import Libraries

📟 Console Output

Run code to see output...

Load the Dataset

We’ll use the Breast Cancer Wisconsin dataset. It’s a classic binary classification problem: predicting whether a tumor is malignant (cancerous) or benign (not cancerous).

🐍 Python Load Dataset

📟 Console Output

Run code to see output...

Explore the Dataset

Let’s check the class distribution and get a feel for the data:

🐍 Python Explore Dataset

📟 Console Output

Run code to see output...

Try it yourself: Change head() to sample(5) in the first code block to see random rows from the dataset.

Understanding the Classes

This dataset has two classes:

Class 0 (Malignant): Cancerous tumors - these are dangerous
Class 1 (Benign): Non-cancerous tumors - these are safe

Notice the classes are somewhat balanced, but not perfectly. This matters for evaluation:

If classes were 99% one class, accuracy would be misleading
Even with balanced classes, we need to look at more than accuracy
In medical problems, false negatives (missing cancer) are worse than false positives (false alarm)

Quick Data Check

Let’s see some basic statistics:

🐍 Python Data Statistics

📟 Console Output

Run code to see output...

Why This Dataset?

We chose this dataset because:

Real-world relevance - Medical diagnosis is a real problem
Binary classification - Simple enough to understand, complex enough to be interesting
Class imbalance considerations - Not perfectly balanced, so we’ll see why metrics matter
Feature-rich - 30 features give us plenty to work with
Well-known - You can find lots of resources if you want to learn more

Key Takeaways

Before moving forward:

Data is loaded - We have features (X) and target (y)
Classes are somewhat balanced - But not perfectly, which matters
No missing values - Clean dataset, ready for modeling
30 features - Plenty of information to work with

What’s Next?

In the next page, we’ll do a simple train/test split and train our first model. You’ll see how to split data correctly and get your first evaluation metrics.

Progress 29%

Page 2 of 7

← Previous → Next

Sign In