Install and Import Libraries
First, let’s install the packages we need:
pip install scikit-learn pandas matplotlib numpy
Now let’s import everything we’ll use:
Load the Dataset
We’ll use the Breast Cancer Wisconsin dataset. It’s a classic binary classification problem: predicting whether a tumor is malignant (cancerous) or benign (not cancerous).
Explore the Dataset
Let’s check the class distribution and get a feel for the data:
Try it yourself: Change head() to sample(5) in the first code block to see random rows from the dataset.
Understanding the Classes
This dataset has two classes:
- Class 0 (Malignant): Cancerous tumors - these are dangerous
- Class 1 (Benign): Non-cancerous tumors - these are safe
Notice the classes are somewhat balanced, but not perfectly. This matters for evaluation:
- If classes were 99% one class, accuracy would be misleading
- Even with balanced classes, we need to look at more than accuracy
- In medical problems, false negatives (missing cancer) are worse than false positives (false alarm)
Quick Data Check
Let’s see some basic statistics:
Why This Dataset?
We chose this dataset because:
- Real-world relevance - Medical diagnosis is a real problem
- Binary classification - Simple enough to understand, complex enough to be interesting
- Class imbalance considerations - Not perfectly balanced, so we’ll see why metrics matter
- Feature-rich - 30 features give us plenty to work with
- Well-known - You can find lots of resources if you want to learn more
Key Takeaways
Before moving forward:
- Data is loaded - We have features (X) and target (y)
- Classes are somewhat balanced - But not perfectly, which matters
- No missing values - Clean dataset, ready for modeling
- 30 features - Plenty of information to work with
What’s Next?
In the next page, we’ll do a simple train/test split and train our first model. You’ll see how to split data correctly and get your first evaluation metrics.