Walkthroughs & Exercises — Machine Learning for Data Analytics with Python

Interactive, in-browser edition — fill in the code as we go. Nothing to install.

Author

Dr. Chester Ismay

How to use this page

Everything runs in your browser — there’s no Python or Jupyter to install.

Click Run Code on the import cell and the data-loading cells near the top first.
Then work through the page top to bottom, filling in each cell as we live-code together.
Your edits are saved automatically in this browser. Use Start Over on a cell to reset it.
The very first run takes a few seconds while Python loads in the background.

Stuck? Open the completed solutions in another tab.

Intro: Getting Started with Machine Learning for Data-Driven Decisions

Walkthrough: Setting Up the Python Environment for ML

If you haven’t already installed Python, Jupyter, and the necessary packages, there are instructions on the course repo in the README to do so [here](https://github.com/ismayc/oreilly-ml-for-data-analytics-with-python/blob/main/README.md.

You can also install the packages directly in a Jupyter notebook with

Run the following code to check that each of the needed packages are installed. If you get an error, you may need to install the package(s) again.

Exercise: Setting Up the Python Environment

By completing this exercise, you will be able to

Import necessary Python packages
Check for successful package loading
Load datasets into Python

Follow the instructions above in Walkthrough to check for correct installation of necessary packages.

Module 1: Data Understanding and Preprocessing for Machine Learning

Walkthrough 1.1: Exploring and Preprocessing Data with Pandas & Seaborn

Inspect a dataset using Pandas

Handle missing values and clean data

Create visualizations to identify key business trends

Exercise 1.1: Exploring and Preprocessing Data with Pandas & Seaborn

Inspect a dataset using Pandas

Handle missing values and clean data

Common Pitfall: The Income column has missing values. If you skip this step and try to build models later, you’ll get errors. Always check isnull().sum() after cleaning to verify zeros.

Create visualizations to identify key business trends

Interpretation Questions

Looking at the violin plot of Income by Response, which group (responders or non-responders) shows more variability in income?
Based on your scatter plot, do higher-income customers tend to spend more? Is this relationship strong or weak?
Which education level shows the highest response rate? What marketing implications might this have?

Self-Check

By the end of this module, you should be able to:

Load a dataset and inspect its structure with .info() and .describe()
Identify and handle missing values before modeling
Create at least 3 types of visualizations (histogram, scatter, violin/box)
Articulate one business insight from your exploratory analysis

Module 2: Supervised Learning for Business Decisions

Walkthrough 2.1: Build a Regression Model for Pricing Optimization

Split the data into training and validation sets

Common Pitfall: Fitting StandardScaler on the full dataset before the train/test split leaks information from the validation set into training. In a strict workflow, fit the scaler on the training data only (or wrap it in a Pipeline) so the validation set stays truly unseen.

Train a linear regression model

Evaluate model performance on the validation set

Exercise 2.1: Build a Regression Model for Pricing Optimization

Split the data into training and validation sets

Common Pitfall: Forgetting to scale features before linear regression can work, but it makes the coefficient harder to interpret. Always scale when comparing feature importance.

Train a linear regression model

Evaluate model performance on the validation set

Interpretation Questions

Is your R-squared higher or lower than the telco churn model? What might explain the difference?
If MAE is $200, what does that mean in practical terms for predicting customer spending?
Would you trust this model for making budget decisions? Why or why not?

Walkthrough 2.2: Implement a Classification Model for Customer Churn

Split the data into training and validation sets

Scaling is not as important for tree-based models.

Train a Random Forest classification model

Evaluate model performance on the validation set

Quick Reference: When to Use Which Metric

Situation	Metric	Why
Predicting continuous values	R-squared, MAE	Measures prediction error
Balanced classes	Accuracy	Overall correctness
Cost of false positives is high	Precision	Minimize wrong positive predictions
Cost of false negatives is high	Recall	Catch all actual positives
Need balance	F1-Score	Harmonic mean of precision/recall

Exercise 2.2: Implement a Classification Model for Customer Churn

Split the data into training and validation sets

Common Pitfall: The Response column is imbalanced (many more 0s than 1s). This is why accuracy alone can be misleading. A model predicting “No Response” every time would still get ~85% accuracy!

Train a Random Forest classification model

Evaluate model performance on the validation set

Interpretation Questions

Compare your precision and recall. Which is higher, and what does that imply about the model’s tendencies?
Looking at your confusion matrix, is the model better at identifying responders or non-responders?
If running a marketing campaign costs $50 per contact, which metric matters more: precision or recall?

Self-Check

By the end of this module, you should be able to:

Explain the difference between regression and classification
Split data into training and validation sets with a fixed random_state
Interpret R-squared, MAE, accuracy, precision, and recall
Explain why we don’t evaluate a model on its training data
Recognize when accuracy is misleading on imbalanced classes

Module 3: Unsupervised Learning and Pattern Discovery in Business

Walkthrough 3.1: Exploring K-Means Clustering for Customer Segmentation

Apply K-Means clustering to segment customers

Common Pitfall: K-means uses Euclidean distance, so an unscaled feature with a large range (like MonthlyCharges) will dominate the clusters. Always standardize features before clustering, and remember the cluster labels are arbitrary integers with no inherent order.

Determine the optimal number of clusters using the Elbow Method

Verify using the silhouette score (optional but recommended)

Fit K-means and assign cluster labels to each customer

Visualize customer segments using a 2D plot

Exercise 3.1: Exploring K-Means Clustering for Customer Segmentation

Apply K-Means clustering to segment customers

Determine the optimal number of clusters using the Elbow Method

Verify using the silhouette score (optional but recommended)

Choosing Your Optimal k

The “right” answer here is somewhat subjective. Look for:

Where the elbow curve bends (typically k=3 to 5 for this data)
The highest silhouette score
If they disagree, silhouette score is often more reliable

Pick a k and justify your choice. There’s no single correct answer.

Fit K-means and assign cluster labels to each customer

Visualize customer segments using a 2D plot

Interpretation Questions

How did you decide on the optimal k? Did the elbow method and silhouette scores agree?
Looking at your 2D visualization, do the clusters seem well-separated or do they overlap?
Can you describe what “type” of customer each cluster might represent based on their TotalChildren and TotalSpent values?

Walkthrough 3.2: Market Basket Analysis with Apriori Algorithm

Prepare transactional data (services as items)

Common Pitfall: Setting min_support too low floods you with rules (many spurious), while setting it too high can return no itemsets at all. Also, a rule can have high confidence simply because the consequent is popular – always check lift (> 1) to confirm a real association rather than a coincidence.

Apply the Apriori algorithm to identify frequent itemsets

Generate association rules from frequent itemsets

Interpret insights

Key Metrics:

Support: How often items appear together (0.25 = 25% of customers)
Confidence: If A, how likely B? (0.8 = 80% chance)
Lift: How much more likely than random? (>1 = positive association)

Example interpretation: If rule shows {PhoneService} -> {InternetService} with confidence=0.75 and lift=1.3: “Customers with phone service are 75% likely to also have internet service, and this is 30% more likely than if the purchases were independent.”

Exercise 3.2: Market Basket Analysis with Apriori Algorithm

Prepare transactional data (product categories as items)

Note on Thresholds

The min_support=0.2 and min_threshold=0.6 are starting points. If you get:

Too few rules: lower the thresholds
Too many rules: raise them
All trivial rules: look for higher lift values

Feel free to experiment with different values.

Apply the Apriori algorithm to identify frequent itemsets

Generate association rules from frequent itemsets

Interpretation Questions

Which product category appears in the most frequent itemsets? What does this suggest about purchasing patterns?
Find a rule with high confidence but low lift. Why might this rule be less useful despite high confidence?
Identify one actionable insight: what product bundle would you recommend based on these rules?

Self-Check

By the end of this module, you should be able to:

Explain the difference between supervised and unsupervised learning
Scale features before clustering and explain why it matters
Use the elbow method and silhouette score to choose k
Interpret support, confidence, and lift in association rules
Describe a business application for clustering and market basket analysis

Module 4: Implementing and Evaluating ML Models

Walkthrough 4.1: Exploring Cross-Validation for Model Evaluation

Split data into training and validation sets

Common Pitfall: Reporting a single train-test split can be lucky or unlucky depending on which rows land where. Cross-validation averages over several folds for a more honest estimate – and forgetting random_state makes those folds (and your results) non-reproducible.

Train a classification model using logistic regression

Apply k-fold cross-validation to evaluate model performance

Compare metrics across folds

Interpretation

Accuracy -> Overall correctness of predictions

Precision -> How many predicted churns were actual churns

Recall -> How many churns were correctly identified

F1-Score -> Balances precision & recall

Cross-validation ensures that your model generalizes better to unseen data by reducing the risk of overfitting on a single split.

Exercise 4.1: Exploring Cross-Validation for Model Evaluation

Split data into training and validation sets

Train a classification model using logistic regression

Apply k-fold cross-validation to evaluate model performance

Compare metrics across folds

Interpretation

Accuracy -> Overall correctness of predictions

Precision -> How many predicted responders were actual responders

Recall -> How many actual responders were correctly identified

F1-Score -> Harmonic mean of precision and recall

Interpretation Questions

How consistent are your metrics across folds? (Look at the standard deviation values.)
Which metric shows the most variability? What might cause this?
Compare your cross-validation results to the single train-test split in Exercise 2.2. Are they similar?

Walkthrough 4.2: Hyperparameter Tuning with Grid Search

Train a Random Forest classifier

Common Pitfall: Tuning hyperparameters and then evaluating on the same data leaks the answer and overstates performance. The honest grid-search workflow selects parameters using cross-validation on the training set, then reports the final score on a held-out test set the search never saw.

Apply grid search to find optimal hyperparameters

Evaluate model improvement using accuracy and recall

Interpret the best hyperparameter combination

Exercise 4.2: Hyperparameter Tuning with Grid Search

Train a Random Forest classifier

Apply grid search to find optimal hyperparameters

Evaluate model improvement using accuracy and recall

Interpret the best hyperparameter combination

Interpretation Questions

Did hyperparameter tuning improve recall compared to the default model in Exercise 2.2?
Look at the best parameters found. Are they at the edges of your grid (suggesting you should expand the search)?
Was the computational cost of grid search worth the performance improvement?

Self-Check

By the end of this module, you should be able to:

Explain why cross-validation is better than a single train-test split
Interpret the mean and standard deviation of cross-validation scores
Define what hyperparameters are and why we tune them
Use GridSearchCV to find optimal model settings
Explain why tuning and reporting on the same data inflates performance

Bonus Challenge: End-to-End ML Pipeline

If you finish early or want additional practice, try this integration challenge using the marketing_campaign data:

Goal: Build the best model to predict Response using everything you’ve learned.

Feature Engineering: Create at least one new feature beyond TotalChildren and TotalSpent (e.g., spending per child, years as customer from Dt_Customer)
Model Comparison: Train both Logistic Regression and Random Forest, use cross-validation to compare them fairly
Optimization: Use GridSearchCV on your better-performing model
Interpretation: Write 2-3 sentences explaining which model you’d recommend and why

This is open-ended. There’s no single right answer. The goal is to practice the full workflow.