Walkthroughs & Exercises — Machine Learning for Data Analytics with Python

Interactive, in-browser edition — fill in the code as we go. Nothing to install.

Author

Dr. Chester Ismay

TipHow to use this page

Everything runs in your browser — there’s no Python or Jupyter to install.

  1. Click Run Code on the import cell and the data-loading cells near the top first.
  2. Then work through the page top to bottom, filling in each cell as we live-code together.
  3. Your edits are saved automatically in this browser. Use Start Over on a cell to reset it.
  4. The very first run takes a few seconds while Python loads in the background.

Stuck? Open the completed solutions in another tab.

Intro: Getting Started with Machine Learning for Data-Driven Decisions

Walkthrough: Setting Up the Python Environment for ML

If you haven’t already installed Python, Jupyter, and the necessary packages, there are instructions on the course repo in the README to do so [here](https://github.com/ismayc/oreilly-ml-for-data-analytics-with-python/blob/main/README.md.

You can also install the packages directly in a Jupyter notebook with

If you aren’t able to do this on your machine, you may want to check out Google Colab. It’s a free service that allows you to run Jupyter notebooks in the cloud. Alternatively, I’ve set up some temporary notebooks on Binder here that you can work with online as well.

Run the following code to check that each of the needed packages are installed. If you get an error, you may need to install the package(s) again.

Exercise: Setting Up the Python Environment

By completing this exercise, you will be able to

  • Import necessary Python packages
  • Check for successful package loading
  • Load datasets into Python

Follow the instructions above in Walkthrough to check for correct installation of necessary packages.


Module 1: Data Understanding and Preprocessing for Machine Learning

Walkthrough 1.1: Exploring and Preprocessing Data with Pandas & Seaborn

Inspect a dataset using Pandas

Handle missing values and clean data

Exercise 1.1: Exploring and Preprocessing Data with Pandas & Seaborn

Inspect a dataset using Pandas

Handle missing values and clean data

Common Pitfall: The Income column has missing values. If you skip this step and try to build models later, you’ll get errors. Always check isnull().sum() after cleaning to verify zeros.

Interpretation Questions

  1. Looking at the violin plot of Income by Response, which group (responders or non-responders) shows more variability in income?
  2. Based on your scatter plot, do higher-income customers tend to spend more? Is this relationship strong or weak?
  3. Which education level shows the highest response rate? What marketing implications might this have?

Self-Check

By the end of this module, you should be able to:


Module 2: Supervised Learning for Business Decisions

Walkthrough 2.1: Build a Regression Model for Pricing Optimization

Split the data into training and validation sets

Common Pitfall: Fitting StandardScaler on the full dataset before the train/test split leaks information from the validation set into training. In a strict workflow, fit the scaler on the training data only (or wrap it in a Pipeline) so the validation set stays truly unseen.

Train a linear regression model

Evaluate model performance on the validation set


Exercise 2.1: Build a Regression Model for Pricing Optimization

Split the data into training and validation sets

Common Pitfall: Forgetting to scale features before linear regression can work, but it makes the coefficient harder to interpret. Always scale when comparing feature importance.

Train a linear regression model

Evaluate model performance on the validation set

Interpretation Questions

  1. Is your R-squared higher or lower than the telco churn model? What might explain the difference?
  2. If MAE is $200, what does that mean in practical terms for predicting customer spending?
  3. Would you trust this model for making budget decisions? Why or why not?

Walkthrough 2.2: Implement a Classification Model for Customer Churn

Split the data into training and validation sets

Scaling is not as important for tree-based models.

Train a Random Forest classification model

Evaluate model performance on the validation set

Quick Reference: When to Use Which Metric

Situation Metric Why
Predicting continuous values R-squared, MAE Measures prediction error
Balanced classes Accuracy Overall correctness
Cost of false positives is high Precision Minimize wrong positive predictions
Cost of false negatives is high Recall Catch all actual positives
Need balance F1-Score Harmonic mean of precision/recall

Exercise 2.2: Implement a Classification Model for Customer Churn

Split the data into training and validation sets

Common Pitfall: The Response column is imbalanced (many more 0s than 1s). This is why accuracy alone can be misleading. A model predicting “No Response” every time would still get ~85% accuracy!

Train a Random Forest classification model

Evaluate model performance on the validation set

Interpretation Questions

  1. Compare your precision and recall. Which is higher, and what does that imply about the model’s tendencies?
  2. Looking at your confusion matrix, is the model better at identifying responders or non-responders?
  3. If running a marketing campaign costs $50 per contact, which metric matters more: precision or recall?

Self-Check

By the end of this module, you should be able to:


Module 3: Unsupervised Learning and Pattern Discovery in Business

Walkthrough 3.1: Exploring K-Means Clustering for Customer Segmentation

Apply K-Means clustering to segment customers

Common Pitfall: K-means uses Euclidean distance, so an unscaled feature with a large range (like MonthlyCharges) will dominate the clusters. Always standardize features before clustering, and remember the cluster labels are arbitrary integers with no inherent order.

Determine the optimal number of clusters using the Elbow Method

Fit K-means and assign cluster labels to each customer

Visualize customer segments using a 2D plot


Exercise 3.1: Exploring K-Means Clustering for Customer Segmentation

Apply K-Means clustering to segment customers

Determine the optimal number of clusters using the Elbow Method

Choosing Your Optimal k

The “right” answer here is somewhat subjective. Look for:

  • Where the elbow curve bends (typically k=3 to 5 for this data)
  • The highest silhouette score
  • If they disagree, silhouette score is often more reliable

Pick a k and justify your choice. There’s no single correct answer.

Fit K-means and assign cluster labels to each customer

Visualize customer segments using a 2D plot

Interpretation Questions

  1. How did you decide on the optimal k? Did the elbow method and silhouette scores agree?
  2. Looking at your 2D visualization, do the clusters seem well-separated or do they overlap?
  3. Can you describe what “type” of customer each cluster might represent based on their TotalChildren and TotalSpent values?

Walkthrough 3.2: Market Basket Analysis with Apriori Algorithm

Prepare transactional data (services as items)

Common Pitfall: Setting min_support too low floods you with rules (many spurious), while setting it too high can return no itemsets at all. Also, a rule can have high confidence simply because the consequent is popular – always check lift (> 1) to confirm a real association rather than a coincidence.

Apply the Apriori algorithm to identify frequent itemsets

Generate association rules from frequent itemsets

Interpret insights

Key Metrics:

  • Support: How often items appear together (0.25 = 25% of customers)
  • Confidence: If A, how likely B? (0.8 = 80% chance)
  • Lift: How much more likely than random? (>1 = positive association)

Example interpretation: If rule shows {PhoneService} -> {InternetService} with confidence=0.75 and lift=1.3: “Customers with phone service are 75% likely to also have internet service, and this is 30% more likely than if the purchases were independent.”


Exercise 3.2: Market Basket Analysis with Apriori Algorithm

Prepare transactional data (product categories as items)

Note on Thresholds

The min_support=0.2 and min_threshold=0.6 are starting points. If you get:

  • Too few rules: lower the thresholds
  • Too many rules: raise them
  • All trivial rules: look for higher lift values

Feel free to experiment with different values.

Apply the Apriori algorithm to identify frequent itemsets

Generate association rules from frequent itemsets

Interpretation Questions

  1. Which product category appears in the most frequent itemsets? What does this suggest about purchasing patterns?
  2. Find a rule with high confidence but low lift. Why might this rule be less useful despite high confidence?
  3. Identify one actionable insight: what product bundle would you recommend based on these rules?

Self-Check

By the end of this module, you should be able to:


Module 4: Implementing and Evaluating ML Models

Walkthrough 4.1: Exploring Cross-Validation for Model Evaluation

Split data into training and validation sets

Common Pitfall: Reporting a single train-test split can be lucky or unlucky depending on which rows land where. Cross-validation averages over several folds for a more honest estimate – and forgetting random_state makes those folds (and your results) non-reproducible.

Train a classification model using logistic regression

Apply k-fold cross-validation to evaluate model performance

Compare metrics across folds

Interpretation

Accuracy -> Overall correctness of predictions

Precision -> How many predicted churns were actual churns

Recall -> How many churns were correctly identified

F1-Score -> Balances precision & recall

Cross-validation ensures that your model generalizes better to unseen data by reducing the risk of overfitting on a single split.


Exercise 4.1: Exploring Cross-Validation for Model Evaluation

Split data into training and validation sets

Train a classification model using logistic regression

Apply k-fold cross-validation to evaluate model performance

Compare metrics across folds

Interpretation

Accuracy -> Overall correctness of predictions

Precision -> How many predicted responders were actual responders

Recall -> How many actual responders were correctly identified

F1-Score -> Harmonic mean of precision and recall

Interpretation Questions

  1. How consistent are your metrics across folds? (Look at the standard deviation values.)
  2. Which metric shows the most variability? What might cause this?
  3. Compare your cross-validation results to the single train-test split in Exercise 2.2. Are they similar?

Bonus Challenge: End-to-End ML Pipeline

If you finish early or want additional practice, try this integration challenge using the marketing_campaign data:

Goal: Build the best model to predict Response using everything you’ve learned.

  1. Feature Engineering: Create at least one new feature beyond TotalChildren and TotalSpent (e.g., spending per child, years as customer from Dt_Customer)

  2. Model Comparison: Train both Logistic Regression and Random Forest, use cross-validation to compare them fairly

  3. Optimization: Use GridSearchCV on your better-performing model

  4. Interpretation: Write 2-3 sentences explaining which model you’d recommend and why

This is open-ended. There’s no single right answer. The goal is to practice the full workflow.