Walkthroughs & Exercises — Machine Learning for Data Analytics with Python
Interactive, in-browser edition — fill in the code as we go. Nothing to install.
Everything runs in your browser — there’s no Python or Jupyter to install.
- Click Run Code on the import cell and the data-loading cells near the top first.
- Then work through the page top to bottom, filling in each cell as we live-code together.
- Your edits are saved automatically in this browser. Use Start Over on a cell to reset it.
- The very first run takes a few seconds while Python loads in the background.
Stuck? Open the completed solutions in another tab.
Intro: Getting Started with Machine Learning for Data-Driven Decisions
Walkthrough: Setting Up the Python Environment for ML
If you haven’t already installed Python, Jupyter, and the necessary packages, there are instructions on the course repo in the README to do so [here](https://github.com/ismayc/oreilly-ml-for-data-analytics-with-python/blob/main/README.md.
You can also install the packages directly in a Jupyter notebook with
If you aren’t able to do this on your machine, you may want to check out Google Colab. It’s a free service that allows you to run Jupyter notebooks in the cloud. Alternatively, I’ve set up some temporary notebooks on Binder here that you can work with online as well.
Run the following code to check that each of the needed packages are installed. If you get an error, you may need to install the package(s) again.
Exercise: Setting Up the Python Environment
By completing this exercise, you will be able to
- Import necessary Python packages
- Check for successful package loading
- Load datasets into Python
Follow the instructions above in Walkthrough to check for correct installation of necessary packages.
Module 1: Data Understanding and Preprocessing for Machine Learning
Walkthrough 1.1: Exploring and Preprocessing Data with Pandas & Seaborn
Inspect a dataset using Pandas
Handle missing values and clean data
Create visualizations to identify key business trends
Exercise 1.1: Exploring and Preprocessing Data with Pandas & Seaborn
Inspect a dataset using Pandas
Handle missing values and clean data
Common Pitfall: The
Incomecolumn has missing values. If you skip this step and try to build models later, you’ll get errors. Always checkisnull().sum()after cleaning to verify zeros.
Create visualizations to identify key business trends
Interpretation Questions
- Looking at the violin plot of Income by Response, which group (responders or non-responders) shows more variability in income?
- Based on your scatter plot, do higher-income customers tend to spend more? Is this relationship strong or weak?
- Which education level shows the highest response rate? What marketing implications might this have?
Self-Check
By the end of this module, you should be able to:
Module 2: Supervised Learning for Business Decisions
Walkthrough 2.1: Build a Regression Model for Pricing Optimization
Split the data into training and validation sets
Common Pitfall: Fitting
StandardScaleron the full dataset before the train/test split leaks information from the validation set into training. In a strict workflow, fit the scaler on the training data only (or wrap it in aPipeline) so the validation set stays truly unseen.
Train a linear regression model
Evaluate model performance on the validation set
Exercise 2.1: Build a Regression Model for Pricing Optimization
Split the data into training and validation sets
Common Pitfall: Forgetting to scale features before linear regression can work, but it makes the coefficient harder to interpret. Always scale when comparing feature importance.
Train a linear regression model
Evaluate model performance on the validation set
Interpretation Questions
- Is your R-squared higher or lower than the telco churn model? What might explain the difference?
- If MAE is $200, what does that mean in practical terms for predicting customer spending?
- Would you trust this model for making budget decisions? Why or why not?
Walkthrough 2.2: Implement a Classification Model for Customer Churn
Split the data into training and validation sets
Scaling is not as important for tree-based models.
Train a Random Forest classification model
Evaluate model performance on the validation set
Quick Reference: When to Use Which Metric
| Situation | Metric | Why |
|---|---|---|
| Predicting continuous values | R-squared, MAE | Measures prediction error |
| Balanced classes | Accuracy | Overall correctness |
| Cost of false positives is high | Precision | Minimize wrong positive predictions |
| Cost of false negatives is high | Recall | Catch all actual positives |
| Need balance | F1-Score | Harmonic mean of precision/recall |
Exercise 2.2: Implement a Classification Model for Customer Churn
Split the data into training and validation sets
Common Pitfall: The
Responsecolumn is imbalanced (many more 0s than 1s). This is why accuracy alone can be misleading. A model predicting “No Response” every time would still get ~85% accuracy!
Train a Random Forest classification model
Evaluate model performance on the validation set
Interpretation Questions
- Compare your precision and recall. Which is higher, and what does that imply about the model’s tendencies?
- Looking at your confusion matrix, is the model better at identifying responders or non-responders?
- If running a marketing campaign costs $50 per contact, which metric matters more: precision or recall?
Self-Check
By the end of this module, you should be able to:
Module 3: Unsupervised Learning and Pattern Discovery in Business
Walkthrough 3.1: Exploring K-Means Clustering for Customer Segmentation
Apply K-Means clustering to segment customers
Common Pitfall: K-means uses Euclidean distance, so an unscaled feature with a large range (like
MonthlyCharges) will dominate the clusters. Always standardize features before clustering, and remember the cluster labels are arbitrary integers with no inherent order.
Determine the optimal number of clusters using the Elbow Method
Verify using the silhouette score (optional but recommended)
Fit K-means and assign cluster labels to each customer
Visualize customer segments using a 2D plot
Exercise 3.1: Exploring K-Means Clustering for Customer Segmentation
Apply K-Means clustering to segment customers
Determine the optimal number of clusters using the Elbow Method
Verify using the silhouette score (optional but recommended)
Choosing Your Optimal k
The “right” answer here is somewhat subjective. Look for:
- Where the elbow curve bends (typically k=3 to 5 for this data)
- The highest silhouette score
- If they disagree, silhouette score is often more reliable
Pick a k and justify your choice. There’s no single correct answer.
Fit K-means and assign cluster labels to each customer
Visualize customer segments using a 2D plot
Interpretation Questions
- How did you decide on the optimal k? Did the elbow method and silhouette scores agree?
- Looking at your 2D visualization, do the clusters seem well-separated or do they overlap?
- Can you describe what “type” of customer each cluster might represent based on their TotalChildren and TotalSpent values?
Walkthrough 3.2: Market Basket Analysis with Apriori Algorithm
Prepare transactional data (services as items)
Common Pitfall: Setting
min_supporttoo low floods you with rules (many spurious), while setting it too high can return no itemsets at all. Also, a rule can have high confidence simply because the consequent is popular – always checklift(> 1) to confirm a real association rather than a coincidence.
Apply the Apriori algorithm to identify frequent itemsets
Generate association rules from frequent itemsets
Interpret insights
Key Metrics:
- Support: How often items appear together (0.25 = 25% of customers)
- Confidence: If A, how likely B? (0.8 = 80% chance)
- Lift: How much more likely than random? (>1 = positive association)
Example interpretation: If rule shows {PhoneService} -> {InternetService} with confidence=0.75 and lift=1.3: “Customers with phone service are 75% likely to also have internet service, and this is 30% more likely than if the purchases were independent.”
Exercise 3.2: Market Basket Analysis with Apriori Algorithm
Prepare transactional data (product categories as items)
Note on Thresholds
The min_support=0.2 and min_threshold=0.6 are starting points. If you get:
- Too few rules: lower the thresholds
- Too many rules: raise them
- All trivial rules: look for higher lift values
Feel free to experiment with different values.
Apply the Apriori algorithm to identify frequent itemsets
Generate association rules from frequent itemsets
Interpretation Questions
- Which product category appears in the most frequent itemsets? What does this suggest about purchasing patterns?
- Find a rule with high confidence but low lift. Why might this rule be less useful despite high confidence?
- Identify one actionable insight: what product bundle would you recommend based on these rules?
Self-Check
By the end of this module, you should be able to:
Module 4: Implementing and Evaluating ML Models
Walkthrough 4.1: Exploring Cross-Validation for Model Evaluation
Split data into training and validation sets
Common Pitfall: Reporting a single train-test split can be lucky or unlucky depending on which rows land where. Cross-validation averages over several folds for a more honest estimate – and forgetting
random_statemakes those folds (and your results) non-reproducible.
Train a classification model using logistic regression
Apply k-fold cross-validation to evaluate model performance
Compare metrics across folds
Interpretation
Accuracy -> Overall correctness of predictions
Precision -> How many predicted churns were actual churns
Recall -> How many churns were correctly identified
F1-Score -> Balances precision & recall
Cross-validation ensures that your model generalizes better to unseen data by reducing the risk of overfitting on a single split.
Exercise 4.1: Exploring Cross-Validation for Model Evaluation
Split data into training and validation sets
Train a classification model using logistic regression
Apply k-fold cross-validation to evaluate model performance
Compare metrics across folds
Interpretation
Accuracy -> Overall correctness of predictions
Precision -> How many predicted responders were actual responders
Recall -> How many actual responders were correctly identified
F1-Score -> Harmonic mean of precision and recall
Interpretation Questions
- How consistent are your metrics across folds? (Look at the standard deviation values.)
- Which metric shows the most variability? What might cause this?
- Compare your cross-validation results to the single train-test split in Exercise 2.2. Are they similar?
Walkthrough 4.2: Hyperparameter Tuning with Grid Search
Train a Random Forest classifier
Common Pitfall: Tuning hyperparameters and then evaluating on the same data leaks the answer and overstates performance. The honest grid-search workflow selects parameters using cross-validation on the training set, then reports the final score on a held-out test set the search never saw.
Apply grid search to find optimal hyperparameters
Evaluate model improvement using accuracy and recall
Interpret the best hyperparameter combination
Exercise 4.2: Hyperparameter Tuning with Grid Search
Train a Random Forest classifier
Apply grid search to find optimal hyperparameters
Evaluate model improvement using accuracy and recall
Interpret the best hyperparameter combination
Interpretation Questions
- Did hyperparameter tuning improve recall compared to the default model in Exercise 2.2?
- Look at the best parameters found. Are they at the edges of your grid (suggesting you should expand the search)?
- Was the computational cost of grid search worth the performance improvement?
Self-Check
By the end of this module, you should be able to:
Bonus Challenge: End-to-End ML Pipeline
If you finish early or want additional practice, try this integration challenge using the marketing_campaign data:
Goal: Build the best model to predict Response using everything you’ve learned.
Feature Engineering: Create at least one new feature beyond TotalChildren and TotalSpent (e.g., spending per child, years as customer from Dt_Customer)
Model Comparison: Train both Logistic Regression and Random Forest, use cross-validation to compare them fairly
Optimization: Use GridSearchCV on your better-performing model
Interpretation: Write 2-3 sentences explaining which model you’d recommend and why
This is open-ended. There’s no single right answer. The goal is to practice the full workflow.