Solutions — Machine Learning for Data Analytics with Python
Interactive, in-browser edition — completed code you can run and edit.
These are the completed solutions, runnable in your browser — no install needed. Click Run Code on any cell (run the import and data-loading cells near the top first). Edit any cell to experiment.
Looking for the blanks to fill in yourself? See the exercises page.
Intro: Getting Started with Machine Learning for Data-Driven Decisions
Walkthrough: Setting Up the Python Environment for ML
If you haven’t already installed Python, Jupyter, and the necessary packages, there are instructions on the course repo in the README to do so here.
You can also install the packages directly in a Jupyter notebook with
If you aren’t able to do this on your machine, you may want to check out Google Colab. It’s a free service that allows you to run Jupyter notebooks in the cloud. Alternatively, I’ve set up some temporary notebooks on Binder here that you can work with online as well.
Run the following code to check that each of the needed packages are installed. If you get an error, you may need to install the package(s) again.
Exercise: Setting Up the Python Environment
By completing this exercise, you will be able to
- Import necessary Python packages
- Check for successful package loading
- Load datasets into Python
Follow the instructions above in Walkthrough to check for correct installation of necessary packages.
Module 1: Data Understanding and Preprocessing for Machine Learning
Walkthrough 1.1: Exploring and Preprocessing Data with Pandas & Seaborn
Inspect a dataset using Pandas
Handle missing values and clean data
Create visualizations to identify key business trends
Exercise 1.1: Exploring and Preprocessing Data with Pandas & Seaborn
Inspect a dataset using Pandas
Handle missing values and clean data
Common Pitfall: The
Incomecolumn has missing values. If you skip this step and try to build models later, you’ll get errors. Always re-checkisnull().sum()after cleaning to verify it returns zeros.
Create visualizations to identify key business trends
Interpretation Questions
- Looking at the violin plot of Income by Response, which group (responders or non-responders) shows more variability in income?
- Based on your scatter plot, do higher-income customers tend to spend more? Is this relationship strong or weak?
- Which education level shows the highest response rate? What marketing implications might this have?
Self-Check
By the end of this module, you should be able to:
Module 2: Supervised Learning for Business Decisions
Walkthrough 2.1: Build a Regression Model for Pricing Optimization
Split the data into training and validation sets
Common Pitfall: Fitting
StandardScaleron the full dataset before the train/test split leaks information from the validation set into training. In a strict workflow, fit the scaler on the training data only (or wrap it in aPipeline) so the validation set stays truly unseen.
Train a linear regression model
Common Pitfall: Because the predictor was standardized before fitting, this coefficient is the change in price per one-standard-deviation change in usage, not per raw unit. Keep that in mind when explaining the number to stakeholders, and convert back to raw units if you need a real-world-unit effect.
Evaluate model performance on the validation set
Exercise 2.1: Build a Regression Model for Pricing Optimization
Split the data into training and validation sets
Train a linear regression model
Evaluate model performance on the validation set
Interpretation Questions
- Is your R-squared higher or lower than the telco churn model? What might explain the difference?
- If MAE is $200, what does that mean in practical terms for predicting customer spending?
- Would you trust this model for making budget decisions? Why or why not?
Walkthrough 2.2: Implement a Classification Model for Customer Churn
Split the data into training and validation sets
Common Pitfall: Churn and campaign response are imbalanced (far more 0s than 1s), so accuracy alone can be misleading – a model that always predicts the majority class can still look “accurate.” Report precision and recall alongside accuracy to see how well the model catches the minority class.
Train a Random Forest classification model
Evaluate model performance on the validation set
Quick Reference: When to Use Which Metric
| Situation | Metric | Why |
|---|---|---|
| Predicting continuous values | R-squared, MAE | Measures prediction error |
| Balanced classes | Accuracy | Overall correctness |
| Cost of false positives is high | Precision | Minimize wrong positive predictions |
| Cost of false negatives is high | Recall | Catch all actual positives |
| Need balance | F1-Score | Harmonic mean of precision/recall |
Exercise 2.2: Implement a Classification Model for Customer Churn
Split the data into training and validation sets
Train a Random Forest classification model
Evaluate model performance on the validation set
Interpretation Questions
- Compare your precision and recall. Which is higher, and what does that imply about the model’s tendencies?
- Looking at your confusion matrix, is the model better at identifying responders or non-responders?
- If running a marketing campaign costs $50 per contact, which metric matters more: precision or recall?
Self-Check
By the end of this module, you should be able to:
Module 3: Unsupervised Learning and Pattern Discovery in Business
Walkthrough 3.1: Exploring K-Means Clustering for Customer Segmentation
Apply K-Means clustering to segment customers
Common Pitfall: K-means uses Euclidean distance, so an unscaled feature with a large range (like
MonthlyCharges) will dominate the clusters. Always standardize features before clustering, and remember the cluster labels are arbitrary integers with no inherent order.
Determine the optimal number of clusters using the Elbow Method
Verify using the silhouette score (optional but recommended)
Fit K-means and assign cluster labels to each customer
Visualize customer segments using a 2D plot
Exercise 3.1: Exploring K-Means Clustering for Customer Segmentation
Apply K-Means clustering to segment customers
Determine the optimal number of clusters using the Elbow Method
Verify using the silhouette score (optional but recommended)
Fit K-means and assign cluster labels to each customer
Visualize customer segments using a 2D plot
Interpretation Questions
- How did you decide on the optimal k? Did the elbow method and silhouette scores agree?
- Looking at your 2D visualization, do the clusters seem well-separated or do they overlap?
- Can you describe what “type” of customer each cluster might represent based on their TotalChildren and TotalSpent values?
Walkthrough 3.2: Market Basket Analysis with Apriori Algorithm
Prepare transactional data (services as items)
Common Pitfall: Setting
min_supporttoo low floods you with rules (many spurious), while setting it too high can return no itemsets at all. Also, a rule can have high confidence simply because the consequent is popular – always checklift(> 1) to confirm a real association rather than a coincidence.
Apply the Apriori algorithm to identify frequent itemsets
Generate association rules from frequent itemsets
Exercise 3.2: Market Basket Analysis with Apriori Algorithm
Prepare transactional data (product categories as items)
Apply the Apriori algorithm to identify frequent itemsets
Generate association rules from frequent itemsets
Interpretation Questions
- Which product category appears in the most frequent itemsets? What does this suggest about purchasing patterns?
- Find a rule with high confidence but low lift. Why might this rule be less useful despite high confidence?
- Identify one actionable insight: what product bundle would you recommend based on these rules?
Self-Check
By the end of this module, you should be able to:
Module 4: Implementing and Evaluating ML Models
Walkthrough 4.1: Exploring Cross-Validation for Model Evaluation
Split data into training and validation sets
Common Pitfall: Reporting a single train-test split can be lucky or unlucky depending on which rows land where. Cross-validation averages over several folds for a more honest estimate – and forgetting
random_statemakes those folds (and your results) non-reproducible.
Train a classification model using logistic regression
Apply k-fold cross-validation to evaluate model performance
Compare metrics across folds
Interpretation
Accuracy -> Overall correctness of predictions
Precision -> How many predicted churns were actual churns
Recall -> How many churns were correctly identified
F1-Score -> Balances precision & recall
Cross-validation ensures that your model generalizes better to unseen data by reducing the risk of overfitting on a single split.
Exercise 4.1: Exploring Cross-Validation for Model Evaluation
Split data into training and validation sets
Train a classification model using logistic regression
Apply k-fold cross-validation to evaluate model performance
Compare metrics across folds
Interpretation
Accuracy -> Overall correctness of predictions
Precision -> How many predicted responders were actual responders
Recall -> How many actual responders were correctly identified
F1-Score -> Harmonic mean of precision and recall
Interpretation Questions
- How consistent are your metrics across folds? (Look at the standard deviation values.)
- Which metric shows the most variability? What might cause this?
- Compare your cross-validation results to the single train-test split in Exercise 2.2. Are they similar?
Walkthrough 4.2: Hyperparameter Tuning with Grid Search
Train a Random Forest classifier
Common Pitfall: Tuning hyperparameters and then evaluating on the same data leaks the answer and overstates performance. The honest grid-search workflow selects parameters using cross-validation on the training set, then reports the final score on a held-out test set the search never saw.
Apply grid search to find optimal hyperparameters
Evaluate model improvement using accuracy and recall
Interpret the best hyperparameter combination
Exercise 4.2: Hyperparameter Tuning with Grid Search
Train a Random Forest classifier
Apply grid search to find optimal hyperparameters
Evaluate model improvement using accuracy and recall
Interpret the best hyperparameter combination
Interpretation Questions
- Did hyperparameter tuning improve recall compared to the default model in Exercise 2.2?
- Look at the best parameters found. Are they at the edges of your grid (suggesting you should expand the search)?
- Was the computational cost of grid search worth the performance improvement?
Self-Check
By the end of this module, you should be able to:
Bonus Challenge: End-to-End ML Pipeline
If you finish early or want additional practice, try this integration challenge using the marketing_campaign data:
Goal: Build the best model to predict Response using everything you’ve learned.
Feature Engineering: Create at least one new feature beyond TotalChildren and TotalSpent (e.g., spending per child, years as customer from Dt_Customer)
Model Comparison: Train both Logistic Regression and Random Forest, use cross-validation to compare them fairly
Optimization: Use GridSearchCV on your better-performing model
Interpretation: Write 2-3 sentences explaining which model you’d recommend and why
This is open-ended. There’s no single right answer. The goal is to practice the full workflow.