ML for Prod

Overview of Machine Learning for Production

MLCode is just the beginning.

Hidden Tech debt in ML System|800
Source: D. Sculley et al. NIPS 2015: Hidden Technical Debt in Machine Learning Systems

ML Lifecycle

Here, we will take an example of an ML speech recognition model and go through its various phases in the ML lifecycle.





ML System = Code + Data + Hyperparameters



Two challenges are there:

  1. Statistical issues
  2. Software issues

Concept and data drift (Statistical issues)

Software Engineering Issues

Checklist of questions


End-to-end ML system = First time deploying an ML system in Production + Maintaining the ML System.

Deployment patterns

Common deployment cases:

  1. New product/capability
  2. Automate/assist with manual task
  3. Replace previous ML system

All should have:

Different modes of deployment

  1. Shadow mode -> No traffic; it just shadows the human
  2. Canary mode -> Small fraction of traffic
  3. Blue-Green deployment -> Easy to rollback

Degrees of automation

  1. Human only
  2. Shadow mode -> This is to check the output quality from the model compared to the humans.
  3. AI assistance -> Human will take the input from the model to reach a decision
  4. Partial Automation -> In case the model is not sure, humans will come into the flow
  5. Full automation -> AI Model takes the decision itself.

Simple Model Monitoring


Example of metrics to track

There are three kinds of metrics to track:



Pipeline Monitoring



Challenges in model development

  1. Doing well on the training set (usually measured by average training error).
  2. Doing well on dev/test sets.
  3. Doing well on business metrics/project goals.

Why lower test error is not good enough

Establish a baseline

Ways to establish a baseline
• Human-level performance (HLP)
• Literature search for state-of-the-art/open source
• Quick-and-dirty implementation
• Performance of older system


Getting started on modeling

Deployment constraints when picking a model

Should you take into account deployment constraints when picking a model?

Sanity Check

Sanity-check for code and algorithm

Error Analysis and Performance Auditing

To check the suggestions from an inaccurate model.

Useful metrics for each tag

Work prioritization

Work prooratization|800

Decide on the most important categories to work on based on the following:

Once a category has been finalized to improve upon, then we can:

Skewed Dataset

Precison and Recall|800


F1 score|400

F1 score will help find a better score and which category can impact the model performance more.

Performance Editing

Auditing framework

Check for accuracy, fairness/bias, and other problems.

  1. Brainstorm the ways the system might go wrong.
    • Performance on subsets of data (e.g., ethnicity, gender).
    • How common are specific errors (e.g., FP, FN)?
    • Performance in rare classes.
  2. Establish metrics to assess performance against these issues on appropriate data slices.
  3. Get business/product owner buy-in.

Speech recognition example

  1. Brainstorm how the system might go wrong -> This is use case specific.
    • Accuracy of different genders and ethnicities.
    • Accuracy on different devices.
    • Prevalence of rude mis-transcriptions.
  2. Establish metrics to assess performance against these issues on appropriate data slices.
    • Mean accuracy for different genders and significant accents.
    • Mean accuracy on different devices.
    • Check for the prevalence of offensive words in the output.

Data Iteration

Data-centric AI development

Model-centric view

Data-centric view

Data Augmentation


Create realistic examples that
(i) the algorithm does poorly on, but
(ii) humans (or other baseline) do well on
• Does it sound realistic?
• Is the x - y mapping clear? (e.g., can humans recognize speech?)
• Is the algorithm currently doing poorly on it?

Product recommendation has shifted from a collaborative to a content-based filtering approach.

In Collaborative filtering

In Content-based filtering

For structured data, feature engineering is still required.

Experiment Tracking

What to track?

Tracking tools

Desirable features

Big data to good data

Ensure consistently high-quality data in all phases of the ML project lifecycle.
Good data:

Data Definition and Baseline

Input data can be across different formats, even if in the case of structured data.

Major types of Data Problems

Unstructured vs. structured data

Unstructured data

Structured data

Small data vs. big data

Small data

Big data

Emphasis, data process.
Big data problems can have small data challenges too.
Problems with a large dataset but where there's a long tail of rare events in the input will have small data challenges, too.
• Web search
• Self-driving cars
• Product recommendation systems

Improving label consistency

Small data vs. big data (unstructured data)

Small data

Big data

Human-Level Performance (HLP)

HLP on structured data
Structured data problems are less likely to involve human labelers thus HLP is less frequently used.
Some exceptions:

Label and Organize data.

How long should you spend obtaining data?

Labeling data


Balanced train/dev/test sets


Scoping Process

  1. Brainstorm business problems, not AI problems.

    • What are the top 3 things you wish were working better?
      • Increase conversion
      • Reduce inventory
      • Increase margin (profit per item)
  2. Brainstorm Al solutions

  3. Assess the feasibility and value of potential solutions

    • Use external benchmark (literature, other company, competitor) -> Did anyone else make it?
    • |800
    • Why use HLP to benchmark?
      • People are very good at unstructured data tasks.
      • Criteria: Can a human perform the task given the same data? -> Traffic light change detection
    • Do we have predictive features?
      • Given past purchases, predict future purchases. ✅
      • Given the weather, predict shopping mall foot traffic. ✅
      • Given DNA info, predict heart disease 🤔
      • Given social media chatter, predict demand for a clothing style? 🤔
      • Given the history of a stock's price, predict the future price of that stock Y ❌
    • History of project
    • Value of a project
      • |800
      • ML Engineers need to go rightwards and Business leaders need to go leftwards.
    • Ethical considerations
      • Is this project creating net positive societal value?
      • Is this project reasonably fair and free from bias?
      • Have any ethical concerns been openly aired and debated?
  4. Determine milestones

    • Key specifications:
      • ML metrics (accuracy, precision/ recall, etc.)
      • Software metrics (latency, throughput, etc. given compute resources)
      • Business metrics (revenue, etc.)
      • Resources needed (data, personnel, help from other teams)
      • Timeline
        If unsure, consider benchmarking to other projects or building a POC (Proof of
        Concept) first.
  5. Budget for resources

I completed the course and got the certificate.


I am planning to go for the specialization.


Also Read

Thoughts 🤔 by Soumendra Kumar Sahoo is licensed under CC BY 4.0