Overview of Machine Learning for Production

MLCode is just the beginning.

Hidden Tech debt in ML System|800
Source: D. Sculley et al. NIPS 2015: Hidden Technical Debt in Machine Learning Systems

ML Lifecycle

Here, we will take an example of an ML speech recognition model and go through its various phases in the ML lifecycle.

Scoping

Decide to work on speech recognition for voice search.
Decide on key metrics:
- Accuracy
- Latency
- Throughput
Estimate resources and timeline

Data

Is the data labeled consistently?
How much silence before/after each clip?
How to perform volume normalization?

Modeling

There are three pillars for improving an ML Model's performance:
1. Data
2. Code (Algo/Model)
3. Hyperparameters
In Research, people change codes and Hyperparameters.
In Products, people change Data and Hyperparameters.

Quote

ML System = Code + Data + Hyperparameters

Deployment

Production server
Data drift

Challenges

Two challenges are there:

Statistical issues
Software issues

Concept and data drift (Statistical issues)

Here's an analogy: Imagine you train a model to predict if someone will buy a particular product based on age and income.
- Data Drift: Maybe your data source will include people from different regions with different buying habits. The model still looks at age and income, but this data isn't representative.
- Concept Drift: Maybe people's buying habits change due to a recession. Even with the same age and income data, people are less likely to buy.
In essence, data drift is a change in the data itself, while concept drift is a change in the meaning of the data.
Drifts can be of two ways:
- User data generally has a slower drift. On the other hand, Enterprise data (B2B applications) can shift fast. There are two kinds of shifts:
  1. Sudden drift: Like in COVID-19, the Credit Card algo started flagging purchases from users who used their card sparingly.
  2. Gradual drift: House prices increase over a gradual period

Software Engineering Issues

Checklist of questions

Realtime or Batch
Cloud vs. Edge/Browser
Compute resources (CPU/GPU/memory)
Latency, throughput (QPS)
Logging
Security and privacy

Tip

End-to-end ML system = First time deploying an ML system in Production + Maintaining the ML System.

Deployment patterns

Common deployment cases:

New product/capability
Automate/assist with manual task
Replace previous ML system

All should have:

Gradual ramp-up with monitoring
Rollback

Different modes of deployment

Shadow mode -> No traffic; it just shadows the human
Canary mode -> Small fraction of traffic
Blue-Green deployment -> Easy to rollback

Degrees of automation

Human only
Shadow mode -> This is to check the output quality from the model compared to the humans.
AI assistance -> Human will take the input from the model to reach a decision
Partial Automation -> In case the model is not sure, humans will come into the flow
Full automation -> AI Model takes the decision itself.

Simple Model Monitoring

Dashboards

Brainstorm the things that could go wrong.
Brainstorm a few statistics/metrics that will detect the problem.
It is ok to use many metrics initially and gradually remove the ones you find unhelpful.

Example of metrics to track

There are three kinds of metrics to track:

Infra Metrics/Software metrics: Memory, compute, latency, throughput, server load
Input data metrics: To detect any change in the input data like:
- Audio Length changes
- Pictures are coming with high-contrast
- Summarization text size is increasing.
Output data metrics:
- times return " " (null)
- Times user redoes the search.
- Times the user switches to typing.

Misc

|800

If an issue is detected while monitoring the model, we can manually or automatically retrain the model based on the use case.

Pipeline Monitoring

Multiple ML models can run in serial where one model's output is the input for another.
In this case, changes in the first ML model may affect the second one.

Modeling

Challenges

Challenges in model development

Doing well on the training set (usually measured by average training error).
Doing well on dev/test sets.
Doing well on business metrics/project goals.

Why lower test error is not good enough

Bias or DIscrimination handling on the Model output
Skewed data set handling.
- Accuracy in rare classes
Error Analysis can help.

Establish a baseline

For performance measurement, 100% is not the target. We may need to change the baseline based on the use case. Sometimes, we may need to set the baseline as the human-level performance (HLP).
It depends on structured and unstructured data.
In unstructured data, HLP is good, as are images and audio.

Ways to establish a baseline
• Human-level performance (HLP)
• Literature search for state-of-the-art/open source
• Quick-and-dirty implementation
• Performance of older system

The baseline helps to indicate what might be possible. In some cases (such as HLP), there is also a sense of an irreducible error/Bayes error.

FAQs

Getting started on modeling

Literature search to see what's possible (courses, blogs, open-source projects).
Find open-source implementations if available.
A reasonable algorithm with good data often outperforms a great algorithm with no-so-good data.

Deployment constraints when picking a model

Should you take into account deployment constraints when picking a model?

Yes, if the baseline is already established and the goal is to build and deploy.
Not necessarily if the purpose is to establish a baseline and determine what is possible and might be worth pursuing.

Sanity Check

Sanity-check for code and algorithm

Try to overfit a small training dataset before training on a large one.
In this way, you will know whether the approach will work.

Error Analysis and Performance Auditing

To check the suggestions from an inaccurate model.

Check input data and find attributes like non-ASCII characters, low bandwidth, and people noise.

Useful metrics for each tag

What fraction of errors has that tag?
Of all the data with that tag, what fraction is misclassified?
What fraction of all the data has that tag?
How much room for improvement is there in the data with that tag?

Work prioritization

Work prooratization|800

Decide on the most important categories to work on based on the following:

How much room for improvement there is.
How frequently does that category appear?
How easy is it to improve accuracy in that category?
How important it is to improve in that category.

Once a category has been finalized to improve upon, then we can:

Collect more data
Use data augmentation to get more data.
Improve label accuracy/data quality.

Skewed Dataset

Precison and Recall|800

|800

F1 score|400

F1 score will help find a better score and which category can impact the model performance more.

Performance Editing

Auditing framework

Check for accuracy, fairness/bias, and other problems.

Brainstorm the ways the system might go wrong.
- Performance on subsets of data (e.g., ethnicity, gender).
- How common are specific errors (e.g., FP, FN)?
- Performance in rare classes.
Establish metrics to assess performance against these issues on appropriate data slices.
Get business/product owner buy-in.

Speech recognition example

Brainstorm how the system might go wrong -> This is use case specific.
- Accuracy of different genders and ethnicities.
- Accuracy on different devices.
- Prevalence of rude mis-transcriptions.
Establish metrics to assess performance against these issues on appropriate data slices.
- Mean accuracy for different genders and significant accents.
- Mean accuracy on different devices.
- Check for the prevalence of offensive words in the output.

Data Iteration

Data-centric AI development

Model-centric view

Take your data and develop a model that does it as well as possible.
Hold the data fixed and iteratively improve the code/model.
This is mainly used across Research when a benchmark dataset is fixed and the model is optimized.

Data-centric view

The quality of the data is paramount.
Use tools to improve the data quality, allowing multiple models to do well.
Hold the code fixed and iteratively improve the data.

Data Augmentation

|800

Check out where the significant gap between the baseline is.
Train the model by getting more data on that category and see if the model performance improves on that category and nearby related categories.

Goal:
Create realistic examples that
(i) the algorithm does poorly on, but
(ii) humans (or other baseline) do well on
Checklist:
• Does it sound realistic?
• Is the x - y mapping clear? (e.g., can humans recognize speech?)
• Is the algorithm currently doing poorly on it?

Product recommendation has shifted from a collaborative to a content-based filtering approach.

In Collaborative filtering

A few users are clustered, and two users from the same cluster recommended the same thing.

In Content-based filtering

Near-matched Restaurants are recommended.
New restaurants can be easily recommended but have not been liked.

For structured data, feature engineering is still required.

Experiment Tracking

What to track?

Algorithm/code versioning
Dataset used
Hyperparameters
Results

Tracking tools

Text files
Spreadsheet
Experiment tracking system
- Weights and Biases, etc.

Desirable features

Information needed to replicate results
Experiment results, ideally with summary metrics/analysis
Perhaps also: Resource monitoring, visualization, model error analysis

Big data to good data

Ensure consistently high-quality data in all phases of the ML project lifecycle.
Good data:

Covers important cases (good coverage of inputs x)
Is defined consistently (definition of labels y is unambiguous)
Has timely feedback from production data (distribution covers data drift and concept drift)
Is sized appropriately

Data Definition and Baseline

Input data can be across different formats, even if in the case of structured data.

Major types of Data Problems

Unstructured vs. structured data

Unstructured data

It may or may not have a huge collection of unlabeled examples x.
Humans can label more data.
Data augmentation is more likely to be helpful.

Structured data

It may be more difficult to obtain more data.
Human labeling may not be possible (with some exceptions).

Small data vs. big data

Small data

Clean labels are critical.
Can manually look through the dataset and fix labels.
Can get all the labelers to talk to each other.

Big data

Emphasis, data process.
Big data problems can have small data challenges too.
Problems with a large dataset but where there's a long tail of rare events in the input will have small data challenges, too.
• Web search
• Self-driving cars
• Product recommendation systems

Improving label consistency

Have multiple labelers label the same example.
When there is disagreement, have the ML Engineer (MLE), subject matter expert (SME), and labelers discuss the definition of y to reach an agreement.
If labelers believe that x doesn't contain enough information, consider changing x.
Iterate until it is hard to increase agreement significantly.

Small data vs. big data (unstructured data)

Small data

Usually, a small number of labelers.
Can ask labelers to discuss specific labels.

Big data

Get a consistent definition with a small group.
Then, send labeling instructions to labelers.
Consider having multiple labelers label every example and using voting or consensus labels to increase accuracy.

Human-Level Performance (HLP)

HLP on structured data
Structured data problems are less likely to involve human labelers thus HLP is less frequently used.
Some exceptions:

User ID merging: Same person?
Based on network traffic, is the computer hacked?
Is the transaction fraudulent?
Spam account? Bot?
From GPS, what is the mode of transportation - on foot, bike, car, bus?

Label and Organize data.

How long should you spend obtaining data?

Get into this iteration loop as quickly as possible.
Instead of asking, how long would it take to obtain examples? Ask: How much data can we get in k days?
Exception: If you have worked on the problem before and have experience, you need examples.

Labeling data

Options: In-house vs. outsourced vs. crowdsourced
Having MLE label data is expensive. But doing this for just a few days is usually fine.
Who is qualified to label?
- Speech recognition - any reasonably fluent speaker
- Factory inspection, medical image diagnosis - SME (subject matter expert)
- Recommender systems - may need help to label well.
Don't increase data by more than 10x at a time.

Metadata

Keep track of data provenance (the origin of the data) and lineage (sequence of steps).
Store Metadata with the data as it becomes difficult to track/add the metadata later.
Metadata is useful for:
- Error analysis. Spotting unexpected effects.
- Keeping track of data provenance.

Balanced train/dev/test sets

The labels need to be distributed in a balanced way across all these sets.
Otherwise, it only matters in large datasets. Only issues in smaller datasets.

Scoping

Scoping Process

Brainstorm business problems, not AI problems.
- What are the top 3 things you wish were working better?
  - Increase conversion
  - Reduce inventory
  - Increase margin (profit per item)
Brainstorm Al solutions
Assess the feasibility and value of potential solutions
- Use external benchmark (literature, other company, competitor) -> Did anyone else make it?
- Why use HLP to benchmark?
  - People are very good at unstructured data tasks.
  - Criteria: Can a human perform the task given the same data? -> Traffic light change detection
- Do we have predictive features?
  - Given past purchases, predict future purchases. ✅
  - Given the weather, predict shopping mall foot traffic. ✅
  - Given DNA info, predict heart disease 🤔
  - Given social media chatter, predict demand for a clothing style? 🤔
  - Given the history of a stock's price, predict the future price of that stock Y ❌
- History of project
- Value of a project
  - ML Engineers need to go rightwards and Business leaders need to go leftwards.
- Ethical considerations
  - Is this project creating net positive societal value?
  - Is this project reasonably fair and free from bias?
  - Have any ethical concerns been openly aired and debated?
Determine milestones
- Key specifications:
  - ML metrics (accuracy, precision/ recall, etc.)
  - Software metrics (latency, throughput, etc. given compute resources)
  - Business metrics (revenue, etc.)
  - Resources needed (data, personnel, help from other teams)
  - Timeline
    If unsure, consider benchmarking to other projects or building a POC (Proof of
    Concept) first.
Budget for resources

I completed the course and got the certificate.

Certificate|800

I am planning to go for the specialization.

Overview of Machine Learning for Production

ML Lifecycle

Scoping

Data

Modeling

Deployment

Challenges

Concept and data drift (Statistical issues)

Software Engineering Issues

Deployment patterns

Degrees of automation

Simple Model Monitoring

Dashboards

Example of metrics to track

Misc

Pipeline Monitoring

Modeling

Challenges

Why lower test error is not good enough

Establish a baseline

FAQs

Getting started on modeling

Deployment constraints when picking a model

Sanity Check

Error Analysis and Performance Auditing

Work prioritization

Skewed Dataset

Performance Editing

Auditing framework

Speech recognition example

Data Iteration

Data-centric AI development

Model-centric view

Data-centric view

Data Augmentation

In Collaborative filtering

In Content-based filtering

Experiment Tracking

What to track?

Tracking tools

Desirable features

Big data to good data

Data Definition and Baseline

Major types of Data Problems

Unstructured vs. structured data

Unstructured data

Structured data

Small data vs. big data

Small data

Big data

Improving label consistency

Small data vs. big data (unstructured data)

Small data

Big data

Human-Level Performance (HLP)

Label and Organize data.

Labeling data

Metadata

Balanced train/dev/test sets

Scoping

Scoping Process

Source

Also Read