Automated Testing for LLMOps

Introduction to CI

Automated rule-based evaluations

Traditional s/w vs LLMs|500

What should you evaluate?

When should we evaluate?

Sample LLM Evaluation script


import langchain
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser

import warnings
warnings.filterwarnings('ignore')

human_template = "{question}"
quiz_bank = """1. Subject: Leonardo DaVinci
Categories: Art, Science
Facts:
- Painted the Mona Lisa
- Studied zoology, anatomy, geology, optics
- Designed a flying machine

2. Subject: Paris
Categories: Art, Geography
Facts:
- Location of the Louvre, the museum where the Mona Lisa is displayed
- Capital of France
- Most populous city in France
- Where Radium and Polonium were discovered by scientists Marie and Pierre Curie

3. Subject: Telescopes
Category: Science
Facts:
- Device to observe different objects
- The first refracting telescopes were invented in the Netherlands in the 17th Century
- The James Webb space telescope is the largest telescope in space. It uses a gold-berillyum mirror

4. Subject: Starry Night
Category: Art
Facts:
- Painted by Vincent van Gogh in 1889
- Captures the east-facing view of van Gogh's room in Saint-Rémy-de-Provence

5. Subject: Physics
Category: Science
Facts:
- The sun doesn't change color during sunset.
- Water slows the speed of light
- The Eiffel Tower in Paris is taller in the summer than the winter due to expansion of the metal."""

delimiter = "####"
prompt_template = f"""
Follow these steps to generate a customized quiz for the user.
The question will be delimited with four hashtags i.e {delimiter}
The user will provide a category that they want to create a quiz for. Any questions included in the quiz
should only refer to the category.

Step 1:{delimiter} First identify the category user is asking about from the following list:
* Geography
* Science
* Art

Step 2:{delimiter} Determine the subjects to generate questions about. The list of topics are below:
{quiz_bank}
Pick up to two subjects that fit the user's category.
Step 3:{delimiter} Generate a quiz for the user. Based on the selected subjects generate 3 questions for the user using the facts about the subject.
Use the following format for the quiz:
Question 1:{delimiter} <question 1>
Question 2:{delimiter} <question 2>
Question 3:{delimiter} <question 3>
"""

Create an LCEL chain

# taking all components and making reusable as one piece
def assistant_chain(
    system_message,
    human_template="{question}",
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    output_parser=StrOutputParser()):

  chat_prompt = ChatPromptTemplate.from_messages([
      ("system", system_message),
      ("human", human_template),
  ])
  return chat_prompt | llm | output_parser


def eval_expected_words(
    system_message,
    question,
    expected_words,
    human_template="{question}",
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    output_parser=StrOutputParser()):

  assistant = assistant_chain(
      system_message,
      human_template,
      llm,
      output_parser)


  answer = assistant.invoke({"question": question})
  print(answer)
  assert any(word in answer.lower() for word in expected_words), f"Expected the assistant questions to include {expected_words}', but it did not"

question = "Generate a quiz about science"
expected_words = ["davinci", "telescope", "physics", "curie"]

Evaluate the output of the LLM

eval_expected_words(
    prompt_template,
    question,
    expected_words
)

Great! Here are three science questions for your quiz:

Question 1:####
What is the largest telescope in space called and what material is its mirror made of?

Question 2:####
True or False: Water slows down the speed of light.

Question 3:####
What did Marie and Pierre Curie discover in Paris and where is it displayed?

Remember to remove the four hashtags (####) before sharing the quiz with others. Enjoy!

There are the expected words in all the generated queries, therefore our evaluation passed.


def evaluate_refusal(
    system_message,
    question,
    decline_response,
    human_template="{question}",
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    output_parser=StrOutputParser()):

  assistant = assistant_chain(human_template, system_message, llm, output_parser)

  answer = assistant.invoke({"question": question})
  print(answer)

  assert decline_response.lower() in answer.lower(), f"Expected the bot to decline with '{decline_response}' got {answer}"


question  = "Generate a quiz about Rome."
decline_response = "I'm sorry"


evaluate_refusal(
    prompt_template,
    question,
    decline_response
)

Step 1: The user wants to create a quiz about Rome.

Step 2: The available subjects for the quiz are:

  1. Subject: Leonardo DaVinci
    Categories: Art, Science
    Facts:

    • Painted the Mona Lisa
    • Studied zoology, anatomy, geology, optics
    • Designed a flying machine
  2. Subject: Paris
    Categories: Art, Geography
    Facts:

    • Location of the Louvre, the museum where the Mona Lisa is displayed
    • Capital of France
    • Most populous city in France
    • Where Radium and Polonium were discovered by scientists Marie and Pierre Curie
  3. Subject: Telescopes
    Category: Science
    Facts:

    • Device to observe different objects
    • The first refracting telescopes were invented in the Netherlands in the 17th Century
    • The James Webb space telescope is the largest telescope in space. It uses a gold-berillyum mirror
  4. Subject: Starry Night
    Category: Art
    Facts:

    • Painted by Vincent van Gogh in 1889
    • Captures the east-facing view of van Gogh's room in Saint-Rémy-de-Provence
  5. Subject: Physics
    Category: Science
    Facts:

    • The sun doesn't change color during sunset.
    • Water slows the speed of light
    • The Eiffel Tower in Paris is taller in the summer than the winter due to expansion of the metal.

Based on the user's category of Rome, we will select the subjects Leonardo DaVinci and Starry Night.

Step 3: Generate a quiz for the user.

Question 1:####
Which famous painting was created by Leonardo DaVinci?
a) Starry Night
b) The Last Supper
c) The Sistine Chapel
d) The Birth of Venus

Question 2:####
What did Leonardo DaVinci study?
a) Mathematics and Astronomy
b) Zoology, Anatomy, Geology, and Optics
c) Literature and Philosophy
d) Music and Dance

Question 3:####
Who painted the famous artwork "Starry Night"?
a) Leonardo DaVinci
b) Vincent van Gogh
c) Pablo Picasso
d) Michelangelo

Note: The answers to the questions are:
Question 1: b) The Last Supper
Question 2: b) Zoology, Anatomy, Geology, and Optics
Question 3: b) Vincent van Gogh


AssertionError Traceback (most recent call last)
Cell In[22], line 1
----> 1 evaluate_refusal(
2 prompt_template,
3 question,
4 decline_response
5 )

Cell In[20], line 14, in evaluate_refusal(system_message, question, decline_response, human_template, llm, output_parser)
11 answer = assistant.invoke({"question": question})
12 print(answer)
---> 14 assert decline_response.lower() in answer.lower(), f"Expected the bot to decline with '{decline_response}' got {answer}"

AssertionError: Expected the bot to decline with 'I'm sorry' got Step 1: The user wants to create a quiz about Rome.

Step 2: The available subjects for the quiz are:

  1. Subject: Leonardo DaVinci
    Categories: Art, Science
    Facts:

    • Painted the Mona Lisa
    • Studied zoology, anatomy, geology, optics
    • Designed a flying machine
  2. Subject: Paris
    Categories: Art, Geography
    Facts:

    • Location of the Louvre, the museum where the Mona Lisa is displayed
    • Capital of France
    • Most populous city in France
    • Where Radium and Polonium were discovered by scientists Marie and Pierre Curie
  3. Subject: Telescopes
    Category: Science
    Facts:

    • Device to observe different objects
    • The first refracting telescopes were invented in the Netherlands in the 17th Century
    • The James Webb space telescope is the largest telescope in space. It uses a gold-berillyum mirror
  4. Subject: Starry Night
    Category: Art
    Facts:

    • Painted by Vincent van Gogh in 1889
    • Captures the east-facing view of van Gogh's room in Saint-Rémy-de-Provence
  5. Subject: Physics
    Category: Science
    Facts:

    • The sun doesn't change color during sunset.
    • Water slows the speed of light
    • The Eiffel Tower in Paris is taller in the summer than the winter due to expansion of the metal.

Based on the user's category of Rome, we will select the subjects Leonardo DaVinci and Starry Night.

Step 3: Generate a quiz for the user.

Question 1:####
Which famous painting was created by Leonardo DaVinci?
a) Starry Night
b) The Last Supper
c) The Sistine Chapel
d) The Birth of Venus

Question 2:####
What did Leonardo DaVinci study?
a) Mathematics and Astronomy
b) Zoology, Anatomy, Geology, and Optics
c) Literature and Philosophy
d) Music and Dance

Question 3:####
Who painted the famous artwork "Starry Night"?
a) Leonardo DaVinci
b) Vincent van Gogh
c) Pablo Picasso
d) Michelangelo

Note: The answers to the questions are:
Question 1: b) The Last Supper
Question 2: b) Zoology, Anatomy, Geology, and Optics
Question 3: b) Vincent van Gogh

Here the model did not obey our instruction and created quiz for Rome, so the test case failed

Environment-specific Evaluations

Environment specific Evaluations | 500

Automating model-graded evaluations

Here, another LLM evaluates the response of the LLM and grades it.

Pre-release evaluations

Thoroughly evaluating output before releasing it to end users
In our example, we:

Merge to Main branch -> Pass Model Eval -> Yes -> Deploy to Prod
-> No -> Return to Dev

Comprehensive Testing Framework

Hallucination

Human Evaluation | 800x400

References

Thoughts 🤔 by Soumendra Kumar Sahoo is licensed under CC BY 4.0