53.43 Lessons learnt from building with LLMs

#LLMs #AI

Takeaways

Prompting

Fundamental Prompting techniques
- Start with Prompting when developing new applications.
- Best prompting techniques:
  - Few shot prompting with in-context learning: To provide the LLM with a few examples that demonstrate the task and align outputs to our expectations.
  - Chain-of-thought prompting: To explain LLMs thought process before returning the final answer
Structured inputs and outputs
- Structured input and output help models better understand the input as well as return output that can reliably integrate with downstream systems. Adding serialization formatting to your inputs can help provide more clues to the model as to the relationships between tokens in the context, additional metadata to specific tokens (like types), or relate the request to similar examples in the model’s training data.
- Structured output serves a similar purpose, but it also simplifies integration into downstream components of your system. Instructor and Outlines work well for structured output. (If you’re importing an LLM API SDK, use Instructor; if you’re importing Huggingface for a self-hosted model, use Outlines.) Structured input expresses tasks clearly and resembles how the training data is formatted, increasing the probability of better output.
Have small prompts that do one thing, and only one thing, well
- A simple prompt can become a God Object soon.
- Use different prompts to do one and only one tasks each.
Craft your context tokens
- If structure of the context is not good for human, it is neither for Agents.
- Manually evaluate what exactly you are sending to LLM. Are you able to understand?

RAG

The quality of your RAG’s output is dependent on the quality of retrieved documents. This can be evaluated through ranking metrics such as Mean Reciprocal Rank (MRR) or Normalized Discounted Cumulative Gain (NDCG).
- The RAG quality can be determined by three criterias:
  1. Rank
  2. Information density
  3. Output provided to LLM
Use keyword search as a baseline
- Use a keyword search algo like BM25 for exact terms.
- Explainability is more on keyword search.
- A combination of both keyword and vector embeddings work the best.
For new knowledge prefer RAG than fine-tuning.
- It is easier and cheaper.
- Document can be easily modified to change the output of RAG.
- Multi-tenancy is easier to manage.
Long context window support by LLMs
- We need a way to select information to feed into the model.
- Model may get distracted with irrelevant context provided and hallucinate.
- Cost is more to provide more context token.

Tuning and optimizing workflows

Step-by-step, multi-turn “flows” can give large boosts.
- Each step an agent takes has a chance of failing, and the chances of recovering from the error are poor.
Have agent systems that produce deterministic plans which are then executed in a structured, reproducible way. It's like giving instructions to a junior engineer; have clear goals and concrete plans.
- The workflow can be:
  - In the first step, given a high-level goal or prompt, the agent generates a plan.
  - Then, the plan is executed deterministically.
  - This allows each step to be more predictable and reliable.
- Generated plans can be represented as directed acyclic graphs (DAGs) which are easier, relative to a static prompt, to understand and adapt to new situations.
- The key to reliable, working agents will likely be found in adopting more structured, deterministic approaches, as well as collecting data to refine prompts and finetune models.
Getting more diverse outputs beyond temperature
- Problem:
  - Increasing the temperature parameter is not enough to expect diversified output from LLM.
  - Some output that could be a good fit may never be output by the LLM. The same handful of tokens might be overrepresented in outputs, if they are highly likely to follow the prompt based on what the LLM has learned at training time.
  - If the temperature is too high, you may get outputs that reference nonexistent products (or gibberish!)
- Solutions:
  - By instructing the LLM to avoid suggesting items from this recent list, or by rejecting and resampling outputs that are similar to recent suggestions.
  - Vary the phrasing used in the prompts.
Caching is underrated
- One straightforward approach to caching is to use unique IDs for the items being processed, such as if we’re summarizing new articles or product reviews. When a request comes in, we can check to see if a summary already exists in the cache. If so, we can return it immediately; if not, we generate, guardrail, and serve it, and then store it in the cache for future requests.
- #TODO For more open-ended queries, we can borrow techniques from the field of search, which also leverages caching for open-ended inputs.
Fine-Tuning. When?
- We may have some tasks where even the most cleverly designed prompts fall short. For example, even after significant prompt engineering, our system may still be a ways from returning reliable, high-quality output. If so, then it may be necessary to finetune a model for your specific task.
- If prompting gets you 90% of the way there, then fine-tuning may not be worth the investment. However, if we do decide to fine-tune, to reduce the cost of collecting human annotated data, we can generate and finetune on synthetic data, or bootstrap on open-source data.

Evaluation & Monitoring

Evaluation can be simply unit testing, or it’s more like observability, or maybe it’s just data science.

Assertion-based unit tests
- Create unit tests (i.e., assertions) consisting of samples of inputs and outputs from production, with expectations for outputs based on at least three criteria.
- While three criteria might seem arbitrary, it’s a practical number to start with; fewer might indicate that your task isn’t sufficiently defined or is too open-ended, like a general-purpose chatbot.
- These unit tests, or assertions, should be triggered by any changes to the pipeline, whether it’s editing a prompt, adding new context via RAG, or other modifications. This write-up has an example of an assertion-based test for an actual use case.
- Consider beginning with assertions that specify phrases or ideas to either include or exclude in all responses. Also consider checks to ensure that word, item, or sentence counts lie within a range.
LLM-as-Judge
- Use pairwise comparisons: Instead of asking the LLM to score a single output on a Likert scale, present it with two options and ask it to select the better one. This tends to lead to more stable results.
- Control for position bias: The order of options presented can bias the LLM’s decision. To mitigate this, do each pairwise comparison twice, swapping the order of pairs each time. Just be sure to attribute wins to the right option after swapping!
- Allow for ties: In some cases, both options may be equally good. Thus, allow the LLM to declare a tie so it doesn’t have to arbitrarily pick a winner.
- Use Chain-of-Thought: Asking the LLM to explain its decision before giving a final preference can increase eval reliability. As a bonus, this allows you to use a weaker but faster LLM and still achieve similar results. Because frequently this part of the pipeline is in batch mode, the extra latency from CoT isn’t a problem.
- Control for response length: LLMs tend to bias toward longer responses. To mitigate this, ensure response pairs are similar in length.
Intern test
- We like to use the following “intern test” when evaluating generations: If you took the exact input to the language model, including the context, and gave it to an average college student in the relevant major as a task, could they succeed? How long would it take?
- #TODO Create a flow chart
- If the answer is no because the LLM lacks the required knowledge, consider ways to enrich the context.
- If the answer is no and we simply can’t improve the context to fix it, then we may have hit a task that’s too hard for contemporary LLMs.
- If the answer is yes, but it would take a while, we can try to reduce the complexity of the task. Is it decomposable? Are there aspects of the task that can be made more templatized?
- If the answer is yes, they would get it quickly, then it’s time to dig into the data. What’s the model doing wrong? Can we find a pattern of failures? Try asking the model to explain itself before or after it responds, to help you build a theory of mind.
Overemphasizing certain evals can hurt overall performance
- Needle-in-a-Haystack (NIAH) eval
Simplify annotation to binary tasks or pairwise comparisons
- it’s easier for humans to say “A is better than B” than to assign an individual score to either A or B.

Source

https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/

Takeaways

Prompting

RAG

Tuning and optimizing workflows

Evaluation & Monitoring

Source

Also Read