Evaluation and Testing LLM Outputs

This is one of series of articles on different solution considerations for integrating into LLMs.

Question

How can we evaluate and test the Large Language model outputs

This is critical question among enterprises for integrating LLMs into production

Their is no single answer here. Integrating to LLMs has been widespread already , their are lots of showcases, demo tutorials etc. However not many stories on integrating output response from LLMs directly into actions or to end users, especially when your business is at stake. I

So how can we ensure

the outputs are expected (or within the tolerance range )

within the guard rails

ethical and How can we log and observe this for improvement/feedback and as a proof of reference.

Lets cover what we know

What are the toolsets available:

July 2023 (because changes so fast in this space)

Lang chain: Self-critique chain with constitutional AI

References:

https://wandb.ai/a-sh0ts/ethical-ada-llm-comparison/reports/Applying-Ethical-Principles-Scoring-to-Langchain-Visualizing-with-W-B-Prompts