How to Evaluating AI System Performance

February 07, 2024

Testing is a fundamental pillar of building any reliable system, and this holds true for software, machine learning models, and even scientific experiments. By measuring how well our outputs align with expectations, we ensure we're achieving our goals and provide future developers with a clear understanding of past behavior. This principle applies equally to AI systems, as our reliance grows, you may seek to build something more customized, private, or integrated with data beyond public chatbot capabilities.

This article explores the crucial yet often overlooked aspect of testing in AI systems. While the provided RAG (Retrieval Augmented Generation) example using a document database showcases promising performance, how can we objectively measure its true effectiveness?

from langchain.chains import ConversationalRetrievalChain
from langchain.memory.buffer import ConversationBufferMemory
from langchain_community.retrievers.google_vertex_ai_search import (
    GoogleVertexAISearchRetriever,
)
from langchain_google_vertexai.chat_models import ChatVertexAI

llm = ChatVertexAI(
    model_name="gemini-pro",
    convert_system_message_to_human=True,
)
retriever = GoogleVertexAISearchRetriever(
    location_id="global",
    data_store_id="documents_42",
)
memory = ConversationBufferMemory(
    memory_key="chat_history",
    output_key="answer",
    return_messages=True,
)
qa = ConversationalRetrievalChain.from_llm(
    llm=llm,
    memory=memory,
    retriever=retriever,
    return_source_documents=True,
)

While the Large Language Model (LLM) plays a significant role in output quality, it's just one piece of the system. Examining the bigger picture reveals several factors influencing performance:

Prompt Engineering: Langchain provides general prompts, but for specific requirements like answering in specific language or formatting, prompt adjustments are necessary.
Training Data Quality and Relevance: The model's training data significantly impacts its ability to produce accurate and relevant responses. Biases or inconsistencies in the data can lead to skewed outputs.
Retrieval Data Quality: The retrieval engine's ability to find relevant information from the document database directly affects the quality of the generated text.

In classical software development, we employ a diverse range of testing practices, while in machine learning, we rely on a comprehensive set of metrics and loss functions to guide our judgment when deploying a new model version. However, the challenge lies in determining the quality of natural language. How can we effectively assess the nuances and intricacies inherent in linguistic expressions?

To address this issue, we introduced a methodology that incorporates third-party judgment through the recording of conversations, posing questions, and comparing predicted answers with ground truth, akin to addressing a classical classification problem. This approach employs another AI system specifically designed to establish a scoring system, considering key factors such as accuracy, completeness, clarity, and relevance.

from langchain.llms import VertexAI
from langchain.prompts import PromptTemplate

prompt_template = """{question}

Ground Truth Answer: {answer}

Predicted Answer: {predicted_answer}

Scoring Guidelines:

Accuracy: The predicted answer should be factually correct and consistent with established scientific knowledge.
Completeness: The predicted answer should provide all of the information that is relevant to the question.
Clarity: The predicted answer should be easy to understand and should use clear and concise language.
Relevance: The predicted answer should be directly relevant to the question and should not introduce any extraneous information.

Scoring Scale:

5: The predicted answer is completely accurate, complete, clear, and relevant.
4: The predicted answer is mostly accurate, complete, clear, and relevant.
3: The predicted answer is partially accurate, complete, clear, and relevant.
2: The predicted answer is somewhat accurate, complete, clear, or relevant.
1: The predicted answer is not accurate, complete, clear, or relevant.
0: The ground truth answer is unknown.

Score Strictly in Integer Format:"""

promt = PromptTemplate(
    template=prompt_template,
    input_variables=["question", "answer", "predicted_answer"],
)
llm = VertexAI(project=GOOGLE_PROJECT_ID, max_retries=12)
scoring = promt | llm

df = pd.read_json("gs://shikanime-studio-labs/qa.jsonl", lines=True)
df["predicted_answer"] = df["input_text"].apply(lambda x: qa.invoke({
    "question": x
}))
df["predicted_answer_scoring"] = df.apply(lambda x: int(scoring.invoke({
    "question": x["input_text"],
    "answer": x["output_text"],
    "predicted_answer": x["predicted_answer"]
})), axis=1)

This scenario offers only a glimpse into the multifaceted process of testing AI systems. My advise is to draw inspiration from the benchmarking methods employed in LLM assessments. Additionally, exploring managed model evaluation services, such as performing automatic side-by-side evaluation and executing metrics-based evaluation, can provide valuable insights for a more comprehensive evaluation of your AI system.