How to Ship LLMs in Production ?

JUNE 26, 2023

Language models (LLMs) have been a topic of intrigue for me for quite some time. Initially skeptical, I eventually felt the pressure to dive into the technology myself rather than just reading articles or examining shiny examples. I wanted to create a small-scale reproduction of what I would deploy for my clients in a production environment. While OpenAI is widely known, I discovered several alternatives such as PaLM on Google Vertex AI, which I was already familiar with and had worked on previously. It's worth mentioning that there are other alternatives available as well, such as Hugging Face, Falcon, and StableLM. Ultimately, I decided to use PaLM as it had recently been released and offered a viable serverless infrastructure. However, I relied on Sentence Transformer for document embedding, which I'll explain in more detail later.

My primary focus was to develop a proof of concept that would initially work on a notebook for rapid experimentation with the technology. That's why I chose Langchain, a popular library with numerous integrations, prompt templates, and ingestion mechanisms like loading PDFs from Google Drive in a vector store. I have been truly impressive to see so many people outside the tech sphere embracing Python and ML tools. As someone who has experienced nightmares with dependencies, unresolved pip installations, and system dependency issues, I can appreciate the ease of use that Langchain provides.

https://lh4.googleusercontent.com/ZHwbJHVFKwFFjCJPeAhL41U-KAV5aEzSgihuEwwfaxouXE5ACJowwtsd0DwlWzhySs8FSkqYrPT9pgpG9SGXQEC-YJHnMcgALvHes_tuAwzN9L12da656sDbqgaZwmG3s6M9u982nNL4ZTYhLb1PyKo

Returning to Langchain, it can be likened to Keras for TensorFlow but for conversing with LLMs or enabling them to engage in various tasks. The library is divided into different modules:

LLMs: This module enables interaction with different types of LLMs using a common interface.
Chains: Here, we delve into the core concept of Langchain, which makes the library truly magical. Chains provide an abstraction that users interact with and can be serialized in JSON, similar to Hugging Face's pipeline.
Memory: The preservation of previous messages and context is vital for reproducing a ChatGPT-like interface. Memory allows for a discussion with greater context.
Agent: This is the most advanced concept within Langchain. While it still leverages conversation as a network of information, it also facilitates interaction with multiple conversations or APIs. As LLMs lack the concept of concepts, it becomes necessary to incorporate ground truth logic and information into the system, such as fact-checking news.

https://lh3.googleusercontent.com/NsHBXGLaq47sO97ZUoxKfuRUmrSZUrPDQ8f83nDjZomQqeNAybgnUsZjt6DFDPs1RSbIaMdVvGE3ACMhIt8QGWHCcu_wSVOjVhrvCOB0VtUVgRuj_FAa7K4mBqXkWiqyBjRa-2gvgYlIOoV02CDiigI

Langchain components illustration by Syed Hyder Ali Zaidi

For my experiment, I utilized those modules except for Agents in my notebook, despite encountering several dependency and authentication issues. The library is enjoyable to use, and I wanted to share my experience with others, including colleagues at SFEIR and friends, to showcase that we can also have our own GPT-like models at home. Therefore, my goal was to build a cost-efficient, completely serverless service while maintaining control over the entire stack, avoiding reliance on magical SaaS solutions. This service would act as my librarian, ingesting all the books, papers, and documentation I had read into a Google Drive and test my remembrance.

from langchain.llms import VertexAI
from langchain.vectorstores import Chroma
from langchain.embeddings import VertexAIEmbeddings
from langchain.document_loaders import GoogleDriveLoader

loader = GoogleDriveLoader(
    folder_id="1daABjn2QXHMFUK_LUvVRlbUdTTc8nOWe",
    recursive=True,
)
embeddings = VertexAIEmbeddings()
vectorstore = Chroma.from_documents(loader.load(), embeddings)
llm = VertexAI()
qa = ConversationalRetrievalChain.from_llm(
    llm,
    vectorstore.as_retriever(),
    return_source_documents=True,
)

The first challenge I needed to address was where to store the vector representations of the embeddings. Initially, I used Chroma, a fantastic database for experimentation. However, I eventually discovered Deep Lake, an open-source library that, as the name suggests, operates similarly to the concept of a Data Lake, such as Iceberg, by separating the data source from the computing. This was incredibly convenient, as it eliminated the need for a costly vector database cluster.

During the ingestion process, which involved approximately 9,000 pages, I encountered a limitation with PaLM. Its quota allowed for a bucket of only 60 calls per minute, and considering the pricing, it could become expensive if I scaled the system to a company-wide dataset. To address this, I opted for a "lighter" solution by loading Sentence Transformer, an open-source, pretrained embedding library that fit comfortably within a memory footprint of less than 4GB file system included (a ballpark figure).

from langchain.vectorstores import DeepLake
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()
vectorstore = DeepLake(
  dataset_path="gcs://shikanime-studio-labs-sfeir-hivemind-deep-lake-datasets/books/",
  embedding_function=embeddings
)
vectorstore.add_documents(loader.load())

One of the significant challenges I encountered during my experimentation was how to manage stateful conversations. Since serverless environment are designed to be stateless, preserving the conversation state across multiple function invocations can be quite tricky. However, I found a simple yet effective solution to this problem by leveraging Firestore, a serverless document database provided by Google Cloud.

To implement this solution, I created a Firestore collection for storing conversation documents. Each document represented a conversation and contained relevant metadata and the messages exchanged during the conversation. Firestore's document-centric approach made it easy to manage and query conversations based on various parameters such as user ID and session ID.

When a new message arrived, I would retrieve the conversation document from Firestore, update it with the new message, and then store it back into Firestore. This approach ensured that the conversation state was preserved across function invocations, enabling a seamless and context-aware conversation experience.

from langchain.memory.chat_message_histories import FirestoreChatMessageHistory

chat_message_history = FirestoreChatMessageHistory(
    collection_name="chat_history",
    session_id=qa.session_id,
    user_id=qa.user_id,
)
result = qa(
    {"question": qa.question, "chat_history": chat_message_history.messages}
)
chat_message_history.add_user_message(qa.question)
chat_message_history.add_ai_message(result["answer"])