Harnessing the Power of LLMs for Synthetic Data Generation

January 22, 2024

In the realm of data, from collection to dashboard insights or model design to training, data reigns supreme. However, what if you possess the data but are unable to utilize it? This is a prevalent challenge in the professional world, often exacerbated by data regulations like the GDPR, which we may sometimes love to hate but are grateful for its existence.

Data security regulations mandate that production data must not be leaked into any environments including during the development cycle, a principle that is particularly crucial for external consultants. So, how can we address this challenge? The most common approach is to anonymize the data using a variety of irreversible cryptographic techniques, such as hashing the information or reversible encryption techniques like format-preserving encryption.

All of these methods are tedious and time-consuming, often requiring extensive preparation and data decontamination procedures to keep the data safe but useful. On the other hand, LLMs are a rapidly evolving type of machine learning model that can produce “new” data, as opposed to conventional methods.

Well, this generative property is a really interesting ! What if we could generate new data for our PoC experiments, similar to traditional data augmentation but enhanced with additional dimensions beyond random cropping, rotation, or shifting? To illustrate this concept, let's envision a furniture store with a limited collection. I'm provided with just a few lines of data in a CSV file, which is insufficient for comprehensive analysis. Moreover, I'm informed of the existence of additional niche categories, such as dinosaur plushies and rice cookers, which are not represented in the dataset.

date,order_id,client_id,product_id,product_price,product_quantity
01/01/21,1234,991,490756,50,1
01/01/21,1234,991,389728,3.56,4
01/01/21,3456,420,490756,50,2
01/12/20,3456,420,549380,300,1
01/12/20,3456,420,293718,10,6

I have a fairly good understanding of the store's range of products. It sells a wide variety of furniture, from kitchenware to Christmas trees and dinosaur plushies. The IDs appear to be traditional integral numbers. The date can be assumed to be in European date notation (DD-MM-YYYY), also known as the right way. We can also assume that there are more than 2 product types. While these assumptions may not be completely accurate, they may be sufficient for our purposes. The key point is that we need to have a basic understanding of our data to be able to imagine new data that we might encounter in reality.

Now that we have a fundamental grasp of our imaginary data, we can embark on the process of crafting our ideal magical realm. The renowned library Langchain, primarily employed for developing interactive chats, may immediately come to mind, it offers a wealth of components that extend far beyond conversational interactions.

from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_experimental.tabular_synthetic_data.openai import (
    OPENAI_TEMPLATE,
)
from langchain_experimental.tabular_synthetic_data.prompts import (
    SYNTHETIC_FEW_SHOT_PREFIX,
    SYNTHETIC_FEW_SHOT_SUFFIX,
)

example_prompt = PromptTemplate(
    input_variables=[
        "order_id",
        "client_id",
        "product_id",
        "product_price",
        "product_quantity",
        "date",
    ],
    template=(
        "Order ID: {order_id}, "
        "Client ID: {client_id}, "
        "Product ID: {product_id}, "
        "Product Price: {product_price}, "
        "Product Quantity: {product_quantity}, "
        "Date: {date}"
    ),
)

example_df = fct_transaction.apply(
lambda row: example_prompt.format(
        order_id=row["order_id"],
        client_id=row["client_id"],
        product_id=row["product_id"],
        product_price=row["product_price"],
        product_quantity=row["product_quantity"],
        date=row["date"],
    ),
    axis=1,
).to_frame("example")

prompt = FewShotPromptTemplate(
    prefix=SYNTHETIC_FEW_SHOT_PREFIX,
    examples=example_df.to_dict("records"),
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=OPENAI_TEMPLATE,
)

The first step I take is to craft a prompt that effectively populates the context surrounding the prompt with a few examples to provide the model with a grasp of the data structure. This technique, known as Few-Shot Learning, lies between Zero-Shot Learning and Chain-of-Thought. It's similar to teaching a child: "The dog is really cute, Mom is beautiful... What about Dad?"

While there may be no strict science to predict the exact outcome of a given situation, we often have a general sense of how it will unfold. Similarly, Few-Shot Learning allows LLMs to generalize from limited examples, achieving a level of performance that is surprisingly close to their full-sized counterparts.

The next step is to setup our generator, I choose OpenAI as our LLM compared to other ones, because it’s the only one that show decent output for relatively complexe reasoning and consistency over time from my own benchmarking.

from langchain.chat_models.openai import ChatOpenAI
from langchain_experimental.tabular_synthetic_data.openai import (
    create_openai_data_generator,
)
from langchain_experimental.pydantic_v1 import BaseModel

class Transaction(BaseModel):
    order_id: int
    client_id: int
    product_id: int
    product_price: float
    product_quantity: int
    date: int

llm = ChatOpenAI(temperature=1)

synthetic_data_generator = create_openai_data_generator(
    output_schema=Transaction,
    llm=llm,
    prompt=prompt,
)

As the exciting phase commences, pay close attention to the temperature setting, which should be set to 1 to ensure maximum variance in the generated outputs. At this junction, I define the subject of the generation, while the additional instructions meticulously outline how I desire the model to generate my imaginary data. While there are no hard-and-fast rules governing the writing of these instructions, my experience dictates that succinct, precise, and authoritative language proves most effective. Avoid the temptation to engage in politesse; it's not sentience... for the moment at least. If you do, the model will take far more liberties with the generation than you'd appreciate.

synthetic_results = synthetic_data_generator.generate(
    subject="sale_transaction",
    extra=(
        "the client_id, product_price, product_quantity and date must be "
        "chosen at random. Make it something you wouldn't normally choose."
    ),
    runs=10,
)
fct_transaction_synthetic_df = pd.DataFrame(
    [result.dict() for result in synthetic_results]
)

That's just the beginning! You've now unlocked the ability to generate your own data. However, it's important to remember that this dataset should not be considered the ultimate representation of reality, as it may not perfectly mirror the true world. Instead, it serves as a valuable tool for augmenting your imbalanced dataset or adding diversity to your data pool.