AUGUST 3, 2023

In the world of natural language processing (NLP), one of the fundamental tasks is to measure the similarity between pieces of text. Whether it's finding similar product names, recommending related articles, or matching user queries, text similarity search plays a crucial role in various applications. In this article, we will explore how to create a simple text similarity search system using a pre-trained TensorFlow text encoder model and BigQuery ML.

Introduction to Text Embeddings

Text embeddings are dense vector representations of words or sentences that capture semantic relationships between them. In other words, words or sentences with similar meanings are represented as vectors that are close to each other in a high-dimensional space. These embeddings are learned through powerful language models, such as the Universal Sentence Encoder (USE) by Google, which is pretrained on a large corpus of text data.

The Universal Sentence Encoder is capable of encoding sentences into fixed-length vectors, making it an ideal choice for our text similarity search. TensorFlow provides a pre-trained model of the Universal Sentence Encoder on Tensorflow Hub, which we will utilize in this article.

Importing your ML model into BigQuery

In this section, we will walk through the process of downloading a SavedModel from the TensorFlow Hub and importing it into BigQuery for further use.

Download the Universal Sentence Encoder SavedModel using the "wget" command-line tool and save it to the desired location:

wget \\
    -O universal-sentence-encoder-multilingual-large-v3.tar.gz \\
    <https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3\\?tf-hub-format\\=compressed\\>

Extract the downloaded model files:

mkdir -p universal-sentence-encoder-multilingual-large-v3

tar -xf \\
    universal-sentence-encoder-multilingual-large-v3.tar.gz \\
    -C universal-sentence-encoder-multilingual-large-v3

Upload the model files to your Google Cloud Storage (GCS) bucket using the gsutil command-line tool:

gsutil cp \\
	universal-sentence-encoder-multilingual-large-v3/* \\
    gs://shikanime-studio-labs/universal-sentence-encoder-multilingual-large-v3

Now, with the Universal Sentence Encoder SavedModel available in your GCS bucket, you can proceed to import it into BigQuery:

CREATE OR REPLACE MODEL search.universal_sentence_encoder_large
OPTIONS(
	model_type='tensorflow',
	model_path='gs://shikanime-studio-labs/universal-sentence-encoder-multilingual-large-v3/*'
)

If everything is set up correctly, you should see information about the model's details and I/O schema.

BigQuery model details

BigQuery model details

In this section, we learned how to download a TensorFlow SavedModel from the TensorFlow Hub and import it into BigQuery. By importing the Universal Sentence Encoder model, we are now ready to generate text embeddings directly within BigQuery, allowing us to efficiently perform text similarity search and other natural language processing tasks. With that set up, we can move on to the next step of building our simple text similarity search system using text embeddings in BigQuery.

Creating the Text Embeddings in BigQuery

For demonstration purposes, we will use an example dataset of BBC news.