Semantic Search on Documents with OpenAI and Pinecone

Apr 28

Introduction

As AI based tools continue to grow in popularity, it is increasingly becoming easier to quickly sift through various types of documents and extract valuable content. Semantic search, a technique that focuses on understanding the meaning of words, phrases, and concepts instead of merely matching keywords, offers a powerful solution in this regard. In this blog post, we'll walk you through the process of performing semantic search on documents, including PDFs and others, using OpenAI and Pinecone. By applying these techniques, you can make your information retrieval tasks across multiple document formats easier and more efficient. We will cover the following topics:

How to create vector embeddings using an OpenAI model.
How to store those embeddings into Pinecone.
How to search with a query using Pinecone.

You can check out the code in this post on github.

Creating Vector Embeddings with OpenAI Model

Semantic search requires the creation of vector embeddings, which are numerical representations of text that facilitate comparison and mathematical analysis. OpenAI's GPT-4 is one of the most powerful models for generating these embeddings. To create embeddings using OpenAI, follow these steps:

a. Install the required packages:

b. Set up OpenAI:

c. Convert the PDF's text into a string format using a library like PyPDF2 or pdfplumber. For this example, we'll use pdfplumber:

In this example, we're using this pdf. It is an essay on essay writing.

d. Split the text into smaller chunks (e.g., sentences or paragraphs) and create embeddings:

Storing Embeddings into Pinecone

With the embeddings created, we need to store them in a vector database to enable efficient searching. Pinecone, a managed vector database service, is perfect for this task.

To store embeddings in Pinecone, follow these steps:

a. Initialize Pinecone:

Please note the the dimension parameter here depends on the chunk_size that you've used to split your pdf's text.

b. Upload the previously created embeddings to the pinecone index created in the previous step:

Searching with a Query using Pinecone

Now that the embeddings are stored in Pinecone, we can perform semantic search using a query. To search for a query in Pinecone, follow these steps:

a. Create an embedding for the query using the OpenAI model:

b. Query Pinecone for similar embeddings and display results:

c. Optionally, once you're done, clean up by deleting the namespace and deinitializing Pinecone:

Conclusion

In this blog post, we have demonstrated how to perform semantic search on documents, including PDFs and others, using OpenAI and Pinecone. We showed you how to create vector embeddings using an OpenAI model, store those embeddings in Pinecone, and search for a query using Pinecone. This approach allows you to quickly and efficiently find relevant information in various document formats by considering the meaning of the text rather than just matching keywords.

It's worth noting that you can also achieve semantic search using open-source language models and vector search databases. For instance, you can use Hugging Face's Transformers library with models like BERT, GPT-2, or RoBERTa for generating embeddings, and then use open-source vector search databases like FAISS or Annoy to store and search those embeddings efficiently.

With these powerful tools at your disposal, you can tackle information retrieval tasks with ease and precision, whether you choose to use the OpenAI API and Pinecone or opt for open-source alternatives.

openaipineconeGPTvector-searchmachine-learning

Saadullah Aleem