From Beginner to Advanced

Lowdown on Vector Databases

Part 2: Build a Semantic Application for Right Questions

Arup Nanda

--

A mirror with the word ASK reflected on it
Photo by Brett Jordan on Unsplash

In the Part 1 of this series you learned the fundamentals of vectors and how it enables the “similarity” search, different from typical value-centric search in databases. You also saw how we created a vector database using an open source product called ChromaDB and used it for searching for semantically similar values in the textual data. In this part you will learn how to build a more complex real world example.

Please note, the objective is to demonstrate the concepts of a vector database and ChromaDB is merely used as an example. There are many vendors in this space and it is not the objective of this article to discuss them.

Constructing the Right Question

I am sure you are aware of Wikipedia, which has lots of interesting pieces of information in form of questions and answers. You can get many of your queries answered from Wikipedia, assuming you know the specific question to ask. What if your specific question does not exist in Wikipedia; but something similar in meaning may. For example you want to ask:

“Why did Americans fight their own”

Unfortunately there is no such question in the question bank we have. You are probably referring to the American Civil War; but at this moment you don’t remember the exact term. You want to see what questions are available that semantically similar to the question you have in mind. And you want to see those semantically similar questions suggested to you so that you can pick one and look for the answers to that particular one.

Let’s build a vector database of all possible questions and use the vector technology to find questions nearest in meaning to the one we are asking.

As you learned earlier, vector databases are specialized databases that allow language data to be stored as vectors representing that data. Vectors allow the determinations of similarities between data by computing the distance between them in a very large multi-dimensional space. The shorter the distance, the more similar the data values are to each other, which is crucial in semantic and language based searches, where the meaning of words in the sentence is important to compare; not the actual values.

Here is a complete Jupyter Notebook. Here is the summary of activities:

  1. Get questions data called wiki_qa from huggingface.
  2. Extract only the questions. There will be duplication as each question has multiple answers.
  3. Remove the duplicate questions
  4. Create a language based vector for each question using the sentence transformer utility we saw in Part 1
  5. Use your specific question as a query input and search for semantically similar questions in the question bank.
  6. Return the 3 (just a limit; it is configurable) most similar questions from the question bank.

Please make sure you install the following packages from PyPI

  • datasets (the package to pull the data from huggingface directly)
  • chromadb (the vector technology we will be using for the example)
  • sentence_transformers (creating the vector from the language data)
  • tqdm (to display the progress bar while updating the vector database)

Now, here is the Jupyter Notebook:

Notebook containing the entire application to find semantically similar questions

Note how we got the three semantically similar questions from the dataset:

Distance   ID Question
0.986254 779 what made the civil war different from others
1.121846 80 when was america pioneered
1.144610 826 what triggered the civil was

My search question was:

why did Americans fight their own

The vector data looked for and found these 3 semantically similar questions. The stress was on the word “semantically similar”; not “like”. For instance look at the first closest question:

0.986254  779 what made the civil war different from others

These is no mention of “Americans”, or “fight their own”, etc.; yet the database correctly inferred the semantic similarity to the original search data and found it to have a similar meaning. This is the power of semantic search. Of course when you see the actual question in the question bank you can search for the answers from the wiki_qa dataset using the normal search methods.

Conclusion

In this part you saw a fully featured application to search a large database using vector techniques to find semantically similar data. In this case the data was a bunch of questions from wikipedia. The question you have in mind is not exactly in the question bank; and you want to find the nearest ones in meaning and not actual words. You saw how the vector search allowed you to do that.

In the 3rd and final part of the series you will learn some of the other uses of vector database and how they are different from other database technologies such as graph.

--

--

Arup Nanda

Award winning data/analytics/ML and engineering leader, raspberry pi junkie, dad and husband.