From beginner to advanced

Lowdown on Vector Databases

Part 3: Other Uses and Comparisons

Arup Nanda

--

Photo by Pietro Jeng on Unsplash

In the first two parts of the article series you learned what a vector database is (Part 1), how to create and use one with a free opensource tool named ChromaDB and building a full application for a question recommender system (Part 2). In this part, learn where they can be used as well as how they are different from other technologies such as a graph database.

Extending Large Language Models

Since vector databases excel at searching for similarities between datapoints by comparing their distance from each other in a multi-dimensional space and not the actual values themselves, they are perfect for finding the similarity in meaning. Where do you see the “meaning” is important? You guessed it right — in Large Language Models (LLM) where the data is text based and the meaning of the terms, the relationship between them, the interplay of various words are important.

You are probably familiar with several LLMs such as ChatGPT, Bard, etc. These models are trained on a variety of datasets such as from Wikipedia, Reddit and so on; but not your dataset. For instance if you want the LLM model to evaluate multiple writing samples from you for quality, how do you suppose it will work? The model does not have your data, i.e. your writing samples. One option is to train the model with your specific writings (the “training” data); but that is simply not practical. Training LLMs usually involves massive computing resources, which is expensive and takes a lot of time. Therefore LLMs are trained very infrequently. Adding your own training data to retrain the model is not usually an option, unless the model is very small and can be cheaply and quickly retrained. Most LLMs are not small.

So, without retraining how can the model know about your data? This is where the vector databases come in; you can create a vector database and add your data as vectors which can be used for comparison of meaning; not just values. The original model remains intact; but the vector database provides the additional data it needs be effective.

Long Term Memory of LLMs

The fact that the LLMs are trained infrequently has another consequence. Recall that ChatGPT is trained with data from 2021. What if you ask a question the answer to which lies after that point in time? For instance, if you ask it about financial picture of Twitter, it will spit out the data and assessment assuming that Twitter is still a public company (to be fair, it does give a warning that its data is dated and could be wrong). However, as you know, Twitter used to be one; but it is private now and all financial assessment is now irrelevant. ChatGPT does not know that and will provide lucid, clearly articulated content without realizing its inaccuracy. It’s often referred to as hallucinations, much like how humans hallucinate and assume they are looking at something that does not actually exist. To them it is reality. Hallucination is a serious problem in infrequently trained LLMs. One way to reduce that is by storing additional data to help in the model’s execution of the query, akin to bolstering its long term memory. This long term memory uses a vector database.

Data Taxonomy

But that’s not it. Consider any database where semantic information is needed for similarity of meaning; not actual values. One example is comparing metadata for similar information such as “biological age” and “chronological number” are similar, or “person” and “people” are similar in meaning. Using a large language model to extract the semantic relationships and querying using the vector search will be quite helpful for developing an ontology of data models.

Recommendation Engine

Recommendation engines need to know the likeness of text in meaning; not in actual value. You saw an example of one in part 2 of the article series where you passed a question that actually does not exist in the question bank and the system recommended three questions you will find closest to what you are looking for. The same principle can be applied to many other recommender systems.

Outliers and Duplicate Search

In a data engineering application, particularly in data quality, we generally look for clusters of data that defy the normal pattern. Well, what is a normal pattern? It’s may not be just values; it could be things that are not semantically similar. For instance you were ingesting some large amount of free format user responses to a customer survey of an iPhone app but want to weed out someone putting unrelated content such as advertisement. Semantic search works very well to identify those outliers and vector databases can be used for those.

You can think of many other such uses now that you know the power of the vector search.

Difference from Graph Databases

Now, you may ask, how is a vector different from a graph database? Graph databases also store points of data and each data can be stored as a vector in a multi-dimensional space as well. But graphs are used to represent the relationship between the data points if one exists. For instance, look at the following diagram.

A graph of relationship among four data points
Typical Graph Database

Here we see four data points and the relationship between some of them. For example, Alice is spouse of Bob and Charlie is the child of Bob. But there is no relationship between Alice and Charlie. Shouldn’t Charlie be the child of Alice as well? Possibly, but not definitively. Charlie could be step child of Alice. This is why the graph allows us to define a derived relationship with a degree of confidence of that relationship. If it is a fact, it’s with 100% confidence; otherwise a lesser value. This degree of confidence is an attribute of the relationship.

You can define a number of attributes of the relationships between points in a graph. Using this relationship and the attributes of that we will be able to derive other relationships that may exist. This may sound like the similarities between the data points we have seen before; but imagine several billion data points and you are trying to find out the similarities between one datapoint named Lisa and others. How will you go about it?

The only possible option is to compute the distance between two datapoints for all the defined attributes of the relationships and using that as a comparison mechanism. This will be extremely time and resource consuming and may be impractical. It’s not really what graphs are designed for. Graphs are designed for finding connections that may not be obvious by traversing from one datapoint (called nodes in the graph) to another, such as how we went from Alice to Bob and then to Charlie, which showed us Charlie could be either child or a step-child (assuming Alice adopted him). Finding similarities will be hard in a graph database. Consider the language model example shown above. We were trying to find out semantically similar language constructs for “why did Americans fight their own?”. It has many datapoints. In a graph database we would have had to create semantic links between other datapoints such as “what made the civil war different from others”. Assuming we could somehow establish a semantic relationship that way, it will be very hard — almost impossible — to derive the similarities. Vector databases are made for that purpose and make that comparison easy. Therefore graph and vector databases have distinct usage and are not interchangeable.

Summary

I hope the article series gave you a solid foundational understanding of the concept of vector databases with the examples. This was meant to be an introductory article series only which guides you to seek additional targeted information on the other concepts of the vector databases. Here is a quick recap of what you learned:

  1. Vectors are numerical representation of data. It could be any data but with the large language models, the text data is important.
  2. A sentence model may be used to encode each text item into vectors which represent the meaning of the data; not its value. A sentence model has been trained on actual text datasets.
  3. Vector databases allow querying for data similar; not the same. It does so by deriving the distance between them in a multidimensional space. The smaller the distance, the closer the values.
  4. When the vectors are built on the meaning of language based data, it can compute the closeness of two datapoints for meaning, not value. For instance it knows that “meal” is closer in meaning to “food” compared to “dress”. Note the words are very different from each other and a value based search would not have found the similar.
  5. Large Language Models can produce answers in a coherent manner from the training they have been provided earlier. Retraining is expensive and often impractical, which makes it hard for LLMs to get new data or user data as a part of their response. We can pass the additional data to LLMs, or enhance the long term memory of the LLMs by passing data as a vector database.
  6. In addition to LLMs, vector databases can also be used for other “meaning” searches such as in a data catalog, or a recommendation system.

I hope this was informative and useful in your journey to explore this exciting new world in database technology. Bon voyage!

In Part 4, you will learn how to use the metadata feature to combine vector and definitive searches.

--

--

Arup Nanda

Award winning data/analytics/ML and engineering leader, raspberry pi junkie, dad and husband.