From Beginner to Advanced

Lowdown on Vector Databases

Part 1: Introduction to Vector and Semantic Search

Arup Nanda

--

Understand through working examples what vector databases are, how they enable the “similarity” search and where they can be used beyond the obvious LLM space.

A human looking at many points in a multi-dimensional space
Photo by Savannah Boller on Unsplash

Unless you have been living under a rock, you would have heard the terms such as Generative AI and Large Language Models (LLM). Along with that there is a good chance that you have heard about vector databases which provide context to the queries to the LLMs. Ever wondered what they are and how they are useful, beyond the obvious LLM space? Well, read on to learn about this exciting new technology, build your own vector database and think of ways of leveraging it in your projects including but not limited to LLMs.

The Limitations of Value-centric Search

First, let’s see what was lacking that raised the need for a different type of database technology in the first place. It was to do with searching for data. When you hear the word “searching” in databases, you likely immediately think of normal value centric searches such as:

Equality: where the customer_id = 123

Comparative: where age is greater than 25

Wildcard: where customer name starts with “Mc”, e.g. “McDonald”

Sometimes these value centric searches are also nestled together, e.g.

where age > 25 and zipcode = ‘12345’

The modern database technology has evolved over last several decades improving the efficiency of this type of search, which I refer to as value-centric search”, where the specific values are evaluated for filtering in a query. While they work in many cases, arguably in almost all business related applications, consider something like this:

Find me a customer like Lisa

Note the filter used: it is not asking for a customer whose name is “Lisa”; just someone like her, i.e. similar to Lisa. What does similar mean? It’s a hard question to answer. It’s not the name, because a similar customer could be named Alice, Bob or Chris. Could it be their age? Possibly. Suppose Lisa’s age is 40. Customers aged 40 are most similar. A customer aged 25 will be less similar, as well as ones 55, who will also be equally dissimilar.

Let’s ponder on that a little. Consider these three customers with their respective ages.

three customers and their age
Customers and their age

If we draw a graph with Lisa’s balance right in the middle and plot the others, it will look like the following diagram. The distances of their ages from 40 (Lisa’s age) show how far they are from that target. In this case we show that Bob is the most similar, Charlie is most dissimilar, and Alice is a bit more similar.

Customer’s ages along a single axis plot
Ages of customers along a single axis plot

Age is merely one aspect of the customers. While searching for someone “like Lisa” we likely had more of the attributes in mind; not just one. One such attribute could be Net Worth of the customer as shown below, added to the original table:

Customers with two attributes: Age and Net Worth
Customers with two attributes: Age and Net Worth

If Lisa’s net worth is 100,000, what will be new similarity between these customers? We can create a two dimensional graph with Age and Net Worth as two axes, as shown in the figure below.

Customers’ Age and Net Worth along Two Dimensions
Customers’ Age and Net Worth along Two Dimensions

However since the latter is in thousands and the former is in two digits, the graph will be disproportionate. To have the same proportions, we will need to convert these absolute values to some relative values for comparison. The age varies from 20 to 80, i.e. a spread of 60. So, distance of age for Alice from Lisa will be (40–20)/60 = 0.33. Likewise the spread of Net Worth is from 50 to 200, which is 150. Likewise, Bob’s distance for Net Worth will be (200–100)/150 = 0.67.

The relative distance of customers from Lisa
The relative distance of customers from Lisa

We discover that Bob is no longer “similar” in profile to Lisa. To find the composite distance, we can compute the distance between them on a two dimensional graph, such as:

Composite Distance = Square Root of (Square of (Age Distance) + Square of (Net Worth Distance))

Using that formula, we compute the composite distance from Lisa.

Composite Distance of Customers from Lisa
Composite Distance of Customers from Lisa

We might discover that Alice may be closer to Lisa than Bob and Charlie could be the farthest. Just adding a dimension changes the similarity dramatically. Consider adding another dimension, e.g. “number of children”, making it a 3 dimensional plot, which may alter the distance of the objects from Lisa even more. In reality, objects will have hundreds of attributes for comparison. It will be impossible to put all that on a paper. But hopefully you got the point about the distance between two points in a multi-dimensional space means. The less the distance the more similar the points are, with 0 being exactly the same in all the dimensions.

The attributes of the points are captured as a vector. In the above example, the dimensions of the vector will be [Age, Net Worth]; so we will represent the values as follows.

The vector representing Lisa is [40,100000]. The distance between the points are generally represented as euclidean distance as depicted in the function d() below for a 2-dimesnional space. Source: wikipedia.

Formula for Eucledian Distance in a Two Dimensional Plane
By Kmhkmh — Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=67617313

We have a built in method in the scipy package we can use in python to compute the euclidean distance.

>>> from scipy.spatial import distance
>>> lisa = (40,100000)
>>> charlie = (80,50000)
>>> d = distance.euclidean(charlie, lisa)
>>> print(d)
50000.01599999744

These vectors are stored in the vector database, which offers tools to compare the vectors for closeness, i.e. the distances between them; not the value comparison or a wildcard search. The database offers tools to make that distance calculation more efficiently compared to a regular database. This is what a vector database is.

Let’s Build a Database

Let’s examine the features of a vector database this with an example, for which we will use an open source vector database called ChromaDB and python. First, we install chromadb

pip install chromadb

Then we will create a vector database.

import chromadb
client = chromadb.Client()
coll = client.create_collection(name='my_collection')

In the last line above we created a “collection” to which we will add the vectors of the data we want to be compared, i.e. those for Alice, Bob and Charlie. The vectors are designed for large language models for something called “embeddings”, which I will describe later. We pass the vectors as embeddings. We will add a label to each data to identify them. These labels are just the names of the customers, i.e. Alice, Bob and Charlie. We also have to provide unique identifiers of the data as well passed as the parameter “ids”.

coll.add (
embeddings=[[20,100000], [40,200000], [80,50000]],
documents=["Alice","Bob","Charlie"],
ids=["1","2","3"]
)

Now our collection has the data with the respective vectors. Now we can query the collection with the vector for Lisa, which you may recall, is [40,100000]. You can perform the query as:

coll.query(
query_embeddings=[40,100000]
)

Here is the result:

{'ids': [['1', '3', '2']],
'embeddings': None,
'documents': [['Alice', 'Charlie', 'Bob']],
'metadatas': [[None, None, None]],
'distances': [[400.0, 2500001536.0, 10000000000.0]]}----

It shows the data “Alice” with id=’1' is the closest with the value 400 (shown in the distances array), the next one is Charlie, although a distant second at a score of 2500001536. This he is how the vector database showed you the vectors similar to another vector; not an equality or wildcard search as in a normal database.

The database or the “collection” you created earlier is not static. You can of course insert more items and update the existing values as needed using the upsert() method. In the following example, we add a new value with the label “Dave”.

coll.upsert(
ids=["4"],
documents=["Dave"],
embeddings=[[40,150000]]
)

Let’s perform the same query as done earlier. To limit to only three values in the results (and not all possible values), we set a new parameter in the method called n_results to 3.

coll.query(
query_embeddings=[40,100000],
n_results=3
)

It returns:

{'ids': [['1', '4', '3']],
'embeddings': None,
'documents': [['Alice', 'Dave', 'Charlie']],
'metadatas': [[None, None, None]],
'distances': [[400.0, 2500000000.0, 2500001536.0]]}

You can see that it returned a different set of values, and Dave now takes the second place displacing Charlie as most similar. Using this approach you can load hundreds of millions of data elements into the vector database with hundreds of attributes as you see fit and compare them to see the distance from a query vector to find similarities.

Vectorization

Now that we understood how the vectors are compared in a vector database, let’s see what makes up the vectors. The actual data has to be decomposed into a vector; but how do we do that? For defined set of specific values such as Age, Net Worth, etc. as seen in the above example, it’s easy; but what about words, sentences or pictures? We also need to create the vector in such a way that it represent the meaning of the sentence, or text and not just some set of random numbers. For example vectors for “flower” should come out to be closer to that of “rose” than, say “toaster”. To vectorize the meaning, we have to use some language transformer model.

One such tool is a sentence transformer model using minimal text data to transform words. You can download this model as a package from huggingface. Let’s see the model in action. First, install the module if it’s not already.

pip install sentence-transformers

We will import the module and create a model object for the class.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM- module L6-v2')

We are ready to use a method called encode() in this object to create a vector from any language element, such as this sentence: “This is an awesome tool”

outvector = model.encode(“This is an awesome tool”)

If you display the output outvector, you will see the following:

array([-4.86320592e-02,  3.47757190e-02, -2.03912966e-02, -1.41970646e-02,
1.77401770e-02, -2.35676132e-02, 3.43585275e-02, -8.87791067e-03,
8.61549564e-03, -5.38554648e-03, 5.00661805e-02, 5.87902553e-02,
… output truncated …

The output is a vector of 384 values that represent the sentence “This is an awesome tool”. This can be used to compute the distance from other vectors to find similarities. But first, you may be wondering how the tool managed to get the attributes to create a vector? It did so via a language model called all-MiniLM-L6-v2 which has been trained using various language datasets from sources such as Reddit, Trivia Q&A, crowdsourced Natural Language Interfaces corpus, etc. If you want to examine the model card of that model, here it is https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2.

Creating Meaningful Vectors

Let’s explore using the vector in similarity search in more detail with a specific example. Suppose Alice, Bob and Charlie are Engineer, Accountant and Artist respectively. We will use this sentence transformer language model to create the vectors of these terms by rather than putting them together by hand as we did before.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
alice_vector=model.encode('Engineer').tolist()
bob_vector=model.encode('Accountant').tolist()
charlie_vector=model.encode('Artist').tolist()

Now we got three vectors for the three data items. These vectors are created from the profession attribute of these data items. We will use these created vectors to build our vector database collection, just like we did earlier.

coll = client.create_collection(name='my_collection')
coll.add (
embeddings=[alice_vector,bob_vector,charlie_vector],
documents=["Alice","Bob","Charlie"],
ids=["1","2","3"]
)

Note how we used the generated vectors in embeddings rather than putting them by hand in the previous example.

Now, suppose we want to find out whose vocation is similar to another, say, “Painter”. First we need to convert the word “Painter” to a vector as well and then find the similarity (which is the distance) of that to the other vectors using the query() method we saw earlier.

coll.query(model.encode(‘Painter’).tolist())

This returns:

{'ids': [['3', '1', '2']],
'embeddings': None,
'documents': [['Charlie', 'Alice', 'Bob']],
'metadatas': [[None, None, None]],
'distances': [[0.6380528211593628, 1.2819364070892334, 1.337925672531128]]}

Voila! It did come back with the results as expected. Charlie is most similar. Remember, Charlie is an “Artist”, which is semantically closest to a “Painter”. The others are farther away and less similar. Now, run the same query for an “Actuary”:

coll.query(model.encode(‘Actuary’).tolist())

It comes back with:

{'ids': [['2', '1', '3']],
'embeddings': None,
'documents': [['Bob', 'Alice', 'Charlie']],
'metadatas': [[None, None, None]],
'distances': [[0.9443958401679993, 1.3902302980422974, 1.566853642463684]]}

As expected, it figured out correctly that “Actuary” is similar in meaning to “Accountant”, which is what Bob’s profession is.

Pause for a second and consider this. You were searching for similarities, not comparison of values. Using the power of a language model, even a small one like all-MiniLM-L6-v2, you could correctly build the vectors and using the vector database query, you could see which vectors are similar in some ways to another term. This is the power of semantic search and vector databases.

In the second part of this article series you will learn how to use this to find semantically similar terms in a very large sized database using vector database tools.

--

--

Arup Nanda

Award winning data/analytics/ML and engineering leader, raspberry pi junkie, dad and husband.