Lowdown on Vector Databases Part 4

Tying the loose ends to complete the series on the vector database. Learn about adding metadata, filtering on metadata and documents, and combining the definitive and vector searches.

Arup Nanda
10 min readFeb 15, 2024

--

Lots of vector points and the filters applied to them.
Image by the author using Midjourney

In the previous three article series, you learned the basics of the vector database, vector search and saw a practical example of using semantic search to get a precise text semantically similar to your query string. Many readers reached out to me on some of the other interesting uses of vector search. In this article, you will explore the capabilities of the vector searches even further.

Links to the previous three on the series:

Part 1 Where you learned the basics of vector databases, how to construct a semantic meaning from large language models

Part 2 where you learned how to build a semantic application for asking the right questions from a knowledge bank

Part 3 where you learned other uses and comparisons with other technologies such as graph

Passing Metadata

As you leaned, vectors are specifically designed data items that represent other data as a collection of numbers, which can then be subject to vector mathematics to get the distance between two vectors. The distance tells how close these two vectors are to each other. Unless all the dimensions of vector are precisely the same, two vectors are not the same.

However, sometimes it is possible to have some specific attributes of the data that can be searched like a typical database. Let’s look at the previous example. Here is the example we used earlier. We used the following attributes of the customers — Age, Networth (in ‘000), Number of Children and Zipcode.

These attributes are used to create the vector representing those customers. When we want to find out the customer like another customer named Lisa, we get the same attributes for Lisa, vectorize them and then find the distance between Lisa’s vector and these three vectors. The least the distance, the closest the customer is to Lisa. Let’s recap the code.

# Install chromadb if needed
#!pip install chromadb
import chromadb
# Create a client object from chromadb library
client = chromadb.Client()
# Create a collection in ChromaDB
coll = client.create_collection(name='my_collection')
# Add the embeddings as [Age, Networth, No of Children, Zipcode]
# There are three records, or “documents”. We give IDs 11, 12 and 13
# The documents are named after the customers they represent
#
coll.add (
ids = ["11","12","13"],
documents = ["Alice", "Bob", "Charlie"],
embeddings = [
[20,100,0,12345],
[40,200,3,23456],
[80,50,2,34567]
]
)

Again, as a recap, if you want to see some of the records in the collection, you can use the peek() function.

>>> coll.peek()

{
'ids': ['11', '12', '13'],
'embeddings': [
[20.0, 100.0, 0.0, 12345.0],
[40.0, 200.0, 3.0, 23456.0],
[80.0, 50.0, 2.0, 34567.0]
],
'metadatas': [None, None, None],
'documents': ['Alice', 'Bob', 'Charlie'],
'uris': None,
'data': None

If you want to display the details on specific items in the collection, you can use the get() function. In the following example, you are getting the details for IDs “11” and “12”:

>>> results = coll.get (
... ids = ["11","12"]
... )
>>> results
{
'ids': ['11', '12'],
'embeddings': None,
'metadatas': [None, None],
'documents': ['Alice', 'Bob'],
'uris': None,
'data': None
}

Suppose the attributes of Lisa are as follows:

Age: 40

Networth: 100

Number of Children: 2

Zipcode: 12345

To find out the distance between the vector representing Lisa and those in the collection, you use the following query, as you learned in the previously mentioned articles:

results = coll.query(
query_embeddings=[40,100,2,12345]
)

However, suppose you also have some definitive attributes you can search on with normal database type queries. For instance, you may have an attribute called “gender”, which can be used in a search definitively, as opposed to in meaning — similar to the how traditional databases work. This is passed as a parameter named “metadatas”. Recall that in our previous example we did not pass any metadata for the data in the collection; but now we will.

# Add embeddings
# Age, Networth, No of Children, Zipcode
coll.add (
ids = ["11","12","13"],
documents = ["Alice", "Bob", "Charlie"],
metadatas = [{"gender":"woman"},{"gender":"man"},{"gender":"man"}],
embeddings = [
[20,100,0,12345],
[40,200,3,23456],
[80,50,2,34567]
]
)

With this in place, let’s see how we can add more intelligence to our queries. First, we query the vectors only, as we did earlier.

# Query without any meatadata filter
results = coll.query(
query_embeddings=[40,100,2,12345]
)
results

The output:

{'ids': [['11', '12', '13']],
'distances': [[404.0, 123464320.0, 493821376.0]],
'metadatas': [[{'gender': 'woman'}, {'gender': 'man'}, {'gender': 'man'}]],
'embeddings': None,
'documents': [['Alice', 'Bob', 'Charlie']],
'uris': None,
'data': None}

It showed us all the data with the distance. Now, let’s run the same query but with a metadata filtering. We will search only where “gender”=”man”. This is done by passing a new parameter named where:

# Search with a definitve query on metadata
results = coll.query(
query_embeddings=[40,100,2,12345],
where={"gender":{"$eq":"man"}}
)
results

The output:

{'ids': [['12', '13']],
'distances': [[123464320.0, 493821376.0]],
'metadatas': [[{'gender': 'man'}, {'gender': 'man'}]],
'embeddings': None,
'documents': [['Bob', 'Charlie']],
'uris': None,
'data': None}

Notice how the output is limited only to those with gender=man in the metadatas attribute. Remember the similarity between two points in the vector space determines how close they are? If we can eliminate a lot of data definitively prior to performing the vector computation, this will be a great way to improve performance, not to mention producing only the relevant results.

Here are the operators you can use in the query on the metadata:

  • $eq : “equal to”
  • $ne : “not equal to”.
  • $gt : “greater than”.
  • $gte : “greater than or equal to”.
  • $lt : “less than”
  • $lte : “less than or equal to”

Only $eq and $ne can be applied to strings; not the others.

You can also use another operator $in for checking in a list of values, e.g.:

where={“gender”:{“$in”:[“man”,”woman”]}}

There is also another operator $nin to represent “not in the list”.

Filtering on the Document

In addition to a definitive query on the metadatas attribute, you can filter on the document name as well. For example, suppose you want to limit the query results to only those documents that has the letter “C” in them, you can use the $contains operator. Note the use of a new parameter where_document:

results = coll.query(
query_embeddings=[40,100,2,12345],
where_document={"$contains":"C"}
)
results

Output:

{'ids': [['13']],
'distances': [[493821376.0]],
'metadatas': [[{'gender': 'man'}]],
'embeddings': None,
'documents': [['Charlie']],
'uris': None,
'data': None}

Note how only the document that matched the query condition, i.e. the letter “C” in it, which is “Charlie” came out.

Limiting Displayed Attributes

In the previous examples, notice how the query returns all the attributes of the data, which may be a nuisance in many cases. For instance, the “metadatas” attribute does not actually add any value and should not have been present unless explicitly asked for by the user. To return only a specified number of attributes in a query, you can use the optional “include” parameter. Here is an example where we asked for only two attributes: “documents” and “distances”:

results = coll.query(
query_embeddings=[40,100,2,12345],
include=["documents","distances"]
)
results

The output:

{'ids': [['11', '12', '13']],
'distances': [[404.0, 123464320.0, 493821376.0]],
'metadatas': None,
'embeddings': None,
'documents': [['Alice', 'Bob', 'Charlie']],
'uris': None,
'data': None}

A Practical Example

Remember in Part 2 you learned how to use the power of vector and large language models to identify a question from Wikipedia when you knew the spirit of your question; but not precisely the question contained in the data? There you asked a question:

Why did Americans fight their own?

But there was no question in that precise phrase in the data. You were not looking to get the precise question; all you wanted was the questions contained in the data that had the similar meaning to the question you were asking. Using the power of vector and LLMs, you got the answers as shown below:

Distance   ID Question
1.027662 101 what made the civil war different from others
1.059752 960 when was america pioneered
1.102381 1353 what date did the american civil war start
1.102810 481 how many native Americans did the United States kill or deport?
1.126004 1650 what triggered the civil war
1.138259 469 when did the civil war start and where
1.167491 1180 Who controlled Alaska before US?
1.183807 1168 what two empires fought to control afghanistan
1.229329 1008 what is colonial americans day in usa
1.237927 1999 how did bleeding sumner lead to the civil war

Here you got a list of questions and how similar they are to the question you asked, represented by the value of “Distance”. The less the distance, the closer it to the meaning. Once you get the question, you can look up the answers. However, how will you look up the answer? You can query the original dataset wiki_qa to look up answers to the question, e.g. “what made the civil war different from others”; but it’s less efficient. You are better off using the question_id to look up the answers.

What is the question_id? Is it the “ID” column shown before the question? No, it’s not. The ID column is merely the identifier in the collection of vectors, which is pretty much meaningless to a user. What we really want is to capture the question_id as well from the original dataset in the collection. How can we do that?

I hope you guessed it: “metadata”. We can capture the question_id as a metadata. Let’s see how we can do that. Let’s fast track some of the items

# for loading data directly from hugging face
from datasets import load_dataset
ds = load_dataset('wiki_qa', split='train')
# collect only the questions
questions = []
for i in ds ['question']:
questions.append(i)
# remove duplicates
questions = list(set(questions))

We did all this in the previous article. But here are the new lines we add here.

# collect the question_ids
qids = []
for i in ds [ 'question_id']:
qids.append(i)
# Remove duplicates
qids = list(set(qids))

We need to represent the metadata in the format “metadata_name”:”value”; so we will create a dict in the format

{'question_id':'Q1234'}
# Form dict objects for the metadata
question_ids = []
for i in qids:
j = {'question_id':i}
question_ids.append(j)

Then we continue building the vector collection and search as before:

import chromadb
client = chromadb.Client()
coll = client.create_collection(name='my_collection')
from tqdm.auto import tqdm
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')

Now we will upsert all the values as we did earlier in Part 3, with a very important difference: we will add the metadatas parameter in the upsert() call. We will need to create a variable to hold the metadatas as well, named “metadatas” which is the list of dict we created called question_ids.

# Upsert
# batches of 128
# total questions are 2118
batch_size=128
total_size=2118
for ctr in tqdm(range(0,total_size,batch_size)):
ctr_end = min(ctr+batch_size, total_size)
IDs = [str(i) for i in range(ctr, ctr_end)]
documents = [text for text in questions[ctr:ctr_end]]
embeddings = model.encode(questions[ctr:ctr_end]).tolist()
metadatas = question_ids[ctr:ctr_end]
coll.upsert(documents=documents, ids=IDs, embeddings=embeddings, metadatas=metadatas)

Now our collection has not only the questions but the question_id as a metadata as well.

Let’s find the questions similar in meaning to what I want to ask:

question = 'why did Americans fight their own'
ques_vector = model.encode(question).tolist()
# Get similar vectors
similar_vectors = coll.query(ques_vector, include = ["distances","documents","metadatas"], n_results = 10)

You can print similar_vectors to see the output; but I want to print it in a readable interface, remove the “ID” columns as it’s internal to the collection and not meaningful to me and print the question_id column which is captured in the metadata called “question_id”. Here is the code:

#pretty output
print(f'{"Distance":>8} {"ID":>4} {"Question"}')
for ids in similar_vectors['ids'][0]:
i = similar_vectors['ids'][0].index(ids)
# print(f"{round(similar_vectors['distances'][0][i],6):1.6f} {ids:>4} {similar_vectors['documents'][0][i]}")
print(f"{round(similar_vectors['distances'][0][i],6):1.6f} {similar_vectors['metadatas'][0][i]['question_id']} {similar_vectors['documents'][0][i]}")

And here is the output:

Distance   ID Question
1.027662 Q1403 what made the civil war different from others
1.059752 Q2251 when was america pioneered
1.102381 Q1818 what date did the american civil war start
1.102810 Q2110 how many native Americans did the United States kill or deport?
1.126004 Q303 what triggered the civil war
1.138259 Q1530 when did the civil war start and where
1.167491 Q2090 Who controlled Alaska before US?
1.183807 Q365 what two empires fought to control afghanistan
1.229329 Q1635 what is colonial americans day in usa
1.237927 Q1783 how did bleeding sumner lead to the civil war

Voila! I got the question_ids as well. Once I have them, I can use them to look up the questions and answers directly from the wiki_qa dataset.

I can add as many elements of metadata as I want. For instance, I may want to add the column called “document_title” found on the wiki_qa dataset as a metadata, which will allow me to filter out specific records based on the title. Other datasets may have more attributes you can capture as metadata and either present them or filter on them as needed.

Multiple Metadata Filters

In the previous section you should have realized how powerful the metadatas attribute of the data being passed to a collection and how that can be used as a primary filter. Here we used only one metadatas attribute: “gender”. We can define as many as we want. Here is an example where we define a “level” as an attribute. The “level” represents the status level of the customer, represented as a number, 1 being the least.

We can use the update() method of the collection object to update existing data instead of creating a brand new collection.

coll.update (
ids = ["11","12","13"],
documents = ["Alice", "Bob", "Charlie"],
metadatas = [{"gender":"woman","level":3},{"gender":"man","level":2},{"gender":"man","level":1}],
embeddings = [
[20,100,0,12345],
[40,200,3,23456],
[80,50,2,34567]
]
)

Now, let’s query customers with level greater than or equal to 2:

results = coll.query(
query_embeddings=[40,100,2,12345],
where={"level":{"$gte":2}}
)
results

The output shows only those with the level 2 and above:

{'ids': [['11', '12']],
'distances': [[404.0, 123464320.0]],
'metadatas': [[{'gender': 'woman', 'level': 3},
{'gender': 'man', 'level': 2}]],
'embeddings': None,
'documents': [['Alice', 'Bob']],
'uris': None,
'data': None}

If multiple metadatas are present, all of they can be queried as AND or OR clauses. Here is an example of a search for a data where we are looking for “gender” == “man” and “level” >= 2:

results = coll.query(
query_embeddings=[40,100,2,12345],
where={
"$and":[
{"level":{"$gte":2}},
{"gender":{"$eq":"man"}}
]
}
)

The output shows only that satisfies all the conditions:

{'ids': [['12']],
'distances': [[123464320.0]],
'metadatas': [[{'gender': 'man', 'level': 2}]],
'embeddings': None,
'documents': [['Bob']],
'uris': None,
'data': None}

You can also use “$or” operator to introduce an OR condition in the code where $and exists. You can nest as many metadatas attributes as you want and query on them in as many combinations. It is a powerful way to filter out data that can be queried definitively leaving the vector computations to be done on fewer data items. This is a huge boost for performance as well as resulting result set on which you application needs to work on. Remember, filtering on metadatas is optional; so you can define metadata (if you can) but not use it to filter during a query. This adds a tad bit to the storage but the benefits are much more compared to that cost.

Hope you like the series on the Vector database and the capabilities using the code samples. I will appreciate if you could drop me a line with your feedback — the good, the bad and the ugly.

--

--

Arup Nanda

Award winning data/analytics/ML and engineering leader, raspberry pi junkie, dad and husband.