In the rapidly evolving field of data science, the ability to efficiently search and manipulate high-dimensional vectors is a critical skill. However, when dealing with smaller datasets, the conventional wisdom of using resource-intensive vector databases may not always be the most efficient approach. In this article, titled “In-Memory Vector Search with C#: A Practical Approach for Small Datasets,” we delve into an alternative method that challenges the status quo.
We explore the intriguing possibility of storing vectors directly in the memory of a C# program, bypassing the need for a separate vector database system. This approach, while seemingly unconventional, can offer surprising benefits in terms of speed and resource utilization for specific use cases.
Through a small case study, we’ll demonstrate how this method can perform similarity searches in split seconds, even with 100,000 records of 1536-dimensional vectors. If you’re curious about pushing the boundaries of vector search efficiency in C#, or if you’re simply interested in exploring new ways to handle high-dimensional data, this article is for you.
Table of Contents
Why and when do we want to replace vector databases in the first place? Well, after writing about the great potential of enhancing the limited context of the ChatGPT API with vector embeddings, I realized that installing vector databases involves a bit of work and a great bit of resources. For many small projects, this is simply not economical.
So, let’s go through some background information first.
If you want to jump directly to the code, go to implementation part.
Understanding Vector Databases
In the realm of data science and machine learning, vector databases have emerged as a powerful tool for handling high-dimensional data. These databases are designed to store vectors — arrays of numbers that represent data in multi-dimensional space. But what makes them so special, and why are they often the go-to solution for dealing with such data?
Vector databases, such as Milvus, Faiss, and Annoy, are engineered to perform efficient similarity searches. They use indexing algorithms to organize the vectors in a way that allows for quick retrieval of the most similar vectors to a given query vector. This is a crucial feature when dealing with high-dimensional data, as brute force comparison of a query vector with every vector in the database can be computationally expensive and time-consuming.
However, these databases are not without their drawbacks. They often require a significant amount of resources to run effectively. For instance, hosting a vector database system like Milvus may require at least 4GB of RAM. This can be a limiting factor, especially for smaller projects or for developers working with constrained resources.
Moreover, setting up and maintaining a vector database can be a complex task. It involves understanding the underlying algorithms, tuning parameters for optimal performance, and managing the database’s lifecycle. This complexity can be a barrier for developers who are new to the field or who are working on projects with tight timelines.
In the next section, we will explore an alternative approach that can bypass some of these challenges: in-memory implementations with indexes. However, as we will see, this approach also has its own set of considerations.
Exploring In-Memory Implementations with Indexes
As we venture into the world of high-dimensional data, it’s important to explore all possible avenues for efficient data handling. One such approach is the use of in-memory storage coupled with indexing algorithms. This method offers a unique blend of speed and efficiency, making it a compelling alternative to traditional vector databases.
In-memory storage, as the name suggests, involves storing data directly in the memory of a program. This method can offer significant speed advantages, as accessing data in memory is typically much faster than retrieving it from a database. When combined with an indexing algorithm, in-memory storage can provide a powerful tool for handling high-dimensional vectors.
Indexing algorithms, such as KD-trees, Ball trees, or Annoy, organize data in a way that allows for quick retrieval of the most similar vectors to a given query vector. These algorithms can significantly speed up similarity searches in high-dimensional spaces, making them a crucial component of this approach.
However, implementing an in-memory solution with indexes is not without its challenges. These algorithms can be complex to implement and may require a deep understanding of data structures and algorithms. Moreover, some indexing algorithms do not scale well with the dimensionality of the data, which can limit their effectiveness for very high-dimensional vectors.
Additionally, while in-memory storage can offer speed advantages, it can also lead to increased memory usage. This can be a concern for applications that need to handle large amounts of data or run on systems with limited memory.
Given these considerations, an in-memory implementation with indexes may not always be the best solution, especially for smaller datasets or simpler use cases. In the next section, we will explore a simpler approach that can be surprisingly effective for these scenarios: full scans in memory.
The Simplicity Vector Search with C#: A Viable Option
Understanding Full Scans in Memory
As we delve deeper into the world of high-dimensional vector searches, we encounter a surprisingly simple yet effective approach: full scans in memory. This method, while seemingly straightforward, can offer significant benefits for certain use cases, particularly when dealing with smaller datasets.
A full scan, in the context of vector searches, involves comparing a query vector with every vector in the dataset to find the most similar vectors. This is essentially a brute force approach, where every possible match is evaluated. While this may sound inefficient, it can actually be quite fast for smaller datasets, especially when the data is stored in memory.
The beauty of this approach lies in its simplicity. There’s no need for complex indexing algorithms or resource-intensive database systems. All you need is your dataset and a similarity metric, such as cosine similarity or Euclidean distance. The data is stored directly in the memory of your C# program, and the full scan is performed using simple loops and mathematical operations.
However, it’s important to note that full scans are not a one-size-fits-all solution. As the size of your dataset grows, the time required for a full scan can increase significantly, making this approach less efficient. Moreover, storing large amounts of data in memory can lead to increased memory usage, which can be a concern for systems with limited memory.
Personal Test Results: A Case Study
To put this theory into practice, let’s consider a real-world example from my own experience. In this scenario, we assume that we extract roughly 20 vectors per page from an average PDF file using the OpenAI embeddings API. Each of these vectors has a dimensionality of 1536.
In my tests, I found that storing 100,000 such vectors directly in memory consumed only a few hundred megabytes of RAM. This means that we can store information about approximately 5,000 PDF pages in memory without significantly impacting system resources.
This is a compelling finding. It shows that for smaller datasets, full scans in memory can be a viable and efficient approach. In the next section, we will dive into the details of implementing full scans in C#, providing code examples and discussing key considerations. If you’re looking for a simple, efficient way to handle high-dimensional vectors in smaller datasets, this section is for you.
Diving into the Code: Implementing Full Scans in C#
Storing and searching
Since storing the vectors alone without any additional information won’t do any good we’ll take the approach of creating a collection class with generics and interfaces.
Since the vector collection only cares about vectors, we can keep the interface for the records that should be stored inside very small:
For the moment we’ll go with float arrays. For a more general approach one might want to make that part generic as well.
Having this interface, we can now use it in our class that represents the actual data items we care about.
Since we are extracting parts from a Pdf document we want to store some more information. This will include the text that was used to create the vectors as well as some meta information on where the text was coming from.
The collection can then be implemented as this:
For now this collection only can store and retrieve objects that implement the IVectorObject interface.
To enable searching we’ll use dot multiplication of vectors.
Important: This will only work with normalized vectors. Luckily this is what we get from Open AI’s embedding API. I’ve you’re vectors are not yet normalized, take a look below at some vector math.
The dot product of two normalized vectors will give us a measure of their similarity. If we multiply a vector with itself, we’ll get 1 as result. We can see that the closer the dot product of two vectors is to 1, the more similar the two vectors are.
With that knowledge we can know implement the simple full scan to find the closest vector. Therefor we simple take in our query vector (the vector that represents what we are searching) and and multiply it with every record in our collection. We keep track of the biggest product and it’s index in the collection and finally return the vector with the biggest product:
Just to have something in place for storing and retrieving a collection to / from disk, we add two more methods to our collection:
Finally let’s come to the fun part of seeing the vector search in action.
To get some vectors I’ve downloaded a free ePaper with 59 pages of text on Robinson Crusoe and splitted it into approximately 500 Chunks of text. The approach is the same as discussed in my previous article about chatting with documents.
Next we need something to search for and we must convert the question into a vector as well.
My example question is
Where are the dead cannibals hidden?
Putting everything together, we can now search for that vector and get top 3 closest vectors from our collection:
All three results are closely related to our question and the second closest hold the key information that we’re actually looking for in order to answer the question:
We hide the bodies in the forest near
The rows 16 to 22 are just for verification. We’re calculation the product of the query with itself and expect to see something very close to 1 (only approximately 1 because of the limited accuracy of float values.
Then we test 3 random products to get feel for the numbers we can expect that will not help us.
Lastly we calculate the products for our top 3 results. Here’s what we got:
We can clearly see that the closest vector have a greater product that the unrelated text blocks (first, second, tenth).
And how long does it take to scan through all vectors? As mentioned above searching through 100000 records can be done in a split second. So this example with 500 records doesn’t really take much resources from your CPU.
Normalization of vectors means that we want to bring all vectors to a length of 1. After that we still have the directional information of the vector (pointing in 1536 dimensions – totally easy to imaging that one) but we lost the length. For our use case that’s fine.
We first need a small function to calculate the length for a vector:
With that we can do the actual normalization by dividing each value in the vector by the length:
Vector Dot Product
Calculating the dot product of two vectors is done by multiplying each index from vector one with the corresponding index of the second vector and then summarize the results:
In the realm of high-dimensional vector searches, there is no one-size-fits-all solution. The best approach depends on the specific requirements of your project, including the size of your dataset, the resources available, and the level of complexity you’re willing to handle.
In this article, we explored three potential approaches: using a complete vector database, implementing an in-memory solution with indexes, and performing a simple full scan in memory. While each of these methods has its own advantages and drawbacks, our focus was on the latter approach due to its simplicity and efficiency for smaller datasets.
Through a practical example, we demonstrated how to implement a full scan in C# and showed that this method can be surprisingly effective for handling up to 100,000 vectors, or about 5,000 PDF pages worth of data. This approach requires only a few hundred megabytes of RAM and can perform searches in split seconds, making it a viable option for many use cases.
However, it’s important to remember that as your dataset grows, the time required for a full scan can increase significantly, and storing large amounts of data in memory can lead to increased memory usage. Therefore, it’s crucial to consider your specific needs and constraints when choosing an approach.
In the end, the world of high-dimensional vector searches is vast and complex, but with the right tools and techniques, it can be navigated effectively. Whether you’re a seasoned data scientist or a developer just starting out in the field, I hope this article has provided you with valuable insights and practical knowledge that you can apply in your own projects.