Optimizing Proximity Search: Upfront Analysis for Enhanced Efficiency and Cost-Effectiveness

Encoding and Indexing Source Code

Jun 11, 2024

Proximity search is at the forefront of automation systems being developed by numerous companies today. It serves as the driving force behind the innovative features of tech products and services that are integral to nearly all businesses. Specifically, proximity search—or nearest neighbors search—has been foundational in pioneering advances across data science and machine learning, particularly in enhancing search and recommendation systems. Despite its widespread application, implementing proximity search presents a unique set of challenges for every business, demanding tailored solutions to effectively harness its potential.

With AI-supported tools today, we can optimize proximity search, making it not only more efficient but also more cost-effective, ensuring that businesses can thrive in an increasingly complex world.

Figure 1. To optimize code search within a business, it’s important to understand the performance of the embedding model as well as selecting the most appropriate indexing strategy. The right combination will lead to better code search results, enhancing the quality of life for developers.

Optimizing proximity search involves two critical components: encoding the data (such as source code) and indexing the embedding results. In this blog, we'll explore both aspects using practical examples from the domain of source code intelligence. We'll utilize the open-source Encoder repository—a sophisticated tool designed to facilitate efficient encoding and evaluate embedding models. These capabilities are crucial for enhancing downstream functionalities, including semantic search and AI pipeline optimization. Our previous discussions highlighted its utility in selecting the optimal embedding model to ensure superior encoding performance. This article aims to provide valuable insights for a broad audience, including software engineers, machine learning engineers, data scientists, AI practitioners, and the general developer community.

Encoding

Encoding is important in transforming the input data into vectors into an embedding space that is easier to handle in downstream processing. In 2024, we now have easy access to try various embedding models thanks to the fast-paced advances in AI and machine learning in the past several years. Specifically, we can break out of only using exact substring pattern matching (e.g. ngram methods) and start capturing deep semantic meanings of code and text. As discussed previously, Encoder run is designed for teams anywhere to test different embedding models for source code.

In this exploration, we examine two embedding models: jina-embeddings-v2-base-code and jina-embeddings-v2-base-en. Despite their similar names, these models are trained on different datasets tailored to their specific applications. The base-code model employs the RoBERTa tokenizer, which is optimized for handling code by encoding data at the byte level. In contrast, the base-en model uses the BERT tokenizer, which is better suited for natural language processing, encoding data at the sentence level. This differentiation in tokenization aligns with the nature of the input data, ensuring that code and text are processed in the most effective manner.

Figure 2. Embedding from two models of different bases will yield two sets of data distribution.

Due to the two embedding models having different construction, their embedding results are different for the same data input. Figure 2 shows the results color-coded and plotted in a lower-dimensional space via t-SNE. What this means is that you would need to take care in ensuring that all data inputs are embedded with the most appropriate model for your business needs. Think of this as arranging books by similarity within a library. If you change embedding models, then the same books will move to another library with different room arrangements.

Table 1. Searching for the nearest neighbors of the file cmd/modeldeployer/main.py in the repository.

The embedding model transforms the input data into a high-dimensional space where similar items are close to each other. This distance can be measured with metrics like cosine or Euclidean distance. In Table 1 we are looking at the top 5 most similar files to cmd/modeldeployer/main.py. As you can see, the two embedding models have slightly different results based on Euclidean distance. While we may say that the base-code model is better, the important message here is that you need to do the analysis to know if an embedding model is working as expected and if there is a better model to use.

While in the embedding space we have a structured way to represent the input data, a big challenge arises when we need to efficiently find the nearest neighbors in this high-dimensional space. Furthermore, as the data source grows in size, this challenge becomes more prominent.

Indexing

How do we ensure that proximity search will be scalable and cost-efficient? For instance, to get the top N nearest neighbors from a query code file (Table 1), we would have to calculate the pairwise distance of this query from all other files in the embedding space–a full scan! We do not want this. The solution to this problem is called indexing, and it’s been around in computer science for a long time.

Indexing structures the data in embedding space to significantly reduce the number of comparisons needed to complete the proximity search. How does it do this? There are several indexing algorithms that are widely used: Flat, HNSW, and IVFFlat (Table 2). These algorithms are implemented in most database tools and packages are available in many coding languages. There are also code intelligence indexing algorithms, but for the purposes of this blog, we will discuss the general purpose algorithms.

Table 2. Several popular indexing algorithms and their characteristics. Selection of the most appropriate algorithm will help with serving the best search results for a business application.

The Flat method, also known as the brute force approach, performs exhaustive comparisons between the user query and all data points, ensuring the highest accuracy but at a cost of high computational expense, making it suitable for smaller datasets. Hierarchical Navigable Small World (HNSW) is an algorithm that builds multiple layers of connected graphs, allowing for faster searches at the expense of some accuracy, particularly effective for medium to large datasets due to its balance of speed and precision. Inverted File with Flat Search (IVFFlat), on the other hand, clusters the data vectors and only searches within the most relevant clusters, significantly speeding up the search process while sacrificing some accuracy, making it ideal for very large datasets where response time is critical.

Going back to the library analogy, you can think of indexing as the following. In a library, books are often categorized and shelved by genres, authors, or topics, similar to the IVFFlat indexing method where data is clustered into groups. This system allows you to go directly to one section of shelves rather than searching through every book in the library. For a more precise but time-consuming approach, akin to the Flat indexing method, you might scan every book in a specific section to find exactly what you need, ensuring you don’t miss anything, but at the expense of time.

On the other hand, a method like HNSW can be compared to asking a knowledgeable librarian for help. The librarian uses their understanding of the layout and connections between different topics to guide you efficiently through different sections, finding the book that best matches your query through a series of educated guesses and shortcuts.

Optimization for Source Code Management

Investing analysis resources in encoding and indexing strategies (Figure 1) is crucial for optimal source code management, which is central to many businesses' operations and products. Custom interactions between developers and code repositories can enhance this process.

While businesses utilize various tools for infrastructure and data management, their unique source codes, written in diverse languages and formats, demand tailored development and maintenance. For example, code search optimization involves understanding the repository's metadata—like size, file types, and commit history—to choose suitable encoding and indexing strategies.

Figure 3. Illustration of semantic search in Encoder. Users employ Encoder to improve their interaction with source code.

For mixed-content repositories containing both code and natural language documents, selecting or designing embedding strategies that perform well across most if not all file types is essential to avoid degrading search performance. Testing various models (Figure 3) helps identify which embedding model yields the best results–perhaps a model specializing in code embedding could work for the repository if natural language documents represent a low percentage of the repository.

In addition, indexing strategies must also be tailored to repository size and performance requirements. Smaller repositories might use the brute force (Flat) approach for accuracy, while larger ones benefit from faster, albeit less precise, methods like HNSW or IVFFlat. A mixed strategy may be necessary for nuanced accuracy demands in different repositories.

In this blog, we cover two major steps in developing an optimized code search experience. Choosing the right encoding and indexing pipeline not only minimizes errors but also enhances result relevance and accuracy. High accuracy also improves developers' quality of life, allowing them to focus more on creating impactful applications and products.

About

This is a technical blog in an ongoing series [encoder.run on substack] that captures features and use cases of Encoder, an open source application to deploy various models and create pipelines to produce embedding data in an efficient and scalable way.

Encoder-run

Discussion about this post