Show HN: Fast vector similarity using Rust and Python

binary132 · on Aug 23, 2023

I realize you have a different usecase in mind, so the following feedback might not be that useful to you. I also haven’t reviewed your API design, so perhaps you have already considered this angle.

I picked up a habit in Go package design which I now apply everywhere: one of my immediate first thoughts when evaluating a library is how it creates or uses threads.

When I see that a library wants to consume all of the available cores of my system, or uses an internal task scheduler or threading mechanism, I worry that it will compete for CPU time with the other processes, threads, etc. that my system and process are using.

On the other hand, I appreciate when a library exposes an interface for me to offer it threads or a scheduler, or even better if it offers primitives that I can use directly in my own threads or tasks.

Perhaps this is mainly a problem for high-concurrency servers and UIs, but it’s always one of my first thoughts, and Rust is an ecosystem which needs to support those usecases. Many threaded libraries, even otherwise very good ones, are not designed with this in mind and it is a problem.

eigenvalue · on Aug 23, 2023

It's a good point, and the way my library is designed now it would be a big resource hog. Not sure the best way to handle that given that it's supposed to be a super easy to use library for Python. Perhaps I could add an optional parameter "no_resource_hogging" or something, and if enabled that can ratchet down the amount of parallelism, or reduce the CPU priority of the threads to a very low level so it won't use up all the system resources. The problem with the latter approach is that it starts being system specific. For example you could do something like this:

  fn create_low_priority_pool() -> rayon::ThreadPool {
      ThreadPoolBuilder::new()
          .start_handler(|_| {
              #[cfg(target_os = "linux")]
              {
                  use libc::{sched_param, sched_setscheduler, SCHED_IDLE};
                  unsafe {
                      let mut param: sched_param = std::mem::zeroed();
                      param.sched_priority = 0;
                      sched_setscheduler(0, SCHED_IDLE, &param);
                  }
              }
              #[cfg(target_os = "windows")]
              {
                  use winapi::um::processthreadsapi::SetThreadPriority;
                  use winapi::um::winbase::THREAD_PRIORITY_LOWEST;
                  unsafe {
                      SetThreadPriority(std::ptr::null_mut(), THREAD_PRIORITY_LOWEST);
                  }
              }
          })
          .build()
          .unwrap()
  }

But I don't know how that would work for MacOS. Anyway, it's something to think about.

zacmps · on Aug 24, 2023

I would say default to min(affinity, system threads, 16) and let the user input a specific number to override it.

I had a problem recently where a library started a process per CPU thread I suspect the author didn't intend it to start start 256...

Affinity is only easy to get on some platforms but you should use it instead of the system threads if available.

binary132 · on Aug 24, 2023

One approach I’ve used in the past is to offer an API which takes threads or a worker pool as a parameter, and a separate API which does that for you and calls the first one.

Then in your Python wrapper you can use the wrapper, but your users who care can provide the pool they’ve already created.

Bonus points if it’s an interface or trait that the user can adapt their own scheduler to, if they don’t use rayon. But, I can see how that could be tricky if the design of your library is tightly coupled with the threading solution.

One way to frame it is that thread creation is a significant side effect.

eigenvalue · on Aug 23, 2023

Author here. I just wanted to add that I struggled mightily to include normalized mutual information in this library but simply couldn’t get it to run quickly enough with 15,000 dimensional vectors to include it (it caused the “all” option to take 20x longer to execute).

I was able to get non-normalized MI to run fast, but then you end up with a measure that doesn’t have any intuitive meaning (i.e., it’s not a value in the range [0.0, 1.0] or anything like that— it’s something like 5.2, where this depends on the length of the vectors and the magnitude of the elements).

I tried many different approaches, but computing the joint entropy was just too computationally expensive, even using BLAS/LAPACK and lots of vectorization. I tried using various kernel density estimation techniques but couldn’t get anything to work well and quickly. I suppose I could just approximate it by taking a random subset of indices like I did for the distance correlation measure, but I find that unsatisfying.

So if there are any math/Rust wizards in here who have any suggestions, please let me know!

foota · on Aug 23, 2023

If you don't mind, how is this calculated? I found some references, but I'm not sure I understand.

eigenvalue · on Aug 23, 2023

There are lots of different ways of doing it (for a reference, see https://www.mdpi.com/1099-4300/19/11/631 ), but this was the basic approach I was trying that I gave up on because of performance concerns:

    fn fast_kde(data: &[f64], grid_size: usize, bandwidth_opt: Option<f64>) -> Vec<f64> {
        let n = data.len() as f64;
        let min = *data.iter().min_by(|x, y| x.partial_cmp(y).unwrap()).unwrap();
        let max = *data.iter().max_by(|x, y| x.partial_cmp(y).unwrap()).unwrap();
        let std_dev = (data.iter().map(|&x| x.powi(2)).sum::<f64>() / n - (data.iter().sum::<f64>() / n).powi(2)).sqrt();
        // Optimal bandwidth using Silverman's rule or use provided bandwidth
        let bandwidth = bandwidth_opt.unwrap_or(1.06 * std_dev * n.powf(-0.2));
        // Reflecting data at boundaries
        let range = max - min;
        let mut reflected_data = data.to_vec();
        reflected_data.extend(data.iter().map(|&x| 2.0 * min - x));
        reflected_data.extend(data.iter().map(|&x| 2.0 * max - x));
        let fft = Radix4::new(grid_size, FftDirection::Forward); // Use the enum
        // Create a grid and kernel
        let mut kernel = vec![Complex::new(0.0, 0.0); grid_size];
        for i in 0..grid_size {
            let x = (i as f64 / grid_size as f64) * 2.0 * std::f64::consts::PI;
            kernel[i] = Complex::new((-0.5 * (x / bandwidth).powi(2)).exp(), 0.0);
        }
        // Compute FFT of the kernel
        fft.process(&mut kernel);
        // Compute the FFT of the data (with kernel applied)
        let mut data_fft = vec![Complex::new(0.0, 0.0); grid_size];
        for value in reflected_data {
            let index = ((value - min) / range * grid_size as f64).floor() as usize % grid_size;
            data_fft[index].re += 1.0;
        }
        fft.process(&mut data_fft);
        // Convolve in frequency space
        let convolved: Vec<Complex<f64>> = kernel.iter().zip(data_fft.iter()).map(|(k, d)| k * d).collect();
        // Inverse FFT
        let mut inverse_fft = convolved;
        let ifft = Radix4::new(grid_size, FftDirection::Inverse); // Use the enum
        ifft.process(&mut inverse_fft);
        // Normalize
        let sum: f64 = inverse_fft.iter().map(|c| c.re).sum();
        inverse_fft.iter().map(|c| c.re / sum).collect()
    }

    fn entropy(data: &Array1<f64>, bandwidth: f64) -> f64 {
        assert!(bandwidth > 0.0, "Bandwidth must be greater than 0");
        let n = data.len() as f64;
        let normalizing_constant = 0.5 * (2.0 * std::f64::consts::PI * std::f64::consts::E * bandwidth * bandwidth).ln();
        let entropy_estimate: f64 = data.iter().map(|&x| {
            let p_x = kernel_density(data.as_slice().unwrap(), x, bandwidth);
            if p_x > 0.0 {
                -p_x * (p_x.ln())
            } else {
                0.0
            }
        }).sum();
        // Divide by the number of data points and add the normalizing constant
        (entropy_estimate / n) + normalizing_constant
    }

    fn mutual_information_kde(x: &Array1<f64>, y: &Array1<f64>, bandwidth: f64) -> f64 {
        assert_eq!(x.len(), y.len());
        let n = x.len() as f64;
        let x_slice = x.as_slice().unwrap();
        let y_slice = y.as_slice().unwrap();
        let mi: f64 = x_slice.par_iter().enumerate().map(|(i, &xi)| {
            let p_x = kernel_density(x_slice, xi, bandwidth);
            let p_y = kernel_density(y_slice, y_slice[i], bandwidth);
            let p_xy = joint_kernel_density(x_slice, y_slice, xi, y_slice[i], bandwidth);
            if p_xy > 0.0 {
                (p_xy / (p_x * p_y)).ln() / n
            } else {
                0.0
            }
        }).sum();
        let h_x = entropy(x, bandwidth);
        let h_y = entropy(y, bandwidth);
        let nmi = 2.0 * mi / (h_x + h_y);
        nmi
    }

eigenvalue · on Aug 23, 2023

Oops, I left out this part:

    fn joint_kernel_density(x_data: &Array1<f64>, y_data: &Array1<f64>, x: f64, y: f64, bandwidth: f64) -> f64 {
        let normal = Normal::new(0.0, bandwidth).unwrap();
        let density: f64 = x_data.par_iter().zip(y_data.par_iter()).map(|(&x_val, &y_val)| {
            normal.pdf((x - x_val) / bandwidth) * normal.pdf((y - y_val) / bandwidth)
        }).sum();
        density / ((x_data.len() as f64) * (bandwidth * bandwidth))
    }

simonw · on Aug 23, 2023

Always great to see projects like this that distribute wheels for a whole ton of different platforms - makes installation so much less painful if you don't need to ensure you have a full build toolchain on the machine where you run "pip install".

https://pypi.org/project/fast-vector-similarity/#files

eigenvalue · on Aug 23, 2023

Totally agree, and it’s actually trivially easy to do now if you use Maturin and GitHub Actions. Whereas I think it was much more arduous a few years ago (although I can’t say I tried then so I’m not sure).

tmaly · on Aug 23, 2023

Could you do a future write up on the whole process of how you got Github actions to cross compile to different platforms?

That sounds very fascinating.

eigenvalue · on Aug 23, 2023

Yeah, like the other commenter said, everything is in this file here:

https://github.com/Dicklesworthstone/fast_vector_similarity/...

If you also make your project using Rust and Maturin, you can literally just copy and paste that into your project because it's totally generic, and if the repo is public, GitHub will just run it all for you for free.

The only thing is you need to create an account on PyPi (pip) and add 2-Factor Auth so you can generate an API key. Then you go into the repo settings and go to secrets, and create a Github Actions secret with the name PYPI_API_TOKEN and make the value your PyPi token. That's it! It will not only compile all the wheels for you but even upload the project to PyPi for you using the settings found in your pyproject.toml file, like this:

https://github.com/Dicklesworthstone/fast_vector_similarity/...

appplication · on Aug 23, 2023

Look pretty inspectable in the .github folder? I’m not 100% sure how actions work though so maybe there’s something critical missing.

timvisee · on Aug 24, 2023

Awesome work!

At Qdrant we do this at scale. Store billions of vectors in a cluster of any size. Also in Rust which turned out to be an amazing choice, and fully open source. It uses various features to keep things performant, such as vectorization (multiple arches), quantization (form of compression) and more.

https://github.com/qdrant/qdrant

kordlessagain · on Aug 23, 2023

If anyone is interested in how to use something besides OpenAI for embeddings (the ada-002 model) consider checking out Instructor Large: https://huggingface.co/hkunlp/instructor-large

There is some reference code in the DoctorGPT project that uses this approach: https://github.com/FeatureBaseDB/DoctorGPT. This project is designed to image and then run OCR on PDFs (because not all PDFs have embedded text) and uses FeatureBase as the storage/vector engine for prompt tuning and assembly.

eigenvalue · on Aug 23, 2023

Cool, I also made a similar kind of tool recently that I also shared on HN a couple weeks ago. You might find it useful for generating and managing LLM embeddings locally:

https://github.com/Dicklesworthstone/llama_embeddings_fastap...

DigitalNoumena · on Aug 23, 2023

Nice work!

How does it compare against FAISS?

eigenvalue · on Aug 23, 2023

It's really quite different in goals to FAISS. FAISS is for if you have millions of stored vectors and want to do quick semantic similarity for retrieval. It manages to do this with some very clever approximations to narrow the search space.

My library is very different: it's for when you don't have quite so many vectors, or when simpler measures like Cosine similarity aren't quite cutting it for whatever reason and you want to use more powerful techniques.

I have another project that does robust near duplicate image detection (that is robust to all sorts of transformations), and it uses a similar approach of turning the images into high dimensional vectors. Using cosine similarity with FAISS is great for narrowing the field, but once you have done that, I find that you can get much higher quality results by augmenting the analysis with measures like Hoeffding's D. But until now, there were no high performance libraries for computing that in Python, and my library using Rust is orders of magnitude faster than using Numpy like this:

    import numpy as np
    import scipy

    def hoeffd_inner_loop_func(i, R, S):
        # See slow_exact_hoeffdings_d_func for definition of R, S
        Q_i = 1 + sum(np.logical_and(R<R[i], S<S[i]))
        Q_i = Q_i + (1/4)*(sum(np.logical_and(R==R[i], S==S[i])) - 1)
        Q_i = Q_i + (1/2)*sum(np.logical_and(R==R[i], S<S[i]))
        Q_i = Q_i + (1/2)*sum(np.logical_and(R<R[i], S==S[i]))
        return Q_i

    def slow_exact_hoeffdings_d_func(x, y):
        #Based on code from here: https://stackoverflow.com/a/9322657/1006379
        #For background see: https://projecteuclid.org/download/pdf_1/euclid.aoms/1177730150
        x = np.array(x)
        y = np.array(y)
        N = x.shape[0]
        R = scipy.stats.rankdata(x, method='average')
        S = scipy.stats.rankdata(y, method='average')
        print('Computing Q with list comprehension...')
        with MyTimer():
            Q = [hoeffd_inner_loop_func(i, R, S) for i in range(N)]
            Q = np.array(Q)
            D1 = sum(((Q-1)*(Q-2)))
            D2 = sum((R-1)*(R-2)*(S-1)*(S-2))
            D3 = sum((R-2)*(S-2)*(Q-1))
            D = 30*((N-2)*(N-3)*D1 + D2 - 2*(N-2)*D3) / (N*(N-1)*(N-2)*(N-3)*(N-4))
        print('Exact Hoeffding D: '+ str(round(D,8)))
        return D

spullara · on Aug 23, 2023

It sounds like it is brute force comparison so I imagine any approximate solution will be a lot faster. I have a similar brute force search library in Rust. Though I didn't find Rayon to be great (at the time).

eigenvalue · on Aug 23, 2023

It seems to really depend how you use Rayon, and the size of the data. The overhead of setting up the parallelism can't exceed the savings basically. Wherever it's possible to use ndarray broadcasting (vectorization), I do that instead. But I do think Rayon and the Rust compiler have gotten better about being smart, and I'm super impressed with the speed.

sps44 · on Aug 24, 2023

The library looks great, thank you for building this!

Am I understanding it correctly that you use it to detect hidden relationships in your data to build better ML models? If so, have you tried it on non-NLP use cases?

eigenvalue · on Aug 24, 2023

Yes. If you look at some of my other replies here, I am using this approach successfully in another project to do robust near duplicate image detection. So it’s also using high dimensional vector embeddings, but they are embeddings of images using ResNet-style vision models rather than text based LLMs.

Xenoamorphous · on Aug 23, 2023

Is there any resource that compares cosine similarity with other alternatives?

eigenvalue · on Aug 23, 2023

Yeah, there are some good survey articles that review various methods. Here are some links you can check out:

https://cloud.r-project.org/web/packages/correlation/vignett...

https://juniperpublishers.com/bboaj/pdf/BBOAJ.MS.ID.555795.p...

https://www.stats.ox.ac.uk/~cucuring/Lecture_2_Correlations_...

https://blogs.sas.com/content/iml/2021/04/28/compute-hoeffdi...

Essentially, cosine similarity just measures the extent to which the vectors point in the same direction in N dimensional space. That turns out to be quite discriminative when N is very high (i.e., over 1000). But it is not as good for finding complex non-linear relationships.

esafak · on Aug 23, 2023

The objective function is chosen with retrieval in mind, since training is done once and retrieval is performed many times. You are expected to use a simple similarity measure.

alpaca128 · on Aug 23, 2023

In the example it looks like this needs one JSON encoding/decoding step per call for parameters & results which seems a bit inefficient, but maybe my intuition is wrong there? Or does it allow dumping a whole list of vectors (I only see vector1 and vector2 though)?

eigenvalue · on Aug 23, 2023

Yes, you're right, and it's a good point. I'll add an additional python function that allows you to supply a pandas dataframe or numpy array for one of the arguments so you can specify a query vector and a matrix/table of vectors to search across.

pajep · on Aug 23, 2023

Is this similar to other probabilistic vector similarity search like LSH?

eigenvalue · on Aug 23, 2023

It's not so much for search (although it can obviously be used that way) as it is for just computing various kinds of similarity/dependency for arbitrary vectors of equal length.

To the extent you would want to use it for vector/semantic search applications, I would advise doing a first pass with something like FAISS to narrow down the list of potential matches from the entire universe of stored vectors, and THEN trying to compute these other measures of similarity/dependency only on the "most likely" vectors using my library. Hopefully that makes sense.

matmulbro · on Aug 24, 2023

it says here that you work on nfts?

ashvardanian · on Aug 23, 2023

It’s a good start, but you can’t generally get even remotely close to hardware potential in Rust, let alone Python.

I had to implement a separate C99 library to always trigger the newest SIMD intrinsics, occasionally leveraging SVE on more recent ARM CPUs, that compilers don’t know how to generate.

That library is in turn used in USearch, which is designed for Approximate Search, but some users recently reported that they use it for brute force as well… where it performed 20x faster than FAISS.

https://github.com/ashvardanian/simsimd

bluejekyll · on Aug 23, 2023

> you can’t generally get even remotely close to hardware potential in Rust

Can you express the gaps you’re aware of in regards to Rust’s ability to SIMD? There has been a lot of work in Rust and LLVM to make these things work. I’m mostly just curious with your experience in this area. Here’s some of the detection available in Rust in this area, and I know there is more out there: https://doc.rust-lang.org/core/arch/

eigenvalue · on Aug 23, 2023

Cool. I considered trying to use SIMD tricks, but then it starts becoming very platform dependent and I wanted to keep it general. My version is already by far the fastest Hoeffding's D implementation available for use in Python, so it's quite useful even without leveraging every possible optimization. I'll check out your library though, sounds cool. The link you shared gives a 404 though.

ashvardanian · on Aug 23, 2023

Updated, thanks :)

jackmott42 · on Aug 23, 2023

Rust supports SIMD intrinsics

eigenvalue · on Aug 23, 2023

Yes, and I believe the latest Rust toolchain is even able to automatically vectorize certain operations using SIMD primitives without explicitly specifying them in the Rust code. There has been a lot of recent work in Rust and LLVM to support that sort of thing, although that stuff is a bit over my pay grade.

steveklabnik · on Aug 23, 2023

That's not "the latest" Rust toolchain, LLVM has been able to do this for a long time. Of course, it gets better and better over time.

eigenvalue · on Aug 23, 2023

Yes, you're right of course. I meant more that it's getting better recently, and it's an active area of work in the Rust and LLVM dev communities. So if your last experience with this was a few years ago, it's worth checking it out again because it's actively improving.

softwaredoug · on Aug 23, 2023

Nice!

I recently implemented a C-based numpy solution of LSH to compress / recover cosine similarity[1]. It was my first time writing Numpy C, and it was a lot of fun to massively improve the performance over pure Python[2].

1- https://github.com/softwaredoug/np-sims

2- https://softwaredoug.com/blog/2023/08/22/rand-projections-in...

eigenvalue · on Aug 23, 2023

Cool. I looked into using the CFFI stuff in Python, and I found that using PyO3/maturin was much simpler and easier, especially when it came to automatic compilation of wheels. If you're looking to make high performance functions for use in Python, I can't recommend that approach enough, and then you get to use Rust which is amazing.

My experience from years ago when I was starting out in Python was that, as soon as you make people compile stuff to use your python library, you've already lost, since most people don't have the tooling/compilers installed and configured correctly, and even if they do, it's just not reasonable to expect people to wait so long for everything to compile just to use a library from pip. It's really nice to be able to start from a fresh Ubuntu install and just make a venv and instantly pip install requirements.txt using only wheels-- the whole process takes seconds and there's no fuss.