Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Fast vector similarity using Rust and Python (github.com/dicklesworthstone)
141 points by eigenvalue on Aug 23, 2023 | hide | past | favorite | 40 comments
I recently found myself computing the similarity between lots of very high dimensional vectors (i.e., sentence embedding vectors from LLMs), and I wanted to try some more powerful measures of similarity/dependency than just Cosine similarity, which seems to be the default for everything nowadays because of its computational efficiency.

There are many other more involved measures that can detect more subtle relationships, but the problem is that some of them are quite slow to compute, especially if you're trying to do it in Python. For my favorite measure of statistical dependency, Hoeffding's D, that's true even if you use Numpy. Since I recently learned Rust and wanted to learn how to make Python packages using Rust, I put together this new library that I call Fast Vector Similarity.

I was blown away by the performance of Rust and the quality of the tooling while making this. And even though it required a lot of fussing with Github Actions, I was also really impressed with just how easy it was to make a Python library using Rust that could be automatically compiled into wheels for every combination of platform (Linux, Windows, Mac) and Python Version (3.8 through 3.11) and uploaded to PyPi, all triggered by a commit to the repo and handled by Github's servers-- and all for free if you're working on a public repo!

Anyway, this library can easily be installed to try out using `pip install fast_vector_similarity`, and you can see some simple demo Python code in the readme to show how to use it.

Aside from exposing some very high performance implementations of some very nice similarity measures, I also included the ability to get robust estimates of these measures using the Bootstrap method. Basically, if you have two very high dimensional vectors, instead of using the entire vector to measure similarity, you can take the same random subset of indices from both vectors and compute the similarity of just those elements. Then you repeat the process hundreds or thousands of times and look at the robust average (i.e., throw away the results outside the 25th percentile to 75th percentile and average the remaining ones, to reduce the impact of outliers) and standard deviation of the results. Obviously this is very demanding of performance, but it's still reasonable if you're not trying to compute it for too many vectors.

Everything is designed to fully saturate the performance of multi-core machines by extensive use of broadcasting/vectorization and the use of paralell processing via the Rayon library. I was really impressed with how easy and low-overhead it is to make highly parallelized code in Rust, especially compared to coming from Python, where you have to jump through a lot of hoops to use multiprocessing and there is a ton of overhead.

Anyway, please let me know what you think. I'm looking to add more measures of similarity if I can find ones that can be efficiently computed (I already gave up on including HSIC because I couldn't get it to go fast enough, even using BLAS/LAPACK).




I realize you have a different usecase in mind, so the following feedback might not be that useful to you. I also haven’t reviewed your API design, so perhaps you have already considered this angle.

I picked up a habit in Go package design which I now apply everywhere: one of my immediate first thoughts when evaluating a library is how it creates or uses threads.

When I see that a library wants to consume all of the available cores of my system, or uses an internal task scheduler or threading mechanism, I worry that it will compete for CPU time with the other processes, threads, etc. that my system and process are using.

On the other hand, I appreciate when a library exposes an interface for me to offer it threads or a scheduler, or even better if it offers primitives that I can use directly in my own threads or tasks.

Perhaps this is mainly a problem for high-concurrency servers and UIs, but it’s always one of my first thoughts, and Rust is an ecosystem which needs to support those usecases. Many threaded libraries, even otherwise very good ones, are not designed with this in mind and it is a problem.


It's a good point, and the way my library is designed now it would be a big resource hog. Not sure the best way to handle that given that it's supposed to be a super easy to use library for Python. Perhaps I could add an optional parameter "no_resource_hogging" or something, and if enabled that can ratchet down the amount of parallelism, or reduce the CPU priority of the threads to a very low level so it won't use up all the system resources. The problem with the latter approach is that it starts being system specific. For example you could do something like this:

  fn create_low_priority_pool() -> rayon::ThreadPool {
      ThreadPoolBuilder::new()
          .start_handler(|_| {
              #[cfg(target_os = "linux")]
              {
                  use libc::{sched_param, sched_setscheduler, SCHED_IDLE};
                  unsafe {
                      let mut param: sched_param = std::mem::zeroed();
                      param.sched_priority = 0;
                      sched_setscheduler(0, SCHED_IDLE, &param);
                  }
              }
              #[cfg(target_os = "windows")]
              {
                  use winapi::um::processthreadsapi::SetThreadPriority;
                  use winapi::um::winbase::THREAD_PRIORITY_LOWEST;
                  unsafe {
                      SetThreadPriority(std::ptr::null_mut(), THREAD_PRIORITY_LOWEST);
                  }
              }
          })
          .build()
          .unwrap()
  }
  
But I don't know how that would work for MacOS. Anyway, it's something to think about.


I would say default to min(affinity, system threads, 16) and let the user input a specific number to override it.

I had a problem recently where a library started a process per CPU thread I suspect the author didn't intend it to start start 256...

Affinity is only easy to get on some platforms but you should use it instead of the system threads if available.


One approach I’ve used in the past is to offer an API which takes threads or a worker pool as a parameter, and a separate API which does that for you and calls the first one.

Then in your Python wrapper you can use the wrapper, but your users who care can provide the pool they’ve already created.

Bonus points if it’s an interface or trait that the user can adapt their own scheduler to, if they don’t use rayon. But, I can see how that could be tricky if the design of your library is tightly coupled with the threading solution.

One way to frame it is that thread creation is a significant side effect.


Author here. I just wanted to add that I struggled mightily to include normalized mutual information in this library but simply couldn’t get it to run quickly enough with 15,000 dimensional vectors to include it (it caused the “all” option to take 20x longer to execute).

I was able to get non-normalized MI to run fast, but then you end up with a measure that doesn’t have any intuitive meaning (i.e., it’s not a value in the range [0.0, 1.0] or anything like that— it’s something like 5.2, where this depends on the length of the vectors and the magnitude of the elements).

I tried many different approaches, but computing the joint entropy was just too computationally expensive, even using BLAS/LAPACK and lots of vectorization. I tried using various kernel density estimation techniques but couldn’t get anything to work well and quickly. I suppose I could just approximate it by taking a random subset of indices like I did for the distance correlation measure, but I find that unsatisfying.

So if there are any math/Rust wizards in here who have any suggestions, please let me know!


If you don't mind, how is this calculated? I found some references, but I'm not sure I understand.


There are lots of different ways of doing it (for a reference, see https://www.mdpi.com/1099-4300/19/11/631 ), but this was the basic approach I was trying that I gave up on because of performance concerns:

    fn fast_kde(data: &[f64], grid_size: usize, bandwidth_opt: Option<f64>) -> Vec<f64> {
        let n = data.len() as f64;
        let min = *data.iter().min_by(|x, y| x.partial_cmp(y).unwrap()).unwrap();
        let max = *data.iter().max_by(|x, y| x.partial_cmp(y).unwrap()).unwrap();
        let std_dev = (data.iter().map(|&x| x.powi(2)).sum::<f64>() / n - (data.iter().sum::<f64>() / n).powi(2)).sqrt();
        // Optimal bandwidth using Silverman's rule or use provided bandwidth
        let bandwidth = bandwidth_opt.unwrap_or(1.06 * std_dev * n.powf(-0.2));
        // Reflecting data at boundaries
        let range = max - min;
        let mut reflected_data = data.to_vec();
        reflected_data.extend(data.iter().map(|&x| 2.0 * min - x));
        reflected_data.extend(data.iter().map(|&x| 2.0 * max - x));
        let fft = Radix4::new(grid_size, FftDirection::Forward); // Use the enum
        // Create a grid and kernel
        let mut kernel = vec![Complex::new(0.0, 0.0); grid_size];
        for i in 0..grid_size {
            let x = (i as f64 / grid_size as f64) * 2.0 * std::f64::consts::PI;
            kernel[i] = Complex::new((-0.5 * (x / bandwidth).powi(2)).exp(), 0.0);
        }
        // Compute FFT of the kernel
        fft.process(&mut kernel);
        // Compute the FFT of the data (with kernel applied)
        let mut data_fft = vec![Complex::new(0.0, 0.0); grid_size];
        for value in reflected_data {
            let index = ((value - min) / range * grid_size as f64).floor() as usize % grid_size;
            data_fft[index].re += 1.0;
        }
        fft.process(&mut data_fft);
        // Convolve in frequency space
        let convolved: Vec<Complex<f64>> = kernel.iter().zip(data_fft.iter()).map(|(k, d)| k * d).collect();
        // Inverse FFT
        let mut inverse_fft = convolved;
        let ifft = Radix4::new(grid_size, FftDirection::Inverse); // Use the enum
        ifft.process(&mut inverse_fft);
        // Normalize
        let sum: f64 = inverse_fft.iter().map(|c| c.re).sum();
        inverse_fft.iter().map(|c| c.re / sum).collect()
    }

    fn entropy(data: &Array1<f64>, bandwidth: f64) -> f64 {
        assert!(bandwidth > 0.0, "Bandwidth must be greater than 0");
        let n = data.len() as f64;
        let normalizing_constant = 0.5 * (2.0 * std::f64::consts::PI * std::f64::consts::E * bandwidth * bandwidth).ln();
        let entropy_estimate: f64 = data.iter().map(|&x| {
            let p_x = kernel_density(data.as_slice().unwrap(), x, bandwidth);
            if p_x > 0.0 {
                -p_x * (p_x.ln())
            } else {
                0.0
            }
        }).sum();
        // Divide by the number of data points and add the normalizing constant
        (entropy_estimate / n) + normalizing_constant
    }

    fn mutual_information_kde(x: &Array1<f64>, y: &Array1<f64>, bandwidth: f64) -> f64 {
        assert_eq!(x.len(), y.len());
        let n = x.len() as f64;
        let x_slice = x.as_slice().unwrap();
        let y_slice = y.as_slice().unwrap();
        let mi: f64 = x_slice.par_iter().enumerate().map(|(i, &xi)| {
            let p_x = kernel_density(x_slice, xi, bandwidth);
            let p_y = kernel_density(y_slice, y_slice[i], bandwidth);
            let p_xy = joint_kernel_density(x_slice, y_slice, xi, y_slice[i], bandwidth);
            if p_xy > 0.0 {
                (p_xy / (p_x * p_y)).ln() / n
            } else {
                0.0
            }
        }).sum();
        let h_x = entropy(x, bandwidth);
        let h_y = entropy(y, bandwidth);
        let nmi = 2.0 * mi / (h_x + h_y);
        nmi
    }


Oops, I left out this part:

    fn joint_kernel_density(x_data: &Array1<f64>, y_data: &Array1<f64>, x: f64, y: f64, bandwidth: f64) -> f64 {
        let normal = Normal::new(0.0, bandwidth).unwrap();
        let density: f64 = x_data.par_iter().zip(y_data.par_iter()).map(|(&x_val, &y_val)| {
            normal.pdf((x - x_val) / bandwidth) * normal.pdf((y - y_val) / bandwidth)
        }).sum();
        density / ((x_data.len() as f64) * (bandwidth * bandwidth))
    }


Always great to see projects like this that distribute wheels for a whole ton of different platforms - makes installation so much less painful if you don't need to ensure you have a full build toolchain on the machine where you run "pip install".

https://pypi.org/project/fast-vector-similarity/#files


Totally agree, and it’s actually trivially easy to do now if you use Maturin and GitHub Actions. Whereas I think it was much more arduous a few years ago (although I can’t say I tried then so I’m not sure).


Could you do a future write up on the whole process of how you got Github actions to cross compile to different platforms?

That sounds very fascinating.


Yeah, like the other commenter said, everything is in this file here:

https://github.com/Dicklesworthstone/fast_vector_similarity/...

If you also make your project using Rust and Maturin, you can literally just copy and paste that into your project because it's totally generic, and if the repo is public, GitHub will just run it all for you for free.

The only thing is you need to create an account on PyPi (pip) and add 2-Factor Auth so you can generate an API key. Then you go into the repo settings and go to secrets, and create a Github Actions secret with the name PYPI_API_TOKEN and make the value your PyPi token. That's it! It will not only compile all the wheels for you but even upload the project to PyPi for you using the settings found in your pyproject.toml file, like this:

https://github.com/Dicklesworthstone/fast_vector_similarity/...


Look pretty inspectable in the .github folder? I’m not 100% sure how actions work though so maybe there’s something critical missing.


Awesome work!

At Qdrant we do this at scale. Store billions of vectors in a cluster of any size. Also in Rust which turned out to be an amazing choice, and fully open source. It uses various features to keep things performant, such as vectorization (multiple arches), quantization (form of compression) and more.

https://github.com/qdrant/qdrant


If anyone is interested in how to use something besides OpenAI for embeddings (the ada-002 model) consider checking out Instructor Large: https://huggingface.co/hkunlp/instructor-large

There is some reference code in the DoctorGPT project that uses this approach: https://github.com/FeatureBaseDB/DoctorGPT. This project is designed to image and then run OCR on PDFs (because not all PDFs have embedded text) and uses FeatureBase as the storage/vector engine for prompt tuning and assembly.


Cool, I also made a similar kind of tool recently that I also shared on HN a couple weeks ago. You might find it useful for generating and managing LLM embeddings locally:

https://github.com/Dicklesworthstone/llama_embeddings_fastap...


Nice work!

How does it compare against FAISS?


It's really quite different in goals to FAISS. FAISS is for if you have millions of stored vectors and want to do quick semantic similarity for retrieval. It manages to do this with some very clever approximations to narrow the search space.

My library is very different: it's for when you don't have quite so many vectors, or when simpler measures like Cosine similarity aren't quite cutting it for whatever reason and you want to use more powerful techniques.

I have another project that does robust near duplicate image detection (that is robust to all sorts of transformations), and it uses a similar approach of turning the images into high dimensional vectors. Using cosine similarity with FAISS is great for narrowing the field, but once you have done that, I find that you can get much higher quality results by augmenting the analysis with measures like Hoeffding's D. But until now, there were no high performance libraries for computing that in Python, and my library using Rust is orders of magnitude faster than using Numpy like this:

    import numpy as np
    import scipy

    def hoeffd_inner_loop_func(i, R, S):
        # See slow_exact_hoeffdings_d_func for definition of R, S
        Q_i = 1 + sum(np.logical_and(R<R[i], S<S[i]))
        Q_i = Q_i + (1/4)*(sum(np.logical_and(R==R[i], S==S[i])) - 1)
        Q_i = Q_i + (1/2)*sum(np.logical_and(R==R[i], S<S[i]))
        Q_i = Q_i + (1/2)*sum(np.logical_and(R<R[i], S==S[i]))
        return Q_i

    def slow_exact_hoeffdings_d_func(x, y):
        #Based on code from here: https://stackoverflow.com/a/9322657/1006379
        #For background see: https://projecteuclid.org/download/pdf_1/euclid.aoms/1177730150
        x = np.array(x)
        y = np.array(y)
        N = x.shape[0]
        R = scipy.stats.rankdata(x, method='average')
        S = scipy.stats.rankdata(y, method='average')
        print('Computing Q with list comprehension...')
        with MyTimer():
            Q = [hoeffd_inner_loop_func(i, R, S) for i in range(N)]
            Q = np.array(Q)
            D1 = sum(((Q-1)*(Q-2)))
            D2 = sum((R-1)*(R-2)*(S-1)*(S-2))
            D3 = sum((R-2)*(S-2)*(Q-1))
            D = 30*((N-2)*(N-3)*D1 + D2 - 2*(N-2)*D3) / (N*(N-1)*(N-2)*(N-3)*(N-4))
        print('Exact Hoeffding D: '+ str(round(D,8)))
        return D


It sounds like it is brute force comparison so I imagine any approximate solution will be a lot faster. I have a similar brute force search library in Rust. Though I didn't find Rayon to be great (at the time).


It seems to really depend how you use Rayon, and the size of the data. The overhead of setting up the parallelism can't exceed the savings basically. Wherever it's possible to use ndarray broadcasting (vectorization), I do that instead. But I do think Rayon and the Rust compiler have gotten better about being smart, and I'm super impressed with the speed.


The library looks great, thank you for building this!

Am I understanding it correctly that you use it to detect hidden relationships in your data to build better ML models? If so, have you tried it on non-NLP use cases?


Yes. If you look at some of my other replies here, I am using this approach successfully in another project to do robust near duplicate image detection. So it’s also using high dimensional vector embeddings, but they are embeddings of images using ResNet-style vision models rather than text based LLMs.


Is there any resource that compares cosine similarity with other alternatives?


Yeah, there are some good survey articles that review various methods. Here are some links you can check out:

https://cloud.r-project.org/web/packages/correlation/vignett...

https://juniperpublishers.com/bboaj/pdf/BBOAJ.MS.ID.555795.p...

https://www.stats.ox.ac.uk/~cucuring/Lecture_2_Correlations_...

https://blogs.sas.com/content/iml/2021/04/28/compute-hoeffdi...

Essentially, cosine similarity just measures the extent to which the vectors point in the same direction in N dimensional space. That turns out to be quite discriminative when N is very high (i.e., over 1000). But it is not as good for finding complex non-linear relationships.


The objective function is chosen with retrieval in mind, since training is done once and retrieval is performed many times. You are expected to use a simple similarity measure.


In the example it looks like this needs one JSON encoding/decoding step per call for parameters & results which seems a bit inefficient, but maybe my intuition is wrong there? Or does it allow dumping a whole list of vectors (I only see vector1 and vector2 though)?


Yes, you're right, and it's a good point. I'll add an additional python function that allows you to supply a pandas dataframe or numpy array for one of the arguments so you can specify a query vector and a matrix/table of vectors to search across.


Is this similar to other probabilistic vector similarity search like LSH?


It's not so much for search (although it can obviously be used that way) as it is for just computing various kinds of similarity/dependency for arbitrary vectors of equal length.

To the extent you would want to use it for vector/semantic search applications, I would advise doing a first pass with something like FAISS to narrow down the list of potential matches from the entire universe of stored vectors, and THEN trying to compute these other measures of similarity/dependency only on the "most likely" vectors using my library. Hopefully that makes sense.


it says here that you work on nfts?


It’s a good start, but you can’t generally get even remotely close to hardware potential in Rust, let alone Python.

I had to implement a separate C99 library to always trigger the newest SIMD intrinsics, occasionally leveraging SVE on more recent ARM CPUs, that compilers don’t know how to generate.

That library is in turn used in USearch, which is designed for Approximate Search, but some users recently reported that they use it for brute force as well… where it performed 20x faster than FAISS.

https://github.com/ashvardanian/simsimd


> you can’t generally get even remotely close to hardware potential in Rust

Can you express the gaps you’re aware of in regards to Rust’s ability to SIMD? There has been a lot of work in Rust and LLVM to make these things work. I’m mostly just curious with your experience in this area. Here’s some of the detection available in Rust in this area, and I know there is more out there: https://doc.rust-lang.org/core/arch/


Cool. I considered trying to use SIMD tricks, but then it starts becoming very platform dependent and I wanted to keep it general. My version is already by far the fastest Hoeffding's D implementation available for use in Python, so it's quite useful even without leveraging every possible optimization. I'll check out your library though, sounds cool. The link you shared gives a 404 though.


Updated, thanks :)


Rust supports SIMD intrinsics


Yes, and I believe the latest Rust toolchain is even able to automatically vectorize certain operations using SIMD primitives without explicitly specifying them in the Rust code. There has been a lot of recent work in Rust and LLVM to support that sort of thing, although that stuff is a bit over my pay grade.


That's not "the latest" Rust toolchain, LLVM has been able to do this for a long time. Of course, it gets better and better over time.


Yes, you're right of course. I meant more that it's getting better recently, and it's an active area of work in the Rust and LLVM dev communities. So if your last experience with this was a few years ago, it's worth checking it out again because it's actively improving.


Nice!

I recently implemented a C-based numpy solution of LSH to compress / recover cosine similarity[1]. It was my first time writing Numpy C, and it was a lot of fun to massively improve the performance over pure Python[2].

1- https://github.com/softwaredoug/np-sims

2- https://softwaredoug.com/blog/2023/08/22/rand-projections-in...


Cool. I looked into using the CFFI stuff in Python, and I found that using PyO3/maturin was much simpler and easier, especially when it came to automatic compilation of wheels. If you're looking to make high performance functions for use in Python, I can't recommend that approach enough, and then you get to use Rust which is amazing.

My experience from years ago when I was starting out in Python was that, as soon as you make people compile stuff to use your python library, you've already lost, since most people don't have the tooling/compilers installed and configured correctly, and even if they do, it's just not reasonable to expect people to wait so long for everything to compile just to use a library from pip. It's really nice to be able to start from a fresh Ubuntu install and just make a venv and instantly pip install requirements.txt using only wheels-- the whole process takes seconds and there's no fuss.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: