Hacker News new | past | comments | ask | show | jobs | submit login
How long does it take to linear search 200MB in memory? (jonathanotto.com)
70 points by jotto on April 27, 2018 | hide | past | favorite | 72 comments



Not wanting to be the 'Old Man Yells At Cloud' figure, but this is a perfect illustration of how not to do it, particularly multithreading in Go before using a reasonable algorithm.

Searching for a long string like this is preposterously easy. You can use a classical algorithm, or, given you are looking for a 64 byte long string, you could do a strided inspection - every 32 bytes, look to see if you have a 4-gram from the first 35 bytes of the string. More or less anything works as almost any reasonable string searching mechanism can outrun memory bandwidth.

Or you could go use my old project (https://github.com/intel/hyperscan) and then you could look for regexes too. :-)


Depends on how often you need to search. The implementation of the parallel linear search in Go is a lot easier to get right than what you’re recommending.

Though I would certainly agree that paying for more vm cores just so you can do a linear search on each of them is a bit silly


The linear search code should probably be in a library, and in fact often is, if you have the GNU extension memmem().

I'm also not convinced that it's that hard. This seems to be the new cult; just because someone buggered up binary search with an integer overflow once, we're not going to write non-pathological algorithms anymore?


I’m not saying it’s a bad idea to implement it, but there’s a greater time cost, and if the performance gains aren’t worth the effort, then there’s no reason to make the improvement


Fair enough. I guess you would figure out how long it takes to either clag the code from somewhere else or write it yourself and plug in the math at the bottom of the page (as they are trying to quantify the value of doing this).

I strongly suspect dragging in memmem() would be the most cost effective way of doing this and run in the most places, even if it's a non-standard GNU libc extension.


0.25s for 200MB (in memory) seems pretty slow for a modern CPU (800MB/s) which is more than an order of magnitude below what you would expect from main memory.

There can be a lot of performance left on the table even in standard library implementations: http://0x80.pl/articles/simd-strfind.html


This isn't super surprising to me. String comparisons are super slow if not handle with care, and even then may be surprisingly slow. This does a LOT of string comparisons.

Depending on the requirements, sliding a window through doing byte comparisons may be better. https://golang.org/src/bytes/bytes_decl.go?s=522:550#L7

EDIT: I believe paper trail does or did have a limitation on case sensitive searching. You can imagine why they may be doing this if you consider the performance implications.


The ruby version looked a bit odd to me. It would be unusual to not use regex in this context.

I'm not sure if this is in the spirit of the test, but using regex reduced the time from 27s to 0.04s on my machine.

    require 'benchmark'

    contents = File.read("/tmp/200mb.txt")
    find = "07123E1F482356C415F684407A3B8723E10B2CBBC0B8FCD6282C49D37C9C1ABC"
    result = nil
    elapsed = Benchmark.realtime do
      result = (contents =~ /#{find}/)
    end
    puts "Found at index #{result} in #{elapsed}s"

(The search string was placed at the end of a 200mb file)


You can reduce the time to 0.02s by just using `String#index`.


The fastest, Golang at slightly over 2.8GB/s, is still surprisingly slow for modern (as in, under 10 years old) hardware. I just wrote a similar linear search in Asm using the not-so-fast REPNZ SCASB/CMPSB instructions and got 6GB/s on an i7 860 from 2009.

If you're just scanning for the string then algorithms like BMH can do up to x times memory bandwidth, where x is the length of the string, since they start matching from the end and can skip over chunks of x bytes at once.

This article from someone with a rather... distinctive style shows >10GB/s on 5-year-old hardware with target strings only slightly over 50 bytes:

https://www.codeproject.com/Articles/250566/Fastest-strstr-l...


I recall repne scasb was significantly slower than the equivalent loop unrolled a few times. Not sure if that's still the case but it was something like 2x slower.


I don’t have a computer at hand, but I imagine it would be a lot shorter (and likely faster due to algorithmic optimisations) to use bytes.Index. As other have pointed out, 1 GB/s seems slow. I would expect that performance when reading from disk.

(Also, iterating over indexes is not too idiomatic in Go. You can iterate both the index and value using `range`, just like enumerate in Python.)

https://golang.org/pkg/bytes/#Index


Turned out bytes.Index is initially similar or slower in speed, but search time goes down to 40ms as soon as you lop off the last byte.

For strings over (on my machine) 63 bytes, bytes.Index uses a rolling Rabin-Karp hash. For this data it's ~25% slower than the loop to search. However, that evens out if you make the whole file hex, which makes the search loop have some branch mispredictions. (Also, the simple loop search isn't complete! More below.) The R-K search is more reasonable than this one comparison makes it sound; it avoids bad worst-case behavior when there are a lot of partial matches. (Also, it may be less optimized because authors think most bytes.Index searches will be for short bits of data.)

Relatedly, the loop search doesn't back up its position when it discovers a partial match isn't a complete match, which means it's incorrect; it won't find a real match that overlaps with a false match (a prefix of a match). Like, when I made a file that contained "07123E1F482356C415F684407123E1F482356C415F684407A3B8723E10B2CBBC0B8FCD6282C49D37C9C1ABC", it didn't find the match starting at the second 071... because it overlaps with a prefix of a match (at the start of the string) that didn't turn out to be a full match. (The overlap is two digits here, but same thing would come up if it were one digit.)

Searching for shorter strings, Index on Intel platforms uses AVX/SSE search. If you're going to search for long strings and feel lucky, you could search for a prefix of whatever your platform's max SIMD-searchable length is, then check if it's a full match. If so, you're done; if not, keep searching starting at the next byte. In case of pathological data you could bail out to the safe R-K approach after n false matches, like one of Index's existing strategies does. It might be something for someone who wants a stdlib patch to try submitting to Go, though it might also be that they see it as too much of a corner case to merit complicating things more.


That search doesn't handle head/tail overlaps, but otherwise it's a good approximation. I'm surprised it isn't faster; I think I got a few hundred MB/s on trie-based matching.

Protip: bring benchmarks like this to your meeting with the sales rep.


It's pretty bad: Say it's searching for "AB". It won't find it in this string: "AAB"


I think the point was just to test each string against the query for equality. The benchmark could be slightly tweaked to test each string against the query for substring inclusion or regular expression match, but I suspect the results would only change by the cost of the test predicate.


Not necessarily when comparing different language, as they might allow different optimizations. For example, you'd be able to compare several bytes at a time with C if you pad the strings, while node.js' V8 is unlikely to optimize that far.


    console.log(bulk.length); console.log(new Date()); pos= bulk.indexOf("788bbd6b12d79a4ee18e36a8a264fb496465c71b7932a965ed731397b7c97d14"); console.log(pos); console.log(new Date());
    209715266                           debugger eval code:1:1
    Date 2018-04-28T00:08:30.231Z       debugger eval code:1:27
    209715202                           debugger eval code:1:139
    Date 2018-04-28T00:08:30.252Z       debugger eval code:1:157 
That's about 0.021 seconds for however Firefox implements String.indexOf().

    $ dd if=/dev/urandom of=200mb.dat bs=1M count=200
    200+0 records in
    200+0 records out
    209715200 bytes (210 MB, 200 MiB) copied, 2.98279 s, 70.3 MB/s
    
    $ echo 788bbd6b12d79a4ee18e36a8a264fb496465c71b7932a965ed731397b7c97d14 >>200mb.dat 
    $ time grep -o 788bbd6b12d79a4ee18e36a8a264fb496465c71b7932a965ed731397b7c97d14 <200mb.dat 
    Binary file (standard input) matches
    
    real    0m0.056s
    user    0m0.028s
    sys     0m0.028s
That's not too terribly different.


Something feels off about this, but I can't put my finger on it.


The multi-core Go benchmark is spawning the goroutines before calling `start := time.Now()`.


Go's concurrency model defines that memory writes in one goroutine won't be observable in another until a context switch can happen (e.g. by send/recv on a channel). In practice, newly spawned goroutines often don't get the chance to run at all until that context switch. But yeah, it's not a good idea to rely on this behavior here.


I'm not sure where you got that information?

Go's goroutines launch on threads when multiple CPUs are available. They have all the race conditions of C code. Memory writes certainly are visible on other goroutines.


> I'm not sure where you got that information?

From the very official spec? https://golang.org/ref/mem

Goroutines do not map to OS threads 1:1. Whether the scheduler decides to spawn a thread or not is its own, entirely private decision. No action is guaranteed to be taken until scheduling occurs.


Well, from that spec: "When multiple goroutines access a shared variable v, they must use synchronization events to establish happens-before conditions that ensure reads observe the desired writes."

However, that very specifically does NOT say that writes may not be observed BEFORE synchronization events. They MAY.

Under "Incorrect Synchronization": "Note that a read r may observe the value written by a write w that happens concurrently with r. Even if this occurs, it does not imply that reads happening after r will observe writes that happened before w."

These memory ordering rules are the same as C, C++ or machine code. Go makes no special effort to avoid the natural machine behavior of x86, ARM or MIPS.

And on goroutine spawning and execution order, I have observed both behaviors. Sometimes Go will continue on the current goroutine before launching other goroutines. Other times the new goroutines begin executing immediately. It's possible that some other, background goroutine is causing a scheduling event, but if so it's so random that no one could predict it.


A single frame of Quake is 800x600=1mb of memory, and quake ran at 30fps on a pentium-233 in software. That's 30mb per second circa 1995. So your times all seem extremely slow to me. You should be getting a couple orders of magnitude better speeds. Try it in rust!


I'm actually somewhat surprised it wasn't faster. I may play with this and see how fast I can get it to go.

Edit: After a number of attempts (such as using Golang's regexp library or strings.Contains()), I wasn't able to speed it up much. Interestingly, grep performs the same search in about 5ms on my machine.


grep is the product of a lot of thought and engineering. There's a great rundown by GNU grep's author, if you're interested [1].

One big takeaway is that grep doesn't simply traverse the string linearly, instead using the Boyer-Moore algorithm [2].

1: https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug...

2: https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_sea...

EDIT: Interestingly, there seems to be an internal Boyer-Moore implementation in Golang's strings package[3], though it only appears to be used in strings.Replace("")[4]

3: https://golang.org/src/strings/search.go

4: https://golang.org/src/strings/replace.go?h=stringFinder


And if you want an even faster version of grep, checkout ripgrep (which happens to be written in Rust).

https://github.com/BurntSushi/ripgrep


That's a nice overview by the author. I've been fond of this blog post which goes into the main trick grep does besides Boyer-Moore and unrolling the loop that'll make it beat a by-the-book implementation with unrolling: http://ridiculousfish.com/blog/posts/old-age-and-treachery.h...


Ignoring the somewhat low performance, unless I'm misreading things, this version won't be able to find the string:

   07123E1F482356C415F684407A3B8723E10B2CBBC0B8FCD6282C49D37C9C1ABC
in the string:

    07123E1F482356C415F684407123E1F482356C415F684407A3B8723E10B2CBBC0B8FCD6282C49D37C9C1ABC
or to make this a bit easier to read, it won't find:

    AXAY
inside the string:

    AXAXAY
at the start it will match all the way up to "AXA" then fail to match "AXAY" with "AXAX", and then it will try to search for "AXAY" in the remainder "XAY" which fails.

The correct way of handling this is either searching from the start at for every possible position, which will usually fail quickly so it's not that inefficient, or you need to take into account that partial matches might overlap with an actual match, using something like the KMP algorithm.


Maybe i'm blind - but i'm not seeing where the benchmark results are? Also, is this code the same code that you actually ran? For ruby, i notice you are outputting 'contens' not 'contents' so i imagine you would get a runtime error?


I also had a hard time finding performance. It's listed in the grey bar at the top of each language.

I'd love to see this comparison expanded to a few other languages (particularly C/C++ with custom memory allocation).


I once wrote a scanner similar to this for Symantec in C++. As I recall it used a hybrid between the Wu-Manber and Boyer-Moore search algorithms to search a million lines of text for thousands of signatures simultaneously in approx 0.004s as I recall.


A c++ implementation gives 5.7[ms] for searching through 200MB of vector<string> on a Ryzen 1700x.

EDIT: Reworking the code.


- 3276800 entries (to match authors 200 MB of ascii)

Sounds like you implicitly assumed the match will be aligned at a 64 byte boundary which makes the task much easier and 4 GiB/s a pretty slow result.


I asked the compiled to align what it can on 16 bytes and updated the results.

Unsure why the previous result was way off, but I am lacking time for fun tonight :)


I meant something different, your implementation is unable to find the pattern if it does not start at an index divisible by 64. You are effectively only searching through 3,125 MiB instead of 200 MiB.


You have a vector of 64 char strings, while others have just 200 MB byte array.

IOW, your code is not doing remotely same task as the blog post ones.

Edit: "Typoed" 1 GB instead of 200 MB. Oops. The point holds.


I did some benchmarking for a dictionary app a while back, and the fastest I could get was strstr(), if I recall correctly. It substantially beat std::string::find(). Using a loop is bound to give terrible performance; for large data you should be using one of the string-search algorithms (which strstr() surely does). And given the very large search string, those string-search algorithms should be able to skip large chunks at a time.

Here's a small version that allocates 200 MB of uninitialized memory.

  #include <stdlib.h>
  #include <string.h>
  int main(int argc, char *argv[])
  {
	int size = 200 * 1024 * 1024;
	char *bytes = malloc(size);
	bytes[size - 1] = '\0';
	strstr(bytes, "788bbd6b12d79a4ee18e36a8a264fb496465c71b7932a965ed731397b7c97d14");
	return 0;
  }
  
  > gcc test.c
  > time ./a.out
  real	0m0.006s
  user	0m0.002s
  sys	0m0.002s
(Since the author ignores time to read off disk, I figure uninitialized memory should be comparable. The chances of strstr() actually finding anything are pretty slim, but I ran it a number of times just to be sure, with the same results.)

This is on a 2012 MacBook Pro.


Note that strstr will stop as soon as you have a 0 byte so this is not a fair comparison.

Uninitialized (practically often zeroed) also produces different results from random and explicitly zeroed, since I assume it affects how often it can fail fast.

    #define _GNU_SOURCE
    #include <stdlib.h> 
    #include <stdio.h> 
    #include <string.h> 
    #include <time.h> 

    int main(int nargs, char ** args) 
    {
      struct timespec start; 
      struct timespec end; 
      int mode = atoi(args[1]); 
      const char * needle = "788bbd6b12d79a4ee18e36a8a264fb496465c71b7932a965ed731397b7c97d14";  
      const int N = 200*1024*1024 ;
      char * haystack = malloc(N + sizeof(needle)); 
      strcpy(haystack+N,needle); 

      printf("Mode is %s\n" , mode == 0 ? "zeroed" : 
                              mode == 1 ? "random"        : 
                              "uninitialized"); 

      if (mode == 0) 
      {
        memset(haystack,0,N); 
      }
    
      if (mode == 1) 
      {
        FILE * urandom = fopen("/dev/urandom","r"); 
        int nread = fread(haystack,1,N,urandom); 
        fclose(urandom); 
        if (nread < N) 
        {
          fprintf(stderr,"read returned %d\n",nread); 
        }
      }
    
      clock_gettime(CLOCK_REALTIME, &start); 
      void * found = memmem(haystack,N+sizeof(needle),needle,sizeof(needle)); 
      clock_gettime(CLOCK_REALTIME, &end); 
    
      printf("Took %g secs to find string at offset of %llu\n", end.tv_sec - start.tv_sec + 1e-9*(end.tv_nsec -    start.tv_nsec), found-(void*)haystack); 
      return 0; 
    }



    $ gcc test_find.c -o test_find
    $ ./test_find 0
      Mode is zeroed
      Took 0.0147023 secs to find string at offset of 209715200
    $ ./test_find 1
      Mode is random
      Took 0.129951 secs to find string at offset of 209715200
    $ ./test_find 2
      Mode is uninitialized
      Took 0.0371922 secs to find string at offset of 209715200


Note that if you malloc that big of a block, it's going to come directly from a MAP_ANONYMOUS mmap, and so it will be zero initialized!

Your uninitialized mode is slower than zeroed, but nowhere near as slow as random.

The difference between initialized and uninitialized may be due to page faults: in the uninitialized case, the search is touching the memory for the first time. Maybe we're seeing TLB misses?


Yeah I was expecting the uninitialized to be the same as the zeroed case for that reason, but was surprised to see a (slightly) different result.

Edit: you are almost certainly right. I added a getrusage call before and after, and we have ~50k soft page faults in the uninitialized case and ~2 in the zeroed case during the memmem.


Reading uninitialized allocation triggers undefined behavior. The compiler is free to do whatever it pleases. Don't do that.


Incredible! Will you post a code snippet?


Results are in the black rectangles above the code



I made a program to add spaces into Chinese text, which requires a lot of string comparisons.

https://pingtype.github.io

The algorithm works as follows: Read 5 characters. Is that in the dictionary? No. What about the first 4? No. 3? No. 2? Yes! Copy to the output string, add a space, slide along 2 characters. Repeat.

Searching a 5 MB dictionary repeatedly like this was too slow. I made it a little faster by splitting the dictionary based on the number of characters in the word (fiveWords, fourWords, threeWords, twoWords). It's still slow, though. I have to run it in local JavaScript, because I don't have a server. Any suggestions?

On a totally different extreme, the iTunes Store's EPF data (freely available!) is 55 GB of text. I'd like to sort out Artist names by Producer. To grep the file takes 30 mins. Is there a faster way?

Third, my algorithm for offline StackOverflow was optimised for disk space. I search the index of titles as 200MB of plain text to get the post ID. Then I use dd to get a 4KB chunk of the tar.bz2 compressed file. I can read it using bzip2recover. Then I check if the post ID is there, and binary search like that. It's slow (5 seconds to load) but doesn't waste my precious disk space, and I can be patient when I need offline StackOverflow.


I would put your dictionary into a prefix tree if some sort, and the first implementation I would reach for is a trinary tree [1]. As you scan your input, walk the tree. Each time you get to the end of the tree, insert a space one go back to the root. Once the tree is built it’s O(length of input) [2] regardless if dictionary size.

1: https://en.m.wikipedia.org/wiki/Ternary_search_tree

2: for a fixed alphabet size. It’s really closer to O(dictionary size * log(alphabet size)) if that increases with dictionary size.


Instead of storing the dictionary as a list, why not use a Map or Set?

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

As for offline StackOverflow, you should use a database engine like SQLite or Redis to manage indexing into smaller blocks of compressed post data on-disk, or store the posts directly in the database and keep the database files in a filesystem like BTRFS that supports online compression.


This is actually something I'm trying to figure out. I currently use (and pay for) Scalyr, which is the best of the bunch (super fast, flexible alerting) but I hate shipping my log data there. It just seems like a potential attack surface I am not really happy about.

Is there anything open source that I can run in a docker container somewhere on the internets that allows me to create custom alerts, extract data, create charts and dashboards?


Elasticsearch + Kibana/Grafana. The next version will opensource XPack[1] which has some support for alerting[2]. You can also use Grafana with ES and setup alerts there (I think)

[1] https://www.elastic.co/products/x-pack/open

[2] https://www.elastic.co/products/x-pack/alerting


As far as I've been able to tell, ELK and Graylog are still pretty much the only answers here. Either one will set you back at least three Docker containers and take a bit of configuration, and while they both have dashboarding capabilities I'm not sure if they're any good. You can plug Elasticsearch into Grafana though so there's that.


Scalyr CEO here. I'd love to hear more about how you're thinking about attack surfaces – drop me a line? My email address is in my profile.


For example, I am wary about who has access to my log data. I know that theoretically, I "invite" your support team, and can revoke it, but without an understanding of the architecture behind Scalyr, I am afraid this is just lip service.


Hmmm, the Boyer-Moore indexing in Go takes 2 x as long as the naive linear search.

Now I feel bad for wasting an hour wrestling with Go weirdness to get my program to compile.


Not regarding the time it takes to search stuff in an array. You really don't want any sort of naive string search to comb through your logfiles.

It might depend on your usecase but if you have some constantly running web app some service to diggest all the logs is very nice to have.

It is true that a lot of those logging services are absurdly expensive and i would rise even privacy concerns regarding sending all the internal data to someone else.

I really don't get why so many people buy overprices services, if all it takes is half an our and the skill to run through an installation instruction.

At our company we use graylog i guess there are other good alternatives too. Its open source and we slapped it on some cheapo v-server which cost a few bucks per month. That thing swallows like 150 messages a second and finding things takes no time at all. And if need be we can scale it to whatever is needed.


You’d be much better off with an FSA or (honestly) Boyer-Moore implementation. For a baseline, a simple fgrep should show you FSA speed.

You might want to read about classical string search algorithms. Your naive algorithm is straightforward but far from ideal.


"But isn't transient log storage a popular use case, and if so why don't any services optimize for it?"

Well I could start with "define transient"! I also note your unhappiness with third party logging facilities. I also note you went very DIY.

A three node ES cluster with a Graylog on the front can ingest quite a lot of logs for a reasonable time outlay in learning how it hangs together. A bit of tuning might be required.

I suspect I'm not your target market.


The Ruby implementation looks wrong. He's shoving the file into an array:

  contents = contents.read.each_byte.to_a
When really he should be streaming each byte from the file. I suppose he should do that for all of the implementations (or just use grep & cat?), but Ruby objects are annoyingly expensive since there's so much bloat.


So, if this is to measure the relative speed of a naive algorithm across multiple languages then great. If this is an attempt to measure how fast a single cpu can scan memory for a string then there are better algorithms. This is just one: [1]

  [1] https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm


Golang single core version:

    if (b[n]) == (find[found]) {
Isn't indexing an UTF-8 string "find" O(n)? To get 10000th char, you need to iterate all 9999 previous characters first.

Edit: Clarified I'm talking about string "find" and not byte array "b".


No, that's not how it works - indexing a string returns a byte, and strings have len() their bytes in UTF8.

Ranging a string does return unicode runes, however, but ranging is obviously O(n) anyway.

https://play.golang.org/p/oyaChPTb576


I don't know Go, but b and found appear to be byte arrays, not strings, which would imply O(1) indexing.


Single core version:

  find := "07123E1F482356C415F684407A3B8723E10B2CBBC0B8FCD6282C49D37C9C1ABC"
That's definitely a string.

Otherwise (single core) golang code seems to be just a test whether the compiler is smart enough to remove branch in this code:

    if (b[n]) == (find[found]) {
      found += 1
    } else {
      found = 0
    }
I wouldn't be too surprised if golang compiler generates superfluous range checks costing even more performance. At least golang 1.4 seemed to generate a lot of unnecessary checks in code I optimized a few years ago. Hopefully newer versions are better.


yep, my mistake - was looking at multicore



Using the same algorithm I got 0.175 or so seconds in Java.

Sample code: https://pastebin.com/duYF2QfE

This was on a "Early 2015" rMBP with a 3.1 GHz i7.


  if (data[i] == strArr[found]) {
Shouldn't that be:

  if (data[i] == stringToFind.codePointAt(found)) {
... if you claim to use the same algorithm. That's what golang needs to do to index a UTF-8 string, right? (Comparing to golang single core version.)

That way both golang and Java will be performing proper unicode code point indexing.

Of course the optimization you made makes sense, but I think it'd be fair for the golang version to do same.


I wonder how long a simple grep over a 200MB file takes on linux. I guess if you put the 200MB file on a ramdisk then you could easily do it from memory.

That would be my first and dirty approach.


I'm kinda surprised this task isn't limited by RAM or IO. There should be no benefit of multi-threading except for blocking I/O.


For such a simple problem you should be shooting for about 25GB/s on a modern Intel CPU. Anything less is a disgrace.


we use elasticsearch and it is pretty fast




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: