Hacker News new | past | comments | ask | show | jobs | submit login
The Boyer-Moore Fast String Searching Algorithm (utexas.edu)
228 points by js2 on April 23, 2021 | hide | past | favorite | 81 comments



I remember encountering this wonderful page/website about ten years ago, and I was struck by two things:

1. How effective it was (and is). Both the demonstrations here (of Boyer-Moore algorithm, and of Knuth-Morris-Pratt) illustrate the algorithms very clearly (the same for the linear-time majority vote algorithm); the interactive (web) medium really helps (just hyperlinks and maybe the back button); and the technology is nothing more than a hand-coded HTML file (with abundant "<BR>"s), already over a decade old when I saw it. (The DOCTYPE line in the source mentioning "HTML 3.2" lets us date this to 1997 I guess, as HTML 3.2 was released in January 1997 and HTML 4 in December 1997.)

2. The page one level up from this (https://www.cs.utexas.edu/users/moore/best-ideas/index.html), titled ``My'' Best Ideas, which opens [Incidentally, this usage of `` and '' reflects ancient fonts—you can see the same usage in Emacs, TeX, etc—explained well by Markus Kuhn here: https://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html] with:

> Here are the technical achievements of which I am most proud. I think such a list gives more insight into my research than, say, the traditional ``Honors and Awards'' section of a resume.

And it lists eight ideas. Firstly, what a great idea to make such a list. And secondly, an impressive research career, and it's a successful life, to end with a portfolio of 8 ideas of which one can be proud. One doesn't need much more, and to have a handful of good ideas is something to hope for. I'm reminded of this, from Proofs from the Book (saw it here: https://sites.math.rutgers.edu/~zeilberg/Opinion82.html ):

> The essence of mathematics is proving theorems – and so, that is what mathematicians do: they prove theorems. But to tell the truth, what they really want to prove once in their lifetime, is a Lemma…


I had great help of this website too: https://www-igm.univ-mlv.fr/~lecroq/string/


> The DOCTYPE line in the source mentioning "HTML 3.2" lets us date this to 1997 I guess, as HTML 3.2 was released in January 1997 and HTML 4 in December 1997.

Not necessarily. HTML 4 was a big change from 3.2, more complicated, and stricter about separating structure from presentation. Lots of people stayed with the more familiar 3.2 for a long time.


> Lots of people stayed with the more familiar 3.2 for a long time.

In fact recently I was wondering why markupsafe generates numeric entities for the quote characters.

That was added to Jinja (before markupsafe was moved out), for HTML 3.2 compatibility, because 3.2 only had 3 named entities (amp, gt, and lt).

According to jinja’s log, this was fixed in 2008. Sadly no link to anything was recorded, but it means either mitsuhiko was feeling playful (and happened to remember this issue) or someone was generating HTML 3.2, with jinja, in 2008.


The Boyer–Moore algorithm is very well known. It's in the standard library of many languages such as C++: https://en.cppreference.com/w/cpp/utility/functional/boyer_m...

But TIL after reading this page that there's a new variation optimized for small alphabets such as DNA: https://www.cs.utexas.edu/users/moore/publications/sustik-mo...


I know the algorithm is well known, but I never really knew how it worked other than very hand wavy.

This page also links to a demonstration of how the search works which I found easier to follow than the wiki page on the algorithm.


There's a lot of different approaches to searching small alphabets.

The one you linked to maintains state during the search loop to get bigger shifts, and complexity in search loops always slows it down, sometimes to the point there is a net loss.

If the needle was reasonably long, which is not uncommon for low alphabet needles, I suspect a better approach would be to use a hash based Horspool style algorithm working on n-grams, say 4.

This works by calculating shifts based on ngrams, rather than individual bytes. Since there are many more permutations, you effectively increase the alphabet and hence the shifts also increase. At the cost of always having to read an entire ngram before you can shift.

Edit: you also hash the ngram into a smaller hash table of course, you don't maintain a shift position for each ngram. You can tune the size of the hash table for lower memory or higher performance. This is similar to the multi pattern Wu Manber search, but applied to a single needle.

Edit 2: I have found this algorithm also beats Horspool most of the time, say for an ngram size of 2 or 3, on any alphabet, not just low ones.


I few years ago, I wrote another substring search algorithm that is outperforming this one on a modern computer: https://github.com/RaphaelJ/fast_strstr/tree/master/benchmar...

While the algorithm is time linear (Boyer-Moore is sub linear), it ends up being significantly faster on textual content as the per character operations are significantly simplier.


BM taught me that cleverness doesn't pay for fast, similar iterations. Performance is all about cache misses. I hated that lesson.


Also that thinking about string matching problems isn't good for my mental health. I had horrible nightmares on the road to my O(m) good suffix rule. Very hard not to go a little mad over iterating countless ideas and string matching scenarios in your head... on the train, on the pot, laying in bed trying to sleep.

Btw. digital tablet with pen input is gold thinking about algorithms, as you get modifying your scenarios in place endlessly and copy and paste over paper + pencil.


Same. I implemented it, IIRC, on my Amiga 2000 and compared it to the Manx C library builtin. Which wiped the floor with my BM implementation. I don't remember the exact numbers, but it was so much faster that I didn't even bother trying to optimize mine, I just threw up my hands.

It may have been on an HP9000s2xx machine instead of Amiga.

Taught me that hand optimized assembly can beat clever algorithms.


I should note that this algorithm is not for untrusted inputs because one can relatively easily fabricate inputs that trigger the worst case (this is even possible when the attacker can control only the haystack or only the needle). The two-way algorithm, used by glibc and many others, is linear at the worst and cache-friendly enough. Maybe Rabin-Karp is fast enough for short needles even at the worst case; a hybrid algorithm might be interesting (like, two-way algorithm has two variants for long needles and short needles; RK can replace the latter).


Speaking of modern computers, have you compared it to these SIMD string find algorithms by Wojciech Muła? They're a bit more recent:

http://0x80.pl/articles/simd-strfind.html


Interesting, but I’m not sure the algorithm aged well.

Memory is a block device now. DDR delivers 8 bytes per transaction per channel, e.g. 16 bytes per transaction for dual-channel memory configuration. Skipping bytes delivers little profit in terms of memory bandwidth (but it saves other things, CPUs have very small limit how many memory load instructions can run per cycle).

SIMD instructions can compare vectors for about the same cost as scalars, e.g. _mm_cmpeq_epi8 (SSE2) and vceqq_u8 (NEON) compare 16 bytes with another 16 bytes.

If I wanted a super-fast substring search, I would try to leverage these things.


It has not aged well for ascii characters, especially for texts like English where the letter “e” is abundant enough to make it likely that your average jump would be 6, less than the 64-bit fetch.

It can still be used to find a substring of 100 64-bit values in a much longer 64-bit string, but ... it’s a problem I never encountered (and only encountered a 16-bit version of it once in my life). Then again, if you use 32-bit for each Unicode character and have a lot of emojis, it might still be somewhat competitive with brute force.


Harder to argue about character frequency in DNA/RNA sequences.


Naively, I'd expect all letters to occur equally frequently. Do you know if this is the case?


It's not the case.

There are consequences to the proportion of some nucleotides, the most well-known being the GC-content[1], or proportion in a sequence of bases being G or C (G pairs with C and A pairs with T). There are benefits to a higher GC-content, like the sequence being more stable.

From this the obvious question would be why this matters, since you obviously can't replace an A-T pair with a G-C pair just to increase this proportion. Except that there are ways to do it… kind of. If you look at the codon table[2], you'll see that there are multiple ways to represent each protein encoded by a triplet of nucleotides.

[1] https://en.wikipedia.org/wiki/GC-content

[2] https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables


Thanks for jumping in with way more info than I could provide.

I love how the DNA language, much unlike spoke words, is in close relation to physical properties of the world.

There really is some exciting magic in the interception of biology and computation.


Most BM implementations do this implicitly by using `memchr` in their skip loops. A lot of folks attribute the "skipping bytes" aspect of BM as its main reason for being fast. But nowadays, it's really just because most implementations run memchr on the last byte. At least for common cases and fairly short needles. (I'm sure there are inputs where the byte skipping plays a larger role.)


I tried to argue once that longer needles were better for searching; but I didn't find any literature to back that up. And I suspect most searches are done with line endings being tracked, such that most have an implicit tiny needle they are tracking.


To a first approximation, I would agree with the statement that "longer needles are better for searching." They just tend to have more useful properties: better chance of containing a rarer byte (or bytes), better chance of skipping characters, less of a chance of occurring at all, and so on. Of course, longer needles can sometimes result in a much slower search for pathological cases, depending on the substring algorithm you use.

The grep tools do track line endings, yes. ripgrep does it by default where as GNU grep does not. Line endings are only tracked to print line numbers. If you don't need to print line numbers, then you don't need to track all line endings.

The thing with tiny needles is that they aren't all created equal. If your tiny needle is a single byte and that byte doesn't occur too frequently, then it's going to be very fast to find occurrences of that byte.


I'm curious how you can speed up finding a single byte? Don't you have to touch all bytes at that point? Leaning heavily on simd instructions?

And I realize newlines is only to track line counts. Am I wrong that that slows things down?


> I'm curious how you can speed up finding a single byte? Don't you have to touch all bytes at that point? Leaning heavily on simd instructions?

Yes. That's what memchr does: https://github.com/BurntSushi/rust-memchr/blob/427fdc384007d...

> And I realize newlines is only to track line counts. Am I wrong that that slows things down?

It does, but only a little if it's implemented using vector instructions. You can try with ripgrep using -N/-n. GNU grep's line counting isn't vectorized IIRC, so you might notice bigger differences there.


On a typical input, where there's probably a newline every 30–80 characters on average [1], it seems like many of the bytes are not in that algorithm's happy 64-bytes-at-a-time 16-bytes-aligned main loop because each line would be a separate memchr call which has to redo the unaligned prefix bit, and/or are processed twice because they're in the 64 bytes but after the first newline. Have you ever considered an alternate interface which returns all the newline positions in a single call on a whole buffer? [edit: or next N with N>1.] Is it worth pursuing?

[1] bytes/lines on "wc -l *.[ch]" in the ffmpeg/libavformat directory I happened to have open says there's a newline every ~34 bytes.

edit: I suppose if ripgrep -N and ripgrep -n aren't that far apart, it's fast enough already. But I have to think it's not getting much mileage out of that 64-bytes-at-a-time loop.


> each line would be a separate memchr call which has to redo the unaligned prefix bit

I'm not sure what GNU grep/wc does, but assuming you're getting line numbers for relatively rare matches, you probably shouldn't use memchr. Instead, load chunks of bytes into a SIMD/vector register, compare each one for equality to 0A, take the popcnt, and add it to a line number counter. You don't actually care where (most of) the newlines are located, just how many of them there are.


Yes, that's what ripgrep does, or similarish. In my initial implementation, I did what you suggested on a standard u64. But bytecount adds SIMD to it (which I did not write): https://docs.rs/bytecount

I believe GNU grep just uses memchr for this. But I might be mis-remembering.


For line counting specifically, memchr isn't used, but rather, bytecount is: https://docs.rs/bytecount/0.6.2/bytecount/

However, memchr is still used for finding the bounds of a line when a match occurs. memchr still does pretty well on shorter haystacks too. The main loop is indeed quite large, but before that, it does unaligned loads of 32 bytes: https://github.com/BurntSushi/rust-memchr/blob/427fdc384007d...


I think in the general case, random strings, long patterns do not improve average performance.

Or rather nothing will change, once the pattern length exceeded the expected value for the alphabet used. E.g. for a simplified DNA alphabet, extended BC shifts on average 4 characters. On the fly I can't do the stochastics on GS, but I assume substring match length and chance of reoccurrence run against each other in a similar manner.

On the other hand long patterns increase preprocessing/access time: Extended BC and GS have O(m) for preprocessing, and BC may have non-constant access, depending on space tradeoff.


That is all true. You can however increase the effective alphabet size by working with n-grams of the needle rather than bytes. Even a 2 byte n-gram gives you an alphabet of 65536. You wouldn't get all of that as the hash table would probably be smaller, but you can increase the shifts a lot for longer patterns in this manner.


My gut is that "long string" just isn't that long nowadays. Even searching for a full UUID in a log doesn't do much to change the size of the needle relative to cache sizes. Though, maybe it is enough to help with some of the wide instruction choices?


(I am no expert at all!)

I mean yes of course in specific scenarios certain assumptions can be made. E.g. for UUIDs you probably better off exploiting the fixed length pattern too.

But for the general/theoretical case these intuitions do not pan out usually. I mean even "memory hacks" make assumptions about the data indirectly by common hardware architecture.

That is, most data isn't random strings of course XD

However when BM is used in e.g. bioinformatics on DNA and you are still stuck with firstly finding data against evolutionary noise, when your algorithms must adapt to your hypotheses, theory becomes more relevant I assume.

I think DNA focused "string search problems" are really inspiring and have a lot of potential for mingling with philosophical fundamentals of informatics. There is something about the evolutionary emergence of "data" AFAIK no other field offers. E.g. the overlapping, extending, or contextualizing information meta-layers upon meta-layers in DNA translation and structure "specified" to no more than merely exist. Throwing Boyer-Moore at ASCII encoded sequences in FASTA files almost feels blasphemous, or the arrogantly fallacious human essence.


I think using an alphabet of length 4 is probably a specialized enough use case that it deserves its own analysis from whole cloth if your goal is maximal performance.

I work on text search where the alphabet size is 256. If someone told me to work on text search where the alphabet size is 4, a lot of what I'd do would change, starting with my benchmark inputs.

So I kind of think responding to general claims about substring performance with, "well in DNA searching..." is kind of reframing the discussion to a specialized use case with very different priors. That is, I'm not sure anyone ever said running BM on DNA was a good idea. But maybe it's convenient, so that's what folks do. :-)


I don't think your comment your comment is a fitting response as I clearly stated this is about the general, theoretical stance, where you can't make assumptions about either the alphabet, nor the text.

As I recognized, introducing assumptions of course opens possibilities for doing things differently.

I was using the DNA example, because bioinformatics is where a lot of people learn about BMA, because the math is easy there and we are not tempted to make assumptions about the text, even tho it's not random. For most bioinfo applications the alphabet is not four letters, but e.g. the amino acid alphabet, or having additional sequence information included.

In bioinfo practice the naive algorithm often performs competitive against BMA if you optimize for hardware assumptions. Branch prediction is a bitch for BMA.


If you know the maximum width of the sub string you are searching for you can easily calculate the maximal forward jump, and hence you'll know when to issue a pre-fetch.


Having had a quick look at the AMD 17h processor family optimization guide, it would seem that the PREFETCHNTA opcode is the one you are looking for; it hints to the processor to pull the memory pointer close to the processor while avoiding polluting the cache.

For GCC it would seem that using this built in pseudo function call does the trick;

__builtin_prefetch((void *)p,0,0);


Yes ofc. I was making a theoretical argument. Assumptions about the data or hardware matter in practice, but they are not relevant for the theoretical exploration.


I imagine Boyer-More might still make sense for some applications, where the searched-for string is very long, such that whole blocks can be skipped.

Related TIL: ripgrep ̶d̶o̶e̶s̶n̶'̶t̶ edit mostly doesn't use Boyer-Moore. [0]

[0] https://github.com/BurntSushi/ripgrep/issues/493#issuecommen...


It does actually! But only in some cases. See https://github.com/BurntSushi/ripgrep/issues/617#issuecommen... and https://github.com/rust-lang/regex/issues/408 and https://github.com/rust-lang/regex/pull/419

Note though that it is going away soon. It's going to be replaced with a combination of Two-Way, Rabin-Karp and a fast vectorized prefilter based on this: http://0x80.pl/articles/simd-strfind.html#algorithm-1-generi...

The "secret sauce" is that in the vectorized prefilter, we don't just choose the first and last bytes. Instead, we pick bytes that we think will be rare based on a priori assumption of relative byte frequencies.


It's interesting that we're still seeing such interesting progress in such a fundamental problem. What kind of performance boost do you expect?


For queries that are already fast, I expect them to remain similarly fast. Or hope that they do. :-)

But there are many queries that are not so fast today. For example:

    $ time rg-master -Nc 'strengths' OpenSubtitles2018.raw.sample.en
    242

    real    0.255
    user    0.204
    sys     0.050
    maxmem  903 MB
    faults  0

    $ time rg -Nc 'strengths' OpenSubtitles2018.raw.sample.en
    242

    real    0.092
    user    0.068
    sys     0.024
    maxmem  904 MB
    faults  0
Where the second command has the new substring algorithm. Nothing earth shattering to be honest. ripgrep already did a good job of dealing with common cases.

This is ripgrep's current work-horse: https://github.com/rust-lang/regex/blob/3db8722d0b204a85380f...

The revised algorithm is based on the same principles (heuristics derived from an a priori frequency distribution), but shouldn't be as easy to get into cases where it runs very slowly. Here's another more poignant example (where the needle is two ASCII space characters):

    $ time rg-master -Nc '  ' OpenSubtitles2018.raw.sample.en

    real    0.887
    user    0.859
    sys     0.027
    maxmem  903 MB
    faults  0

    $ time rg -Nc '  ' OpenSubtitles2018.raw.sample.en

    real    0.084
    user    0.043
    sys     0.040
    maxmem  903 MB
    faults  0


> heuristics derived from an a priori frequency distribution

How did you come up with the frequency distribution? English text? Code samples? Languages other than English?



Wow, cool!

Since rg (which is awesome) understands which programming languages map to which file suffixes, could it make sense to do per-language frequency analysis? {}[] are common in C-style languages, but much less so in lisp, python, etc.


It could. But there's a lot of code complexity there because ripgrep is made from de-coupled components. Implementing your idea would require making that frequency distribution available as an overridable element all the way down into where substring search is actually implemented (a transitive dependency of ripgrep).

On top of that, it's unlikely to make much of a difference. Not only would the frequency distributions between languages look pretty similar, but it's still just a guess at the end of the day. There is such a thing as overfitting too. :-)


The "Boyer–Moore majority vote algorithm" is also very neat.

It's the optimal solution to https://leetcode.com/problems/majority-element/ – as in it's required to solve the problem in linear time and in O(1) space.


I don't know if this applies to that problem, but something worth noting about Leetcode O(1) space problems: the input is mutable and does not count toward your space usage.

I like to use Leetcode hard or medium problems as interesting puzzles to work on in my head when I've got some time to kill, such as waiting in line for some service or lying in bed trying to fall asleep.

One of those problems was given an input array of integers, find the smallest positive integer that is not in the array. They wanted this in O(N) time and O(1) space.

I spent something like two months trying to solve that. I even spent a fair bit of that time trying to prove that it could not be done, hoping that maybe if I could not prove it impossible I might get some insights from why I could not do so that would help me figure out how to do it.

That "try to disprove it to gain insight" approach didn't work, because my attempts to disprove it actually almost convinced me that it was in fact impossible. I say almost, because my argument using Turing machines actually seemed like it wa right, but it has been around 40 years since I last studied or used Turing machines, and so I wasn't quite sure that I wasn't missing something.

I finally gave up, and took a quick glance at a published solution, trying to just see enough to get unstuck. The first thing I saw was that they were writing to the input array. As soon as I saw that you can overwrite the input, and so by O(1) space what they really mean is O(1) additional space besides this array of length N that we give you, the solution came quickly.


Does it have practical applications tho? Last I read about it, I figured it was just neat. (On the other hand it pollutes the namespace as a "BMA" search term collision XD)


These days? Perhaps some niche area.

From the paper's abstract:

> the algorithm uses storage in a way that permits an efficient use of magnetic tape


If you’d like to read a canonical, well-commented implementation, here is one from OpenJDK: http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/37a05a11f281/s...


I have always implemented the horspool algorithm instead. It is simpler to implement and generally performs just as good.

When researching string search algorithms some time ago I came across max shift Boyer moore which seemed to be a very nice incremental improvement.


On the topic of string research, one excellent reference that someone recommended to me is the book "Flexible Pattern Matching in Strings" by Navarro and Raffinot.

One of my favorite things about it is that it has a "phase change" graph at the end of each chapter, saying for each combination of alphabet size and pattern size, which algorithm fares better. It also favors algorithms that are simple to implement (such as Horspool instead of Boyer-Moore), because those also tend to be the fastest.


It is indeed an excellent book, marrying theory with an eye for implementation and real world trade offs.


I've actually found Boost's Boyer-Moore-Horspool to be faster than its Boyer-Moore.


For long needles it is, I agree. I am just too lazy to have a good heuristics between a naive search (which is surprisingly fast on modern hardware) and the horspool. With whatever implementations I have implemented have a sweet spot between naive and horspool wher BM is actually sliiightly faster.

I think it is just that Horspool works better on modern hardware. BM isn't nearly as much of a gain over the naive search that it was 25 years ago.

Edit: the max shift change I mentioned above can be done to Horspool as well!


Oh I actually meant for (super) shorter needles. I'm not sure if this is a general statement or if it's just applicable to the Boost implementation though. And yeah I should really try out the max shift change, I came across it not too long ago but never got around to implementing it.


I find Horspool is almost always faster than Boyer Moore.

For very short needles though (e.g. less than 8) it is hard to beat the SHIFT OR algorithm.

Even though it is not sublinear, looking at every position, it tracks matches using a bitmask. In practice this is very fast and has very good cache hits. It easily beats BM or Horspool by quite a margin. They rely for speed in skips, and you don't get great skips with a short needle.


They make look like nothing happened in optimal strstr algorithms in the last decades. In the libc's not, indeed. They are still in the stoneage of string search.

But the fastest is currently EPSM, by S. Faro and O. M. Kulekci. See https://smart-tool.github.io/smart/

"Exact Packed String Matching" optimized for SIMD SSE4.2/AVX (x86_64 and aarch64). It performs stable and best on all sizes.

The site I linked to compares 199 fast string search algorithms, with the usual ones (BM, KMP, BMH) being pretty slow. EPSM outperforms all the others being mentioned here on these platforms. It's also the latest.


It is a great site, I've discovered many interesting algorithms on it. And benchmarked them with their tool, along with some of my own.

And yes, EPSM outperforms the others by a mile.

If you are comfortable with your search performance being bound to particular CPU architectures, and that may be entirely reasonable, then you can't beat that approach.


Back in the days of MS-DOS I wrote what I thought was a fairly clever search algorithm making use of the XLAT instruction. It could search for N= 1-8 characters in a block of text in a loop 7 instructions long, which I'm trying to reproduce from memory

  start:
    xor ah,ah
    lea bx, masktable
    mov cx, textlength
    lea si, texttosearch
  test_character:
    or    ah, startbit  (set bit 8-N of the ah register)
    lodsb               (get a character into AL)
    xlat                (AL gets [BX+AL])
    and   ah,al         (masks off non-matching characters)
    shl   ah,1          (AH=AH*2, carry goes into Carry Flag)
    jc    match_found        ; we found a match, break out
    loopnz  test_character
  no_match_found:            ; get here if there isn't a match

In a 64 bit environment, you should be able to search for up to 64 characters fairly quickly, using a table that is 8*256 bytes, you can do case insensitive searches, along with a ? for any character to do some wildcarding. The nice thing is that everything is in the cache until you jump out when you have a match, and not before.


I went through the work, and re-implemented it as an assembly language routine inside a 64bit free pascal program.

It searches about 1 gigabyte of text in 2 seconds.

https://github.com/mikewarot/fastsearch/blob/master/fastsear...


Work has been done applying the Boyer-Moore principle to regex scanning.

* https://doi.org/10.1016/S0167-6423(03)00013-3 (PDF is freely available)

* https://nothings.org/computer/source/sgrep.html


Interesting, thanks for the link, I wasn't aware of that one.

I've been mulling how to apply fast search algorithms to regular expressions for a while. There isn't really one technique that works well for all expressions, so the more options right now the better. Eventually I'll settle on a few that offer the best overall coverage and performance.


I implemented it in Go for fun. I expected Go's standard lib will beat it but for long substring searches it perform up to 10 times faster than `strings.Index`.

https://github.com/sarpdag/boyermoore

Go's strings package depends on the substring size tries different approaches like Rabin Karp algorithm, (https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm)


To me, this is a classic and timeless piece of real computer science.

Such an eye opener when I learned about it the first time! And a reminder that we really stand on the shoulder of giants.


These algorithms are such a joy for me to code. Much more fun than the whole enterprise application work most of us 'developers' are stuck with.


Have you looked into the Z-algorithm? It's related to BM via Good Suffix rule, and I think it's very neat and handy.



It solves a slightly different problem, searching for one substring instead of searching for one of many, so it would not be fair to compare the two.


Yes, and if you need the best of both there is the Commentz-Walter algorithm from 1979:

> Like the Aho–Corasick string matching algorithm, it can search for multiple patterns at once. It combines ideas from Aho–Corasick with the fast matching of the Boyer–Moore string search algorithm. For a text of length n and maximum pattern length of m, its worst-case running time is O(m·n), though the average case is often much better.

> GNU grep implements a string matching algorithm very similar to Commentz-Walter.

https://en.wikipedia.org/wiki/Commentz-Walter_algorithm


Commentz-Walter doesn't work that well in practice. As you add more patterns, the likelihood that there is any meaningful byte skipping goes way down. So it might be useful for a small number of patterns, but at that point, there are vectorized algorithms that will do (a lot) better. (Like Teddy from Hyperscan.)


Thank you very much for this reply. I was considering implementing it for a program I’m writing, but in light of that it’s clearly not worth it.


Does this actually matter in the real world?

I assume that even with a naieve search a modern CPU can search at the memory read speed.

Without any kind of indexing beforehand it is impossible to search faster than the memory read speed.

The approach shown of only checking some characters doesn't help remove memory bandwidth constraints unless the thing you are looking for is very long (because cache lines are very wide), which I suspect is not a common use case.


String searching is interesting in that the naive algorithm is quadratic-ish: imagine searching for “aaaab” in the string “aaaaaaaaaaaa” by comparing the characters 1-5, then 2-6, and so on. So at least some kind of clever insight needs to be used so as to not hit quadratic behaviour - some mechanical way of deciding that a chunk of the input can be skipped after being read. The value of extra skips (not examining some of the input) is debatable, but these algorithms are still very much relevant when wanting to avoid accidentally-quadratic behaviour.


Naive BM goes quadratic at

T: aaaaaaaaaaaa

P: xaaaaa

Not easy to come up with an improvement, there will always be that exception you didn't think of XD

I wonder, if you could derive meta information about the DNA "lyrical" composition by observing the BM put through. After all the algorithm assumes random text.


Straightforward Boyer-Moore is worst case O(nm), but (I believe) there are modifications that make it linear time. There are other string searching algorithms like Knuth-Morris-Pratt which are linear in the size of the two inputs. My point was that non-naive string searching algorithms absolutely do matter, and that no amount of cache-efficiency is going to save an implementation from hitting the quadratic behaviour intrinsic to a naive algorithm.


I thought it was implicit that you build a jump table based on the string. That is, BM isn't just "search from the end." It includes the push down automata.


Yes, I am not contesting or anything, but my example shows how the original BC and GS rules don't protect you from quadratic worst case complexity either. That's my point, it's not easy to do sub-quadratic worst case search.

I haven't recently looked into the sub-quadratic worst case improved version of BM, but IIRC it's far from trivial and not what most people think of for this algorithm. I assume it's also slower IRL, for cache misses, or for expensive operations like modulo in the main loop, and prolly has a space penalty.

BM just has a better average case than naive string search.


Burntsushi has plenty of data showing it isn't the best nowadays. Pretty sure any version that goes quadratic is bite BM, though. Not that you are wrong, but just highlighting that most folks don't think of BM when they think they are, though.


ROM hackers frequently use this algorithm to search binary ROMs for encoded strings.


The real world is larger than your desktop PC.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: