A regular expression matcher in 19 lines of Clojure

gjm11 · on Oct 16, 2011

Repeating, mutatis mutandis, what I said in an earlier thread (http://news.ycombinator.com/item?id=2723366) about the C code on which this is based:

It's very nice pattern-matching code, but it isn't a regular expression matcher. The class of patterns it's able to match is much smaller than the class of all regular expressions -- not because of missing "sugar" (character classes, +, non-greedy wildcards, ...) but because it doesn't support groups or the | operator.

As a result, the language supported by this matcher has, e.g., no way to match the pattern (ab)* or the pattern a|b. It's much, much less powerful than an actual regular expression matcher would be, and much of what makes it possible to do it in 19 lines of Clojure is that loss of power.

(I've written extremely similar code before: this level of functionality -- basically, glob or DOS wildcards -- is pretty useful. I'm not Rob Pike, and my code for similar functionality would probably be longer than his. But my code, or even Rob Pike's code, for even the simplest thing that could honestly be called a regular expression matcher, would be longer than this by a bigger factor.)

alexgartrell · on Oct 16, 2011

I'm no haskell hacker (by a long, long, long shot[0]), and I've played a lot more with Clojure than with Haskell, but I still found this easier to write and read in haskell than in clojure.

    matches :: String -> String -> Bool
    matches rs = dm ((fixHead . fixTail) rs) where
      fixHead ('^':rs) = rs
      fixHead rs = '.':'*':rs
      fixTail rs = case (reverse rs) of
        ('$':rrs) -> reverse rrs
        rrs -> reverse ('*':'.':rrs)
      dm "" "" = True
      dm ('.':'*':rs) (c:cs) = (dm ('.':'*':rs) cs) || (dm rs (c:cs))
      dm (r:'*':rs) (c:cs) = (dm rs (c:cs)) || (c == r && (dm (r:'*':rs) cs))
      dm ('.':rs) (c:cs) = dm rs cs
      dm (r:rs) (c:cs) = r == c && dm rs cs
      dm (_:'*':[]) "" = True
      dm _ _ = False

[0] I'm particularly ashamed of fixHead and fixTail and the way I special cased matching a star and nothing else against the empty string

edit: couldn't resist doing the golf thing

    matches :: String -> String -> Bool
    matches rs = dm (fixHead rs) where
      fixHead ('^':rs) = rs
      fixHead rs = '.':'*':rs
      cm r c = r == '.' || r == c
      dm (r:'*':rs) (c:cs) = (dm rs (c:cs)) || ((cm r c) && (dm (r:'*':rs) cs))
      dm (_:'*':"") "" = True
      dm (r:rs) (c:cs) = cm r c && dm rs cs
      dm "" _ = True
      dm "$" "" = True
      dm _ _ = False

alexgartrell · on Oct 16, 2011

Ok, now I'm done

    matches :: String -> String -> Bool
    matches ('^':rs) = dm rs where
      cm r c = r == '.' || r == c
      dm (r:'*':rs) (c:cs) = (dm rs (c:cs)) || ((cm r c) && (dm (r:'*':rs) cs))
      dm (_:'*':"") "" = True
      dm (r:rs) (c:cs) = cm r c && dm rs cs
      dm "" _ = True
      dm "$" "" = True
      dm _ _ = False
    matches rs = matches ("^.*" ++ rs)

drallison · on Oct 16, 2011

LOC (Lines of Code) is not a good characterization of the quality of code; clarity and understandability is. The Clojure implementation is pretty nice on those counts.

The LOC measure may be important when, for example, the goal is to show the number of operations to be as small as possible.

gruseom · on Oct 16, 2011

LOC (Lines of Code) is not a good characterization of the quality of code

Actually it is. Or at least it's the best we have. When it comes to anything you can measure, the evidence points to program length as the best predictor of quality of code. And LOC is, for better or worse, the standard metric for code size.

We had a whole thread on HN about this a month ago. Specific studies are cited there. http://news.ycombinator.com/item?id=3037293

Speaking of which, someone later told me that Chapter 27 of McConnell's Code Complete cites studies according to which bug count grows as a mild exponential function of code size. I don't have that book handy. If anyone does, can they look at Chapter 27 and tell us what it says?

arohner · on Oct 17, 2011

My edition is copyright 1993. In Chapter 21.3, they cite Capers Jones 1977 "Program Quality and Programmer Productivity". (Couldn't find a handy link online, appears to be http://www.amazon.com/Program-quality-programmer-productivit...)

They report the following table

    Project Size (LoC)       Error Density (Bugs / kLoC)
    0-2k                     0-25
    2k-16k                   0-40
    16k-64k                  0.5-50
    64k-512k                 2-70
    512k or more             4-100

So yes, the error density increases with LoC.

dewitt · on Oct 16, 2011

Thanks, drallison. Generally speaking I agree, LOC alone is not a great metric of code quality, though it does roughly map to fewer places for bugs to be introduced when coding in the large, which is rarely a bad thing (though irrelevant for something like this).

In this case, I decided to implement it in a lisp after noticing that Pike's algorithm worked by consuming a character at a time off of both the pattern and the input text, lending itself naturally to a functional recursive implementation. Pike's version in C does this by advancing pointers, but it spoke equally well to an idiomatic car/cdr approach in a lisp (the recur/first/rest pattern you see in matchhere).

ramiyer · on Oct 16, 2011

I so understand this http://www.cs.princeton.edu/courses/archive/spr09/cos333/bea... way better than clojure. What matters? number of lines or readability? and I am sure the Clojure implementation would be better understood by people who understand lisp. Ofcourse I haven't played with clojure or any lisp dialects and the main reason for that is when I see a lisp code I see only "(" and ")". I still don't know how to get out of that.

dewitt · on Oct 16, 2011

The best way I found to see past the parens is to simply start writing some lisp. I recently started teaching myself by working through the problems on 4clojure.com, reading pg's _On Lisp_, and books like _Land of Lisp_, _Programming Clojure_, _Joy of Clojure_, etc., and even within a day or two of learning I found I no longer saw the syntax at all. In fact, one of the great beauties of Lisp and s-expressions is that the syntax is so so minimal and so consistent that it all but disappears. (Clojure is a bit more complicated, in that it introduces additional special forms for java interop, metadata, vectors and maps, but you get used to that quickly, too.)

A much trickier thing for me personally has been memorizing what all of the standard functions do, particularly bouncing back and forth between common lisp and clojure like I've been. Clojuredocs.org and the clojure source code itself have been essential companions in this journey.

So the only thing holding you back is the parens, then have no fear, you'll actually grow to love them (especially if you're an emacs user, thanks to show-paren-mode and electric-pair-mode and the like).

0x12 · on Oct 16, 2011

Both versions are really nice but I find the C version to be lots easier to understand (I come from a C background, so that's probably not a surprise).

What does this look like to an experienced lisp/clojure programmer, do they have the same feeling but reversed?

What about programmers familiar with other languages?