Hacker News new | past | comments | ask | show | jobs | submit login
Rob Pike on regular expressions in lexing and parsing (commandcenter.blogspot.com)
128 points by jeffreymcmanus on Aug 23, 2011 | hide | past | favorite | 39 comments



Regular expression systems are designed for a very wide variety of text processing tasks. Lexing and parsing systems are optimized for a much more limited range of tasks--lexing and parsing. When you are writing software that is intended as a production system, not a one-off quickie, you are almost always better off using a more specialized system optimized for your specific task, if one is available. The unoptimized, general purpose system would be Plan B. I think that was Pike's point.

Unfortunately, it will probably be misinterpreted as a general criticism of regular expression systems. There are so many cases where my knowledge of a powerful, general-purpose system has saved me from having to learn a new software library for each random task that I intend to keep using such things as regular expressions, scripting languages, unix command line pipelines, Excel, etc. Writing my own code or learning to use a new, specialized system is Plan B for simple, non-production tasks.

Only when I can amortize the cost (time/effort) over a lot of use (as in a production system), or when the advantage of the specialized system is noticeable to the user (including me), does the specialized system become my Plan A.

General purpose tools, special purpose tools, and custom coding all have different costs and benefits, and there are circumstances under which each one is the best choice.


One thing I've found is that using a bit of a subset of regular expressions seems to work very nicely for doing many lexing tasks. Things like using ^[a-zA-Z][a-zA-Z0-9]+ (or whatever shorthand your RE system has for those classes) can make a lexer nicer to work with. Now I wouldn't be trying to do lots with most of the other features in them for a lexer but it ends up a compact but simple to understand way of describing the input the lexer is looking for to pull tokens out.


I've found Ragel ( http://www.complang.org/ragel/ ) to be a good compromise: it's less error-prone and easier to maintain a Ragel grammar than a handwritten lexer, but Ragel lets you use regular expressions for all the little places near the leaves of a grammar where it's easy to represent token rules as regular expressions. In contrast to most regex APIs, it does the state machine compilation at build time rather than runtime, and the generated code can be quite fast (although you have to make a speed-vs-code-size tradeoff).


Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

~jwz


I thought you were just having a cheap shot, but then I read the story, and that's exactly his sentiment.


I'll pop this quote here too

"If you're willing to restrict the flexibility of your approach, you can almost always do something better." — John Carmack


Interesting that he claims that it's difficult to change a regular expression to cope with new requirements but easy to change a hand-coded loop.

In my experience, the opposite is true (up to the point when your grammar becomes non-regular).


Many regular expression implementations don't deal properly with unicode character classes and collation. So adding something like unicode identifiers to go may be impossible or impractical with regexs whereas with a loop you can rely on more standard collation support.


Some libs such as PCRE can match by character properties, for example \pL would match any unicode letter.

Doing that in a loop seems to be much more difficult. And unless you work on utf-32 data, you have to handle variable-size encoding of characters.


I'm a bit confused, aren't lexers implemented using similar techniques as regular expressions? (i.e: NFA to DFA conversion and DFA minimization).

Or maybe I'm also confusing lexers with lexer generators.


The term "regex" usually means not "regular expressions" in the formal languages sense but rather the trickier operational notion that makes use of backtracking.

Look at http://cstheory.stackexchange.com/questions/448/regular-expr...


Amen, stop putting regex everywhere, they are way too error-prone and difficult to read!

Use parsers combinators where you can. There you can put sub-expressions in variables, have meaningful identifiers in your grammar, explorable and debuggable parsing etc.


Does anyone else have a little trouble with the following sentence: "A regular expression library is a big thing. Using one to parse identifiers is like using a Ferrari to go to the store for milk." I for one would rather read that as "A regular expression library is freaking awesome!" But I think the author meant to say it was overkill! ;)


> And when we want to adjust our lexer to admit other haracter types, such as Unicode identifiers

Actually regex libraries such as PCRE have a good unicode support and are better than me when it comes to do things like matching character properties (letters, uppercase letters, numbers, punctuation, etc).


I understand this point on lexers and parsers and i think it makes a lot of sense. And regular expressions definitely have a sweet spot in short off the cuff applications (sed, vim, grep). But there's a large amount of potential use cases in between. I wouldn't want to write a lexer/parser to deal with validating random user supplied data in a web application (phone numbers, email addresses).

So i guess the point is that the whole point of a parser/lexer is to look at and validate text so you shouldn't 'outsource' it. but maybe in applications where minor parsing/text validation tasks are more peripheral then regexs are more appropriate?


To be fair, I wouldn't want to write a regex to validate an email address either. http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html


> maybe in applications where minor parsing/text validation tasks are more peripheral then regexs are more appropriate?

Definitely. There's a 3-line regex-based tokenizer in one of my side projects (in python). Today, for kicks, I tried writing the same function without regexes using two different approaches. One uses list comprehensions and makes liberal use of python string, set, and list methods, the other is a single-pass ad hoc state machine.

The regex function is 3 lines and parses 1,000,000 characters in 0.3 seconds on my desktop.

The split-strip function is 20 lines, imo very hard to read, and far slower than the regex (1.77 seconds on the same benchmark).

The adhoc state machine is 89 lines and easily the slowest (5.3 seconds on the million byte benchmark.

I'm pretty sure that even if I moved this app towards production I would polish the regex function rather than swap it out entirely with one of the other approaches.

    key_sections_regex=re.compile(r'\(([\-+]?[0-9]+,\s*\S+)\):([\s\-0-9]+)', re.S)
    def split_keyed_line(keyedline):
        return [[k, n.strip()] for k, n in re.findall(key_sections_regex, keyedline)]
    
    def split_keyed_line_splits(keyedline):
        whitespace = [' ', '\t', '\n', '\r', '\f', '\v']
        notechars=set(list('0123456789-') + whitespace)
        offsetchars=set('0123456789-+')
        def splitka(ka):
            off, mode = ka.split(',')
            if set(off).issubset(offsetchars):
                if len(set(mode.strip()).intersection(whitespace)) == 0:
                    return ','.join([off, mode])
            raise Exception
        def validnotes(notestr):
            if notestr.startswith(':'):
                if set(notestr[1:]).issubset(notechars):
                    return notestr[1:].strip()
            else:
                raise Exception(notestr)
        return [[splitka(keyarea), validnotes(notearea.strip())]
                for keyarea, notearea in
                [x.split(')') for x in keyedline.split('(') if x != '']]
    
    def split_keyed_line_adhoc(keyedline):
        whitespace = [' ', '\t', '\n', '\r', '\f', '\v']
        digits=list('0123456789')
        keyarea_begin = '('
        keyarea_end = ')'
        keyarea_split = ','
        keyarea_offset_signs = ['-', '+']
        keyarea_offset_values = digits
        notearea_values = digits + whitespace + ['-']
    
        tokens=[]
        states=['start', 'offsetsign', 'offset', 'premode', 
                'mode', 'transition', 'notes', 'accept']
        state='start'
        current_token=''
        subtoken=[]
        for ch in keyedline:
            if state == 'start':
                if ch in whitespace:
                    continue
                if ch == '(':
                    state = 'offsetsign'
                    continue
                if ch not in whitespace:
                    print "Syntax Error"
                    raise Exception
            elif state == 'offsetsign':
                if ch in keyarea_offset_signs:
                    current_token += ch
                elif ch in keyarea_offset_values:
                    current_token += ch
                else:
                    print "Syntax Error"
                    raise Exception
                state='offset'
                continue
            elif state == 'offset':
                if ch in keyarea_offset_values:
                    current_token += ch
                elif ch == keyarea_split:
                    current_token += ch
                    state='premode'
                continue
            elif state == 'premode':
                if ch in whitespace:
                    continue
                elif ch not in whitespace:
                    current_token += ch
                    state = 'mode'
                continue
            elif state == 'mode':
                if ch == keyarea_end:
                    subtoken.append(current_token)
                    current_token=''
                    state='transition'
                    continue
                elif ch in whitespace:
                    raise Exception
                elif ch not in whitespace:
                    current_token += ch
                continue
            elif state == 'transition':
                if ch == ':':
                    state='notes'
                    continue
                else:
                    raise Exception
            elif state == 'notes':
                if ch in notearea_values:
                    current_token += ch
                    continue
                elif ch == keyarea_begin:
                    subtoken.append(current_token.strip())
                    tokens.append(subtoken)
                    current_token=''
                    subtoken=[]
                    state='offsetsign'
                    continue
        if state == 'notes':
            subtoken.append(current_token.strip())
            tokens.append(subtoken)
            current_token=''
            subtoken=[]        
            current_token=''
            state = 'accept'
        elif state == 'start':
            state = 'accept'
        if state == 'accept':
            return tokens
        else:
            return ('error', state, tokens)


Maybe I'm thinking of something different, but I've never considered regular expression to be hard to write or hard to write well, and considering that they may be compiled to machine code (either AOT or JIT), they're not _necessarily_ inefficient and a good implementation is likely to handle unicode better than your average programmer. So, while I agree that they aren't a panacea, they certainly are a powerful and useful tool, and for simple pattern matching and/or extraction, they are often easier to visually verify than the hand-written equivalent in my experience.


Have you read Friedl's book? I didn't think they were hard to write or hard to write well until I did, and then I regretted my past life of sin.


Then again reading said book made me believe they’re fairly easy to write well. You need to keep in mind what a quantifier really does (“this will gobble up the whole string and then yield bits until the pattern matches”), but in the end I find it not fundamentally any more taxing than a having in my head a rough idea of the behaviour of a few nested loops or recursions.


I feel like it's more error-prone. This regexp lexer took me several minutes to get right, and I'm still not totally sure it's bug-free:

    >>> replace = lambda text, env: ''.join(env[item[1:]] if item.startswith('$') else item[1:] if item.startswith('\\') else item for item in re.findall(r'[^\\$]+|\\.|\$\w+|\\$', text))
    >>> print replace(r'This $line has \stuff \\in it that costs \$50 and some $variables.', {'line': 'LINE', 'variables': 'apples'})
    This LINE has stuff \in it that costs $50 and some apples.
If I were to write out an explicit loop over the characters of the string, I would be a lot more sure that I wasn't accidentally dropping characters due to an inadvertent failure to make the regexp exhaustive (I originally forgot the \\$ case! Although that reduces to the empty string anyway) and I wouldn't have to forget and rediscover which lexical category each token belonged to.

And, although it's not present in this case or in all regexp engines, it's a lot easier to accidentally write an exponential-time algorithm in a regexp than in a nested loop. And my experience has been that it's harder to debug it, too.


I agree and disagree:

"If I were to write out an explicit loop over the characters of the string, I would be a lot more sure that I wasn't accidentally dropping characters due to an inadvertent failure to make the regexp exhaustive"

This is why regex comments exist. For any non-trivial regex (more than a two or three characters), you should break down and document your regex. Otherwise, it's worse than a 1000 character-long perl one-liner.

"it's a lot easier to accidentally write an exponential-time algorithm in a regexp than in a nested loop"

So true. I did not realize it was possible until I made that mistake. Debugging these cases is extremely difficult. For two strings that look similar, the same regex can go crazy on one. But this happened to me only once in the past three years (time when I started to heavily rely on regular expressions for a project).


> This is why regex comments exist.

Regex comments don't help much with inadvertently writing a non-exhaustive regex (i.e. one for which some possible input could fail to match), or a few other kinds of regexp bugs. Or, how would you write the regexp in the above code with comments so that it would be obvious if you left out the \\$ case?



You haven't lived until you've written a hand-rolled lexer and recursive descent parser.


... or at least, you haven't done second-year CS. Pick one.


Oh, that I could have been so lucky! I was stuck using lex and yacc. I had to work on an actual production parser before I saw the light.


I wish. There are multiple educated juniors at work which I've handheld. None of them has a course in compilers/parsing under the belt. Sigh, they probably got lives...


> They also result in much faster, safer, and compact implementations.

If you are using a scripting language, regular expressions are often faster than a hand-written parser because the regular expression library itself will work at a lowest-level.


I don't understand: "Standard lexing and parsing techniques are so easy to write so general, and so adaptable there's no reason to use regular expressions".

Isn't using the lex family of generators pretty 'standard' in terms of writing lexers? Aren't they generally considered fast and in most cases better than hand-written code?

Aren't bad regular expressions just a consequence of someone not properly learning regular expressions? Couldn't the same argument be applied to every programming language?

Also: http://swtch.com/~rsc/regexp/regexp1.html


Ten years ago I wrote a lexer myself for my own language, Q, because I didn't know about regular expressions.

It was actually quite powerful. But I literally just implemented a state machine. To be honest, I think there were a couple things it did that a regular expression wouldn't be able to do.

However, for the most part, the "efficiency" argument is off base. You can COMPILE regular expressions into tight code.


> However, for the most part, the "efficiency" argument is off base. You can COMPILE regular expressions into tight code.

Rob Pike has written multiple regular expression engines and at least 4 programming languages. I think he understands the efficiency part pretty well.


> However, for the most part, the "efficiency" argument is off base. You can COMPILE regular expressions into tight code.

When I started programming ocaml, I was surprised to find no string primitives for things like split, find, etc... There was only regex. I didn't want to write a regex for doing a simple find or split, so I wrote a string library for my programs. My string library was consistently 4x faster than regex and did only exactly what I wanted.


I'm in the same boat just recently, only I knew about regular expressions but was intimidated by them. My first attempt used regular expressions, but got to a point where I made faster progress rolling my own lexer. I was surprised at how much fast simple looping over characters were. It was about twice as fast as my regular expression implementation.


You don't have to implement a state machine yourself. You can use a lexical parser generator such as flex.


The lex program and its descendants are generally regarded poorly by Ken and Rob and others from their tribe of Unix systems programmers. The blog post suggests why. It is often faster and easier to write the lexer in C by hand. But they do tend to put great value on yacc. Once you wrap your head around LALR shift-reduce parsing, yacc is expressive in ways that are hard to replicate with a hand-written parser.


Pike's argument is relevant for compiled system level languages (C, C++, Java, etc).

The regular expression libraries used in scripting languages today (Perl's and PCRE) are optimized C code. A lexer in interpreted code is hardly going to beat them in speed.

Note that a large part of the reason to use scripting languages is the development speed. Regexps FTW, etc.

Edit: I should add -- if I have a problem where I need to parse something complex like a "real" programming language, I go to the LALR libs of course. The right tool for the job. (On second thought -- this is probably what Pike talks about; he doesn't go around solving simple problems, like I do. But no complaints, I've got a job interview tomorrow... :-) )


He has several arguments; performance is only one of them.


>>He has several arguments; performance is only one of them.

First -- I noted that I limited my comment to scripting languages.

I did discuss most of the relevant arguments in the article, so I really don't see what your point is?

It doesn't matter for the scripting languages that a regexp lib is large (it is linked in and used anyway), so I ignored that. I also ignored the Unicode point, since it is generally supported in scripting languages' regexp engines.

I did touch on speed and development speed. Shorter code is also generally easier to understand; a simple regexp can often replace 10-20 lines.

In my edit, 10++ hours before your comment, I discussed when full grammars where a better alternative -- so I handled Pike's accusation of treating Regexps as a "panacea for all text processing". (I have used regexps as lexers for grammars. But only for parsers of less than a few hundred lines.)

(I am not going to touch Pike's argument if I and others really grokk regular expressions, because I thought I did understand them until I read "Mastering Regular Expressions". Maybe there are some more satori insights waiting? I do consider myself a little bit familiar with them from automata theory, implementing state engines and usage [Edit: and my considered opinion is that re:s are often a good solution for scripting languages.])




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: