Syntax extensions and regular expressions for Rust

sixbrx · on April 27, 2014

Very impressive. Especially the native regexs which will fail at compile time if they are invalid and which compile to a custom matcher machine (without having to express them in nasty template syntax). Rust has more macro-chops than I suspected.

pohl · on April 27, 2014

Aside from Servo itself, this is the coolest and most ambitious piece of Rust code I've yet seen.

I haven't had the pleasure of using a language with a good macro facility. What other sorts of applications do macros have, aside from this and type safe println are there? Could one make a json or XML parsing library with this technique? Or a driver for a database? How does one go about acquiring an intuition for when to use or not use them?

burntsushi · on April 27, 2014

Thanks for your kind comments. :)

> How does one go about acquiring an intuition for when to use or not use them?

Truthfully, I don't know. In this particular case, the benefits seem really clear: we get safety and performance increases.

But here's a caveat: the particular reason why this works well for regexes is because the common use case is to write the string literal corresponding to the regex into your program. This is what allows the macro facilities to kick in. For example, if the regex is derived from user input at runtime, then the `regex!` macro can't be used.

i.e., you can't do this:

    let restr = "a*";
    let re = regex!(restr);

The `regex!` macro must accept a string literal (or another macro invocation that produces a string literal).

pohl · on April 27, 2014

I see. So maybe if you had a fixed XSD or XSLT at compile time you could emit a faster but specialized parser or transformer. But maybe that wouldn't work out so well in practice.

I did find an interesting slide deck from the scala universe that mentions something like F# type providers (among other things) as possible applications.

http://scalamacros.org/paperstalks/2014-02-04-WhatAreMacrosG...

dom96 · on April 27, 2014

A JSON or XML parsing library is certainly possible with procedural macros. I haven't had the pleasure of trying out Rust macros but if you are interested in macros in general then I suggest taking a look at the Nimrod programming language (http://nimrod-lang.org) as well as the talk on metaprogramming by the language's author: http://www.infoq.com/presentations/nimrod

The Nimrod standard library has many examples of what macros can do. For example, I have been working on an async macro (now part of the latest Nimrod release) which allows asynchronous function calls to be made in a similar fashion to ordinary blocking function calls. It's very similar to C#'s await. The way it works is that it transforms functions marked as 'async' into an iterator, while at the same time transforming any occurrences of 'await' into a yield.

rca · on April 27, 2014

I think you would ask yourself : can I compute this at compile time? (And is it useful to do so?) I don't really see how you could use that for a database driver, but for example you could create a library that uses macros to generate custom serialization/deserialization code to/from json (or whatever format) for your data structures.

(Edit: after a quick check this use case seems to be already covered by #[deriving(Encodable, Decodable)] in rust.)

lilyball · on April 28, 2014

You could use macros to perform compile-time checking of SQL query strings (and checking of any arguments bound to the query, as long as they're bound immediately instead of being bound later).

rcxdude · on April 27, 2014

I guess a good rule of thumb is 'could I make this safer and faster by writing more repetitive code?'. If the problem can be solved by writing code explicitly but that quickly becomes too verbose (or error-prone), macros are a good technique.

CyberShadow · on April 27, 2014

Would be interesting to see a benchmark between D's ctRegex and Rust's "regex!". IIRC, when ctRegex was announced, it beat all other contestants (including V8's JIT-ed REs) in the author's benchmarks.

There will be a talk at this year's D conference on D's implementation of regular expressions:

http://dconf.org/2014/talks/olshansky.html

jeffdavis · on April 27, 2014

Impressive.

I like how the rust community is offering so many good examples of real utility for the language. Some languages are a little heavy on theory but have trouble connecting with potential users; and other languages are popular but have a lot of problems due to fundamental weaknesses.

Rust, and its community, seems to have a good balance. Good theoretical foundation, but a lot of people always working to show some practical reason to care about the theory.

thristian · on April 27, 2014

The author asks about other languages with "native" regex implementations, and I'm not entirely sure whether PyPy counts. Like regular Python, it compiles regexes into a bytecode-based virtual machine, but like just sprintf-style string formatting and the rest of Python in general, the regex VM is JIT-compiled into native code.

burntsushi · on April 27, 2014

Author here. I don't know enough about PyPy to really answer your question with any certainty, but I have some guesses.

Firstly, it's my understanding that JITing pays a runtime cost. That's not the case with Rust compiled regexes. (Admittedly, this cost is likely negligible.)

Secondly, wouldn't PyPy JIT the general matching algorithm? This seems like a substantively different process than doing code generation specifically for a regex whose structure you know at compile time. (Whether I'm currently taking advantage of any optimizations that PyPy's JIT isn't is of course a different story.)

thristian · on April 27, 2014

PyPy uses a tracing JIT, so it observes the general matching algorithm for a few iterations and then emits code that implements what the general algorithm actually did, i.e. the specific matching steps for that specific regex.

JITting a regex isn't really the same thing as your Rust native regexes; for example, if somebody wrote a regex library with a JIT and linked it to a C program, I woudln't count it. However, in PyPy's case regexes are exactly as native as the rest of the codebase rather than being a nested VM, so in that sense they're 'native'.

gsnedders · on April 27, 2014

Plenty of implementations have JITing regexp implementations. PCRE nowadays has a JIT mode, so any VM embedding PCRE for their regexp has such a thing, likewise all the JS engines have their own custom JIT implementations.

Ygg2 · on April 27, 2014

How does this compare to other regex parsing libraries, performance wise?

charlieflowers · on April 27, 2014

He gives a link to this at the bottom: https://github.com/BurntSushi/regexp/tree/master/benchmark/r...

(You have to scroll down to the code block sections to see the numbers).

Rust beats Go pretty soundly (on this specific benchmark), but C beats Rust pretty soundly.

Why is C so much faster?

burntsushi · on April 27, 2014

Yeah, this is really the only comparison I have so far. It was included with a PR for adding regexes to Rust, so hopefully it will make it into the shootout proper in the next release.

I also have some more specific benchmarks against Go: https://github.com/BurntSushi/regexp/tree/master/benchmark

(For any curious readers out there, the benchmark is multithreaded in Rust[1], with a particularly nice use of futures.)

> Rust beats Go pretty soundly (on this specific benchmark), but C beats Rust pretty soundly.

I believe the dynamic regexes also beat Go on this specific benchmark, but the optimizing compiler is likely to blame there. (And to be clear, Go still beats Rust native regexes in some benchmarks, primarily because it has more clever optimizations. It was written by Russ Cox, after all!)

> Why is C so much faster?

The C program is pretty elaborate, but here are my guesses.

* There are many ways to implement regex matching. For example, RE/C++ uses both a very very quick DFA and a NFA simulation. My library only has the (slower) NFA simulation. I don't know what Tcl uses for an implementation strategy, but whatever it is, it's likely a factor.

* If string replacement is done in place, then it's going to avoid a good chunk of allocation. (In this particular benchmark, it would be perfectly legal to do string replacement in place.)

I'm sure there are other things I'm missing, but these seem like good candidates for a good chunk of the difference.

[1] - https://github.com/BurntSushi/regexp/blob/master/benchmark/r...

innguest · on April 27, 2014

> Why is C so much faster?

Because processors are made with C in mind, as it's the lingua franca, and so they end up being optimized for it.

pcwalton · on April 27, 2014

> Because processors are made with C in mind, as it's the lingua franca, and so they end up being optimized for it.

No, Rust exposes the C memory model pretty directly, and it uses the backend of an industrial-strength C++ compiler (clang/LLVM). The difference here is algorithmic.

(Rust does have some codegen bugs that affect performance, however, though I suspect they won't affect this code.)

Ygg2 · on April 27, 2014

Ok, but why is Python faster than Rust implementation? (And yes, I assume those are C libs behind python).

NOTE: Rust is supposed to have zero cost abstractions. I think this is more to lack of optimization on regexp part, than some fault of language.

dbaupp · on April 28, 2014

The Rust implementation hasn't been extensively microoptimised.

Dewie · on April 27, 2014

Is this implementation close to the theoretical concept of regular expressions? I have read that some implementations of regular expressions are quite a bit more powerful than regular expressions from CS, in that they can recognize languages that are not regular.

burntsushi · on April 27, 2014

Yes. It's a reasonably faithful port of RE2, which elides features like backreferences.

pjscott · on April 28, 2014

This is usually a small price to pay for guaranteed O(n) matching speed. (Kudos to the author for doing the Right Thing here.)

burntsushi · on April 28, 2014

Thanks :-) The implementation is basically the Pike VM as described by Russ Cox. Recursion depth in particular has an upper bound corresponding to the number of instructions in the regex. In practice, this means it's safe to run a regex on untrusted data.

(Creating a regex from untrusted data still needs a bit of work, but is fixable!)