Learn Regex The Hard Way

james2vegas · on Feb 25, 2012

This being zedshaw, no surprise at the lack of Perl in this book, but what is funny is at http://regex.learncodethehardway.org/book/learn-regex-the-ha... "Imagine if you could write your regular expressions like this: " I don't have to imagine since Perl's regular expression engine has the x modifier to do just that.

zedshaw · on Feb 25, 2012

I tried to keep it generic, so most of the regex in the book will work in Perl, Ruby, Python, and the libraries since they all originate from Perl's ideas of PCRE. The choice of using Python was mostly because people had read my other book and probably already had Python.

Also lots of engines have the verbose form, problem is there's been too many Perl hackers writing those god awful huge regex so everyone thinks dense and succinct is the only way to write regex.

masklinn · on Feb 25, 2012

Python has the VERBOSE (re.X/re.VERBOSE) modifier as well, it works nicely with raw triple-quoted strings. And I'm guessing it also exists in Ruby.

urbanautomaton · on Feb 25, 2012

Perhaps a footnote could mention that this style really is possible in modern regexp implementations? I see chapter 20 is inked in to cover verbose regexps with comments (which is great!), but the current text could give the impression that it's a purely imaginary feature.

lninyo · on Feb 25, 2012

I think some examples of actual strings that the RegEx (in section 0.2) actually does match would be quite instructive.

Also, I think word problems develop an essential skill, namely mapping new problems (described usually in natural language) to mathematical concepts using the symbols. Problems come first, the neat symbolic form comes way after and generally the main problem is mapping a vague problem description to a concrete description using the symbolic language provided by mathematical notation.

bermanoid · on Feb 25, 2012

It's funny...when I read that bit, all I could think was, how much nicer would it be if regexes could just be written without the special characters, using sane word-based self-documenting tokens the way we do the rest of our programming, following scoping and quotation rules that actually mesh with the languages we're using them in?

Then it occurred to me that plenty of people have probably written libraries to do exactly that, but nobody uses them because we all already have regexes built-in almost everywhere that we want to use them. Hell, I've never even looked for one, even though I choke back a little vomit every time I introduce a new regex into a codebase because of how much future debugging pain I know they can cause (all but the shortest ones force what's essentially a full context-shift in order to parse, and in reality what usually happens is people scan a regex as one chunk and say "Eh, a regex, it's probably right, hopefully my bug is somewhere else..." until they have some concrete reason to think otherwise).

Sort of a shame, really, that such a problematically condensed syntax won the prize so early on, and now even those of us that hate it are too comfortable with it to look for something better.

zedshaw · on Feb 25, 2012

Regex aren't difficult to learn, it's just nobody teaches them as a language with a base syntax and words to use. If you just sit down and memorize the names of a few symbols, then learn what each does, then it becomes fairly clear.

It's my belief (totally unfounded) that learning a simple symbolic language like regular expressions teaches you how to handle other symbolic languages like mathematics, chemistry, and programming. That's one of the reasons I'm teaching it and trying to get other people to use it.

More importantly though, they are damn handy. As long as you don't abuse them in places where a lexer+parser is better, you can get a lot done with very little regex in very short time.

jacobolus · on Feb 25, 2012

Unfortunately, the actual syntax of regexps is far from ideal. As an example, it’s completely stupid that non-capturing groups must be written as (?:…) when they are by far the common case.

Larry Wall’s writings on this general topic are fairly convincing. http://www.perl.com/pub/2002/06/04/apo5.html?page=2

zedshaw · on Feb 25, 2012

I don't consider that a failing of regular expressions (which predate perl), but a failure of implementation. I've thought that instead of syntax it should be an API option that says "this is a matcher" vs. "this is a capture". Then the same regex works for both, it's just how you run it.

eurleif · on Feb 26, 2012

So would it not be possible to mix capturing and non-capturing groups in the same regex? That's useful to do if you have code that expects specific things to be captured at specific group indices, and you need to add grouping somewhere else in the regex without messing it up.

gghh · on Feb 25, 2012

::Regex aren't difficult to learn, it's just nobody teaches them as a language with a base syntax and words to use::

I couldn't agree more. because of lack of the "language" approach to them, their weird syntax, and the fact that the "verbose" mode for their definition is almost unknown, they come out as a sort of voodoo that only gurus can handle. Moreover, this results in tons of broken code in production. They're simple, beautiful and handy, but have an unfortunate historical load.

ufo · on Feb 26, 2012

What I miss the most about regexes (and I think is kind of what bermanoid was hinting at) is that we don't have access to much of the expressiveness we usually have available in a programming language. For example, I never saw a widely used regex library that takes advantage of the algebraic structure of regular expressions and that would let me do things like incrementally building regexes or creating named constants:

    var regex1 = /some_regex/,
        regex2 = /other_regex/;

    var regex3 = alternative(regex1, regex2);
    var regex4 = kleene_star( sequence(regex1, regex2) );

riffraff · on Feb 25, 2012

you would _love_ perl6 rules.

dmnd · on Feb 25, 2012

Ruby and Python (the two languages used at the end of this book) both have expanded mode regexes too.

bstar77 · on Feb 25, 2012

This is the "Learn X the Hard Way" book I've really been waiting for. Thanks, Zed, for the time and effort on this one. It's a topic that schools seem to completely ignore for some reason, but a skill I find I use on nearly a daily basis.

redslazer · on Feb 25, 2012

Its not done yet, its in draft stage and has been available for a while. Im waiting for it to come out fully on a pdf so i can throw my money at him.

mkramlich · on Feb 25, 2012

I'm pretty sure there have been plenty of books and documentation on regexps at least as far back as the 90's. I distinctly remember O'Reilly books on the topic, that far back, to be specific. For a certain generation at least, we haven't been waiting to be able to learn them. It's out there. Though having additional angles of approach is not bad.

zedshaw · on Feb 25, 2012

Mastering Regular Expressions is the book you're thinking of.

goodweeds · on Feb 25, 2012

I never liked the O'Reilly approach. At one point I had read 20 of their books and never felt like I learned anything. Some people can learn from references, but I really prefer Zed's "hard way" method which feels a lot like the Programmed Learning methodlogy (http://en.wikipedia.org/wiki/Programmed_learning).

mkramlich · on Feb 26, 2012

Understood. Technically there has always been a harder way available to all of us. You acquire a book or some other documentation. Then you sit down at keyboard and write the bare minimum code/syntax to test if you understood how some things work, or just play around it. The terminals on Mac, Windows, Linux have allowed us to do this for decades, as far back as my own childhood, to give one example. So yeah, I think the "hard way" is good, and iterative feedback and experiment is good, and that's been available for a long time even without Zed's things. I'm still learning new things even in the last few days, using this approach, except without some third-party playbook I have to follow. Learn Redis the hard way? Download it. Install it. Start it. Enter client. Type things. See what happens. Repeat. This is fairly obvious.

Agreed O'Reilly is probably more famous for their "completeness" rather than effectiveness at teaching.

alexjgough · on Feb 25, 2012

Here one should not forget the past, and look to Jeffrey Friedl's excellent "Mastering Regular Expressions".

theneb · on Feb 25, 2012

It's actually a good accompliment to "Mastering Regular Expressions" as in that book Python isn't covered.

gghh · on Feb 25, 2012

I am looking forward for item 24 "debugging Regex", and I hope it will cover perl's (use re 'debug') or python re.DEBUG flag. I still didn't find a satisfactory document on the web on those features. I'd also like to see something on the eternal diatribe "regexp VS regular grammars", i.e. how the formers substantially differ from the latter. But maybe that's not the right book, having the (sane) "get it done" approach.

leeoniya · on Feb 25, 2012

"Exercise 11: This Or That" sadly still suffers from using examples which necessitate mentioning caveats I pointed out months ago.

https://news.ycombinator.com/item?id=3297996

Even considering the fact that alteration is introduced after the needed concepts of NL and EOL assertions in "Exercise 7: The Beginning And End"

no me gusta :(

zedshaw · on Feb 25, 2012

I already told you that your reason it was busted is wrong. It's not because of NL or EOL, it's because of the order of precedence of the | operator. If you're going to correct me, be right. In fact, someone also posted a solution in the comments, which you also didn't do.

leeoniya · on Feb 25, 2012

i did not say that you were wrong. you are correct. it is the order of precedence that causes the NL/EOL to be treated as part of the alteration.

you are wrong, however, not to mention this alongside examples that specifically display this in an unrelated context, as this is something that is far from obvious. you can

1. easily use different examples or

2. add one line of text that clears this up

on the other hand, the title of the book is to learn regex the hard way, so i'm quite possibly the one who is wrong here. i am not trying to troll, trust me, i respect the work and knowledge you have put into this. but as someone who has been confused by this in the past during my own learning, i see no reason why you think i'm an exception and continue to argue against any clarification or change.

zedshaw · on Feb 25, 2012

No, you're wrong still. NL/EOL doesn't factor into it at all. You also are one of those people who thinks the following:

1. "I have been bit in the past by not knowing an obscure fact." 2. "To protect myself, I must learn every obscure fact to prevent this from happening again." 3. "This here has a missing obscure fact and is therefore going to hurt me and everyone around me."

This attitude that you have to teach someone everything about something right away is the reason most educational tech books suck. You do not need to teach someone everything right away. You don't even need to teach them everything as long as what you've taught is the foundational elements and those are correct.

This attitude is also hyperbolic. The book is not going to destroy the world because you have a problem with one small portion of one exercise that you can't even fix yourself.

Finally, you keep saying these things, and you keep asserting you're correct, but I don't see a solution from you. It's in a git repository:

http://gitorious.org/learn-regex-the-hard-way/

So put your money where your mouth is and send me a patch. If it's soooooo easy to fix and explain then prove me wrong.

Until you offer up your supposedly superior world saving solution I have to assume you're just wrong but can't admit it.

leeoniya · on Feb 25, 2012

i'm not sure why you insist i have no solution or cannot admit that i am wrong, when i have outlined exactly what the solution is.

i would be more than happy to send a patch if i wasn't 100% sure that it would be rejected or ignored on the stated grounds. these convos along with the only 2 outstanding, uncommented, unmerged, 3.5-month-old merge requests have not exactly instilled a great vote of confidence that you are open to taking people's code or advice.

i will leave you with this - after reading through your alteration section (and perhaps the whole book), beginning regex students will be unable to correctly answer the following question (using your own example) which appears not to use any concepts not previously discussed:

Circle all strings below which will be matched by ^[0-9]+|[a-z]+$

a. abc

b. 123

c. abc123

d. 123abc

e. abc_45&{`!123

f. 123_xk&{`!abc

g. _xk&{`!abc

h. 123_xk&{`!

if you believe this is some edge-case gotcha and as a teacher, you're okay with this, that's fine with me, i've spent an order of magnitude more energy than i should have trying to help.

lninyo · on Feb 25, 2012

Just out of sheer curiosity, what regexp tutorial would YOU recommend?

leeoniya · on Feb 26, 2012

it depends on where you're coming from. if you've never touched programming, regular expressions is probably not a good place to start in general. otherwise this one is top-notch:

http://www.regular-expressions.info/tutorialcnt.html

jongraehl · on Feb 26, 2012

When you guys say "NL" you mean "beginning of line", right?

leeoniya · on Feb 26, 2012

jongraehl · on Feb 26, 2012

"alternation"

leeoniya · on Feb 26, 2012

hehe, yeah, oops. thx.

riffraff · on Feb 25, 2012

I'm curious to see what ends up in "21: Extensions To Avoid".

By the way, from a quick glance, I believe there is no chapter devoted to discuss how the regexp engines work (which is ok) but will there be a section on "when things explode in your face due to exponential behaviour"?

I remember when I first noticed a regex I wrote had a bewhaviour like that and it was kind of enlightening, because earlier I had always assumed all the talk about this things was rather academic :)

zedshaw · on Feb 25, 2012

Yeah, so I thought about making the 2nd half a walk through building a regex engine by first building a lexer, then a parser, then the engine, but I wasn't sure if that'd work to teach regular expression. I think it'd be a great way to really understand how they work (it's how I figured them out), but I'm not sure if other folks would get the same understanding out of it.

tsliwkan · on Feb 25, 2012

I think this would be awesome. The mini database and the simple object system were really interesting exercises in Learn C the Hard Way and this would be another great exercise/project for one of your books.

cschep · on Feb 26, 2012

Along these lines are there any plans for "learn compilers the hard way" ? Missed that big time in school.

jimmytucson · on Feb 25, 2012

Thanks for this. I learned Perl about 1.5 years ago and recently took up learning Python. The one thing I find less intuitive about Python than Perl is its syntax around regular expressions.

jackfoxy · on Feb 25, 2012

Zed, Do you have a timeline in mind for filling out the rest of the exercises in the alpha release? Do you call it a beta at that point? I think you're doing great work, btw. Thanks.