Learn Regex The Hard Way

jdthomas · on Dec 1, 2011

Not to sound like a pretensions CS guy ... but how can you have a book on regular expressions without even mentioning DFA/NFA? Maybe its just me, but (back in my day) I found learning about finite automata made regex just click.

Edit: ahh, I see now mentioned in intro that "I'm going to be very practical and straight forward about it. No NFA to DFA conversion. No crazy explanations of push down finite state automata." :) I still argue that thinking of regex as state machines is a very useful tool.

mekoka · on Dec 1, 2011

I'm not a CS graduate. I first learned of state machine from an article posted here on HN a few months ago.

I learned regular expression from the regular-expression.info site and from the first 4 chapters of Jeffrey Friedl's book over 5 years ago. I don't recall an introduction to NFA/DFA from them either. I'm pretty comfortable with regex and use them regularly (no pun intended) and so far, I don't feel that I've missed much.

I'm not disagreeing with your view, but I think it's worth pointing out from someone having developed an understanding of regex from a different perspective, that NFA/DFA might not be such a strong prerequisite and people should not feel like they're going in at a disadvantage.

bad_user · on Dec 1, 2011

An understanding of finite-state machines is useful, because that way you'll understand the limitations you have when using regular expressions, which have been grossly misused.

It also helps you write optimal regular expressions - for instance if your regular expression clearly describes a DFA, then it will probably perform well. Otherwise in many cases you can optimize it - just as you can turn any NFA into a DFA.

It isn't so clear and cut these days however, as the regexp capabilities provided by modern libraries exceed the capabilities of finite state machines - for instance the engine in Perl 5.10 has support for recursion, which means a regular expression in Perl 5.10 can describe a Pushdown Automaton (PDA) instead, so things get way more complicated.

Either way, I suggest you do some reading on Automata Theory, because it's fun and not that hard.

spp · on Dec 1, 2011

Chapter 4 of Jeffrey Friedl's "Mastering Regular Expressions" goes in depth about NFA/POSIX NFA/DFA. It even has car analogies!

mekoka · on Dec 2, 2011

It must have been the 3 first chapters then. At the time, they were enough for me to understand and "get going" with regex. Various other online resources have filled little gaps here and there over time. This thread makes me want to revisit the book and resume my reading of the next 3 chapters, a long overdue promise to myself. I'm glad that the book will cover the topic of FA then.

zedshaw · on Dec 1, 2011

Yep, I'm going at it from another direction where I teach the symbols like a language, and then it makes it easier to explain how they work later. I agree that state machines are a great way to understand them, but I don't think they're practical for using them and they just complicate getting capable in it.

furyg3 · on Dec 1, 2011

Just wanted to chime in, thanks a lot for making the guide with this approach. Sysadmins, specifically, will find this to be a great resource.

Many people start out tweaking config files, editing other's scripts, manipulating logs and output data, and then eventually sit down and say "Damn, it's about time I learned how to really write good [bash/perl/python/regex]". I find the theory behind it interesting, but it's a bit of a distraction for many.

burke · on Dec 1, 2011

I'm a CS graduate, and I read the Friedl book on regular expressions in high school. I don't seem to remember learning anything about automata from it, and I know for certain that my automata course in university didn't make me any more competent at understanding or applying regular expressions.

irahul · on Dec 1, 2011

Personally I don't find NFA/DFA helpful with learning regular expressions.

To a newcomer, even a naive implementation of NFA/DFA doesn't make sense. Consider this small gist https://gist.github.com/1414061

I have a feeling a newcomer will come out none the wiser even if he were shown this. And this is just a naive implementation - no nfa to dfa conversions, inefficient implementation etc.

I would root for state diagrams though. Generally diagrams help people see patterns mere text doesn't.

marchdown · on Dec 1, 2011

I know the theory pretty well, but writing actual regexes always felt so slow and frustrating that I really welcome this kind of rote hand-holding tutorial. I might finally learn the arbitrary DSL that everyone seems to have settled on.

kd0amg · on Dec 1, 2011

I imagine that sort of implementation issue is only relevant to the typical user if that user comes across a case where the potential inefficiency of NFA->DFA conversion (or NFA simulation) actually does happen. I'm not sure how common it is to encounter this.

I'd be more concerned about making sure there is some discussion of nonregular languages; this appears to be in the works (I'm curious to see what exactly it turns out to be).

dfc · on Dec 1, 2011

Do not worry, in order to sound "pretensions" you need to be able to spell pretentious correctly;)

leeoniya · on Dec 1, 2011

i learned regex inside-out before ever learning or implementing any sort of FSM - just sayin'.

scrame · on Dec 1, 2011

Thats pretty awesome. Regex's the tool (vs. the concept of "regular expressions" that comes up in interpreters/compilers) is a supremely useful utility when used sparingly, and surprisingly portable between languages.

Regex's are in a (not-)sweet spot of practical languages that aren't taught in school. For all the hype of ruby, et al. being able to make "DSL's", its surprising how mystical an _actual_ DSL, that is portable between languages (through PCRE) is held.

Unfortunately, like many other tools, can be used sparingly after a lot of practice writing crappy regex's, and having to eat your own dogfood (so you learn its not an "ultra-fast parser", or that you shouldn't chain substitutions on user-generated input).

A rote tutorial -- in the style of LPTHW -- is excellent.

itmag · on Dec 1, 2011

What are some other examples of DSL's that are useful in the same way as regexes?

I am thinking LINQ might qualify. Also, JQuery is strictly not a DSL, but it kinda feels DSLy.

lmkg · on Dec 1, 2011

Personally, I classify any sort of sequence comprehension as a DSL, including LINQ and python's list and generator comprehensions. JQuery is basically as close to a LINQ-like DSL as you can get in a language like JavaScript that doesn't have metaprogramming support. Related to sequence comprehensions, there is Common Lisp's controversial LOOP macro, which is basically an iteration comprehension DSL.

I'm somewhat on the fence about whether format control strings qualify as DSL's. They don't as much, but they tend to have very complicated and specialized syntax.

I feel that Monads are a DSL in Haskell, but whenever I see "Monads in Blub!" I don't feel the same way about them. Part of it is probably that Haskell provides not only infix operators, but do-notation for supporting monads. Part of it is also that monads don't do as much in other languages, because their type systems aren't rich enough to fully express them, nor restrictive enough that their capabilities are useful.

itmag · on Dec 1, 2011

A DSL doesn't have to be Turing-complete IIRC so format strings could probably qualify.

delwin · on Dec 1, 2011

What do you guys think of Zed Shaw's teaching style? As I'm not a beginner programmer, I can't really evaluate correctly — to me it seems like a odd approach to programming pedagogy, but perhaps it works. Any non-programmers want to shed light on why his style works?

joelhooks · on Dec 1, 2011

My 13 year old son is working through Learn Python the Hard way. He likes it a lot and is building things that interest him. He will spend the whole day fiddling with Python working on a text based choose your own adventure game that one of the chapters had him build.

He's approached programming a couple of times before, but LPTHW is the first time he's really stuck with it.

In that regard, I think Zed is doing an awesome job. I'm looking forward to the REGEX book and want to work through the C book as well.

zedshaw · on Dec 1, 2011

That's awesome. It's funny because I didn't even try to write the book for kids, but happy it seems to work well for them.

mkopinsky · on Dec 2, 2011

Kids tend to be, in general, kinda like adults in a lot of ways. Whodathunk? :-)

capkutay · on Dec 1, 2011

I went through LPTHW without losing motivation, wanting to quit, or being too confused to go on. I'd say this is what really sets it apart from other programming books. Also, very functional. You know what you're doing every step of the way and he tries to speak to you like a person while helping you graspt the material. There were only a few instances doing the extra credit where I really felt stuck. His books force you to write a lot of code. After 1/2 way, I'd genuinely be excited to go to the next chapter. In terms of simulating your own personal CS tutor, LPTHW was awesome.

chrisguitarguy · on Dec 1, 2011

It's just like teaching anything: nothing is really fun until you can do something.

I teach guitar, so let me use that as an example: it's my job as a teacher to get students to love the guitar so they (1) continue with it, and (2) come to the next lesson. the only way I can make sure students get excited is if they can do some thing. So I could bog them down with a bunch of theory and stuff, or explain the mechanics of hand movements in very technical detail.

But those things really don't matter. What does matter is reducing the essence of a few things into a simple exercise that the student can do and that sounds good. If you took a guitar lesson from me, you'd probably leave the first lesson able to strum a few chords in time. Of course, you just learned a ton of stuff: how to use your fingers on the fretboard, how to keep the beat, how to strum, posture, hand positions -- the list goes on.

All you care about, as the student, is that you can do something. You feel successful. You have something to practice. As you progress, you become more interested in the finer details (or you quit) and we go deeper -- explaining music theory, patterns, technique and mechanics, etc.

Zed teaches programming the same way. Give the student something to do where they can see immediate results. The ones that stick with it are going to be naturally curious about how things work and will keep going further.

I used "Learn Python the Hard Way" as my introduction to programming. It was awesome.

TWAndrews · on Dec 1, 2011

Learn Guitar the Hard Way would be an awesome addition to the mix.

nikcub · on Dec 1, 2011

as a Python dev I scanned the python book and I thought it dwelled too much on string formatting, focused on teaching syntax and not programming, and the language was a bit terse.

but then a friend of mine told me that he and a few other people he knew learnt python from that book, so either later revisions got better or my opinion is shit.

zedshaw · on Dec 1, 2011

I don't think your opinion is shit, I just think your perspective has changed because you actually can code. The hardest thing for a beginner is just the syntax. If you've never written code before then the symbols in programming are alien as hell. By focusing on seemingly repetitive syntax drills I get them skilled progressively until syntax isn't a problem.

Another thing you might be missing is under each exercise is a real piece of programming they're learning. For example, with string formatting I'm actually teaching the concept of named variables. Later on I use the fact they've actually been using named variables to then teach the concept explicitly and then they get it quicker.

YuriNiyazov · on Dec 1, 2011

I agree that syntax is the hardest thing. Based on my experience tutoring, I would add the following (which was surprising to discover at first): the second hardest thing is the order of evaluation. I've seen students write down

    a = do_something1(do_something2())

and be unable to explain that do_something2() actually happens before do_something1(), and to see that the code is loosely equivalent to

    b = do_something2() 
    a = do_something1(b)

Locke1689 · on Dec 1, 2011

The worst is when you're teaching C and you have to explain that

    void * a = do_something1(do_something2(), do_something3());

is not equivalent to

    int b = do_something2();
    int c = do_something3();
    void * a = do_something1(b, c);

YuriNiyazov · on Dec 1, 2011

Because the order of evaluation of arguments is not defined?

drmohundro · on Dec 1, 2011

I'm a CS grad, but I had zero knowledge of programming when I entered college - I knew some basics about HTML and I could fix some basic computer issues but that is it. I remember trying to open an executable in Notepad and thinking to myself, "uh oh, this is going to be really hard."

I was still struggling with syntax issues into my second year of school. It's not a good feeling to be debugging null pointer issues and data structures when you're not confident with the syntax.

VonLipwig · on Dec 1, 2011

I like Zed Shaw's style. I feel it really helps things stick with the way it is taught. I also feel the 'go find out yourself' methodology sets you up well going forward as you learn to depend on yourself to solve problems instead of the author of the book.

_kotv · on Dec 1, 2011

My brother is working through Learn Ruby The Hard Way, which is not written by Zed, but is based on his Python book. He has no programming experience and seems to be enjoying the work and learning Ruby pretty quickly.

ramkalari · on Dec 1, 2011

I started doing Ruby yesterday (have done about 8 exercises) and it's been fun so far. The approach is new and it seems very well thought out.

xatax · on Dec 1, 2011

I'm about 75% through his Python book, and I have to say I like it. He assumes you're an absolute beginner, and I'm not, so I can't really give it a fair appraisal from the perspective of its target audience.

That said, the book definitely makes learning fun. For the first half of the book, you have a really quick learn-reward cycle. The exercises are short and to the point, and he makes you type in the lesson and run it. This means that you type a few lines, then immediately get to see what it did, on your computer. It very much reminds me of when I was in grade 10 and we were doing simple stuff in QBasic. It would definitely grab and keep the interest of an absolute beginner dipping his or her toes into the programming pool for the first time.

From the perspective of an experienced programmer - a term I apply to myself with some trepidation - things move a bit slow, but the experienced programmer is definitely not Zed's target with the book. Even being on the slow side, the exercises are quick enough that you can knock out a bunch of lessons quickly if you're so inclined, and I've definitely skipped over a bunch of the "Extra Credit" stuff because of that.

I actually did spend the "recommended" week doing Exercise 37, which is sort of a "learn at your own pace" exercise. You're given a list of operators, string formats, keywords, etc. and asked to spend some time looking into them, defining them for yourself, and playing with them. I spent a good amount of time reading into the way Python's floor/integer division and modulo operations work / differ from other languages and why (if you're interested, Python floors towards negative infinity, not zero, and performing a modulo on negative numbers takes the sign of the divisor, not the dividend, in contrast to some other languages), and spent time toying around with decorators and lambdas.. I kept a spreadsheet of everything I was supposed to learn and my description of it. My interpretation of that lesson is probably a lot different from a beginners, but I think it's a nice example of "you get out of it what you put in."

Zed also likes to shove you into the deep end and let you learn to swim. The extra credit assignments don't have answers, they're all about making you play around and learn on your own terms. The lesson above is a good example of "here's a bunch of terms, now go play on your own for a while", which was an interesting change of pace. Also, he gives a solid emphasis on reading other peoples code. There have been at least two lessons so far dealing with that, and I've read over a bunch of reddits code and learned a good amount from that - this was during the first "go find some code" exercise, where you only know the bare minimum of conditionals and looping, so there were a lot more scary "i don't know what that code is doing, but I'm going to find out!" moments, which I found helpful.

zedshaw · on Dec 1, 2011

> It very much reminds me of when I was in grade 10 and we were doing simple stuff in QBasic.

That's pretty much the feel I was going for when I wrote it.

xatax · on Dec 1, 2011

It really shows. A friend of mine asked me about it, and my abbreviated review was just "it makes programming fun again."

I hesitated to use that in my earlier comment because some might use it to paint a bad picture of me, contrasting it with the idea that all good programmers love their craft and are always writing code because it's always fun for them. I just think that as you get into meatier works and start architecting larger applications, there's always going to be some hair pulling and you're going to fight wars with your compiler/interpreter/whatever. LPTHW isn't like that. You just sort of cut through the bullshit and get on with the fun stuff, and it makes the learning process easier.

So thanks. I look forward to reading your other works as they finish.

plytheman · on Dec 1, 2011

Speaking as someone who isn't a programmer at all and needed (still needs, actually) to learn a language for dealing with a massive amount of data for a school project, I decided to have a stab at Python and have been working my way through his book. Overall I'd say it's pretty great. The good is that, as you mentioned, the early chapters (really the whole first half) are quick, repetitive exercises which I feel gave me a good understanding of the dirt basics of how Python works. Now I'm getting into the more tricky chapters and still really enjoying them.

I sometimes feel a bit more explanation from him on how or why certain things work would be nice, or maybe a few hints for where to start looking for answers in the extra credit would be good, but for the most part I think he's effectively covered enough of the basics needed to land me on the ground running. Between the exercises I've done so far and the basic script someone at reddit was kind enough to throw together for me I've managed to modify and make my own scripts to start processing the data for this project. Also it feels pretty awesome to write a script for an hour then actually see it work in less than a second and do what would have taken me days otherwise to do.

xatax · on Dec 1, 2011

> I sometimes feel a bit more explanation from him on how or why certain things work would be nice, or maybe a few hints for where to start looking for answers in the extra credit would be good, but for the most part I think he's effectively covered enough of the basics needed to land me on the ground running.

There's the Python documentation[1], but coming from working mostly with PHP, I've had a hard time adjusting to them. PHPs documentation seems a lot more straightforward, so I've had to lean on StackOverflow[2] a lot more when I run into roadblocks (again, almost entirely as part of the "long list of things to research" exercise).

[1] http://docs.python.org

[2] http://stackoverflow.com

mattwilsoninc · on Dec 1, 2011

I'm on the first legs of the Learn Python the hard way. I like the examples and explanation. Doing it is really the best way to learn. The great aspect about the book/course is that Zed provides examples of pitfalls that come with experience, so you learn the language at a faster rate.

mturmon · on Dec 1, 2011

The regex shell is a neat idea.

In many situations I'd like to generate strings accepted by the regex, to check that my regex is tight enough.

I know this is hard in general (for lots of reasons, basically all reducing to "there are a lot of strings and no easy way to iterate through them in a satisfying way").

But, has anyone made any progress on this for the "easy" cases?

zedshaw · on Dec 1, 2011

I thought about trying but it's hard to dip into the regex data structure (if there is one) and also display it in some sort of meaningful way.

Other than that, Regetron is kind of just enough of a tool to get you through the book and have no friction. I think for more advanced regex exploration there's http://kodos.sourceforge.net/ and I believe there's a few javascript ones in the browser and even a few very advanced ones in Java.

tbrooks · on Dec 1, 2011

I'm sure this has made the rounds before: http://rubular.com/

scorpion032 · on Dec 1, 2011

Your wish has been granted: http://txt2re.com/

I use this all the time. No wonder I don't know to hand write a Regex.

haraball · on Dec 1, 2011

Here's another tool I use when testing my regex: http://regexpal.com/

Nice tool for testing the regex on a given data set.

mturmon · on Dec 1, 2011

That's a neat page, which does help construction of regex's.

But, it does not address the question I posed.

The regex R generates a language L of strings accepted by R. I want a facility for generating example elements of L.

Given that L often is infinite or has immense cardinality, this is hard to do. I think I want a breadth-first descent of the tokens in the regex, but it's not always that simple.

leeoniya · on Dec 1, 2011

just decided to randomly click a chapter. in 12.1 wouldn't

^[0-9]+|[A-Z]+$ in fact need to be written as

^([0-9]+|[A-Z]+)$ or non-capturing

^(?:[0-9]+|[A-Z]+)$

i doubt the intent was to alternate NL/EOL assertion. seems like a novice oversight and does not instill confidence in the rest of the material.

or am i missing something?

zedshaw · on Dec 1, 2011

Hmm, well the book is being written so there's potential for some errors, but I believe you are wrong here. You're confusing match with search semantics. Your above works because by default Regetron searches. If you turn on match it doesn't match your proposed test below. Try this:

http://codepad.org/uPpwKPJS

Notice when you turn on !match it doesn't find your test.

Also, keep in mind that this is just introducing the concept of alternating. Captures are covered later.

sophiebits · on Dec 1, 2011

I agree with the GP --

http://codepad.org/YmsaEtJS

zedshaw · on Dec 1, 2011

Great, another example of regex engines not doing anything you tell them. I'll look at changing it but () isn't going to work there it has to be taught later.

leeoniya · on Dec 1, 2011

the solution seems pretty easy. simply leave out the beginning/end-of-line assertions. [0-9]+|[A-Z]+ would suffice for that example. i agree that introducing capture groups IS too early, but you can add the parentheses without mentioning capture groups, even if they do capture. grouping still has great value as a precedence indicator which is taught in gradeschool arithmetic.

it's rather unfortunate that the syntax to capture is "(" while the less complex no-capture is "(?:", i'm not sure who thought that through, but here we are.

zedshaw · on Dec 1, 2011

Yes, I hate that (?:) syntax. Who the hell thought that crap up.

But, I will point out that you attributed the error to NL/EOL assertion, when actually it was order of precedence of | being greater than $ and ^. It's a simple nearly 1-2 character mistake, not a "novice" mistake that discredits the entire book.

leeoniya · on Dec 1, 2011

yeah i apologize, perhaps i was a bit harsh. i did understand what the issue was and never claimed that it was the newline/EOL assertion itself. i said that your newline/EOL assertion is being treated as PART of the alteration, which is exactly what happens - yes because of pipe's higher operator precedence when no explicit grouping is defined.

novice or not though, 2 misplaced chars in a regex can make the diff between a security feature and a security hole, in this case matching a plethora of inject-able, potentially malicious characters which appear nowhere in the regex, not even as a wildcard ".", i would never make the claim that a 2-char mistake which matches this far beyond its intent is a minor oversight in a real-world, public facing application - and isn't that the ultimate goal?

the only issue in the context of a book for a regex beginner is that the regex gives the appearance of newline/EOL assertion hugging an alteration, but does something quite different. removing those assertions would clear things up for the better.

zedshaw · on Dec 1, 2011

Ok, both of you are wrong. It's got nothing to do with EOL, it's the parse order of precedence of | which makes it parse as (or (^[0-9]+) ([A-Z]+$)), but I was reading it as ($ (^ (or ([0-9]+) ([A-Z]+)))). Now I know why I should change it rather than I just suck.

sophiebits · on Dec 1, 2011

I'm not exactly sure how you thought I was interpreting it -- that's exactly how I understood the original comment and I still think that was leeoniya's point.

mekoka · on Dec 1, 2011

The point is not to have the best or most efficient regex expression, but rather to do the exercises and to highlight by this process misconceptions you may have about how the regex engine operates.

You enter what is given, you run it, see what the console returns, ponder about the implications for a few minutes and then scream "aha!"

What do you expect ^[0-9]+|[A-Z]+$ to return when matched against the various lines? What does it actually return? What conclusions do you draw from that little experiment?

leeoniya · on Dec 1, 2011

unfortunately this is not an efficiency optimization. the intent appears to be to match EITHER all-alpha OR all-numeric strings.

the regex, as given, would match both "&()ABC" and "5673%$^#" and "123ABC" but NOT "ABC123"

what should be a simple concept is defeated and confusion ensues. the correct formulation should have been ^[0-9]+$|^[A-Z]+$ or (judging from the example input and output samples) [0-9]+|[A-Z]+

zedshaw · on Dec 1, 2011

Uh, no, the intent is not to match all-alpha or all-numeric strings. It's to search for all alpha or all numeric. As I mentioned above, you're confusing search with match semantics.

suivix · on Dec 1, 2011

The (?i) is a confusing part of that example expression, and I don't see why it's necessary.

walrus · on Dec 1, 2011

The (? ... ) syntax is for "extended patterns". In this case, it makes the rest of the regular expression case insensitive. Read through http://perldoc.perl.org/perlre.html#Extended-Patterns for more information on extended patterns.

irahul · on Dec 1, 2011

It's good to know about it. Earlier at an employer we had a system where a processor in the pipeline took regular expressions as input - you can pass regular expressions but not the flags. (?i) is the only way you can indicate case-insensitive in these cases.

zedshaw · on Dec 1, 2011

Which one? That might be an error.

rcthompson · on Dec 1, 2011

I think it's the one on the "cover page" (the one linked to from HN.