Stackoverflow, HTML by Regex, topmost answer

jgrahamc · on July 5, 2010

First: please don't editorialize in the title. The "Got to see this" smacks of other places I will not mention.

Second: this is not at all interesting. The person asks a sensible question and then gets some ridiculous replies.

Third: it made me remember my spat with ESR about HTML parsing: http://news.ycombinator.com/item?id=923775 and now I feel sad.

pclark · on July 5, 2010

I really dislike all this arm chair moderation of hacker news, I like the occasional commentary in submission titles, and I also like the fact the community decides what is or isn't interesting. Like the ~60 people who found this article interesting.

jgrahamc · on July 5, 2010

The trouble is "Got to see this" is purely the judgement of the submitter and not of the community. The 60 upvotes are the community judgement.

CodeMage · on July 5, 2010

The trouble is "Got to see this" is purely the judgement of the submitter

Yes, just like "Second: this is not at all interesting" is your judgment in your comment. I know we all want to keep HN different from reddit, but a little tolerance is good.

Groxx · on July 5, 2010

... on the content, not the title.

euroclydon · on July 5, 2010

Some person explained that HTML is a Chomsky type 2 grammar and regular expressions are a Chomsky type 3 grammar, and provided this link: http://en.wikipedia.org/wiki/Chomsky_hierarchy

Can anyone here provide a link that makes the discussion of these typed grammars available to laymen?

bad_user · on July 5, 2010

Oh, I don't know if reducing the grammars to Chomsky is really necessary.

Regular expressions, in their original version, are equivalent to Finite-stage Machines (i.e. ... regular grammars, no recursion, no stack, no memory further than keeping the current state). You can't describe the rules of HTML with a FSM.

Perl's regular expressions contain various enhancements. Newer versions of Perl's regexes also contain direct support for recursion (but frankly, you can't call those "regular expressions" anymore).

So ... if your regex library has recursion support, then you can parse HTML (since with recursion you can parse context-free / Chomsky type-2 grammars). If it doesn't support recursion, then you can't.

Btw ... the equivalent for a context-free grammar would be a Push-down Automaton ... http://en.wikipedia.org/wiki/Pushdown_automaton , which is a FSM + a stack.

steveklabnik · on July 5, 2010

This answer on Reddit is pretty solid: http://www.reddit.com/r/programming/comments/cm02a/you_cant_...

Depends on your definition of 'laymen,' I guess.

Also note that PCREs actually are recursively enumerable (I think).

friism · on July 5, 2010

For more discussion (of a blog-post about the answer by Atwood), see here: http://news.ycombinator.com/item?id=944673

man1sh · on July 5, 2010

I think this is pretty old and has been discussed everywhere many times. Check the date of the question/answer too.

albertzeyer · on July 5, 2010

Ehm, I wonder a bit, the discussions always goes that HTML is not regular. The poster though asked to just match any open tags. The language of HTML tags clearly is regular, isn't it?

jauco · on July 5, 2010

For one thing: <br> is valid html, and this too:

    <!DOCTYPE html>
    <html>
    <head>
       <title>I AM YOUR DOCUMENT TITLE REPLACE ME</title>
    </head>
    <body>
       <div>
    <br id="<bl>">
       </div>
    </body>
    </html>

albertzeyer · on July 5, 2010

All kind of ways you could write the br-tag are regular.

jauco · on July 5, 2010

No, because as the example shows I can put regular html in the id tag. Once you do nesting, it's not regular anymore.

albertzeyer · on July 5, 2010

You cannot nest the quotes. I.e. this is invalid:

    <br id="<br id="<br>">">

I.e. it doesn't matter what is inside the quotes.

megaduck · on July 5, 2010

The language of individual HTML tags is certainly regular, and trivially easy. However, the language of "matched HTML tags with junk between them" is NOT regular.

Anything that requires balanced matching is NOT parseable with standard Regular Expressions, and by not parseable I mean that you will literally have an infinite amount of bugs. Shoot me an email and I can show you the math.

Even with Perl's whiz-bang recursive not-really-regexes-regexes, it's strongly not recommended to tackle balanced matching problems like HTML or XML. It might be theoretically possible (I haven't actually checked), but your brain will leak from your ears and you probably won't get it right, no matter how smart you are.

albertzeyer · on July 5, 2010

But the OP asked for parsing just for a specific type of tag (which is regular). Not an area of text between open/close tags (which is not regular).

vog · on July 5, 2010

That's what I'm wondering, too.

If it was XML, one might get in trouble with "<[CDATA[" sections, but regarding HTML, I don't see a real issue here.

... especially not from the pragmatic point of view. Depending on the use case and the quality of the HTML source, a "dirty" regex hack might be a far better solution and using a DOM parser.

jvdh · on July 5, 2010

> HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ik͏e liquid pain

Just reading that makes me wince.

arethuza · on July 5, 2010

I'm tempted to change our bug tracker to have a bug status of "it is too late it is too late we cannot be saved".

gvb · on July 5, 2010

The terse term is "overcome by events" (OBE), applied literally.

buro9 · on July 5, 2010

The revisions for it are pretty cool too: http://stackoverflow.com/posts/1732454/revisions

Someone slightly not getting the joke edited it out on the basis of it being troll/rambling, then someone put it all back. The nice bit... the actual point is emphasised as a result.

ars · on July 5, 2010

What is necessary to make regex turing complete?

steveklabnik · on July 5, 2010

You need to be able to provide 'context.' I linked this above, but I'll link it to you, too: http://www.reddit.com/r/programming/comments/cm02a/you_cant_...

Estragon · on July 5, 2010

I had heard that perl regular expressions are already Turing complete. Never verified it, though. E.g.

http://books.slashdot.org/comments.pl?sid=143298&cid=120...

earcar · on July 5, 2010

Got to read that in a Cylon Hybrid or GLaDOS voice.

ilkhd2 · on July 5, 2010

Anybody can suggests fiction authors who write in this style - sane text morphing into gibberish and probably back to sanity?

friism · on July 5, 2010

The prose is almost certainly inspired by H. P. Lovecraft. "At the Mountains of Madness" may be a good start, if you like this sort of stuff: http://www.dustylibrary.com/horror/5-at-the-mountains-of-mad...

steveklabnik · on July 5, 2010

Yes, the entire Zalgo meme is vaguely inspired by Lovecraft.

bigiain · on July 5, 2010

Try a book called "Random Acts of Senseless Violence" by Jack Womack. The text doesn't morph into gibberish, but the lead character transforms from entitled middle class to borderline psychopath street kid, and the way the language changes as the story progresses is _wonderful!_

ivank · on July 5, 2010

Mark Z. Danielewski's House of Leaves

djacobs · on July 5, 2010

I definitely got HoL vibes reading the post.

Russelldb · on July 5, 2010

William S. Burroughs

wlievens · on July 5, 2010

It reminded me mostly of the gibberish from the Cylon hybrids in Battlestar Galatica.

nuxi · on July 5, 2010

Hmm, Joyce's Finnegans Wake? You did say "probably back".

kurumo · on July 5, 2010

Robert Anton Wilson

Schrödinger's Cat trilogy especially

arethuza · on July 5, 2010

Thomas Pynchon?