Hacker News new | past | comments | ask | show | jobs | submit login
Stackoverflow, HTML by Regex, topmost answer (stackoverflow.com)
105 points by s2r2 on July 5, 2010 | hide | past | favorite | 37 comments



First: please don't editorialize in the title. The "Got to see this" smacks of other places I will not mention.

Second: this is not at all interesting. The person asks a sensible question and then gets some ridiculous replies.

Third: it made me remember my spat with ESR about HTML parsing: http://news.ycombinator.com/item?id=923775 and now I feel sad.


I really dislike all this arm chair moderation of hacker news, I like the occasional commentary in submission titles, and I also like the fact the community decides what is or isn't interesting. Like the ~60 people who found this article interesting.


The trouble is "Got to see this" is purely the judgement of the submitter and not of the community. The 60 upvotes are the community judgement.


The trouble is "Got to see this" is purely the judgement of the submitter

Yes, just like "Second: this is not at all interesting" is your judgment in your comment. I know we all want to keep HN different from reddit, but a little tolerance is good.


... on the content, not the title.


Some person explained that HTML is a Chomsky type 2 grammar and regular expressions are a Chomsky type 3 grammar, and provided this link: http://en.wikipedia.org/wiki/Chomsky_hierarchy

Can anyone here provide a link that makes the discussion of these typed grammars available to laymen?


Oh, I don't know if reducing the grammars to Chomsky is really necessary.

Regular expressions, in their original version, are equivalent to Finite-stage Machines (i.e. ... regular grammars, no recursion, no stack, no memory further than keeping the current state). You can't describe the rules of HTML with a FSM.

Perl's regular expressions contain various enhancements. Newer versions of Perl's regexes also contain direct support for recursion (but frankly, you can't call those "regular expressions" anymore).

So ... if your regex library has recursion support, then you can parse HTML (since with recursion you can parse context-free / Chomsky type-2 grammars). If it doesn't support recursion, then you can't.

Btw ... the equivalent for a context-free grammar would be a Push-down Automaton ... http://en.wikipedia.org/wiki/Pushdown_automaton , which is a FSM + a stack.


This answer on Reddit is pretty solid: http://www.reddit.com/r/programming/comments/cm02a/you_cant_...

Depends on your definition of 'laymen,' I guess.

Also note that PCREs actually are recursively enumerable (I think).


For more discussion (of a blog-post about the answer by Atwood), see here: http://news.ycombinator.com/item?id=944673


I think this is pretty old and has been discussed everywhere many times. Check the date of the question/answer too.


Ehm, I wonder a bit, the discussions always goes that HTML is not regular. The poster though asked to just match any open tags. The language of HTML tags clearly is regular, isn't it?


For one thing: <br> is valid html, and this too:

    <!DOCTYPE html>
    <html>
    <head>
       <title>I AM YOUR DOCUMENT TITLE REPLACE ME</title>
    </head>
    <body>
       <div>
    <br id="<bl>">
       </div>
    </body>
    </html>


All kind of ways you could write the br-tag are regular.


No, because as the example shows I can put regular html in the id tag. Once you do nesting, it's not regular anymore.


You cannot nest the quotes. I.e. this is invalid:

    <br id="<br id="<br>">">
I.e. it doesn't matter what is inside the quotes.


The language of individual HTML tags is certainly regular, and trivially easy. However, the language of "matched HTML tags with junk between them" is NOT regular.

Anything that requires balanced matching is NOT parseable with standard Regular Expressions, and by not parseable I mean that you will literally have an infinite amount of bugs. Shoot me an email and I can show you the math.

Even with Perl's whiz-bang recursive not-really-regexes-regexes, it's strongly not recommended to tackle balanced matching problems like HTML or XML. It might be theoretically possible (I haven't actually checked), but your brain will leak from your ears and you probably won't get it right, no matter how smart you are.


But the OP asked for parsing just for a specific type of tag (which is regular). Not an area of text between open/close tags (which is not regular).


That's what I'm wondering, too.

If it was XML, one might get in trouble with "<[CDATA[" sections, but regarding HTML, I don't see a real issue here.

... especially not from the pragmatic point of view. Depending on the use case and the quality of the HTML source, a "dirty" regex hack might be a far better solution and using a DOM parser.


> HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain

Just reading that makes me wince.


I'm tempted to change our bug tracker to have a bug status of "it is too late it is too late we cannot be saved".


The terse term is "overcome by events" (OBE), applied literally.


The revisions for it are pretty cool too: http://stackoverflow.com/posts/1732454/revisions

Someone slightly not getting the joke edited it out on the basis of it being troll/rambling, then someone put it all back. The nice bit... the actual point is emphasised as a result.


What is necessary to make regex turing complete?


You need to be able to provide 'context.' I linked this above, but I'll link it to you, too: http://www.reddit.com/r/programming/comments/cm02a/you_cant_...


I had heard that perl regular expressions are already Turing complete. Never verified it, though. E.g.

http://books.slashdot.org/comments.pl?sid=143298&cid=120...


Got to read that in a Cylon Hybrid or GLaDOS voice.


Anybody can suggests fiction authors who write in this style - sane text morphing into gibberish and probably back to sanity?


The prose is almost certainly inspired by H. P. Lovecraft. "At the Mountains of Madness" may be a good start, if you like this sort of stuff: http://www.dustylibrary.com/horror/5-at-the-mountains-of-mad...


Yes, the entire Zalgo meme is vaguely inspired by Lovecraft.


Try a book called "Random Acts of Senseless Violence" by Jack Womack. The text doesn't morph into gibberish, but the lead character transforms from entitled middle class to borderline psychopath street kid, and the way the language changes as the story progresses is _wonderful!_


Mark Z. Danielewski's House of Leaves


I definitely got HoL vibes reading the post.


William S. Burroughs


It reminded me mostly of the gibberish from the Cylon hybrids in Battlestar Galatica.


Hmm, Joyce's Finnegans Wake? You did say "probably back".


Robert Anton Wilson

Schrödinger's Cat trilogy especially


Thomas Pynchon?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: