I really dislike all this arm chair moderation of hacker news, I like the occasional commentary in submission titles, and I also like the fact the community decides what is or isn't interesting. Like the ~60 people who found this article interesting.
The trouble is "Got to see this" is purely the judgement of the submitter
Yes, just like "Second: this is not at all interesting" is your judgment in your comment. I know we all want to keep HN different from reddit, but a little tolerance is good.
Some person explained that HTML is a Chomsky type 2 grammar and regular expressions are a Chomsky type 3 grammar, and provided this link: http://en.wikipedia.org/wiki/Chomsky_hierarchy
Can anyone here provide a link that makes the discussion of these typed grammars available to laymen?
Oh, I don't know if reducing the grammars to Chomsky is really necessary.
Regular expressions, in their original version, are equivalent to Finite-stage Machines (i.e. ... regular grammars, no recursion, no stack, no memory further than keeping the current state). You can't describe the rules of HTML with a FSM.
Perl's regular expressions contain various enhancements. Newer versions of Perl's regexes also contain direct support for recursion (but frankly, you can't call those "regular expressions" anymore).
So ... if your regex library has recursion support, then you can parse HTML (since with recursion you can parse context-free / Chomsky type-2 grammars). If it doesn't support recursion, then you can't.
Ehm, I wonder a bit, the discussions always goes that HTML is not regular. The poster though asked to just match any open tags. The language of HTML tags clearly is regular, isn't it?
The language of individual HTML tags is certainly regular, and trivially easy. However, the language of "matched HTML tags with junk between them" is NOT regular.
Anything that requires balanced matching is NOT parseable with standard Regular Expressions, and by not parseable I mean that you will literally have an infinite amount of bugs. Shoot me an email and I can show you the math.
Even with Perl's whiz-bang recursive not-really-regexes-regexes, it's strongly not recommended to tackle balanced matching problems like HTML or XML. It might be theoretically possible (I haven't actually checked), but your brain will leak from your ears and you probably won't get it right, no matter how smart you are.
If it was XML, one might get in trouble with "<[CDATA[" sections, but regarding HTML, I don't see a real issue here.
... especially not from the pragmatic point of view. Depending on the use case and the quality of the HTML source, a "dirty" regex hack might be a far better solution and using a DOM parser.
Someone slightly not getting the joke edited it out on the basis of it being troll/rambling, then someone put it all back. The nice bit... the actual point is emphasised as a result.
Try a book called "Random Acts of Senseless Violence" by Jack Womack. The text doesn't morph into gibberish, but the lead character transforms from entitled middle class to borderline psychopath street kid, and the way the language changes as the story progresses is _wonderful!_
Second: this is not at all interesting. The person asks a sensible question and then gets some ridiculous replies.
Third: it made me remember my spat with ESR about HTML parsing: http://news.ycombinator.com/item?id=923775 and now I feel sad.