Maybe not very constructive, but I think it's a technically fair answer given th...

goto11 · on Jan 31, 2019

How is it technically fair? The answer is objectively wrong - you can tokenize XHTML using regexes. You cannot use a parser, since a parser does not emit tokens but emit the element tree and abstracts away syntactic details like the difference between <x></x> and <x />.

A technically fair answer would be to point out that the regex would have to take other tokens like comments, CData etc. into consideration, so it is more like a five-line regex than a one-line regex. If someone recommended a XHTML tokenizer or other tool which could solve the OP's task, that would also be a great answer.

boomlinde · on Jan 31, 2019

> How is it technically fair? The answer is objectively wrong - you can tokenize XHTML using regexes.

Yes, but that you can tokenize XHTML using regular expressions is not the same thing as being able to use a single regular expression to extract XHTML tokens. Remember that context free languages are a superset of regular expressions. I don't personally know enough about the XHTML syntax to say off the bat whether the syntax can be described with a regular expression, but generally a recursive definition of valid syntax is not possible to express with regular expressions.

> You cannot use a parser, since a parser does not emit tokens but emit the element tree and abstracts away syntactic details like the difference between <x></x> and <x />.

You can use a parser, just not any XHTML parser. The parser would need to be constructed with the objectives in mind, to parse into a data structure that doesn't abstract these details away.

That said, maybe an even simpler solution exists, such as to use several regular expressions to first remove comment and CDATA before matching. I'm not immediately aware of any other cases that would cause problems for the trivial match suggested in the question post.