Maybe not very constructive, but I think it's a technically fair answer given the question. The person asking is not intending to match individual tokens one by one to feed into a parser, but simply to use a regular expression to extract all instances of a set of opening tags in a whole document. The trivial solution he proposes, while perfectly sufficient for some subset of documents, quickly breaks in the general case when you consider comments and CDATA sections. For that you need to maintain an understanding of the whole document.
That said, this answer frequently gets linked in discussions even where using regular expressions is an entirely valid approach.
How is it technically fair? The answer is objectively wrong - you can tokenize XHTML using regexes. You cannot use a parser, since a parser does not emit tokens but emit the element tree and abstracts away syntactic details like the difference between <x></x> and <x />.
A technically fair answer would be to point out that the regex would have to take other tokens like comments, CData etc. into consideration, so it is more like a five-line regex than a one-line regex. If someone recommended a XHTML tokenizer or other tool which could solve the OP's task, that would also be a great answer.
> How is it technically fair? The answer is objectively wrong - you can tokenize XHTML using regexes.
Yes, but that you can tokenize XHTML using regular expressions is not the same thing as being able to use a single regular expression to extract XHTML tokens. Remember that context free languages are a superset of regular expressions. I don't personally know enough about the XHTML syntax to say off the bat whether the syntax can be described with a regular expression, but generally a recursive definition of valid syntax is not possible to express with regular expressions.
> You cannot use a parser, since a parser does not emit tokens but emit the element tree and abstracts away syntactic details like the difference between <x></x> and <x />.
You can use a parser, just not any XHTML parser. The parser would need to be constructed with the objectives in mind, to parse into a data structure that doesn't abstract these details away.
That said, maybe an even simpler solution exists, such as to use several regular expressions to first remove comment and CDATA before matching. I'm not immediately aware of any other cases that would cause problems for the trivial match suggested in the question post.
That said, this answer frequently gets linked in discussions even where using regular expressions is an entirely valid approach.