> My favorite is how they work to make it permissible to have a THEAD with an implicit TBODY that follows it. Just, why!?
Are you asking why the HTML4 spec initially allowed this? There were likely several reasons (I wasn't involved in the working group at the time). Some of the reasons:
1) Tables without <tbody> or <thead> at all, with <tr> in the <table> directly were already all over the place before HTML4 appeared. They needed to keep those allowed, both asa practical matter and to make authoring less verbose in cases when there is no thead/tfoot.
2) Their syntax definition method (SGML) allowed for this by making <tbody> start and end tags optional, which they therefore did.
The outcome is that <tbody> has optional start/end tags in HTML 4, and <table><thead></thead><tr></tr></table> ends up with an implicit <tbody>.
Now we come to HTML5. We're not using SGML anymore, so we _could_ disallow missing tbody when there's a thead/tfoot, but still allow it if there are no headers/footers. But in the intervening 10 years, there's a ton of content that was created that relies on the HTML4 behavior, and browsers all implement the HTML4 behavior. What is the argument for changing that behavior?
> I think this is important for browser vendors, to an extent.
What was important to browser vendors, with HTML5, was having a standard that actually specified the behaviors needed for de-facto compat with existing content, so they could stop (buggily) reverse-engineering each other to figure out how to handle the corner cases. This was the stated intent of the spec. Is this the goal you consider "not sane"?
Note that some of the authoring behaviors involved are still considered incorrect in HTML5 (though leaving out <tbody> is not one of them): misnesting your tags will cause your HTML to not be valid HTML5, and a validator will flag that. It's just that HTML5 specifies what browsers should do even in the face of incorrect behavior like misnesting tags, because it turned out that people were doing that even though it was invalid and depending on the resulting behavior of browsers.
> I also think we can try to do better.
What do you think should be the goal of a spec for "HTML"?
has an implicit tbody. Sure, there are some sane reason to have implicit values. And in some cases I think it is actually obvious what those tags would be. This case, however, does not appear to be obvious to me. It is just as likely that this was a table that has a header, but no body.
I don't fully understand why "existing documents" are relevant at all. Since you basically have to "opt in" to the new version by declaring the doctype, we could have had much cleaner semantics on a new doctype. This seemed to be the goal of the xhtml push a few years prior. I am not privy to all of the history of why that failed.
To directly answer, my goal for the spec of HTML would have been a spec with fewer special cases. Preferrably, one that made less of a surprise between people that know XML and HTML.
In HTML4 it does because the DTD has "TBODY+" instead of "TBODY*", and yeah, I have no idea why someone thought that was a good idea, apart from the theoretical purity of "a table with no body makes no sense".
> It is just as likely that this was a table that has a header, but no body.
That's exactly what it has.
> Since you basically have to "opt in" to the new version by declaring the doctype,
Er.... you don't. The "new version" is the only version. The doctype affects a very small number of quirks but that's it, and that part way predates HTML5.
> I am not privy to all of the history of why that failed.
There were a few reasons. First, it turned out that neither authors nor users wanted the hard-fail behavior of an XML parser. Users, because it would mean they couldn't read the page they wanted to read. Authors, because they did not sufficiently control all the markup ending up on the page (multiple people authoring snippets, CMS templates, random bits of markup pulled from databases provided by other companies, etc).
Second, because there was no sane migration path. Suppose an author wanted to switch some page over to XHTML. But not all browsers support XHTML (and in particular the browser with 95%+ market share does not), so they need to provide an HTML version too. The normal answer to that was to make use of XHTML 1.0 Appendix C to provide a document that could be parsed as either HTML or XHTML, and to use HTTP content negotiation to send either the text/html or application/xhtml+xml MIME type. But then the problem was a tendency to only test the text/html case and have the application/xhtml+xml case not end up as well-formed XML. There were tons of documents all over the place that had an XHTML doctype and were attempting to comply with Appendix C, but were not actually well-formed; luckily most of them were only served up as text/html. All of this was a strong disincentive for browsers to advertise application/xhtml+xml support, because they would get broken pages. Even the browsers that had started off advertising such support ended up removing it in the face of user complaints; see first reason above.
Note that all this would have been _much_ worse if the switching had been on doctype, not MIME type; as I noted above, there were tons of documents around that had the XHTML doctype but were not well-formed.
I should note that the actual semantics of XHTML1 were not that different from HTML4; apart from parsing there were no significant differences. And the parsing semantics turned out to be something no one wanted in practice, per above.
As for XHTML2, which did attempt new semantics of various sorts, it suffered from several problems as well. Most glaring, again, was complete lack of migration path. Unlike XHTML1 there was no way to create a document that would work with a UA that didn't implement XHTML2 _and_ one that did. The XML parsing semantics were still not wanted in the market. The new semantics XHTML2 introduced were not that wanted either, because the working group decided to not talk to any actual authors or browser vendors or anyone else who would be involved in creating or consuming XHTML2, pretty much. The result was a spec that was solving problems people didn't have, not solving problems they did have, and with no clear way to deploy it in the market.
All of the above is why when WHATWG started working on an evolution of HTML the priority of constituencies (now captured at https://www.w3.org/TR/html-design-principles/#priority-of-co... ) was users, authors, implementors, specifiers, theoretical purity. Because the approach of putting theoretical purity first had been tried and failed spectacularly...
Note that a large part of the failure was in fact due to the "existing documents" problem, because the lack of a migration path was one of the most significant barriers to XTHML adoption. Of course the lack of strong reasons to adopt it didn't help either.
> my goal for the spec of HTML would have been a spec with fewer special cases.
This is not an unreasonable goal, sure. I should note that in terms of priority of constituencies this is a "theoretical purity" goal. Getting rid of specific special cases that are confusing people could be a goal in terms of the "authors" or "implementors" or "specifiers" constituency, of course.
Note that HTML5 did in fact remove various special-cases HTML4 had that were due to its SGML heritage, and most of which had never actually gotten widely impelemented in browsers. For example, comment parsing was simplified significantly, such that "<!-- Reader -- take note! -->" is actually a closed comment (which it's not in HTML 4, and wasn't in Firefox, which actually implemented the HTML 4 semantics for comments, until the switch to the HTML5 parser). The special-cases that remained were the ones that were needed to actually render existing web pages correctly.
Hmm, I have to confess I was cribbing this example from a link above. I'll dive further on it and see where I got lost.
I am a bit fuddled on the claim that HTML5 was determined not to be an opt-in schema. I'm probably colored because most of my docs by when I was actually caring about this were using the xhtml doctype. So, for me it definitely was a sort of "opt-in" and a migration. Which, frankly, is logical and makes the most sense.
So, I grant that the "existing documents" problem presented a ton of not well formatted documents. But, a large chunk of existing code presents with excessive warnings. The solution there is not to just give up, but to come up with better tools and guide people to the higher quality paths.
In the end, I fully accept this as something I will just have to agree to disagree on. My assertion is that contortions to not raise the bar on the creation of documents did little to advance the state of the web. I do not have a clear path on how to test this assertion. And have since moved on from web development.
Are you asking why the HTML4 spec initially allowed this? There were likely several reasons (I wasn't involved in the working group at the time). Some of the reasons:
1) Tables without <tbody> or <thead> at all, with <tr> in the <table> directly were already all over the place before HTML4 appeared. They needed to keep those allowed, both asa practical matter and to make authoring less verbose in cases when there is no thead/tfoot.
2) Their syntax definition method (SGML) allowed for this by making <tbody> start and end tags optional, which they therefore did.
The outcome is that <tbody> has optional start/end tags in HTML 4, and <table><thead></thead><tr></tr></table> ends up with an implicit <tbody>.
Now we come to HTML5. We're not using SGML anymore, so we _could_ disallow missing tbody when there's a thead/tfoot, but still allow it if there are no headers/footers. But in the intervening 10 years, there's a ton of content that was created that relies on the HTML4 behavior, and browsers all implement the HTML4 behavior. What is the argument for changing that behavior?
> I think this is important for browser vendors, to an extent.
What was important to browser vendors, with HTML5, was having a standard that actually specified the behaviors needed for de-facto compat with existing content, so they could stop (buggily) reverse-engineering each other to figure out how to handle the corner cases. This was the stated intent of the spec. Is this the goal you consider "not sane"?
Note that some of the authoring behaviors involved are still considered incorrect in HTML5 (though leaving out <tbody> is not one of them): misnesting your tags will cause your HTML to not be valid HTML5, and a validator will flag that. It's just that HTML5 specifies what browsers should do even in the face of incorrect behavior like misnesting tags, because it turned out that people were doing that even though it was invalid and depending on the resulting behavior of browsers.
> I also think we can try to do better.
What do you think should be the goal of a spec for "HTML"?