It doesn't treat all whitespace as insignificant. It does, by default (for conte...

bryanrasmussen · on July 12, 2023

so to clarify then HTML in older versions treats all whitespace as insignificant but can be overridden in combinations of newer versions HTML and CSS when interpreted by a browser that understands the styling decisions overriding the default behavior?

CaptainNegative · on July 12, 2023

The <pre> tag has existed since HTML 2 for displaying preformatted/whitespace-sensitive text, and HTML 1 had the mostly similar <LISTING> tag (plus <PLAINTEXT> which is a little different).

cxr · on July 12, 2023

> then HTML in older versions treats all whitespace as insignificant[...]?

No. It works the way I described.

bryanrasmussen · on July 12, 2023

ok well I forgot the PRE tag as CaptainNegative pointed out but when you say

>No. It works the way I described.

what you described made reference to white-space: normal which is a CSS property that I don't believe is available as part of the HTML standard itself (although I don't really keep up anymore so I could be wrong) but certainly wasn't part of older versions of the spec.

cxr · on July 12, 2023

You are putting undue focus on a parenthetical (that I only even put in as a hedge[1] in the first place).

Copy and paste my comment somewhere, delete the parenthetical, and then read the result to yourself.

"HTML [...] treats all whitespace as insignificant" is simply inaccurate, no matter how you want to constrain it (e.g. "in older versions" or not). Whitespace is not insignificant.

1. <https://pchiusano.github.io/2014-10-11/defensive-writing.htm...>

slowwriter · on July 13, 2023

Let me be clear about what I meant by whitespace insignificance.

When you put plain text into an element, that is equivalent to a string in typical programming terms. No, whitespace is not entirely insignificant within a text node. But almost. If we leave out <pre> and other special cases here, HTML specifies to ignore any extraneous whitespace and simply collapse it into a single space. So it is “extraneous whitespace insignificant” in a sense. It doesn’t ignore whitespace interely, but no one would expect that in the contex of a string in any language, even a whitespace insignificant one.

In a text node HTML goes out of it’s way to minimize the meaning of whitespace, but it does do the minimum of respecting that words have spaces between them. You can put spaces some places and have it break or change stuff, like in the middle of an attribute name or value, in the middle of an element name, etc. But you would expect that to happen in any whitespace insignificant language. Outside of that and a few special cases, the default behavior is to ignore whitespace (for example whitespace between the beginning or ending tag of an element and the text node it contains), and as such HTML is very much whitespace insignificant in my opinion.

The reason why I commented that this design was absolutely the right call is basically cases like building a website in PHP, where you mix the two languages together. Here you end up adding a lot of whitespace from indenting your code, etc., and it would be a nightmare if HTML didn’t treat whitespace as it does.

cxr · on July 13, 2023

I understood what you meant.

> HTML specifies to ignore any extraneous whitespace and simply collapse it into a single space[...] Outside of that and a few special cases, the default behavior is to ignore whitespace

No it doesn't, and it's not. What you're describing is how the browser displays the content. (And a few other things—like interactions when you select text to drag and drop or copy it to the clipboard.)

> building a website in PHP[...] you end up adding a lot of whitespace from indenting your code, etc., and it would be a nightmare if HTML didn’t treat whitespace as it does

You keep saying "HTML" when you mean something else. In almost every instance if you just said "the browser" (broadly) instead, then you'd be good, but you keep saying "HTML".

There are absolutely parts of the browser that don't care whether they're seeing one space or a thousand varied whitespace characters (tabs, carriage returns, linefeeds, etc), because based on what style properties are in effect at that place the browser will be presenting that content to the user as if there's one space character when laying it out and putting it on screen. But the only whitespace that gets ignored in HTML, really, is the whitespace inside angle brackets around attributes and element names.

Your string metaphor is a good one. Content marked up with HTML is like one big string, and as you say, no one would expect whitespace in a string to be insignificant. It's not insignificant in HTML, either; it does, by default, get painted as if sequences of multiple whitespace characters were a single space, in most contexts. But again, that's a separate thing entirely.

slowwriter · on July 13, 2023

I don’t understand your distinction between “the browser” and “HTML” in this context. The browser is merely the interpreter of the language, but the HTML specification lays out how the language should be interpreted.

Also, this is an example of whitespace that is ignored:

[whitespace here]I’m a text node[more whitespace here]

I don’t believe that is what you referred to when you said “inside angle brackets around attributes and element names”.

Here the whitespace or sequence of spacelike characters is not collapsed into a single space. It is simply ignored, and the text node (string) begins at the first non-whitespace character.

That is actually what I referred to when I said that you end up adding a lot of extra whitespace when building a website in, say, PHP. Because that is where it typically ends up in the generated output.

cxr · on July 13, 2023

Nope. Try it out:

  $ dump ./scratch/p.html
  3c 70 3e 20 20 0a 20 20 49 27 6d 20 61 20 74 65
   <  p  >        .        I  '  m     a     t  e
  78 74 20 6e 6f 64 65 20 5b 20 20 20 20 5d 20 20
   x  t     n  o  d  e     [              ]      
  20 20 3c 2f 70 3e 0a
         <  /  p  >  .

(I replaced your first square bracket sequence with two spaces followed by a newline (U+000A) followed by two more spaces, and I replaced the second square bracket sequence with a space followed by a literal left square bracket, followed by four spaces characters, followed by a literal right square bracket, followed by four more spaces.)

The text node's value is exactly the sequence of characters between the closing angle bracket in `` and the opening angle bracket in ``:

  "  \n  I'm a text node [    ]    "

> The browser is merely the interpreter of the language, but the HTML specification lays out how the language should be interpreted.

You're right about the second half, but you're wrong in thinking that it says extra whitespace should be ignored. It doesn't. The bigger problem, though, is in the first half.

I think you have an oversimplified understanding of what's going on in a browser and of the relationship that HTML has to what you see when the browser paints the content on the screen and lets you interact with it; a fundamental misunderstanding seems to exist on your part regarding the pipeline that you do or don't think of as existing between the markup and what you actually get when you open the page in a browser—there's a lot more to it than the browser being "merely the interpreter" for HTML.

slowwriter · on July 17, 2023

I see. Don’t know if you’re still checking for replies on this thread. Livin’ up to my name. Thanks for taking the time to explain, though.

I’m going to have to look further into this to get a better understanding, but I suppose the rules for collapsing whitespace in a text node exist somewhere in the HTML specification, but not at the “interpretation” stage as I assumed.

To be clear what I imagined was that at the interpretation stage a text node would be marked to begin at the first non-whitespace character and end at the last non-whitespace character. And then within the text node there might be additional whitespace that would need to be collapsed into a single space.

Since the first type is not rendered at all and the second type is collapsed to a single space I assumed the rules could exist at two different points in the process/pipeline.

So what I gather here is that both types exist at a later stage than “interpretation” (basically what you see when you open Developer Tools and inspect individual nodes).

But I guess the subtlety here is that at whichever stage the whitespace collapsing/removal happens, the rules for it would still have to be defined by the HTML specification somehow.

And another subtlety to counteract that is that HTML is a markup language and not a programming language. One is executed, one is rendered. So any comparison between say Python and HTML needs to take that into account.

So even though there is some whitespace ignoring going on at some point from:

[whitespace]This textnode has extraneous whitespace[whitespace]

To the point where [whitespace] is not rendered in the viewport, the fact that the ignoring does not happen at the “interpretation” stage is important because that’s as far as the comparison between say Python and HTML can go before the two veer off in different directions.

I’m mainly typing this out for my own understanding, but again, will have to look into it myself to validate or correct my current framework of thinking about this. Thanks for an interesting discussion

cxr · on July 18, 2023

Pretty much. HTML parsing produces a content model, where the model's whitespace matches pretty faithfully what's in the source document. At some later point, that model is massaged into the thing that you see and interact with—but the model itself retains everything; this is like a filter, if it helps to think of it that way, or a projection of a complex (e.g. 3D object) onto a lesser substrate (e.g. 2D plane).

Offhand, and after a few glasses of wine, there are a couple points where the whitespace collapse will occur:

- at the display level—when it's time for the browser to actually put the thing on the screen—for CSS contexts where the white-space property is "normal" or something similar, at least, or

- at the interaction level, when something like text selection happens, and the browser computes essentially the equivalent of node.innerText (versus node.textContent; alternatively: node.nodeValue, in cases where the node in question is a text node)