Without. UTF-8 is such a distinctive pattern that if text with high bits set mat...

umanwizard · on June 19, 2016

Huh? What would a BOM in UTF-8 even do? 1-byte objects can't have an internal byte ordering.

jcranmer · on June 19, 2016

MS popularized the idea of adding the UTF-16 BOM into UTF-8 to distinguish between UTF-8 text files and Windows code page files, or what they called "Unicode" and "ANSI." There's (nearly?) unanimous agreement among everyone else that BOMs in UTF-8 text are really stupid.

Note that the "BOM" in this case means storing the U+FEFF character in UTF-8 form (just as UTF-16 stores it in the appropriate endianness). This means that the result would be EF BB BF.

Const-me · on June 20, 2016

Not everything is a web request or response that have a “content-encoding” header transmitted somewhere out of band.

The BOM allows to distinguish a byte stream between non-Unicode, UTF8, UTF16 and UTF32.

Like it or not, but it's part of the standard:

http://unicode.org/faq/utf_bom.html#BOM

Avernar · on June 20, 2016

From section 2.6 in the standard: "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature."

Yes, it can be used to distinguish a UTF-8 stream but it's not recommended. One issue is you can't tell if the BOM is not valid text in some other non-unicode encoding.

I'm curious where you've encountered missing content-encoding headers or other OOB indicators where it wasn't because of programmer error or laziness.

Const-me · on June 20, 2016

> it can be used to distinguish a UTF-8 stream but it's not recommended

If a specification says “something may be encountered”, for me, when I write my software, it means I must support that thing. Otherwise, the software won’t conform to the spec.

> I'm curious where you've encountered missing content-encoding headers or other OOB indicators where it wasn't because of programmer error or laziness.

Everywhere.

Most filesystems don’t have encoding headers for their text files. Most databases don’t have headers for their blob columns.

Only web that has encoding headers.

Avernar · on June 20, 2016

There's a difference between what a program should accept as input and what it should generate as output. The standard just says to expect a BOM on input and suggests not generating on output. In other words "A UTF-8 BOM is a bad idea but some yutz out there stated doing it so we should ignore it on input". Someone else mentioned that the yutz was Microsoft.

I misread what you wrote about where you saw no idication that it was UTF-8. You were talking about places other than the web.

BOM for UTF-8 text files seems to be a Microsoft thing. Everyone else just defaults to UTF-8. But you can't be sure that it's a UTF-8 BOM or some other encoding. Most editors let the user overide what it is.

Why would you store text in a blob column? If a database can't handle UTF-8 in it's text column it needs to be fixed (or taken out back an shot).

Const-me · on June 20, 2016

> There's a difference between what a program should accept as input and what it should generate as output.

I’m a Windows developer. In my world, a program should generate its output in whatever format user wants it to be.

When I press “File/Save as” in visual studio and click on the down arrow icon, I see a choice of more than 100 different encodings (including all flavors of Unicode with and without the BOM), and independent choice of 3 line endings (Window, mac, Unix).

> BOM for UTF-8 text files seems to be a Microsoft thing

Practically — maybe, most Microsoft apps tend to understand those BOMs, and most *nix tools don’t, even on input.

Officially — definitely no, we both saw the spec on unicode.org.

Avernar · on June 20, 2016

> I’m a Windows developer. In my world, a program should generate its output in whatever format user wants it to be.

When generating output for a user, letting them choose is a good idea. But for interop with other programs I leave it off unless the program needs it.

> Officially — definitely no, we both saw the spec on unicode.org.

The spec says the BOM is optional. Some Microsoft programs however require it.

Const-me · on June 20, 2016

> for interop with other programs I leave it off unless the program needs it.

Plain text isn’t exactly a machine-friendly format.

If you want to interop with other programs, the good choice is e.g. XML. That has this encoding problem fixed as a part of the standard.

> The spec says the BOM is optional. Some Microsoft programs however require it.

Could you please name a Microsoft program that you think requires a BOM?

I’m asking because I have completely different experience. For me, Microsoft programs open text files just fine, with or without the BOM. But most *nix and osx programs show me garbage instead of BOM.

Avernar · on June 20, 2016

> Plain text isn’t exactly a machine-friendly format.

Works fine for unix. :D

> Could you please name a Microsoft program that you think requires a BOM?

Visual C++ off the top of my head. It mangles UTF-8 string literals without the BOM in the source code.

> For me, Microsoft programs open text files just fine, with or without the BOM. But most *nix and osx programs show me garbage instead of BOM.

That's what I was trying to say about the BOM being prevalent on the Windows side of the fence. Some programs require it, some always generate it so most program now accept it.

On the unix/osx side everyone switched to UTF-8 so the BOM is redundant. Everything is UTF-8 so the silliness of this needs a BOM that doesn't need a BOM doesn't exist. Good example of what the "UTF-8 Everywhere" site is trying to promote.

Personally I really wish Microsoft would eventually fix their UTF-8 codepage. Would be so nice not having to convert to/from UTF-16 at the Win32 API boundary.

Const-me · on June 20, 2016

> Works fine for unix. :D

The trend towards higher-level data formats is universal across all OSes.

Even on Unix, users typically read html, write odf or docx both being xml, print PostScript, etc.

Plain text is friendly towards developers. But it’s neither interop-friendly nor user friendly.

> Visual C++ off the top of my head

Only the C++ compiler. MS can’t change the compiler because backward compatibility. The IDE however works fine with such files.

> "UTF-8 Everywhere" site is trying to promote.

The transition is going to be expensive, because most languages and frameworks (C++/MFC/ATL/QT, .NET languages, JVM languages, Python, etc) use Unicode (USC2 or UT16) strings for decades already.

To justify the costs, the benefits of the transition must be substantial.

And there aren’t any.

Avernar · on June 20, 2016

> Plain text is friendly towards developers. But it’s neither interop-friendly nor user friendly.

Kind of got off track here. You can process a lot of formats as text (html, css, xml, etc). So a BOM there is unnecessary and sometimes detrimental. On the unix side there are a lot of text utilities that do useful things that you can do on these formats. That's probably why BOMs are non existent there.

> MS can’t change the compiler because backward compatibility.

You care to tell MS that? Every single time I've done a major VS upgrade my code had to be changed because something that was valid before stopped being valid.

> And there aren’t any.

If you can't see any benefit of using UTF-8 then I'm done debating with you.

the_mitsuhiko · on June 19, 2016

In UTF-8 a BOM can be placed to support round tripping the information with UTF-16.

umanwizard · on June 19, 2016

So what order do you put the BOM in? Does it not even matter?

Avernar · on June 19, 2016

The unicode BOM is code point U+FEFF. The process of encoding it determines the order.

Encoded to UTF-8 it becomes EF BB BF. Encoding to UTF-16 big endian it will become FE FF. Encoding it to UTF-16 little endian it becomes FF FE.

Converting it back from UTF-8 always gives you U+FEFF since UTF-8 doesn't care about endianess. Converting it back from UTF-16 using the correct endianess gives you U+FEFF. Converting it using the wrong endianess gives you U+FFFE which is defined by unicode as a "non character" that should never appear in text.

umanwizard · on June 20, 2016

Makes sense, thanks :)