Hacker News new | past | comments | ask | show | jobs | submit login

There's a difference between what a program should accept as input and what it should generate as output. The standard just says to expect a BOM on input and suggests not generating on output. In other words "A UTF-8 BOM is a bad idea but some yutz out there stated doing it so we should ignore it on input". Someone else mentioned that the yutz was Microsoft.

I misread what you wrote about where you saw no idication that it was UTF-8. You were talking about places other than the web.

BOM for UTF-8 text files seems to be a Microsoft thing. Everyone else just defaults to UTF-8. But you can't be sure that it's a UTF-8 BOM or some other encoding. Most editors let the user overide what it is.

Why would you store text in a blob column? If a database can't handle UTF-8 in it's text column it needs to be fixed (or taken out back an shot).




> There's a difference between what a program should accept as input and what it should generate as output.

I’m a Windows developer. In my world, a program should generate its output in whatever format user wants it to be.

When I press “File/Save as” in visual studio and click on the down arrow icon, I see a choice of more than 100 different encodings (including all flavors of Unicode with and without the BOM), and independent choice of 3 line endings (Window, mac, Unix).

> BOM for UTF-8 text files seems to be a Microsoft thing

Practically — maybe, most Microsoft apps tend to understand those BOMs, and most *nix tools don’t, even on input.

Officially — definitely no, we both saw the spec on unicode.org.


> I’m a Windows developer. In my world, a program should generate its output in whatever format user wants it to be.

When generating output for a user, letting them choose is a good idea. But for interop with other programs I leave it off unless the program needs it.

> Officially — definitely no, we both saw the spec on unicode.org.

The spec says the BOM is optional. Some Microsoft programs however require it.


> for interop with other programs I leave it off unless the program needs it.

Plain text isn’t exactly a machine-friendly format.

If you want to interop with other programs, the good choice is e.g. XML. That has this encoding problem fixed as a part of the standard.

> The spec says the BOM is optional. Some Microsoft programs however require it.

Could you please name a Microsoft program that you think requires a BOM?

I’m asking because I have completely different experience. For me, Microsoft programs open text files just fine, with or without the BOM. But most *nix and osx programs show me garbage instead of BOM.


> Plain text isn’t exactly a machine-friendly format.

Works fine for unix. :D

> Could you please name a Microsoft program that you think requires a BOM?

Visual C++ off the top of my head. It mangles UTF-8 string literals without the BOM in the source code.

> For me, Microsoft programs open text files just fine, with or without the BOM. But most *nix and osx programs show me garbage instead of BOM.

That's what I was trying to say about the BOM being prevalent on the Windows side of the fence. Some programs require it, some always generate it so most program now accept it.

On the unix/osx side everyone switched to UTF-8 so the BOM is redundant. Everything is UTF-8 so the silliness of this needs a BOM that doesn't need a BOM doesn't exist. Good example of what the "UTF-8 Everywhere" site is trying to promote.

Personally I really wish Microsoft would eventually fix their UTF-8 codepage. Would be so nice not having to convert to/from UTF-16 at the Win32 API boundary.


> Works fine for unix. :D

The trend towards higher-level data formats is universal across all OSes.

Even on Unix, users typically read html, write odf or docx both being xml, print PostScript, etc.

Plain text is friendly towards developers. But it’s neither interop-friendly nor user friendly.

> Visual C++ off the top of my head

Only the C++ compiler. MS can’t change the compiler because backward compatibility. The IDE however works fine with such files.

> "UTF-8 Everywhere" site is trying to promote.

The transition is going to be expensive, because most languages and frameworks (C++/MFC/ATL/QT, .NET languages, JVM languages, Python, etc) use Unicode (USC2 or UT16) strings for decades already.

To justify the costs, the benefits of the transition must be substantial.

And there aren’t any.


> Plain text is friendly towards developers. But it’s neither interop-friendly nor user friendly.

Kind of got off track here. You can process a lot of formats as text (html, css, xml, etc). So a BOM there is unnecessary and sometimes detrimental. On the unix side there are a lot of text utilities that do useful things that you can do on these formats. That's probably why BOMs are non existent there.

> MS can’t change the compiler because backward compatibility.

You care to tell MS that? Every single time I've done a major VS upgrade my code had to be changed because something that was valid before stopped being valid.

> And there aren’t any.

If you can't see any benefit of using UTF-8 then I'm done debating with you.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: