Hacker News new | past | comments | ask | show | jobs | submit login

>That HTML you just fetched? How do you know it's Unicode?

Headers contain information about the charset. If the charset isn't specified then only god knows the used encoding. This applies to all encodings. If they aren't specified you can't interpret them.

>That .txt file the user just asked to load? How do you know that's Unicode?

If you don't know the used encoding then you simply cannot interpret the file as a string. If the encoding isn't specified you can't interpret the file.

>For heaven's sake, when can you actually guarantee that even sys.stdin.read() is going to read Unicode?

Again if the encoding isn't specified then all bets are off. This is an inherent problem with unix pipes. Text isn't any different than say a protobuffer packet. You have to know how to interpret it otherwise it's just a raw byte array without any meaning.

>What do you do when your fundamentally invalid assumptions break? Do you just not care and simply present a stack trace to the user and tell them to get lost?

I don't understand you at all. Just load it as a byte array if you don't care about the encoding. If you do care about the encoding then tough luck. You're never going to understand the meaning of that text unless it is an agreed upon encoding like UTF-8 and in that case the assumptions of always choosing UTF-8 are part of the value proposition.

Let me tell you why reading a text file as a byte array and pretending that character encodings don't exist is a bad idea. There are lots of Asian character encodings that don't even contain the latin alphabet. Now imagine you are running source.replace("Donut", "Bagel"). What meaning does running this function have on a byte array? It doesn't have any.

That operation simply cannot be implemented at all if you don't know the encoding. So if you were to choose the python 2 way then you would have to either remove all string operations from the language or force the user to specify the encoding on every operation.

A string literal like "Donut" isn't just a string literal. It has a representation and you first have to convert the logical string into a byte array that matches the representation of the source string. Lets say your python program is loading UTF-16 text. Instead of simply specifying the encoding you just load the text without any encoding. If you wanted to run the replace operation then it would have to look like something like this: source.replace("Donut".getBytes("UTF-16"), "Bagel".getBytes("UTF-16")). This is because you need to convert all string literals to match the encoding of the text that you want to replace.

Well, doesn't this cause a pretty huge problem? You now need to have a special type just for string literals because the runtime string type can use any encoding and therefore isn't guaranteed to be able to represent the logical value of a literal. Isn't that extremely weird?




I'm too tired of these to reply to everything, so I'll just reply to the first bit and rest my case. It's like you're completely ignoring the fact that <meta charset="UTF-8"> and <?xml encoding="UTF-8"...?> and all that are actually things in the real world. You can't just treat them as strings until you read their bytes, was my point. The notion that the user can or should always provide you out-of-band encoding info or otherwise let you assume UTF-8 everywhere every time you read a file or stdin is just a fantasy and not how so many of our tools work.


So treat them as bytes. It's not like Python 3 removed that type. It just made it impossible to inadvertently treat bytes as a string in a certain encoding - unlike Python 2, which would happily implicitly decode assuming ASCII.


> So treat them as bytes.

Which was my entire point!! You have to go to bytes to get correct behavior. They didn't fix the nonsense by changing the default data type to a string, they just made it even more roundabout to write correct code.

> It just made it impossible to inadvertently treat bytes as a string in a certain encoding

It most certainly did not! It's like you completely ignored what I just told you. I already gave you an example: sys.stdin.read(). Uses some encoding when you really can't ever guarantee any encoding, or when the encoding info itself, is embedded in the byte stream is the normal case. How do can you know a priori what the user piped in? Are you sure users magically know every stream's encoding and just neglecting to provide it to you? At least if they were bytes by default, you'd maintain correct state and only have to worry about encoding/decoding at the I/O boundary. (And to top off the insanity, it's not even UTF-8 everywhere; on Windows it's CP-1252 or something, so you can't even rely on the default I/O being portable across platforms, even for text! Let alone arbitrary bytes. This insanity was there in Python 2, but they sure didn't make it better by moving from bytes to text as the default...)


Sure it did. Here's an easy test, using your own test case with stdin:

   Python 2.7.17 (v2.7.17:c2f86d86e6, Oct 19 2019, 21:01:17) [MSC v.1500 64 bit (AMD64)] on 
   win32
   Type "help", "copyright", "credits" or "license" for more information.
   >>> s = raw_input()
   abc
   >>> s
   'abc'
   >>> s + u"!"
   u'abc!'
So it was bytes after reading it, and became Unicode implicitly as soon as it was mixed with a Unicode string. And guess what encoding it used to implicitly decode those bytes? It's not locale. It's ASCII. Which is why there's tons of code like this that works on ASCII inputs, and fails as soon as it seems something different - and people who wrote it have no idea that it's broken.

Python 2 did this implicit conversion, because it allowed it to have APIs that returned either bytes or unicode objects, and the API client could basically pretend that there's no difference (again, only for ASCII in practice). By removing the conversion, Python 3 forced developers to think whether the data that they're working with is text or binary, and to apply the correct encoding if it's binary that is encoded text. This is exactly encoding/decoding at the I/O boundary!

The fact that sys.stdout encoding varies between platforms is a feature, not a bug. For text data, locale defines the encoding; so if you are treating stdin and stdout as text, then Python 3 will use locale encoding to encode/decode at the aforementioned I/O boundary, as other apps expect it to do (e.g. if you pipe the output). This is exactly how every other library or framework that deals with Unicode text works; how is that "insanity"?

Now, if you actually want to work with binary data for stdio, then you need to use the underlying BytesIO objects: sys.stdin.buffer and sys.stdout.buffer. Those have read() and write() that deal with raw bytes. The point, again, is that you are forced to consider your choices and their consequences. It's not the same API that tries to cover both binary and text input, and ends up with unsafe implicit conversions because that's the only way to make it look even remotely sane.

The only thing I could blame Python 3 for here is that sys.stdin is implicitly text. It would be better to force API clients to be fully explicit - e.g. requiring people to use either sys.stdin.text or sys.stdin.binary. But either way, this is strictly better than Python 2.


> The fact that sys.stdout encoding varies between platforms is a feature, not a bug. [...] This is exactly how every other library or framework that deals with Unicode text works; how is that "insanity"?

No, it's utterly false that every other framework does it. Where do you even get this idea? Possibly the closest language to Python is Ruby. Have you tried to see what it does? Run ruby -e "$stdout.write(\"\u2713\")" > temp.txt in the Command Prompt and then tell me you face the same nonsensical Unicode error as you do in Python (python -c "import sys; sys.stdout.write(u\"\u2713\")" > temp.txt)? The notion that writing text on one platform and reading it back on another should produce complete garbage is absolute insanity. You're literally saying that even if I write some text to a file in Windows and then read it back on Linux with the same program on the same machine from the same file system, it is somehow the right thing to do to have an inconsistent behavior and interpret it as complete garbage?? Like this means if you install Linux for your grandma and have her open a note she saved in Windows, she will actively want to read mojibake?? I mean, I guess people are weird, so maybe you or your grandma find that to be a sane state of affairs, but neither me, nor my grandma, nor my programs (...are they my babies in this analogy?) would expect to see gibberish when reading the same file with the same program...

As for "Python 3 forced developers to think whether the data that they're working with is text or binary", well, it made them think even more than they already had to, alright. That happens as a result of breaking stuff even more than it happens as a result of fixing stuff. And what I've been trying to tell you repeatedly is that this puristic distinction between "text" and "binary" is a fantasy and utterly wrong in most of the scenarios where it's actually made, and that your "well then just use bytes" argument is literally what I've been pointing out is the only solution, and it's much closer to what Python 2 was doing. This isn't even something that's somehow tricky. If you write binary files at all, you know there's absolutely no reason why you can't mix and match encodings in a single stream. You also know it's entirely reasonable to record the encoding inside the file itself. But regardless, just in case this was a foreign notion, I gave you multiple examples of this that are incredibly common—HTML, XML, stdio, text files... and you just dodged my point. I'll repeat myself: when you read text—if you can even guarantee it's text in the first place (which you absolutely cannot do everywhere Python 3 does)—it is likely to have an encoding that neither you nor the user can know a priori until after you've read it and examined its bytes. XML/HTML/BOM/you name it. You have to deal with bytes until you make that determination. The fact that you might read complete garbage if you read back the same file your own program wrote on another platform just adds insult to the injury.

But anyway. You know full well that I never suggested everything was fine in Python 2 and that everything broke in Python 3. I was extremely clear that a lot of this was already a problem, and that some stuff did in fact improve. It's the other stuff got worse and even harder to address that's the problem I've been talking about. So it's a pretty illegitimate counterargument to cherrypick some random bit about some implicit conversion that actually happened to improve. At best you'll derail the argument into a discussion about alternative approaches for solving those problems (which BTW actually do exist) and distract me. But I'm not about to waste my energy like this, so I'm going to have to leave this as my last comment.


Every other language and framework as in Java, C#, everything Apple, and most popular C++ UI frameworks.

Ruby is actually the odd one out with its "string is bytes + encoding" approach; and that mostly because its author is Japanese - Japan is not all sold on Unicode for some legitimate reasons. This approach also has some interesting consequences - e.g. it's possible for string concatenation to fail, because there's no unified representation for both operands.


> Possibly the closest language to Python is Ruby.

Not really; they are similar in that they are dynamic scripting languages, but philosophically and in terms of almost every implementation decision, they are pretty radically opposed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: