ASCII by default is only an accident of history. It's going to be a slow, painful process but all human-readable text is going to be Unicode at some point. For historical reasons you'll still have to encode a vector of bytes full of character information to send it down the pipe but there's no reason why we shouldn't be explicit about it.
The pain is painful [in Python 3] primarily for library authors and only at the extremities. If you author your libraries properly your users won't even notice the difference. And in the end as more protocols and operating systems adopt better encodings for Unicode support that pain will fade (I'm looking at you, surrogateescape).
It's better to be ahead of the curve on this transition so that users of the language and our libraries won't get stuck with it. Python 2 made users have to think (or forget) about Unicode (and get it wrong every time... the shear amount of work I've put into fixing codebases that mixed bytes and unicode objects without thinking about it made me a lot of money but cost me a few years of my life I'm sure).
I was careful to say "Unix strings", not "ASCII". A Unix string contains no nul byte, but that's about the only rule. It's certainly not necessarily human-readable.
I don't think a programming language can take the position that an OS needs to "adopt better encodings". Python must live in the environment that the OS actually provides. It's probably a vain hope that Unix strings will vanish in anything less than decades (if ever), given the ubiquity of Unix-like systems and their 40 years of history.
I understand that Python2 does not handle Unicode well. I point out that Python3 does not handle Unix strings well. It would be good to have both.
This is the first time I encounter the idiom Unix strings. I'll map it to array of bytes in my table of idioms.
> I don't think a programming language can take the position that an OS needs to "adopt better encodings".
I do think that programming languages should take a position on things, including but not limited to how data is represented and interpreted in itself. A language is expected to provide some abstractions, and whether a string is an array of bytes or an array of characters is a consideration of a language designer, who will end up designing a language takes one or another of the sides available.
Python has taken the side of language user: enabled Unicode names, defaulted to Unicode strings, defaulted to classes being subclasses of the 'object' class... Unix has taken the side of machine (which was the side at the time of Unix's inception.
> [...] probably a vain hope that Unix strings will vanish [...]
If only we wait for them to vanish, doing nothing to improve.
> Python must live in the environment that the OS actually provides.
Yes, Python must indeed live in the OS' environment. Regardless, one need not be a farmer because they live among all farmers, need they?
> This is the first time I encounter the idiom Unix strings
The usual idiom is C-strings, but I wanted to emphasize the OS, not the language C.
>> [...] probably a vain hope that Unix strings will vanish [...]
>If only we wait for them to vanish, doing nothing to improve.
The article is about the lack of Python3 adoption. In my case, Python3's poor handling of Unix/C strings is friction. It sounds like you believe that Unix/C strings can be made to go away in the near future. I do not believe this. (I'm not even certain that it's a good idea.)
I do not insist that C strings must die, I insist that C strings are indeed arrays of bytes, and we cannot use them to represent text correctly at present. I fully support strings to be Unicode-by-default in Python, as most people will put text in between double quotes, not a bunch of bytes represented by textual characters.
I do not expect C or Unix interpretations of strings to change, but I believe that they must be considered low-level and require higher-level language user to explicitly request the compiler to interpret a piece of data in such fashion.
My first name is "Göktuğ". Honestly, which one of the following is rather desirable for me, do you think?
I'm not arguing against you. I just don't write any code that has to deal with people's names, so that's just not a problem that I face. I fully acknowledge that lack of Unicode is a big problem of Python2, but it's not my problem.
A Unix filename, on the other hand, might be any sort of C string. This sort of thing is all over Unix, not just filenames. (When I first ever installed Python3 at work back when 3.0 (3.1?) came out, one of the self tests failed when it tried to read an unkosher string in our /etc/passwd file.) When I code with Python2, or Perl, or C, or Emacs Lisp, I don't need to worry about these C strings. They just work.
My inquiry, somewhere up this thread, is whether or not it would be possible to solve both problems. (Perhaps by defaulting to utf-8 instead of ASCII. I don't know, I'm not a language designer.)
> I insist that C strings are indeed arrays of bytes, and we cannot use them to represent text correctly at present
OK, maybe I do see one small point to argue. A C string, such as one that might be used in Unix, is not necessarily text. But text, represented as utf-8, is a C string.
It seems like there's something to leverage here, at least for those points at which Python3 interacts with the OS.
The pain is painful [in Python 3] primarily for library authors and only at the extremities. If you author your libraries properly your users won't even notice the difference. And in the end as more protocols and operating systems adopt better encodings for Unicode support that pain will fade (I'm looking at you, surrogateescape).
It's better to be ahead of the curve on this transition so that users of the language and our libraries won't get stuck with it. Python 2 made users have to think (or forget) about Unicode (and get it wrong every time... the shear amount of work I've put into fixing codebases that mixed bytes and unicode objects without thinking about it made me a lot of money but cost me a few years of my life I'm sure).