ASCII by default is only an accident of history. It's going to be a slow, painfu...

dded · on Dec 31, 2013

I was careful to say "Unix strings", not "ASCII". A Unix string contains no nul byte, but that's about the only rule. It's certainly not necessarily human-readable.

I don't think a programming language can take the position that an OS needs to "adopt better encodings". Python must live in the environment that the OS actually provides. It's probably a vain hope that Unix strings will vanish in anything less than decades (if ever), given the ubiquity of Unix-like systems and their 40 years of history.

I understand that Python2 does not handle Unicode well. I point out that Python3 does not handle Unix strings well. It would be good to have both.

_pfxa · on Dec 31, 2013

> I was careful to say "Unix strings"

This is the first time I encounter the idiom Unix strings. I'll map it to array of bytes in my table of idioms.

> I don't think a programming language can take the position that an OS needs to "adopt better encodings".

I do think that programming languages should take a position on things, including but not limited to how data is represented and interpreted in itself. A language is expected to provide some abstractions, and whether a string is an array of bytes or an array of characters is a consideration of a language designer, who will end up designing a language takes one or another of the sides available.

Python has taken the side of language user: enabled Unicode names, defaulted to Unicode strings, defaulted to classes being subclasses of the 'object' class... Unix has taken the side of machine (which was the side at the time of Unix's inception.

> [...] probably a vain hope that Unix strings will vanish [...]

If only we wait for them to vanish, doing nothing to improve.

> Python must live in the environment that the OS actually provides.

Yes, Python must indeed live in the OS' environment. Regardless, one need not be a farmer because they live among all farmers, need they?

dded · on Dec 31, 2013

> This is the first time I encounter the idiom Unix strings

The usual idiom is C-strings, but I wanted to emphasize the OS, not the language C.

>> [...] probably a vain hope that Unix strings will vanish [...] >If only we wait for them to vanish, doing nothing to improve.

The article is about the lack of Python3 adoption. In my case, Python3's poor handling of Unix/C strings is friction. It sounds like you believe that Unix/C strings can be made to go away in the near future. I do not believe this. (I'm not even certain that it's a good idea.)

_pfxa · on Jan 1, 2014

I do not insist that C strings must die, I insist that C strings are indeed arrays of bytes, and we cannot use them to represent text correctly at present. I fully support strings to be Unicode-by-default in Python, as most people will put text in between double quotes, not a bunch of bytes represented by textual characters.

I do not expect C or Unix interpretations of strings to change, but I believe that they must be considered low-level and require higher-level language user to explicitly request the compiler to interpret a piece of data in such fashion.

My first name is "Göktuğ". Honestly, which one of the following is rather desirable for me, do you think?

  Python 2.7.4 (default, Sep 26 2013, 03:20:26) 
  >>> "Göktuğ"
  'G\xc3\xb6ktu\xc4\x9f'

or

  Python 3.3.1 (default, Sep 25 2013, 19:29:01) 
  >>> "Göktuğ"
  'Göktuğ'

dded · on Jan 1, 2014

I'm not arguing against you. I just don't write any code that has to deal with people's names, so that's just not a problem that I face. I fully acknowledge that lack of Unicode is a big problem of Python2, but it's not my problem.

A Unix filename, on the other hand, might be any sort of C string. This sort of thing is all over Unix, not just filenames. (When I first ever installed Python3 at work back when 3.0 (3.1?) came out, one of the self tests failed when it tried to read an unkosher string in our /etc/passwd file.) When I code with Python2, or Perl, or C, or Emacs Lisp, I don't need to worry about these C strings. They just work.

My inquiry, somewhere up this thread, is whether or not it would be possible to solve both problems. (Perhaps by defaulting to utf-8 instead of ASCII. I don't know, I'm not a language designer.)

dded · on Jan 1, 2014

> I insist that C strings are indeed arrays of bytes, and we cannot use them to represent text correctly at present

OK, maybe I do see one small point to argue. A C string, such as one that might be used in Unix, is not necessarily text. But text, represented as utf-8, is a C string.

It seems like there's something to leverage here, at least for those points at which Python3 interacts with the OS.