The problem isn't that Python 2 is bad. Python 2 is a fantastic language. The pr...

kstrauser · on April 21, 2020

I don’t think there was any clear path around that, though. The single biggest change was that Python 2 pretended that text and binary data were the same datatype, where Python 3 correctly makes you distinguish between the two. There’s not really a great way to roll out that major change without breaking tons of stuff along the way. And, well, if you’re already making a backward-incompatible version, here’s this checklist of other breaking changes you might as well bring along for the ride.

philwelch · on April 21, 2020

And that raises an obvious question: why didn’t every other programming language immediately break backwards compatibility when UTF-8 became a de facto standard?

> And, well, if you’re already making a backward-incompatible version, here’s this checklist of other breaking changes you might as well bring along for the ride.

Sorry, that doesn’t track. Treating quoted strings as UTF-8 by default instead of ASCII-or-arbitrary-bytes would have been a small migration that would not have taken over a decade to complete.

b2gills · on April 27, 2020

The way Perl dealt with this was to have you declare when you are using UTF8.

    # declare that the code itself is written in utf8
    use utf8;

    my $ā = 'ā';

If you need unicode strings to work, you turn on the unicode strings feature.

    use feature 'unicode_strings';

Another way to turn it on:

    use v5.12;

(Declaring which version of the language you need is something you should do anyway.)

Really mostly what you have to do is declare the encodings of the file handles.

    # change it for all files
    use open ':encoding(UTF-8)';

To change it per file handle, you would use `binmode`. (Which was originally added to allow binary code to work on Windows.)

    open my $fh, '<', 'example.txt';
    binmode $fh, ':utf8';

(Declaring the encoding of an opened file is something you should do anyway.)

---

Basically Perl just defaults to the old original ways. If you need a new feature which would break old code, you just declare that you need it.

Because of that, most code that was written for an earlier form of Perl still works on the latest version.

takeda · on April 21, 2020

Because many of these languages were created when Unicode already existed. Someone listed Java and Javascript, both of them started from the point that python 3 tries to bring.

When python was written in 1989 Unicode didn't exist yet.

As for your second argument, many people bring out Go, that had such amazing idea of using everything as UTF-8 and it works great. They don't realize that Go is pretty much doing the same thing that Python does (ignoring how the string is represented internally, since that shouldn't really be programmer's concern).

Go clearly distinguishes between string (string type) and bytes ([]byte type) to use string as bytes you have to cast it to []byte and to convert bytes to string you need to cast them to string.

That's the equivalent of doing variable.encode() to get bytes and you do variable.decode() to get a string.

What python 3 inroduced is two types str and bytes, and blocked any implicit casting between them. That's exactly same thing Go does.

The only difference is implementation detail, Go stores strings as utf-8 and casting doesn't require any work, they are just for having compiler catch errors it also ignores environment variables and always uses utf-8. Python has an internal[1] representation and does do conversion. It respects LANG and other variables and uses that for stdin/out/err. Initially when those variables were undefined it assumed us-ascii which created some issues, but I believe now that was fixed and utf-8 is the default.

[1] Python 3 actually tries to be smart and uses UCS1 (Latin 1), UCS2 or UCS4 depending what characters are contained. If an UTTF-8 conversion was requested it will also cache that representation (as a C-string) so it won't do the conversion next time.

philwelch · on April 22, 2020

> Because many of these languages were created when Unicode already existed. Someone listed Java and Javascript, both of them started from the point that python 3 tries to bring.

That was me in a parallel thread. Java and JavaScript internally use UTF-16 encoding. I also mentioned C, which treats strings as byte arrays, and C++, which supports C strings as well as introducing a string class that is still just byte arrays.

> As for your second argument, many people bring out Go, that had such amazing idea of using everything as UTF-8 and it works great.

Has Go ever broken backwards compatibility? Let me clarify my second argument: if you are going to break backwards compatibility, you should do so in a minimal way that eases the pain of migration. The Python maintainers decided that breaking backwards compatibility meant throwing in the kitchen sink, succumbing to second system effect, and essentially forking the language for over a decade. The migration from Ruby 1.8 to 1.9 was less painful, though in fairness I suppose the migration from Perl 5 to Perl 6 was even more painful.

b2gills · on April 27, 2020

Actually migrating from Perl5 to Raku may be less painful than migrating from Python2 to Python3 for some codebases.

That is because you can easily use Perl5 modules in Raku.

    use v6;

    use Scalar::Util:from<Perl5> <looks_like_number>;

    say ?looks_like_number( '5.0' ); # True

Which means that all you have to do to start migrating is make sure that the majority of your Perl codebase is in modules and not in scripts.

Then you can migrate one module at a time.

You can even subclass Perl classes using this technology.

Basically you can use the old codebase to fill in the parts of the new codebase that you haven't transferred over yet.

---

By that same token you can transition from Python to Raku in much the same way. The module that handles that for Python isn't as featurefull as the one for Perl yet.

    use v6;

    {
        # load the interface module
        use Inline::Python;

        use base64:from<Python>;

        my $b64 = base64::b64encode('ABCD');

        say $b64;
        # Buf:0x<51 55 4A 44 52 41 3D 3D>

        say $b64.decode;
        # QUJDRA==
    }

    {
        # Raku wrapper around a native library
        use Base64::Native;

        my $b64 = base64-encode('ABCD');

        say $b64;
        # Buf[uint8]:0x<51 55 4A 44 52 41 3D 3D>

        say $b64.decode;
        # QUJDRA==
    }

    { 
        use MIME::Base64:from<Perl5>;

        my $b64 = encode_base64('ABCD');

        say $b64;
        # QUJDRA==
    }

    {
        use Inline::Ruby;
        use base64:from<Ruby>;

        # workaround for apparent missing feature in Inline::Ruby
        my \Base64 = EVAL ｢Base64｣, :lang<Ruby>;

        my $b64 = Base64.encode64('ABCD');

        say $b64;
        # «QUJDRA==
        # »:rb

        say ~$b64;
        # QUJDRA==
    }

I just used four different modules from four different languages, and for the most part it was fairly seamless. (Updates to the various `Inline` modules could make it even more seamless.)

So if I had to I could transition from any of those other languages above to Raku at my leisure.

Not like Python2 to Python3 where it has to mostly be all or nothing.

takeda · on April 22, 2020

> That was me in a parallel thread. Java and JavaScript internally use UTF-16 encoding. I also mentioned C, which treats strings as byte arrays, and C++, which supports C strings as well as introducing a string class that is still just byte arrays.

C and C++ doesn't really have Unicode support, and most C and C++ applications don't support unicode. There are libraries that you need to use to get this kind of support.

> Has Go ever broken backwards compatibility? Let me clarify my second argument: if you are going to break backwards compatibility, you should do so in a minimal way that eases the pain of migration. The Python maintainers decided that breaking backwards compatibility meant throwing in the kitchen sink, succumbing to second system effect, and essentially forking the language for over a decade. The migration from Ruby 1.8 to 1.9 was less painful, though in fairness I suppose the migration from Perl 5 to Perl 6 was even more painful.

Go is only 10 years old Python is 31. And in fact it had some breaking changes for example in 1.4, 1.12. Those are easy to fix since they would show up during compilation. Python is a dynamic language and unless you use something like mypy you don't have that luxury.

Going back to python, what was broken in Python 2 is that str type could represent both text and bytes, and the difficulty was that most Python 2 applications are broken (yes they worked fine with ascii text but broke in interesting ways whenever unicode was used. You might say, so what, why should I care if I don't use Unicode. The problem was that mixing these two types and implicit casting that python 2 did made it extremely hard to write correct code even when you know what you're doing. With python 3 is no effort.

There is a good write up by one of Python developers why python 3 was necessary[1].

[1] https://snarky.ca/why-python-3-exists/

philwelch · on April 24, 2020

> Going back to python, what was broken in Python 2 is that str type could represent both text and bytes...

You know, it’s astounding to me that you managed to quote my entire point and still didn’t even bother to acknowledge it, let alone respond to it.

If they had to break backwards compatibility to fix string encoding, that’s fine and I get it. That doesn’t explain or justify breaking backwards compatibility in a dozen additional ways that have nothing to do with string encoding.

Are you going to address that point or just go on another irrelevant tangent?

lizmat · on April 22, 2020

There is no migration from Perl 5 to Perl 6, but mainly because Perl 6 has been renamed to Raku (https://raku.org using the #rakulang tag on social media).

That being sad, you can integrate Perl code in Raku (using the Inline::Perl5 module), and vice-versa.

philwelch · on April 22, 2020

Yes, that was the joke :)

afiori · on April 21, 2020

some would say that the distinction was in the wrong places, like assuming that the command line arguments or file paths were utf8

ynik · on April 21, 2020

Fundamentally, the "right place" here differs between Windows and Linux. On Windows, command line arguments really are unicode (UTF-16 actually). On Linux, they're just bytes. In Python 2, on Linux you got the bytes as-is; but on Windows you got the command line arguments converted to the system codepage. Note that the Windows system codepage generally isn't a Unicode encoding, so there was unavoidable data loss even before the first line of your code started running (AFAIK neither sys.argv nor sys.environ had a unicode-supporting alternative in Python 2). However, on Linux, Python 2 was just fine.

Now with Python 3 it's the other way around -- Windows is fine but Linux has issues. However, the problems for linux are less severe: often you can get away with assuming that everything is UTF-8. And you can still work with bytes if you absolutely need to.

pdonis · on April 21, 2020

> On Windows, command line arguments really are unicode (UTF-16 actually)

No, they're not. Windows can't magically send your program Unicode. It sends your program strings of bytes, which your program interprets as Unicode with the UTF-16 encoding. The actual raw data your program is being sent by Windows is still strings of bytes.

> you can still work with bytes if you absolutely need to

In your own code, yes, you can, but you can't tell the Standard Library to treat sys.std{in|out|err} as bytes, or fix their encodings (at least, not until Python 3.7, when you can do the latter), when it incorrectly detects the encoding of whatever Unicode the system is sending/receiving to/from them.

> AFAIK neither sys.argv nor sys.environ had a unicode-supporting alternative in Python 2)

That's because none was needed. You got strings of bytes and you could decode them to whatever you wanted, if you knew the encoding and wanted to work with them as Unicode. That's exactly what a language/library should do when it can't rely on a particular encoding or on detecting the encoding--work with the lowest common denominator, which is strings of bytes.

takeda · on April 21, 2020

> In your own code, yes, you can, but you can't tell the Standard Library to treat sys.std{in|out|err} as bytes,

Actually you can, you should use sys.std{in,out,err}.buffer, which will be binary[1]

> or fix their encodings (at least, not until Python 3.7, when you can do the latter), when it incorrectly detects the encoding of whatever Unicode the system is sending/receiving to/from them.

I'm assuming you're talking about scenario where LANG/LC_* was not defined, then Python assumed us-ascii encoding. I think in 3.7 they changed default to UTF-8.

[1] https://docs.python.org/3/library/sys.html#sys.stdin

pdonis · on April 22, 2020

> Actually you can, you should use sys.std{in,out,err}.buffer,

That's fine for your own code, as I said. It doesn't help at all for code in standard library modules that uses the standard streams, which is what I was referring to.

> I think in 3.7 they changed default to UTF-8

Yes, they did, which is certainly a saner default in today's world than ASCII, but it still doesn't cover all use cases. It would have been better to not have a default at all and make application programs explicitly do encoding/decoding wherever it made the most sense for the application.

takeda · on April 22, 2020

> That's fine for your own code, as I said. It doesn't help at all for code in standard library modules that uses the standard streams, which is what I was referring to.

I'm not aware what code you're talking about. All functions I can think of expect to provide streams explicitly.

> Yes, they did, which is certainly a saner default in today's world than ASCII, but it still doesn't cover all use cases. It would have been better to not have a default at all and make application programs explicitly do encoding/decoding wherever it made the most sense for the application.

I disagree, it would be far more confusing when stdin/stdout/stderr were sometimes text sometimes binary. If you meant that they should always be binary that's also unoptimal. In most use cases an user works with text.

pdonis · on April 22, 2020

> I'm not aware what code you're talking about.

All the places in the standard library that explicitly write output or error messages to sys.stdout or sys.stderr. (There are far fewer places that explicitly take input from sys.stdin, so there's that, I suppose.)

> it would be far more confusing when stdin/stdout/stderr were sometimes text sometimes binary

I am not suggesting that. They should always be binary, i.e., streams of bytes. That's the lowest common denominator for all uses cases, so that's what a language runtime and a library should be doing.

> If you meant that they should always be binary that's also unoptimal. In most use cases an user works with text.

Users who work with text can easily wrap binary streams in a TextIOWrapper (or an appropriate alternative) if the basic streams are always binary.

Users who work with binary but can't control library code that insists on treating things as text are SOL if the basic streams are text, with buffer attributes that let user code use the binary version but only in code the user explicitly controls.

takeda · on April 21, 2020

Linux had issues whenever LANG/LC_* variables weren't defined, python assumed us-ascii, I believe that was changed recently to just assume utf-8.

int_19h · on April 21, 2020

Python 3 does not make such assumptions; it uses the appropriate locale encoding.

(IIRC it used to do that in 3.0, but they backtracked very quickly - and 3.0 was effectively treated as a beta by the community at large, anyway.)

someguydave · on April 21, 2020

> correctly

Sometimes it’s better to be correct and also yield to the common good.