Hacker News new | past | comments | ask | show | jobs | submit login
Grokking Python 3’s str (sircmpwn.github.io)
213 points by type0 on Jan 14, 2017 | hide | past | favorite | 144 comments



Python 3 has arguably one of the best built-in string implementations around.

In Python 2, "unicode" was a type whose codepoint width depended on the interpreter build - 2 bytes on "narrow" builds and 4 bytes on "wide" builds. Since most builds were "narrow", in practice non-BMP codepoints were a real challenge to use.

Furthermore, most languages with a "wide" string type use a fixed 2 bytes per codepoint (Java, C#, C++, ...), which wastes space when you're dealing with ASCII a lot, and is a pain to work with if you have non-BMP codepoints, since now indexing requires scanning the string from start to end.

In Python 3, with PEP 393, strings now have a flexible internal representation which can use 1, 2, or 4 bytes per codepoint depending on the largest codepoint in the string. This saves both space and processing time (in most common situations) since common string operations like indexing and slicing are constant time. This representation also allows it to scale from ASCII to astral codepoints with ease.

Python 3 is worth the switch. Correct, strict separation of 'bytes' (bucket of octets) and 'str' (sequence of codepoints) is really the only way to preserve sanity and interoperability with today's encoding-rife reality.


I would say Swift has the best implementation. It actually conceptually treats strings as a group of grapheme clusters (the Character type is a grapheme cluster). This is more in line with how humans conceptually segment strings, and how most application logic should segment a string.

> since now indexing requires scanning the string from start to end.

This is not a good thing. O(1) code point indexing is a completely useless operation. Code points have no intrinsic meaning. They do not map to any natural language concept. They are just a tool to encode strings. The only time code points matter are when writing algorithms defined by the unicode spec, like casefolding. These operations usually require iteration anyway.

If you're regularly indexing and slicing strings by code points, your code is likely broken when it comes across decomposed accents in European languages, most Asian languages, and emoji.

FWIW you can have O(1) slice indexing in UTF8 as well; you just need to use byte position as your indexing type (and handle the error case when it's not aligned). Rust does this. It also has methods for working with char indices (and a .char_indices() iterator that lets you juggle the two), but byte indexing works. Generally these indices come from other parts of the application so you can design it to not have errors.


I end up having to explain this often, and have been meaning to write about this for a long time, and I finally did:

http://manishearth.github.io/blog/2017/01/14/stop-ascribing-...


I'm not disagreeing but the situation is a bit more complex. For example ISO 10646 ("Unicode") has multi-code point sequences denoting still a single character (such as variation sequences, which are actually used in standard HTML entities).

Moreover, there are languages/scripts challenging the notion that a particular byte-unit corresponds to a single character.

Treating strings as byte sequences (with optional UTF-8 interpretation/checking, uppercase/lowercase conversion if the concept even applies, trim functions based on all Unicode spaces not just those characters in US-ASCII etc.) is entirely a defensible choice for a programming language.


> For example ISO 10646 ("Unicode") has multi-code point sequences denoting still a single character

I only know two programming languages that deal correctly with that: Swift and Perl 6. If you know more, please tell me.


Rust separates this out; the standard library gives you bytes and Unicode Scalar Values. Grapheme cluster stuff is in a package on Cargo, maintained by the Servo team.


You can iterate over Grapheme clusters using the standard library: https://doc.rust-lang.org/1.3.0/std/str/struct.Graphemes.htm...


Those are unstable docs for 1.3; they don't exist in today's Rust.


If you rely on libraries, there are ICU libraries for C and Java too, see for example http://userguide.icu-project.org/boundaryanalysis


Sounds like Rust follows in the tradition of C. Like you now always hear about hos C has no string type, just a bunch of char, will we hear in the future that, alas, Rust has no native support for grapheme clustering?


The issue is, there's multiple ways to do the clustering, in my understanding. We picked the most core thing to be our most core thing. Text is hard.


Oh I get it didn't want to come across as flippant. Indeed text is hard.


I was sloppy in using the word "character" - strings are really sequences of codepoints. I've edited my post to reflect this. Of course, multiple codepoints may be needed to construct a single character, and that's by design.


But has Python 3 Unicode-aware string-length rather than byte-length functions (+ substring, case-folding, character classification functions etc.) for dealing with this? If it hasn't then nothing really is achieved for the general case by switching to 4-byte or any other fixed-byte unit, including the case of dealing with HTML 5 text.

See also perlgeek's question.


Yes, Python 3 does have Unicode-aware functions, as demonstrated in the post (via slices).

Go and Rust also have very good Unicode support, in addition to Swift and (apparently) Perl 6 as mentioned above.


Java hasn't used 2 bytes per code point since many releases ago. It does unfortunately use 2 bytes per char, but the standard library makes it reasonably easy to deal with cases where you need 2 chars for a code point.


The fact that Python 2 is so permissive when mixing bytes and strings I think is the fundamental reason why people think Python 3 strings are broken.

In practice what happens is that Python 3 won't allow you to mix bytes and strings at all, and it will force you decode and encode properly. Python 2 will happily try to implicitly convert to and from ascii when needed. Now, this might seem simpler at the beginning because in Python 2 you require less boilerplate code, but then later, when you have to deal with a different encoding or when you use non ascii characters, you get really weird, hard to debug problems that cause real pain. Python 3's strictness is helping you preventing these kind of cases.


Keeping bytes and strings separated is actually the simpler situation. What it is not, is easy, because it takes a little bit of learning. Sadly many people prefer easy over simple.


It is only hard if you used it like that before. I'm betting people that come directly to Python 3.5 will find the difference blatantly obvious. I remember many years ago, coming from C++ boost's String that I was taken aback about Python's laissez-faire approach to strings. Python 3 fixes that.


5 years ago I started learning programming with Python 3.0 and Python 2 just seems like Python 3 + many warts and lack of new features.

EDIT: changed phrasing


Could you recommend a learning resource for Python 5? I also tried learning it years ago but eventually went back to 3.0.


Whoops, the phrasing was a bit ambiguous. See the new edited comment.


Taking away permissiveness from a programming language just leads to developers not upgrade. No surprises there.

Yes, Python 3's stricter model is less error prone (vs python 2 u').

the more annoying thing for me personally is the "print" function rather than previous "print" statement... Bugs the hell out of me.


> Taking away permissiveness from a programming language just leads to developers not upgrade.

The longer I do this, the more I see non-permissiveness in data validation as a feature and am actively happy to upgrade to get it. Ruby fixed their strings in 1.9 and it was one of the main reasons I was excited to upgrade our codebase to it. It was a pain to upgrade, but we found some real bugs, and were very happy with it after the fact. That the Python community has not embraced this yet completely mystifies me.

Maybe you're right about permissiveness if you're talking about syntax, but even then, I hate permissive syntax that increases ambiguity.


Whatever the other pros and cons of the print statement might be, one potential advantage of the function version of print, is that you can pass it to a function as an argument (call it the "output function" argument), and on other calls, substitute some other function in its place, which could do anything, like maybe output to a network connection, or to a file instead. Basically, a polymorphic parameter, which, based on the actual function value passed to it, might do one of a number of different but related things (all being some form of "output"). Like Python's file-like objects protocol, where you can pass any object that implements file-like functions/methods in a place where an actual file object arg is expected. Duck typing, IOW.

I realize this could also be done with the statement version of print, just by wrapping it in a function, using args, *kwargs, but it seems like the function version would be easier to use for this.

Of course, you would have to take care that all the functions you plan to pass to it, such as print and others, can do whatever is wanted to be done with it in the called code.


I hear this complaint about print all the time but I have never understood it. Why is print so special that it needs dedicated synta? Most languages don't do this and people don't seem to mind. Also, I almost never physically type print, but I do type a lot of function calls, so if anything I am more accustomed to typing it as a call.


The print statement is a little more convenient in the REPL. Otherwise there is little to complain about having a more consistent, extensible print without any magic. Every Py2 coder should be sticking in "from __future__ import print_function" to train themselves for the transition.


> The print statement is a little more convenient in the REPL.

IIRC ipython has a feature where you can prepend slash to any function (at the beginning of a line) to have parens inserted automatically, ie.

    In [0]: /print "foo"
is executed as print("foo"). There - problem solved, and for all functions, not just one special case.

Honestly, the crazy `print >>fileobject, "foo"` syntax is enough of a reason for removing it (print statement) from the language. It was unpythonic and I'm surprised it lasted this long in the language.


Yes, I don't like that last usage too. Non-orthogonal syntax is slightly harder to learn / remember, and the more of if there is, the more the cognitive / memory load. Of course, no language can be perfectly orthogonal or regular, though I guess Lisp comes close. (Not an expert on language design or theory.)


> Why is print so special that it needs dedicated synta?

parent answered the question:

> Taking away permissiveness from a programming language just leads to developers not upgrade. No surprises there.

Some people used the print statement because at a time this was the recommended approach. In fact the 2.7 tutorial uses the statement: https://docs.python.org/2.7/tutorial/interpreter.html#intera... print behavior was changed, and combined with the other changes and incompatible modules a lot of people decided not to jump to 3.0.

Since languages like C never had a print statement to begin with -- printf was always a function call -- this issue would never have appeared in the first place.


print function is so superior than print statement. I use to (long ago like v1.5 or 2.0) create stupid little functions that put print into a function (also raise and some other statements cause it's super fucking annoying that some language elements can't be used in expression context) or use sys.stdout.write() and get pissed off cause that doesn't auto append newline.

print (and I'd argue raise) aren't special enough to be statements, to have special syntactical rules, to be so special cause that works different.


Raise I might argue in favor of (though I find it annoying too) solely because it's a language feature to change where you return to. It's not a thing you can express via other code. (though a callback-only language sounds kinda interesting / terrifying)

Print statements are horrific though, totally agreed. It's just normal IO with a common default, but making it as a statement causes problems, e.g. `lambda: print 'yay'` isn't allowed.


Last month, we had to add support for HINDI language in our APIs within 2 weeks as Indian govt wanted to launch learning program in both languages. We being a young startup of less than 2 years could not afford to lose this opportunity by asking more time. Because all of our code was in python3.5, there was nothing we had to do. Hindi support was magically available and everything worked flawlessly.

Fingers crossed on the program launch... :)


What sort of APIs and learning program, if not confidential?


Its a learning program for budding entrepreneurs. We are at www.upgrad.com There are APIs for question, answers, feedbacks, discussion forum etc


Sounds interesting, thanks.


It may be true that one of the issues with Python 3 strings is people not grokking them, thus littering their code with useless/redundant/wrong `.decode()` and `.encode()` calls. But I say this is a problem with Python 3 itself, since it obviously just replaced one set of problems with another.

I think the fundamental mistake of Python 3's approach to strings is assuming that programmers mostly meant "string" when they used strings in the past. This may be true in web circles, but in most other areas of application they actually meant (opaque) bytes. (`from __future__ import unicode_literals` makes more sense in this scenario, and I've been using it since forever.)

It's the changing of default behavior that confuses people!

Also, the standard library has a few places where the maintainers don't seem to grok strings either. Take the "json" module for example: Why does `json.dumps()` return an "str" and `json.loads()` doesn't accept "bytes" as input? (Hint: JSON is, by definition, UTF-8 encoded, so "dumps" should return "bytes" and "loads" should accept both.)

(I should mention that my native tongue uses characters outside of ASCII, so I it would seem that I should be asking for unicode everywhere.)


JSON is a text format, not binary http://www.ecma-international.org/publications/files/ECMA-ST...

JSON text is a sequence of Unicode code points, not bytes. Python 3 str type is ideal to represent JSON text.

On the internet "JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32" https://tools.ietf.org/html/rfc7159

json.loads() accepts binary input on Python 3.6. using the encoding detection scheme from the obsolete rfc 4627 that relies on now false assumptions (that json text represents either array or object).


RFC 7159 is sort of sloppy, saying "JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations" while allowing UTF-16 and UTF-32 and not actually requiring UTF-8 be supported. ECMA 404 doesn't address the issue.


In Python 3, a "bytes" type is too much like a string. It's supposed to be an array of [0..255]. But

    >>> s = bytes([97,98,99]) # created an array of "bytes" from a list of ints.
    >>> type(s)
    <class 'bytes'>   # it's really a type "bytes"
    >>> s
    b'abc'            # but it prints as a string
    >>> s[1]          # each element, however, prints as an integer
    98
Python 3 thus isn't rigorous about "bytes" as an array of byte values. It's become less rigorous; you can now use regular expressions on "bytes" types. If Python 3 had taken a harder line, that "bytes" is just an array of bytes, the distinction would be clearer.

Actually, Unicode has worked fine in Python since Python 2.6. You just had to write "unicode" and "u'foo'" a lot. In an exercise of sheer obnoxiousness, those were originally disallowed in Python 3, instead of making them null operations.

Strings in Python 3 appear to be arrays of Unicode characters. This is a bit tricky, because Python doesn't have a Unicode character type. Elements of a string are also strings. There's no type like Go's "rune".

Python has successfully hidden the internal structure of UTF-8 strings. Internally, representations can be 1-byte, 2-byte, or 4-byte, plus an optional UTF-8 representation. This means a lot of run-time machinery.

Go and Rust both have a string type that is both internally and visibly UTF-8. They're subscriptable, but at the byte level. An element of a Go or Rust string is not a string, a character, or a rune - it's just a byte out of the middle of something. The same is true for slices of strings; they are not necessarily valid UTF-8. This is a cause of trouble. You shouldn't be subscripting through UTF-8 byte by byte. In practice, you have to be aware of the UTF-8 representation in Go and Rust, or use a library which is. Here's my grapheme-aware word wrap in Rust.[1] Too much touchy fooling around with byte-level indices into arrays there.

Arguably, if you're going to use UTF-8 as an string representation, subscripts should be of an opaque type, not integers. You should be able to move forward or backwards one grapheme at a time cheaply, and if necessary, create an array of slices which represent all the graphemes in the array. Then programmers could random-address strings without fear, and you don't have multiple internal representations.

[1] https://github.com/John-Nagle/rust-rssclient/blob/master/src...


Perhaps the problem with Python3's str type is that a string type shouldn't be challenging to grok?

Also, for those of us who learned computer science back before UTF-16 was a standard, a "string" has always meant an array of chars, and a char was a byte. In some languages this is still the case.

In other languages, from Pascal to Java and beyond, a string has been a distinct class or type. Though generally those types began as a hint abstraction around an array of bytes. Surprise, surprise, that's how Python2 has always done it.

So Python3 changed which internals it uses for its default string type (pro tip: a b'foo' object in Python2.7 or Python3 isn't "a bytes", it's "a bytestring".)

I happen to think that this is a good choice in Python3. It's not 1996 any more. People expect software to support accented characters and ridiculous emoji. Default Unicode strings are easier to work with for most purposes that involve accepting text from a user and returning text for a user. For most of us it's worth the additional disk space and memory trade-offs even for things like dict keys that don't benefit from Unicode.

People whose work involves processing binary sequences as bytes for convenience are now inconvenienced and understandably frustrated, but they're not the majority of users of the language.

The author is correct that people who expect Python3 strings to be arrays of bytes are mistaken. But the author is wrong to tell people that what they've worked with as a "string" and considered a "string type" all their lives -- and which is STILL considered a "string" in many languages is not a string at all. It's still a string. It's still a type of string. Even Python still calls it a byteSTRING. It's just no longer the way Python internally represents a sequence of characters surrounded by unadorned quote marks.


I don't know about that... even well before Python 3 was released, there were urgent warnings to programmers to STOP thinking of byte arrays as strings. Joel Spolsky's Unicode essay predates Python 3 by five years, for instance: https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

That shift in thinking was overdue in 2003 and it's way overdue now in 2017.


(pro tip: a b'foo' object in Python2.7 or Python3 isn't "a bytes", it's "a bytestring".)

Quotes from the Python 3 documentation:

Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type.

class bytes([source[, encoding[, errors]]])

Return a new “bytes” object, which is an immutable sequence of integers in the range 0 <= x < 256. bytes is an immutable version of bytearray – it has the same non-mutating methods and the same indexing and slicing behavior.

https://docs.python.org/3/library/functions.html#bytes

https://docs.python.org/3/reference/lexical_analysis.html#st...


"bytes" is the type for bytestrings on both Python 2 and 3.


Gah. No.

Unicode code points (which is what a str is a sequence of) are not characters. That is not a one-to-one mapping, nor a one-to-many mapping. That is a one-or-many-or-none to one-or-many "mapping". And glyphs are a third category that we aren't even getting into.


Well, we are slowly approaching the "truth" .-)

Frankly I think most people don't care about the complexities of Unicode. Count me in. I treat it as a necessary evil. What I do with it is mostly concerned with the characters (code points if you insist) from the ASCII range that are in there (for example, splitting lines or words). I hope it's okay to ignore code points vs glyphs etc. in this case?


Btw. I'm fully aware that this is just the bytes vs unicode issue, taken to the next level.

The difference however is that a) most data doesn't contain combining code points while much data is non-ASCII unicode, and b) software (i.e. most software, with the exception of perl 6 and probably few others) doesn't have convenient support for glyph-level strings yet -- I wouldn't mind if it had.


Latin-1 files work just fine. The examples in the post are just overly complicated: open('test.txt','w',encoding='latin-1').write('No need to separately encode or decode.')


Yours is the correct way of doing it, but I chose to do it this way because I felt that this example gives more insight into the concepts this article is trying to explain.


The way I grok the computer science, strings are sequences of zero or more characters input into automata and unicode is a way of encoding text. Text is a sequence of one or more glyphs input into humans. Thus, for me, the implementation of String in Python 2 is about as sound from a computer science perspective as an implementation can get. By which I mean that in the end all data boils down to bits and clustering the bits as bytes is about as reasonable as alternatives. On the other hand, Python 3 treats strings as text and this necessitates all the overhead of multiple encodings and converting glyphs from one language to another language according to the messy and inconsistent and incomplete rules of human language. For me, it would have been better if Python 3 (and most other languages) had a 'Text' type in addition to the String type.

The problem is sloppy use of language in a domain where usually it is ok but sometimes it isn't.


Treating strings as "a sequence of bytes" is perfectly fine if you never, ever interpret or manipulate the contents.

As soon as you want to e.g. limit to 160 characters and suffix with an ellipsis, you run into problems of "what is a character", and you can't even call it a single unicode codepoint. Is "Z̴̛̺͉͙͚̰̔̏ͧͅ" is a single character, or over 10? Can you even call it a single "glyph" since it's formed of many units? Here's the unicode representation:

    Z\u0314\u030F\u0367\u0334\u031B\u033A\u0349\u0359\u035A\u0330\u0345
Which is a whopping 46 bytes in UTF-8. Or what about unicode flags and their 2-character representation: https://esham.io/2014/06/unicode-flags

Strings as a concept as they currently stand are absolute nonsense. A "Text" type might resolve the semantic problems (and I love the name, this is a great idea), but strings are text. Any other use is just abusing the container because it's easier to type "v1.2.3" than to make a "Version(1,2,3)" structure (especially when you have to communicate it across different programs / languages).


To me, because texts get encoded into strings (or strings encode texts) they are not the same thing. For example, Base 64 encodes data into a string, but the source is not necessarily a text. That's a separate issue from the way 'string' often gets used in the context of programming...a context in which even the otherwise pedantic seem to lose the faith ('regular expressions' is another one that is closely related).

Whether or not Z̴̛̺͉͙͚̰̔̏ͧͅ is a character (in terms of computer science) is a matter of whether or not it is part of the input language which some machine accepts. Which is to say it is no different than whether or not 'HashMap' is part of the input language to a compiler (yes for Java, no for Python).


Base64 is a text representation of a binary blob of data - it's just a protocol that happens to limit itself to a less-likely-to-be-mangled-by-bad-string-handling-code sequence of bytes.

Regexes are a great example - they're text that is parsed into a parsing-engine that can be executed. The text part is just a human-interface protocol over many possible implementations, and importantly, it has a standard. Slightly-varying standards at times, but everything trends towards Perl's version, plus/minus some features. And yes - implementations can choose which variant(s) they support, because they control its interpretation.

Programming languages don't have a choice about if "Z̴̛̺͉͙͚̰̔̏ͧͅ" is a character though, if they deal with human-language input and output. Humans have already decided. When it's displayed, it either is or is not, often based on the viewer's language (e.g. `str.lower()` is locale-sensitive, but many programming languages ignore this and only deal with ASCII for English speakers). If the program doesn't understand how it's dealing with this human <-> computer protocol and mangles it, it's just as bad as something that mangles other protocols like TCP/IP, except that humans are occasionally more forgiving.

---

edit: I should probably tl;dr this.

There is a right way and a wrong way to manipulate human-language text. And it's extremely complicated to do correctly - people are difficult. Shoving it under the rug and ignoring it entirely, as has been done by most people in most languages, is 100% the wrong "solution".

Python 3 (or even better, Swift) have taken a step in the right direction to reduce accidents - it's painful because it requires correcting long-standing horrifically-wrong habits.


I used the term 'regular expressions.' I did not use the term 'regexes'. Regular expressions are clearly defined and have mathematical properties including equivalence to (or the ability to unequivocally and fully describe) finite automata. Conversely, regexes are not clearly defined mathematically.

There's nothing wrong with imprecision when precision is not called for and particularly when the imprecision facilitates communication. Sometimes however the abstractions leak and expecting the string type to embody the properties of human text is one of those...at least in my opinion. Other people may have different opinions.


Ah, yeah, you're entirely right about the regexes. My mistake.

So I think we mostly agree. My question is then: what are strings for, if not Text? It makes a terrible enum, a weirdly-limited escape-hatch for ignoring type systems, and an immensely wasteful protocol.


Your definition of a string is wrong. It was fine when C came about and we only had ASCII, but real strings hold none of the implicit assumptions that are held by C. C's strings are just a stream of zero-terminated unsigned bytes. Nothing more, nothing less.

Your vision works until you need to get out of the 255 characters available to encode through ASCII. The original way out was to have a bunch of different, incompatible encodings of different widths for different languages. This proved to be a complete disaster, and Unicode has been the answer.

Unicode is not easy; nothing in the domain of human text handling is. But its use case is for human-readable output. You can't really do user-facing applications using the C model of strings.


I am not a fan of C's strings either. Null termination rather than communicating the length as a preamble has been a font of bugs for decades. On the other hand, ASCII is merely a way of interpreting a stream of bits that chunks the bits in into bytes and maps the bytes to characters. Unicode is another way of interpreting a string of bits which also chunks the bits into bytes, but then chunks bytes together before mapping to characters.

Both Unicode and ASCII are abstractions built on top of streams of bits largely (since both contain machine instructions such as <BEL>) intended to communicate text primarily and strings (in the computational sense) secondarily (for example as commands to a REST endpoint). For example C has had a wide character type for about 25 years [1] available as an abstraction built on top of strings...like most of C how wide is wide is implementation dependent and explicit 16 bit and 32 bit wide characters are more recently standardized.

[1]:


But you are aware that python2 has not only bytestrings but also what you call "text"? That these are different types and there are implicit conversions between them?

    $ python2
    >>> len('ä'), 'ä'[0]
    2, \xc3
    >>> len(u'ä'), u'ä'[0]
    1, \xe4
    >>> 'ä' == u'ä'
    True
I think it's obvious that there are different levels of looking at the data that we have to deal with. If the implementors feel that the best way to enable different levels is having different representations, so be it. Note that it doesn't require twice the amount of memory if you stream the data.


It's true that programmers don't understand Python3 strings. But the language is to blame for being opaque.

Go is right up front: Go strings use UTF-8 and Go source code is in UTF-8. You can get the length of a string in bytes, or in runes. There's only one string type. Since every programmer needs to understand UTF-8 anyway, you can understand it immediately.

I tried to find how Python3 unicode strings work, but I could not find it anywhere, until I saw nneonneo's comment here.


Interesting, I haven't ever read about anyone thinking Python 3 is the one that has broken strings. It's always been the Python 2.


Zed Shaw, author of Learn Python the Hard Way, thinks Python 3 strings are broken. He also needs to read this article.

https://learnpythonthehardway.org/book/nopython3.html


I'm doing ML in Python 2 and I can't see a single reason to move to Python 3. Every article is about strings, but in fact I needed to work with Unicode once (mostly doing computer vision tasks, but once trained a RNN on a book in my native language just for fun) - and I don't remember ANY problem with it in Python 2.

So, why should I bother migrating? I moved to C# 7 instantly and never looked back, for example, because it has tons of stuff. But the only thing I hear about Python 3 is "it's so much better!!!111" and "strings, you need it".


The reason that there is a lot of articles about strings is because there is a vocal minority that really hates the Python 3 string changes, and they where the most contentious changes in the language.

Current Python 3 has changed a lot since the first versions of Python 3, and there is a whole host of improvements that makes it superior. None of them earth shaking but a host of them worth it. Python 3.6 was just released, and I would recommend having a look at its whats new document, to get a feel for what has changed since 2.7.


Python 3 fixes the issue where variables in a comprehension "leak" to the surrounding scope.

Python 2:

    x = 'hello'
    [x*x for x in range(5)] #=> [0, 1, 4, 9, 16]
    x #=> 4
Python 3:

    x = 'hello'
    [x*x for x in range(5)] #=> [0, 1, 4, 9, 16]
    x #=> 'hello'
Python 3 has improved destructuring support

Python 2:

    a, *b = [1, 2, 3] #=> Syntax error
Python 3:

    a, *b = [1, 2, 3]
    a #=> 1
    b #=> [2, 3]
Type annotations[1] are nice not because they enable typechecking, but because they tell me what the function expects. Maybe it's just the kind of programming I do, but I use a bunch of libraries with functions that take a class or function as input, except half the time they want the name to the thing instead of the actual reference.

[1] https://www.python.org/dev/peps/pep-0484/


If you are not using strings then very few things are different between Python 3 and 2 so the cost of change to a more maintained interpreter is even smaller for you. The only real problem you might have is having some library that doesn't have Python 3 support. Also since you are dealing with ml I must warn you that the developers already committed to not support Python 2 in the future, not sure about numpy though.


Python 3 has mostly more stuff than Python 2 in the stdlib, some Syntax improvements, most of which are supported by Python 2.7 backports as well. So yes, arguably there are entire classes of problems were one doesn't need to care about differences between Python 2 and 3. But there are also many classes of problems where the differences matter very much.


As an ML person, I appreciate the infix matrix multiply operator (@). Using np.dot is cumbersome and makes formulas difficult to read. In fact, until this feature was added, I considered the readability of vector and matrix formulae one of the strongest arguments of using Matlab over Python for prototyping.

https://www.python.org/dev/peps/pep-0465/#executive-summary


All the articles are about strings since those are the most fought-about change, and the thing you'll likely have to deal with the most when porting. Python3 IMHO has no single "killer feature", but especially with recent versions has a lot of very nice things that add up, which isn't talked about in the prominent articles because it is not controversial. (Well, the async stuff kind of is, since it's a feature many people want, but it also is quite confusing)


Actually I don't believe that "those are the most fought-about change", I think it's a very loud minority. In more personal circles I don't know anyone who knows anyone who ...

It might be though that the new text model faces more resistance in English speaking countries, because ASCIIbytes is obviously less troublesome there, so people might not see the need for Unicode.


I use Python for a significant amount of data munging. Dictionary comprehensions alone was worth the switch for me. When I started, pandas did not exist, so I had to develop everything from scratch.


This is probably why the scientific computing community is very slow in adopting python3, there are simply very few benefits apart from the threat of 2020.


> I can't see a single reason to move to Python 3.

If you are doing any sort of scientific computing there is no compelling reason to move to Python 3.

The nuances that you have to grok with the Py3 str are the same nuances you would have to grok about using the Unicode implementation of Py2 instead of the default encoding in Py2. They just made the default more complicated if you are working with data. Now they require you to add 'b' to every damn thing and they have the arrogance to claim this is the "right way" to do it.

Like someone else said earlier -- computer science strings are a sequence of bytes. If you want to encode and decode from byte strings use another type. They should make the default bytes and print utf-8 interpretation.

Oh, but then they would have Py2 and they would realize what an authoritarian goose chase Py3 has been.


I agree with the article that Python 3 has great character string handling, but I'd suggest that the author does not understand why people like Python 2's system.

Proper handling of unicode glyphs makes for a great demo and is clearly the proper behavior, but does not represent a common use scenario.

It's fairly rare to be doing sub-string manipulation of user input or strings that will be displayed. Instead, it's much more common to be transporting data around inside strings. Maybe you're moving some JSON, or some binary data or csv content. In this examples, you're moving data from one point to another and the Python 2 approach is far simpler. What encoding is it? Is it unicode? For most applications, it does not matter. You can put the data in a str, then pass that str to a function.

In Python 3, this scenario gets more complicated. Is your data in bytes? Is it in a str? If it's in a str, how is it encoded? The Python typing system does not make finding these things out smooth. The details will determine which functions you can pass the data to and what transformations you (may) need to perform. Python 3 makes the programmer work harder to pass properly-tagged string data to functions, which generally makes string handling code more finicky and complex. The result is a string system that is _much_ more predictable and understandable in the failure case, but is more verbose in getting there.

It doesn't help that Python's poor type management tools mean that your first indication you mis-handled a string in Python 3 is the same as Python 2 - unexpected output to a buffer (often b'string' instead of terminal-killing garbage).


You shouldn't be moving things like that around in strings. You should be moving them around as bytes, then decoding them when you need to manipulate them. Your strategy is probably going to lead to subtly broken behaviors in your program. The problem here is not how Python 3 handles strings, it's how you're handling strings.

>It doesn't help that Python's poor type management tools mean that your first indication you mis-handled a string in Python 3 is the same as Python 2 - unexpected output to a buffer (often b'string' instead of terminal-killing garbage).

I agree, Python's behavior around implicit conversion of things to str is pretty bad. It would be better to throw a TypeError.


This is exactly what I mean.

I know I should move binary data around as bytes. I'm not saying I want to (or intend to) move binary data around as strings.

I'm saying that, in Python 2, you get a str - which happens to be binary and you don't have to worry about it. Even if it's unicode, it'll still work - it just won't matter. In Python 3, it depends on what the library gives you. Did it give you bytes? Maybe a str? You'll need to check and do the proper conversion to bytes. The behavior that Python 3 forces on you is safer and more correct, but it's behavior that you did not have to go through in Python 2.

Sometimes, in Python 3, the library gives you bytes and you want bytes and you're fine, but it's common that you are not.


Couldn't agree more with the article or the conclusion: "python 2 is dead. Long live python 3"


With these examples, I finally understood the difference between Python 2 and Python 3.


What does "to grok a string" mean? It is used across the article and comments, yet is not immediately referenced in Google.



>>>open(b'test-\xd8\x00.txt', 'w').close()

Is that a C-string? Does it crash if one forgets the null byte?


This specific example would not work, since NUL bytes are not allowed in file names.

    open(b'test-\xd8\x01.txt', 'w').close()
Makes the same point.

Note that

    b'test-\xd8\x00.txt'
is totally valid. You just can't use that as a file name.


> Is that a C-string?

It's a Python bytestring.

> Does it crash if one forgets the null byte?

No. But in that case it will crash when you use it as CPython (and most FS APIs, and most FS) don't allow NUL in filenames.


It's not going to crash. You'll get an error.


No, it is not a C-string. Python remembers string length regardless of null bytes.


Running that gave:

  >>> open(b'test-\xd8\x00.txt', 'w').close()
  Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  TypeError: embedded NUL character
So yeah, TypeError

Edit: Formatting


My bad, pushed a fix to use \x01.


It's says bytes is an array of bytes, does the author mean bits?


It's a bit of a clunky sentence because of the name conflict between the concept of multiple bytes and the Python data type.

A byte is a sequence of bits, and "bytes" is the name of a sequence of bytes in Python.


What the author means is that the Python type named `bytes` is an array of conceptual byte values.


    >>> 'おはようございます'[::-1]
Why do people keep using this as an example? A character is not a necessarily a single code point; reversing code points is not any more meaningful than reversing bytes. And reversing strings isn't something frequently needed in practice either.

Let me make this as clear as possible:

    a code point is not a character

    a code point is not a character

    a code point is not a character
I want the author to read that, over and over again, until it sinks in.


> The only problem with Python 3’s str is that you don’t grok it

Sadly this quote shows a fundamental lack of understanding of the problems with the Python 3 string type.


Downvoted for surprising lack of substance.

Note I do read many of your blog posts -- and thanks for writing them, there is a lot of insight for me.

I also did read most or all of your blog posts about python3 unicode handling. The thing is, while most or all of the facts presented there are "true", many of the negative conclusions there are just your opinion that stems from years and years of doing it "your way" (I would call it FUD, but I have strong opinions too). Python2's unpredictable implicit conversions are hardly "sane" (that's one of the unjustified claims made there). Do you also call Javascript's "==" sane?

For me Python3 str has worked like a charm for years. I like the strict separation of high and low level affairs. I like how I can treat files as text files (deal only with python3 strs) and don't have to think about low level affairs, or types, or conversions, which is a great boon for scripting. I like how I can drop down to the bytes level when needed and know exactly what I'm at.

For some balance, I wouldn't expect that conversion to python3 is always easy. But I would blame python2 for missing clean concepts, not python3.

Also, it's hardly convenient to code most of a Python3 app at the byte level. But it could be argued that python should not be used for these things.


I agree that having stricter coercion rules makes sense in situations where the coercion might be ambiguous, whether adding byte-strings and Unicode strings together is such a case is debatable though (I think it is), as in many cases implicit conversion yields an acceptable outcome. Again, this is more a question of design philosophy, but the thing with Python (2) is that implicit coercion was the default behavior in many cases, so changing that is painful.

Personally I think we should double-down on type annotations and stronger (optional) typing for Python, because the lack of a good type system is by far the biggest obstacle for building robust, large systems in Python.


The problem is that in Python 2, "str" is a bytestring and "unicode" is a Unicode string, whereas in Python 3 "str" is a Unicode string and "bytes" is a bytestring.

In my experience this is the main reason why Python 2 users get confused. I can understand why the renaming might have been a good idea (to avoid changing the semantic meaning of an expression while leaving the syntax intact), but I can understand why people don't "grok it" immediately, though it is rather simple if you know it.


> The problem is that in Python 2, "str" is a bytestring and "unicode" is a Unicode string, whereas in Python 3 "str" is a Unicode string and "bytes" is a bytestring.

`str` in Python 3 is `unicode` in Python 2. The bytestring type was removed and replaced with a bytes type. Not sure what's particularly confusing about this.

None of this however is the issue with Unicode in Python in general. I think all these articles fail to understand that the actual story behind unicode in the language is very complex.


I really wish they named it something different.

Like `text` and `bytes`

In that scenario you have no name link between Py2 objects and Py3 objects.

The `str` name reuse for different objects makes the 2-3 transition that much harder.


Yes that's what I said, and there's nothing "wrong" with it, I just think that it's where most of the confusion comes from, as many people that come from Python 2 think of `str` as a byte-string, while in Python 3 it's a Unicode string.

I think the "unicode-by-default" approach of Python 3 is a great reason to switch to it (and I did so for almost all my codebases), converting code can sometimes cause surprises though, especially due to the lack of type checks in Python combined with the more stricter checking that Python 3 does when working with strings (i.e. adding Unicode and byte strings throws an exception in Python 3, whereas Python 2 would simply try to convert the byte-string implicitly into a string and only throw an exception if this fails). All theses changes combined make converting code from 2 to 3 more risky. I think especially the handling of type coercion between Unicode and bytes objects could have been handled differently, as it is (IMO) a big semantic change.

As an example, just today I ran into a problem with WTForms and Flask (which is a great framework btw, thanks for writing it!), where I used a cookie-based session store provided by `itsdangerous`, which deserializes a value stored in the session either into a byte-string or a Unicode string. After switching my code from Python 2 to Python 3, there were still some clients in the wild that had a session object in their cookie generated by Python 2 containing byte-strings, and `itsdangerous` happily deserialized it into a byte-string as well, causing an exception when my Python 3 code tried adding it to a Unicode string. With new sessions this didn't happen of course (as the objects were serialized as Unicode), which made this thing quite fun to debug. Granted, this is not a problem with Unicode in Python 3, but I think it illustrates the kind of challenges you face when migrating codebases from one to the other.

What I can agree on is that we'd be much better off if everyone was using Python 3, and eventually we'll get there.


Your comment would be more constructive if you could expand on what, in your opinion, Python 3's str failings are.


This blog post may illuminate you (commenter is the author, IIRC):

http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/


Is this a comment on the article or just its title?


It's a comment on the title but also the article is not even talking about the various issues with Unicode in Python.


The only issue I ever had with Python's Unicode support was due to my own lack of understanding, a direct consequence of failing to read the docs, because I'm lazy and opinionated, and have a bad reaction when reality differs from what it obviously should be, also I'm intolerant to adversity.


In Python 2.7.12

>>>s = u'おはようございます'

>>>print s[::-1]

すまいざごうよはお

The only problem with Python 2.7 strings is that the author of the article doesn't grok them. Just use u'' instead of ''. There is also "from __future__ import unicode_literals".


That's not a Python 2 str, that's a unicode object. Those two things are not the same. The unicode string behaviour is (indeed mostly) the same as what Python 3 calls a str.

  Python 2.7.13 (default, Dec 19 2016, 10:24:34)
  [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
  Type "help", "copyright", "credits" or "license" for more information.
  >>> s = u'おはようございます'
  >>> type(s)
  <type 'unicode'>
  >>> s = 'おはようございます'
  >>> type(s)
  <type 'str'>
Also, just doing a from __future__ import unicode_literals can cause problems in Python 2 if you're using other libraries that don't correctly handle str vs unicode or make assumptions about character encoding.


In both of those cases, you're no longer using a Python 2 string, in that it's not a bytes-like object. For example, you can't write a Unicode object across a socket- it has to be converted to bytes first.


unicode is a subtype of basestring. And it existed since the beginning of Python 2, afaik. It's extremely unfair to compare Python 3 strings to Python 2 str, when Python 3 str is just a rename of Python 2 unicode and has basically the same features. And the article doesn't even mention that Python 2 has all the same features.


> And the article doesn't even mention that Python 2 has all the same features.

Because it's not about the fact that you can achieve the same thing. What the article does is illustrate that a thing that returns 'str' when you call 'type' on it in Python 2 is not the same nor can be used in the same way as what Python 3 would call a 'str'.


No, if that was the author's intention, he would've simply pointed out that "unicode" was renamed to "str", while "str" was renamed to "bytes" and is no longer considered a string. Instead, he wanted to demonstrate how Python 3 string handling is some sort of a major breakthrough when it's largely a cosmetic change.


It's a cosmetic change which happens to completely derail ill-informed Python programmers, so yeah, it's a big deal.


The people complaining about "Python3 broke strings!" and who are targeted by the article likely don't use unicode objects in python2, and probably don't think they need them, if they did they probably wouldn't have such issues with the change?


So, for example what is The Py3k Right Way to fetch an URL with text content into a string? The example https://docs.python.org/3/library/urllib.request.html#exampl... just says "it's complicated, we just know it's utf-8", and even that would bomb out if there's a character spanning the 100 byte boundary.

To me it looks like by insisting on lossy byte/string conversion of all I/O the language painted itself into a corner, with funny "a bytes is not a string" chanting sideshow.


The "right way" in every single language is to follow the encoding sniffing algorithm: https://html.spec.whatwg.org/multipage/syntax.html#encoding-....

You may want to note the following:

1. implementing the entire encoding-sniffing algorithm for a basic example is a bit extreme

2. the very first step of the encoding-sniffing algorithm is "if the use has explicitly provided an encoding, use it", which is essentially what the example does

> To me it looks like by insisting on lossy byte/string conversion of all I/O the language painted itself into a corner

That makes literally no sense. If you want to "fetch an URL with text content into a string" — emphasis on string, not bag of bytes, if you want a bag of bytes you can skip the whole decoding thing in Python it's not necessary — there's no other way than to decode it, which means you need to assert or discover its encoding, for all you know the document could be in Big5.


Seriously. It's easy to "fetch a URL with text content into a string" with Python 2 right now _because Python 2 assumes that all bytes are ASCII_.


No, it's not! It's easy to fetch a url's content into a bag of bytes. That's it.

If the content of the page happens to contain an emoji (which hey, more and more do) or even a friggin' unicode double-quote mark, then you're no longer fetching it into a valid string, if you take string to mean faithful textual representation. You've got all the bytes there, but you can't interpret it correctly.

This is a real problem that has caused real pain for me with various tools written in python2


Which is something you can do in Python 3 as well, if you really want the broken behaviour of Python 2 or the behaviour of Py2 is "ok for me".


> emphasis on string, not bag of bytes,

This is the crux of the issue. You run around reminding everyone of this, yet you are fine with "read bag of 100 bytes, then go merrily utf-8 decode them" as canonical example in official documentation.


What is text content? You have to either know, find from the metadata or blindly assume an encoding from the byte string HTTP gives you. In the "simplest" case Python3 only forces you to explicitly write that assumption in your code, Python2 just made one for you.


Very well, now how do I put down that assumption? Is "read 100 bytes, utf8-decode them" right example?


The example probably does the truncation this way to just show the effect of decode() on the console, but if you really want a truncated version you should either decode everything and then truncate or use a StreamReader to only decode the first X characters. The docs probably should make that issue clear, so yes, the example there is badly choosen.

Another real-life answer probably is "don't use urllib directly", since sadly the trend seems to go away from stdlib libraries for these things. E.g. Requests will take the encoding from the HTTP headers if available, and allows you to manually specify it as well if you want.


I've got nothing against Python3. But recognize that Python3 got itself into this mess by being incompatible with Python2. It's different language, albeit superficially similar to Python2. Python3 should have been given another name like "Bob" or something. Ditto for Perl6 vs Perl5 and Angular2 vs Angular1.


Everyone is running around talking about semver, but then tell you that you're supposed to rename your project when you do breaking changes? Sure.

And it's not like there was no early announcement that it won't be 100 % compatible. That was officially announced 10+ years ago.


It's not a matter of lead time in announcing it. If it's not backwards compatible then it's a different language. I can still run 20 year old C code in a modern C compiler.


In languages, this is accepted practice. Was Perl 5 backward compatible with Perl 4? Or C11 with C99? Or even C99 with C89, or K&R C?

If you use reasonable and forward-looking language constructs, it will be runnable – or at least very easy to port to – newer major versions of the language. This is also true for the transition from Python 2 to Python 3. The problem is that a lot of people used (and still use) Python 2 badly.


> Was Perl 5 backward compatible with Perl 4? Or C11 with C99? Or even C99 with C89, or K&R C?

Yes.


Actually, no.

There are many differences between K&R/UNIX C and C89. For example, in K&R string constants could be modified, and repitions of the same constant would be different strings. This is not the case in C89. Variadic functions are different. Octal numbers were changed. Arithmetic works differently. And so on.

I can't comment on Perl 4 vs 5 vs 6.


The latest version of Perl 5 is almost entirely backwards compatible with every earlier versions of Perl 5. ( There are some features which almost no-one used, that also should never be used that were eventually removed )

Perl 5 is backwards compatible with Perl 4

Perl 4 is just a renamed version of Perl 3 to coincide with the release of the book "Learning Perl"

Perl 3 is backwards compatible with Perl 2

Perl 2 is backwards compatible with the original Perl

Many of the problems that new Perl programmers have with learning Perl is that every new feature had to be added in a way that it didn't break existing code. That is why for example you have to add "use strict" and "use warnings" to every Perl source file, even though that should be the default.

Perl 6 exists because we "wanted to break everything that needs breaking", so it is very different than any previous version.

That is why both Perl 5 and Perl 6 will both continue to be supported languages.

Imagine taking good ideas from every high level modern programming language, bringing them all together, while making the features seem like they have always belonged together. That is Perl 6.

I like to say that as Perl 4 is to Perl 5 is to Perl 6, C is to C++ is to Haskell,C#,smalltalk,BNF,go, etc


> in K&R string constants could be modified

Every compiler has a writable strings flag for backwards compatibility.

Everything else you cited is backwards compatible. You are confusing forward compatibility with backwards compatibility.


Python 'got itself into this mess' by having a broken string handling design. Python 3 fixes it - the incompatibility is not there for the sake of incompatibility.


To be fair, broken string handling is extremely abundant.


Well that's OK then


Oh, not at all. I'm just saying that no one did string handling right when Python was conceived. (And, evidently, much of the brokenness persists).


maybe 2020 also brings python4


Oh I grok it just fine, it just sucks horribly.

I may be missing something "pythonic" but casting between byte strings and ascii strings - a null operation - is by far my greatest cause of runtime bugs.

To make it worse, the obvious cast is broken beyond all belief. What do we get from str(b'hello world')? Literally "b'hello world'" - a __repr__ of the object. Of course, adding an encoding changes the actual functionality of the cast ... str(b'hello world', 'ascii') gives 'hello world'. So, so broken.


You are doing it wrong. Don't mix bytes and str. "Cast" is C lingo -- you don't "cast" completely distinct types like str and bytes.

(Note: Personally I would love to have str always represented internally in UTF-8. If it were like this, I would say "cast" was acceptable vocabulary. But it's not like this).

> str(b'hello world', 'ascii') gives 'hello world'. So, so broken.

https://docs.python.org/3/library/stdtypes.html#str

It's all documented and actually quite simple.

By the way, I never needed this. str() is just a convenience function, it's not meant for reliability. As you surely know there is complex interplay with __str__ and __repr__. I use str() for debugging, error messages etc. But not for anything that I expect to be able to process further.

How do you decode a bytes object to str object? Simple and intuitive, use .decode('utf-8')


> (Note: Personally I would love to have str always represented internally in UTF-8. If it were like this, I would say "cast" was acceptable vocabulary. But it's not like this).

UTF-8 is a good choice for transmission and storage, but since it's somewhat difficult to decode and isn't easily seekable it's not necessarily the best in-memory representation of a string.

Fortunately, that's, from a Python PoV, an implementation detail (even at the C API level).

> By the way, I never needed this. str() is just a convenience function, it's not meant for reliability. As you surely know there is complex interplay with __str__ and __repr__. I use str() for debugging, error message etc. But not for anything that I expect to be able to process further.

Or log messages. Broadly speaking str() means "tell the object to describe itself", while repr() means "tell the object to describe itself, possibly in a way that I can eval() it"


UTF-32 isn't easily seekable either because Unicode text has combining marks that shouldn't be separated from their base character. Almost every language has a "string" type that doesn't handle this correctly.


To add, I think UTF-8 is actually very easy to decode. It's true that indexing is expensive, so ideally different or additional data structures are needed for different use cases.


You don't grok it. Bytes are not ASCII strings. They are NUMBERS between 0 and 255. Strings are strings. DECODING bytes with ASCII encoding into TEXT is not "casting", it's DECODING.


A thing I've found to help people about that is to override __repr__ on bytestrings in python3 and bytes in python3.

If you make it not print strings as "ascii" but as hex, people grasp the difference easier.


I wouldn't say they are not ASCII strings, because they are if you squint.

Better say "bytes are not Unicode strings".


No, they are NOT strings, they are not any kind of strings. They are bytes:

    >>> b'asd'[0]
    97
    >>> b'asd'[1]
    115
    >>> b'asd'[2]
    100
They happen to contain an ASCII DECODABLE string in this case, but to get the text back, you NEED TO decode it first:

    >>> text = b'asd'.decode()
    >>> text[0]
    'a'
    >>> text[1]
    's'
    >>> text[2]
    'd'


Technically true. I'd inferred a different context on this one: By "strings" I meant the abstract concept in this case, not the python str which is the one meant.


Mate... the article was written for you. You don't 'grok' the difference between byte arrays and strings. Take a piece of paper and figure it out.


No, really, I know it. It's a pile of bytes until it has an encoding - and I agree with this whole heartedly. I think that getting involved with 'str' is the (my) problem here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: