Hacker News new | past | comments | ask | show | jobs | submit login

Python 3 gets so much of this right. It's one of the things I really loved about python 3 as it allows for correct string handling in most cases (see below).

Note that this is only really true with Python 3.3 and later as in earlier versions stuff would start breaking for characters outside of the BMP (which is where JS is still stuck at, btw) unless you had a wide build which was using a lot of memory for strings (4 bytes per character)

In general, internally using unicode and converting to and from bytes when doing i/o is the right way to go.

But: Due to http://en.wikipedia.org/wiki/Han_unification being locked into Unicode with a language might not be feasible for everybody - especially in Asian regions, Unicode isn't yet as widely spread and you still need to deal with regional encodings, mainly because even with the huge character set of Unicode, we still can't reliably write in every language.

Ruby 1.9 and later helps here by having many, many string types (as many as it knows encodings), which can't be assigned to each other without conversion.

This allows you to still have an internal character set for your application and doing encoding/decoding at i/o time, but you're not stuck with unicode if that's not feasible for your use-case.

People hate this though because it seems to interfere with their otherwise perfectly fine workflow ("why can't I assign this "string" I got from a user to this string variable here??"), but it's actually preventing data corruption (once strings of multiple encodings are mixed up, it's often impossible to un-mix them, if they have the same characer width).

I don't know how good the library support for the various Unicode encodings is in Ruby though. According to the article, there still is trouble with correctly doing case transformations and reversing them.

Which brings me to another point: Some of the stuff you do with strings isn't just dependent on string encoding, but also locale.

Uppercasing rules for example depend on locale, so you need to keep that into account too. And, of course, deal with cases when you don't know the locale the string was in (encoding is hard enough and most of the cases undetectable - but locales - next to impossible).

I laugh at people who constantly tell me that this isn't hard and that "it's just strings".




> Python 3 gets so much of this right

What does it gets right????? It's all broken as nearly everything else!

It's sad 99% comments there are “oh see, I can run some examples from page just fine. So everything's all right, I've got full Unicode!”

The reality is there's 1-2 languages that are trying to make it correct from the beginning (perl6, I'm looking at you). It's 2013 and if language can compose bytes to code points everyone declares a win, sticks "full unicode support" label to it and continues to str[5:9].

”But I've got UnicodeUtils!” — it won't help. People just don't want or cannot write it correctly. Word is not [a-z]. Not [[:alpha:]] either. And not [insert regex here]. You cannot reverse by reversing codepoint list. And you cannot reverse by reversing grapheme list. And string length is hard to compute and then it doesn't any sense. And indexing string doesn't make any sense and it's far away from O(1)


Can you provide some examples of Python 3 getting strings wrong?

Between strings being native unicode code points (you have to encode to bytes to get UTF-8) and unicodedata for normalization and decomposition (http://docs.python.org/3.3/library/unicodedata.html) I've found Python 3 pretty robust. Python 3.3 also uses appropriate Unicode data for regular expressions, as mentioned on http://docs.python.org/3.3/howto/regex.html.

If you want to compare strings you should really normalize them first, which is where unicodedata comes in. In my programming situations it would be wrong to conflate different decomposition of the same unicode string. Why is this? Because other software you interact with uses encodings and the UTF-8 encoding of two different decompositions if different. I've run into this with UTF-8 filenames on OS X when working with Subversion.


Did you read the comment you're replying to at all? You can start at “It's sad 99% comments”.

PS:

  Python 3.3.2 (default, Nov 27 2013, 20:04:48)
  [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
  Type "help", "copyright", "credits" or "license" for more information.
  >>> 'öo̧'[1:]
  '̈o̧'

And sorry, those new regexes don't even support \X (grapheme matching)

Edit: python version


Yes, I did, and you did not provide a single example. You just said "“oh see, I can run some examples from page just fine. So everything's all right, I've got full Unicode!".

Taking the time to actually prove your point it useful. However, your recent example seems to be running fine on Python 3.3. You did not include any version info in your example output.

    Python 3.3.0 (default, Mar 11 2013, 00:32:12) 
    [GCC 4.7.2] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> "öo̧"[1:]
    'o̧'
    >>>
I haven't run across any situations where Python 3.3 is doing wrong, which is why I am asking for some examples.


3.3.2. No, it is not. Use 'o\u0308o\u0327'


Oh, I see the issue here. You are expecting the string class to function via graphemes rather than characters. It should be possible to implement grapheme support since character support is there, but I imagine the reverse is not true.

A little googling turned this up. https://mail.python.org/pipermail/python-ideas/2013-July/021...


TL;DR of parent comment here for those skimming:

  x = 'o\u0308o\u0327'
  len(x) == 4
  x == "öo̧"
  x[:2] == "ö"
  x[2:] == "o̧"
  x[0] == x[2] == "o"


You got me curious about grapheme matching in Python with regex. It looks like it is not in the stdlib yet with 3.3. However, it you install https://pypi.python.org/pypi/regex and then replace:

    import re
with

    import regex as re
Then if you want to get into using graphemes slicing, you could use something like:

    import regex as re
    
    decomposed_str = 'o\u0308o\u0327'
    graphemes = re.findall('(\\X)', decomposed_str)
    sub_graphemes = grapheme[1:]
    decomposed_substr = ''.join(sub_graphemes)


But what is that sequence (I know the unicode sequence is listed below -- but is it some wierd edge-case)? — because if I manually compose/type those (and a few other characters) everything seems to work fine:

    [edit: Python 3.2.3]
    [edit: [GCC 4.7.2] on linux2]

    >>> 'öo̧'[1:] #copy-paste
    'o̧'
    >>> 'öo̧'[::-1] # "reverse" also breaks
    '̧oö'
    #But for Japanese:
    >>> '日本語'[1:]
    '本語'
    >>> '日本語'[:-1]
    '日本'
    >>> '日本語'[-1:]
    '語'
    >>> '日本語'[::-1]
    '語本日'
    # And Norwegian
    >>> 'æåø'[::-1]
    'øåæ'
    # And a few "French" characters (in this case
    # manually typed as alt+~+e, etc
    >>> 'ẽêèe'[::-1]
    'eèêẽ'
    # And crucially for your example, typed as
    # alt+"+o
    >>> 'öo'[::-1]
    'oö'
So is your initial example some kind of unicode-without-bom(b) or something?

[edit2: I gather, that working with "pre-composed" characters work, and working with "de-composed" ones break. Which, while expected, is a little sad, I agree.]


> Python 3 gets so much of this right. It's one of the things I really loved about python 3 as it allows for correct string handling in most cases (see below).

One of the biggest things that I feel Python gets right with the string type is that strings are immutable. It makes a lot of things easier.

It really makes sense to have a good string type for small strings, stored in unicode. Immutability makes everything simpler.

The string type is not a good fit for handling large amounts of text. There are trade offs for efficiency that have to be made to create a handy string type. It really makes sense to have a separate "bytes" type or some kind of StringBuffer for doing big text operations.


Isn't the string type immutable in many (most?) other languages as well? In Objective-C the default is an immutable string (though optionally

one can create mutable strings as well). Lua also uses immutable strings. In Java and C# I think the situation is the same, since if you want

to use high performance string manipulation, you'll generally resort some form of StringBuilder helper class.


Correct, C# and .Net have an immutable string class and a mutable StringBuilder helper class.


I believe strings are mutable in Ruby.


They are...

    s = "hello"
    s << "   world"
    s # hello world


Is that allocating a new buffer, leaving the "hello" string to be collected by the GC?


No, it's operating in place:

    def append_world(str)
        str << " world"
    end

    a = "hello"
    append_world(a)
    a                       #=> "hello world"


No, it expands the existing buffer. (leaving " world" to be collected). Note that the following is different and more like what you're thinking.

    a = a + " world"


>In general, internally using unicode and converting to and from bytes when doing i/o is the right way to go.

I'm not sure what "internally using unicode" means. Pyhon's internal representation of strings has changed a lot. It hasn't even been stable in Python 3. Now they are apparently using an internal representation that varies depending on the "widest" character stored.

The only solution that isn't driving me insane is to use UTF-8 everywhere. The Python 3 unicode situation is actually the main reason why I'm not using Python much these days.


In Python 3, you don't care about what they use internally. You don't need to.

If you want to work with strings, you work with strings. If you want to work with bytes, you work with bytes. If you want to convert bytes into strings (maybe because it's user input that you want to work with), then you tell Python what encoding these bytes are in and you have it create a string for you. You don't care what Python uses internally, because their string API is correct and correctly works on characters.

That noël example of the original article consists of 4 characters in Python 3 which is exactly what you want.

I know that just using UTF-8 everywhere would be cool, but that's not how the world works for various reasons. One is that UTF-8 is a variable length encoding which has some performance issues for some operations (like getting the length of the string. Or finding the n-th character).

UTF-8 also isn't widely used by current operating systems (Mac OS and Windows use UCS-2). It's also not what's used by way too many legacy systems still around.

So as long as the data you work with likely isn't in UTF-8, the encoding and decoding steps will be needed if you want to be correct. Otherwise, you risk mixing strings in different encodings together which is an irrecoverable error (aside of using heuristics based on the language of the content).


>In Python 3, you don't care about what they use internally. You don't need to.

I do need to know and I always care. My requirements may be different than those of most others because I write text analysis code and I need to optimize the hell out of every single step. I shiver at the thought that any representation could be chosen for me automatically.

Of course, nothing is stopping me from simply using the bytes type instead of str, but clearly the Python community has decided to go down a road I feel is entirely wrong so I'm not coming along.

>I know that just using UTF-8 everywhere would be cool, but that's not how the world works for various reasons. One is that UTF-8 is a variable length encoding which has some performance issues for some operations (like getting the length of the string. Or finding the n-th character).

I'm bound to live in a variable length character world unless I decide to use 4 byte code points everywhere, which is prohibitive in terms of memory usage. Memory usage is absolutely critical. Iterating over a few characters now and then to count them is almost always negligible.

The need to index into a string to find the nth character only comes up when I know what I'm looking for. Things like parsing syntax or protocol headers come to mind, and they are always ASCII. I don't remember a situation where I needed to know the nth character of some arbitrary piece of text and repeat that operation in a tight loop.

If one day I find myself in such a situation I will have to convert to an array of code points anyway.


So in your one, specific, performance-limited situation, Python 3's implementation of unicode doesn't work for you. Mostly because you are trying to optimize based on implementation details.

I don't see how this equates to a general purpose language failing at strings, especially when the language isn't particularly focused on performance and optimization. And if memory usage is of concern, I would certainly think anything like Python and Ruby would be out of the running?


>I don't see how this equates to a general purpose language failing at strings

And I don't see where I said it did.

I used to favor a dual Python/C++ strategy, but Python's multithreading limitations and the decisions around unicode have convinced me to move on. It's not like anything has gotten worse in Python 3, it's just that there has been a major change and the opportunity to do the right thing was missed.

I happen to think that UTF-8 everywhere is the right way to go, not just for my particular requirements, but for all applications, because it reduces overall complexity.


I strongly disagree

and I'd like to know what do you think the "right thing" would be

I agree that only using UTF-8 would be the right thing, but only if you don't want to have "array of codepoints"... the problem is: every language, and every developer expect to be able to have random access to codepoints in their strings...

there're some weird exceptions, like Haskell Data.Text (I think that's due to haskell laziness)

would you prefer to have O(n) indexing and slicing of strings... or you'd prefer to get rid of these operations altogheter?

if the latter, what'd you prefer to do? force the developers to use .find() and handle such things manually... or create some compatibility string type restricted to non composable codepoints?

Getting an implementation out to see it used in the wild might be an interesting endeavor... probably it'd be easier to do in a language that allows you to customize it's reader/parser... like some lisp... clojure


>I agree that only using UTF-8 would be the right thing, but only if you don't want to have "array of codepoints"

Then we agree entirely. I want all strings to be UTF-8. Period. What I said about an array of codepoints was that I would create one seperately from the string if I ever had a requirement to access individual code point positions repeatedly in a tight loop.

>the problem is: every language, and every developer expect to be able to have random access to codepoints in their strings

If by random access you mean constant time access then those developers would be very disappointed to learn that they cannot do that in Java, C#, C++, JavaScript or Python, unless they happen to know that their string cannot possibly contain any characters outside the ASCII or BMP range.

>would you prefer to have O(n) indexing and slicing of strings

I would leave indexing/slicing operators in place and make sure everyone knows that it works with bytes not codepoints. In addition to that I would provide an O(n) function to access the nth codepoint as part of the standard library.


> If by random access you mean constant time access then those developers would be very disappointed to learn that they cannot do that in Java, C#, C++, JavaScript or Python, unless they happen to know that their string cannot possibly contain any characters outside the ASCII or BMP range.

Actually, you can in Python... and obviously most developers ignore such issues [citation needed]

My point is that most developers don't know these details, a lot of idioms are ingrained... get them to work with string types properly won't be easy (but a good stdlib would obviously help immensely in this regard)

> I would leave indexing/slicing operators in place and make sure everyone knows that it works with bytes not codepoints. In addition to that I would provide an O(n) function to access the nth codepoint as part of the standard library.

Ok, so with your proposal an hypothetical slicing method on a String class in a java-like language would have this signature?

byte[] slice(int start, int end);

I've been fancying the idea of writing a custom String type/protocol for clojure that deals with the shortcoming of Java's strings... I'll probably have a try with your idea as well :)


> Actually, you can in Python...

No, you can only get random access on codepoints which will break text as soon as combining characters are involved. Even if you normalize everything beforehand (which most people don't do) as not all possible combinations have precomposed forms.

Unicode makes random access useless at anything other than destroying text.

> but a good stdlib would obviously help immensely in this regard

Which is extremely rare, and which Python does not have.


>Actually, you can in Python

You are right (apart from combining characters as masklinn explained), but as I said, that's only possible if an array of 32 bit ints is used to hold string data or if it can be guaranteed that there are no characters from outside ASCII or BMP. If I understand PEP 393 correctly, what Python 3.3 does is to use 32 bit ints to hold the entire string if even one such code point occurs. So if you load a (possibly large) text file into a string and one such code point exists then the file's size is going to quadruple in memory. All of that is done just to implement one very rare operation efficiently. http://www.python.org/dev/peps/pep-0393/#new-api


Sounds like you want to use Go. Feels like Python, but technically correct implementations of concepts.


Mac OS and Windows use UCS-2

Which parts of Mac OS? You'd have a lot of problems with Emoji support if that were true. To the best of my knowledge, it's UTF-16 everywhere.

Or do you actually mean Mac OS as in Mac OS 9, and not OS X?


Agreed with most of it except:

"because their string API is correct"

Apparently they have a bug in their UTF-7 parser that can lead to invalid unicode strings. Don't know if it's already fixed.


It was a bug in the decoding: it raised an unexpected exception, nothing that couldn't be worked around with a check (afaik it didn't crash the interpreter)

and it has been fixed since more than 1 month, just 2 days after it was reported

http://bugs.python.org/issue19279

Let's avoid spreading fud, shall we? :)


That would be an implementation flaw, not an API issue.


Indeed.


> stuff would start breaking for characters outside of the BMP (which is where JS is still stuck at, btw)

ECMAScript 6 fixes that, mostly. See http://mathiasbynens.be/notes/javascript-unicode for details.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: