> It is often best to avoid non-ASCII characters in source code. Indeed, in some...

O_H_E · on April 23, 2021

> It is often best to avoid non-ASCII characters in source code.

Scientists writing Julia are laughing on the floor.

kps · on April 23, 2021

Sometimes I wish COBOL style hadn't won over mathematical style as the convention for every general purpose language except possibly Go.

marcosdumay · on April 23, 2021

There are many languages pushing for mathematical style formulas, but until they get easy to type, I don't think any of them will succeed.

It's not COBOL that won. It's qwerty.

eigenspace · on April 23, 2021

Julia makes typing unicode characters very easy; so easy that I keep a Julia repl open a lot of the time just for typing various mathematical symbols easily.

For example, to type α², you just type \alpha, hit the tab button to turn it into α, and then type \^2 and hit tab again to turn that into ².

If someone gives you a unicode character that you don't know how to type, you just hit the ? button to enter the repl-help mode and then paste the character in.

    help?> α²
    "α²" can be typed by \alpha<tab>\^2<tab>

    search:

    Couldn't find α²

      No documentation found.

      Binding α² does not exist.

snicker7 · on April 23, 2021

The amazing thing about this UI is that it just works across many of different editors and REPLs (VSCode, Emacs, Pluto, &etc).

It reminds me of GNU TeXMacs, which has a similar interface for equations. That is, typing "\int" followed by tab will render an integral sign.

speedplane · on April 23, 2021

> It is often best to avoid non-ASCII characters in source code. ... >> Depends of the language. In Python 3, files are expected to be utf8 by default, and you can change that by adding a "# coding: <charset>" header.

It's interesting that many languages avoid unicode and non-ASCII text, yet they make assumptions about file and directory structures about the underlying system. It's as if interpreting directory and file system structures is "okay", but interpreting file formats is not.

> In fact, it's one of the reasons it was a breaking release in the first place, and being able to put non-ASCII characters in strings and comments in my source code are a huge plus.

Sorry, but as a Python dev that went from 2 to 3, yes native unicode features are nice, but no, it was not worth breaking two decades of existing code.

BiteCode_dev · on April 23, 2021

As somebody living in Europe, I think it's a perspective you can have only if you live mostly in an English speaking world.

Up to 2015, my laptop was plagued with Python software (and others) crashing because they didn't handle unicode properly. Reading from my "Vidéos" directory, entering my first name, dealing with the "février" month in date parsing...

For the USA, things just work most of the time. For us, it's a nightmare. It's even worse because most software is produced by English speaking people that have your attitude, and are completely oblivious about the crashing bugs they introduce on a daily basis.

In Asia it's even more terrible.

And I've heard the people saying you can perfectly code unicode aware software in Python 2. Yes, you can, just like you can code memory safe code in C.

In fact, just teaching Python 2 is painful in Europe. The student write their name in a comment ? Crashes if you forget the encoding header. Using raw_input() with unicode to ask a question? Crashes. Reading bytes from an image file ? Get back a string object. Got garbage bytes in string object ? Silently concatenate with a valid string and produce garbage.

qsort · on April 23, 2021

> As somebody living in Europe, I think it's a perspective you can have only if you live mostly in an English speaking world.

I live in Europe and I (mostly) agree that (most) code shouldn't (usually) contain any codepoint greater than 127. It's a simple matter of reducing the surface area of possible problems. Code needs to work across machines and platforms, and it's basically guaranteed that someone, somewhere is going to screw up the encoding. I know it shouldn't happen, but it will happen, and ASCII mostly works around that problem. Another issue is readability. I know ASCII by heart, but I can't memorize Unicode. If identifiers were allowed to contain random characters, it would make my job harder for basically no reason. Furthermore, the entire ASCII charset (or at least the subset of ASCII that's relevant for source code) can easily be typed from a keyboard. Almost everyone I know uses the US layout for programming because European ones frankly just suck, and that means typing accents, diacritics, emoji or what have you is harder, for no real discernible reason.

String handling is a different issue, and I agree that native functions for a modern language should be able to handle utf8 without crashing. The above only applies to the literal content of source file.

BiteCode_dev · on April 23, 2021

I understand to not want utf8 for the identifiers, as you type them often and want the lowest common keyboard layout demonitor, but comments and hardcoded strings, certainly not.

Otherwise, chinese coders can't put their name in comments? Or do they need to invent an ascii names for themself, like we forced them to do in the past for immigration ?

And no hard coded value for any string in scripts then? I mean, names, cities, user messages and label, all in in config files even for the smallest quick and dirty script because you can't type "Joyeux Noël" in a string in your file? Can't hardcode my "Téléchargements" folder when in doing some batch work in my shell?

Do I have write all my monetary symbols with escape sequences? Having a table of constant filled with EURO_SIGN=b'\xe2\x82\xac' instead of using "€"? Same for emoji?

I disagree, and I'm certainly glad that not only Python3 allows this, but do it in a way that is actually stable and safe to do. I rarely have encoding issues nowadays, and when I do, it's always because something outside is doing something fishy.

Utf8 in my code file, yes please.

BiteCode_dev · on April 23, 2021

Lol, just realized my table should be "EURO_SIGN=b'\xe2\x82\xac'.decode('utf8')".

Damn, a perfect example of why it's nice to be able to just do "€" in my file. It's also less bugs.

deathanatos · on April 23, 2021

If we're still talking about Python 3, you can also write,

  EURO_SIGN = '\N{EURO SIGN}'

Which is ASCII, and very clear about which character the escape is. I wish more languages had \N. That said, I'm also fine with a coder choosing to write a literal "€" — that's just as clear, to me.

While I don't believe the \N is backwards compatible to Python 2, explicitly declaring the file encoding is, and you can have a UTF-8 encoded source file w/ Python 2 in that way.

I'd also note the box drawing characters are extremely useful for diagrams in comments.

(And… it's 2021. If someone's tooling can't handle UTF-8, it's time for them to get with the program, not for us to waste time catering to them. There's enough real work to be done as it is…)

speedplane · on May 2, 2021

You can do Unicode in Python 2. You can do Unicode faster and easier in Python 3. But by gaining that ability, they set the existing community back by 10 years.

kazinator · on April 23, 2021

I wouldn't say that everything in \x00 to \x1f is OK in source code.

chrismorgan · on April 23, 2021

Eh, I reckon they’re fine.

In shell scripts especially (and sometimes elsewhere), if I want an ANSI escape code, I’ll type the actual bytes I want, because it’s more reliable and easier to type <C-V><Esc> and see ^[ in Vim than it is to worry about whether I’m using \u001b, \u{1b}, \x1b, \033 or something else, and whether I have '', "", $'', $"", printf, echo -e, &c. &c. Speaking generally, if you use the actual control characters, the code will either fail to run (parse or compile error) or work, whereas encoding your control codes with escape sequences is more hit and miss. I’ve seen quite a few others using control characters in this way too.

Truth to tell, I’m quite liberal in using all of Unicode in my source code. About the only places I wouldn’t use the actual characters are when (a) the environment requires me to escape the thing; (b) the numeric values of the characters are more meaningful than the characaters themselves; and (c) when dealing with combining characters and modifiers, because otherwise they’ll do things like combine with the opening ' from the string/character literal.

kazinator · on April 23, 2021

I mean, in non-strawman examples of programming languages.

The code will work, but how will it edit outside of your personal environment? Or print?

If you quote it in a PDF paper, will it be copy-pastable?

Basically I would worry about all that sort of stuff.

If I have echo '^L' to clear the screen, and that program is sent to a line printer, will I get:

  echo '
  [ paper ejected ]
  '

If you so much as "cat" your code to a terminal, all the codes get interpreted.

morelisp · on April 24, 2021

In Emacs Lisp it's common to to include real ^L specifically to trigger page breaks from when printing code was more common (and it has page-aware navigation commands accordingly).