When monospace fonts aren't: The Unicode character width nightmare

jhallenworld · on Sept 11, 2015

I recently changed how JOE dealt with this. Originally it used Markus Kuhn's wcwidth function (http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c), but I've changed it to use the data in EastAsianWidth.txt: http://sourceforge.net/p/joe-editor/mercurial/ci/default/tre...

JOE uses 4-level radix trees for character classes. These work well because the leaf nodes are highly redundant and can be merged together. The resulting structure is often smaller than a binary tree. Character classes are also used for regular expressions, so there is code to build them on the fly from a list of ranges (it's tricky to do this efficiently).

Anyway, I'm surprised that emoji are not double-wide characters.

JOE is still missing Unicode normalization for string searches.

waqf · on Sept 11, 2015

In most languages there's not really much need to display source code in monospace, except to the extent that a previous programmer formatted code with that assumption.

When you're trying to align lines of a block, you only need that a string of n initial spaces (or tabs) is the same width every time: you don't care if it is the same as a string of n arbitrary characters. This suffices for the compiler-enforced indentation rule in Python (though not, I think, for the indentation rule in Haskell).

(I code in a variable-width font in Emacs and I work on shared codebases. The codebase style does have a couple alignment rules that don't make sense for variable-width fonts, but I just let Emacs enforce them and I'm happy that overall the code as I see it is easier on the eyes than it would be with fixed-width formatting.)

blahedo · on Sept 12, 2015

The other main case where monospace (as such) is important is when multi-line expressions should be lined up in some semantically relevant way, e.g. to reflect boolean and/or grouping, or to line up function arguments. (This is especially true in Lisp/Scheme indent styles, but I use it fairly frequently in the C-style-syntax languages as well.)

Separately from the "lining things up" argument, though, there's an argument to be made for characters in any editing situation to be wider, even if they're not all the same width. For naturally-narrow letters like I and l, and for periods and commas, and of course spaces, the non-monospace fonts often make them so narrow that they become harder to target with a mouse, harder to distinguish from each other, and in some cases hard to even see at a glance whether the insertion point is to their left or right. This is a UI problem with a lot of editors (e.g. text entry for data or comments on the web) that doesn't get enough attention; if the content and presentation can be separate (i.e. it's not a WYSI-more-or-less-WYG editor) then the editor should use a font that doesn't have super-narrow anything. Monospace is a convenient way to achieve that.

azernik · on Sept 12, 2015

> Separately from the "lining things up" argument, though, there's an argument to be made for characters in any editing situation to be wider, even if they're not all the same width.

Have a look at Input [1]: a TrueType font built for coding, with somewhat-wide I and l and super-wide punctuation marks.

The issue is one of picking the right font, not of monospace vs. not - street signs, for example, have long used different fonts from printed pages, since they demand high accuracy and fast recognition of short chunks of text at long distances, as opposed to the pleasant long-reading experience demanded from most body fonts. Most systems just happen to have built-in variable-width fonts that are terrible for coding.

[1] [http://input.fontbureau.com]

pauljmartinez · on Sept 13, 2015

I just started using Input, giving Anonymous Pro a little hiatus. It has been pleasant on the eyes.

Stratoscope · on Sept 12, 2015

You don't need a monospaced font to line up multiline expressions or function arguments. All you have to do is use indentation instead of column alignment. In other words, follow the same rules for expressions that you most likely already use for statements.

For example, instead of this:

  someObject.someMethod(oneArgument,
                        anotherArgument,
                        oneMoreArgumentForTheRoad);

Format it exactly as you would if the parentheses were curly braces:

  someObject.someMethod(
      oneArgument,
      anotherArgument,
      oneMoreArgumentForTheRoad
  );

Or instead of this:

  string someVariable = oneThing +
                        anotherThing +
                        yetAnotherThing;

Do this:

  string someVariable =
      oneThing +
      anotherThing +
      yetAnotherThing;

The code is just as readable this way, and you gain many benefits:

* Your code looks exactly the same in a proportional or monospaced font, so people viewing the code are free to choose either.

* You no longer need to fiddle with adding or removing spaces when you change the length of one of your variables or function names.

* The diffs in your revision history no longer show spurious changes that result from that column fiddling.

* Your code lines become much shorter.

* Instead of having one formatting rule for curly braces and a completely different rule for parentheses, you use the same formatting rule for everything.

* It doesn't matter if the code is indented with spaces or tabs. With no column alignment, rules like "tabs for indentation, spaces for alignment" are no longer needed.

* If you need to change the code to a new indentation style (e.g. change 2 or 4 spaces to tabs, or vice versa), you can do that reliably with a simple regex search and replace, with no damage to the code formatting and no manual cleanup required.

I adopted this practice long before I switched to coding in proportional fonts - and once I started formatting code this way I realized that it didn't matter any more what kind of font I used. I was free to choose any font that pleased my eyes, with no impact on the readability of the code for those who prefer monospaced fonts.

A good place to see the problems that column alignment causes is the Servo source code, whose coding standard mandates column alignment. I posted a few examples here:

https://news.ycombinator.com/item?id=9469713

I've often wondered why column alignment is so popular given its drawbacks. I have a theory that I think explains some of it, at least in the case of function calls and parenthesized expressions.

It's because of the very common objection to putting spaces inside the parentheses. For example, PEP8 and many coding standards explicitly forbid this.

Here's what happens. Take my first example above, written as one line without spaces inside the parentheses:

  someObject.someMethod(oneArgument, anotherArgument, oneMoreArgumentForTheRoad);

That line is too long, so let's fix it. The natural starting place is to change every space to a newline:

  someObject.someMethod(oneArgument,
  anotherArgument,
  oneMoreArgumentForTheRoad);

That's a bit messed up, so what can we do? Indent the extra lines?

  someObject.someMethod(oneArgument,
      anotherArgument,
      oneMoreArgumentForTheRoad);

Ugh. Now the arguments don't line up at all. It's really ugly. The only cure is to align the columns:

  someObject.someMethod(oneArgument,
                        anotherArgument,
                        oneMoreArgumentForTheRoad);

But what if we adopted the practice of putting spaces inside the parentheses?

  someObject.someMethod( oneArgument, anotherArgument, oneMoreArgumentForTheRoad );

If we make the same substitution, changing each space to a newline, we get this:

  someObject.someMethod(
  oneArgument,
  anotherArgument,
  oneMoreArgumentForTheRoad
  );

And now it makes perfect sense to indent the arguments:

  someObject.someMethod(
      oneArgument,
      anotherArgument,
      oneMoreArgumentForTheRoad
  );

This is the same thing we intuitively do with curly braces. After all, hardly anybody codes like this:

  while(true) {oneStatement;
               anotherStatement;
               oneMoreStatement;}

I think one reason is that we pretty much tend to put spaces inside the curly braces, even in a one-liner. It's not too common to write this:

  while(true) {oneStatement; anotherStatement; oneMoreStatement;}

Instead, this seems more typical (if you use a one-liner at all, which I'm not arguing for or against):

  while(true) { oneStatement; anotherStatement; oneMoreStatement; }

No one seems to mind spaces inside the braces, but I've had developers totally freak out over the idea of putting spaces inside the parentheses. It Simply Is Not Done.

I don't understand why there's such an objection to that, especially when it seems to lead directly to the difficulties of column alignment.

I tend to think that it's because spaces never go inside the parentheses in English text. But this is code, not English, and we are free to choose conventions that benefit us, even if they differ from how we'd write prose.

Edit: I added a number of thoughts after first posting this. If you upvoted this an earlier version of this comment and now think I'm insane for advocating spaces inside the parentheses, let me know and I'll post another comment that you can downvote. ;-)

jhallenworld · on Sept 12, 2015

This is pretty nice! I've been doing the same for Verilog module declarations (which can easily have 100 arguments):

    module fred(
        a,
        b,
        c
        );
    ...
    endmodule

I've not tried it for function calls, but it's not a bad idea...

How does it look for the case of an indented block after a while or a for? Let's see:

    while (
        a &&
        b && c
    ) {
        a = a + 1;
        b = b - 1;
    }

Hmm, maybe better than what I do now...

    while (a &&
           b && c) {
        a = a + 1;
        b = b - 1;
    }

Stratoscope · on Sept 12, 2015

Thanks! In fact, your first example of the while loop is almost exactly how I format it. The only very minor difference is that I don't put a space before the first parenthesis:

  while(
      a &&
      b && c
  ) {
      a = a + 1;
      b = b - 1;
  }

That's not a big difference; the only reason I leave out that one space goes back to my idea of having fewer formatting rules and special cases. I don't put a space before the open paren on a function call, so I don't put one there on a statement that has parens either. Not a big deal either way, I'm just lazy and like to have one rule instead of two. :-)

I experimented in the past with some other ways of formatting this kind of code, e.g.

    while (
        a &&
        b && c ) {
        a = a + 1;
        b = b - 1;
    }

Of course you can see the problem here - there's no visual separator between the while expression and the statements in the loop body.

That's why some coding standards require a double-indent:

    while (
            a &&
            b && c ) {
        a = a + 1;
        b = b - 1;
    }

That never appealed to me at all, but when I started moving the close paren to the next line and dedenting it there was no need for the double indent.

mzs · on Sept 12, 2015

I always put whitespace after a keyword (like while) so that I can use regexps to find only function calls.

What about this?

  int c[] = {
       3,   6,   9,
      12,  15,  18,
  ...
     300, 303, 306
  };

or

  int v =
    ((r << 24) & 0xff000000) |
    ((g << 16) &   0xff0000) |
    ((b <<  8) &     0xff00) |
    ( a        &       0xff) ;

Stratoscope · on Sept 12, 2015

Those are nice - especially the second one, which is a great illustration of how you can use thoughtful alignment in a monospaced font to make the logic in a piece of code really stand out.

It's very different from the examples of column alignment I was critiquing in my earlier message, where there's an equally good or better way to format the code using indentation only.

All in all, I still prefer the visual appearance and compact nature of proportional fonts, especially my current favorite, Trebuchet MS - and they do just fine for almost all the code I write. But for all my criticism of column alignment, I have to admit I do miss it in cases like your second example where it brings clarity.

My dream editor would have a way to allow the use of both proportional fonts where they work well and monospaced fonts where they are better, all within the same file. Perhaps a special commenting convention that would let you switch back and forth.

One of the editors I use, Komodo, has a bit of this. When you create a custom visual theme, you can specify both a proportional font and a monospaced font, and either toggle the entire file back and forth with a hotkey or select one font or the other as part of syntax highlighting - similar to the way many editors let you select bold or italic for particular syntax elements. For example you can select a monospaced font for comments so you can do ASCII art there, while using a proportional font for other code.

Thanks for sharing those examples!

mzs · on Sept 12, 2015

First I just want to thank you for the very pleasant reply. More often than not people online tend to get very religious and downright mean about things like convention like this, so thank you the dialog here has been refreshing.

You might be interested in literate programming:

https://en.wikipedia.org/wiki/Literate_programming#Tools

It's not exactly what you are after, but I think is more fundamentally what you are after - ways to make it easier to understand what code is doing. You can have the best of both worlds too here with something like noweb and two windows.

One other thing I ran into is a code base that had incredibly long lines (more than 132 often). So I started using ndiff to help me there. It's a python script that puts + and - under the lines where the differences are as well. I also wrote a little script to html it with different colors. It was pretty awful until I made it monospace too sadly.

Cheers!

thaumasiotes · on Sept 13, 2015

    int v =
      ((r << 24) & 0xff000000) |
      ((g << 16) &   0xff0000) |
      ((b <<  8) &     0xff00) |
      ( a        &       0xff) ;

Is there a reason you're doing the shift before the mask? I find this much less intuitive than the equivalent

    int v =
      ((r & 0xff) << 24) |
      ((g & 0xff) << 16) |
      ((b & 0xff) <<  8) |
      ( a & 0xff) ;

mzs · on Sept 14, 2015

Cause I was trying to exaggerate the effect of space to within reason somewhat?

Cause it was like 1AM here when I wrote it?

Take your pick ;)

Yeah I'd totally write the bottom way I think, or I hope, cheers.

plonh · on Sept 13, 2015

I vaguely recall that Ruby promoted spaces inside parents, in part because Japanese (where Ruby started) programmers have some trouble with quickly visually separating parentheses from English letters when scanning code.

djent · on Sept 12, 2015

I believe what the comment you're replying to is implying multiple instances of indentation on the same line. Such as:

    if		( $test[0] )	func1()
    elseif	( $test[1] )	func2()
    elseif	( $test[2] )	func3()

Stratoscope · on Sept 12, 2015

That's a good point. Well, it was just an excuse for me to go off on a tangent! :-)

For the example you posted, I would classify that as column alignment rather than indentation. Just to clarify my terminology, what I call indentation is something that happens at the beginning of a line only. Any additional spacing after the first nonblank character is column alignment.

So the spaces before the if and elseif are indentation, and the extra spaces within the lines are column alignment (in my nomenclature).

The column alignment in your example is pretty appealing, but it does have some of the same problems as other forms of column alignment. When we get to ten or eleven tests, things have to get juggled around again. Do you do this:

    if		( $test[0] )	func1()
    elseif	( $test[1] )	func2()
    ...
    elseif	( $test[10] )	func3()

or this?

    if		( $test[0]  )	func1()
    elseif	( $test[1]  )	func2()
    ...
    elseif	( $test[10] )	func3()

or go all-out with something like this?

    if		( $test[ 0] )	func1()
    elseif	( $test[ 1] )	func2()
    ...
    elseif	( $test[10] )	func3()

I think basically I am just lazy and found all of this alignment so tedious that I looked for ways to avoid it. :-)

Symbiote · on Sept 12, 2015

Elastic tabstops would solve this: http://nickgravgaard.com/elastic-tabstops/

randomanybody · on Sept 13, 2015

I also use this convention, but one frustration is that no IDE wants to follow it in their auto line breaking functions.

userbinator · on Sept 12, 2015

I often use ASCII art in comments to add explanatory detail. That would become nearly impossible to do correctly with a proportional font.

Another thing is easily formatting tabular-looking data (e.g. array initialisers) reasonably well. Proportional fonts mean this can't be done, and existing formatted data looks like a mess. Moving the cursor vertically with a proportional font also causes it to distractingly jump left and right instead of in a straight line.

Has anyone who prefers proportional fonts considered whether they'd like proportional vertical spacing (i.e. the height of each line varies depending on what characters are in it)? That would be the other extreme of variable spacing.

nitrogen · on Sept 12, 2015

If one were completely sold on the idea of proportional fonts for coding, it would make sense to go to fixed tabstops (like a word processor or typewriter) rather than fixed-width tab characters, so tabs could always be used to align tabular data.

ygra · on Sept 12, 2015

You could have comments in monospace and the rest of the code proportional. As long as you don't start ASCII art in normal code ...

scythe · on Sept 11, 2015

How do you deal with the fact that in most variable-width fonts, the characters []{}():;'!.*,|`ijl are often too small/narrow to really notice? I've looked into programming with variable-width fonts, but I kept finding that monospace fonts are much easier on the eyes. It's hard to even look at typical C code in a variable-width font.

waqf · on Sept 12, 2015

I don't have a perfect solution, but using syntax highlighting together with a bold font weight goes a long way to make typos apparent.

douche · on Sept 11, 2015

There are a few simple rules I have for code fonts: 1.) Monospace - I don't really go nuts about aligning things, but it is nice for looking at multi-line strings in code. 2.) Characters need to be distinct. l (lower-case ell), i (lower-case eye), I (upper-case eye) and 1 (one) cannot appear identical. 0, o and O shouldn't be the same - bonus if zero has a dot or line through the center.

Visual Studio defaults to Consolas, which is nice.

akkartik · on Sept 11, 2015

Now that I think about it, the biggest reason I continue using fixed-width is that my preferred 2-space indent looks narrower and so indentation becomes less salient. Perhaps I should just increase my indent. 2 spaces is mandated at work, though, so there's that..

seanmcdirmid · on Sept 12, 2015

Try using indent guidelines if they are available on your IDE (built in or as a plug in). I work with two space indent, minified curly braces, proportional font, and indent guidelines...it is quite nice and feels "modern."

gchpaco · on Sept 11, 2015

Acme, too, defaults to variable width, and looks rather better for it. It's always a little awkward in Emacs because a lot of the system packaging (package.el, for example) assumes mono, but it can work.

rspeer · on Sept 11, 2015

A project I work on, "ftfy", deals with various Unicode issues. On the master branch, it's recently gained a module for aligning monospaced Unicode in a terminal:

https://github.com/LuminosoInsight/python-ftfy/blob/master/f...

The result works pretty well in my gnome-terminal and Mac OS Terminal. You can probably see that GitHub in your web browser doesn't even come close to lining them up, though.

And the problem can't be completely solved, because the standards have gaps, and there are some scripts where monospacing just isn't a thing.

breadbox · on Sept 11, 2015

This stuff is a nightmare if you're trying to write a nice-looking terminal application. AFAICT there is no reliable way to determine how many cells an arbitrary Unicode glyph will occupy when output to a terminal. None. You can use various wcwidth() functions as a first approximation, but you have to give up on things working in the general case, because there's no guarantee (in theory or in practice) that the terminal's font will actually honor the width defined by the standards. Hopefully this situation will slowly improve with the years, but given the level of neglect the terminal environment gets from standardization processes these days, I'm not entirely optimistic.

mark-r · on Sept 12, 2015

You could always place each glyph individually, but that's likely to have subtle bugs too, while not being very performant.

breadbox · on Sept 12, 2015

I have used such a solution at times. But then you usually have the problem of overwriting half of a wide character, which is worse.

jquast · on Sept 13, 2015

There is a method, using the report cursor position query to determine the current location of the cursor, you can then print question characters and re-read the loctation, the difference determining how many cells a character forwards the carriage.

jquast · on Sept 12, 2015

If anybody here has serious expertise on this subject, please consider reviewing my python implementation of Markus Kuhn's wcwidth, updated for the latest Unicode Specification (programmatically by "python setup.py update").

https://github.com/jquast/wcwidth

glandium · on Sept 11, 2015

Relatedly, on GNU/Linux, depending on your locale, fontconfig configuration and characters in the strings displayed (no, I'm not making that up), your monospace font might end up not being a monospace font at all. Look at this nice ghex screenshot: http://i.imgur.com/z0Dp60H.png

fred256 · on Sept 11, 2015

Related, it always annoyed me in programming books when program listings using monospaced fonts used an "fi" ligature crammed into one character width.

gpvos · on Sept 12, 2015

Having a "fi" or "ff" or similar typographic ligature at all in any monospace font is a grave mistake. (Typographic as opposed to ligatures that have been promoted to letters in certain languages, such as "ß").

toothbrush · on Sept 11, 2015

Wow, that is an atrocity i have luckily never stumbled into—do you have examples?

kylebgorman · on Sept 12, 2015

Don't quote me on this but I think there may have been an example of this in the most recent Stroustrup C++ book?

kevin_thibedeau · on Sept 12, 2015

There is no way to standardize this. Just because legacy encodings provided for single cell and double width characters doesn't mean that all monospace fonts should have to conform to that scheme to support Asian characters. It is up to the font designer to decide how much advance to use and some fonts are designed with Latin characters the same width as the Chinese.

arm · on Sept 12, 2015

Many Chinese characters are completely unreadable at small sizes if you try to fit them in the width of a halfwidth character. It’s just not practical.

masklinn · on Sept 12, 2015

The monospace font designer could integrate CJK logographs as double the standard width (that's essentially what CJK fonts do in reverse with halfwidth latin characters). CJK is monospace to start with though, I'd think cursive & highly ligatured scripts like arabic or brahmic scripts would be a bigger issue when designing monospace fonts (if you want something which doesn't look like utter shite) (though I guess Kufic — the very blocky arabic script found on the flags of iraq and iran — might be better source material than the more modern and cursive Naskh variations, it's commonly manipulated into tilings and brickwords already: https://en.wikipedia.org/wiki/Kufic#/media/File:Alijlas_kufi...)

microcolonel · on Sept 12, 2015

This seems more like a problem with GDI and/or DirectWrite(as well as how each browser is making use of them), less to do with chrome vs. firefox vs. IE vs. VS.

Chrome on Chrome OS (using FreeType 2) properly aligns the text in that <pre>, as does firefox on GNU/Linux (also using FreeType 2, in addition to graphite). On FreeType with Chrome or Graphite, the full-width latin characters are also rendered with the correct weight and face. Something that Chrome and IE on Windows seem to get wrong in your screenshots.

nsajko · on Sept 12, 2015

Isn't the best general answer to assume mono is mono and put the font configuration responsibility on the user?

mpweiher · on Sept 12, 2015

Seems to work fine in OS X Terminal.

kalleboo · on Sept 12, 2015

But not in Safari, TextEdit, Xcode or BBEdit. Seems like the Terminal has special monospace handling

anamexis · on Sept 13, 2015

Interestingly, not with emoji.

gcb0 · on Sept 12, 2015

gotta love how things catch up faster than you can complain about.

the rendered text for me on firefox (actually, not even proper firefox, but iceweasel, on debian stable, which is almost a year behind firefox) align perfectly fine. but the picture labeled "firefox" is all out of whack.

douche · on Sept 12, 2015

I was originally a history major, who had a hell of a professor that taught the history of pre-modern imperial China, so I ended up focusing in that area. One thing that baffled me was that characters never really fell out of usage in favor of a syllabary or alphabet the way similar ideographic systems have tended to elsewhere. Other countries in the Chinese cultural sphere, that originally imported Chinese script, developed alpha/syllabic replacements, like the Korean hangul and Japan's two kana systems. And it is not as though China was not exposed to alphabets - Buddhism, Tibet, and the various Turkic, Mongol and Manchurian tribes that bordered on (and not infrequently ruled over significant portions of China), all used variations and descendants of Semetic alphabets. It seemed like such a huge inefficiency and barrier to wide-spread literacy, to have to memorize at least a thousand or more characters, compared to 26-odd letters, 10 digits, and the 50 or so more common phonetic spelling rules, that could get you to an equivalent level of literacy in a Latin-based language. Once you've learned all the characters, and digested all the Confucian classics that make up the shared context of classic Chinese, you can be wonderfully succinct and expressive, but the learning curve to get there can be measured in decades, judging by accounts of prospective scholar-officials studying for the civil service examinations.

It's perhaps the difference between becoming a proficient Vim or Emacs user, and popping open Notepad.

To get back to Unicode, most of the hairiness could have been avoided if Chinese (and related regional variations) used a syllabic or alphabetic script - 100-200 characters vs ~80000 would make 2-byte wide chars sufficient to express nearly every non-dead script, with plenty of space for any number of poo or banana emoticons (if such a thing is really necessary, about which I have my doubts).

Kenji · on Sept 11, 2015

That reminds me of how I wanted to start a project with a name containing the German 'ö' character. Of course, as customary, I save my projects into folders with the name of the project. And, also of course, GCC issues instantly ensued when I tried to compile the C++ source of that project. What did I learn? Stay with ASCII for your code and names. It might be 2015 but apparently it's more important to put poop into unicode than to actually implement the thing in all the important software.

gchpaco · on Sept 11, 2015

Don't idly dismiss emoji. They're the first thing from outside the Basic Multilingual Plane that people actually want to use, and as a result the first thing to find all the bugs and hidden assumptions going from a 16 bit wide character set to a 20 bit wide character set. This helps people who actually needed access to the various CJK Ideograph Extensions, most of which are in Plane 2.

gsnedders · on Sept 11, 2015

Nit-pick: 21-bit wide character set.

gchpaco · on Sept 12, 2015

Truth, I dunno why I have 20 bits stuck in my head.

Dylan16807 · on Sept 12, 2015

Probably because it barely goes over 20 bits. 20.09ish

If you want to be cheeky you can argue that "hidden assumptions going from a 16 bit wide character set to a 20 bit wide character set" is still correct, because those hidden assumptions are part of UTF-16, and UTF-16 uses exactly 20 bits to express the astral planes, encoding them in a different way from the BMP.

jhallenworld · on Sept 11, 2015

Thanks to UTF-16, ugh.

jepler · on Sept 11, 2015

Have you tried this lately? On a Debian Jessie machine using the en_US.UTF-8 locale, I encountered no problems doing this, though I picked œ instead of ö. I did have a little bit of pain from git (git status showed \305\223 for œ) until I did "git config core.quotepath false".

Not sure how this'll come through on ycombinator, so it's also on a pastebin for 24 hours: https://paste.debian.net/311370/

~$ cd src

~/src$ mkdir œuvre

~/src$ cd œuvre

~/src/œuvre$ printf '#include <stdio.h>\nint main() {puts("hello œuvre");}\n' > œuvre.c

~/src/œuvre$ g++ œuvre.c -o œuvre

~/src/œuvre$ ./œuvre

hello œuvre

lilyball · on Sept 12, 2015

That's a bit different. Presumably you're running this in a terminal emulator. Terminal emulators (and programmers editors), if they're set to a monospace font, operate on a grid. For characters in the selected monospace font this is just like normal font rendering, but when fallback fonts are used, it still uses the layout from the original monospace font (i.e. the grid) instead of the layout from the fallback font. This means that the fallback font rendering doesn't screw with the columns.

Naturally there's still the issue of properly identifying double-wide characters. But as long as you can correctly identify them (and for the ambiguous characters it generally treats them all the same, either as narrow or as wide depending on the software in question), you can simply render them with 2 cells instead of 1, and everything remains lined up.

But the article here was showing monospaced text in a web browser, and a web browser doesn't do any of this (nor a rich text editor). Font fallback usually attempts to maintain properties like monospacing, but the fallback font may still have different metrics, meaning it won't line up with the monospaced text from the source font (and depending on the fonts available, if you have no monospaced font with asian glyphs it would have to fall back to a non-monospaced font).

I suspect what's going on with the locale stuff here is simply that the locale affects the fallback list, and it ends up picking a different font that happens to have the right metrics to line up. But I can't say for certain that this is the explanation.

on Sept 12, 2015

[deleted]

lilyball · on Sept 12, 2015

How strange. When I read jepler's comment it actually appeared as a top-level comment instead of a reply, which is why I assumed his terminal commands were intended to represent a rendering error, not a GCC error. But looking at it now, there is indeed a parent comment.

Also, your first line there was quite reasonable, but the footnote is incredibly patronizing. I did not appreciate it in the slightest.

jepler · on Sept 12, 2015

I apologize for the misunderstanding. I also didn't intend to sound patronizing to anyone; I wanted to show Kenji exactly what I was doing in case it was helpful to him or encouraged him to try naming his projects with non-ASCII characters again.

lilyball · on Sept 13, 2015

You weren't patronizing. It was another comment that has since been deleted. The confusion was I misinterpreted your comment because I thought it was a top-level comment and so I lost the context (I believe it was a rendering bug in the beta version of Safari I was using).

simoncion · on Sept 11, 2015

I can also confirm success when compiling a file with a path "ömlaut/asdf.cpp".

  $ g++ --version | head -n 1
  g++ (Gentoo Hardened 4.9.3 p1.1, pie-0.6.2) 4.9.3
  $ uname -r -m
  4.1.6-hardened i686
  $ echo $LANG
  en_US.UTF-8
  $ bash --version | head -n 1
  GNU bash, version 4.3.42(1)-release (i686-pc-linux-gnu)

GFK_of_xmaspast · on Sept 12, 2015

" it's more important to put poop into unicode than to actually implement the thing in all the important software."

I somehow suspect those aren't the same people.

geofft · on Sept 12, 2015

Were you using a UTF-8 locale? (Try running `locale`.)