Hacker News new | past | comments | ask | show | jobs | submit login

>> Specify the dominant direction of your user-input-containing elements, people, and/or enclose the input in U+2068 FSI ... U+2069 PDI (after balancing outstanding bidi controls inside).

> The level of arrogance packed in this sentence is just mind-boggling.

It’s not arrogance, really, it’s just that I’ve been reading on this exact thing for the last couple of months, and the relevant knowledge is rather unpleasantly smeared over multiple documents in several places (W3C and Unicode.org mostly), so I tried to condense the recipe into a single sentence and drop some terms an interested person could look up: I was attempting to pack information. I see now how that could come off as arrogant, but can’t think of appropriate circumlocutions that could ward that off without turning it into a full bidi-in-HTML tutorial (which I am not qualified to write, for one thing). I already write too many unsolicited tutorials in my comments, this is me trying not to :(

> There are many other "Easter eggs" in various basic technologies. I can assure you that no matter how high of an opinion you have about yourself, if you write any production code at all, you are guaranteed to be using something that contains other Easter egg design decisions. [...]

I’m aware I have limits! I know lots of those! I discover new ones every day!

(I dread the day I need to figure out how an 802.11 retransmission works and how to fight one, for one thing. I can’t do post-2010 JS frontend to save my life, and my database knowledge is somewhere around “there were those guys with the normal form, I think?”. Limits? I’ve got ’em.)

I also expect that once I know about a footgun, I have a responsibility to avoid it, and that people who have just encountered such generally want to hear how to avoid it as well. I’m not entirely competent at the communication part. Sorry.

As to the actual issue... I could say that if you’re handling multilingual text, then you should damn well know how multilingual text works, that it’s not peripheral to your problem.

But I don’t actually believe that, not completely: I think this bidi thing is needlessly hard and we should have directional-stack-balancing and directionality-isolating functions in our standard libraries the same way we have URL-escaping or HTML-quoting ones. Perhaps even have the templating handle most of these cases automatically. It’s like with SQL injection: I don’t have a right to complain people are writing vulnerable queries if we don’t have convenient tools to write correct ones. Unfortunately, in the bidi case, we don’t, so we’ll have to treat this like spun glass until someone makes them.

(That’s part of why I’ve been looking into this so much lately.)

[Previously]

> The problem is not with Arabic or Hebrew. The problem is that this modifier affects other languages and characters in a way the vast majority of people clearly wouldn't expect (otherwise the story wouldn't make it to the front page).

As far as I know, this is not solvable. Or rather, this specific thing is, and the right-to-left override (U+202E RLO) is kind of a screw-up due to this kind of nonlocal effect on surrounding text (it might even be a holdover from the IBM days?), but you can’t design RTL such that it can be ignored by unaware programmers, with or without directional controls. Last I checked (several years ago), a post in Hebrew would wreak considerable destruction on an LTR Facebook news feed, no controls required.

The problem is of distinguishing a white zebra with black stripes from a black zebra with white stripes: Are you looking at RTL text with LTR pieces inside or LTR text with RTL pieces inside? (If you don’t see why this would change the layout, the Unicode Bidirectional Algorithm spec has examples.) What if the pieces themselves include opposite-direction quotes? How do you know where the pieces end in the presence of characters with no intrinsic direction (punctuation, emoji)?

You can encode everything in LTR display order. Your RTL-script users, DBAs, search engine developers, etc. will hate you.

You can require explicit indicators. If this needs to work in plain text (and it does, if Arabic and Hebrew are to do plain text at all, because RTL text requires embedded LTR pieces fairly often), you’ll have to express that in format controls. But then if a user manages drop a right-to-left switch into English text, which couldn’t care less about RTL, the text will get completely messed up and the user gets to complain why RTL influences English. You may try to completely disallow controls in markup that has alternative ways of expressing directionality, but then your input method, your clipboard, etc. needs to know about every possible kind of markup, or every markup processor needs to generate equivalent controls. To at least limit the scope of the disaster, you declare that the effect of the controls ends at a paragraph boundary, but then you need to tell where that is, and the kind of “plain text” you inherited has no good way of distinguishing a mere hard line break from a paragraph terminator except by not-so-plain “protocol” conventions. So you’ll need to guess.

You can ditch explicit indicators and guess. Your processing algorithm will need to know which scripts have which direction, of course, but that’s not a problem. Given the presence of quotations and such in plain text, it’ll also need to learn about paired delimiters and which of them pair with which others, and try to recover when the pairs are wrong or unbalanced, because users are awful. Because of the aforementioned zebra problem, you’ll also need a way to guess which direction of a piece of text is the main one, which seems intractable without godlike NLP, so maybe just take the first character with a definite direction and tell people who start sentences with an opposite-direction fragment they lose? Overall, the whole guessing game becomes so complex it’s completely impossible to reliably embed an arbitrary fragment of user input inside your text unchanged (without inserting visible compensating delimiters, for example), so some kind of format controls that manipulate a stack of directions are called for.

The Unicode design does most of the above; it is complex and could undoubtedly be simpler—there’s like three generations of “no, that’s a bad idea, let’s try again” in there. But it seems like some indication from a programmer that they want to insert this inner thing, that should remain intact, into this outer thing, that shouldn’t get messed up in the process, would be required in any logical-order design at all; you won’t be able to just concatenate byte sequences. It’s acting on that indication that could stand to be easier.




> I could say that if you’re handling multilingual text, then you should damn well know how multilingual text works, that it’s not peripheral to your problem.

Trouble is, anyone using Unicode and accepting user inputs is effectively handling multilingual text, unless they explicitly filter it out. Which includes the vast majority of websites and even web-based user interfaces for standalone hardware.

> As far as I know, this is not solvable.

I am sure it is solvable in the sense that it is possible to make the behaviors less surprising and complicated without sacrificing people's ability to use right-to-left languages. There would have to be a discussion about underlying assumptions and real-life usage to achieve that, however.

Generally, though, I don't see legitimate use for ever reversing left-to-right languages when displayed to user. That's not what anyone would expect, not even the writers of right-to-left languages. And the myriad of malicious uses are kind of obvious. And the long-term effect of people abusing these will be websites banning more control characters, which will affect users of Arabic and Hebrew.

Also, with the way Unicode is being developed it is increasingly unclear what "plain text" even means these days. AFAIK, there isn't even a formal definition of that term. Maybe that's where the discussion should really start. What capabilities separate "plain text" from other things?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: