Warning, controversial: I'm actually against this.
While I believe we should have a standard, I feel that having UTF-8 in source code brings more risks than benefits. [0]
Next to those additional risks it also limits discoverability if `fiancee` now is written as `fiancée` (yea I had to c/p that from somewhere). Searching for the former does not discover the latter.
Lastly, there is the issue of Intellisense, or whatever it is called in many languages. I have seen codebases in English with the aforementioned `é` in a function name. The only way for me to select that function name was with the arrow keys / mouse. I couldn't type it on my QWERTY.
And yes, I know that there are codebases which are non-English, and not even Latin. Those are very valid concerns to which I admittedly have no answer to.
I can see wanting to stick to plain ASCII for identifiers, but what of embedded strings? Localization might prefer to be data-driven, but what of unit tests surrounding text management - including locale-specific sorting tests, glyph rendering, etc.? I don't think you're against making the source code UTF-8 per se, just "UTF-8" identifiers. Which seems quite valid. In rust you might write:
#![forbid(non_ascii_idents)]
pub fn fiancée() {}
And get an appropriate error:
error: identifier contains non-ASCII characters
--> src/lib.rs:3:8
|
3 | pub fn fiancée() {}
| ^^^^^^^
|
note: the lint level is defined here
--> src/lib.rs:1:11
|
1 | #![forbid(non_ascii_idents)]
| ^^^^^^^^^^^^^^^^
> Why would you not just put test input in a separate test-input file that you pipe in?
You just added a dependency on File I/O (can't easily run the test on filesystemless embedded/wasm targets), deserialization, the current working directory if using relative paths, filesystem layout if using absolute paths - we lose type checking as part of our compile step, so deserialization might fail, we lose intellisense...
> That way you can test multiple different encodings and all sorts of edge cases
Well, a mixture of half and fullwidth yen symbols (¥¥). A bit awkward to eyeball as they've been escaped via \u#### codes - dodging the whole "what encoding are our source files" problem - but given the graphical variants of that glyph, I could see keeping the escaped versions. Now, the array this is helping construct could certainly be deserialized from, say, a JSON or XML test-input file. But what would we gain, exactly, besides more boilerplate and context switching to wade through?
> While I'm also team ASCII, it's possible I'm not appreciating some edge case here.
To be clear I'm not saying you can't use files, and sometimes files are more appropriate and convenient than loading everything into code despite the caveats I mentioned... but I'm not seeing much of a boon for exiling test data to files here.
thanks. Interesting examples - and thanks for actually finding in-code examples :)
I didn't realize OpenJDK targets systems without filesystems. That's cool. I thought the Java Smart Card days are behind and the Java folks only care about server applications now. However having a requirement that all tests be inline seems a bit onerous. Do they inline images as well for testing BufferedImage and company?
I can see how an external file makes it a bit more of an "integration test" that would in effect be testing several moving pieces. I don't agree it would create more boiler plate or make things any less clear though. It makes it much easier to test many different complex and large inputs and to feed in new tests without needing to recompile. It's also easy to version control and introduce new tests with pathological cases. It presents a clear separation of test inputs from code - but maybe I can understand the counterargument. It does look neater to have everything together
makes it easy to embed arbitrary data with your code (or in this case, your tests, which are usually compiled separately) so no, you can easily test against files without access to a filesystem (beyond the access required to read the jars).
I still think having test cases with strings in the code is often a lot clearer personally.
(Unrelated: I wish more PLs made it easy to embed files within executables and access them with an OS-independent filesystem like API, it's often very useful.)
> I didn't realize OpenJDK targets systems without filesystems.
To be fair I'm not sure it does, really, per se. I know embedded Java is/was a thing, though, and it wouldn't suprise me if someone somewhere tortured their own personal fork into running a test suite on embedded stuff - if only for legacy support testing purpouses.
And poking around in that directory did make it clear some tests use files containing test vectors.
> I don't agree it would create more boiler plate
To get concrete again, I consider this boilerplate:
And this is merely (de)serializing pure unstructured strings without any kind of data format or failiable schema beyond charset encoding, making this a poster child for exiling test data to files (probably part of the reason why it was exiled to files!)
> It makes it much easier to test many different complex and large inputs and to feed in new tests without needing to recompile.
For bulk plain text I'd agree. For codebases with slow incremental builds, structured data might also benefit from being exiled for iteration speed. Or there can be benefits to skipping having a full developer environment.
But if incremental builds are fast (a worth goal), and if full developer environment is reasonably assumed (dev-focused unit/integration testing), compiler assistance with structured data is often more convenient, and has better error reporting for syntax errors etc. than what you'll get from many/most simple and straightforward uses of deserializers.
If you are specifically trying to test Unicode support, then you might want to try multiple formulations for the same glyph. E.g. you can make é either directly with 00e9 or with a combining mark 0065 0301. You can't tell the difference in a string literal, but escape codes will make it clear.
> it also limits discoverability if `fiancee` now is written as `fiancée` […] Searching for the former does not discover the latter.
Let's book this under "falsehoods programmers believe about Unicode".
You can try either the code below or simply using your browser's search function for this very page and type in the string `fiancee` and see all variants, accent-marked or not, highlighted as results.
#!/usr/bin/env perl
use utf8;
use Unicode::Collate;
my $uc = Unicode::Collate->new(
normalization => undef, level => 1
);
printf "pos %d len %d\n",
$uc->index('Lettre à ma fiancée.docx', $_)
for 'fiancée', 'fiancee', 'lettre a ma-Fiancee _DOCX_';
__END__
pos 12 len 7
pos 12 len 7
pos 0 len 24
Pardon me if I butchered the French grammar, I do not speak the language and can't easily verify whether I made a mistake in adapting the code from a year ago. <https://news.ycombinator.com/item?id=30405840>
It isn't a falsehood but a valid remark regarding most tools. Tried searching "fiancee" in your example on VS Code and failed to find "fiancée". Of course you can consider "e=é" similar to how you can do case-insensitive search considering "e=E". Problem is most tools currently don't do this.
Yes, this. The files in the source repo are 99.9% ASCII. Anything that's not ASCII should be encoded in UTF-8 instead of some other encoding. The mailing list discussion refers to some issues we've had where characters in Shift-JIS (CP932) crept into the code base.
I don't think anyone is advocating using non-ASCII characters, such as accented characters or emoji, in identifiers.
Exactly, and this isn't a proposal for everybody else's Java software, it's a proposal for the JDK itself.
So this isn't even close to Rust's rule (all Rust source is UTF-8), let alone a declaration that Java is going to introduce data types with Korean names or new Russian method names on existing types, it's just hey, if any part of the JDK's own source code needs non-ASCII then it should be UTF-8, and if it doesn't need to be non-ASCII that's fine because ASCII is a strict subset of UTF-8.
> While I believe we should have a standard, I feel that having UTF-8 in source code brings more risks than benefits. [0]
> Next to those additional risks it also limits discoverability if `fiancee` now is written as `fiancée` (yea I had to c/p that from somewhere). Searching for the former does not discover the latter.
If your search function can't handle different characters with the same meaning then it's already broken (e.g. usually you want search to be case insensitive). Likewise if similar-looking characters are an exploitable problem in your codebase then that problem already exists with e.g. l/I.
> Lastly, there is the issue of Intellisense, or whatever it is called in many languages. I have seen codebases in English with the aforementioned `é` in a function name. The only way for me to select that function name was with the arrow keys / mouse. I couldn't type it on my QWERTY.
In any decent system you can type it with compose-'-e like it looks like.
I'm all for making programmers dogfood the use of non-ascii characters (actually I'd like to see more use of non-unicode encodings, unless and until unicode stops screwing over Japanese). Maybe it'll give them more empathy for those of us who need to type things like this daily, and stop them doing things like flashy new packaging systems that break IME.
> Can you say more about this? Are there still reasons to prefer shiftjis over unicode for japanese characters?
Yes. Unicode uses codepoints that are primarily for Chinese characters to represent Japanese characters that they consider equivalent, even when those Japanese characters have different appearences; as a result, Japanese text in unicode looks bad (readable, but ugly) unless displayed in a specifically Japanese font (in which case you'd have the converse problem of Chinese text looking wrong). The unicode consortium suggests various vaporware approaches to combat this, but the thing that actually works is keeping Japanese text in Japanese encodings and Chinese text in Chinese encodings. (Of course this means that you need to be able to display text from multiple encodings in the same page if you want to display both languages on the same page, but all of the unicode consortium vaporware fixes require you to build something equivalently complex, and you wouldn't even be able to test it by using strings from two western encodings since it would be specific to Japanese and Chinese)
There's no difference between encoding a page in Shift_JIS and tagging the page as Japanese language (assuming your browser switches fonts correctly based on language tags). Chinese text in Shift_JIS isn't going to look correct either.
I think the suggested alternative though is having UTF-8 take care of this.
UTF-8 already has different codepoints for every other language, like I can type russian (День), greek (Ημέρα), english, even egyption hieroglyphs (𓀃) all of them in this one text field, and they all render right for both of us if we have a suitable font. If they don't render, it's unambiguous.
It's only chinese and japanese where I can type some characters, and depending on if you have a chinese or japanese font _first_ in your system, it might render wrong, and either way it's ambiguous to the computer without further metadata. The computer doesn't need metadata to know that "α" is "α", why does it need metadata to know if 直 is the chinese (http://www.hanzi5.com/bishun/76f4.html) or japanese (https://kanjivg.tagaini.net/viewer.html?kanji=%E7%9B%B4) character, two different looking characters that share the same codepoint.
Unicode does not have different codepoints for every language. Leaving aside scripts that aren't encoded at all, all languages using the Latin script get a single set of glyphs, and if you want Comic Sans MS or Carolingian miniscule or fine-grained control over whether "a" has a hook at the top or not, you need to specify a font that renders it the way you want, just like with 直.
Because Unicode is about the meaning of characters, not their appearance. If there were a context where the two different shapes of 直 have different meaning, Unicode would add a new codepoint to distinguish them. In fact "ɑ" without hook does have a separate codepoint because it is used in linguistics to mark a vowel different from "a".
If CYRILLIC SMALL LETTER A deserves a code point despite being so similar to LATIN SMALL LETTER A, why doesn't a Japanese character that actually looks different from a Chinese one get a codepoint? And the idea that something can have the "same meaning" across two languages is very wooly.
Why don't Japanese characters in seal script or handwriting get different codepoints from mincho font characters? They look different too.
Sometimes looking visually different doesn't matter because if you know how to write them by stroke order, you'll still be able to read it (which is how handwriting works, I think; I'm pretty bad at reading that…)
Hanzi simplification in the Mainland also complicated things, since I doubt they wanted to make all of those into different characters.
Sure, but every application handles encoding (or at least, every application did up until the recent UTF-8-only movement), whereas language tagging is web/html-specific.
If you're really hardcore processing multilingual text you need an equivalent to language tags anyway, because you need a dictionary for word-wrapping, date formatting, quote marks etc along with changing fonts (or glyph selectors in the same font.) TeX has them too.
But usually people only care about their language, so it goes by the system UI language and it works out.
> If you're really hardcore processing multilingual text you need an equivalent to language tags anyway, because you need a dictionary for word-wrapping, date formatting, quote marks etc along with changing fonts (or glyph selectors in the same font.) TeX has them too.
Sure. And once you're doing that you don't gain a lot of benefit from unicode AFAICS, because you have to track these spans of locale-specific text.
People's names are the thing you should be most careful to not do that with! Most Japanese people will generally put up with seeing the Chinese rendering of a character in ordinary text, but they really don't want you to do that to their names!
Even assuming that works, getting it to make its whole way through your whole tech stack is no easier (and more HTML-specific) than having spans with their own encoding.
This isn't about multiple encodings, this is about the one encoding, UTF-8, being able to represent multiple languages. Which is mostly can, except for han characters.
In my reply here, I can type in english and russian at once. Привет, мир.
Yet, if I try to type chinese on one line, and japanese on the next, I cannot do it. Hacker news does not let me enter "lang" tags, so I can only type either the chinese or the japanese variant of a kanji.
so yes it is about pages with multiple encodings. A span's smaller than a page!
The Unicode answer is "variation selectors" which are used for some historical variant kanji, but not for whole language switching. I suppose they could be used for that too though.
I don't know about HTML specifically; I meant spans as a general concept rather than a literal <span> tag. If the web stack has actually implemented mixing languages on the same page to the point where you can use it in a "normal" application then that's very cool (and if they've done it with their lang tags rather than by allowing mixed encodings, well, fine), but I've yet to see a site that actually has that up and running.
I can't read it, but there will be hundreds of articles on Japanese Wikipedia covering Chinese literature etc that have text in both languages, all in Unicode.
<html lang="ja">
Japanese text ...
<span lang="zh">
Chinese quote
</span>
...
</html>
is much easier than mixing encodings. With the above entirely in Unicode, it will be handled reliably by anything that can handle Unicode, and is still reasonably readable even if the Chinese text is shown in a Japanese font. Reading just the fourth line without the third will still show something 'OK'.
They might be alluding to issues from CJK unification. But I don’t believe that many practitioners working with Japanese text consider these concerns more important than the value that Unicode brings to the table.
I think that CJK unification leads to awkwardness where some fonts will basically render some kanji the “Chinese way” or the “Japanese way”, so you can’t really have a good font that covers both at once. There’s a good aesthetic argument about not using the same font for both languages, but Unicode outright precludes it I think
Virtually every Japanese organisation disagrees, you only have to look at a Japanese website to know as much (or are those somehow not "practitioners working with Japanese text"?). Most will use a non-Unicode encoding and/or resort to rendering important text as images, since it's impossible to rely on Unicode text rendering well.
From here [0], 94.1% of .jp websites use UTF-8 and 6.5% use Shift-JIS. You can find a list of popular sites still using Shift-JIS here [1].
If you change your PC or your phone locale to Japanese and access mostly Japanese website, the font display on most major OS works well. It's probably fair to assume that's a very common setup for Japanese users, and probably for Chinese and Korean users as well.
I think the CJK "problem" is more apparent for international users who don't have the locale setup or dealing with multiple languages plain text at the same time.
Your own link shows that both Shift-JIS and EUC-JP are on an upward trend, which isn't what we'd expect if Unicode was succeeding. I can't see it listing examples of .jp sites using UTF-8; would be interesting to see how many of them are relying on text-as-images.
Yeah what lmm says is not what happend/thought in Japan. Japanese still uses Shift_JIS because they love (/s) legacy software. Using Shift_JIS isn't a practical solution.
Random sample of some websites:
- Rakuten Bank -> Shift JIS
- NHK, the Ministry of Health and Labor Welfare, yahoo japan -> UTF-8
I don't really really know, but this random yahoo answers [0] page seems to agree with me that really you're gonna use UTF-8 for new stuff in general
One point that answer brings up that I hadn't thought of: older flipphones would have good shift-JIS support but not necessarily good unicode support
My guess is that "text as images" might partly be that, partly what you are saying in a legacy sense, and also partly for the same reason old sites in the US would do that: to make something look the way you want and you already have a photoshop render
[0]: https://detail.chiebukuro.yahoo.co.jp/qa/question_detail/q14...
I wonder if this can ever be fixed. If basically every Japanese organization disagrees with the Unicode consortium to the point that its undermining the use of unicode in entire countries, it might be worth unwinding CJK normalization.
This isn't the case obviously. Japanese organisations are involved in the Unicode process (something evidenced in the input for the several CJK Unified Ideograph Extensions). Thank to Unicode, many historical characters can be represented in digital text without resorting to placeholders or hacks.
Anecdotally, I've visited a group enthusiastically exploring the possibilities of the then-new CJK Unified Ideograph Extension B in Tokyo over a decade ago. The CJK issue mentioned here was a bigger problem then than now (most developers know how to handle it properly, e.g. with language declarations in HTML etc.).
People will use legacy encodings. That's just a fact of life. Sometimes out of inertia (i.e., it just works, so why bother changing it?), sometimes out of technical limitations of their pipeline, sometimes because of a lack of understanding, rarely out of principle. I'm stuck fixing an issue with some numb skull sending us SAML requests declared as UTF-8 containing some obsolete encoding instead, and this is for Dutch text! This just happens.
There is also nothing wrong with text in Shift-JIS, if that encoding encompasses all characters used.
Snap/Flatpak/etc. often mean you can't use IME (input method extensions) with programs packaged with them unless the packager has jumped through certain hoops, which they usually haven't.
Ew, that's... extremely unfortunate. I'm sure there's some internal reason, but that feels like something the system should be able to add support for unilaterally, at least by default, without needing special packager effort:(
I think you have misunderstood what this issue is about. The Java Language Specification already specifies that Java programs are written using the Unicode character set and that letters in identifiers can be drawn from the entire Unicode character set. The issue is that there are many ways to encode Unicode characters and that a single encoding is not used across the code base. The suggestion is merely to normalize this encoding so that all files use the same, so as to not confuse the tooling and complicate the build process.
> I feel that having UTF-8 in source code brings more risks than benefits. [0]
While this is a really fun theoretical attack, has this ever been encountered? A such patch would have to go trough a diff anyway before being accepted. And certainly a "comment" that looks like code has a high chance of getting rejected?
Does it make sense to restrict this feature (that is useful to everyone in the world, except those who only speak English) on account of a very theoretical risk?
> Next to those additional risks it also limits discoverability if `fiancee` now is written as `fiancée`
Just in Italy, it can be written as: fidanzata, zita, morosa, ragazza, picciotta. Certainly the accent isn't the main problem here? Do you want to eliminate all synonyms? All languages other than English?
> I couldn't type it on my QWERTY.
Have you encountered this situation often? If so, just remap your keyboard.
I, for example, have a "w" key that is mostly used in shooter games, since the letter doesn't appear in my language.
> And yes, I know that there are codebases which are non-English, and not even Latin. Those are very valid concerns to which I admittedly have no answer to.
Perhaps allowing unicode in source files could be a solution.
It's not about using more or less non-ASCII characters. It's about saving the source code files all in the same encoding (UTF-8) and standardizing around this encoding.
They want get rid of an encoding mess.
I don't know how one can be against this. Your code is probably saved in UTF-8 right now even if you don't use non-ASCII characters.
It's probably just that the JDK is a very old code base that was written from different OSes, without much thought on the encoding. Now, they want to standardize while making sure not to break anything.
What's more, the presence of non-ASCII characters in such a huge code base is almost inevitable so yeah, it matters if you want to avoid issues.
> Your code is probably saved in UTF-8 right now even if you don't use non-ASCII characters.
Nit: For a document that only contains ASCII characters, UTF-8 is identical to ASCII. So calling it UTF-8 is technically correct, but misleading -- you could equally well say the document is encoded using any of the other supersets of ASCII.
Indeed. I had that in mind when writing my comment, but it's good that you made this part explicit.
You'd better have your editor / tools set to use UTF-8 though in case some non-ascii characters end up in your file or else you might be in for some pain.
While I agree that sticking with ASCII sidesteps a lot of potential issues, I don’t agree that “I don’t know how to type that character” is a valid excuse nowadays. There are various ways, from Compose keys to builtin OS functionality, that entering accented characters, at the very least, shouldn’t be a major hurdle. This is to encourage people to look into this, it’s really not that difficult.
What you're against seems to be bad practices that could be more easily abused with UTF, rather than UTF itself.
I think the benefits could outweigh the harms, but some of it would indeed depend on good conventions being followed.
As for things like é, a function called fianc__e_accent_aigu__e would be far worse, and still a matter of convention. At least search engines could easily be configured to toggle accented letters as unaccented etc. And even nano has pretty good autocomplete facilities these days.
Where I find a big benefit with utf names is in reproducing standard mathematical formulae in code. But I agree that this implies a good way to input the required symbols that doesnt require you to search utf charmaps.
Not so controversial. The vast majority of code is written in plain ASCII (and english), and I hope it will go on. Using non ASCII character in your code is a double failure.
Take an example: "trouvé" instead of "found" as a variable name.
1- You're right. Most people won't be able to type "é", "à"... with their keyboard. Working on that code won't be easy.
2: What is the meaning of "trouvé" ? How people could figure out that variable name ? Average coder will search for english named functions or variable.
depending on your OS / windowing system, you actually have easy ways to quickly get greek letters out of your (non-greek) keyboard without having to copy-paste them:
my favorite way is to assign a key or key combination of your choice to act as so-called dead_greek modifier, and then just press that modifier before a latin letter key to get the corresponding greek letter. For example, under linux/xorg, if you wanted AltGr+g to be your dead_greek, you can use xmodmap to set it so:
alternative 1: assign a key as a direct modifier (not a dead key) and add key combination defs for that to generate the characters you want … alternative 2: switchable keyboard layouts
can't test it currently as the machine I have here doesn't run wayland but xorg, but iirc: the general idea of using a dead_greek modifier should also work on wayland, but assigning a key for it unfortunately cannot be done with the simple non-xkb legacy X11 keyboard layout tools like xmodmap that still work on xorg.
That being said, wayland did take over part of xkb (the less old keyboard system) from xorg, and so from what I can gather from a quick search, the easiest way to assign a key as dead_greek in wayland would probably be with a file in $XDG_CONFIG_HOME/xkb/symbols/ like they do here for other key symbols: https://unix.stackexchange.com/questions/292868/how-to-custo...
I don't know any way to tell wayland to (re)load an xkb config on the fly (without logging out and back in) though. In particular, I doubt that setxkbmap would work for that like on xorg.
You'd need something that can actually parse the language to apply rules on language constructs like identifiers. I don't think EditorConfig supports this, but a linting tool would be able to.
What's the difference between a variable name that's obscure because it has a special character and a variable name that's obscure for some other reason?
I wish a compose key was more widely available. It's actually moderately intuitive, on x windows, é is compose-e-'.
I actually agree with this. Any byte higher than 127 in any source file should be a compile-time error, and it is so in any project where I can make the call.
> And yes, I know that there are codebases which are non-English
Non-English speaker here. Use English for identifiers and comments. Put magic strings in files other than source files, which is what you should be doing anyway if you want to seriously support i18n.
This makes it a PITA to use non-ASCII in string literals. They have applications beyond non-English language support. Unicode character escapes are opaque and impair readability. If a wizened language like C can do it gracefully, newer languages should be able to as well.
I don’t know, if you want to find any form of official documentation for your dependencies/frameworks, you will have to know English either way. It is just the de facto language of IT, and I think it is fair to have every (even internal-only by a foreign language speaking group) codebase entirely in English.
And I say it as someone whose native tongue is not English.
While I believe we should have a standard, I feel that having UTF-8 in source code brings more risks than benefits. [0]
Next to those additional risks it also limits discoverability if `fiancee` now is written as `fiancée` (yea I had to c/p that from somewhere). Searching for the former does not discover the latter.
Lastly, there is the issue of Intellisense, or whatever it is called in many languages. I have seen codebases in English with the aforementioned `é` in a function name. The only way for me to select that function name was with the arrow keys / mouse. I couldn't type it on my QWERTY.
And yes, I know that there are codebases which are non-English, and not even Latin. Those are very valid concerns to which I admittedly have no answer to.
[0] https://krebsonsecurity.com/2021/11/trojan-source-bug-threat...