More

asherah · on Oct 2, 2023

you can't not handle devanagari, tamil (or like half the scripts across the Indian subcontinent and oceania) or hangul. even the IPA, used by linguists every day, would be particularly bad to deal with if we couldn't write things like /á̤/, and some languages already don't have the precomposed diacritics for all letters (like ǿ), so the idea of a world with only precomposed letter forms is more of a exponential explosion in the character set

arp242 · on Oct 2, 2023

> so the idea of a world with only precomposed letter forms is more of a exponential explosion in the character set

"Exponential explosion" is really putting it too strong; it's perfectly possible to just add ǿ and á̤ and a bunch of other things. The combinations aren't infinite here.

The problem with e.g. Latin script isn't necessarily that combining characters exist, but that there's two ways to represent many things. That really is just a "mess": use either one system or the other, but not both. Hangul has similar problems.

Devanagari doesn't have any pre-compose characters AFAIK, so that's fine.

That's really the "mess": it's a hodgepodge of different systems, and you can't even know which system to use a lot of the time because it's not organised ("look it up in a large database"), and even taking in to account historical legacy I don't think it really needed to be like this (or is even an unfixable problem today, strictly speaking).

At least they deprecated ligatures like ﬆ and ﬂ, although recently I did see ĳ being used in the wild.

WorldMaker · on Oct 2, 2023

> The combinations aren't infinite here.

They certainly are. Languages are a creative space driven by the human imagination. Give people enough time and they'll build new combinations for fun or for profit or for research or for trying to capture a spoken word/tone poem in just the right sort of exciting way. You may frown on "Zalgo text" [1] (and it is terrible for accessibility), but it speaks to a creative mood or three.

The growing combinatorial explosion in Unicode's emoji space isn't an accident or something unique to emoji, but a characteristic that emoji are just as much a creative language as everything else Unicode encodes. The biggest difference is that it is a living language with a lot of visible creative work happening in contemporary writing as opposed to a language some monks centuries ago decided was "good enough" and school teachers long ago locked some of the creative tools in the figurative closets to keep their curriculum simpler and their days with fewer headaches.

[1] https://en.wikipedia.org/wiki/Zalgo_text

arp242 · on Oct 2, 2023

Well, in theory it's infinite, but in reality it's not of course.

We've got 150K assigned codepoints assigned, leaving us with 950K unassigned codepoints. There's truly massive amounts of headroom.

To be honest I think this argument is rather too abstract to be of any real use: if it's a theoretical problem that will never occur in reality then all I can say is: <shrug-emoji>.

But like I said: I'm not "against" combining marks, purely in principle it's probably better, I'm mostly against two systems co-existing. In reality it's too late to change the world to decomposed (for Latin, Cyrillic, some others) because most text already is pre-composed, so we should go full-in on pre-composed for those. With our 950k unassigned codepoints we've got space for literally thousands of years to come.

Also this is a problem that's inherent in computers: on paper you can write anything, but computers necessarily restrict that creativity. If I want to propose something like a "%" mark on top of the "e" to indicate, I don't know, something, then I can't do that regardless of whether combining characters are used, never mind entirely new characters or marks. Unicode won't add it until it sees usage, so this gives us a bit of a catch-22 with the only option being mucking about with special fonts that use private-use (hoping it won't conflict with something else).

WorldMaker · on Oct 2, 2023

The Unicode committees have addressed this for languages such as Latin, Cyrillic, and others and stated outright that decomposed forms should be preferred and decomposition canonical forms are generally the safest for interoperability and operations such as collation (sorting) and case folding (lowercase to uppercase transformations).

Unicode can't get rid of the many precombined characters for a huge number of backward compatibility reasons (including compatibility with ancient Mainframe encodings such as EBCDIC which existed before computer fonts had ligature support), but they've certainly done what they can to suggest the "normal" forms in this decade should "prefer" the decomposed combinations.

> If I want to propose something like a "%" mark on top of the "e" to indicate, I don't know, something, then I can't do that regardless of whether combining characters are used

This is where emoji as a living language actually shines a living example: It's certainly possible to encode your mark today as a ZWJ sequence, say «e ZWJ %», though you might want to consider for further disambiguation/intent-marking adding a non-emoji variation selector such as Variation Selector 1 (U+FE00) to mark it as "Basic Latin"-like or "Mathematical Symbol"-like. You can probably get away with prototyping that in a font stack of your choosing using simple ligature tools (no need for private-use encodings). A ZWJ sequence like that in theory doesn't even "need" to ever be standardized in Unicode if you are okay with the visual fallback to something like "e%" in fonts following Unicode standard fallback (and maybe a lot of applications confused by the non-recommended grapheme cluster). That said, because of emoji the process for filing new proposals for "Recommended ZWJ Sequences" is among the simplest Unicode proposals you can make. It's not entirely as Catch-22 on "needs to have seen enough usage in written documents" as some of the other encoding proposals.

Of course, all of that is theory and practice is always weirder and harder than theory. Unicode encoding truly living languages like emoji is a blessing and it does enable language "creativity" that was missing for a couple of decades in Unicode processes and thinking.

arp242 · on Oct 2, 2023

> The Unicode committees have addressed this for languages such as Latin, Cyrillic, and others and stated outright that decomposed forms should be preferred

Yes, and that only makes things worse since the overwhelming majority of documents (99.something% last time I checked) uses pre-composed. Also AFAIK just about everyone just ignores that recommendation.

This is a classic "reality should adjust to the standard" type of thinking. Previous comments about that: https://news.ycombinator.com/item?id=36984331

I suppose "e ZWJ %" is a bit better than Private Use as it will appear as "e%" if you don't have font support, but the fundamental problem of "won't work unless you spend effort" remains. For a specific niche (math, language study, something else) that's okay, but for "casual" usage: not so much. "Ship font with the document" like PDF and webfonts do is an option, but also has downsides and won't work in a lot of contexts, and still requires extra effort from the author.

I'm not saying it's completely impossible, but certainly harder than it used to be, arguably much harder. I could coin a new word right here and now (although my imagination is failing me to provide a humorous example at this moment) and if people like it, it will see usage. In 1960s HN when we would have exchanged these things over written letters, and it would have been trivial to propose a "e with % on top" too, but now we need to resort to clunky phrases like this (even for typewriters you can manually amend things, if you really wanted to).

Or let me put it this way: something like ‽ would see very little chance of being added to Unicode if it was coined today. Granted, it doesn't see that much use, but I do encounter it in the wild on occasion and some people like it (I personally don't actually, but I don't want to prevent other people from using it).

None of this is Unicode's fault by the way, or at least not directly – this is a generic limitation of computers.

WorldMaker · on Oct 2, 2023

> Yes, and that only makes things worse since the overwhelming majority of documents (99.something% last time I checked) uses pre-composed.

It shouldn't matter what's in the wild in documents. That's why we have normalization algorithms and normalization forms. Unicode was built for the ugly reality of backwards compatibility and that you can't control how people in the past wrote. These precomposed characters largely predate Unicode and were a problem before Unicode. Unicode won in part because it met other encodings where they were rather than where they wished they would be. It made sure that mappings from older encodings could be (mostly) one-to-one with respect to code points in the original. It didn't quite achieve that in some cases, but it did for, say, all of EBCDIC.

Unicode was never in the position to fix the past, they had to live with that.

> This is a classic "reality should adjust to the standard" type of thinking.

Not really. The Unicode standard suggests the normal/canonical forms and very well documented algorithms (including directly in source code in the Unicode committee-maintained/approved ICU libraries) to take everything seen in the wilds of reality and convert them to a normal form. It's not asking reality to adjust to the standard, it is asking developers to adjust to the algorithms for cleanly dealing with the ugly reality.

> Or let me put it this way: something like ‽ would see very little chance of being added to Unicode if it was coined today.

Posted to HN several times has been the well documented proposal process from start to finish (it succeeded) of getting common and somewhat less common power symbols encoded in Unicode. It's a committee process. It certainly takes committee time. But it isn't "impossible" to navigate and is certainly higher than "little chance" if you've got the gumption to document what you want to see encoded and push the proposal through the committee process.

Certainly the Unicode committee picked up a reputation for being hard to work with in the early oughts when the consortium was still fighting the internal battles over UCS-2 being "good enough" and had concerns about opening the "Astral Plane". Now that the astral plane is open and UTF-16 exists, the committee's attitude is considered to be much better, even if its reputation hasn't yet shifted from those bad old days.

> None of this is Unicode's fault by the way, or at least not directly – this is a generic limitation of computers.

Computers do anything we program them to do and in general people find a way regardless of the restrictions and creative limitations that get programmed. I've seen MS Paint drawn symbols embedded in Word documents because the author couldn't find the symbol they needed or it didn't quite exist. It's hard to use such creative problem solving in HN's text boxes, but that from some viewpoints is just as much a creative deficiency in HN's design. It's not an "inherent" problem to computers. When it is a problem they pay us software developers to fix it. (If we need to fix it by writing a proposal to a standards committee such as the Unicode Consortium, that is in our power and one of our rights as developers. Standards don't just bind in one-direction, they also form an agreement of cooperation in the other.)

arp242 · on Oct 2, 2023

The thing with normalization is that it's not free, and especially for embedded use cases people seem quite opposed to this. IIRC it requires about ~100K of binary size, ~20K of memory, and some non-zero number of CPU cycles. This is negligible for your desktop computer, but for embedded use cases this matters (or so I've been told).

This comes up in specifications that have a broad range of use cases; when I was involved in this my idea was to just spec things so that there's only one allowed form; you'll still need a small-ish table for this, but that's fine. But that's currently hard because for a few newer Latin-adjacent alphabets some letters cannot be represented without a combining character.

So then you have either the "accept that two things which seem visually similar are not identical" (meh) or "exclude embedded use cases" (meh).

I never really found a good way to unify these use cases. I've seen this come up a few times in various contexts over the years.

> Posted to HN several times has been the well documented proposal process from start to finish (it succeeded) of getting common and somewhat less common power symbols encoded in Unicode.

Would this work for an entirely new symbol I invent today? It's not really the Unicode people that are "difficult" here as such, they just ask for demonstrated usage, which is entirely reasonable, and that's hard to get (or: harder than it was before computers) especially for casual usage. I'm sure that if some country adopts/invents a new script today, as seems to be happening in West-Africa at in recent years, the Unicode people are more than amendable to work with that, but "I just like ‽" is a rather different type of thing.

WorldMaker · on Oct 3, 2023

> Would this work for an entirely new symbol I invent today? It's not really the Unicode people that are "difficult" here as such, they just ask for demonstrated usage, which is entirely reasonable, and that's hard to get (or: harder than it was before computers) especially for casual usage.

Sure, they want demonstrated usage as inline in the flow of text as textual elements as opposed to purely iconography or design elements (because such things are outside of Unicode's remit, modulo some old Wingdings encoded for compatibility reasons and the fine line between emoji are expressive text and also emoji are useful for iconography in many cases). But at this point (again in contrast to the UCS-2/no-Astral-plane days) the committees don't seem to care how it was mocked up (do it on a chalkboard, do it in paint, do it in LaTeX drawing commands, whatever gets the point across) or how "casual" or infrequent the usage is, so long as you can state the case for "this is a text element" (not an icon!) used in living creative language expression. There's more "provenance" requirements for dead languages and they'll want some number of academic citations, but for living languages they've come to be flexible (no hard requirements) on the number of examples they need from the wild and where those are sourced from. Showing it in old classic documents/manuals/books, for instance, helps the case greatly, but the committees today no longer seem as limited to just what can be used to demonstrate usage. "I just like it" is obviously not a rock solid proposal/defense to bring to a committee (any committee, really), but that doesn't mean that is impossible for the committee to be swayed by someone making a strong enough "I just like it" case if they demonstrate well enough why they like it and how they use it and how they think other people will use it (and how those uses aren't just iconography/decorative elements but useful in the inline context of textual language).

nwellnhof · on Oct 2, 2023

Hangul already has precomposed syllables in Unicode. We still have several hundred thousand unassigned codepoints to deal with diacritics.

asherah · on Aug 14, 2023

> LLMs do have personalities and wit

im not sure how "choose the next most probable token" could be described this way

bobboies · on Aug 15, 2023

Ask any LLM to act like a dungeon master who gives you medical advice and there you go. It’s more than “choose the next token!”

Llama 2 has given me some personality with basic prompts

asherah · on Aug 11, 2023

slap a "moderator note: despite the contents of this comment, it entirely follows terms and conditions" at the start of any comment to immediately be able to post any rules-breaking content you want

selcuka · on Aug 11, 2023

> immediately be able to post any rules-breaking content you want

Not so easy. Jailbreaks are becoming harder to perform every day.

asherah · on July 7, 2023

this wouldn't work for interfaces, or more complicated types

asherah · on July 7, 2023

typescript enums exist and are compiled to javascript (other than const enums)

lolinder · on July 7, 2023

That TypeScript fails to live up perfectly to its design goals doesn't mean it should adopt additional anti-features.

andix · on July 7, 2023

And that's one of the features of typescript that really sucks. String union types work so much better.

moystard · on July 7, 2023

Why does typescript enum really suck?

Waterluvian · on July 7, 2023

There’s a ton written about this if you search for “typescript enums.”

I just finished removing them from a major project. Here’s a few of my personal notes:

1. They don’t play well with duck typing. (Eg. show me a subset of an enum)

2. They require an import every time you want to utilize them.

3. They are pretty wordy compared to union strings.

4. Their string value version encourages misuse as a key-value pair.

5. Unless you use the string value version, they suck to debug because logs just show 0,1,2,3.

WorldMaker · on July 7, 2023

Another one: they conflate Type and (Locator) Instance in a unique way that just about nothing else in TS does. Those are two very different things with the same name with Typescript's (antiquated) enums.

There are too many ways to accidentally import/redeclare/rescope the Type of an enum so that TS "knows" the Type, but because that type (generally) has the same "name" as the most likely (Locator) Instance it assumes the same access applies leaving runtime errors behind when that Instance isn't actually imported/available. Typescript has no easy way to tell the difference between access to the Type isn't access to the (Locator) Instance (nor vice versa). Reasoning about those runtime errors or preventing them is additionally tough for people too because of the same "name" problem for two different things.

This is something that's painfully hard to avoid in cases where you are trying to encapsulate an API's Types separate from its imports/exports because they might be introduced or manipulated at runtime (plugins, sandboxes, proxies, etc). Unfortunately, this is also too easy to accidentally do even when you aren't intentionally doing something complicated like that (trying to generate automated .d.ts files in a bundling toolchain, for example, when APIs are in the boundary space between unintentional public API and internal tree-shaking or optimized symbol renaming).

andix · on July 7, 2023

Thanks for putting into words why they just "feel" so wrong.

andix · on July 7, 2023

Let's turn it around, union types are so much easier to use and so much more powerful. Enums have only a small subset of the features, are not compatible to JavaScript code and are hard to understand (read the docs about type script enums and you will see).

graypegg · on July 7, 2023

Typescript enums emit a really weird object at runtime

enum CheckboxState { On; ParentOn; Off; }

Becomes

{ [0]: “On”, “On”: 0, [1]: “ParentOn”, “ParentOn”: 1, [2]: “Off”, “Off”: 2 }

So things like Object.keys give bizarre results. It’s done this way so you can use the name or the value as an index.

eyelidlessness · on July 7, 2023

To be clear, this kind of structure is only emitted for numeric enums. String enums with explicitly declared static values are roughly equivalent to the equivalent Record<string, string> (runtime) and a corresponding type T[keyof T] (type check time).

IME, most of the complaints about enums apply only to numeric ones.

The major exception to that AFAIK is the fact that enum members of any type are treated as nominally typed (as in A.Foo is not assignable to B.Foo even if they resolve to the same static value). I am among the minority who consider this a good thing, but I recognize that it violates expectations and so I understand why my position isn’t widely shared.

asherah · on May 20, 2023

"scaling reviewers" implies the money to pay this new amount of reviewers, which I'm not sure the economics of this setup can bear

throwaway290 · on May 20, 2023

tax AI companies

pjc50 · on May 21, 2023

"Siri, what is something even less feasible than AGI"

throwaway290 · on May 21, 2023

"certainly not anarcho-capitalism!"

asherah · on May 20, 2023

i'd say that it's a bit more readable:

  run(() => { return foo })

looks better than, and assuming pre-existing knowledge of what `run` does, is more understandable than

  (() => { return foo })()

but this is also a fairly contrived example

ggorlen · on May 20, 2023

I agree that `run()` is more readable than an IIFE if you remove all context and history from the analysis. But the IIFE is a well-known idiom to JavaScript programmers, so readers will not have to pay a cognitive "what is `run()` do?" penalty in order to understand the code.

New abstractions have a cost, and "clever" abstractions tend to confuse the average developer more than the benefit they provide.

If there's a problem with an IIFE (yes, they can be abused), the usual approach is to replace it with a named function definition. This works in their React example as well--rather than (necessarily) creating a new component as they suggest, the standard approach is to add a named rendering helper in the function closure that returns JSX.

asherah · on May 10, 2023

it's probably a bit of a bad definition, but i'd say "something i can have a long conversation with and not figure out it's an AI"

tanseydavid · on May 10, 2023

>> it's probably a bit of a bad definition

It seemed fine to Alan Turing.

asherah · on May 9, 2023

i think it's reasonable for cases where you know the type, e.g. an internal object which a user of a library can't access

asherah · on May 9, 2023

you can do one better: imagine those developers writing typescript frontends for asp.net backends!

capableweb · on May 9, 2023

One which is using npm for package management and the other using nuget. And guess who owns both?