The Norway Problem

keeperofdakeys · on April 3, 2021

This is part of more general problem, they had to rename a gene to stop excel auto-completing it into a date.

https://www.theverge.com/2020/8/6/21355674/human-genes-renam...

Edit: Apparently Excel has its own Norway Problem ... https://answers.microsoft.com/en-us/msoffice/forum/msoffice_...

masklinn · on April 3, 2021

> This is part of more general problem

The more general problem basically being sentinel values (which these sorts of inferences can be treated as) in stringly-typed contexts: if everything is a string and you match some of those for special consideration, you will eventually match them in a context where that's wholly incorrect, and break something.

pdkl95 · on April 3, 2021

edit: fixed formatting problem

> sentinel values

Using in-band signaling always involves the risk of misinterpreting types.

> This is part of more general problem

DWIM ("Do What I Mean") was a terrible way to handle typos and spelling errors when Warren Teitelman tried it at Xerox PARC[1] over 50 years ago. From[2]:

>> In one notorious incident, Warren added a DWIM feature to the command interpreter used at Xerox PARC. One day another hacker there typed

    delete *$

>> to free up some disk space. (The editor there named backup files by appending $ to the original file name, so he was trying to delete any backup files left over from old editing sessions.) It happened that there weren't any editor backup files, so DWIM helpfully reported

    *$ not found, assuming you meant 'delete *'

>> [...] The disgruntled victim later said he had been sorely tempted to go to Warren's office, tie Warren down in his chair in front of his workstation, and then type 'delete *$' twice.

Trying to "automagically" interpret or fix input is always a terrible idea because you cannot discover the actual intent of an author from the text they wrote. In literary criticism they call this problem "Death of the Author"[3].

[1] https://en.wikipedia.org/wiki/DWIM

[2] http://www.catb.org/jargon/html/D/DWIM.html

[3] https://tvtropes.org/pmwiki/pmwiki.php/Main/DeathOfTheAuthor

lisper · on April 3, 2021

>> [...] The disgruntled victim later said he had been sorely tempted to go to Warren's office, tie Warren down in his chair in front of his workstation, and then type 'delete $' twice.

Ironically, this did not render the way you intended because HN interpreted the asterisk as an emphasis marker in this line.

It works here:

    ... type 'delete *$' twice.

because the line is indented and so renders as code, but not here:

> ... type 'delete $' twice.

because the subsequent line has emphasized text*. So the scoping of the asterisks is all screwed up.

wnoise · on April 3, 2021

Eh. "Death of the Author" is a reaction to the text not being dispositive as to what the author meant. It's deciding you don't care what the author meant, no longer considering it a problem that the text doesn't reveal that. Instead the text means whatever you can argue it means.

Which can be a fun game, but is ultimately pointless.

BlueTemplar · on April 4, 2021

It gets more complicated when the author himself changes their mind about that.

_kdhr · on April 3, 2021

That’s a shrewd observation. Static types help with this somewhat. E.g. in Inflex, if I import some CSV and the string “00.10” as 0.1, then later when you try to do work on it like

x == “00.10”

You’ll get a type error that x is a decimal and the string literal is a string. So then you know you have to reimport it in the right way. So the type system told you that an assumption was violated.

This won’t always happen, though. E.g. sort by this field will happily do a decimal sort instead of the string 00.10.

The best approach is to ask the user at import time “here is my guess, feel free to correct me”. Excel/Inflex have this opportunity, but YAML doesn’t.

That is, aside from explicit schemas. Mostly, we don’t have a schema.

dalbasal · on April 3, 2021

If we're talking about general problems, then I don't think we can be satisfied with "sometimes it's a problem with types and sometimes it's a UI bug." That's not general.

alpaca128 · on April 3, 2021

> E.g. sort by this field will happily do a decimal sort instead of the string 00.10.

So that system is not consistent with type checking? How is this not considered a bug?

_kdhr · on April 3, 2021

I mean if the value is imported as a decimal, then a sort by that field will sort as decimal. This might not be obvious if a system imports 23.53, 53.98 etc - a user would think it looks good. It only becomes clear that it was an error to import as a decimal when we consider cases like “00.10”. E.g, package versions: 10.10 is a newer version than 10.1.

Types only help if you pick the right ones.

jhugo · on April 4, 2021

Sure. In most static type systems though, you would be importing the data into structures that you defined, with defined types. So you wouldn’t suddenly get a Decimal in place of a String just because the data was different. You’d get a type error on import.

BlueTemplar · on April 4, 2021

And of course the plague that is CSV when your decimal delimiter is ,

christophilus · on April 3, 2021

Basically, autoimmune disease, but for software.

dalbasal · on April 3, 2021

I suppose this is a cliched thought, but the more general problem kind of emblematic of current "smart" features... and their expected successors.

OOH, this is a a typically human problem. We have a system. It's partly designed, partly evolved^. It's true enough to serve well in the contexts we use it in on most days. There are bugs in places (like norway, lol) that we didn't think of initially, and haven't encountered often enough to evolve around.

In code, we call it bugs. In bureaucracy, we just call it bureaucracy. Agency A needs institution B's document X, in a way that has bugs.

Obviously, it's also a typical machine problem. @hitchdev wants to tell pyyaml that Norway exists, and pyyaml doesn't understand. A user wants to enter "MARCH1" as text (or the name of a gene), and excel doesn't understand.

Even the most rigid bureaucracy is made of people and has fairly advanced comprehension ability though. If Agency A, institution B or document X are so rigid that "NO" or "MARCH1" break them... it probably means that there's a machine bug behind the human one.

Meanwhile... a human reading this blog (even if they don't program) can understand just fine from context and assumptions of intent.

IDK... maybe I'm losing my edge, but natural language programming is starting to seem like a possibility to me.

^I feel like we need a new word for these: versioned, maybe?

BlueTemplar · on April 4, 2021

"The computer won't let me" is a particularly maddening "excuse" from bureaucrats...

bilalq · on April 3, 2021

I don't understand why those support agents for Microsoft just threw their hands up in the air and asked customers to go through some special process for reporting the bug in Excel. Why are they not empowered/able to report the issue on behalf of customers? It's so clearly a bug in Excel that even they are able to reproduce with 100% reliability.

sneak · on April 3, 2021

It looks like it is intended behavior in Excel.

njarboe · on April 3, 2021

Yes. Excel cells are set to a "General" format that, by default, tries to guess the type of data the cell should be from its content. A date looking entry gets converted to a date type. A number looking string to a number (so 5.80 --> 5.8, very annoying since I believe in significant digits) When you import cvs data, for example, the default import format is "General" so date looking strings will be changed to a date format. This can be avoided by importing the file and choosing to import the data as "Text". People having these data corruption problems forgot to do that.

It's "user error" except that there is no way to set the default import to import as "Text" (as far as I know), so one has to remember to do the three step "Text" import every time instead of the default one step "General" import.

imtringued · on April 4, 2021

Excel doesn't support CSV files. Anyone who believes that has never used Excel. [0] You're supposed to use spreadsheets as is. Programs that have excel export features should always directly export xlsx files.

[0] The only thing you can safely do with CSV files is to interpret every value as text cell. CSV files always require out of band negotiation on everything, including delimiters, quotation, escape characters, the data type of each column.

blacktriangle · on April 5, 2021

However....

Users BELIEVE Excel supports CSV file. That's the reality on the ground. Fighting against that is a losing battle.

andrepd · on April 3, 2021

I'd say the more general problem is a bad type system! In any language with a half decent type system where you can define `type country = Argentina | ... | Zambia` this would be correctly handled at compile-time, instead of having strange dynamic weak typing rules (?) which throw runtime errors in production (???).

alexvoda · on April 5, 2021

I would like to see how your solution handles the case of new countries or countries changing name. Recompile and push an update? If the environment is governmental this can take a very very very long time.

The proper solution, in my opinion, is a lookup table stored in the database. It can be updated, it can be cached, it can be extended.

And for transfer of data, use formats to which you can attach a schema. This way type data is not lost on export. XML did this but everyone hates XML. And everyone hates XSD (the schema format) even more. However, if you use the proper tools with it, it is just wonderful.

zoward · on April 3, 2021

An even more general problem is that we as humans use pattern-matching as a cerebral tool to navigate our environment, and sometimes the patterns aren't what they appear to be. The Norway problem is the programming equivalent of an optical illusion.

WalterBright · on April 3, 2021

Good language design involves deliberately adding redundancy which acts like a parity bit in that errors are more likely to be detected.

richthegeek · on April 3, 2021

That's an interesting statement to apply to natural languages.

Consider this headline in English: "Man attacks boy with knife". This can be read two ways, either the man is using a knife to attack the boy, or the boy had the knife and thus was being attacked.

The same sentence in Polish would make use of either genitive or instrumental case to disambiguate (although barely). However, a naive translation would only differ in the placement of a `z` (with) and so errors could still slip through. At least in this case the error would not introduce ambiguity, simply incorrectness.

Similar to language design we can also consider: does the inclusion/requirement of parity features reduce the expressivity of the language?

tremon · on April 3, 2021

does the inclusion/requirement of parity features reduce the expressivity of the language?

This was a real eye-opener for me when learning Latin in school: stylistic expressions such as meter, juxtaposition, symmetry are so much easier to include when the meaning of a sentence doesn't depend on word order.

thaumasiotes · on April 3, 2021

> stylistic expressions such as meter, juxtaposition, symmetry are so much easier to include when the meaning of a sentence doesn't depend on word order.

Eh.... some things are easy and some things are hard in any language. The specifics differ, and so do the details of what kinds of things you're looking for in poetry. Traditional Germanic verse focuses on alliteration. Modern English verse focuses on rhyme. Latin verse focuses on neither. [1]

English divides poetically strong syllables from poetically weak syllables according to stress. It also has mechanisms for promoting weak syllables to strong ones if they're surrounded by other weak syllables.

In contrast, Latin divides strong syllables from weak syllables by length. Stress is irrelevant. But while stress can be changed easily, you're much more restricted when it comes to syllable length -- and so Publius Ovidius Naso is invariably referred to by cognomen in verse, because it isn't possible to fit his nomen, Ovidius, into a Latin metrical scheme. That's not a problem English has.

[1] I am aware of one exceptional Latin verse:

> O Tite, tute, Tati, tibi tanta, tyranne, tulisti.

mcv · on April 3, 2021

The real problem here is that people use Excel to maintain data. Excel is terrible at that. But the fact that it may change data without the user being aware of it, is absolutely the biggest failing here.

slightwinder · on April 3, 2021

The problem is more that it's insanly overpowered, while aiming for convenience out of the box. An "Excel Pro"-Version which takes away all the convenience and gives the user the power to configure the power pinpointet to their task might be a better solution. Funny part is, most of those things are already configurable now, but users are not educated enough about their tools to actually do it.

mfer · on April 3, 2021

Excel allows people to maintain data all over the place. From golf league data to job actual data compared to estimates to so much more. And, excel is accessible enough that tens of millions (or maybe more) of people do it.

wayoutthere · on April 3, 2021

The one I’ve seen was a client who wanted to store credit card numbers in an Excel sheet (yes I know this is a bad idea, but it was 15 years ago and they were a scummy debt collection call center). Signed integers have a size limit, which a 16 digit credit card number significantly exceeds.

Now, you and I know this problem is solved by prepending ‘ to the number and it will be treated as a string, but your average Excel user has no understanding of types or why they might matter. Many engineers will also look past this when generating Excel reports.

jgalt212 · on April 3, 2021

and cusips, which are strings, get converted to scientific notation.

https://social.msdn.microsoft.com/Forums/vstudio/en-US/92e0a...

nullsense · on April 4, 2021

Easiest solution is just to rename Norway.

BlueTemplar · on April 4, 2021

"Renaming it to Xorway resulted in untold damages from computer bugs..." - Narrator

nullsense · on April 4, 2021

Norway Orway Xorway Nandway Andway

Yes, yes, I see... This could be problematic, indeed. If only there were a logical solution.

imtringued · on April 4, 2021

So basically they renamed a gene because they had employees who were too stupid to use excel?

qwertox · on April 3, 2021

Regarding Excel: It also happens with Somalia, which makes this issue even stranger. Apparently because of "SOM".

commandlinefan · on April 4, 2021

There’s a really simple solution to this problem, which has been around since the 70’s: schemas.

afturkrull · on April 3, 2021

> they had to rename a gene to stop excel auto-completing it into a date.

No one in their right mind uses a spreadsheet for data analysis. Good for working out your ideas but not in a production environment. I figure excel was chosen as this the utility the scientists were most familiar with.

The proper tool for the job would be a database. I recall reading about a utility, a highly customized database with an interface that looks just like a spreadsheet.

mattkrause · on April 3, 2021

The analysis itself isn’t (usually) happening in Excel.

A lot of tools operate on CSV files. People use Excel to peek at the results or prepare input for other tools, and that’s how the date coercion slips in.

Sometimes, people do use it to collate the results of small manual experiments, where a database might be overkill. Even so, the data is usually analyzed elsewhere (R, graphPad, etc).

imtringued · on April 4, 2021

>A lot of tools operate on CSV files.

The mistake was to believe that Excel can operate on CSV files. It doesn't support them in any meaningful way. It supports them in a "I can sort of pretend that I support CSV files" way.

Arrath · on April 5, 2021

What is a good alternative to working with CSV files than Excel? Excel sure isn't ideal but it's always there as part of the MS Office suite, so I've never looked for anything esle.

BlueTemplar · on April 4, 2021

And yet, we are still being taught to use an Excel (2003) spreadsheet for data analysis... (Because that's what most businesses are still using !)

helsinkiandrew · on April 3, 2021

> they had to rename a gene to stop excel auto-completing

I can just about understand that "No" might cause a problem, but “Membrane Associated Ring-CH-Type Finger 1" being converted to MAR-1 defeats me.

jasode · on April 3, 2021

>, but “Membrane Associated Ring-CH-Type Finger 1" being converted to MAR-1 defeats me.

No, that's not what's happening. To clarify...

If you type a 41 characters long string of "Membrane Associated Ring-CH-Type Finger 1" into a cell -- Excel will not convert that to a date of MAR-1.

On the other hand, it's if you type an 6-char abbreviation of "MARCH1" that looks like a realistic date -- Excel converts it to MAR-1.

atombender · on April 3, 2021

The world desperately needs a replacement for YAML.

TOML is fine for configuration, but not an adequate solution for representing arbitrary data.

JSON is a fine data exchange format, but is not particularly human-friendly, and is especially poor for editable content: Lacks comments, multi-line strings, is far too strict about unimportant syntax, etc.

Jsonnet (a derivative of Google's internal configuration language) is very good, but has failed to reach widespread adoption.

Cue is a newer Jsonnet-inspired language that ticks a lot of boxes for me (strict, schema support, human-readable, compact), but has not seen wide adoption.

Protobuf has a JSON-like text format that's friendlier, but I don't think it's widely adopted, and as I recall, it inherits a lot of Protobufisms.

Dhall is interesting, but a bit too complex to replace YAML.

Starlark is a neat language, but has the same problem as Dhall. It's essentially a stripped-down Python.

Amazon Ion [1] is neat, but I've not seen any adoption outside of AWS.

NestedText [2] looks promising, but it's just a Python library.

StrictYAML [3] is a nice attempt at cleaning up YAML. But we need a new language with wide adoption across many popular languages, and this is Python only.

Any others?

[1] https://amzn.github.io/ion-docs/

[2] https://nestedtext.org/

[3] https://github.com/crdoconnor/strictyaml/

diggan · on April 3, 2021

Seems you're missing my personal favorite, extensible data notation - EDN (https://github.com/edn-format/edn). Probably I'm a bit biased coming from Clojure as it's widely used there but haven't really found a format that comes close to EDN when it comes to succinctness and features.

Some of the neat features: Custom literals / tagged elements that can have their support added for them on runtime/compile time (dates can be represented, parsed and turned into proper dates in your language). Also being able to namespace data inside of it makes things a bit easier to manage without having to result to nesting or other hacks. Very human friendly, plus machine friendly.

Biggest drawback so far seems to be performance of parsing, although I'm not sure if that's actually about the format itself, or about the small adoption of the format and therefore not many parsers focusing on speed has been written.

rubyn00bie · on April 3, 2021

Your list is like a graveyard of my dreams and hopes. Anything that doesn't validate the format of the underlying data is pretty much dead to me...

The problem with most of these is they're useless to describe the data. Honestly, it is completely not useful to have the following to describe data:

email => string

name => string

dob => string

IMHO, it is akin to having a dictionary (like Oxford English) read like:

email - noun

name - noun

birthday - noun

It says next to nothing except, yes, they are nouns. All too often I waste time fighting nils and bullshit in fields or duplicating validation logic all over the place.

"Oh wow, this field... is a string..? That's great... smiles gently except... THERE SHOULD NOT BE EMOJI IN MY FUCKING UUID, SCHEMA-CHUD. GET THE FUCK OFF MY LAWN!"

sangnoir · on April 3, 2021

It sounds to me like XML with a DTD & XSD would solve your problem. XML no longer fashionable, but its validation is Turing-complete

Nitramp · on April 3, 2021

My experience is that validation quickly becomes surprisingly complex, to the point of being infeasible to express in a message format.

Not only are the constraints very hard to express (remember that one 2000 char regexp that really validates email addresses?), they are also contextual: the correct validation in an Android client is not the same as on the server side. Eg you might want to check uniqueness or foreign key constraints that you cannot check on the client. Sometimes you want to store and transmit invalid messages (eg partially completed user input). And then you have evolving validation requirements: what do you do with the messages from three years ago that don't have field X yet?

Unfortunately I don't think you can express what you need in a declarative format. Even minimal features such as regexp validation or enums have pitfalls.

I think it's better to bite the bullet and implement the contextually required validation on each system boundary, for any message crossing boundaries.

scythe · on April 3, 2021

If you want automatic built-in string validation, one option that seems particularly interesting is to use a variant of Lua patterns, which are weaker and easier to understand than regular expressions, but still provide a significant degree of "sanity" for something like an email. The original version works on bytes and not runes, but you could simply write a parser that works on runes instead, and the pattern-matching code is just 400 old and battle-tested lines of C89. You might want to add one extension: allow for escape sequences to be treated as a single character (hence included in repetition operators and adding the capability to match quoted strings); with this extension, I think you could implement full email address validation:

https://i.stack.imgur.com/YI6KR.png

Lua patterns have also shown up in other places, such as BSD's httpd, and an implementation for Rust:

https://www.gsp.com/cgi-bin/man.cgi?section=7&topic=PATTERNS

https://github.com/stevedonovan/lua-patterns

http://lua-users.org/wiki/PatternsTutorial

neop1x · on April 3, 2021

Amazon Ion [1] supports schema [2] and it all looks quite nice to me. Maybe it deserves wider adoption.

[1] https://amzn.github.io/ion-docs/ [2] https://amzn.github.io/ion-schema/

tormeh · on April 3, 2021

I agree with this, something RON/JSON-like with type annotations would be great:

    {
      "isTrue":false:Boolean,
      "id":"123e4567-e89b-12d3-a456-426614174000":UUID
    }

BlueTemplar · on April 4, 2021

Sounds like your issue is that UUID is NOT a string, but a 128-bit integer ?

geoduck14 · on April 3, 2021

>THERE SHOULD NOT BE EMOJI IN MY FUCKING UUID

thanks for the lolz

djedr · on April 3, 2021

Still early, but here's my baby I hope can improve things:

website with grammar spec: https://tree-annotation.org/

prototype of a JSON/YAML alternative for JS: https://github.com/tree-annotation/tao-data-js

same thing, even less finished for C#: https://github.com/tree-annotation/tao-data-csharp

working on it constantly, more to come soon

fmakunbound · on April 3, 2021

XML and XML Schema solved this more than 20 years ago. It had to be replaced with JSON by the web developers though, so they could just “eval() it” to get their data.

jdeisenberg · on April 3, 2021

XML with RelaxNG (https://relaxng.org/) would have made life so much better than using XML Schema, but, as they say, that ship has long since sailed.

servercobra · on April 3, 2021

All except the easily written by humans part. Which is kind of a key part.

MrPatan · on April 3, 2021

If all the smart people like you used XML, how come it was so painful to use and it died?

takeda · on April 3, 2021

Because it offered all these things parent responded, but that made it too complex. You either provide schema and get commodities of describing it or you don't.

I had a chance of using SOAP at one point. It was a F5 device and I used a python library. What I really liked is that when it connected to it it downloaded its schema, and then used that to generate an object. At that point you just communicated with device like you did with any object in Python.

We abandoned it for inferior technologies like REST and JSON, because they were harder to use from JS, as parent mentioned.

MrPatan · on April 3, 2021

Parent didn't say it was harder to use from JS. Parent said "It had to be replaced with JSON by the web developers though, so they could just “eval() it” to get their data."

First of all, I was there 20 years ago. I had to deal with XML, XSLT, one kind of Java XML parsers that didn't fully do what I needed, another kind of Java XML parsers that didn't fully do what I needed. And oh boy was it a pain. I just wanted to get a few properties of a bunch of entities in a bigger XML document, that's all. Big fail.

Second, JSON always had a parser in JS, so I don't know where that eval nonsense is coming from.

Third, JS actually had the best dev UX for XML of all languages 20 years ago. Maybe you know JavaScript from Node.js, but 20 years ago it used to run excusively in web browsers, which even then were pretty good at parsing XML documents. The browser of course had a JS DOM traversal API known to every single JS developer, and very soon (Although TBH I can't remember if before or after JSON) it also had xpath querying functions, all built in.

XML was so bad, that its replacement came from the language where it was actually easiest to use. think about that for a second.

So the answer to the question "Why was XML replaced?" is not "Because webdevs lol".

I suspect it was because it has both content and attributes, which all but guarantees it's impossible to create a bunch of simple, common data structures from it (like JSON does).

fmakunbound · on April 4, 2021

> Second, JSON always had a parser in JS, so I don't know where that eval nonsense is coming from.

Firstly, it sounds like XML ran over your dog or something. Sorry to hear about that. It wasn’t particularly hard to use at all, and if you’re dealing with the possibility of emojis in your JSON UUIDs in 2021, one might even say it’s easier to use.

If you’re referring to JSON.parse() in “had a parser” above, then you have a temporal problem. Regarding eval(), it’s suggested right in the original RFC for JSON. Check it out. Web developers at the time were following that advice.

BlueTemplar · on April 4, 2021

Another issue is that due to their age, a lot of XML tools ignore the existence of Unicode (or UTF-8).

dragonwriter · on April 3, 2021

> The world desperately needs a replacement for YAML.

The world desperately needs support for YAML 1.2, which solves the problems the article addresses fairly completely (largely in the “default” Core schema[0], but more completely with the support for schemas in general), plus a bunch of others, and has for more than a decade. But YAML 1.2 libraries aren’t available for most languages.

[0] not actually an official default, but reflects a cleanup of the YAML 1.1 behavior without optional types, so its defaultish. Back when it looked like YAML 1.3 might happen in some reasonably-near future, it was actually indicated by team members that the JSON Schema for YAML (not to be confused with the JSON Schema spec) would be the explicit default YAML Schema in 1.3, which has a lot to recommend it.

tormeh · on April 3, 2021

Nope nope nope. YAML is awful and needs to die. The more you look at it the worse it gets. The basic functionality is elegant (at least until you consider stuff like The Norway Problem), but the advanced parts of YAML are batshit insane.

dragonwriter · on April 3, 2021

“The Norway Problem" is a YAML 1.1 problem, of which there are many.

What advanced parts of YAML are you talking about that remain problems in YAML 1.2?

medstrom · on April 3, 2021

From the article:

> The most tragic aspect of this bug, howevere, is that it is intended behavior according to the YAML 2.0 specification.

dragonwriter · on April 3, 2021

The article is simply, factually wrong; there is no “YAML 2.0 specification” [0], and everything they point to is YAML 1.1, and addressed in YAML 1.2 (the most recent YAML spec, from 2009.)

[0] https://yaml.org/

svnpenn · on April 3, 2021

You seem pretty quick to disregard TOML. I switched all my JSON and YAML for TOML. Do you care to detail what is missing?

atombender · on April 3, 2021

TOML quickly breaks down with lots of nested arrays of objects. For example:

    a:
      b:
      - c: 1
      - d:
        - e: 2
        - f:
            g: 3

Turns into this, which is unreadable:

    [[a.b]]
    c = 1

    [[a.b]]
    [[a.b.d]]
    e = 2

    [[a.b.d]]
    [a.b.d.f]
    g = 3

TOML also has a few restrictions, such as not supporting mixed-type arrays like [1, "hello", true], or arrays at the root of the data. JSON can represent any TOML value (as far as I know), but TOML cannot represent any JSON value.

At my company we use YAML a lot for table-driven tests (e.g. [1]), and this not only means lots of nested arrays, but also having to represent pure data (i.e. the expected output of a test), which requires a format that supports encoding arbitrary "pure" data structures of arrays, numbers, strings, booleans, and objects.

[1] https://github.com/sanity-io/groq-test-suite/

svnpenn · on April 3, 2021

Looks fine to me:

    [[a.b]]
    c = 1
    d = [
       { e = 2 },
       { f = { g = 3 } }
    ]

timClicks · on April 3, 2021

An improvement, but the original YAML is still significantly better, in my opinion.

Arnavion · on April 3, 2021

Also many (most? all?) serializers don't let you control which fields are serialized inline vs not. So if you have a program that generates configuration, you're going to end up with the original unreadable form anyway.

mc10 · on April 3, 2021

S-expressions are super easy to parse and are fairly easy for humans to read. See e.g. using s-expressions in OCaml: https://dev.realworldocaml.org/data-serialization.html

Nihilartikel · on April 3, 2021

Apropos of this, in Clojure-land the idiomatic serialization is, EDN [1], which is pretty ergonomic to work with IMO, since in most cases it is the same as a data-literal in Clojure.

My feeling is that :keywords reduce the need and temptation to conflate strings and boolean/enumerations that occurs when there's no clear way to convey or distinguish between a string of data and a unique named 'symbol'. I miss them when I'm in Pythonland.

[1] https: https://www.compoundtheory.com/clojure-edn-walkthrough/

gnud · on April 3, 2021

S-expressions inherits all trouble with data types from json (dates, times, booleans, integer size, number vs numeric string).

You get neat ways of nesting data, but that is not enough for a robust and mistake-resilient configuration language.

The problem isn't parsing in itself. The problem is having clear sematics, without devolving into full SGML DTDs (or worse still, XML schemas).

diggan · on April 3, 2021

> S-expressions inherits all trouble with data types from json (dates, times, booleans, integer size, number vs numeric string).

Hm, not sure that's true, S-expressions would only define the "shape" of how you're defining something, not the semantics of how you're defining something. EDN https://github.com/edn-format/edn for all purposes is S-expressions and have support for custom literals and more, to avoid "the trouble with data types from JSON"

gnud · on April 4, 2021

Yes, EDN is S-expressions plus a bunch of semantic rules. Parsing EDN is quite a bit more complex than just parsing S-expressions, just because you need to support a bunch of built in types, as well as arbitrary exensions through 'tags'.

The tag system is quite brilliant though.

dqpb · on April 3, 2021

I’ve used most of the technologies you listed. Cue is the best, and the only one with strong theoretical foundations. I’ve been using it for some time now and won’t go back to the others.

ng12 · on April 3, 2021

Jsonnet hasn't taken off because it's turing complete. It's a really great language for generating JSON but not a replacement for JSON.

hansvm · on April 3, 2021

> The world desperately needs a replacement for YAML.

For situations like TFA you really want a configuration language that behaves exactly like you think it will, and since you don't have to interop with other organizations you don't really need a global standard.

Moreover, broadly used config languages can be somewhat counterproductive to that goal. Take JSON as an example; idiomatic JSON serdes in multiple programming languages has discrepancies in minint, maxfloat, datetime, timezone, round-tripping, max depth, and all kinds of other nuanced issues. Existing tooling is nice when it does what you expect, but for a no-frills, no-surprises configuration language I would almost always just prefer to use the programming language itself or otherwise write a parser if that doesn't suffice (e.g., in multilingual projects).

Mildly off-topic: The problem here, more or less, was that the configuration change didn't have the desired effect on an in-memory representation of that configuration. We can mitigate that at the language level, but as a sanity check it's also a good idea to just diff the in-memory objects and make sure the change looks kind of like what you'd expect.

atombender · on April 3, 2021

You don't need wide adoption for internal projects in an organization, but you do want great toolchain support.

For example, the fact that NestedText is a Python library means a Python team could use it, but it's a poor fit for an organization whose other teams use Go and JavaScript/TypeScript.

We use YAML for much more than configuration, by the way. I feel like YAML hits a nice sweet spot where it's usable for almost everything.

BlueTemplar · on April 4, 2021

> and since you don't have to interop with other organizations

Until you have to, and all hell breaks loose ?

Now, the example of codepages maybe isn't really appropriate to companies, but is still a good enough metaphor ?

ak217 · on April 3, 2021

I don't think YAML is going anywhere, largely because it was the first format to prioritize readability and conciseness, and has used that advantage to achieve critical mass.

It's far more productive to push for incremental changes to the YAML spec (or even a fork of it) to make it more sane and better defined. Things like a StrictYAML subset mode for parsers in other popular languages.

dragonwriter · on April 3, 2021

> It's far more productive to push for incremental changes to the YAML spec

The problems this article raises and strictyaml purports to address were addressed in YAML 1.2, already supported in python via ruamel.yaml; YAML 1.2 addresses much of this in the Core schema which is the closest successor to the default behavior of earlier spec versions, and does so more completely in the support for schemas more generally, which define both the supported “built-in" tags (roughly, types) and how they are matched from the low-level representation which consists only of strings, sequences, and maps (which, incidentally, are the only three tags of the “Failsafe” schema; there’s also a “JSON” Schema between Failsafe and Core, which has tags corresponding to the types supported by JSON.

IshKebab · on April 3, 2021

JSON5 is the best option currently. A fair number of tools in the JS ecosystem support it.

atombender · on April 3, 2021

JSON5 is better than JSON on my points, but it has downsides compared to YAML. For example, YAML is very good at multiline strings that don't require any sort of quoting, and knows to remove preceding indentation:

  foo: |
    "This is a string that goes across
    multiple lines," he wrote.

In JSON5, you'd have to write:

  {
    foo: \"This is a string that goes across \
  multiple lines,\" he wrote."
  }

This sort of ergonomic approach is why YAML is so well-liked, I think. (Granted, YAML's use of obscure Perl-like sigils to indicate whitespace mode is annoying, but it does cover a lot of situations.)

YAML is also great at arrays, mimicking how you'd write a list in plaintext:

  foo:
  - "hello"
  - 42
  - true

geraldbauer · on April 3, 2021

You might look at JSON Next variants (if you remember - "classic" JSON is a subset of YAML), see https://github.com/json-next/awesome-json-next

My own little JSON Next entry / format is called JSON 1.1 or JSONX, that is, JSON with eXtensions, see https://json-next.github.io

orthoxerox · on April 3, 2021

The list is missing http://www.relaxedjson.org/

Also, there's no explanation what <..-..> and <..+..> do.

tormeh · on April 3, 2021

Also RON: https://github.com/ron-rs/ron

A bit like JSON5, but I believe even more advanced.

imtringued · on April 4, 2021

I will keep using YAML because I don't want to learn the pitfalls of your alternatives. With YAML everyone is complaining about the pitfalls, and therefore everyone is aware of them. A random replacement may not have this particular problem, but it may have other problems that remain unknown.

debug-desperado · on April 3, 2021

Thanks for this list, I’ve never heard of Ion. I’ll consider it for config and even replacing Avro & Protobuf in future projects.

BlueTemplar · on April 4, 2021

Besides this issue, what's wrong with YAML ?

azernik · on April 3, 2021

YAML had a worse example, once.

For the ease of entering time units YAML 1.1 parsed any set of two digits, separated by colons, as a number in sexagesimal (base 60). So 1:11:00 would parse to the integer 4260, as in 1 hour and 11 minutes equals 4260 seconds.

Now try plugging MAC addresses into that parser.

The most annoying part is that the MAC addresses would only be mis-parsed if there were no hex digits in the string. Like the bug in this post, it could only be reproduced with specific values.

Generally, if you're doing implicit typing, you need to keep the number of cases as low as possible, and preferably error out in case of ambiguity.

adwn · on April 3, 2021

> For the ease of entering time units YAML 1.1 parsed any set of two digits, separated by colons, as a number in sexagesimal (base 60).

This is a mind-boggling level of idiocy. Even leaving aside the MAC address problem, this conversion treats "11:15" (= 675) different from "11:15:00" (= 40500), even though those denote the same time, while treating "00:15:00" (15 minutes past midnight) and "15:00" (3 in the afternoon) the same.

azernik · on April 4, 2021

You know you've fucked up when you have to remove features from the spec (which they did in YAML 1.2).

imtringued · on April 4, 2021

On the other hand, you know that you did well, when a direct competitor would look exactly the same minus some undesired features.

dragonwriter · on April 3, 2021

> YAML had a worse example, once.

It had it literally at the same time as it had the problem in the article (the article refers to YAML 2.O, a nonexistent spec, and to PyYAML, a real parser which supports only YAML 1.1.)

Both the unquoted-YES/NO-as-boolean and sexagesimal literals were removed in YAML 1.2. (As was the 0-prefixed-number-as-octal mentioned in a sibling comment.)

whytookay · on April 3, 2021

One that really surprised/confused me was that pyaml (and the yaml spec) attempts to interpret any 0-prefixed string into an octal number.

There was a list of AWS Account IDs that parsed just fine until someone added one that started with a 0 and had no numbers greater than 7 in it, after which our parser started spitting out decidedly different values than we were expecting. Fixing it was easy, but figuring out what in the heck was going on took some digging.

lainga · on April 3, 2021

We had a Grafana dashboard where one of the columns was a short Git hash. One day, a commit got the hash `89e2520`, which Grafana's frontend helpfully decided to display as "+infinity". Presumably it was parsing 89E+2520.

rootusrootus · on April 3, 2021

Ha, that reminds me of some work I was doing just yesterday, implementing a custom dictionary for a postgres full text search index. Postgres has a number of mappings that you can specify, and it picks which one based on a guess of what the data represents. I got bit by a string token in this same format, because it got interpreted as an exponential number.

BlueTemplar · on April 4, 2021

Sounds like the core issue is that an hexadecimal number was encoded as a string ?

m463 · on April 3, 2021

slightly related, on my microwave 99 > 100, even 61 > 100

roland35 · on April 3, 2021

I try to optimize my microwave button pushing too. I also have a +30 seconds button, so for 1:30 I can hit "1,3,0,Start" or "+30" three times and save a press!

bonzini · on April 3, 2021

Why does your microwave compare numbers?

sokoloff · on April 3, 2021

It doesn’t compare them, it just counts down.

If I enter 1-3-0-start, I get 90 seconds of cooking. If I enter 9-9-start, I get 99 seconds of cooking, so in that sense, 99 > 130.

If I want about 90 seconds, I’ll use 88 as it’s faster to enter (fewer finger movements).

JoeAltmaier · on April 3, 2021

I've done the same thing for decades! Soul mates?

sokoloff · on April 3, 2021

You might like this one as well.

Load soap into the dishwasher after emptying rather than after loading. If the soap dispenser is closed, the dishes are dirty.

sneak · on April 3, 2021

My rule is that loading the dishwasher means that one loads all the available dishes, and runs it, even if it's only x% full. We use the (large) sink as an input buffer.

If the dishwasher has dishes in it and it's not running, they're clean.

rootusrootus · on April 3, 2021

This is exactly our algorithm as all. I can't really imagine flipping it the other way, since leaving dirty dishes in a dishwasher will just let them completely dry out, making it more likely they won't get fully clean when the cycle is eventually run.

medstrom · on April 3, 2021

Rinse until visually clean, then put in dishwasher.

sneak · on April 4, 2021

This doubles the time required to do the dishes, defeating much of the purpose of the dishwasher.

medstrom · on April 4, 2021

Idk, to me it's not about time but effort. Rinsing is just pleasant.

teddyh · on April 3, 2021

That’s not a zero-copy algorithm. The algorithm with using the soap dispenser being closed as a flag is zero-copy.

corpMaverick · on April 3, 2021

I want to have two dishwashers. One with the dirty dishes and one with the clean dishes. So you never have to put the dishes away. They go from the clean dishwasher to the table to the dirty one. And then flip them.

tjalfi · on April 3, 2021

This idea comes up periodically on Reddit. [0] has a few posts from people who have installed them, mostly for bachelors.

[0] https://www.reddit.com/r/self/comments/ayr9c/when_im_rich_im...

sokoloff · on April 3, 2021

There’s a community near here with a high fraction of Orthodox Jews. One condo I toured in my 20s had two dishwashers and without thinking about why they did it, I commented how I thought that was awesome that you’d never need to put dishes away. (They of course installed two dishwashers for orthodox separation of dishes from each other.)

LilBytes · on April 3, 2021

Blasphemy! I do the inverse. You're wrong. /s

insert code flame war here

pdkl95 · on April 3, 2021

Vi Hart - "How to Microwave Gracefully"

https://www.youtube.com/watch?v=T9E0zSpULFY

pdpi · on April 3, 2021

Not the OP, but I have the same problem. For some reason that escapes me, pressing the “10 sec” button 7 times produces 00 70 instead of 01 10. If you then press the “1 min” button you get 01 70

nerdponx · on April 3, 2021

Most microwaves (in the USA) do this, at least in my experience.

They treat the ":" like a sum of two sexagesimal numbers, rather than a sexagesimal digit separator.

mckirk · on April 3, 2021

How else would you prove it's turing complete and can run Doom?

kstenerud · on April 3, 2021

The worst tragedy of this is the security implications of subtly different parsers. As your application surface increases, you're likely to mix languages (and thus different parsers), which means that the same input data will produce different output data depending on whether your parser replaces, truncates, ignores, or otherwise attempts to automatically "fix up" the data. A carefully crafted document could exploit this to trick your data storage layer into storing truncated data that elevates privileges or sets zero cost, while your access control layer that ignores or replaces the data is perfectly happy to let the bad document pass by.

And here's something else to keep you up at night: Just think of how many unintentional land mines lurk in your serialized data, waiting to blow up spectacularly (or even worse, silently) as soon as you attempt to change implementation technologies!

This is why I've been so anal about consistent decoder behavior in Concise Encoding https://github.com/kstenerud/concise-encoding/blob/master/ce...

https://concise-encoding.org/

yellowapple · on April 3, 2021

This is exactly why configuration/serialization formats should make as few assumptions about value types as possible. Once parsing's done, everything should be a string (or possibly a symbol/atom, if the program ingesting such a file supports those), and it should be up to the application to convert values to the types it expects. This is Tcl's approach, and it's about as sensible as it gets.

...which is why it pains me to admit that in my own project for a Tcl-like scripting/config language[1] I missed the float v. string issue, so it'll currently "cleverly" return different types for 1.2 (float) v. 1.2.3 (atom). Coincidentally, I started work on a "stringy" alternative interpreter that hews closer to Tcl's philosophy (to fix a separate issue - namely, to avoid dynamically generating atoms, and therefore avoid crashing the Erlang VM when given potentially-adversarial input), so I'm gonna fix that case for at least the "stringy" mode (by emitting strings instead of numbers, too), knocking out two birds with one stone for the upcoming 0.3.0 release :)

----

[1]: https://otpcl.github.io, for those curious

dkersten · on April 3, 2021

It’s reasons like this that I want my configuration languages to be explicit and unambiguous. This is why I use JSON or if I want a human friendly format, TOML. Strings are always “quoted” and numbers are always unquoted 1.2, it can never accidentally parse one as the other. The convenience of omitting quotes is just not worth the potential for ambiguity or edge cases to me.

progval · on April 3, 2021

> Once parsing's done, everything should be a string

Or give a schema to the parser, defining what type is expected in each field.

kenshoen · on April 3, 2021

Yes, that looks like a right way to handle this problem without ignoring YAML spec. Define what to parse upfront.

WalterBright · on April 3, 2021

> The most tragic aspect of this bug, howevere, is that it is intended behavior according to the YAML 2.0 specification.

This is one of those great ideas that sadly one needs experience to realize are really bad ideas. Every new generation of programmers has to relearn it.

Other bad ideas that resurface constantly:

1. implicit declaration of variables

2. don't really need a ; as a statement terminator

3. assert should not abort because one can recover from assert failures

atleta · on April 3, 2021

I agree with the general observation, but the need for ";" ? Quite a few languages (over a few generations) have been doing fine without the semicolon. Just to mention two: python and haskell. (Yes, python has the semicolon but you'll only ever use it to put multiple statements on a single line.)

yakshaving_jgt · on April 3, 2021

> Yes, python has the semicolon but you'll only ever use it to put multiple statements on a single line.

This is also true of Haskell btw.

ufo · on April 3, 2021

Another inreresting example is Lua. It's a free form language without semicolons. It's not indentation sensitive.

samatman · on April 3, 2021

Lua does have semicolons!

It even has semicolon insertion, but because the language is carefully designed, this doesn't cause problems, and most users can go a lifetime without knowing about it.

Our coding style requires semicolons for uninitialized variables, so you'll see

    local x;
    if flag then
       x = 12 
    else
       x = 24
    end

As a way of marking that the lack of initialization is deliberate. `local x = nil` is used only if x might remain nil.

ufo · on April 4, 2021

I don't like saying that it's semicolon insertion because it might give people the idea that the semicolons work similarly to Javascript. In Lua, inserting a semicolon is always optional and it's an stylistic matter (like in your example). It even allows putting multiple statements on the same line without a semicolon.

    -- Two assignment statements
    x = 10 y = 20

lelanthran · on April 3, 2021

> I agree with the general observation, but the need for ";" ? Quite a few languages (over a few generations) have been doing fine without the semicolon. Just to mention two: python and haskell. (Yes, python has the semicolon but you'll only ever use it to put multiple statements on a single line.)

But then it's inconsistent and has unnecessary complexity because now there's one (or more) exceptions to the rules to remember: when the ';' is needed. And of course if you get it wrong you'll only discover it at runtime.

"Consistent applications of a general rule" is preferable to "An easier general rule but with exceptions to the rule".

orlp · on April 3, 2021

Have you ever used Python? If you did you really wouldn't be saying this. There isn't an exception. The semicolon is used to put multiple statements on a single line. That's it's only use, and that's the only time it's 'needed' - no exceptions.

c-cube · on April 3, 2021

But python has instead the "insert \ sometimes" rule, which isn't better.

lelanthran · on April 3, 2021

> Have you ever used Python? If you did you really wouldn't be saying this. There isn't an exception.

For the ';', perhaps not. For the token that is used to terminate (or separate) statements? Yes, the ';' is an exception to the general rule of how to terminate statements.

The semicolon also works on some sort of statements and not others, throwing errors only at runtime.

It's easier to remember one rule than many.

pedrovhb · on April 3, 2021

Honestly, the rule is "don't use semicolons in Python". I don't think there's a single one in the large codebase I work with, and there's really no reason at all to use it other than maybe playing code golf.

It's not a language in which you ever need be saving bytes on the source code. Just use a new line and indent. It's more readable and easier.

atleta · on April 3, 2021

There are no exceptions. You only need it if/when you want to put multiple statements on a single line. That's its sole purpose.

And I'd also add that it's something that you almost never do. One practical use is writing single line scripts that you pass to the interpreter on the command line. E.g. `python -c 'print("first command"); print("second command")'`

If you don't know about the `;` at all in python then you are 100% fine.

labawi · on April 3, 2021

When you use ; and possibly {, }, code statements / blocks are specified redundantly (indentation + separators), which can cause inconsistent interpretation of code by compiler / readers.

I find it much, much easier to look at code and parse blocks via indentation, than the many ways and exceptions of writing ; and {, }, while an extra or missing ';' or {} easily remains unspotted and leads to silly CVEs.

Cu3PO42 · on April 3, 2021

Haskell has the semicolon for the same reason!

linspace · on April 3, 2021

> implicit declaration of variables

This is so true. I really like Julia and I know that explicitly declaring variables would be detrimental to adoption but I prefer it to the alternative, which is this: https://docs.julialang.org/en/v1/manual/variables-and-scopin...

asiachick · on April 3, 2021

What do think of implicit member access (C++, Java, C#) vs explicit (python, javascript)? Is there a concrete argument one way or the other?

I feel like I prefer explicit

    self.member = value
    this.member = value

vs implicit

    member = value

But clearly C++/Java/C# people are happy with implicit ... though many of them try to make it explicit by using a naming convention.

coopierez · on April 3, 2021

That was my single biggest pet-peeve of C++. A variable appears in the middle of a member function? Good luck figuring out what owns it. Is it local? Owned by the class? The super-class? (And in that case - which one?)

The added mental load of tracking variables' sources builds up.

logicchains · on April 3, 2021

FWIF, most C++ style guards recommend writing member variables like mVariableName or variable_name_ so they're easy to distinguish from local variables, and modern C++ doesn't generally make much use of inheritance so there's usually only one class it could belong to.

mcv · on April 3, 2021

The fact that people introduce naming conventions to keep track of member variables is probably the biggest condemnation of implicit member access. People clearly need to know this, so you'd better make it explicit.

It's actually a bit surprising that this is one thing that javascript does better than Java. In most other areas, it's Java that's (sometimes overly) explicit.

aasasd · on April 3, 2021

I can tell for certain that as a JS/Python man, every time I look through Java code I have to spend a bit of time when stumbling upon such access, until I remember that it's a thing in Java. Pity that Kotlin apparently inherited it.

But at least, to my knowledge, in Java these things can't turn out to be global vars. Having this ‘feature’ in JS or Python would be quite a pain in the butt.

jhomedall · on April 3, 2021

F#, Kotlin, Python, Nim and many others all seem to get by fine without semicolons as statement terminators.

WalterBright · on April 3, 2021

In Python, a newline is a token and serves as a statement terminator.

What I'm referring to is the notion that:

    a = b c = d;

can be successfully parsed with no ; between b and c. This is true, it can be. But then it makes errors difficult to detect, such as:

    a = b
    *p;

Is that one statement or two?

goatinaboat · on April 3, 2021

This is one of those great ideas that sadly one needs experience to realize are really bad ideas. Every new generation of programmers has to relearn it.

It's a bad idea because ASCII already includes dedicated characters for field separator, record separator and so on. These could easily be made displayable in a text editor if you wanted just as you can display newlines as ↲. Anyone who invents a format that involves using normal printable characters as delimiters and escaping them when you need them, is, I feel very confident in saying, grotesquely and malevolently incompetent and should be barred from writing software for life. CSV, JSON, XML, YAML, all guilty.

aasasd · on April 3, 2021

The obvious first step toward the brighter future is to refrain from using any and all software that utilizes the malevolent formats you mentioned. Doing otherwise would mean simply being untrue to one's own conscience and word.

tjalfi · on April 3, 2021

> It's a bad idea because ASCII already includes dedicated characters for field separator, record separator and so on.

ASCII is over 60 years old and separators haven't caught on yet; what's different now?

> These could easily be made displayable in a text editor if you wanted just as you can display newlines as ↲.

Can you name a common text editor with support for ASCII separators? It's a lot easier to use delimiters and escaping then change every text editor in the world.

> Anyone who invents a format that involves using normal printable characters as delimiters and escaping them when you need them, is, I feel very confident in saying, grotesquely and malevolently incompetent and should be barred from writing software for life. CSV, JSON, XML, YAML, all guilty.

All of the formats you rant about are widely used, well supported, and easy to edit with a text editor - none of these are true of ASCII separators. People chose formats they can edit today instead of formats they might be able to edit in the future. All of these formats have some issues but none of the designers were incompetent.

erik_seaberg · on April 3, 2021

US-ASCII only has four information separators, and I believe they can only be used in a four-layer schema with no recursion, sort of like CSV (if your keyboard didn’t have a comma or quote or return key). When you need to pass an object with records of fields inside a field you’re out of luck, and everyone has to agree on quoting or encoding or escaping again.

I think SGML (roll your own delimiters and nesting) was pretty close to the Right Thing,™ but ISO has the specs locked down so everyone had a second-hand understanding of it.

spion · on April 3, 2021

how do you write them though

teddyh · on April 3, 2021

Ctrl-\, Ctrl-], Ctrl-^ and Ctrl-_ for file, group, record and unit separator, respectively.

However, your tty driver, terminal or program are all likely to eat them or munge them. Also, virtually nothing actually uses these characters for these purposes.

goatinaboat · on April 3, 2021

virtually nothing actually uses these characters for these purposes.

Right. Which is why we have all these hilarious escaping and interpolation problems. Any why programmers will never be taken seriously by real engineers. It's like we have cement mixed and ready to go but we decide to go and forage for mud instead and think that makes us cleverer than the cement guys.

spion · on April 5, 2021

> your tty driver, terminal or program are all likely to eat them or munge them

Maybe that has something to do with this?

lixtra · on April 3, 2021

I’m surprised that with your experience you come to such unbalanced conclusions. Everything in engineering is about trade-offs and while your conclusions may be indisputable for the design goals of D they may wrong in other contexts.

1. If I scribble some one time code etc. the probability of having an error coming from implicit declarations is in the same order of magnitude as missing out edge cases or not getting the algorithm right for most people. The extra convenience may well be worth it.

2. I would relax this it should be clear to the programmer where a statement ends.

3. Go on with a warning is a sane strategy in some situations. I happily ruin my car engine to drive out of the dessert. The assert might have been to strict and i know something about the data so the program can ignore the assert failure.

tralarpa · on April 3, 2021

> 1. If I scribble some one time code

.... and here is another entry for Walter's list of bad ideas:

4. "It's okay. I will use this code only once"

erik_seaberg · on April 3, 2021

My favorite Red Green quote is “now, this is only temporary … unless it works.”

WalterBright · on April 3, 2021

Your rationale in this and your followups are exactly what I'm talking about.

1. You're actually right if the entire program is less than about 20 lines. But bad programs always grow, and implicit declaration will inevitably lead you to have a bug which is really hard to find.

2. The trouble comes from programmer typos that turn out to be real syntax, so the compiler doesn't complain, and people tend to be blind to such mistakes so don't see it. My favorite actual real life C example:

    for (i = 0; i < 10; ++i);
    {
        do_something();
    }

My friend who coded this is an excellent, experienced programmer. He lost a day trying to debug this, and came to me sure it was a compiler bug. I pointed to the spurious ; and he just laughed.

(I incorporated this lesson into D's design, spurious ; produce a compiler error.)

3. I used to work for Boeing on flight critical systems, so I speak about how these things are really designed. Critical systems always have a backup. An assert fail means the system is in an unknown, unanticipated state, and cannot be relied on. It is shut down and the backup is engaged. The proof of this working is how incredibly safe air travel is.

lixtra · on April 3, 2021

> 3. I used to work for Boeing on flight critical systems, so I speak about how these things are really designed. Critical systems always have a backup. An assert fail means the system is in an unknown, unanticipated state, and cannot be relied on. It is shut down and the backup is engaged.

I ask you to reconsider your assumptions. How did this play out in the 737 MAX crashes? Was there a backup AoA sensor? Did MCAS properly shut down and backup engaged? Was manual overriding the system not vital knowledge to the crew?

You don’t have to answer. I probably wouldn’t get it anyway.

But rest assured that I won’t try to program flight control and I strongly appreciate your strive for better software.

WalterBright · on April 3, 2021

> How did this play out in the 737 MAX crashes?

They didn't follow the rule in the MCAS design that a single point of failure cannot lead to a crash.

> Was manual overriding the system not vital knowledge to the crew?

It was, and if the crew followed the procedure they wouldn't have crashed.

unionpivo · on April 3, 2021

I disagree with most of what you said but I want to specifically call out:

> 3. Go on with a warning is a sane strategy in some situations.

No, if its sometimes ok, to continue, than you should not assert it.

Assert means "I assert this will always be true, and if it's not our runtime is in unknown/bad state."

If you think you can recover, or partially recover, throw/return appropriate error, and go into emergency/recovery mode.

lixtra · on April 3, 2021

Your reactor is boiling. Your control software shut down with assertion failed: temperature too high, cannot display more than 3 digits.

Downvote me if you want to open a bug ticket with the vendor and wait a week for the fix.

Upvote me if you’d give it a try to restart with a switch to ignore assertions.

You may abstain if you never shipped a bug.

Edit: not to forget that this website runs on lisp which violates all three. Was it really a bad choice for the website?

unionpivo · on April 3, 2021

> Your reactor is boiling. Your control software shut down with assertion failed: temperature too high, cannot display more than 3 digits.

Several points:

1. Most of such critical components have several different and independent implementations, with analog backup (if possible).

2. You are arguing one specific safety critical case, that 99.999% or even more programmers will never face, should somehow inform decision about general purpose programming language.

3. Even if you are working in such safety critical situation, you should not really on assertion bypass, but have separate emergency procedure, which bypasses all the checks and try's to force the issue. (ever saw a --force flag ?)

Because what happens in reality, is developer encounters a bug (maybe while its still in development), notice you can bypass it by disabling assertions (or they are disabled by default), log it as a low priority bug, that never gets fixed.

Then a decade later me or someone like me is cursing you because you enterprise app just shit the bed, and is generating tons of assertion warnings, even when it running normally, so I have to figure out, which of them are "just normal" program flow, and which one just caused an outage.

I never experienced situation like you described, but I have experienced behavior like I wrote above, too many times.

Botom line is:

- don't assert if you don't mean it

- if you need bypass for various runtime checks, code one in explicitly.

Edit: Hacker News is written in ARC which is schema dialect. ARC doesn't have assertions as far as i can tell.

ARC doesn't have its own runtime and is run on racket language, that has optional assertion, that exit the runtime if they fail https://docs.racket-lang.org/ts-reference/Utilities.html

jrockway · on April 3, 2021

I agree with this. Nuclear reactors are a special case of systems where removing energy from the system makes it more unsafe, because it generates its own energy and without a control system it will generate so much energy that it destroys itself (and due to the nature of radiation, destroys the surrounding suburbs too).

With most systems, the safest state is off. CNC machine making a weird noise? Smash that e-stop. Computer overheating? Unplug it. With this in mind, "assert" transitions the system from an undefined state to an inoperative state, which is safer.

That isn't to say that that you want bugs in your code, and that energizing some system is free of consequences. Your emergency stop of your mill just scrapped a $10,000 part. Unplugging your server made your website go down and you lost a million dollars in revenue. But, it didn't kill someone or burn the building down, so that's nice.

BlueTemplar · on April 4, 2021

Modern nuclear reactors are designed and built with the expectation that when they melt down, the results aren't catastrophic (at least for the outside world).

WalterBright · on April 3, 2021

See my previous reply. Your reactor design is susceptible to a single point of failure, and, how do I say it strongly enough, is an utterly incompetent design. Bypassing assertions is not the answer.

mcv · on April 3, 2021

If it ignores part of the spec, I don't think "strictyaml" is the correct name here. Instead, if it interprets everything as string, perhaps "stringyaml" would have been more accurate, though I'm sure that's not as good PR.

I'm reminded of the discussion we had a few days ago about environment variables; one problem there is that env variables are always strings, and sometimes you do want different types in your config. But clearly having the system automatically interpret whether it's a string or something else is a major source of bugs. Maybe having an explicit definition of which field should be which type would help, but then you end up with the heavy-handed XML with its XSD schema.

Or you just use JSON, which is light-weight, easy to read, but unambiguous about its types. I guess there's a good reason it's so popular.

Maybe other systems like yaml and environment variables should only ever be used for strings, and not for anything else, and I suppose replacing regular yaml with 'strictyaml' could play a role there. Or cause unending confusion, because it does violate the spec.

msiemens · on April 3, 2021

> JSON, which is [...] unambiguous about its types

With the one exception that with floatig point values the precision is not specified in the JSON spec and thus is implementation defined[1] which may lead to its own issues and corner cases. It for sure is better than YAML's 'NO' problem, but depending on your needs JSON may have issues as well

[1]: https://stackoverflow.com/questions/35709595/why-would-you-u...

wongarsu · on April 3, 2021

Also JSON's complete lack of many commonly used types, and no way to define any new ones.

mcv · on April 3, 2021

Isn't that a problem with most of these config languages, though? XML is the only one where I think this might be possible.

wongarsu · on April 3, 2021

Allowing you to define types is quite uncommon, but many config languages allow more types than JSON (so more than boolean, number, string, list, dict). Date datatypes are a big one and are provided by about every second JSON variant, in addition to TOML, ION and others.

marcinzm · on April 3, 2021

>If it ignores part of the spec, I don't think "strictyaml" is the correct name here.

The article didn't fully explain it but strictyaml requires a typed schema or defaults to string (or list or dict) if one is not provided. So it strictly follows the provided schema.

mcv · on April 3, 2021

That makes a big difference indeed. It wasn't clear to me from the article, but string yaml + optional schema sounds like a useful combination.

povik · on April 3, 2021

“saneyaml” would not make for bad PR

grenoire · on April 3, 2021

I was helping out a friend of mine in the risk department of a Big 4; he was parsing CSV data from a client's portfolio. Once he started parsing it, he was getting random NaNs (pandas' nan type, to be more accurate).

I couldn't get access to the original dataset but the column gave it away. Namibia's 2-letter ISO country code is NA—which happens to be in pandas' default list of NaN equivalent strings.

It was a headache and a half...

grenoire · on April 3, 2021

Verbatim from the docs, on read-csv:

    na_valuesscalar, str, list-like, or dict, default None

    Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

You fix it by using `keep_default_na=False`, by the way.

mseepgood · on April 3, 2021

A Ms True also broke Apple's iCloud: https://twitter.com/RachelTrue/status/1365461618977476610

grenoire · on April 3, 2021

That looks like an interesting hard-coded check, I wonder what it intended to fix.

fanf2 · on April 3, 2021

There’s some analysis in this twitter thread: https://twitter.com/badedgecases/status/1368362392573317120

tl;dr: there are a bunch of fields of various types that arrive as strings, and they get coerced but without paying attention to which field should have which type

sdfhbdf · on April 3, 2021

What I am most baffled by with Yaml is the fact that it’s a superset of JSON.

Whenever an input accepts YAML you can actually pass in JSON there and it’ll be valid

It really surprised me when I found out and I use JSON Whenever possible since then since it’s much stricter

https://en.m.wikipedia.org/wiki/JSON#YAML

norrius · on April 3, 2021

> Whenever an input accepts YAML you can actually pass in JSON there and it’ll be valid

...unless your parser strictly implements YAML 1.1, in which case you should be careful to add whitespace around commas (and a few other minor things). This is a valid JSON that some YAML parsers will have problems with:

    {"foo":"bar","\/":10e1}

The very first result Google gives me for "yaml parser" is https://yaml-online-parser.appspot.com, which breaks on the backslash-forward slash sequence.

dragonwriter · on April 3, 2021

> Whenever an input accepts YAML you can actually pass in JSON there and it’ll be valid

Strictly speaking, this is only true of YAML 1.2, not YAML 1.0-1.1 (the article here addresses YAML 1.1 behavior, the headline example od which was removed ib YAML 1.2 twelve years ago), though it calla YAML 1.1 “YAML 2.0”, which doesn’t actually exists.

Of course, there are lots of features, like custom types, that JSON doesn’t support, but you can still use YAML’s JSON-style syntax instead of actual JSON, for them.

alephu5 · on April 3, 2021

Yes this is usually the best way. If you need some features for code reuse there are several preprocessors. I personally use Dhall to configure everything and then convert it to JSON for my application to consume. It is a lot more powerful than YAML and has a very safety-oriented type system.

yakshaving_jgt · on April 3, 2021

> it’s equally true that extremely strict type systems require a lot more upfront and the law of diminishing returns applies to type strictness - a cogent answer to the question “why is so little software written in haskell?“

I was with the article up until that point. I don't agree that diminishing returns with regards to type strictness applies linearly. Term-level Haskell is not massively harder than writing most equivalent code in JavaScript — in fact I'd say it's easier and you reap greater benefit. Perhaps it's a different story when you go all-in on type-level programming, but I'm not sure that's what the author was getting at. This smells of the Middle Ground logical fallacy to me. Or of course the comment was tongue-in-cheek and I'm overreacting.

7952 · on April 3, 2021

I had to rewrite some JavaScript code in Postgres recently that measured the overlap between different elevation ranges. In JS I had to write it myself and deal with the edge cases and bugs. In Postgres I just use the range type and some operators. It was brilliant in comparison. The tiny effort of learning it was worth it. The list of data types I use all the time is bigger than just string, numbers and booleans. Serialisation formats should support them. Particularly as there are often text format standards that already exist for a lot of them. Give me wkt geometry and iso formatted dates. It's not that difficult and totally with it.

choeger · on April 3, 2021

That law of diminishing returns might actually apply, I am not 100% sure. But more powerful type systems allow for the more complex composition of more complex interfaces in a safe manner. Think of higher-level modules and data structures. Or dependent types and input handling. Or linear types and resource handling.

samvher · on April 3, 2021

I agree. I would say that Erlang goes ~80% of the way compared to Haskell's type system and the last 20% really matter, to the point that in many cases I find myself not really using Erlang's (optional) type system at all. Better type coverage and more descriptive types allow the compiler to infer more and I'd say this is the opposite of diminishing returns.

abujazar · on April 3, 2021

Norwegian here. I’d say the problem is YAML, not Norway :D

jasode · on April 3, 2021

That author's blog post sent me down a rabbit hole of insanity with YAML and the PyYAML parser idiosyncrasies.

First, he mentions "YAML 2.0" but there's no such reference about "2.0" from yaml.org or Google/Bing searches. Yaml.org and wikipedia says yaml is at 1.2. Apparently the other commenters in this thread clarified that the older "YAML 1.1" is what the author is referring to.

Ok, if we look at the official YAML 1.1 spec[1], it has this excerpt for implicit bool conversions:

   y|Y|yes|Yes|YES|n|N|no|No|NO
  |true|True|TRUE|false|False|FALSE
  |on|On|ON|off|Off|OFF

But the pyyaml code excerpts[2][3] from resolver.py has this:

  u'tag:yaml.org,2002:bool',
  re.compile(ur'''^(?:yes|Yes|YES|n|N|no|No|NO
              |true|True|TRUE|false|False|FALSE
              |on|On|ON|off|Off|OFF)$''', re.X),

The programmer omitted the single character options of 'y' and 'Y' but it still has 'n' and 'N' ?!? The lack of symmetry makes the parser inconsistent.

And btw for trivia... PyYAML also converts strings with leading zeros to numbers like MS Excel: https://stackoverflow.com/questions/54820256/how-to-read-loa...

[1] https://yaml.org/type/bool.html

[2] 2020 latest: https://github.com/yaml/pyyaml/blob/ee37f4653c08fc07aecff69c...

[3] 2006 original : https://github.com/yaml/pyyaml/blob/4c570faa8bc4608609f0e531...

ancarda · on April 3, 2021

You can catch this with yamllint (https://github.com/adrienverge/yamllint):

    % cat countries.yml 
    ---
    countries:
      - US
      - GB
      - NO
      - FR

    % yamllint countries.yml 
    countries.yml
      5:4       warning  truthy value should be one of [false, true]  (truthy)

RcouF1uZ4gsC · on April 3, 2021

YAML seems like a really neat idea, but over time, I have I have come to regard it as being too complicated for me to use for configuration.

My personal favorite is TOML, but I would even prefer plain JSON over YAML

The last thing I want at 2 AM when trying to look figure out if an outage is due to a configuration change is having to think if each line of my configuration is doing the thing I want.

YAML prizes making data look nicely formatted over simplicity or precision. That for me, is not a tradeoff, I am willing to make.

Arnavion · on April 3, 2021

They all have their downsides.

JSON:

- no comments, unless you fake them with fake properties, unless your configuration has a schema that doesn't allow extra fake properties

- no trailing commas; makes editing more annoying

- no raw strings

YAML:

- the automatic type coercion

- the many ways to encode strings ( https://yaml-multiline.info/ )

- the roulette wheel of whether this particular parser is anal about two-space indentation or accepts anything as long as it's used consistently

- the roulette wheel of whether this particular parser supports uncommon features like anchors

TOML:

- runtime footguns in automated serialization ( https://news.ycombinator.com/item?id=24853386 )

- hard to represent deeply-nested structures, unless you switch to inline tables which are like JSON but just different enough to be annoying

perlgeek · on April 3, 2021

For hand-writing I love jsonnet, which produces JSON, is much more convenient to write, and has some templating, functions etc. https://jsonnet.org/

You wouldn't serialize data structures to jsonnet though, you'd just generate JSON.

anticristi · on April 3, 2021

This makes me sad. It's 2021 and we still haven't figure out how to serialize configuration in a format that is easy-to-edit and predictable.

kstenerud · on April 3, 2021

This is the problem space I'm targeting with https://concise-encoding.org/

* Text AND binary so that humans can edit easily, and machines can transmit energy and bandwidth efficiently.

* Carefully designed spec to avoid ambiguities (and their security implications).

* Strong type support so you're not using all kinds of incompatible hacks to serialize your data.

* Versioned, because there's no such thing as the perfect format.

* Also, the website is 32k bytes ;-)

yyyk · on April 3, 2021

+ Has binary format.

+ Avoids ambiguities.

- The format seems to feel the need to support everything, including things I am not sure are actual usecases (what's the point of Markup element for example? What does Metadata save us compared to just including it in document, given that parsers must parse it anyway?). This must make implementation most complex and costly, and makes reading the text format more difficult.

- Not a fan of octal notation. At 3am not sure I can't confuse 0 and o given certain fonts. Does anyone even use it these days?

- Unquoted string were discussed in the thread, I'd like to point out that it's very easy to make an unquoted string not "text-safe" (according to the spec) without noticing it, at which point document is invalid.

Just add white-space (maybe a user pasted a string from somewhere without noticing whitespace at the end or forgot the rules), a dot, an exclamation or a question mark. Having surprises like that is IMHO worse than a consistent quoting method.

Basically all the things I don't like are about the format supporting a bit too much. YAML 1.1 should teach us more is sometimes less.