The more general problem basically being sentinel values (which these sorts of inferences can be treated as) in stringly-typed contexts: if everything is a string and you match some of those for special consideration, you will eventually match them in a context where that's wholly incorrect, and break something.
Using in-band signaling always involves the risk of misinterpreting types.
> This is part of more general problem
DWIM ("Do What I Mean") was a terrible way to handle typos and spelling errors when Warren Teitelman tried it at Xerox PARC[1] over 50 years ago. From[2]:
>> In one notorious incident, Warren added a DWIM feature to the command interpreter used at Xerox PARC. One day another hacker there typed
delete *$
>> to free up some disk space. (The editor there named backup files by appending $ to the original file name, so he was trying to delete any backup files left over from old editing sessions.) It happened that there weren't any editor backup files, so DWIM helpfully reported
*$ not found, assuming you meant 'delete *'
>> [...] The disgruntled victim later said he had been sorely tempted to go to Warren's office, tie Warren down in his chair in front of his workstation, and then type 'delete *$' twice.
Trying to "automagically" interpret or fix input is always a terrible idea because you cannot discover the actual intent of an author from the text they wrote. In literary criticism they call this problem "Death of the Author"[3].
>> [...] The disgruntled victim later said he had been sorely tempted to go to Warren's office, tie Warren down in his chair in front of his workstation, and then type 'delete $' twice.
Ironically, this did not render the way you intended because HN interpreted the asterisk as an emphasis marker in this line.
It works here:
... type 'delete *$' twice.
because the line is indented and so renders as code, but not here:
> ... type 'delete $' twice.
because the subsequent line has emphasized text*. So the scoping of the asterisks is all screwed up.
Eh. "Death of the Author" is a reaction to the text not being dispositive as to what the author meant. It's deciding you don't care what the author meant, no longer considering it a problem that the text doesn't reveal that. Instead the text means whatever you can argue it means.
Which can be a fun game, but is ultimately pointless.
That’s a shrewd observation. Static types help with this somewhat. E.g. in Inflex, if I import some CSV and the string “00.10” as 0.1, then later when you try to do work on it like
x == “00.10”
You’ll get a type error that x is a decimal and the string literal is a string. So then you know you have to reimport it in the right way. So the type system told you that an assumption was violated.
This won’t always happen, though. E.g. sort by this field will happily do a decimal sort instead of the string 00.10.
The best approach is to ask the user at import time “here is my guess, feel free to correct me”. Excel/Inflex have this opportunity, but YAML doesn’t.
That is, aside from explicit schemas. Mostly, we don’t have a schema.
If we're talking about general problems, then I don't think we can be satisfied with "sometimes it's a problem with types and sometimes it's a UI bug." That's not general.
I mean if the value is imported as a decimal, then a sort by that field will sort as decimal. This might not be obvious if a system imports 23.53, 53.98 etc - a user would think it looks good. It only becomes clear that it was an error to import as a decimal when we consider cases like “00.10”. E.g, package versions: 10.10 is a newer version than 10.1.
Sure. In most static type systems though, you would be importing the data into structures that you defined, with defined types. So you wouldn’t suddenly get a Decimal in place of a String just because the data was different. You’d get a type error on import.
I suppose this is a cliched thought, but the more general problem kind of emblematic of current "smart" features... and their expected successors.
OOH, this is a a typically human problem. We have a system. It's partly designed, partly evolved^. It's true enough to serve well in the contexts we use it in on most days. There are bugs in places (like norway, lol) that we didn't think of initially, and haven't encountered often enough to evolve around.
In code, we call it bugs. In bureaucracy, we just call it bureaucracy. Agency A needs institution B's document X, in a way that has bugs.
Obviously, it's also a typical machine problem. @hitchdev wants to tell pyyaml that Norway exists, and pyyaml doesn't understand. A user wants to enter "MARCH1" as text (or the name of a gene), and excel doesn't understand.
Even the most rigid bureaucracy is made of people and has fairly advanced comprehension ability though. If Agency A, institution B or document X are so rigid that "NO" or "MARCH1" break them... it probably means that there's a machine bug behind the human one.
Meanwhile... a human reading this blog (even if they don't program) can understand just fine from context and assumptions of intent.
IDK... maybe I'm losing my edge, but natural language programming is starting to seem like a possibility to me.
^I feel like we need a new word for these: versioned, maybe?
I don't understand why those support agents for Microsoft just threw their hands up in the air and asked customers to go through some special process for reporting the bug in Excel. Why are they not empowered/able to report the issue on behalf of customers? It's so clearly a bug in Excel that even they are able to reproduce with 100% reliability.
Yes. Excel cells are set to a "General" format that, by default, tries to guess the type of data the cell should be from its content. A date looking entry gets converted to a date type. A number looking string to a number (so 5.80 --> 5.8, very annoying since I believe in significant digits) When you import cvs data, for example, the default import format is "General" so date looking strings will be changed to a date format. This can be avoided by importing the file and choosing to import the data as "Text". People having these data corruption problems forgot to do that.
It's "user error" except that there is no way to set the default import to import as "Text" (as far as I know), so one has to remember to do the three step "Text" import every time instead of the default one step "General" import.
Excel doesn't support CSV files. Anyone who believes that has never used Excel. [0] You're supposed to use spreadsheets as is. Programs that have excel export features should always directly export xlsx files.
[0] The only thing you can safely do with CSV files is to interpret every value as text cell. CSV files always require out of band negotiation on everything, including delimiters, quotation, escape characters, the data type of each column.
I'd say the more general problem is a bad type system! In any language with a half decent type system where you can define `type country = Argentina | ... | Zambia` this would be correctly handled at compile-time, instead of having strange dynamic weak typing rules (?) which throw runtime errors in production (???).
I would like to see how your solution handles the case of new countries or countries changing name. Recompile and push an update? If the environment is governmental this can take a very very very long time.
The proper solution, in my opinion, is a lookup table stored in the database. It can be updated, it can be cached, it can be extended.
And for transfer of data, use formats to which you can attach a schema. This way type data is not lost on export. XML did this but everyone hates XML. And everyone hates XSD (the schema format) even more. However, if you use the proper tools with it, it is just wonderful.
An even more general problem is that we as humans use pattern-matching as a cerebral tool to navigate our environment, and sometimes the patterns aren't what they appear to be. The Norway problem is the programming equivalent of an optical illusion.
That's an interesting statement to apply to natural languages.
Consider this headline in English: "Man attacks boy with knife". This can be read two ways, either the man is using a knife to attack the boy, or the boy had the knife and thus was being attacked.
The same sentence in Polish would make use of either genitive or instrumental case to disambiguate (although barely). However, a naive translation would only differ in the placement of a `z` (with) and so errors could still slip through. At least in this case the error would not introduce ambiguity, simply incorrectness.
Similar to language design we can also consider: does the inclusion/requirement of parity features reduce the expressivity of the language?
does the inclusion/requirement of parity features reduce the expressivity of the language?
This was a real eye-opener for me when learning Latin in school: stylistic expressions such as meter, juxtaposition, symmetry are so much easier to include when the meaning of a sentence doesn't depend on word order.
> stylistic expressions such as meter, juxtaposition, symmetry are so much easier to include when the meaning of a sentence doesn't depend on word order.
Eh.... some things are easy and some things are hard in any language. The specifics differ, and so do the details of what kinds of things you're looking for in poetry. Traditional Germanic verse focuses on alliteration. Modern English verse focuses on rhyme. Latin verse focuses on neither. [1]
English divides poetically strong syllables from poetically weak syllables according to stress. It also has mechanisms for promoting weak syllables to strong ones if they're surrounded by other weak syllables.
In contrast, Latin divides strong syllables from weak syllables by length. Stress is irrelevant. But while stress can be changed easily, you're much more restricted when it comes to syllable length -- and so Publius Ovidius Naso is invariably referred to by cognomen in verse, because it isn't possible to fit his nomen, Ovidius, into a Latin metrical scheme. That's not a problem English has.
[1] I am aware of one exceptional Latin verse:
> O Tite, tute, Tati, tibi tanta, tyranne, tulisti.
The real problem here is that people use Excel to maintain data. Excel is terrible at that. But the fact that it may change data without the user being aware of it, is absolutely the biggest failing here.
The problem is more that it's insanly overpowered, while aiming for convenience out of the box. An "Excel Pro"-Version which takes away all the convenience and gives the user the power to configure the power pinpointet to their task might be a better solution. Funny part is, most of those things are already configurable now, but users are not educated enough about their tools to actually do it.
Excel allows people to maintain data all over the place. From golf league data to job actual data compared to estimates to so much more. And, excel is accessible enough that tens of millions (or maybe more) of people do it.
The one I’ve seen was a client who wanted to store credit card numbers in an Excel sheet (yes I know this is a bad idea, but it was 15 years ago and they were a scummy debt collection call center). Signed integers have a size limit, which a 16 digit credit card number significantly exceeds.
Now, you and I know this problem is solved by prepending ‘ to the number and it will be treated as a string, but your average Excel user has no understanding of types or why they might matter. Many engineers will also look past this when generating Excel reports.
> they had to rename a gene to stop excel auto-completing it into a date.
No one in their right mind uses a spreadsheet for data analysis. Good for working out your ideas but not in a production environment. I figure excel was chosen as this the utility the scientists were most familiar with.
The proper tool for the job would be a database. I recall reading about a utility, a highly customized database with an interface that looks just like a spreadsheet.
The analysis itself isn’t (usually) happening in Excel.
A lot of tools operate on CSV files. People use Excel to peek at the results or prepare input for other tools, and that’s how the date coercion slips in.
Sometimes, people do use it to collate the results of small manual experiments, where a database might be overkill. Even so, the data is usually analyzed elsewhere (R, graphPad, etc).
The mistake was to believe that Excel can operate on CSV files. It doesn't support them in any meaningful way. It supports them in a "I can sort of pretend that I support CSV files" way.
What is a good alternative to working with CSV files than Excel? Excel sure isn't ideal but it's always there as part of the MS Office suite, so I've never looked for anything esle.
The world desperately needs a replacement for YAML.
TOML is fine for configuration, but not an adequate solution for representing arbitrary data.
JSON is a fine data exchange format, but is not particularly human-friendly, and is especially poor for editable content: Lacks comments, multi-line strings, is far too strict about unimportant syntax, etc.
Jsonnet (a derivative of Google's internal configuration language) is very good, but has failed to reach widespread adoption.
Cue is a newer Jsonnet-inspired language that ticks a lot of boxes for me (strict, schema support, human-readable, compact), but has not seen wide adoption.
Protobuf has a JSON-like text format that's friendlier, but I don't think it's widely adopted, and as I recall, it inherits a lot of Protobufisms.
Dhall is interesting, but a bit too complex to replace YAML.
Starlark is a neat language, but has the same problem as Dhall. It's essentially a stripped-down Python.
Amazon Ion [1] is neat, but I've not seen any adoption outside of AWS.
NestedText [2] looks promising, but it's just a Python library.
StrictYAML [3] is a nice attempt at cleaning up YAML. But we need a new language with wide adoption across many popular languages, and this is Python only.
Seems you're missing my personal favorite, extensible data notation - EDN (https://github.com/edn-format/edn). Probably I'm a bit biased coming from Clojure as it's widely used there but haven't really found a format that comes close to EDN when it comes to succinctness and features.
Some of the neat features: Custom literals / tagged elements that can have their support added for them on runtime/compile time (dates can be represented, parsed and turned into proper dates in your language). Also being able to namespace data inside of it makes things a bit easier to manage without having to result to nesting or other hacks. Very human friendly, plus machine friendly.
Biggest drawback so far seems to be performance of parsing, although I'm not sure if that's actually about the format itself, or about the small adoption of the format and therefore not many parsers focusing on speed has been written.
Your list is like a graveyard of my dreams and hopes. Anything that doesn't validate the format of the underlying data is pretty much dead to me...
The problem with most of these is they're useless to describe the data. Honestly, it is completely not useful to have the following to describe data:
email => string
name => string
dob => string
IMHO, it is akin to having a dictionary (like Oxford English) read like:
email - noun
name - noun
birthday - noun
It says next to nothing except, yes, they are nouns. All too often I waste time fighting nils and bullshit in fields or duplicating validation logic all over the place.
"Oh wow, this field... is a string..? That's great... smiles gently except... THERE SHOULD NOT BE EMOJI IN MY FUCKING UUID, SCHEMA-CHUD. GET THE FUCK OFF MY LAWN!"
My experience is that validation quickly becomes surprisingly complex, to the point of being infeasible to express in a message format.
Not only are the constraints very hard to express (remember that one 2000 char regexp that really validates email addresses?), they are also contextual: the correct validation in an Android client is not the same as on the server side. Eg you might want to check uniqueness or foreign key constraints that you cannot check on the client. Sometimes you want to store and transmit invalid messages (eg partially completed user input). And then you have evolving validation requirements: what do you do with the messages from three years ago that don't have field X yet?
Unfortunately I don't think you can express what you need in a declarative format. Even minimal features such as regexp validation or enums have pitfalls.
I think it's better to bite the bullet and implement the contextually required validation on each system boundary, for any message crossing boundaries.
If you want automatic built-in string validation, one option that seems particularly interesting is to use a variant of Lua patterns, which are weaker and easier to understand than regular expressions, but still provide a significant degree of "sanity" for something like an email. The original version works on bytes and not runes, but you could simply write a parser that works on runes instead, and the pattern-matching code is just 400 old and battle-tested lines of C89. You might want to add one extension: allow for escape sequences to be treated as a single character (hence included in repetition operators and adding the capability to match quoted strings); with this extension, I think you could implement full email address validation:
XML and XML Schema solved this more than 20 years ago. It had to be replaced with JSON by the web developers though, so they could just “eval() it” to get their data.
Because it offered all these things parent responded, but that made it too complex.
You either provide schema and get commodities of describing it or you don't.
I had a chance of using SOAP at one point. It was a F5 device and I used a python library. What I really liked is that when it connected to it it downloaded its schema, and then used that to generate an object. At that point you just communicated with device like you did with any object in Python.
We abandoned it for inferior technologies like REST and JSON, because they were harder to use from JS, as parent mentioned.
Parent didn't say it was harder to use from JS. Parent said "It had to be replaced with JSON by the web developers though, so they could just “eval() it” to get their data."
First of all, I was there 20 years ago. I had to deal with XML, XSLT, one kind of Java XML parsers that didn't fully do what I needed, another kind of Java XML parsers that didn't fully do what I needed. And oh boy was it a pain. I just wanted to get a few properties of a bunch of entities in a bigger XML document, that's all. Big fail.
Second, JSON always had a parser in JS, so I don't know where that eval nonsense is coming from.
Third, JS actually had the best dev UX for XML of all languages 20 years ago. Maybe you know JavaScript from Node.js, but 20 years ago it used to run excusively in web browsers, which even then were pretty good at parsing XML documents. The browser of course had a JS DOM traversal API known to every single JS developer, and very soon (Although TBH I can't remember if before or after JSON) it also had xpath querying functions, all built in.
XML was so bad, that its replacement came from the language where it was actually easiest to use. think about that for a second.
So the answer to the question "Why was XML replaced?" is not "Because webdevs lol".
I suspect it was because it has both content and attributes, which all but guarantees it's impossible to create a bunch of simple, common data structures from it (like JSON does).
> Second, JSON always had a parser in JS, so I don't know where that eval nonsense is coming from.
Firstly, it sounds like XML ran over your dog or something. Sorry to hear about that. It wasn’t particularly hard to use at all, and if you’re dealing with the possibility of emojis in your JSON UUIDs in 2021, one might even say it’s easier to use.
If you’re referring to JSON.parse() in “had a parser” above, then you have a temporal problem. Regarding eval(), it’s suggested right in the original RFC for JSON. Check it out. Web developers at the time were following that advice.
> The world desperately needs a replacement for YAML.
The world desperately needs support for YAML 1.2, which solves the problems the article addresses fairly completely (largely in the “default” Core schema[0], but more completely with the support for schemas in general), plus a bunch of others, and has for more than a decade. But YAML 1.2 libraries aren’t available for most languages.
[0] not actually an official default, but reflects a cleanup of the YAML 1.1 behavior without optional types, so its defaultish. Back when it looked like YAML 1.3 might happen in some reasonably-near future, it was actually indicated by team members that the JSON Schema for YAML (not to be confused with the JSON Schema spec) would be the explicit default YAML Schema in 1.3, which has a lot to recommend it.
Nope nope nope. YAML is awful and needs to die. The more you look at it the worse it gets. The basic functionality is elegant (at least until you consider stuff like The Norway Problem), but the advanced parts of YAML are batshit insane.
The article is simply, factually wrong; there is no “YAML 2.0 specification” [0], and everything they point to is YAML 1.1, and addressed in YAML 1.2 (the most recent YAML spec, from 2009.)
TOML quickly breaks down with lots of nested arrays of objects. For example:
a:
b:
- c: 1
- d:
- e: 2
- f:
g: 3
Turns into this, which is unreadable:
[[a.b]]
c = 1
[[a.b]]
[[a.b.d]]
e = 2
[[a.b.d]]
[a.b.d.f]
g = 3
TOML also has a few restrictions, such as not supporting mixed-type arrays like [1, "hello", true], or arrays at the root of the data. JSON can represent any TOML value (as far as I know), but TOML cannot represent any JSON value.
At my company we use YAML a lot for table-driven tests (e.g. [1]), and this not only means lots of nested arrays, but also having to represent pure data (i.e. the expected output of a test), which requires a format that supports encoding arbitrary "pure" data structures of arrays, numbers, strings, booleans, and objects.
Also many (most? all?) serializers don't let you control which fields are serialized inline vs not. So if you have a program that generates configuration, you're going to end up with the original unreadable form anyway.
Apropos of this, in Clojure-land the idiomatic serialization is, EDN [1], which is pretty ergonomic to work with IMO, since in most cases it is the same as a data-literal in Clojure.
My feeling is that :keywords reduce the need and temptation to conflate strings and boolean/enumerations that occurs when there's no clear way to convey or distinguish between a string of data and a unique named 'symbol'. I miss them when I'm in Pythonland.
> S-expressions inherits all trouble with data types from json (dates, times, booleans, integer size, number vs numeric string).
Hm, not sure that's true, S-expressions would only define the "shape" of how you're defining something, not the semantics of how you're defining something. EDN https://github.com/edn-format/edn for all purposes is S-expressions and have support for custom literals and more, to avoid "the trouble with data types from JSON"
Yes, EDN is S-expressions plus a bunch of semantic rules.
Parsing EDN is quite a bit more complex than just parsing S-expressions, just because you need to support a bunch of built in types, as well as arbitrary exensions through 'tags'.
I’ve used most of the technologies you listed. Cue is the best, and the only one with strong theoretical foundations. I’ve been using it for some time now and won’t go back to the others.
> The world desperately needs a replacement for YAML.
For situations like TFA you really want a configuration language that behaves exactly like you think it will, and since you don't have to interop with other organizations you don't really need a global standard.
Moreover, broadly used config languages can be somewhat counterproductive to that goal. Take JSON as an example; idiomatic JSON serdes in multiple programming languages has discrepancies in minint, maxfloat, datetime, timezone, round-tripping, max depth, and all kinds of other nuanced issues. Existing tooling is nice when it does what you expect, but for a no-frills, no-surprises configuration language I would almost always just prefer to use the programming language itself or otherwise write a parser if that doesn't suffice (e.g., in multilingual projects).
Mildly off-topic: The problem here, more or less, was that the configuration change didn't have the desired effect on an in-memory representation of that configuration. We can mitigate that at the language level, but as a sanity check it's also a good idea to just diff the in-memory objects and make sure the change looks kind of like what you'd expect.
You don't need wide adoption for internal projects in an organization, but you do want great toolchain support.
For example, the fact that NestedText is a Python library means a Python team could use it, but it's a poor fit for an organization whose other teams use Go and JavaScript/TypeScript.
We use YAML for much more than configuration, by the way. I feel like YAML hits a nice sweet spot where it's usable for almost everything.
I don't think YAML is going anywhere, largely because it was the first format to prioritize readability and conciseness, and has used that advantage to achieve critical mass.
It's far more productive to push for incremental changes to the YAML spec (or even a fork of it) to make it more sane and better defined. Things like a StrictYAML subset mode for parsers in other popular languages.
> It's far more productive to push for incremental changes to the YAML spec
The problems this article raises and strictyaml purports to address were addressed in YAML 1.2, already supported in python via ruamel.yaml; YAML 1.2 addresses much of this in the Core schema which is the closest successor to the default behavior of earlier spec versions, and does so more completely in the support for schemas more generally, which define both the supported “built-in" tags (roughly, types) and how they are matched from the low-level representation which consists only of strings, sequences, and maps (which, incidentally, are the only three tags of the “Failsafe” schema; there’s also a “JSON” Schema between Failsafe and Core, which has tags corresponding to the types supported by JSON.
JSON5 is better than JSON on my points, but it has downsides compared to YAML. For example, YAML is very good at multiline strings that don't require any sort of quoting, and knows to remove preceding indentation:
foo: |
"This is a string that goes across
multiple lines," he wrote.
In JSON5, you'd have to write:
{
foo: \"This is a string that goes across \
multiple lines,\" he wrote."
}
This sort of ergonomic approach is why YAML is so well-liked, I think. (Granted, YAML's use of obscure Perl-like sigils to indicate whitespace mode is annoying, but it does cover a lot of situations.)
YAML is also great at arrays, mimicking how you'd write a list in plaintext:
I will keep using YAML because I don't want to learn the pitfalls of your alternatives. With YAML everyone is complaining about the pitfalls, and therefore everyone is aware of them. A random replacement may not have this particular problem, but it may have other problems that remain unknown.
For the ease of entering time units YAML 1.1 parsed any set of two digits, separated by colons, as a number in sexagesimal (base 60). So 1:11:00 would parse to the integer 4260, as in 1 hour and 11 minutes equals 4260 seconds.
Now try plugging MAC addresses into that parser.
The most annoying part is that the MAC addresses would only be mis-parsed if there were no hex digits in the string. Like the bug in this post, it could only be reproduced with specific values.
Generally, if you're doing implicit typing, you need to keep the number of cases as low as possible, and preferably error out in case of ambiguity.
> For the ease of entering time units YAML 1.1 parsed any set of two digits, separated by colons, as a number in sexagesimal (base 60).
This is a mind-boggling level of idiocy. Even leaving aside the MAC address problem, this conversion treats "11:15" (= 675) different from "11:15:00" (= 40500), even though those denote the same time, while treating "00:15:00" (15 minutes past midnight) and "15:00" (3 in the afternoon) the same.
It had it literally at the same time as it had the problem in the article (the article refers to YAML 2.O, a nonexistent spec, and to PyYAML, a real parser which supports only YAML 1.1.)
Both the unquoted-YES/NO-as-boolean and sexagesimal literals were removed in YAML 1.2. (As was the 0-prefixed-number-as-octal mentioned in a sibling comment.)
One that really surprised/confused me was that pyaml (and the yaml spec) attempts to interpret any 0-prefixed string into an octal number.
There was a list of AWS Account IDs that parsed just fine until someone added one that started with a 0 and had no numbers greater than 7 in it, after which our parser started spitting out decidedly different values than we were expecting. Fixing it was easy, but figuring out what in the heck was going on took some digging.
We had a Grafana dashboard where one of the columns was a short Git hash. One day, a commit got the hash `89e2520`, which Grafana's frontend helpfully decided to display as "+infinity". Presumably it was parsing 89E+2520.
Ha, that reminds me of some work I was doing just yesterday, implementing a custom dictionary for a postgres full text search index. Postgres has a number of mappings that you can specify, and it picks which one based on a guess of what the data represents. I got bit by a string token in this same format, because it got interpreted as an exponential number.
I try to optimize my microwave button pushing too. I also have a +30 seconds button, so for 1:30 I can hit "1,3,0,Start" or "+30" three times and save a press!
My rule is that loading the dishwasher means that one loads all the available dishes, and runs it, even if it's only x% full. We use the (large) sink as an input buffer.
If the dishwasher has dishes in it and it's not running, they're clean.
This is exactly our algorithm as all. I can't really imagine flipping it the other way, since leaving dirty dishes in a dishwasher will just let them completely dry out, making it more likely they won't get fully clean when the cycle is eventually run.
I want to have two dishwashers. One with the dirty dishes and one with the clean dishes. So you never have to put the dishes away. They go from the clean dishwasher to the table to the dirty one. And then flip them.
There’s a community near here with a high fraction of Orthodox Jews. One condo I toured in my 20s had two dishwashers and without thinking about why they did it, I commented how I thought that was awesome that you’d never need to put dishes away. (They of course installed two dishwashers for orthodox separation of dishes from each other.)
Not the OP, but I have the same problem. For some reason that escapes me, pressing the “10 sec” button 7 times produces 00 70 instead of 01 10. If you then press the “1 min” button you get 01 70
The worst tragedy of this is the security implications of subtly different parsers. As your application surface increases, you're likely to mix languages (and thus different parsers), which means that the same input data will produce different output data depending on whether your parser replaces, truncates, ignores, or otherwise attempts to automatically "fix up" the data. A carefully crafted document could exploit this to trick your data storage layer into storing truncated data that elevates privileges or sets zero cost, while your access control layer that ignores or replaces the data is perfectly happy to let the bad document pass by.
And here's something else to keep you up at night: Just think of how many unintentional land mines lurk in your serialized data, waiting to blow up spectacularly (or even worse, silently) as soon as you attempt to change implementation technologies!
This is exactly why configuration/serialization formats should make as few assumptions about value types as possible. Once parsing's done, everything should be a string (or possibly a symbol/atom, if the program ingesting such a file supports those), and it should be up to the application to convert values to the types it expects. This is Tcl's approach, and it's about as sensible as it gets.
...which is why it pains me to admit that in my own project for a Tcl-like scripting/config language[1] I missed the float v. string issue, so it'll currently "cleverly" return different types for 1.2 (float) v. 1.2.3 (atom). Coincidentally, I started work on a "stringy" alternative interpreter that hews closer to Tcl's philosophy (to fix a separate issue - namely, to avoid dynamically generating atoms, and therefore avoid crashing the Erlang VM when given potentially-adversarial input), so I'm gonna fix that case for at least the "stringy" mode (by emitting strings instead of numbers, too), knocking out two birds with one stone for the upcoming 0.3.0 release :)
It’s reasons like this that I want my configuration languages to be explicit and unambiguous. This is why I use JSON or if I want a human friendly format, TOML. Strings are always “quoted” and numbers are always unquoted 1.2, it can never accidentally parse one as the other. The convenience of omitting quotes is just not worth the potential for ambiguity or edge cases to me.
> The most tragic aspect of this bug, howevere, is that it is intended behavior according to the YAML 2.0 specification.
This is one of those great ideas that sadly one needs experience to realize are really bad ideas. Every new generation of programmers has to relearn it.
Other bad ideas that resurface constantly:
1. implicit declaration of variables
2. don't really need a ; as a statement terminator
3. assert should not abort because one can recover from assert failures
I agree with the general observation, but the need for ";" ? Quite a few languages (over a few generations) have been doing fine without the semicolon. Just to mention two: python and haskell. (Yes, python has the semicolon but you'll only ever use it to put multiple statements on a single line.)
It even has semicolon insertion, but because the language is carefully designed, this doesn't cause problems, and most users can go a lifetime without knowing about it.
Our coding style requires semicolons for uninitialized variables, so you'll see
local x;
if flag then
x = 12
else
x = 24
end
As a way of marking that the lack of initialization is deliberate. `local x = nil` is used only if x might remain nil.
I don't like saying that it's semicolon insertion because it might give people the idea that the semicolons work similarly to Javascript. In Lua, inserting a semicolon is always optional and it's an stylistic matter (like in your example). It even allows putting multiple statements on the same line without a semicolon.
> I agree with the general observation, but the need for ";" ? Quite a few languages (over a few generations) have been doing fine without the semicolon. Just to mention two: python and haskell. (Yes, python has the semicolon but you'll only ever use it to put multiple statements on a single line.)
But then it's inconsistent and has unnecessary complexity because now there's one (or more) exceptions to the rules to remember: when the ';' is needed. And of course if you get it wrong you'll only discover it at runtime.
"Consistent applications of a general rule" is preferable to "An easier general rule but with exceptions to the rule".
Have you ever used Python? If you did you really wouldn't be saying this. There isn't an exception. The semicolon is used to put multiple statements on a single line. That's it's only use, and that's the only time it's 'needed' - no exceptions.
> Have you ever used Python? If you did you really wouldn't be saying this. There isn't an exception.
For the ';', perhaps not. For the token that is used to terminate (or separate) statements? Yes, the ';' is an exception to the general rule of how to terminate statements.
The semicolon also works on some sort of statements and not others, throwing errors only at runtime.
Honestly, the rule is "don't use semicolons in Python". I don't think there's a single one in the large codebase I work with, and there's really no reason at all to use it other than maybe playing code golf.
It's not a language in which you ever need be saving bytes on the source code. Just use a new line and indent. It's more readable and easier.
There are no exceptions. You only need it if/when you want to put multiple statements on a single line. That's its sole purpose.
And I'd also add that it's something that you almost never do. One practical use is writing single line scripts that you pass to the interpreter on the command line. E.g. `python -c 'print("first command"); print("second command")'`
If you don't know about the `;` at all in python then you are 100% fine.
When you use ; and possibly {, }, code statements / blocks are specified redundantly (indentation + separators), which can cause inconsistent interpretation of code by compiler / readers.
I find it much, much easier to look at code and parse blocks via indentation, than the many ways and exceptions of writing ; and {, }, while an extra or missing ';' or {} easily remains unspotted and leads to silly CVEs.
That was my single biggest pet-peeve of C++. A variable appears in the middle of a member function? Good luck figuring out what owns it. Is it local? Owned by the class? The super-class? (And in that case - which one?)
The added mental load of tracking variables' sources builds up.
FWIF, most C++ style guards recommend writing member variables like mVariableName or variable_name_ so they're easy to distinguish from local variables, and modern C++ doesn't generally make much use of inheritance so there's usually only one class it could belong to.
The fact that people introduce naming conventions to keep track of member variables is probably the biggest condemnation of implicit member access. People clearly need to know this, so you'd better make it explicit.
It's actually a bit surprising that this is one thing that javascript does better than Java. In most other areas, it's Java that's (sometimes overly) explicit.
I can tell for certain that as a JS/Python man, every time I look through Java code I have to spend a bit of time when stumbling upon such access, until I remember that it's a thing in Java. Pity that Kotlin apparently inherited it.
But at least, to my knowledge, in Java these things can't turn out to be global vars. Having this ‘feature’ in JS or Python would be quite a pain in the butt.
This is one of those great ideas that sadly one needs experience to realize are really bad ideas. Every new generation of programmers has to relearn it.
It's a bad idea because ASCII already includes dedicated characters for field separator, record separator and so on. These could easily be made displayable in a text editor if you wanted just as you can display newlines as ↲. Anyone who invents a format that involves using normal printable characters as delimiters and escaping them when you need them, is, I feel very confident in saying, grotesquely and malevolently incompetent and should be barred from writing software for life. CSV, JSON, XML, YAML, all guilty.
The obvious first step toward the brighter future is to refrain from using any and all software that utilizes the malevolent formats you mentioned. Doing otherwise would mean simply being untrue to one's own conscience and word.
> It's a bad idea because ASCII already includes dedicated characters for field separator, record separator and so on.
ASCII is over 60 years old and separators haven't caught on yet; what's different now?
> These could easily be made displayable in a text editor if you wanted just as you can display newlines as ↲.
Can you name a common text editor with support for ASCII separators? It's a lot easier to use delimiters and escaping then change every text editor in the world.
> Anyone who invents a format that involves using normal printable characters as delimiters and escaping them when you need them, is, I feel very confident in saying, grotesquely and malevolently incompetent and should be barred from writing software for life. CSV, JSON, XML, YAML, all guilty.
All of the formats you rant about are widely used, well supported, and easy to edit with a text editor - none of these are true of ASCII separators. People chose formats they can edit today instead of formats they might be able to edit in the future. All of these formats have some issues but none of the designers were incompetent.
US-ASCII only has four information separators, and I believe they can only be used in a four-layer schema with no recursion, sort of like CSV (if your keyboard didn’t have a comma or quote or return key). When you need to pass an object with records of fields inside a field you’re out of luck, and everyone has to agree on quoting or encoding or escaping again.
I think SGML (roll your own delimiters and nesting) was pretty close to the Right Thing,™ but ISO has the specs locked down so everyone had a second-hand understanding of it.
Ctrl-\, Ctrl-], Ctrl-^ and Ctrl-_ for file, group, record and unit separator, respectively.
However, your tty driver, terminal or program are all likely to eat them or munge them. Also, virtually nothing actually uses these characters for these purposes.
virtually nothing actually uses these characters for these purposes.
Right. Which is why we have all these hilarious escaping and interpolation problems. Any why programmers will never be taken seriously by real engineers. It's like we have cement mixed and ready to go but we decide to go and forage for mud instead and think that makes us cleverer than the cement guys.
I’m surprised that with your experience you come to such unbalanced conclusions. Everything in engineering is about trade-offs and while your conclusions may be indisputable for the design goals of D they may wrong in other contexts.
1. If I scribble some one time code etc. the probability of having an error coming from implicit declarations is in the same order of magnitude as missing out edge cases or not getting the algorithm right for most people. The extra convenience may well be worth it.
2. I would relax this it should be clear to the programmer where a statement ends.
3. Go on with a warning is a sane strategy in some situations. I happily ruin my car engine to drive out of the dessert. The assert might have been to strict and i know something about the data so the program can ignore the assert failure.
Your rationale in this and your followups are exactly what I'm talking about.
1. You're actually right if the entire program is less than about 20 lines. But bad programs always grow, and implicit declaration will inevitably lead you to have a bug which is really hard to find.
2. The trouble comes from programmer typos that turn out to be real syntax, so the compiler doesn't complain, and people tend to be blind to such mistakes so don't see it. My favorite actual real life C example:
for (i = 0; i < 10; ++i);
{
do_something();
}
My friend who coded this is an excellent, experienced programmer. He lost a day trying to debug this, and came to me sure it was a compiler bug. I pointed to the spurious ; and he just laughed.
(I incorporated this lesson into D's design, spurious ; produce a compiler error.)
3. I used to work for Boeing on flight critical systems, so I speak about how these things are really designed. Critical systems always have a backup. An assert fail means the system is in an unknown, unanticipated state, and cannot be relied on. It is shut down and the backup is engaged. The proof of this working is how incredibly safe air travel is.
> 3. I used to work for Boeing on flight critical systems, so I speak about how these things are really designed. Critical systems always have a backup. An assert fail means the system is in an unknown, unanticipated state, and cannot be relied on. It is shut down and the backup is engaged.
I ask you to reconsider your assumptions. How did this play out in the 737 MAX crashes? Was there a backup AoA sensor? Did MCAS properly shut down and backup engaged? Was manual overriding the system not vital knowledge to the crew?
You don’t have to answer. I probably wouldn’t get it anyway.
But rest assured that I won’t try to program flight control and I strongly appreciate your strive for better software.
> Your reactor is boiling. Your control software shut down with assertion failed: temperature too high, cannot display more than 3 digits.
Several points:
1. Most of such critical components have several
different and independent implementations, with analog backup (if possible).
2. You are arguing one specific safety critical case, that 99.999% or even more programmers will never face, should somehow inform decision about general purpose programming language.
3. Even if you are working in such safety critical situation, you should not really on assertion bypass, but have separate emergency procedure, which bypasses all the checks and try's to force the issue. (ever saw a --force flag ?)
Because what happens in reality, is developer encounters a bug (maybe while its still in development), notice you can bypass it by disabling assertions (or they are disabled by default), log it as a low priority bug, that never gets fixed.
Then a decade later me or someone like me is cursing you because you enterprise app just shit the bed, and is generating tons of assertion warnings, even when it running normally, so I have to figure out, which of them are "just normal" program flow, and which one just caused an outage.
I never experienced situation like you described, but I have experienced behavior like I wrote above, too many times.
Botom line is:
- don't assert if you don't mean it
- if you need bypass for various runtime checks, code one in explicitly.
Edit:
Hacker News is written in ARC which is schema dialect.
ARC doesn't have assertions as far as i can tell.
I agree with this. Nuclear reactors are a special case of systems where removing energy from the system makes it more unsafe, because it generates its own energy and without a control system it will generate so much energy that it destroys itself (and due to the nature of radiation, destroys the surrounding suburbs too).
With most systems, the safest state is off. CNC machine making a weird noise? Smash that e-stop. Computer overheating? Unplug it. With this in mind, "assert" transitions the system from an undefined state to an inoperative state, which is safer.
That isn't to say that that you want bugs in your code, and that energizing some system is free of consequences. Your emergency stop of your mill just scrapped a $10,000 part. Unplugging your server made your website go down and you lost a million dollars in revenue. But, it didn't kill someone or burn the building down, so that's nice.
Modern nuclear reactors are designed and built with the expectation that when they melt down, the results aren't catastrophic (at least for the outside world).
See my previous reply. Your reactor design is susceptible to a single point of failure, and, how do I say it strongly enough, is an utterly incompetent design. Bypassing assertions is not the answer.
If it ignores part of the spec, I don't think "strictyaml" is the correct name here. Instead, if it interprets everything as string, perhaps "stringyaml" would have been more accurate, though I'm sure that's not as good PR.
I'm reminded of the discussion we had a few days ago about environment variables; one problem there is that env variables are always strings, and sometimes you do want different types in your config. But clearly having the system automatically interpret whether it's a string or something else is a major source of bugs. Maybe having an explicit definition of which field should be which type would help, but then you end up with the heavy-handed XML with its XSD schema.
Or you just use JSON, which is light-weight, easy to read, but unambiguous about its types. I guess there's a good reason it's so popular.
Maybe other systems like yaml and environment variables should only ever be used for strings, and not for anything else, and I suppose replacing regular yaml with 'strictyaml' could play a role there. Or cause unending confusion, because it does violate the spec.
> JSON, which is [...] unambiguous about its types
With the one exception that with floatig point values the precision is not specified in the JSON spec and thus is implementation defined[1] which may lead to its own issues and corner cases. It for sure is better than YAML's 'NO' problem, but depending on your needs JSON may have issues as well
Allowing you to define types is quite uncommon, but many config languages allow more types than JSON (so more than boolean, number, string, list, dict). Date datatypes are a big one and are provided by about every second JSON variant, in addition to TOML, ION and others.
>If it ignores part of the spec, I don't think "strictyaml" is the correct name here.
The article didn't fully explain it but strictyaml requires a typed schema or defaults to string (or list or dict) if one is not provided. So it strictly follows the provided schema.
I was helping out a friend of mine in the risk department of a Big 4; he was parsing CSV data from a client's portfolio. Once he started parsing it, he was getting random NaNs (pandas' nan type, to be more accurate).
I couldn't get access to the original dataset but the column gave it away. Namibia's 2-letter ISO country code is NA—which happens to be in pandas' default list of NaN equivalent strings.
na_valuesscalar, str, list-like, or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
You fix it by using `keep_default_na=False`, by the way.
tl;dr: there are a bunch of fields of various types that arrive as strings, and they get coerced but without paying attention to which field should have which type
> Whenever an input accepts YAML you can actually pass in JSON there and it’ll be valid
...unless your parser strictly implements YAML 1.1, in which case you should be careful to add whitespace around commas (and a few other minor things). This is a valid JSON that some YAML parsers will have problems with:
{"foo":"bar","\/":10e1}
The very first result Google gives me for "yaml parser" is https://yaml-online-parser.appspot.com, which breaks on the backslash-forward slash sequence.
> Whenever an input accepts YAML you can actually pass in JSON there and it’ll be valid
Strictly speaking, this is only true of YAML 1.2, not YAML 1.0-1.1 (the article here addresses YAML 1.1 behavior, the headline example od which was removed ib YAML 1.2 twelve years ago), though it calla YAML 1.1 “YAML 2.0”, which doesn’t actually exists.
Of course, there are lots of features, like custom types, that JSON doesn’t support, but you can still use YAML’s JSON-style syntax instead of actual JSON, for them.
Yes this is usually the best way. If you need some features for code reuse there are several preprocessors. I personally use Dhall to configure everything and then convert it to JSON for my application to consume. It is a lot more powerful than YAML and has a very safety-oriented type system.
> it’s equally true that extremely strict type systems require a lot more upfront and the law of diminishing returns applies to type strictness - a cogent answer to the question “why is so little software written in haskell?“
I was with the article up until that point. I don't agree that diminishing returns with regards to type strictness applies linearly. Term-level Haskell is not massively harder than writing most equivalent code in JavaScript — in fact I'd say it's easier and you reap greater benefit. Perhaps it's a different story when you go all-in on type-level programming, but I'm not sure that's what the author was getting at. This smells of the Middle Ground logical fallacy to me. Or of course the comment was tongue-in-cheek and I'm overreacting.
I had to rewrite some JavaScript code in Postgres recently that measured the overlap between different elevation ranges. In JS I had to write it myself and deal with the edge cases and bugs. In Postgres I just use the range type and some operators. It was brilliant in comparison. The tiny effort of learning it was worth it. The list of data types I use all the time is bigger than just string, numbers and booleans. Serialisation formats should support them. Particularly as there are often text format standards that already exist for a lot of them. Give me wkt geometry and iso formatted dates. It's not that difficult and totally with it.
That law of diminishing returns might actually apply, I am not 100% sure. But more powerful type systems allow for the more complex composition of more complex interfaces in a safe manner. Think of higher-level modules and data structures. Or dependent types and input handling. Or linear types and resource handling.
I agree. I would say that Erlang goes ~80% of the way compared to Haskell's type system and the last 20% really matter, to the point that in many cases I find myself not really using Erlang's (optional) type system at all. Better type coverage and more descriptive types allow the compiler to infer more and I'd say this is the opposite of diminishing returns.
That author's blog post sent me down a rabbit hole of insanity with YAML and the PyYAML parser idiosyncrasies.
First, he mentions "YAML 2.0" but there's no such reference about "2.0" from yaml.org or Google/Bing searches. Yaml.org and wikipedia says yaml is at 1.2. Apparently the other commenters in this thread clarified that the older "YAML 1.1" is what the author is referring to.
Ok, if we look at the official YAML 1.1 spec[1], it has this excerpt for implicit bool conversions:
The programmer omitted the single character options of 'y' and 'Y' but it still has 'n' and 'N' ?!? The lack of symmetry makes the parser inconsistent.
% cat countries.yml
---
countries:
- US
- GB
- NO
- FR
% yamllint countries.yml
countries.yml
5:4 warning truthy value should be one of [false, true] (truthy)
YAML seems like a really neat idea, but over time, I have I have come to regard it as being too complicated for me to use for configuration.
My personal favorite is TOML, but I would even prefer plain JSON over YAML
The last thing I want at 2 AM when trying to look figure out if an outage is due to a configuration change is having to think if each line of my configuration is doing the thing I want.
YAML prizes making data look nicely formatted over simplicity or precision. That for me, is not a tradeoff, I am willing to make.
- The format seems to feel the need to support everything, including things I am not sure are actual usecases (what's the point of Markup element for example? What does Metadata save us compared to just including it in document, given that parsers must parse it anyway?). This must make implementation most complex and costly, and makes reading the text format more difficult.
- Not a fan of octal notation. At 3am not sure I can't confuse 0 and o given certain fonts. Does anyone even use it these days?
- Unquoted string were discussed in the thread, I'd like to point out that it's very easy to make an unquoted string not "text-safe" (according to the spec) without noticing it, at which point document is invalid.
Just add white-space (maybe a user pasted a string from somewhere without noticing whitespace at the end or forgot the rules), a dot, an exclamation or a question mark. Having surprises like that is IMHO worse than a consistent quoting method.
Basically all the things I don't like are about the format supporting a bit too much. YAML 1.1 should teach us more is sometimes less.
Alright that's two votes against unquoted strings so far (plus my wife agrees so that's three against!)
I put in octal because it was trivial to implement after the others. The canonical format when it's stored or being sent is binary, and a decoder shouldn't be presenting integers in octal (that would just be weird). But a human might want octal when inputting data that will be converted to the binary format.
Markup is for presentation data, UI layouts, etc, but with full type support rather than all the hacky XML+whatever solutions that many UI toolkits are adopting. Also, having presentation data in binary form is nice to have.
Well, unquoted strings work when a format is built for that. If the default was "it's text unless we see the special sequences" it would be better for unquoted strings. But even then there are too many special characters in this format IMHO.
I saw there's a 'Media' type in the spec. It's seems the type is actually for serializing files. But there's no "name" (or we can call it "description") field. Of course we could accomplish this with a separate field - but than again the entire type's functionality could be accomplished with a u8x array and a string field. So if you're specifying this type at all, might as well add a name field to make it useful.
The media object is for embedding media within a document (an image, a sound, an animation, some bytecode to execute in a sandbox, or whatever). It's not intended to be used as an archive format for storing files (which, as you said, could be trivially accomplished with a byte array for the data, a string for the file name, and some metadata like permissions etc). A file is just one way among many to store media (in this case as an entry in a hierarchical database - the filesystem - keyed by filename). CE is only interested in the media itself, not the database technology.
The media object is a way to embed media data directly into a document such that the receiving end will have some idea of how to deal with it (from its media type). It won't have or need a "file name" because it's not intended to be stored in a filesystem, but rather to be used directly by an application. Yes, it could be built up from the primitives, but then you lose the canonical "media" type, and everyone invents their own incompatible compound types (much like what happened with dates in JSON and XML).
I'm skimming through the human readable spec, and it seems decent, but I noticed the spec allows unquoted strings. What's the reasoning for this? In my experience unquoted strings cause nothing but trouble, and are confusing to humans who may interpret them as keywords.
Any reason for not using RFC2119 keywords in the spec? Using them should make the spec easier to read.
> I noticed the spec allows unquoted strings. What's the reasoning for this? In my experience unquoted strings cause nothing but trouble, and are confusing to humans who may interpret them as keywords.
Unquoted strings are much nicer for humans to work with. All special keywords and object encodings are prefixed with sigils (@, &, $, #, etc), so any bare text starting with a letter is either a string or an invalid document, and any bare text starting with a numeral is either a number or an invalid document.
> Any reason for not using RFC2119 keywords in the spec? Using them should make the spec easier to read.
If strings are always unambiquously detectable, why allow quoting them at all? Having two representations for the same data means you can't normalize a document unambiguously. I can understand having barewords seems cleaner for things like map keys, but I am not convinced that it's a worthwhile tradeoff.
An important feature of RFC2119 keywords is that they're always capitalized (ie. the keyword is "MUST", not "Must", or "must"). This makes requirements and recommendations stand out amid explanatory text, improving legibility. For example, RFC2119 itself uses MUST and must with different meanings.
> If strings are always unambiquously detectable, why allow quoting them at all?
Because strings can contain whitespace and other structural characters that would confuse a parser.
> Having two representations for the same data means you can't normalize a document unambiguously.
The document will always be normalized unambiguously in binary format. The text format is a bit more lenient because humans are involved.
The idea is that the binary format is the source of truth, and is what is used in 90% of situations. The text format is only needed as a conduit for human input, or as a human readable representation of the binary data when you need to see what's going on.
> An important feature of RFC2119 keywords is that they're always capitalized (ie. the keyword is "MUST", not "Must", or "must").
It's a compromise; there are only so many letters, numbers, and symbols available in a single keystroke on all keyboards, and I don't want there to be any ambiguity with numbers and unquoted strings (e.g. interpreting the unquoted string value true as the boolean value true).
So everything else needs some kind of initiator and/or container syntax to logically separate it from the other objects when interpreted by a human or machine.
XML with a convenient UI tools to edit should have fit the bill. Yet, for whatever reason a convenient UI tool would never happen to be there when needed, and thus scared and tired of manual editing of XML the world have embraced YAML.
> XML with a convenient UI tools to edit should have fit the bill.
"You need this special tool to work" immediately and instantly rules out "easy to edit". Or makes the debate irrelevant: every format is easy to edit if you have "a convenient UI" to do it for you.
The fault was in XML editing, pure data authoring is hard. We have convenient UI — web browser, think of it as literate programming, a way to merge man page and configuration file.
And plain text editor is a "widely deployed special tool to work". Actual data is
Only when you "unmarshal" to an untyped data structure and then make assumptions about the type. I've used yaml with a go application, and it can't interpret NO as a bool when the field is a string.
Btw, the reason Haskell isn’t used more isn’t type system per se, as all types can be inferred at the compilation time. People would sometimes use this feature even to see if GHCi guesses the type correctly (by correctly I mean exactly how the user wants, technically it’s correct always) first time and save them some time writing it either with an extension or just copy&paste from the interpreter window.
When it gets hairy is that most programming languages have low entrance barrier. To write Haskell effectively you’ve got to unlearn a lot of rooted bad habits and you get to dive into the “mathematical” aspect of the language. Not only you got monads, but there’s plethora of other types you need to get comfortably onboard with and the whole branch of mathematics talking about types (you don’t need to even know that such a field as category theory exists to use it).
However, since most people just want to write X, or just want hire a dev team at price they can afford, Haskell rarely is the first choice language.
Never's a strong word, seems quite easy to understand why to me. You've got ease of use reasons, historical reasons like the mis-guided Robustness principle, etc.
And these sort of things happen time and time again.
And although officially JSON requires quoted strings, almost none of the parsers actually enforce that, and so you will find a huge amount of JSON out there that is not actually compliant with the official spec.
Just like browsers have huge hacks in them to handle misformed HTML.
I think the point is that they accept more than the spec dictates - do your JSON parsers accept e.g. the vs code config file (JSON with comments) or JSON with unquoted keys?
The most commonly used parsers only accept valid JSON - including the one included within most JS runtimes (JSON.stringify/parse). VSCode explicitly uses a `jsonc` parser, the only difference being that it strips comments before it parses the JSON. There's also such thing as `json5`, which has a few extra features inspired by ES5. None of them are unquoted strings. I've never come across anything JSON-like with unquoted strings other than YAML, and everything not entirely compliant with the spec has a different name.
If you want no misunderstandings, be explicit. This applies to YAML and life in general. There's an annoying but fairly accurate saying about assumptions that applies.
If you want something to be a specific type, you better have an explicit way of indicating that. If you say quotes will always indicate a string, great. Of course we know it's not that simple, since there are character sets to consider.
The safest answer is to do something like XML with DTDs. But that imposes a LOT of overhead. Naturally we hate that, so we make some "convention over configuration" choices. But eventually, we hit a point where the invisible magic bites us.
This is one case where tests would catch the problem, if those tests are thorough enough - explicitly testing every possibility or better yet, generative testing.
I don't understand why Haskell gets brought up in the middle of an otherwise interesting and useful article. This sort of thing cannot happen in Haskell. And while Haskell is not universally admired, I can't recall seeing Haskell's flavor of type inference being a reason why someone claimed to dislike Haskell.
Not YAML by itself, but there are libraries that parse a YAML-like format that is typed. For example this one: https://hitchdev.com/strictyaml/. Technically, it is not compatible with the YAML spec.
There exists a couple of mainstream languages that are full of these sorts of interesting behavior, one of them is supposedly cool and productive and the other is supposedly ugly and evil.
and yet I don't see anyone complain about bash which is arguably far worse than those 2. When things get hard on bash, you will start to see python scripts on CI and whole thing is complete unreadable mess
> When things get hard on bash, you will start to see python scripts
That's kinda the thing innit? Unless the system specifically only allows shell scripts (something I don't think I've ever encountered though I'm sure it exists) it's quite easy to just use something else when bash sucks, so while people will absolutely complain about it they also have an escape: don't use bash.
When a piece of software uses YAML for its configuration though, you don't really have such an option.
Furthermore, bash being a relatively old technology people know to avoid it, or what the most common pitfalls are. Though they'll still fall into these pitfalls regularly.
There is a lot of elitism around bash, like the "Arch btw" thing but far worse because a lot of important things depends on it.
Powershell has been working on linux for quite a while now and doesnt seem get any attention even when it has a nice IDE support and copy the good things about bash.
It doesn't copy all the good things about the Unix shell though.
The reason people are comfortable with the POSIX shell is because you use the same syntax for typing commands manually as you do for scripts. But, you're going to have a hard time finding people who prefers writing:
Remove-Item some/directory -recursive
Rather than
rm -fr some/directory
People who write shellscripts are often not seeing themselves writing a "program". They are just automating things they would do manually. Going to an IDE in this case is not something you'd consider.
I happen to be very aware of all the pitfalls in POSIX shell, and it's rare that I see a shellscript where I cannot immediately point out multiple potential problems, and I definitely agree that most scripts should probably be written in a language that doesn't contain so many guns aimed at the user's feet. I'm just pointing out a likely reason why people are not adopting powershell in the huge numbers that Microsoft may have hoped for.
I think this applies to Python pretty well. Although certainly not as bad as PHP, most JS traps also exist in Python (falsy values, optional glitchy semicolons, function scoped variables, mutable closure). There is many JS specific traps like this and also other Python specific ones (like static fields are also instance fields, Python versions and library dependency hell). However I find it easier to avoid them in JS than in Python with TypeScript, avoiding classes, ...
its weird that this is a 2019 article misrepresenting behavior in the YAML 1.1 spec (2005) most of which reverted in the YAML 1.2 spec (2009) as being part of a nonexistent YAML 2.0 spec and justifying a library that purports to handle “YAML” ignoring the spec.
You're right, but it's worth noting that much of the world is still on YAML 1.1, for whatever reason, so in practice, these are actual problems that will be encountered in the real world.
For example, Ruby's standard library only supports YAML 1.1. It relies on libyaml, which is not yet compliant with 1.2. Meanwhile, Python's popular PyYAML library only supports 1.1, and asks users to migrate to a newer fork called ruamel.yaml for 1.2 support.
> You're right, but it's worth noting that much of the world is still on YAML 1.1
This is an article justifying use of (and justifying design decisions of) a particular Python quasi-YAML parsing library. If you are in a position to select a non-YAML-1.1-compliant parsing library for Python, or to take the articles advice on design of a YAML(-ish) parsing library, you are, necessarily, not stuck with YAML 1.1.
> for whatever reason
Articles like this spreading misinformation about the current state of standard YAML are part of the reason. LibYAML lagging support is another since so much of the ecosystem depends on libYAML (though, while the documentation situation is terrible, it looks like maybe libYAML has some level of 1.2 support since 0.23.)
> For example, Ruby's standard library only supports YAML 1.1. It relies on libyaml, [...] Python's popular PyYAML library only supports 1.1
Which, also, is dependent on libYAML.
> and asks users to migrate to a newer fork called ruamel.yaml for 1.2 support.
Which makes a lot more sense than migrating to a library thar supports neither 1.1 nor 1.2, but a nonstandard variant that addresses some of the same issues resolved years ago in 1.2, especially when a library supporting 1.2 is available for the same language.
I'd go further and say this is why you write tests. Creating tests that cover a lot (or all) possible inputs is sometimes not that hard and really pays off if you manage to catch a very common error like the Norway thing. Even better if you catch something that would have been a nightmare to fix in production.
I say this because two days ago I wrote a test that used all country codes as input. It took 15 minutes to write that test. During the whole testing session I found at least 5 mistakes of which 3 would have been quite dramatic.
>I say this because two days ago I wrote a test that used all country codes as input. It took 15 minutes to write that test. During the whole testing session I found at least 5 mistakes of which 3 would have been quite dramatic.
And how many minutes to test all city/state/region/street/person names ?
It can also happen that you test s will become outdated, like when url standard changed and more characters codes were allowed.
For something like URLs I'd use the hypothesis Python module and rely on their implementation of URLs (and if that changes the test will fail for newly formated URLs), for everything "custom" I would extract problematic test cases and include them as examples.
Testing doesn't take too long on my machine (maybe 10 seconds), but even if it would, it would be totally acceptable as I run it pre commit only.
Or you just return to the previos (and working) version of the website while you fix the issue. At least if you a good old monolith; if you have 10s of microservices it may be more complicated
The YAML specification eliminated this problem in 2009, with the release of version 1.2. That spec also eliminated some other problematic problems.
The real problem is that YAML parsers in wide use have not been updated to the spec that was released TWELVE years ago.
So who's going to help the common YAML parser developers update their implementations to support version 1.2? I think that would be a big help. Maybe the Norwegian government can chip in some money & time to get them updated, that would probably quietly eliminate a number of problems.
I'm replying-to-myself, because I think this text from YAML 1.2 (explaining its changes) are key:
> The primary objective of this revision is to bring YAML into compliance with JSON as an official subset. YAML 1.2 is compatible with 1.1 for most practical applications - this is a minor revision. An expected source of incompatibility with prior versions of YAML, especially the syck implementation, is the change in implicit typing rules. We have removed unique implicit typing rules and have updated these rules to align them with JSON's productions. In this version of YAML, boolean values may be serialized as “true” or “false”; the empty scalar as “null”. Unquoted numeric values are a superset of JSON's numeric production. Other changes in the specification were the removal of the Unicode line breaks and production bug fixes. We also define 3 built-in implicit typing rule sets: untyped, strict JSON, and a more flexible YAML rule set that extends JSON typing."
Since "no" is not the same as false, the Norway problem disappears. It's safer to always quote single-word strings like 'this', just like you always have to quote all strings in JSON.
Why not just enclose your strings in quotes and be done with it?
As far as I can see, this has nothing to do with typing and everything to do with syntax (of literals). If strings were required to be quoted this problem wouldn’t appear.
This is the reason no programming language has this issue — regardless of type system (JS/Python/Java/Haskell). If you want a string here you need quotes.
Haskell could even be regarded as what the author calls “implicitly typed” — since types are derived from literals — and I’ve never heard a Haskeller complain about this issue.
Norway is one of the luckiest countries in the world. They have a vast amount of resources, can produce their electrical energy entirely from hydropower, have a great democracy, a government they can trust, a beautiful landscape and great people.
I must say that I feel a little bit of relief to see that they have problems that nobody else has, besides insanely expensive alcohol that is only sold in "wine monopoly" stores that are more heavily guarded than banks.
Funny coincidence. Around 2000, I worked for a company that coined the term "Norway problem" for a different software problem.
Their product used an MVCC database (I think ObjectStore). One of their customers in Norway had a problem where updates to the database seemed to not show up. IIRC the problem was a bug in this company's software that caused MVCC to show an older version of the database content than expected.
Or, “use an appropriate schema”. Or, for several of the specific problems identified in the source article, use YAML 1.2 (2009) instead of YAML 1.1 (2005), which the article misidentifies as “YAML 2.0” and acts as if it is the current spec.
Cue also solves this problem. The "no" example is right on the front page: https://cuelang.org
I used it for configuration of a Go program recently and found it pleasant to work with. I hope the language is declared stable soon, because it's a good model.
I prefer JSON over YAML because I spend more time confused and burned by the problems caused by it.
I understand that people don't like directly use JSON because it's not very friendly: no comments, no multi-line string, etc.
A great alternative IMHO is cson[0]. It's like JSON to JavaScript but for CoffeeScript (though nobody talks about it nowadays). It has indentation-based syntax, comments, and multiline string which usually don't need to escape. The advantage is it's close enough to JSON which is the canonical format that everybody can agree on nowadays. For YAML and TOML there are too many visual part-aways from JSON.
Or just create a JSON variant that enables comments and the backtick multiline string from JavaScript.
Edit: downvoters, thanks! I realize this is not an easily agreeable opinion ("let's all chant 'death to YAML!'") but it's really easy to avoid losing money on something like this. Just do proper testing.
Aren't you setting yourself up for surprises if you write file formats such as TOML and YAML without reading the documentation, learning and experimenting first? How about unit testing? Or verifying the type in your config parser? Have you tried opening your site in the norway config on your development or testing environment? Or even in production? It all seems very basic and not at all blog post or even HN worthy.
I'm going to assume the authors still haven't learned their lesson and are going to experience many more surprises in the future working with plain text file formats.
> Christopher Null has a name that is notorious for breaking software code - airlines, banks, every bug caused by a programmer who didn’t know a type from their elbow has hit him.
This one made chuckle, and TIL that Null is a real life surname.
This is such a core issue with a tool like YAML, how the hell did this program get so popular? Are there that many developers willy nilly using tools that fail in critical, silent ways, and the horde of no-nothings follows them?
Something that used to plague me is that I had database processes importing Excel docs from clients, and if the first few rows in a column were numbers, SQLServer assumed that all the values must be numbers. Then it would run into cells containing other strings, and instead of revising its assumption, it would just import them as null. Since clients often didn't have great data hygiene, it was a problem.
I finally solved it by exporting to csv, and using third-party software that handled its own import and did it correctly.
It seems like we need to treat yaml like json and quote all strings. Would that help resolve these issues? Just trying to figure out a rule I can implement to prevent these issues.
Have had a similar issue when adding git revisions to YAML documents.
The problem is that if a YAML parser sees a string like this:
"0123e04"
It interprets it as a number: 123 * 10^4
Our hacky solution was to prefix the revision hashes like sha-0123e04, but still this was quite annoying.
After that experience, I have stopped using YAML for any of my own configuration. Have started preferring putting my configurations in code. And when I don't want that, have found JSON good enough for my purposes.
Hashes are NOT numbers in base 10 scientific notation, which is how the hash that I showed you would be interpreted by YAML.
The point is that this behavior is sporadic. It doesn't apply consistently across all git hashes, which is the real problem. It is easy to be caught unawares by this behavior.
Yes, I got that, but why have you declared a hash, which is a number, though a different kind of number than a base 10 scientific notation, as a string ?
Because we were not using any numerical properties of the hash. We were not adding it to other hashes, seeing if it was greater than or less than other hashes, etc.
Literally the only thing we were doing was passing it between shell commands, helm charts, Kubernetes deployments and then back (if we needed to debug).
It sounds like you have a more attractive alternative in this case than to treat hashes as strings. Would love to hear it.
Except that its numbers are underspecified and cannot be used safely outside of a certain range. The spec explicitly states that the precision of numbers is not defined, meaning that N and N+1 may be the same number, and its behaviour would depend on the parser you're using.
The number one rule when creating a serialisation format should be that serialisation and deserialisation is predictable. It's quite remarkable that two of the most popular formats doesn't do this.
I'm actually surprised we haven't seen any major security issues caused by this.
> “While the website went down and we were losing money we chased down a number of loose ends until finally finding the root cause.”
Hopefully not a real story. If you’re trying out new configurations in production and have no mechanism to rollback problematic changes, you’ve got bigger problems than YAML.
To me, though, YAML, including “StrictYAML” doesn’t solve any problems JSON, perhaps w/comments, already solves.
I am sometimes annoyed by the fact you have to put double quotes around string properties in JSON. It would be so much lighter to use JS syntax..!
Then I read articles like this one. Thank you JSON for not trying to be smart.
I don't like YAML because when I need to write configuration in it I waste time to remember what is the syntax. I have much better understanding of JSON, because I use it almost on daily basis.
They decided to go against the YAML standard and therefore are no longer a YAML parser.
The actual answer to this problem would have been to use a better storage format. Perhaps JSON5? or TOML?
The problem is insufficiently analysed by the article author and the commenters in this thread so far. It is very superficial. The recent thread "Can’t use iCloud with “true” as the last name" https://news.ycombinator.com/item?id=26364993 went deeper. Let me take up its relevant particulars into this thread.
The article author hitchdev does not say it outright, but it is heavily implied that the YAML file was edited by hand. This is the immediate cause of the problem. The indirect root of the problem is that the spec authors chose a plain text serialisation format and thus created an affordancehttp://enwp.org/Affordance#As_perceived_action_possibilities to be edited by hand.
This turns out the be unsafe/source of bugs because YAML end-users are not capable of correctly applying the serialisation rules considering the edge cases detailed in the article because humans are creatures of habit, applying analogy and common sense, making assumptions and then sometimes go wrong, whereas a piece of software will not make the Norway, Null etc. mistakes. hitchdev even writes that quoting the string is "a fix for sure, but kind of a hack", but that's a grave misunderstanding. Quoting the string here is actually applying the serialisation rules correctly.
The tangential at the end of the article about typing is also orthogonal/irrelevant. YAML is strictly/strongly/unambiguously typed, and so is the mentioned variant Strict YAML. The difference is that Strict YAML has serialisation rules that are more amenable to or aligning with the human factors of habit etc. and thus work better in practice.
My personal recommendation is to never edit YAML by hand and always use a serialiser. This is less convenient, but safe.
In closing, I would like the reader of this comment to make an effort to distinguish between "what is" and "what ought to be" in their head, otherwise the ideas here will be very muddled.
The problem is not 'someone is not correctly following the serialization rules', the problem is 'the serialization rules are quite terrible'.
This is not some interesting trade-off, this problem is fixable on all axes by using non-ambiguous, non-overloaded typing rules for your config format.
> The problem is not 'someone is not correctly following the serialization rules'
Yes, yes, I pointed that out. grep "immediate cause" and "indirect root"
> the serialization rules are quite terrible
Did that need to be said explicitly? I agree FWIW. I have already made a value judgement mildly against YAML, in case that's not clear. It's only mild because the problem can be worked around. I think this approach is more practical than moving the whole world over to a completely different thing.
> problem is fixable […] non-ambiguous […] rules
Is the implication here that you say YAML is ambiguous? It's not. I don't want sloppy analysis. To be precise, the ambiguity is imagined, it does not exist on the spec or software level, only in the head of people.
> The problem is insufficiently analysed by the article author
The article author also misidentifies the version of the YAML spec (calling it 2.0, which doesn’t exist; the behavior is from YAML 1.1, and this class of problems motivated a bunch of changes in YAML 1.2, which has been out since 2009.)
But the article author isn’t trying to analyze the problem, he’s trying to rationalize why what is notionally a YAML-processing library just ignores the spec.
The very point of yaml is that it is easy to edit by hand. If you use an, I suppose, GUI editor then you don't need yaml. You could use any strictly typed serialization format. (Self describing or with a schema.)
https://www.theverge.com/2020/8/6/21355674/human-genes-renam...
Edit: Apparently Excel has its own Norway Problem ... https://answers.microsoft.com/en-us/msoffice/forum/msoffice_...