Hacker News new | past | comments | ask | show | jobs | submit login
Uniform eXchange Format (UXF) – plain text human readable typed storage format (github.com/mark-summerfield)
69 points by begoon on Oct 14, 2022 | hide | past | favorite | 60 comments



This seems unnecessarily terse. If you want types, why not make it look like one of the existing popular languages with types? (Rust/TypeScript/protobuf)

Instead of:

    uxf 1
    =Database server:str ports:list connection_max:int enabled:bool
    =DateTime when:datetime tz:str
    =Owner name:str dob:DateTime
    =Hosts name:str
    [
      (Owner <Tom Preston-Werner> (DateTime 1979-05-27T07:32:00 <-08:00>))
      (Database <192.168.1.1> [8000 8001 8002] 5000 yes)
      (Hosts
        <alpha>
        <omega>)
    ]
Why not:

    type Database {
      server: str,
      ports: list,
      connection_max: int,
      enabled: bool,
    }
    type DateTime {
      when: datetime,
      tz: str,
    }
    type Owner {
      name: str,
      dob: DateTime,
    }
    type Hosts {
      name: str,
    }

    [
      Owner('Tom Preston-Werner', DateTime('1979-05-27T07:32:00', '-08:00')),
      Database('192.168.1.1', [8000, 8001, 8002], 5000, yes),
      [
        Hosts('alpha'),
        Hosts('omega'),
      ],
    ]
Evidence shows that people like that kind of format, for "plain text human readable" purposes. They are also used to it. It's only 20% longer, and you can also go with a Haskell-style syntax if you dislike braces.

What's the point of a plain-text format that is not human-friendly? Especially for type definitions, do you expect people to write this or do you want them to compile their schema from a different human-readable format into your human readable-format (and why)?


I agree, visually parsing things that are all on one line is more difficult than it needs to be. Putting separate things on separate lines is preferable.


> Evidence shows that people like that kind of format,

I agree, but I don't know that why people like that kind of format is settled. I suspect it's because the majority of today's software has to touch the web, and the only programming language built into web browsers happen to consume and produce that kind of format natively.

In other words, I think it's popular, but I think it's popular because it's the path of least resistance for interfacing with the web, which isn't necessarily a priority all the time.


> Evidence shows that people like that kind of format

Please cite this evidence. I like your proposed format, just interested in that research.


The existing programming languages are the evidence. Sorry I thought that was clear. I'm suggesting they take hints from the structured languages that humans actually read and write.


They're evidence that people like that style. Which is not evidence that people wouldn't like another style.


This format does a lot of things correctly conceptually, in my opinion.

1. It supports data tables with named and typed columns. 2. It supports types in the header that can be referenced elsewhere. 3. It supports lists of stuff as well as types and nesting. 4. It uses a format header to easily declare what format to decode/encode.

Unlike lists of JSON objects, the data can be represented more compactly. I've done something similar when I encode tables as an array of header names, then each row is also an array, where index is used to match the name.

It would be fairly easy to make a binary version of this if you needed more compact representation, and make a lossless conversion between text and binary.

Why would you want a format like this? Many use cases. Every DB wire protocol essentially re-creates something like this, but often poorly. Writing multiple tables that include headers and types as well as data to disk is frequently useful.

The problem with xml schema is XML is really really complex when you add in transforms and namespaces and everything else that XML can include.

The only thing I might suggest is to be able to add meta-data about a specific type, or create specialized type based on type+meta-data (like max length, etc). This could also help with the issue of timestamps (local, second resolution, offset).


> I've done something similar

Any links to share? I think your feedback is very spot on so curious what you've built.


That's human-readable is it? Personally I find the TOML example far easier to read, it doesn't seem to be spelt out what the 'advantage' is in translating to UXF; I think maybe that it supports custom types? A comparison to Recfiles (which do) would be nice then.

https://en.wikipedia.org/wiki/Recfiles

https://www.gnu.org/software/recutils/


I think toml is optimized for reading/writing by humans, while this is primarily exchange format. So readability is not too much of an concern. Seems like a compromise between JSON and protobuf


TOML has a mess of separators (lines and commas) with confusingly optional delimiters instead of whitespace as separator and mandatory string delimiters: possibly nicer to write, but more complex, with unnecessary ambiguities, and less elegant.


> Use no for false and yes for true.

Wouldn’t it be better to use true for true and false for false?


Could you please elaborate? Why is true a better value than yes? Seems like an arbitrary choice to me. Does one choice have some advantages over the other?


True/false is more widely used than yes/no in this context. The fact that the author felt the need to explain that yes/no maps to true/false indicates that they also believe this to be the case.

The only time I can remember seeing yes/no used in a format like this is YAML, and that caused problems[1].

[1]: https://hitchdev.com/strictyaml/why/implicit-typing-removed/


The Norway problem came to mind for me, too, but I don't think there's much opportunity for that to arise here given that strings are quoted. Meanwhile, there's plenty of other precedent for yes/no instead of true/false; shell scripts come to mind.

I don't know if I prefer this over treating everything as a string and letting readers/writers decide on their own how to parse things, but it seems like a much more reasonable approach than YAML's.


Yes, I agree, the format described here is more reasonable than YAML and would not suffer from the Norway problem.

I'm interested in what you mean when you mention shell scripts; I find it more idiomatic to write:

  if [ true ]; then echo test; fi
Over

  if [ yes ]; then echo test; fi
Because the behavior is more consistent when you try to use it in other constructs:

  while true; do echo test; done
Versus

  yes | while read _; do echo test; done
Or maybe

  while yes | :; do echo test; done
There's also no "no" command. To get that effect, you'd have to confusingly write:

  yes no
Is there a particular instance of yes/no in shell scripting that you had in mind?


> Is there a particular instance of yes/no in shell scripting that you had in mind?

Yes!

I see them all the time in various build scripts, especially for Slackware packages / SlackBuild scripts; it's a pretty common convention to use yes/no values for enabling/disabling (respectively) various build options.

OpenBSD's rc.conf(.local) also uses "NO" to indicate that a service/daemon should be disabled entirely; for example, the default httpd_flags=NO in rc.conf entirely disables httpd - unless, of course, you re-enable it later with httpd_flags= in rc.conf.local. "YES" is also sometimes used, e.g. library randomization being enabled by default via library_aslr=YES.


I would say “yes” and “no” are locale specific. Why not German “ja” and “nein”, French “oui” and “non” (with the advantage of being of the same, short, length), etc? Yes, “true” and “false” are English words, but in programming circles, IMO transcend locale.

(The more locale agnostic ⊤ and ⊥ (https://en.wikipedia.org/wiki/Verum and https://en.wikipedia.org/wiki/Up_tack), IMO are a bit elitist and difficult to type)


UXF has plenty of English-specific keywords.

I agree that "true" and "false" would be clearer than "yes" and "no", but the fact that "yes" and "no" are English words isn't an issue.


What fraction of existing code uses languages with true/false vs yes/no? I’d imagine it should be “least surprising” for most users.


Because even the author uses true to describe what he means by yes.


This is why you should never finalize specs until you've written the documentation for it.

If it sounds stupid when you say it out loud, it _is_ stupid.


Well, "false" can't get confused with the ISO code for Norway.


To be fair, it can't here, because all strings are delimited.


It saves bytes! (joking!)


I fail to see why should I prefer this over JSON. Dynamically typed languages prefer plain hash tables. Static typed languages need to parse and validate input data anyway, so the lack of type of JSON is not a hindrance.

Maybe for communication between trusted parties? Then I would use a binary format.


I really fail to see the point of this format.

If human readability was the point, then doing something different than expected is a really bad idea:

* "no" and "yes" as boolean values may save some bytes, but the tradeoff isn't worth it (and if filesize matters, use a binary format to begin with). * Using angle braces except of double quotes to fence strings makes the format look noisy and means you have to remember two kinds of escapes if you want to use < and > in the value. * The format isn't object oriented in any way. You can simulate that by putting maps into maps, of course, but no one will have fun reading or writing that in a text editor.

Type information is for parsers, not humans. JSON this this right, Protobuf does this right. UXF is just a compromise combining (only) the disadvantages of the two.

UXF is self contained, that's great, but in 99.9% of the cases where you need a DX format, sender and receiver already know the schema, so that definition block just adds bloat.

You can happily mix lists, maps and tables of primitive or compound types. And since stuff is typed instead of named, order matters and you end up addressing everything through positional parameters. That's going to be fun when using a text editor to write down something like a list of GPS coordinates (you are likely to confuse latitude and longitude).


I'm not going to write this by hand. So what's the advantage over schema'd XML?


Pretty cool I guess, although there sure are a lot of such formats now ... but I like the typed-ness.

I feel the spec warranted more discussion about why strings <look like this> instead of the way more common "like this", i.e. why angle brackets are used to quote strings. Probably to make it easier to embed quotes, but I'm not sure. It was rather surprising at least, although I guess you get used to it if you read a lot of raw files.


There are languages that use \q instead of \" as a representation for quote inside quoted strings; I personally like it, it really simplifies and speeds up searching for the end of a string during parsing, and makes regex-based processing much more reliable.


Maybe poor-man’s («») for (as you say) nesting strings.


Can we just add a version indicator, an ISO8601 datetime type and some kind of constraint (regex or BNF) for existing types to JSON and call it done?


Ah, you mean YAML with JSON Schema?


Exactly :-)


Likewise, I had a go at trying to improve on JSON and this is my attempt:

https://github.com/tlocke/zish

Any comments / criticisms gratefully received.


This is very very good. Only thing I can think that it's missing (and maybe you have support for this and I misread your readme) is ordered maps (there's a better name for this but I'm blanking).

For example, imagine I have a compact append only map format and I want to represent it like this (where last tuple wins when you have a duplicate key, but earlier history is preserved)

    {
     score: 0,
     score: 1
    }


Arghhh.. this name collides with our tool UMLet's "UML XML Format" extension ".uxf", in ubiquitous use since 2001! :)


You're in luck though, because nobody is ever going to use this format.


no time zones on date/&times? they should be added.


No time zones ever. If I see a date and its X, I know that I only need to add y to get my timezone to know when this happened. but if anybody puts a timezone there, now I can't do it mentally.


What timezone do you assume the source datetime is in?


UTC, of course: the only sensible timezone to store date and time in.


The response that always comes up is “at 9am” in a TZ with DST, Doesn’t work with UTC


Until everyone starts using stardate or something similar :)


GPS


Since there are three incorrect responses to this comment already:

Anyone saying “UTC” is wrong. Unambiguously wrong if offsets are supported, and in foolish contexts like this where offsets are not supported, still wrong due to common sense and custom.

If there is no offset, there is no offset. It’s what is commonly called a naive or plain datetime. How it should be interpreted is explicitly undefined if offset-capable, and implicitly undefined by strong custom if not offset-capable; but it will generally mean in the local time zone, whatever that is—and it could be relative to a particular machine or a particular user. This is often suitable for social use, but completely unsuitable for machine history-recording use.

So: the question is rhetorical, unanswerable, thereby demonstrating why nmz’s position is unreasonable.

(Actually, only probably unreasonable because nmz’s wording wording with its “X” and “y” is not clear and may be using the term “timezone” subtly—the trouble is it’s used to mean three different things: firstly and most properly, a name for a set of rules about which time offsets to use when, e.g. “Australian Eastern Time” or “Australia/Melbourne” as it’s called in the IANA Time Zone Database, which roughly means AEST (+10:00) for half the year and AEDT (+11:00) for the other half, but conveys the rules as they have been through time; secondly, a somewhat less correct colloquial usage, a named time offset, e.g. “AEDT” or “Australian Eastern Daylight Saving Time” for +11:00; and thirdly, fairly clearly into the realm of misuse but still very common, a time offset like “+11:00”. If nmz was using the term “timezone” more precisely to mean one of the named concepts and expressly not an offset, then yeah, times written that way do require memorising a whole database, whereas offsets are straightforward to calculate, though it’s definitely harder having to do two calculations than the just one if it starts at UTC.)


If you want to make this a nitpicking discussion about phrasing a provocative statement, sure. If we’re talking about what matters, I stand by the notion that any point in time should be recorded in UTC, full stop. Storing a different time zone only makes sense if something happens at the same time in multiple time zones, but use cases are few. In the vast majority of scenarios,calculating the offset of the client and showing the adjusted date is the correct solution.


The context was specifically the handling of times that don’t include an offset (or perhaps time zone, it was unclear). The correct answer there is not UTC (which is flatly incorrect, depending on a locally-enforced convention that incidentally deviates from the most common meaning of such time stamps), but rather “don’t enter that situation in the first place, because any other answer is wrong”.

—⁂—

For the rest of your statement: for times not tied to a particular location or time offset, you should always use UTC in the form of the offset Z, in ISO 8601/RFC 3339 terms, since specifying any other offset indicates that it means something. (Note that RFC 3339 tried to have -00:00 be the neutral offset and Z and +00:00 meaningful, but that is acknowledged to have failed, and so https://www.ietf.org/archive/id/draft-ietf-sedate-datetime-e... is updating it to match actual usage.) But for things that involve humans and are anchored to a particular time zone, using UTC and not storing a time zone is wrong: you should store the relevant time zone and (fallback) offset so that if the time zone definition changes (as they do, sometimes with less than a few days’ notice), future times can be corrected, which they can’t be if you anchored them to UTC. So: things like system logs, use UTC; online conferences, use UTC; location-bound conferences, use that location’s time zone; general user calendars, use the user’s time zone; calendars for companies that straddle time zones (or people that work across time zones): deliberately choose a time zone or offset to anchor things to (sometimes at the level of individual events), especially for the sake of recurring event periods if you use a time zone with DST.


It isn't nitpicking, it's really a different type: 15 oct 2022 at 19:01 CET with DST, i.e. an instant, vs. 15 oct 2022 at 19:01 in an unspecified time zone, i.e. a set of about 25 instants (one per possible timezone).

While a datetime without a timezone isn't terribly useful, distinguishing it from a datetime and timezone designation pair is the only correct type system.


Why UTC instead of TAI?


Because that’s not what TAI is designed for. UTC is for civil time keeping, and that’s what everything except highly-specialised stuff is tied to. The ISO 8601/RFC 3339 serialisation is UTC-based, so you’ll have to go heavily non-standard to use TAI, and you’ll find a complete lack of support in general date time libraries, so 37 second errors are sure to crop up all over the place if others ever touch things.


If you're most Americans, EST. If you're Apple, PST. If you're a technical person, UTC.


UTC


Especially on the timestamps, I find some of the design choices a little bit bizarre. Choosing only a strict ISO8601 format: awesome! Choosing to excise critical parts (representing timezones and fractional seconds): very unfortunate.

Chesterson's Fence (https://en.wikipedia.org/wiki/G._K._Chesterton#Chesterton's_... is a very powerful design principle. They chose to put those elements into ISO 8601 for principled reasons: they come from pain. They embody responses to mistakes that I've made, and thousands of other engineers before me. Unless we fully understand the reason they were included, don't arbitrarily to do "I haven't used it, so it must be useless."

Other than that, it looks like a clean spec, but I'm not personally convinced that it has enough incremental value over JSON or YAML to replace them in the human-readable exchange format space. It can be a little more concise, but if I'm making something for humans, clarity (typically) has more value than conciseness. Are there other compelling values that I'm missing?


Amazon’s Ion[0] uses ISO 8601 including fractional seconds and offsets as its date time format.

0 - https://amzn.github.io/ion-docs/


Does seem a surprising omission. I’d expect offset support, and like more recent fancier draft stuff from https://www.ietf.org/archive/id/draft-ietf-sedate-datetime-e..., Internet Extended Date/Time Format, where you can specify a named time zone rather than just an offset.


Yes, and some people need to use TAI instead of UTC.


Slightly reminded me of Carousel the format that PDFs are written in.


smart, although the use of >< for strings makes it stand out compared to less noisy formats like yaml imo also, doesn't some of the custom typed map syntax overlap with an untyped map one? thanks





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: