Linus Torvalds: “I'm happily hacking on a new save format using ‘libgit2’”

mrcharles · on March 7, 2014

On the game I'm currently working on, it's built very heavily around Lua. So for the save system, we simply fill a large Lua table, and then write that to disk, as Lua code. The 'save' file then simply becomes a Lua file that can be read directly into Lua.

This is absolutely amazing for debugging purposes. Also you never have to worry about corrupt save files or anything of it's ilk. Development is easier, diagnosing problems is easier, and using a programmatic data structure on the backend means that you can pretty much keep things clean and forward compatible with ease.

(Oh, also being able to debug by altering the save file in any way you want is a godsend).

JackC · on March 7, 2014

You probably know this, but remember that storing user data as code is a place where you (general "you") have to think very carefully about security.

Is there any way that arbitrary code in the file could compromise the user's system? If so, does the user know to treat these data files as executables? Is there any way someone untrusted could ever edit the file without the user's knowledge? Even in combination with other programs the user might be running? Are you sure about all of that?

Maybe Lua in particular is sandboxed so that's not a problem (beats me), but in general this is an area where safe high-level languages can all of a sudden turn dangerous. Personally I would rarely find it worth it.

normalhuman · on March 7, 2014

This is a good point, but I feel that discouraging this type of approach is not the way to go.

I apologise in advance for ranting... I hope this is not too off-topic, but instead a "zoom out" on the issue.

This touches on something deep and wrong about how we use computers these days. Computers are really good at being computers, and the amplification of intellectual capabilities they afford is tremendous, but this is reserved for a limited few that were persistent enough and learned enough to rediscover the raw computer buried underneath, and what it can do.

For example, I dream of a world where everything communicates through s-expressions, all code is data and all data is code. Everything understandable all the way down. Imagine what people from all fields could create with this level of plug-ability and inter-operability. We had a whiff of that with the web so far, but it could be so much more powerful, so much simpler, so much more elegant. All the computer science is there, it's just a social problem.

I understand the security issues, but surely limiting the potential of computers is not the solution. There has to be a better way.

haberman · on March 7, 2014

Lack of Turing-completeness can be a feature. Take PDF vs PostScript. The latter is Turing-complete and therefore you cannot jump to an arbitrary page or even know how many pages the document has without running the entire thing first.

By limiting expressiveness you also gain static analysis and predictability. It's not about limiting the potential of computers, it's about designing systems that strike the right balance between the power given to the payload and the guarantees offered to the container/receiver.

For example, it is only because JSON is flat data and not executable that web pages can reasonably call JSON APIs from third parties. There really is no "better way" -- if JSON was executable then calling such an API would literally be giving it full control of your app and of the user's computer.

andrewflnr · on March 8, 2014

If you have a nice data format like s-exprs, it's a fairly simple matter to just aggressively reject any code/data that can't be proven harmless. For example, if you're loading saved game data, just verify that the table contains only tables with primitive data; if there's anything else, throw an error. Then you can safely execute it in a turing-complete environment and be sure it won't cause problems.

Speaking for myself, in my ideal world this sort of schema-checking and executing is ubiquitous and easy. Obviously that's not the world today. While there are tools for checking JSON schemata there doesn't seem to be a standard format. I wonder how hard it would be to implement a Lua schema-checker.

espeed · on March 8, 2014

Have you checked out EDN yet (https://github.com/edn-format/edn)?

It's a relatively new data format designed by Rich Hickey that has versioning and backward-compatibility baked in from the start.

EDN stands for "Extensible Data Notation". It has an extensible type system that enables you to define custom types on top of its built-in primitives, and there's no schema.

To define a type, you simply use a custom prefix/tag inline:

  #wolf/pack {:alpha "Greybeard" :betas ["Frostpaw" "Blackwind" "Bloodjaw"]}

While you can register custom handlers for specific tags, properly implemented readers can read unknown types without requiring custom extensions.

The motivating use case behind EDN was enabling the exchange of native data structures between Clojure and ClojureScript, but it's not Clojure specific -- implementations are starting to pop up in a growing number of languages (https://github.com/edn-format/edn/wiki/Implementations).

Here's the InfoQ video and a few threads from when it was announced:

https://news.ycombinator.com/item?id=4487462, https://groups.google.com/forum/#!topic/clojure/aRUEIlAHguU, http://www.infoq.com/interviews/hickey-clojure-reader

andrewflnr · on March 8, 2014

I've looked at EDN a bit, even started a sad little C# parser. I don't see what it has to do with my previous comment, which is all about how schemas are potentially useful. I'm trying to say that after you check the schema, you don't just read the data, you execute it, and that has the effect of applying the configuration or just constructing the object.

eurleif · on March 7, 2014

>There really is no "better way" -- if JSON was executable then calling such an API would literally be giving it full control of your app and of the user's computer.

Of course there's a "better way": running the code in a sandbox. You could do so using js.js[1], for example. (Of course, replacing a JSON API with sandboxed JS code is likely to be a bad idea. But it is possible.)

[1] https://sns.cs.princeton.edu/2012/04/javascript-in-javascrip...

haberman · on March 7, 2014

You're right inasmuch as I shouldn't have implied that unsandboxed interpretation is the only option.

But my larger point still stands; the fundamental tradeoff is still "power of the payload" vs "guarantees to the container." Even in the case of sandboxed execution, the container loses two important guarantees compared with non-executable data formats like JSON:

1. I can know a priori roughly how much CPU I will spend evaluating this payload.

2. I can know that the payload halts.

This is why, for example, the D language in DTrace is intentionally not Turing-complete.

Someone · on March 7, 2014

I agree 100% with you, but #1 isn't completely true. The counterexample is the ZIP bomb (http://en.wikipedia.org/wiki/Zip_bomb) Whenever you unzip anything you got from outside, you should limit the time spent and the amount of memory written.

eurleif · on March 7, 2014

If excess CPU/non-halting behavior is the issue, you could run the code with a timeout.

haberman · on March 7, 2014

You could. But that has downsides also:

1. imposing CPU limits incurs an inherent CPU overhead and code complexity.

2. if those limits are hit, you can't tell whether the code just ran too long or whether it was in an infinite loop.

So now if we fully evaluate the options, the choice is between:

1. A purely data language like JSON: simple to implement, fast to parse, decoder can skip over parts it doesn't want, etc.

2. A Turing-complete data format: have to implement sandboxing and CPU limits (both far trickier security attack surfaces), have configure CPU limits, when CPU limits are exceeded the user doesn't know whether the code was in an infinite loop or not, maybe have to re-configure CPU limits.

Sure, sometimes all the work involved in (2) is worth it, that's why we have JavaScript in web browsers after all. But a Turing-complete version of JSON would never have taken off like JSON did for APIs, because it would be far more difficult and perilous to implement.

darkmighty · on March 7, 2014

I have to agree here. General Turing-completeness was known from the beginning to imply undecidable questions -- about it's structure, running time, memory and so on. I don't think this has a place as the 'data'.

Abstractions exist for a reason -- this is analogous to source/channel coding separation or internet layers. They don't have to be that way, but are there for a reason.

Someone could change my opinion, though. Provide me a data format which proves certain things about it's behavior and that would be a nice counterexample.

nextstep · on March 8, 2014

http://en.wikipedia.org/wiki/PostScript#The_language

PostScript is Turing-complete.

nkurz · on March 8, 2014

Join me on my crusade to eliminate the use of 'former' and 'latter' in any writing unless the goal is obfuscation. It's error prone, almost always requires rereading, and never is the clearest choice.

Here's my attempt at a clearer version:

"Take Postscript vs PDF. Postscript is Turing-complete and therefore you cannot jump to an arbitrary page or even know how many pages the document has without running the entire thing first."

haberman · on March 8, 2014

That's like campaigning to eliminate pronouns. "Former" and "latter" are just like "it", except they are for when you mentioned two things.

Repeating the proper noun doesn't achieve the goal of emphasizing the fact that you're referring to something you just mentioned.

nkurz · on March 8, 2014

Pronouns are fine. Substituting 'the first' and 'the second' would be an improvement. It's specifically 'former' and 'latter' that should be deprecated. I'd be interested in seeing a study comparing readers' comprehensions of the various phrasings. What cost in clarity would you be willing to pay?

johnminniton · on March 8, 2014

Did you even read his comment? That is precisely what he said!

"Take PDF vs PostScript. The latter is Turing-complete"

Danieru · on March 8, 2014

I think that is what haberman meant.

Back on topic: The reason for PDF's existence is to be a non-turing complete subset of postscript. Features like direct indexing to a page are why Linux has switched to PDF as the primary interchange format.

seabee · on March 7, 2014

In a world where users will willingly enter malicious code into their computers if they believe it will do something they want[1], can there really be a better way?

[1] https://www.facebook.com/selfxss

Crito · on March 7, 2014

When it comes right down to it, you can't fully protect people from themselves. Even in 'meat space', which the general population is presumably experienced with, people talk others into doing things that they should not all the time. Anything from social engineering to bog-standard scam artists masquerading as door-to-door salesmen.

Someone · on March 7, 2014

But in 'meat space', it is way harder and more expensive to do evil against large numbers of people. For example, phishing as done electronically (throw a very, very wide net, and hope) doesn't make economical sense if one had to do it manually.

Also, if, for example, cars and airplanes and banks and nuclear submarines would accept executable code as input, some people would do damage on a gargantuan scale.

Clearly, being liberal in what you accept must end somewhere. I argue that it should end very, very soon. Even innocuous things such as "let's allow everybody to read the subject of everyone's mail messages", if available at scale and cheaply, would entice criminal behavior, for example by those mining them for information that you are away from home.

Does anybody know how RMS thinks about passwords nowadays?

Crito · on March 7, 2014

I'm not saying that scams in "cyberspace" don't present a greater threat than scams in "meatspace". I'm just pointing out that if you cannot protect people from themselves in "meatspace", then doing it in "cyberspace" is futile. You can fight it and cut back on it, but you will never actually win that fight. Technological problems to what is ultimately a sociological problem only go so far, and we should be careful to not obsess over them to a fault.

Someone · on March 8, 2014

And yet, in "meatspace", chainsaws come with more safety mechanisms than butter knives because the damage they can do is so much larger. Yes, we can't win that fight; people will die from chainsaw accidents, but I disagree that we shouldn't be more vigilant about chainsaws than about butter knives.

hdevalence · on March 7, 2014

This seems like exactly the problem the parent post is complaining about, though: the people in the limited group the parent talks about aren't the people being tricked by selfxss, it's the people who don't have the technical knowledge to understand what the developer console does and why pasting in random JS might be a bad idea. So the phenomenon of selfxss reinforces the point.

davb · on March 7, 2014

Perhaps modern operating systems (or hardware?) need two modes - "Safe mode", where everything is sanitised, checked, limited and Secure Boot-style verified, and "Open mode" where it's not; where experts and enthusiasts can work without limit and without DRM.

Seeless · on March 7, 2014

That only adds an extra step to the social engineering process.

kalleboo · on March 7, 2014

Safe mode is called iOS

judk · on March 7, 2014

When my conputer has the potential to erase my bank account, I want to limit its potential.

saraid216 · on March 7, 2014

You're discouraging it in the wrong place.

Learning to swim is not done by throwing a kid in the deep end of a pool. Learning to code is not done by encouraging bad security practices.

wtallis · on March 7, 2014

On the other hand, taking the easy way out when this kind of security problem comes up leads to having a machine that's just an appliance and not a computer. If you've closed every local code execution vulnerability, you've probably rendered your system completely non-programmable and erected a monumental barrier to learning how to hack.

ufo · on March 7, 2014

Lua sandboxing is relatively straightfoward. You can choose what functinos from the standard library the script you are evaluating will see in its global scope. By passing an empty scope the only thing the evaluated script can do is build tables, concatenate strings, do arithmetic, etc. You only need to worry about DOS due to infinite loops but there are also workarounds for that).

In Loa 5.1 you can use setfenv http://www.lua.org/manual/5.1/manual.html#pdf-setfenv

And in Lua 5.2 the functions that eval strings receive the global scope as an optional parameter. http://www.lua.org/manual/5.2/manual.html#pdf-loadfile

chalst · on March 7, 2014

I trust Lua sandboxing. See, e.g.:

1. http://stackoverflow.com/questions/1224708/how-can-i-create-...

2. http://stackoverflow.com/questions/4134114/capabilities-for-...

I find it easier to trust Lua than similar facilities in other programming languages because the kernel of the language has a relatively simple semantics, so the TCB of a sandbox is lower, and the source is easier to understand than most other languages.

Note that sandboxing in Lua 5.2 has a still simpler semantics than for Lua 5.1 - few other languages evolve in a way that makes the language easier to trust.

malkia · on March 7, 2014

It's the halting problem - all it takes for someone to embed in your data code that loops forever (wait infinite), or recurses (crash), or discover some vulnerability... Not directly related to saving files as lua, but say as bytecode: https://www.youtube.com/watch?v=OSMOTDLrBCQ

unsigner · on March 7, 2014

Lua can be sandboxed so your data file can't call arbitrary functions (but can still call a controlled subset, e.g. a function called RGB that does r255255+g*255+b so your colors are somewhat human-readable in the file, yet 24-bit integers in memory).

But it's still code, so you can e.g. inject an infinite loop and the loader will hang. (You can protect against this, you can install a debug hook that gets called after every N instructions executed, and kill the loader.)

malkia · on March 7, 2014

Or HashDOS attack - http://lua-users.org/wiki/HashDos

Guvante · on March 7, 2014

Typically sandboxing is stage one of any lua implementation. You don't need raw IO access and rarely need to print to the screen for instance.

mrcharles · on March 7, 2014

Aware of all the issues and already have a plan. But Lua generally only has access to the APIs you give it; from game code our Lua VM has no access to the OS at all, just game functions, and those game functions are never system related.

The biggest 'concern' would be save hacking, but at the end of the day that will happen no matter what so it doesn't bother me much.

mattgreenrocks · on March 7, 2014

Preach it. My favorite persistence code: stuff that has nothing to do with SQL/NoSQL.

I leaned heavily on Python's pickle module for serializing a few thousand entities to disk a few years ago. By streaming them to the application at startup time, it remained plenty fast for all datasets it'd encounter. I intended to replace it with SQLite one day, but I never had to. I could just keep them all in memory.

I'd probably choose something a bit safer now, but it was hard to beat the simplicity.

3pt14159 · on March 7, 2014

I used to do that, but pickle bit me once. I think it changes between versioning or something. I had to start the statistical model from scratch.

mattgreenrocks · on March 7, 2014

Now that you mention it, I remember running into something like it. It's a big issue. That module had a bunch of scar code to migrate entities as they came up from disk.

rcxdude · on March 7, 2014

Yeah, I wouldn't use pickle for anything where backwards or forwards compatibility is important. It is however very convenient for 'I want to send/store this data for a bit' tasks.

dbaupp · on March 7, 2014

Why does using a Lua-basef format stop the files being corrupted?

politician · on March 7, 2014

Maybe he means that he doesn't have to deal with bugs in a custom binary serializer.

mrcharles · on March 7, 2014

It can only become corrupted by external factors; a lot of games I've worked on, in-game bugs could lead to corrupted saves being written out to disk. Since in this case we are just serializing lua data, unless the serializer itself has a bug, it will always write out correctly, and any issues become issues of game logic rather than anything else.

Igglyboo · on March 7, 2014

I don't think it does. I think he meant that if a save became corrupted it wouldn't do so silently, it would violently crash the game because of a syntax error.

Zecc · on March 7, 2014

It doesn't. But it makes them much easier to fix.

Edit: Igglyboo has a point too.

TheEzEzz · on March 7, 2014

I did this with C# in my last game. All the map/object editors output C# code on save, which was then included in the compiled code on the next build. The beauty is that your "data" files get automatically updated when you refactor your regular code! On top of that loading is faster, because you don't need to worry about fetching a file and parsing it, the whole thing is just compiled code embedded in your executable.

Tanner · on March 7, 2014

That Lua was originally designed as a configuration language becomes really clear when you start doing things like this. Having my code and configuration being separate but equal was really a paradigm shift for me.

Also, the Tiled Map Editor exports directly to Lua.

agumonkey · on March 7, 2014

IIRC that's how Office and Photoshop file format started. I think it's a nightmare for compatibility in the end.

frik · on March 7, 2014

So, it's similar to JSON (JavaScript), but valid Lua syntax.

  local t = {}
  t = {["foo"] = "bar", [123] = 456}
  t.foo2 = "bar2"

ilovecookies · on March 7, 2014

While one could say that this is about savefiles for games, I would say it's implications could more be about savefiles for software projects. If you are building the game in LUA, of course LUA is going to be the preferred way to save your game in since you are already using LUA objects and interpreting files in that language will be easy to integrate.

If you ever used maven xml configs, java object marshalling or c# xml you would understand the pains of using xml as a file format for software projects and data representation. You have to find a solution that is language agnostic, neither LUA or JSON is.

balls187 · on March 7, 2014

I did something similar, but used JSON instead (pretty trivial to (de)/serialize LUA tables to JSON. This made it easy to send data to the server, and inspect with standard tools as well.

seivan · on March 8, 2014

This sounds a lot like NSCoding for Objective-C (Cocoa). Though you'd still have to define the types/classes and name for each property you want to save. But you could technically save it in a big blob, and then read it into memory as you resume.

Could persist to disk as a binary, sql or a plist (xml).

I guess the only downside is, that if you got a lot of composite classes all with their own properties and associations (say a graph), there's a lot of manual work to be done.

hootener · on March 7, 2014

I've had to write output save file formats for various projects on several occasions, and it never occurred to me to take this approach.

Thanks for sharing this, it's one of those ideas that (to me) seems so brilliant in its simplicity that I probably would've never thought of it.

Any hiccups in the day-to-day work using this approach? I'm just trying to get a better idea of the workflow since I'm very seriously considering applying it to my next project.

mrcharles · on March 7, 2014

The biggest hiccup is almost a literal one; serializing large lua structures and then writing them to disk can take a lot of time. But this can largely be mitigated by just saving compiled lua instead of text lua.

Touche · on March 7, 2014

That's how people are going to cheat at your game.

roryokane · on March 7, 2014

I had a lot more fun recently playing the free game Boson X for PC (http://www.boson-x.com/) than I would have otherwise because I discovered that the game folder contains editable Lua scripts. The scripts control the game physics, scoring system, controls, level data, and more.

I’ve created mods of the game where you fun faster but gravity is stronger, and where all levels are randomly mixed into one level, and where the dangerous falling platforms also give you energy while you’re on them, and where the sound effects give the player clearer feedback on what they’re doing. And though I could cheat by multiplying my score by 1000 and submitting it online, I actually have been careful to always comment out the high-score saving and submission code in each of my mods.

I like the game much more than if the developers had obfuscated the Lua files so I couldn’t read and edit them.

outworlder · on March 7, 2014

The save format does not matter at all. It wouldn't matter even if it were an obscure, made-up format. All it would do is slow down 'cheaters' by half an hour.

The only argument against human-editable text files is parsing speed, not security.

dinkumthinkum · on March 8, 2014

Data size has a bit to do with it. Not trying to be pedantic, just adding that.

phn · on March 7, 2014

Cheating is good, I remember having tons of fun with age of empires and sim city because I used cheat codes.

If the player has fun, it's a nice feature! :D

dfc · on March 7, 2014

And what about the people competing against the happy cheater?

baq · on March 7, 2014

in a single player game with save games on local disk? this question is nearly trolling.

ANTSANTS · on March 7, 2014

Some people do compete for speed or score in single player games. Arcade games have always had scoreboards, modern "arcade-style" games have online ones, and a community can turn any solitary activity into a competitive one:

http://speedrunslive.com/

http://speeddemosarchive.com/

Speedrunners are an exceptional case, but I think everyone gets a little annoyed when they look at a leaderboard and all the top players have scores of UINT_MAX or times of 0 seconds.

Obviously cheaters will find a way regardless of whether you give them the source code or not, I'm just saying dfc's concern is not totally ridiculous.

phn · on March 8, 2014

I DO get annoyed when I see those scores, but in a lot of cases even having a leaderboard is just something that was introduced in the game just to be more "social" and less because it makes sense in that specific game.

And yes, it's not ridiculous, on the contrary, it's perfectly understandable.

Of course, these kinds of questions depend a lot on the game in question, and I think they don't have a definitive answer :)

baq · on March 8, 2014

oh don't get me wrong, i love speedrunning and trickjumping competitions, except that they always should require the whole replay - and even then you can't be sure if the whole thing was or wasn't TASsed.

phn · on March 8, 2014

Of course, in multi-player competitive games anti-cheating is a pretty big concern, because it works against the purpose of the game: a competition with well defined rules and conditions.

If the core of the game is single-player/non-competitive, why should we be so worried about cheating?

jethro_tell · on March 7, 2014

Plot twist, It's actually a 'teach yourself lua game'

chris_mahan · on March 7, 2014

so level 38 is "figure out how to write the answer by editing the save file?"

arh68 · on March 8, 2014

Why not level 7? By 38 the player could be fixing bugged waypoints, profiling out bad O(n!) code, resetting time to get a specific time-based drop, tweaking character attributes to make puzzles/quests easier, etc. The savefile just becomes another interface to play with.

Gives a fresh angle on the 'open world' type game.

chris_mahan · on March 10, 2014

Reminds me of tweaking the Colonization game (1991?) by editing text files...

jiggy2011 · on March 7, 2014

Does it matter unless the game is multiplayer, in which case you should assume that client files are untrustworthy anyways.

mrcharles · on March 7, 2014

There are ways around it but if people want to cheat their own SP experience who am I to stop them? We'll obfuscate a bit to dissuade casual users but I don't know that I've ever encountered a game that didn't have some level of save hacking available.

Hell, I've used it myself more than a few times.

6d0debc071 · on March 7, 2014

Hash the information and include the hash in the file. If the hash and the contents don't match when you try to load it, you can refuse it.

If not loading things is important to you, mind.

scott_karana · on March 7, 2014

What's to stop people from re-hashing the changed file? :)

roryokane · on March 7, 2014

You could hash the file contents plus a salt that is contained within the application binary. Most people wouldn’t know how to extract the salt from the binary so they could get the hash right. Though I guess if people are determined enough to hack your game, one hacker might just publish the salt or a small program to rehash files for you. If a user has enough time on their hands, there’s nothing you can do to stop them from running your software in a VM with a debugger and finding out its secrets; the best you can do is making that hard enough that people won’t bother.

6d0debc071 · on March 7, 2014

Lack of knowledge, lack of interest.

If you wanted more security, you could keep a secret that you don't include in the saves but do include when you calculate the hash, so that anyone who doesn't have the key is going to get the wrong answer. That's about as far as I'd consider going for relatively trivial data like save games. Though that's, in principle, discoverable if someone's sufficiently interested.

After that point, it becomes much simpler for someone to watch the memory associated with your program and extract/alter the values there. (Programs to do that to games are generally called Trainers.) That's not a complicated thing to do unless someone's tried to stop you doing it.

There are some techniques to provide some degree of security there. Changing where in memory you place your information each time springs to mind, thus making it more difficult for people to find out where the values are and then share their locations. However, even that's not perfect. Depending on how sure you want to be that no-one's going to alter the values, you're potentially looking at requiring very deep knowledge of security there.

After that point the next easiest target may be the program file itself.

That said, if you want to get around that sort of problem and you're really serious about it, then running your encryption in an environment that hostile may be making things more difficult than they need to be. You might use a trusted platform module, to try to make the environment you were in less hostile, if one were present on the user's machine. But, honestly, I'd want the information not to be stored or calculated on the user's machine if it were that valuable. Have the user's end be the input, encrypt their signals with your public key, and do the calculations that you needed to be sure of remotely.

Though then the user has to trust you. I wouldn't usually advocate that my users trust me that much - not unless we were dealing with a situation where the information we were talking about was entangled with others in some way such that a reasonable argument could be made that they didn't own it, and I was just the best common arbiter I could think of.

-she shrugs awkwardly-

You can get yourself into a situation where it's probable that the amount of effort someone would have to invest is vastly greater than the likely value of the information fairly easily. But ultimately it's a question of how expensive you want to make things and what that's worth to you. Against a sufficiently dedicated adversary, with a sufficiently valuable target, there are so many unknowns in computer security that I wouldn't even be sure that storing the data on your server would be sufficient ^^;

saucetenuto · on March 7, 2014

It's only cheating if the developer disapproves.

bhaak · on March 7, 2014

What's with all the XML hate? Of course, doing everything in XML is a stupid idea (e.g. XSLT and Ant) and thanks heaven that hype is over.

But if I want something that is able to express data structures customized by myself, usually with hierarchical data that can be verified for validity and syntax (XML Schemas or old-school DTD), what other options are there?

Doing hierarchical data in SQL is a bitch and if you want to transfer it, well good luck with a SQL dump. JSON and other lightweight markup languages fail the verification requirement.

rbehrends · on March 7, 2014

XML is unnecessarily verbose, for the supposed sake of human readability. But used as a serialization format, it isn't really readable or editable by humans (except in the sense that a Turing machine is programmable): remember that the ML in XML stands for "markup language", and SGML, its predecessor, was designed as a way of marking up normal text, not littering data with angular brackets and identifiers. (XML/SGML arguably isn't that hot as a markup language, either.)

If you really need a hierarchical serialization format that is "verified for validity and syntax", the problem is that XML has prevented the adoption of something better (because it was "good enough").

If you don't need that, then XML is overkill and bloat and makes your format less readable than it could be. And you rarely need it, because either your data is computer-generated and -read, so there's little point in putting in extra schema checks, or schema verification is woefully insufficient (because it can't verify the contents of fields, relations between fields, or a ton of other stuff that can accidentally go wrong).

_pmf_ · on March 7, 2014

You fail to address OP's question:

> But if I want something that is able to express data structures customized by myself, usually with hierarchical data that can be verified for validity and syntax (XML Schemas or old-school DTD), what other options are there?

bhaak · on March 7, 2014

He actually did address my question in a way: "[...] XML has prevented the adoption of something better (because it was "good enough")."

Which IMO is a sensible way looking at it. I too think XML is not perfect but if all the other stuff we're stuck with currently would be as good enough as XML, IT would be a place with less WTFs all around. ;-)

rbehrends · on March 7, 2014

I did address it. Did you read my comment until the end?

k__ · on March 7, 2014

> the problem is that XML has prevented the adoption of something better

What would be better?

rbehrends · on March 7, 2014

That depends on your specific goals. But essentially, XML schemas are sort of like attribute grammars, except with an unnecessarily convoluted syntax, and yet more limited in their expressiveness than attribute grammars (because whatever constraints you need have to be procrusteanized into XML schemas).

Even if you were to stick with XML semantics as is, you could improve the syntax to be actually readable and eliminate the angle bracket tax [1, 2].

[1] http://blog.codinghorror.com/xml-the-angle-bracket-tax/

[2] http://blog.codinghorror.com/revisiting-the-xml-angle-bracke...

draegtun · on March 7, 2014

I think S-expressions would have been better.

Alternatively Carl Sassenrath was pushing Rebol in the past. See his blog post "Was XML Flawed from the Start?" - http://www.rebol.com/article/0108.html

Update: Just posted the above blog link to HN: https://news.ycombinator.com/item?id=7361260

groovy2shoes · on March 7, 2014

I'm a fan of edn myself. https://github.com/edn-format/edn

lmm · on March 7, 2014

If used sensibly XML isn't too bad. But there's a whole lot of cruft in the standard that seems to do nothing except make it harder to use. Part of this is a problem with popular libraries rather than inherent to the format, but we judge a thing by its ecosystem rather than in isolation. So: namespaces are a pain, making it much harder than it should be to just make my xpath work. DTDs are annoying, especially when a production system breaks because a remote server that was hosting a DTD goes down so now your parser refuses to load a file. User-defined entities seem pointless, and though most parsers can handle the billion laughs these days it wasn't always so. The handling of text nodes is confusing; whitespace is irrelevant except when it isn't. Specifying the encoding inside the document itself seems wrong, and supporting multiple encodings at all causes trouble (e.g. sometimes it's simply impossible to include one document in another inline).

Is XML schema really so much better than e.g. JSON schema?

To me it feels like there's an impedance mismatch between the kind of structures XML lends itself to and the kind of structures programs are good at dealing with. So for program-to-program communications with a certain level of validation I find Protocol Buffers is a much better fit. Conversely in cases where human readability is really important, XML isn't good enough compared to JSON.

saurik · on March 7, 2014

> So: namespaces are a pain, making it much harder than it should be to just make my xpath work.

Namespaces exist to solve a real-world problem that happens in real-world use cases (SVG embedded in HTML, HTML embedded in RSS). While it would be nice to look at things that are complex and say "it would be less complex for these trivial cases without this feature", in reality there are then common use cases that become more complex or even impossible in the general case, which seems like a very short-sighted benefit. Namespace prefixes are really not that difficult to configure, and once configured XPath makes them very easy to use :/.

vidarh · on March 7, 2014

The biggest caveat with namespaces is that most people have never bothered figuring out how they work. The number of applications I've seen that have hardcoded namespace names instead of looking up the namespace uri for example, is horrifying.

lmm · on March 9, 2014

Namespace prefixes are not that difficult to configure once you know about them. But if you're just starting with XML, probably because you need to extract some information from a document you've been sent, you don't want to learn the theory of XML, you want to get the data you need out and get on with adding business value. So you find a tutorial, you write an xpath, and it doesn't work. You try removing the foo: prefixes in your xpath, and it still doesn't work. This is not the experience that a technology should give new users. A default of matching ignoring namespaces would not make anything impossible.

rcxdude · on March 7, 2014

Indeed. XML gets a lot of hate because it's so difficult to use. It would be fine if you could use it without having to care about the 100 features you don't care about and just use the ones you need, but pretty much every library I've seen makes parsing (or generating) a document a huge and complicated task, and most of it is completely irrelevant to the problem I'm trying to solve.

And because of this almost no-one bothers to actually handle it properly so you often can't actually use the advanced features even if you wanted to.

rossjudson · on March 8, 2014

This varies greatly from framework to framework, and language to language. On the JVM at least, the dark machinery that handles the XML is rather rigorously correct. Parsing and generation are trivial, especially using JAXP. You have multiple ways of working with XML (objects, DOM, push, pull).

XML is "good enough" for a lot of cases. There are lots of tools to mess around with it too, which is really quite valuable when you're experimenting with various kinds of data or you're debugging. Being able to extract out stuff you're interested in XML format means you can perform a lot of complex manipulations quite easily.

Sharlin · on March 7, 2014

The issue is probably that 99.999% of all XML use cases don't use (or need) the verification aspect. For all of those, XML is overkill. Besides, surely it would be possible to design a verification layer on top of JSON, for instance - the fact that one does not currently exist does not mean that XML (and abuse of XML!) should not be criticized.

bananas · on March 7, 2014

One of the core aspects of XML that is really important is that no typing is inferred by the structure of the file unlike JSON. JSON is by nature tied to the JavaScript type system which is sparse and inaccurate. For example, if you look at the following:

   { "name": "bob", "salary": 1e999 }

Ah crap! Deserializer blew (in most cases silently converting the number to null)

   <person>
      <name>bob</name>
      <salary>1e999</salary>
   </person>

No problem. The consumer can throw that at their big decimal deserialiser.

And the following is not acceptable as it breaks the semantics of JSON and requires a secondary deserialisation step as strings ain't numbers...

   { "name": "bob", "salary": "1e999" }

JSON is a popular format but it's awful.

stingraycharles · on March 7, 2014

I think it's refreshing to hear someone advocate XML instead of JSON, specifically because you bring up a good point.

The problem I think is that just because XML is human-readable, it's less sufficient as a format that is human-writable (I'm looking at you, Maven!). I believe this is the root cause that many people hate XML, even though it has a very sweet spot in application-to-application communication.

Perseids · on March 7, 2014

I would even argue that XML is not even that human-readable. Take a look at this pom: https://maven.apache.org/pom.html#The_Super_POM . Even with syntax highlighting it is extremely difficult to parse visually. Compare that to nginx's custom config file format: http://wiki.nginx.org/FullExample .

ollysb · on March 8, 2014

If you take the brackets and the closing tags out (use meaningful space) it's a hell of an improvement[1], . A format I really like (ok it's aimed at html not xml) is the slim templating language[2]. It manages to pack the same information in but is massively more readable.

[1] https://gist.github.com/opsb/9424457

[2] http://slim-lang.com/

stingraycharles · on March 8, 2014

Yeah this is exactly where my hate towards Maven configuration comes from, but it's more a testimonial of a bad fit for configuration files than critique towards XML. Java enterprise application configuration has the tendency to be very "expert-friendly", and this is where XML got its bad name from.

mattfenwick · on March 7, 2014

> Ah crap! Deserializer blew (in most cases silently converting the number to null)

Right -- the parser blew it. That many implementations do this is frustrating (and caused me so many problems that I ended up building my own validator for problems like this: http://mattfenwick.github.io/Miscue-js/).

JSON doesn't set limits on number size. From RFC 4627:

An implementation may set limits on the range of numbers.

It's the implementation's fault if the number is silently converted to null.

I guess we need better implementations!

> JSON is a popular format but it's awful.

If you're willing to take the time to share, I'd love to hear more examples of JSON's problems. I'm collecting examples of problems, which I will then check for in my validator!

stegro32 · on March 7, 2014

If you're looking for examples of problems, RFC7159 (http://rfc7159.net/rfc7159) is a good place to start - just search for 'interop', as suggested by [1]. A quick look at Miscue-js suggests you already check for most of them, but you might still find something new.

[1] http://www.tbray.org/ongoing/When/201x/2014/03/05/RFC7159-JS...

giblaz · on March 7, 2014

Your example doesn't do anything but make XML look as bad as your saying JSON is. Think about it again - do you think your first XML example doesn't ALSO have to be deserialized twice (once into an XML in memory tree, once into a number)? It does. Also, both examples will fail if you try to deserialize either of them into numbers...

Regardless, JSON is so much more readable that I'm very glad it's pushed XML out of the picture for the most part.

bananas · on March 7, 2014

Actually no you couldn't be more wrong.

XML can be read as a stream and at certain points like after reading an element or attribute, an object can be created on the fly or a property on an object set and the type deserialised at the same time. The types don't have to be native types either; they can be complex types or aggregate types such as any numeric abstraction or date type you desire.

See java.xml.stream (Java) and System.Xml (CLR) for example.

As for readability, some XML is bad which is probably what you've seen but there's plenty that's well designed.

XML is afflicted with piles of criticism which usually comes from poor understanding or looking at machine targeted schemas that humans don't care about.

You'd complain the same if you looked at protobufs over the wire with a hex editor.

icebraining · on March 7, 2014

And the following is not acceptable as it breaks the semantics of JSON and requires a secondary deserialisation step as strings ain't numbers...

XML strings ain't numbers neither. You can throw a big decimal deserialiser (e.g. as a custom deserialization adapter) at a JSON document as well.

bananas · on March 7, 2014

Let's break this down into two statements:

XML doesn't have strings (or types at all really)

JSON strings are strings.

There is a massive semantic difference here when it comes to parsing.

sanderjd · on March 7, 2014

What is that massive semantic difference? If you want the number represented by 1e999 as the value for salary, at some point, something has to take "1e999", whether you call it a string or a something-with-no-type, and turn it into a number. Your deserializer has to know to do that in either case.

bananas · on March 7, 2014

As follows. It's more how the abstraction works.

XML:

  ->[byte stream]->[deserializer]->[bignum]

JSON:

  ->[byte stream]->[json reader]->[string]->[deserializer]->[bignum]

The latter is, well, wrong.

icebraining · on March 7, 2014

Multiple JSON deserializers have that mapping integrating, eliminating those steps. See, for example, the ContextualDeserializer in Jackson.

sanderjd · on March 7, 2014

How does the [deserializer] step in the XML example know to call into [bignum], and why can't the [json reader] in the JSON example have that knowledge in the same fashion?

jdbernard · on March 7, 2014

Because the XML document has a semantic meaning that is specifically designed for this application. It may even have a schema definition document which formally defines what types to expect. JSON, by contrast, has type definitions imposed on it by its nature as JavaScript code.

sanderjd · on March 8, 2014

I've sort of lost track of what this debate is about... Assuming you don't have a schema definition, it seems to me that you can just as easily parse `{ "salary": "1e999" }` with application-encoded semantics as `<salary>1e999</salary>` with (again) application-encoded semantics. Maybe having a formal schema definition is a win, though.

jules · on March 7, 2014

The equivalent of your XML would be:

    {"name": "bob", "salary": "1e999"}

snowwrestler · on March 7, 2014

I believe that creates a string with the characters "1e999", not the number 1e999.

beagle3 · on March 7, 2014

Same as the XML

snowwrestler · on March 7, 2014

I don't think XML does either by itself. The schema will determine which fields are parsed as strings and which are parsed as numbers.

beagle3 · on March 8, 2014

iff you have a schema, and a parser that actually uses it. I've seen a few DTDs but the vast majority of XML documents don't have a schema or even a DTD to follow.

And the vast majority of parsers will not parse anything for you, regardless of schema definitions.

Which effectively puts you in the same place as the JSON string.

jules · on March 7, 2014

Exactly.

pierrebai · on March 7, 2014

Either to author of the serialized data realized that the numbers could overflow a float or didn't. This is independent of serialization format.

In your contrived example, somehow, the user of JSON didn't realize the salary could overflow a float. (OTOH, he succeeded in serializing it, mysteriously.) All the while, the XML user was magically forward thinking and deserialized the value into a big decimal. Your argument simply hinges on making one programmer smarter than the other. If one knows that a value will not fit a float, the memory representation won't be a float and the serialization format won't use a float representation. It has nothing to do with JSON vs XML.

edraferi · on March 7, 2014

This. Types are a huge pain in JSON, particularly the lack of a good date time type. BSON fixes tips, but only of you're using MongoDB and are willing to give up the "human readable" requirement outside of mongo.

halflings · on March 7, 2014

JSON's semantics is that you represent numbers by their decimal representation.

In this particular case, you're giving a different representation, so of course you an pass it as a string.

andor · on March 7, 2014

His point was that this number is too large to store it in a Javascript Number variable (which is a IEEE 754 double).

shawnz · on March 7, 2014

OK, so the provided number format is not sufficient for the kind of numbers he is trying to deal with. So instead you would represent it as a string and handle the encoding/decoding of that number yourself. How is that different from the XML way where there is no provided number format to begin with, and everything is a string?

boomlinde · on March 8, 2014

That's completely irrelevant. Grok the JSON specs and reconsider what the javascript number format has to do with it.

DougBTX · on March 7, 2014

1e99 is valid JSON, that isn't what he is complaining about. See: http://json.org/number.gif

_delirium · on March 7, 2014

People seem to prefer JSON, but I don't find it any better to hand-write/hand-edit than XML. If anything it's slightly worse, because it has more syntax edge cases.

bananas · on March 7, 2014

And it doesn't support the multitude of accurate numeric types that XML does implicitly. XML data is not just "strings", it's a sequence of characters. The deserializer determines what sort of type it is based on either the structure or the language's capabilities. With XML, you can define these policies. With JSON you're stuck with JavaScript being the semantic standard and type definitions which ties you to floats or numbers inside strings. The latter is criminal.

Edit: clarification as HN won't let me reply any more.

icebraining · on March 7, 2014

How so? XML by itself only supports strings; any other data types have to be derived from a schema. But you can do the same with any other format that supports strings, including JSON.

bhaak · on March 7, 2014

But in the design of XML this was already acknowledged.

That's why there is the distinction between well-formed and valid XML documents. Only with valid XML documents there is a schema attached that will describe these nodes with the type attribute. And because it is extendable, these types can be anything but they will be automatically validated by the parser.

JSON OTOH doesn't have this extensibility. There are a couple of predefined types but if you need to go beyond them (and this happens all the time because JSON doesn't even define a date type!) any interpretation is up to the parsing program and this can vary tremendously (again, look at the handling of dates and for example the questions on stackoverflow about them).

icebraining · on March 7, 2014

Only with valid XML documents there is a schema attached that will describe these nodes with the type attribute.

http://json-schema.org/

What's the issue?

bhaak · on March 7, 2014

It's still a draft (and if I may nit-pick, an expired draft).

It only has "complete structural validation". Which means it doesn't feature custom types.

Although it adds a workaround for the date issue by adding a handful of supported sub-types (http://json-schema.org/latest/json-schema-validation.html#an...)

It is far from what validation XML Schemas offer.

Patient0 · on March 7, 2014

JSON is explicitly not designed to be hand-editable. Hence, for example, no comments.

It's just meant to be human readable.

If you want human editable "json", use Yaml: http://www.yaml.org/ (it's a superset of Json that adds comments, linking etc.)

hrjet · on March 7, 2014

I find HOCON better than YAML:

https://github.com/typesafehub/config/blob/master/HOCON.md

oneeyedpigeon · on March 7, 2014

How is YAML a superset of JSON? Do you mean 'conceptually'?

turnip1979 · on March 7, 2014

To be specific, JSON syntax is a subset of YAML version 1.2.

However, I hate YAML with a passion. It is worse than XML in my books. I can usually read JSON fine. I can also read XML in many cases. For the life of me, I just can't read YAML. It has something to do with "-", line indentation and different ways of writing lists.

Of course, someone will say YAML is technically better ...

ctb_mg · on March 7, 2014

Same here, it is very difficult for me to tell levels of nested structures in yaml. Though I'm sure if I sat down and read up on it I could force it into my brain. But shouldn't it be intuitive to read without that?

snogglethorpe · on March 8, 2014

Precisely.

Python has exactly the same problem -- control-structure nesting quickly gets confusing and hard to read beyond a certain (fairly small) size -- but at least with python, you have the option of splitting off stuff into separate functions to limit the amount of nesting and size of blocks.

buster · on March 7, 2014

Different for me, i would prefer YAML over XML or JSON

turnip1979 · on March 7, 2014

Do you use your naked eyes or do you have any tool recommendations? I don't see YAML going away so I'd better deal :)

_delirium · on March 7, 2014

It's technically true because YAML includes an alternate "inline style" that lets you write objects in JSON syntax. Therefore any JSON object is a valid YAML object as well. But, not an idiomatically written YAML object, since writing YAML using only inline style is unusual.

icebraining · on March 7, 2014

No, it is a superset. Every JSON document is a valid YAML document.

mattfenwick · on March 7, 2014

> because it has more syntax edge cases

Could you provide examples? I'm trying to collect more examples for a JSON validator -- http://mattfenwick.github.io/Miscue-js/ (built during a big project using JSON, after I started running into some issues that I couldn't check using other validators)

I'd love to hear more examples if you're willing to share.

icebraining · on March 7, 2014

It does exist: http://json-schema.org/

nly · on March 7, 2014

I personally miss having schemas and XSLT in JSON.

> doing everything in XML is a stupid idea (e.g. XSLT and Ant)

XSLT actually made a lot of sense. If everyone writes code to transform format1 to format2 then what you end up with a lot of slightly different transformations. Its main downfall, just like XML itself, was that it was annoying and time consuming to write.

How would you replace all this if you moved away from XML?

http://git.hohndel.org/?p=subsurface.git;a=tree;f=xslt;hb=HE...

_pmf_ · on March 7, 2014

> Its main downfall, just like XML itself, was that it was annoying and time consuming to write.

And impossible to debug. Write once, do something else for some weeks, and trying to understand what you were doing at a later point is nearly impossible.

allochthon · on March 7, 2014

There are schemas in JSON (see, for example, Kwalify) although they are not something that is built into the specification. I don't think the equivalent of XSLT is as necessary when the document readily translates to data structures in a scripting language.

eponeponepon · on March 7, 2014

> it was annoying and time consuming to write.

It remains annoying and time-consuming :) But there's no better option for reliably creating valid EPUBs to a predetermined business specification.

sparkie · on March 7, 2014

The problem with XSD and DTD is they only offer primitive ways to validate data, and it takes significant effort to validate some data (eg,[ https://stackoverflow.com/questions/3382944 ]). As a result, there have been a bunch of other XML schema validators created to counter these problems, but we should really ask why we need to keep inventing new languages when the existing ones turn out to be insufficient.

If we start out instead with something that's turing complete and simple to begin with (perhaps S-expressions?), we can (often trivially) write our own validators/type-checkers, or any other processing tool to verify the document structure, with few or no constraints, and without requiring the effort and expertise to parse complex syntax.

tomphoolery · on March 7, 2014

Unfortunately, XML is way too open-ended for my tastes. You end up getting entire rows of DB content (with full text paragraphs and everything) entirely in one tag, with attributes and values. There are so many options that you typically get a lot of idiot programmers who don't understand the purpose of all the shit in XML, so they fuck up their implementation.

Simply put, XML does not correctly model the data by which we intend to interchange. It was a noble effort, but it didn't come from a place of innovation. It came from corporate needs for standardization.

hardwaresofton · on March 7, 2014

S-expressions are also a reasonable choice, with some well-placed carriage returns and a serialization implementer that names things well

seanalltogether · on March 7, 2014

My biggest hatred of xml as a data structure, and believe me I've seen this in production systems more then once, is that it allows for the following.

  <customer>
	<account>
		<type>Personal</type>
		...
	</account>
	<account>
		<type>Business</type>
		...
	</account>
	<custid>496F3AB</custid>
  </customer>

This may seem innocuous, but XML allows mixing of arrays and objects too liberally, and makes automatic parsing overly complex. At first <customer> appears to be an array of account objects, but wait now that we reach the end we find that <customer> is an object with multiple keys and must create an unnamed array key to hold accounts.

XML is a document markup language, not a data format.

tokenizerrr · on March 7, 2014

Well yes. The problem there is that someone made a bad decision on how to structure their XML. If the same was done like this:

      <customer custid="496F3AB">
    	<account>
    		<type>Personal</type>
    		...
    	</account>
    	<account>
    		<type>Business</type>
    		...
    	</account>
      </customer>

it would make a lot more sense, I think.

sixbrx · on March 8, 2014

Which problem XML makes all too easy.

The really annoying issue is as the parent says, that the accounts collection does not have a name. This means there's no canonical mapping for the structure into a programming language object, which necessitates that libraries require annotations or some other side-channel way of specifying how to wrap the accounts into a collection.

In Jaxb e.g., how many times must we add junk like:

  @XmlElementWrapper(name = "accounts")  ?

In any individual case the workaround is easy, but it's annoying to have to do it repeatedly.

XML really is better as document markup than structured data representation.

icebraining · on March 8, 2014

Or:

      <customer>
        <custid>496F3AB</custid>
        <accounts>
           <account>
    	     <type>Personal</type>
             ...
    	   </account>
    	   <account>
    	     <type>Business</type>
    	     ...
    	   </account>
    	</accounts>
      </customer>

mahyarm · on March 7, 2014

People dislike XML because it's way overkill for %99 of people's use cases but it still gets used anyway! Most people who use it should of been using something simpler like JSON to create their configuration file or return their list of strings in some HTTP API. You can have bloody security vulnerabilities with XML, like you had recently with facebook: https://www.facebook.com/BugBounty/posts/778897822124446

The likelihood of a JSON feature biting you in the ass like that is far lower. Don't use XML until you actually need something XML SPECIFICALLY provides.

Also JSON easily translates with easy to work with dictionaries and lists, XML parsers take more code to work with equivalent items.

throwaway5752 · on March 7, 2014

He hates it, but he's using it. Take it for what it's worth.

lucian1900 · on March 7, 2014

Similarly, what's with the binary format/protocol hate? That's what brought us our current extremely common mishandling of text encoding.

wtbob · on March 10, 2014

> But if I want something that is able to express data structures customized by myself, usually with hierarchical data that can be verified for validity and syntax (XML Schemas or old-school DTD), what other options are there?

S-expressions work great. Syntax checking is far simpler, and validity checking is hence something you can roll yourself (and writing an S-expression schema checker ain't tough).

ddevault · on March 7, 2014

I haven't used it myself, but lisp seems well suited to the task. I've also heard good things about yaml, which is more well-supported by your language of choice.

norswap · on March 7, 2014

I'd say use JSON, but JSON doesn't have adequate schemas yet (JSON-Schemas is crap).

outworlder · on March 7, 2014

Because XML solves a 'problem' in the worst possible way. It is not that easy to parse for machines and only the simplest XML files are readable by humans.

Besides, since 1960 or thereabouts we have S-Expressions. The world should just have used that without reinventing the wheel once again.

bananas · on March 7, 2014

I think this title is wrong.

Firstly some clarification - this appears to just be about the persistence format for his dive log. It was XML, now it's git based with plain text.

As someone who had to manage a system which worked with plain text files structured in a filesystem for a number of years in the 1990s, this is done to death already.

You now end up with the following problems: locking, synchronising filesystem state with the program, inode usage, file handles to manage galore and concurrency. All sorts.

Basically this is a "look I've discovered maildir and stuffed it in a git repo".

Not saying there is a better solution but this isn't a magic bullet. It's just a different set of pain.

e12e · on March 7, 2014

> You now end up with the following problems: locking, synchronising filesystem state with the program, inode usage, file handles to manage galore and concurrency. All sorts.

Which is why he's reusing git for resolving those pain points? Well presumably all except "synchronizing filesystem state with the program" -- where he's gone from using some kind of xml parser to marshal xml to objects/structs in ram to using a (simple(r)?) text parser to do the same.

I'm guessing he just writes/reads a full (part) of a log (a branch of the full tree, or whatever is used in the program. Maybe a list anchored at a date?) -- and lets git sort the history/backup thing.

So, yes, it's a different format, but I think the argument you're making is off -- seeing as he already has git for that? It's more like combining Maildir (or mboxes, only commited when valid) and git.

xsace · on March 7, 2014

Maybe you want to wait till he release something. Cause you know, if he took months to get the big picture in mind, I doubt you grasp what he envision just by reading his comment.

bananas · on March 7, 2014

If it's not that, I'll eat my hat, and my pyjamas.

There's not much more to infer from the comment.

Unless he's invented a new ASN.1 encoding which plugs into libgit or something or a new text serialisation format (both unlikely).

bsder · on March 7, 2014

Yes, because his design of git was so well-formed.

Git is so well-designed that expert users manage to trash their repositories and propagate the damage.

Maybe that's not a problem of libgit. But tools are both the infrastructure and the UI.

taeric · on March 7, 2014

Not sure what you are referring to. What are some common ways "expert users" manage to "trash their repositories?"

bananas · on March 7, 2014

5 minutes with me and git rebase usually do the job :)

bsder · on March 7, 2014

Let's start here: http://randyfay.com/content/avoiding-git-disasters-gory-stor...

So, the solution to the fact that the merging UI is a pile of garbage is HAVE A SINGLE PERSON ALWAYS DO THE MERGE. Excuse me? The whole point of a distributed revision control system is so I don't have to have a single choke point. That's the definition of distributed.

Then there was the KDE disaster: http://jefferai.org/2013/03/29/distillation/

Yeah, the root fault wasn't Git. However, at no point did Git flag that something was going horribly wrong as the repository got corrupted and deleted. Other distributed SCM systems I have used tend to squawk very loudly if something comes off disk wrong.

Maybe the underlying git data structures are fine, but, man, the UI is a pile of crap.

And, I won't even get into rebase, because that seems to be a religious argument.

smharris65 · on March 7, 2014

The issues in the randyfay.com post are due to a misunderstanding when using git as a "centralized" repo like SVN. Git, by design, does not enforce a central repo even if you designate one logically. These issues can be completely avoided if you merge the right way:

http://tech.novapost.fr/merging-the-right-way-en.html

pjc50 · on March 7, 2014

Well, that confirms that the "obvious" workflow of "git pull" is dangerous. At least it explains all the spurious merges. Why on earth did it ship with this broken design? Why doesn't git pull do the right thing by default?

judk · on March 7, 2014

Yup, and Windows is broken because ctrl-c copies text instead of killing a process.

Why doesn't Windows do the right thing by default?

Oh, its because a different system behaves differently.

DVCS is fundamentally more complex than VCS.

craigching · on March 7, 2014

I believe you can flip a switch to turn on only allowing FF merges which should alleviate the situation. Certainly FF merges make a lot of things easier and "cleaner".

I guess maybe the reason git doesn't do this by default might be because the idea of rebasing early on (the "omg you're overwriting history in a RCS!!!!!") was a bit taboo and it's taken time for people to get used to the idea. Note that I'm just speculating about that, I did follow the git discussion early on and I know that people then (and still are to some degree) afraid of "rewriting history" (note that I quote that because I don't really see it that way).

taeric · on March 7, 2014

I'm not sure I follow. The advice for the "single person always do the merge" is essentially make sure the people doing the merges are experts. These mistakes do not seem like the kind of thing I have heard of experts doing.

Seriously, you can not call yourself a git expert, if you think rebase is a difficult thing to explain.

Might you sometimes make mistakes? Sure. I hardly see this as a systemic thing, though.

The mirror shenanigans I agree suck. Not sure what the real takeaway is there, other than don't rely on mirror as a good form of backup.

jamesgeck0 · on March 7, 2014

So, the solution to the fact that the merging UI is a pile of garbage is HAVE A SINGLE PERSON ALWAYS DO THE MERGE. Excuse me?

That isn't what the post advocates. He says that having a single person approve the pull request is a good idea, but approving the pull isn't the same thing as manually doing a merge. Projects I've worked on required that the submitter merge master into their branch before their PR would be accepted.

crucialfelix · on March 7, 2014

I took this to mean that what he is replacing is a single XML file whose content is a tree of element nodes. Every time you have to make a change to that file (changing, removing or adding children nodes within the file) you would have to store a new copy of the file. The most efficient you can get is to store just the text diffs using git or something.

But what he replaces it with is a git object store. Each xml-node becomes a git object. They each point to a parent (just as git commits point to a parent commit).

Now writing to this datastore means adding a new node to the git object database and changing the parent references.

Where git stores commits that are related sequentially in time, this stores nodes in a tree relationship that IS the document.

If he's not talking about this then I'd like to officially take credit for my weird idea right now.

dangoor · on March 7, 2014

The impression I got was that he was going to store his data in a git object database and that the files would be virtual in there. It would be like the .git directory without the working files on disk. It's all just conjecture until his code his out.

Regardless, I would think that some applications are simple enough (store few enough separate objects in the file system) that the issues you cite are not likely to cause a problem.

mixedbit · on March 7, 2014

What you describe is quite similar to how gollum wiki uses git for storage: https://github.com/gollum/gollum

WalterBright · on March 7, 2014

Back in the bad old DOS days, instead of creating a file format for saving/loading the configuration of the text editor, I simply wrote out the image in memory of the executable to the executable file. (The configuration was written to static global variables.)

Running the new executable then loaded the new configuration. This worked like a champ, up until the Age of Antivirus Software, which always had much grief over writing to executable files.

It's a trick I learned from the original Fortran version of ADVENT.

picomancer · on March 9, 2014

Readers may be familiar with the TI-83 programmable graphing calculator's assembly language functionality (especially those who took high school math classes in the mid-to-late 1990's). The TI-83's only user-writable storage was 32K of RAM (there was a small lithium battery to keep it powered when you changed the AAA's; also some of the RAM was used for system stuff so somewhat less than 32K was actually available for user purposes).

You could write hex values in the program text editor, then you could tell the calculator to execute the hex codes as machine code. I understand the previous models, TI-82 and TI-85, were hacked / backdoored to run user-supplied assembly language, so TI responded by including an official entry point and developer documentation for the TI-83.

People later wrote loaders which allowed programs to be stored as binary instead of text (using half the space). Some loaders also had the capability to run binary programs by swapping them into the target address rather than copying them (theoretically a third option would be possible, running programs in-place if they weren't written to depend on their load address, but this wasn't a direction the community went in. gcc users may be familiar with -fPIC which produces code which can run from any address, and this flag is necessary when compiling code for shared libraries.)

This allowed people to create massive 20K+ applications (an RPG called Joltima comes to mind), that used most of the available RAM.

The fact that this loading scheme made static variables permanent was also quite convenient. (And most variables were static; stack-based addressing would be tough because the Z80 only has two index registers, one of which is used -- or perhaps I should say "utterly wasted" -- by the TI-83 OS.)

The next generation, the TI-83+, included I think 256K of flash ROM, and a special silver edition was released which contained 2 MB.

chongli · on March 9, 2014

Reminds me of the approach taken by Xmonad where the configuration is compiled into a new executable and then run.

strictfp · on March 7, 2014

Thank you for that anecdote, it made my day. Simply awesome.

WalterBright · on March 7, 2014

I forgot to mention, on a floppy disk system, saving the configuration in the exe file made for fast loading of the program, since it didn't need to do extra floppy file operations to load the config.

WalterBright · on March 7, 2014

I learned a heckuva lot from reading the ADVENT Fortran source code. I was floored when I figured out how it was saving its configuration - such a brilliant idea. And in DOS it could be implemented in about 5 lines of simple C code. (Code size was critical in the old 64Kb days.)

The other huge thing I learned from ADVENT was polymorphism. The comment in the source code "the troll is a modified dwarf" was an epiphany for me.

jmnicolas · on March 7, 2014

From the comments (Tristan Colgate) :

"XML is what you do to a sysadmin if waterboarding him would get you fired."

Made my day :-)

Ygg2 · on March 7, 2014

That's just mean. Waterboarding isn't that bad...

jmnicolas · on March 7, 2014

But it gets you fired ... on the other end, nobody has ever been fired for using XML.

nzp · on March 8, 2014

With my occasional sysadmin hat on, until a few weeks ago I had the luck to never have had to deal with XML configuration files. Then came Solr and now I know what horror is. (To be clear, Solr itself is great, but those god damn config files...)

lifeisstillgood · on March 7, 2014

What I like is the "I dont start prototyping till I have a good mental picture"

I am currently stuck on a project I want to start becasue I cannot get it to fit right in my (future) head. And I am glad I am not an idiot for not being able to knock out my next great project in between lattes.

(Ok, in direct comparison terms I am an idiot, but at least its not compounded)

specialist · on March 7, 2014

  "A change in perspective is worth 80 IQ points."
  
  -- Alan Kay

My biggest hurdle solving new problems is divining a unifying, simplifying metaphor. Once you have the right notion, that Eureka! moment, everything falls into place, like magic.

Like how Kepler was able to fully explain Bache's astronomical data once he realized the planets orbits the sun.

Personal example: I used to write print production software. Placing pages onto much larger sheets of paper that get folded and bound into a book. A task called image positioning aka imposition. It took me years to figure out how to model the problem. Key insight was simulating the work backwards, from binding back to the press. Then when I showed the new solution to my coworkers, the response was "Well, duh."

tim333 · on March 7, 2014

Yeah, I noted that too, also that it took him months to to get his good mental picture. It makes me feel not so bad about spending months trying to get clear on some of my stuff.

tzury · on March 7, 2014

I just realized that Linus' posts are the only reason I ever go to Google Plus.

cbsmith · on March 7, 2014

The question nobody is asking, but actually should is: I wonder what other good G+ content you are missing?

G+ is largely misunderstood. It is a lousy tool for interaction with people connected to you purely socially. It's a very good way to find and interact with people connected to you by interest.

icefox · on March 7, 2014

The really sad thing is that I have tried several times to search for content that I know exists on G+, but I can't find it, even when I knew the author. After the third time failing at this my usage of G+ dropped significantly. Of all of the things that you would think would work search would be at the top... :|

tmzt · on March 9, 2014

Right, if I could subscribe to a Circle with all of the kernel devs in it I would.

G+ is actually a great place to read long form messages and comments, but doesn't really have content discovery down.

ChikkaChiChi · on March 7, 2014

This is exactly how I explain Google+ to folks. It's built for communities, not cliques.

jan_g · on March 7, 2014

For me it's not just G+, but also Facebook and Twitter. Only reason I ever visit those sites is indirectly through HN posts and similar.

npsimons · on March 7, 2014

I know this is completely off-topic, and I'll happily be downvoted for it, but why in the world does Google+ capture keyboard shortcuts that are already bound to other well known browser functions? (C-PgUp, C-PgDn, C-w, etc).

unsigner · on March 7, 2014

Linus : G+ :: notch : Java

xentronium · on March 7, 2014

That's unfair. Lots of infrastructural projects are done in java. E.g. my personal favorite: lucene (+ solr, elasticsearch).

unsigner · on March 10, 2014

Yeah, and I hear some communities (photographers?) have taken G+ as their home. It was tongue-in-cheek and purely from my (PC, desktop, Windows/console game developer) perspective.