Billion laughs

astrojams · on Oct 20, 2012

It isn't obvious at first glance that this small xml file actually expands to billion "lols". You really have to give the bad guys credit for ingenuity.

tisme · on Oct 20, 2012

:(){ :|: & };:

Even simple bash scripts can do weird things like this. And that's a lot smaller.

chubot · on Oct 20, 2012

Yeah but there's a HUGE difference: bash is intended to be a turing complete language, while XML is a data format. It's trivial to use infinite resources if you can execute arbitary code.

It shouldn't be possible to use resources exponential to the input size for just PARSING (not doing anything else with) XML. This is a great example of why you want to use simple serialization formats like JSON.

It's the same reason I never liked YAML. The spec and implementations are just way too big. There's got to be something hiding in there like this.

burke · on Oct 20, 2012

YAML is acutally trivially compatible with this exploit, in a sense, though the result is not particularly disastrous with Psych.

    lol1: &lol1
      "lol"
    lol2: &lol2
      [*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1]
    lol3: &lol3
      [*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2]
    lol4: &lol4
      [*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3]
    lol5: &lol5
      [*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4]
    lol6: &lol6
      [*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5]
    lol7: &lol7
      [*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6]
    lol8: &lol8
      [*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7]
    lol9: &lol9
      [*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8]

clarkevans · on Oct 21, 2012

In YAML, this is 9 objects in memory, one of them is a scalar, the other 8 are arrays of size 10 that happen to be pointers to the other objects. Not very big at all... much less than 1k. It's not even interesting since YAML competently serializes graphs that are far more complex. Lots of reasons to trash YAML, this isn't one of them.

As aardvark179 noted, a usage concern with YAML is more with how an unguarded application might fail to check for serialized graph cycles.

burke · on Oct 21, 2012

That makes sense. I tested this in a REPL that tried to pretty-print the result.

gurkendoktor · on Oct 21, 2012

> Lots of reasons to trash YAML, this isn't one of them.

Is there a list of these reasons? Apart from graph cycles, I have only come across some gotchas (e.g. using "yes" without quotes in a translation file).

koenigdavidmj · on Oct 20, 2012

That is probably not affected, as it's not necessary that each copy of lol1 inside lol2 actually be a copy. They could simply all be references to the same node.

In fact, I don't even think that is legal for them to be copies, as this is the means by which YAML allows circular data structures to be represented.

tisme · on Oct 20, 2012

They're all classes of the same problem: hard to check for resource consumption from apparently innocent input.

#define s, any kind of macro system they all suffer from this, as soon as you are allowed to define an entity referencing another defined entity you have a bomb on board.

chubot · on Oct 20, 2012

It's not really the same. If bash had a separate parse step (which isn't too hard to imagine), you could parse a fork bomb without any trouble. Executing it is what's dangerous.

A better analogy would be a language where you can't even parse it without executing arbitrary code (Perl is probably a good example: http://www.perlmonks.org/?node_id=663393)

Python AFAICT is quite safe to parse. It's not a common distinction because you normally parse stuff to execute it, and executing can do much worse things, of course. But where it might come into play is if you wanted to write a cloud service accepts arbitrary code and just does static analysis on python or bash. What kind of security would you need?

It's a design flaw in the format itself -- in this case, XML. A fork bomb is not a flaw in the design of bash.

oakwhiz · on Oct 20, 2012

I suppose that how safe a grammar is to parse depends on whether or not there are language constructs that can generate exponential growths of memory usage, how much memory you have available for the task, and whether or not it can even be parsed at all.

It seems like the only way to totally secure an automated system against this kind of attack is to have a limit on parser memory usage and CPU cycles, and reject input when the parser breaks any of these limits.

jerf · on Oct 20, 2012

JSON: {{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{... and so on.

It isn't exponential, no, but you're quite likely to still crash something if the receiver was not programmed to handle that.

VMG · on Oct 20, 2012

If I'm not mistaken, the memory load even is linear. You could protect yourself by just setting a maximum size limit for the buffer.

tlrobinson · on Oct 20, 2012

For the curious: http://en.wikipedia.org/wiki/Fork_bomb

pennywyze · on Oct 20, 2012

Looks like an interesting emoticon.

lifeformed · on Oct 20, 2012

Reminds me of a Zip bomb: http://en.wikipedia.org/wiki/Zip_bomb

bonobo · on Oct 20, 2012

Hah, I knew I've already seen this before! Someone submitted this link two weeks ago[1], and in that same thread someone mentioned the billion laughs attack[2].

[1] http://news.ycombinator.com/item?id=4616081

[2] http://news.ycombinator.com/item?id=4616922

dguido · on Oct 20, 2012

Probably should rename this to "billion reposts."

Can we move beyond this simple issue and discuss more complicated aspects of security on HN?

dguido · on Oct 20, 2012

Fine, you guys asked for it.

https://news.ycombinator.com/item?id=259458

https://news.ycombinator.com/item?id=3859853

https://news.ycombinator.com/item?id=1674911

https://news.ycombinator.com/item?id=301296

https://news.ycombinator.com/item?id=4619344

People that exploit these kinds of things continue to innovate, but HN seems to be stuck with XSS, SQLi, and malformed XML.

daeken · on Oct 20, 2012

While I'm all with you on talking about advanced security, the reality is that most people here don't understand basic security. I think talking about the low-hanging fruit is important -- everyone has to start somewhere. And as always, if you want to see more advanced security stuff, post it! I'll upvote it for sure.

Edit: This does make me think that I've been meaning to write a blog post about a security issue I discovered for about 6 months now. Time to do that.

wglb · on Oct 20, 2012

Which he did: http://news.ycombinator.com/item?id=4678309. This is a tale in which cody pretty much ends up owning ccbill.

ezequiel-garzon · on Oct 20, 2012

Please consider that none of those posts with any comments in them took place within the last two years. And "sadly" gets created just to tell people to read more books. Sad indeed...

EDIT:

HN seems to be stuck with XSS, SQLi, and malformed XML

HN is not a person: half of today's HN didn't exist two years ago [1].

[1] http://blog.rjmetrics.com/surprising-hacker-news-data-analys...

tisme · on Oct 20, 2012

The problem here is that a lot of newly minted IT professionals will write stuff that will allow a re-use of all those old exploits all over again.

Occasionally warning people about known dangers doesn't harm. Sure that's not exactly moving the needle but it may help to mitigate a few lurking problems that people were not yet aware of.

Think of it as a 'wear your seatbelt' advertisement. If you're already wearing your seatbelt then it wasn't meant for you.

ghshephard · on Oct 21, 2012

Entirely appropriate. We've probably had a minimum of 20-30 reposts over the problems with saving passwords as simple salts+hashes, which is a useful introduction to bcrypt/scrypt/pbkdf2.

ftwinnovations · on Oct 21, 2012

I'm so tired of these "seen it before, I'm smarter and more senior than you" responses.

Look, if people vote it up it means they like it. Just because its been talked about before doesn't make it less valuable.

VMG · on Oct 20, 2012

Disagree. I didn't know about that attack and found it very interesting.

sadly · on Oct 20, 2012

Sadly, it's because you aren't reading enough books.

Read more books.

enraged_camel · on Oct 20, 2012

Sadly, I downvoted you. I wouldn't have if you had actually given some book suggestions, but instead your post amounts to "read more lol" which is not only unhelpful but also condescending.

ftwinnovations · on Oct 21, 2012

Down voting is good, but feeding the trolls is not!

marmot1101 · on Oct 21, 2012

10 year deep IT professional, recent HN member. This is the first I've seen of this attack. Reposts can be a good thing.

angrycoder · on Oct 21, 2012

Achievement Unlocked: You've read all of the Internet.

stcredzero · on Oct 21, 2012

What about one for reading all of HN?

ctdonath · on Oct 21, 2012

The Internet is a much bigger place than some people realize.

If you would like a part of it improved & expanded, please proceed to do so.

wtallis · on Oct 20, 2012

So, how much memory would a real-world parser actually consume given this file? I'd try it, but I had to RMA my workstation's motherboard yesterday, leaving me with a machine that only has 3GB, which is the obvious minimum for a full expansion. But I could imagine an XML parser might use UCS-2 internally, inflating this to 6GB. Or, some parsers might be clever and not attempt a full expansion.

zurn · on Oct 21, 2012

So you're asking how much memory this resoource exhaustion attack consumes when you run it.

Each line in the WP examaple amplifies by a factor of 10. It has 9 lines. It's 10e9. That's a billion times 3, which is just enough to 32-bit virtual memory space in common operating systems.

Of course the XML implementation could be smart and short-circuit this while preserving the semantics.

dbaupp · on Oct 21, 2012

I think you mean 10^9 = 1e9 rather than 10e9.

(And the parent actually mentioned the 3 GB minimum, I think it was essentially asking how much extra memory a real world XML parser uses.)

irishcoffee · on Oct 21, 2012

Nope, 10e8 is what you're both looking for, as 1^9 is.. 1.

BCM43 · on Oct 21, 2012

Actually, dbaupp is correct. The e1 parts means *10^9.

See: https://www.wolframalpha.com/input/?i=1e9

rhino42 · on Oct 21, 2012

Nope, 1e9 = 1 * 10^9. He was fine.

ghshephard · on Oct 21, 2012

You could try using an iteration that only required 300 MB for a full expansion...

caseydurfee · on Oct 20, 2012

Is there a legitimate use case for being able to recursively define entities like that?

halter73 · on Oct 20, 2012

Look again. There's no actual recursion going on. It's fairly trivial to identify recursion in DTD entities.

lmm · on Oct 20, 2012

Rather than nitpicking the terminology, can we answer the actual question: is there any legitimate use case for defining an entity in terms of another entity?

near · on Oct 20, 2012

Right, that's the root of the problem: feature creep. Common with design-by-committee.

As with any feature, there are possible use cases. Perhaps you want to create a form document and use custom entities that you modify later to fill the document out.

But the amount of times that's useful is not likely to be worth the potential harm of things like billion laughs. Easy to separate that functionality, and have your XML parser do an element data replace with your own custom tokens. Eg node.replace("{name}", customerName);

Nitramp · on Oct 21, 2012

DTD entities come from SGML, and are still commonly used as an extensibility/modularity mechanism in XML. XML power users (e.g. for documentation) commonly have DTDs that refer to entities that can be replaced to customize the format. IBM's DITA is an example for this. Having only one level of DTD expansion would make that much less powerful. I think that's a legitimate use case, but the resulting system is still a mess IMHO. Like many macro systems it's very hard to understand and debug, and on top of it you have to understand a relatively obscure format that isn't even XML (DTDs).

That being said, like many of the things in XML, DTDs and entities not very useful outside of publishing/markup use cases, and dangerous on top of it (not only for the billion laughs, e.g. including system entities).

lukeschlather · on Oct 20, 2012

I don't see any recursion. Recursion would be an infinite loop and presumably a good parser would catch that.

zurn · on Oct 21, 2012

No. Recursion can terminate and still be called recursion.

(news.yc used to be a Lisp hangout)

Nitramp · on Oct 21, 2012

If you read the OP carefully, you'll see he's quite aware of the terminology.

There really isn't any recursion in that XML snippet, none of the entities refers to itself, directly or transitively. If there was, the file would be invalid and the error probably caught by the parser before dying from memory exhaustion. There's no way to terminate recursion in XML entities.

3rd3 · on Oct 21, 2012

Not sure if you could abort an recursion in XML.

andrewcooke · on Oct 20, 2012

it's really the underlying macro system that's recursive - http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.3.2

and that's probably because macros systems typically are recursive since they are otherwise fairly limited, so this is a way to give them sufficient expressive power. so the explanation is probably "tradition" (and likely made more sense in sgml, which was the precursor to html).

alexrbarlow · on Oct 20, 2012

I have to say, i love this, crazy, for a language that is really for transferring data.

I guess you could do this with YAML too?

aardvark179 · on Oct 20, 2012

To pull this particular style of trick you require a schema definition that allows for one object to be expanded into a whole set of objects, and for the resulting data structure to be a tree rather than a simply a rooted directed graph.

I don't know YAML well but I believe if you tried this trick with something like alias nodes then you would end up with a lol9 node with ten separate connections to a single lol8 node with ten separate connections to a single lol7 node and so on. This would not produce the same problem in the parser, though might trigger problems in whatever processed the resulting graph.

clarkevans · on Oct 20, 2012

That's definitely correct: Since you can produce cycles in YAML (it's deliberate choice), programs which don't check for graph cycles and blindly go about traversing a serialized graph are subject to DDOS.

clarkevans · on Oct 20, 2012

I don't think so. YAML by itself doesn't have any entity expansion or include capabilities -- you'd have to rely on extensions or something else that isn't on by default. The reference (alias/anchor) mechanism just rebuild the serialized graph, so there's no expansion going on there. That said, I'm sure there are quite possible implementation issues, like most software.

mbq · on Oct 20, 2012

To some extent every protocol which does not transfer message and payload size on a fixed offset in header can be called "crazy" as being vulnerable to all the problems with live parsing, terminators and unpredictable memory requirements.

055static · on Oct 21, 2012

This doesn't work with my sed-based XML parsers. :(

ilcavero · on Oct 20, 2012

so, how do I protect myself against this?

praptak · on Oct 20, 2012

Assign a limited memory pool to the parser, it's in the article.

davyjones · on Oct 20, 2012

Most of the xml libraries already have the fix.

njharman · on Oct 21, 2012

Stop using XML.

Evbn · on Oct 20, 2012

Use a parser or data structure that doesn't duplicate identical objects.

Functional programming wins here.

Nitramp · on Oct 21, 2012

You might, but the expansion happens on a level that might be really hard to implement if you want to keep XML semantics. E.g. in the DOM API, each element has its own identity - you can't collapse identical objects (or you could, but the objects generated this way aren't identical, they have different parent nodes for example).

sltkr · on Oct 21, 2012

That's hardly a fool-proof solution, because you still end up with a document that's has a very large logical size (even if you represent it in a very compact way in memory) and any non-trivial operation on the resulting string that needs to evaluate it on a character-by-character basis (e.g. a substring search) will still take "forever" to complete.