Hacker News new | past | comments | ask | show | jobs | submit login
Billion laughs (wikipedia.org)
276 points by khet on Oct 20, 2012 | hide | past | favorite | 63 comments



It isn't obvious at first glance that this small xml file actually expands to billion "lols". You really have to give the bad guys credit for ingenuity.


:(){ :|: & };:

Even simple bash scripts can do weird things like this. And that's a lot smaller.


Yeah but there's a HUGE difference: bash is intended to be a turing complete language, while XML is a data format. It's trivial to use infinite resources if you can execute arbitary code.

It shouldn't be possible to use resources exponential to the input size for just PARSING (not doing anything else with) XML. This is a great example of why you want to use simple serialization formats like JSON.

It's the same reason I never liked YAML. The spec and implementations are just way too big. There's got to be something hiding in there like this.


YAML is acutally trivially compatible with this exploit, in a sense, though the result is not particularly disastrous with Psych.

    lol1: &lol1
      "lol"
    lol2: &lol2
      [*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1]
    lol3: &lol3
      [*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2]
    lol4: &lol4
      [*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3]
    lol5: &lol5
      [*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4]
    lol6: &lol6
      [*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5]
    lol7: &lol7
      [*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6]
    lol8: &lol8
      [*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7]
    lol9: &lol9
      [*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8]


In YAML, this is 9 objects in memory, one of them is a scalar, the other 8 are arrays of size 10 that happen to be pointers to the other objects. Not very big at all... much less than 1k. It's not even interesting since YAML competently serializes graphs that are far more complex. Lots of reasons to trash YAML, this isn't one of them.

As aardvark179 noted, a usage concern with YAML is more with how an unguarded application might fail to check for serialized graph cycles.


That makes sense. I tested this in a REPL that tried to pretty-print the result.


> Lots of reasons to trash YAML, this isn't one of them.

Is there a list of these reasons? Apart from graph cycles, I have only come across some gotchas (e.g. using "yes" without quotes in a translation file).


That is probably not affected, as it's not necessary that each copy of lol1 inside lol2 actually be a copy. They could simply all be references to the same node.

In fact, I don't even think that is legal for them to be copies, as this is the means by which YAML allows circular data structures to be represented.


They're all classes of the same problem: hard to check for resource consumption from apparently innocent input.

#define s, any kind of macro system they all suffer from this, as soon as you are allowed to define an entity referencing another defined entity you have a bomb on board.


It's not really the same. If bash had a separate parse step (which isn't too hard to imagine), you could parse a fork bomb without any trouble. Executing it is what's dangerous.

A better analogy would be a language where you can't even parse it without executing arbitrary code (Perl is probably a good example: http://www.perlmonks.org/?node_id=663393)

Python AFAICT is quite safe to parse. It's not a common distinction because you normally parse stuff to execute it, and executing can do much worse things, of course. But where it might come into play is if you wanted to write a cloud service accepts arbitrary code and just does static analysis on python or bash. What kind of security would you need?

It's a design flaw in the format itself -- in this case, XML. A fork bomb is not a flaw in the design of bash.


I suppose that how safe a grammar is to parse depends on whether or not there are language constructs that can generate exponential growths of memory usage, how much memory you have available for the task, and whether or not it can even be parsed at all.

It seems like the only way to totally secure an automated system against this kind of attack is to have a limit on parser memory usage and CPU cycles, and reject input when the parser breaks any of these limits.


JSON: {{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{... and so on.

It isn't exponential, no, but you're quite likely to still crash something if the receiver was not programmed to handle that.


If I'm not mistaken, the memory load even is linear. You could protect yourself by just setting a maximum size limit for the buffer.



Looks like an interesting emoticon.


Reminds me of a Zip bomb: http://en.wikipedia.org/wiki/Zip_bomb


Hah, I knew I've already seen this before! Someone submitted this link two weeks ago[1], and in that same thread someone mentioned the billion laughs attack[2].

[1] http://news.ycombinator.com/item?id=4616081

[2] http://news.ycombinator.com/item?id=4616922


Probably should rename this to "billion reposts."

Can we move beyond this simple issue and discuss more complicated aspects of security on HN?


Fine, you guys asked for it.

https://news.ycombinator.com/item?id=259458

https://news.ycombinator.com/item?id=3859853

https://news.ycombinator.com/item?id=1674911

https://news.ycombinator.com/item?id=301296

https://news.ycombinator.com/item?id=4619344

People that exploit these kinds of things continue to innovate, but HN seems to be stuck with XSS, SQLi, and malformed XML.


While I'm all with you on talking about advanced security, the reality is that most people here don't understand basic security. I think talking about the low-hanging fruit is important -- everyone has to start somewhere. And as always, if you want to see more advanced security stuff, post it! I'll upvote it for sure.

Edit: This does make me think that I've been meaning to write a blog post about a security issue I discovered for about 6 months now. Time to do that.


Which he did: http://news.ycombinator.com/item?id=4678309. This is a tale in which cody pretty much ends up owning ccbill.


Please consider that none of those posts with any comments in them took place within the last two years. And "sadly" gets created just to tell people to read more books. Sad indeed...

EDIT:

HN seems to be stuck with XSS, SQLi, and malformed XML

HN is not a person: half of today's HN didn't exist two years ago [1].

[1] http://blog.rjmetrics.com/surprising-hacker-news-data-analys...


The problem here is that a lot of newly minted IT professionals will write stuff that will allow a re-use of all those old exploits all over again.

Occasionally warning people about known dangers doesn't harm. Sure that's not exactly moving the needle but it may help to mitigate a few lurking problems that people were not yet aware of.

Think of it as a 'wear your seatbelt' advertisement. If you're already wearing your seatbelt then it wasn't meant for you.


Entirely appropriate. We've probably had a minimum of 20-30 reposts over the problems with saving passwords as simple salts+hashes, which is a useful introduction to bcrypt/scrypt/pbkdf2.


I'm so tired of these "seen it before, I'm smarter and more senior than you" responses.

Look, if people vote it up it means they like it. Just because its been talked about before doesn't make it less valuable.


Disagree. I didn't know about that attack and found it very interesting.


Sadly, it's because you aren't reading enough books.

Read more books.


Sadly, I downvoted you. I wouldn't have if you had actually given some book suggestions, but instead your post amounts to "read more lol" which is not only unhelpful but also condescending.


Down voting is good, but feeding the trolls is not!


10 year deep IT professional, recent HN member. This is the first I've seen of this attack. Reposts can be a good thing.


Achievement Unlocked: You've read all of the Internet.


What about one for reading all of HN?


The Internet is a much bigger place than some people realize.

If you would like a part of it improved & expanded, please proceed to do so.


So, how much memory would a real-world parser actually consume given this file? I'd try it, but I had to RMA my workstation's motherboard yesterday, leaving me with a machine that only has 3GB, which is the obvious minimum for a full expansion. But I could imagine an XML parser might use UCS-2 internally, inflating this to 6GB. Or, some parsers might be clever and not attempt a full expansion.


So you're asking how much memory this resoource exhaustion attack consumes when you run it.

Each line in the WP examaple amplifies by a factor of 10. It has 9 lines. It's 10e9. That's a billion times 3, which is just enough to 32-bit virtual memory space in common operating systems.

Of course the XML implementation could be smart and short-circuit this while preserving the semantics.


I think you mean 10^9 = 1e9 rather than 10e9.

(And the parent actually mentioned the 3 GB minimum, I think it was essentially asking how much extra memory a real world XML parser uses.)


Nope, 10e8 is what you're both looking for, as 1^9 is.. 1.


Actually, dbaupp is correct. The e1 parts means *10^9.

See: https://www.wolframalpha.com/input/?i=1e9


Nope, 1e9 = 1 * 10^9. He was fine.


You could try using an iteration that only required 300 MB for a full expansion...


Is there a legitimate use case for being able to recursively define entities like that?


Look again. There's no actual recursion going on. It's fairly trivial to identify recursion in DTD entities.


Rather than nitpicking the terminology, can we answer the actual question: is there any legitimate use case for defining an entity in terms of another entity?


Right, that's the root of the problem: feature creep. Common with design-by-committee.

As with any feature, there are possible use cases. Perhaps you want to create a form document and use custom entities that you modify later to fill the document out.

But the amount of times that's useful is not likely to be worth the potential harm of things like billion laughs. Easy to separate that functionality, and have your XML parser do an element data replace with your own custom tokens. Eg node.replace("{name}", customerName);


DTD entities come from SGML, and are still commonly used as an extensibility/modularity mechanism in XML. XML power users (e.g. for documentation) commonly have DTDs that refer to entities that can be replaced to customize the format. IBM's DITA is an example for this. Having only one level of DTD expansion would make that much less powerful. I think that's a legitimate use case, but the resulting system is still a mess IMHO. Like many macro systems it's very hard to understand and debug, and on top of it you have to understand a relatively obscure format that isn't even XML (DTDs).

That being said, like many of the things in XML, DTDs and entities not very useful outside of publishing/markup use cases, and dangerous on top of it (not only for the billion laughs, e.g. including system entities).


I don't see any recursion. Recursion would be an infinite loop and presumably a good parser would catch that.


No. Recursion can terminate and still be called recursion.

(news.yc used to be a Lisp hangout)


If you read the OP carefully, you'll see he's quite aware of the terminology.

There really isn't any recursion in that XML snippet, none of the entities refers to itself, directly or transitively. If there was, the file would be invalid and the error probably caught by the parser before dying from memory exhaustion. There's no way to terminate recursion in XML entities.


Not sure if you could abort an recursion in XML.


it's really the underlying macro system that's recursive - http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.3.2

and that's probably because macros systems typically are recursive since they are otherwise fairly limited, so this is a way to give them sufficient expressive power. so the explanation is probably "tradition" (and likely made more sense in sgml, which was the precursor to html).


I have to say, i love this, crazy, for a language that is really for transferring data.

I guess you could do this with YAML too?


To pull this particular style of trick you require a schema definition that allows for one object to be expanded into a whole set of objects, and for the resulting data structure to be a tree rather than a simply a rooted directed graph.

I don't know YAML well but I believe if you tried this trick with something like alias nodes then you would end up with a lol9 node with ten separate connections to a single lol8 node with ten separate connections to a single lol7 node and so on. This would not produce the same problem in the parser, though might trigger problems in whatever processed the resulting graph.


That's definitely correct: Since you can produce cycles in YAML (it's deliberate choice), programs which don't check for graph cycles and blindly go about traversing a serialized graph are subject to DDOS.


I don't think so. YAML by itself doesn't have any entity expansion or include capabilities -- you'd have to rely on extensions or something else that isn't on by default. The reference (alias/anchor) mechanism just rebuild the serialized graph, so there's no expansion going on there. That said, I'm sure there are quite possible implementation issues, like most software.


To some extent every protocol which does not transfer message and payload size on a fixed offset in header can be called "crazy" as being vulnerable to all the problems with live parsing, terminators and unpredictable memory requirements.


This doesn't work with my sed-based XML parsers. :(


so, how do I protect myself against this?


Assign a limited memory pool to the parser, it's in the article.


Most of the xml libraries already have the fix.


Stop using XML.


Use a parser or data structure that doesn't duplicate identical objects.

Functional programming wins here.


You might, but the expansion happens on a level that might be really hard to implement if you want to keep XML semantics. E.g. in the DOM API, each element has its own identity - you can't collapse identical objects (or you could, but the objects generated this way aren't identical, they have different parent nodes for example).


That's hardly a fool-proof solution, because you still end up with a document that's has a very large logical size (even if you represent it in a very compact way in memory) and any non-trivial operation on the resulting string that needs to evaluate it on a character-by-character basis (e.g. a substring search) will still take "forever" to complete.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: