Hacker News new | past | comments | ask | show | jobs | submit login
How we applied fuzzing techniques to cURL (trailofbits.com)
305 points by ingve 9 months ago | hide | past | favorite | 58 comments



I don't get the part about custom mutators:

> If the data can’t be parsed into a valid TLV, instead of throwing it away, return a syntactically correct dummy TLV. This can be anything, as long as it can be successfully unpacked.

If you're creating a dummy value, how is that better than failing? How does that give your fuzzer better coverage?


Not an expert, but I am a power user of fuzzers.

The problem is that the space of invalid inputs is far larger than the space of valid inputs. Sometimes orders of magnitude larger, say billions or more invalid inputs to one valid input.

Naive fuzzing will hit so many error cases that it will hardly produce a valid input. For the ratio that I mentioned, you might run a fuzzer for a billion runs and only get one valid input in the bunch.

Using a custom mutator and returning a dummy value will give the fuzzer a starting point from a valid input and makes generating other valid inputs more likely.

For my part, I prefer to use custom mutators to generate valid test cases most of the time, but I want some invalid inputs because error handling is where most bugs are.


Your post doesn't address my question (and neither do any of the others replies). The article says:

> If the data can’t be parsed into a valid TLV, instead of throwing it away, return a syntactically correct dummy TLV. This can be anything, as long as it can be successfully unpacked.

Is the syntactically correct dummy value the same each time? If so, how does that lead to new coverage?

If it's a different valid TLV value each time, then it's not really a dummy value, is it? In any case, why bother with this "if invalid replace with dummy" step? Why not generate/mutate a valid TLV value from the start?


> Is the syntactically correct dummy value the same each time? If so, how does that lead to new coverage?

Per my understanding, the dummy value is constant but is only used once (to create the first valid TLV). Everything after that is a mutation of the original value that, due to the custom mutator logic, is a valid TLV. The mutations are where new coverage comes from.

> In any case, why bother with this "if invalid replace with dummy" step? Why not generate/mutate a valid TLV value from the start?

I'm guessing to handle both when there are seed files that are already valid (which you'll want to use instead of a dummy value), and when there aren't any valid seed files.


I personally would separate that into two separate fuzz cases - one that generates only valid inputs and one that generates only invalid inputs and spend more resources on verifying the latter because validating invalid inputs is more important. I didn’t read the article but I like property testing for this where your mutator takes random values and uses that to generate a valid input somehow rather than stubbing in a static value. Where I can use a static value is where I wouldn’t be fuzzing the validation of that value. Of course certs are complicated beasts so I’m sure the cURL people did what made sense to them.


You think generative ai/deep learning tools could help with this?


The file format they choose is difficult for a fuzzer to produce valid examples by random chance.

The file format isn't what's being fuzzed, so trying to accept as many things as possible as valid is useful.

It's a trick to make the fuzzer faster.


Remember TLV is for "Type Length Value". It's better to take your fuzzer output for value (or possibly type and value) but generate the length and final TLV yourself than having tons of fuzzer generated sequences already fail at the very basic type/length check that is unlikely to be vulnerable in software like cURL (but can be in many others..).


You want the failures to occur deeper in the programs so you cover more code paths.


OSS-fuzz is great in many cases but as this post hints at it seems like a lot of the time the fuzzers miss things because no one actually reviews the coverage info. Not in this case, but sometimes it's like someone just found the functions with signatures like parse(unsigned char *data, size_t size) just to get the money from Google for integrating fuzzing.

Then again some things are just difficult to fuzz properly. I tried writing a libpurple IRC harness by doing it the "right way" with the 3000 libpurple event callbacks you are supposed to set up and structures you are supposed to allocate, which worked, but it was very slow. I ended up being lazy and only calling irc_parse_msg after modifying the code to remove error logging that required proper libpurple setup.


Lots of good info here about how they added more fuzz tests, but no mention whether the fuzzing actually found bugs?


If they did, I'm sure they would mention it.


I am curious how much effort goes into creating and maintaining unit tests and fuzzing tests. Sometimes it takes longer / more lines of code to write thorough tests than it does to implement the core feature.

At that point, is it worth the time invested? Every new feature can take 2-3 times longer to deliver due to adding tests.


In the case of curl, the cost/benefit analysis is probably skewed by it being deployed on a massive scale, such that any bugs in curl have an unusually large impact. If your company's in-house CRM has an exploitable bug, that's bad but the impact is just your company. If libcurl has an exploitable bug, that's millions of devices affected.


Probably billions of devices at this point, honestly.


Curl claims 20 billion installations, FWIW.


Code that isn't tested, isn't done. Tests not only verify the expectations, but also prevent future regression. Fuzzing is essential for code that accepts external inputs. Heartbleed was discoverable with a fuzzer.


>Code that isn't tested, isn't done. Tests not only verify the expectations, but also prevent future regression

A few weeks ago someone at the company implemented a feature without tests. It worked perfectly, everyone was happy. A release of another new feature by a different team a few days later broke the previous feature :)


In most situations, we don't get paid to make code "done'. We get paid to get it close enough to work most of the time.


True but cURL is in a very different situation compared to your average run-of-the-mill program. It's one of the most massively deployed open-source software out there. When you are not beholden to capricious shareholders and incompetent managers you can actually focus on quality.


For libraries, tools, and frameworks, testing is crucial as it ensures that the code relying on them can address the issue at hand. Code can only be as reliable as what it's leaning on. So, to answer your question, a lot of time.

In a business-oriented project(at most jobs), code may undergo frequent changes due to requests from business, thus too much focus on testing could potentially slowing down development speed if extensive testing is implemented for each change. However, regression tests can still provide valuable insights and allow for faster development later in the life of the project.

While many projects only focus on happy path testing, the use of such tests might not be as high. Coupling them with Negative Testing, and even better, implementing boundary testing, compels developers to consider both valid and invalid inputs, helping to identify and address potential edge cases before they become bugs or security issues in production.

For instance, this [0] codebase has more tests than actual code, including fuzzing tests.

[0]https://github.com/Attumm/redis-dict/blob/main/tests.py


There's a tradeoff for sure, but to fully evaluate that tradeoff you must also take into account the future time spent on the code base, including the amount of time needed to debug future features, adopt a new build environment, allow new team members to muck around and see what does and does not work...

If it's a small project that won't be extended much, then perhaps then re-runnable unit tests may not make the bar as a net positive tradeoff. But some time must still be spent testing the code, even if it's not tests that are written in code.


It depends on how risk tolerant you are. If you are at a startup, the growth of your startup is often largely dependent on how quickly you can add features / onboard customers. In that context, writing tests not only slows you down from adding new features, it might make it harder to modify existing features. In addition, early customers also tend to be accepting of small bugs -- they themselves already have "taken a risk" in trusting an early startup. So testing is not really valuable -- you want to remain agile, and you won't lose much money because of a bug.

On the other hand, if you are Google, you already have found a money-printing firehose, and you /don't/ want to take on any additional unnecessary risk. Any new code needs to /not/ break existing functionality -- if it does, you might lose out of millions of revenue. In addition, your product becomes so large that it is impossible to manually test every feature. In this case, tests actually help you move /faster/ because they help you at scale automatically ensure that a change does not break anything.

While cURL does not make any money, it is solidly on the mature/Google end of the testing spectrum. It has found a footing in the open source tooling "market" and people rely on it to maintain its existing functionality. In addition, it has accumulated a fairly large surface area, so manual testing is not really feasible for every feature. So testing similarly helps cURL developers move faster (in the long run), not slower.


Everyone always praises SQLite

> As of version 3.42.0 (2023-05-16), the SQLite library consists of approximately 155.8 KSLOC of C code [...] the project has 590 times as much test code and test scripts - 92053.1 KSLOC.

https://www.sqlite.org/testing.html


I once found a confirmed bug in SQLite one time, and it’s the highlight of all my OSS bug reports.

Interestingly enough I found it while fuzz testing a query builder and cross testing it against PostgreSQL, MySQL and an in-memory filtering engine I wrote.


What the fuck. If they're investing that much then why don't they just go straight to formal verification. This is what things like frama-c (or whatever's popular now) are for.


The problem is that effort to do formal verification goes exponential beyond a certain point.

seL4 is around 10-12 KLoC, and it took a decade of effort from multiple people to make it happen.

At the size of SQLite, especially where they have to operate on platforms with different behavior (as an OS, seL4 is the platform), formal verification is just too much effort.

All that said, your reaction is totally understandable.


Link to how SQLite is tested, for anyone who's curious: https://www.sqlite.org/testing.html

There's also an interesting thing where formal verification requires a formal specification, which afaik there isn't one for SQLite. One of the toughest problems that someone would run into trying to put together a formal specification for code as widely deployed as SQLite boils down to Hyrum's Law[1]: on a long enough time scale, all observable behaviours of your system become interfaces that someone, somewhere depends on.

That massive suite of test cases isn't a formal specification but given that it achieves 100% branch coverage that implies to me that it:

- pretty tightly bounds the interface without formally specifying it

- also pretty tightly constrains the implementation to match the current implementation

Which, when you have as many users as you do with SQLite, is probably a fair way of providing guarantees to your users that upgrading from 3.44.0 to 3.45.1 isn't going to break anything you're using unless you were relying on explicitly-identified buggy behaviour (you'd be able to see the delta in the test cases if you looked at the Fossil diffs).

[1] https://www.hyrumslaw.com


https://corecursive.com/066-sqlite-with-richard-hipp/ is an interesting interview with the lead developer for SQLite about the inspiration and work that went into that testing as well.


Also a formal specification can have bugs. Formal verification checks that the code matches the spec, not that the spec implements all desired behaviors and no undesired behaviors.


You're right, though it's worth noting that you can also use formal verification to verify properties about the specification! For example you can verify that a security system doesn't allow privilege escalation.


Yep, though of course it's still on you to remember to verify everything you care about. Actually writing out all your requirements in a formal (or just programming) language is the hard part of programming.


That's the same link I posted in the original comment, JSYK.


Because formal verification doesn't scale at the moment. There are a relatively few number of experts on formal method around the globe and some of them need to work on SQLite indefinitely unless you're okay with just one-off verification. Frama-C has a different background because it's from Inria, the institution with many formal method experts.


You can do many things other than testing:

- use a memory safe language

- formal verification (multiple implementations even)

- build a simulator like FoundationDB did


The best-case IMO is a test suite where writing the test is almost as easy (or maybe easier?) than doing the equivalent manual test. The test code is maybe 2–4x longer than the change.

The worst case is… arbitrarily bad, to the point of being impossible. Test setup is hell because there are too many dependencies. You have to mock the “real world” since (again) things depend on other things too much and you can’t really do a simple string-manipulating change without setting up a whole environment. Also you introduce bugs in your tests that take longer to debug than the real change you are making.

What I feel that we (current project) have fallen into the trap of is that there wasn’t a test suite from the start. Just manual testing. Then when you get to a stage where you feel you need one, the code base is not structured to make automated testing simple.


I would generally double whatever your expectations are for the initial feature development. Tests are essentially a second implementation from a different angle running in parallel, hoping the results match. Every feature change means changing 2 systems now. You save a bit of subsequent time with easier debugging when other features break tests, but that's somewhat eaten up by maintaining a system twice the size.

There are reasons many MVP developers and small teams whose focus is more on rapid feature implementation than large team coordination or code stability forego writing tests. It doesn't make sense in all circumstances. Generally, more complex, less grokable, more large-team-oriented or public library code is when you need testing.


Tests are only a second implementation if you use test doubles incorrectly. Test doubles should only be used for I/O outside of the program under test that you can’t really run locally / is a network dependency (eg mocking a SaaS service or something) or for performance (mocking database responses vs spinning up a test database instance). If you do it write, most of your tests are just testing each layer and everything below it.

I have yet to see a case where omitting tests actually helps you move meaningfully faster - you’re probably generating more heat than light and that makes you feel like you’re moving faster.


I guess I'm referring here to the unit test standards that seem typically employed by automated code review frameworks - i.e. "100% coverage" checkers. With those, any subsequent change to the system requires understanding and modifying both the original code and the set of tests and seems to end up being around double the effort. It's not actual duplication - but mocking expected inputs/outputs takes on its own system (and often programming language in most frameworks) which is not always trivial. You may be referring to something different though.

In those situations, yes I stand by them being a costly overhead - which makes sense in large collaborative systems, but not so much in small agile MVPs.


Don't test protect against future unintended regressions?

If all code was write-only then testing is probably a waste of time but code changes constantly.


Generally speaking, yes. Its not like most code isn't changed as a consequence of writing those tests, so the practice has immediate benefit, but future changes can be made swiftly and securely due to the confidence those tests should be giving you.

But also your time estimate does sound wonky, 2-3x sounds extreme. Maybe you need to improve your test writing process?


It’s the 101 of talking about testing, it’s been discussed to the death. As usual, “it’s a trade-off”.

You’re right that writing tests can make shipping a feature slower. However, already having tests makes shipping a feature faster (higher confidence it works, less manual testing of the feature and the adjacent ones). It also lowers the amount of bugs, which are going to slow down delivery of new features. They also increase maintenance cost though (tests are code that needs maintenance too).

Is it worth it? Depends, although it’s quite rare that no test at all is the right trade off


>At that point, is it worth the time invested? Every new feature can take 2-3 times longer to deliver due to adding tests.

Depends if you are writing software to control a pacemaker, or writing software for some silly smartphone game.


It's an economical decision. If you're developing a new indie game with $10k revenue expectation in its lifetime, then probably it's not worth your time. But if it's a core infrastructure of multi-billion dollar business, then yes it's worth your time since any non-trivial security incidents may cost more than your annual salary.


Fuzzing isn't really a "2-3 times longer" thing. You have to set it up once (which for basic fuzzing is quite easy), you throw compute power at it, and it turns up bugs for you. As you make changes to the project the fuzzer will explore new paths mostly automatically.

You may want to spend some time looking at code coverage and doing some of the advanced things outlined in this article, especially for very high risk / reward projects like Curl, but even that is not a lot of work.


> Every new feature can take 2-3 times longer to deliver due to adding tests.

alternate framing: you can have a feature done in half or a third of the time, if you don't care whether it actually works or not

...or whether it will continue working, as other new features are added to the system and the codebase evolves and grows. that sort of maintainability is one of the key things you get from investing in writing tests - you can refactor without fear of breaking some existing feature.

it also allows you to hire engineers and give them a "safety net" where they can make changes and be confident they're not breaking some crucial functionality of the unfamiliar codebase.

done correctly (an important caveat, because lots of people do testing poorly and then conclude that all testing is bad) that time spent writing tests is not wasted effort. cutting corners by skipping tests is very much a false economy in the long term.


The point where quality and security matters, and not providing them comes with returned products, canceled contracts or possible lawsuits.


[flagged]


Throw random inputs at a system and see what happens. The primary purpose is to trigger crashes or going into some other error state or the like. It's useful for checking the stability/robustness of a system.

https://en.wikipedia.org/wiki/Fuzzing


Generating random test cases, seeing what path each random case goes through, and then trying to generate cases to cover as many paths as possible.

I love fuzzing.


Here’s a survey of fuzzing techniques:

https://arxiv.org/pdf/1812.00140.pdf

One of the earliest examples was a person who fuzzed many applications by getting them to load files he created. His files triggered many vulnerabilities in apps. His method (IIRC): take a file, flip a bit, load it, repeat throughout the file. Amazing how effective it was.


[flagged]


The analogy sounds great and catchy (and sufficiently contrarian to get knee jerk upvotes from the site), but it really oversimplifies things and misses the mark on what types of problems fuzzing catches and why that's incredibly important in a project as popular as curl.


Indeed. It's more of an argument not to execute scripts downloaded from the web. Which is valid, but not the point of fortifying curl.


Would it make sense for cURL to detect when its being piped into `sh`? Possibly have it refuse to proceed unless the site is on a whitelist or the user has added an --i-accept-the-risk option.


That's not possible in general, I think. Can you even get process information out of file descriptors? But the people who write the scripts that embed such curl commands, will add the option directly. And others will put it in the command to copy. Or put | cat | sh in the command. And there's also wget.

I think it's not something curl can tackle by itself. Instead, you'd need a file descriptor that informs you about the source, and the shell should refuse not whitelisted sources.


You're right it subverted in general. isatty() can tell if output is being piped. When piped special safety rules could kick in. Of course they would be false positives but then there could be an option (like I mentioned above).


I agree that fuzzing is a very important technique, and as I said it's worth fixing any issues found, but I still think that many of the most common uses of curl are very problematic. As you can see, I didn't get any knee jerk upvotes.


cURL is also embedded in a lot of applications. An exploit within curl could lead to a vulnerability in the host application.


It's a tired argument. Running any program has risks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: