Hacker News new | past | comments | ask | show | jobs | submit login

Here's their What Went Wrong:

  1. An engineer wrote a regular expression that could easily backtrack enormously.
  2. A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.
  3. The regular expression engine being used didn’t have complexity guarantees.
  4. The test suite didn’t have a way of identifying excessive CPU consumption.
  5. The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.
  6. The rollback plan required running the complete WAF build twice taking too long.
  7. The first alert for the global traffic drop took too long to fire.
  8. We didn’t update our status page quickly enough.
  9. We had difficulty accessing our own systems because of the outage and the bypass procedure wasn’t well trained on.
  10. SREs had lost access to some systems because their credentials had been timed out for security reasons.
  11. Our customers were unable to access the Cloudflare Dashboard or API because they pass through the Cloudflare edge.
Here's my version of what went wrong:

  1. The process for composing complex regular expressions is "engineer tries to shove a lot of symbols into a line" rather than "compile/compose regex programmatically from individual matches"
  2. Production services had no service health watchdog (the kind of thing that makes systemd stop re-running services that repeatedly hang/die)
  3. Performance testing/quality assurance not done before releasing changes (this is not CI/CD)
  4. No gradual rollout
  5. No testing of rollbacks
  6. Lack of emergency response plans / training
All of these things are completely common, by the way, so they're in no way surprising. Budget has to actually be set aside to continuously improve the reliability of a service, or it doesn't get done. These incidents are a good way to get that budget.

(Wrt the regex's, I know they're implementing a new system that avoids a lot of it, but in the new system they can still write regex's which (I think) should be constructed programmatically)




I don't see the relevance of how regexes are written to the problem they had. The engineer didn't typo the regex, or have a hard time understanding what it would match.

Instead, they didn't understand the runtime performance of the regex, as it was implemented in their particular system. No amount of syntax can change that.


No amount of syntax can change that

A framework that allows well-written, "normal" code to parse out what you want, can produce something easier to understand and maintain, surfacing this type of bug in a more obvious way.

Cryptic syntax is the main reason I avoid regexes (particularly complex ones).

Too much obfuscation between the code you write and the steps your program will take. Granted, my concern doesn't apply to master craftsmen who truly understand the nuances of the tool, but in the real world those are few and far between.

ps. I get there was a lot more going on in this postmortem than just one rogue regex.


By writing regexs by hand, you can accidentally introduce an obviously backtracking pattern such as * .=. *. By programmatically composing them, a program can analyze each regex group to find simple problems, and then combine them in ways that will avoid backtracking.

This isn't even why you should compose them programmatically, though. Perl allows you to compose a regex with in-line comments (https://perldoc.perl.org/perlfaq6.html#How-can-I-hope-to-use...), but it's still a hand-crafted regex, which is error-prone, much like composing code by hand. If you can get a machine to generate it for you, you avoid unintentional human-introduced bugs, as well as make it easier to read and reason about.

If you have a ton of regex's, or they are super important to your business, you should consider not editing them by hand. There's only so much test cases can do to prevent bugs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: