Hacker News new | past | comments | ask | show | jobs | submit login

One thing set of alarm bells in my head from an operational perspective:

> Switching to either the re2 or Rust regex engine which both have run-time guarantees. (ETA: July 31)

That's short timescales for quite a significant change. I know it's just replacing a piece of automation with one that does the same task, but the guts are all changing and all automation introduces some level of instability, and a bunch of unknowns. Changing the regex engine is just as significant as introducing new automation from an operations perspective, even if it seems like it should be a no-brainer. I'd encourage taking time there (unless this is something they've been working on a lot and are already doing canary testing).

The other steps look excellent, and they should all collectively give ample breathing room to make sure that switching to re2 or Rust's regex engine won't introduce further issues. There's no need to be doing it on a scale of weeks.

Some quick thoughts about Quicksilver: Deploying everywhere super fast is inherently dangerous (for some reason, old school rocketjumping springs to mind. Fine until you get it wrong).

I definitely see the value for customer actions, but for WAF rule rollouts, some kind of (automated) increasing speed rollouts might be good, and might help catch issues even as the deployment steps beyond the bounds of PIG etc. canary fleets. Of course, that's also useless in and of itself unless there is some kind of automated feedback mechanism to retard, stop, or undo changes.

If I can make a reading suggestion: https://smile.amazon.com/gp/product/0804759464/ref=ppx_yo_dt... The book is "High Reliability Management: Operating on the Edge (High Reliability and Crisis Management)" (unfortunately not available in electronic form). It's focussed on the energy grid in California, the authors were university researchers specialising in high reliability operations, and they had the good fortune to be present doing a research job at the operations centre right when the California brownouts were occurring in the early 2000s. There's a lot to be gleaned from that book, particularly when it comes to automation, and especially changes to automation.




RE2 is bullet proof. It is the defacto re engine used by anyone looking to deploy regexes to the wild (used in the now defunct google code search, for example). It has a track record. Russ Cox, its author, was affiliated with Ken Thompson from very early in his career.

Rust also has a good pedigree for not being faulty. BurntSushi, the author of rust's regex crate also has a good pedigree...

We switched to RE2 for a massive project 2 years ago and haven't looked back. It is a massive improvement in peace of mind.

If anything, I'm surprised that JGC has allowed the use of PCRE in production and on live inputs...


I'm absolutely not denying that RE2 is great. Not in the slightest. I even agree with their idea to switch towards it or the Rust one.

Changing anything brings an element of risk, and changing quickly to it, even more so, which is essentially what they're proposing doing. That's where my concern lies.

Their current approach clearly has issues, but it has been running in production for several years now and those issues are fully understood, engineers know how to debug them, and there's a lot of institutional knowledge around covering them. They've put a series of protective measures in place following the incident that takes out one of the more significant risks. That gives them breathing space to evaluate and verify their options, carry out smaller scale experiments, train up engineers across the company around any relevant changes etc. There is no reason to go _fast_


i concur with your assessment. i'm sure cloudflare will be cautious and not rush with deployment after the switch. lesson learned?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: