Testing Firefox more efficiently with machine learning

cmehdy · on July 11, 2020

This is the kind of article I'd love to read more of, as in more of each bit! It allowed me to discover the very well made docs to contribute to Firefox[0], which feels very welcoming to an enthusiastic non-genius-expert engineer, who happens to have some experience with CI, testing automation, and a couple languages.

I assume the overhead of the project (and subsequent tweaks to model, re-training and validation) is sufficiently negligible compared to the measured benefits even if those weren't as clear-cut as 70%. I'm unaware of how much compute is required for the task, but likely less than many compute-years per day :)

One thing I did not notice in the approach to the modelization of the problem is any link/tag regarding the platform for which the code changes are made, and the programming languages used. There seems to be some evidence that certain languages could lead to more defect fixing commits[1], and I don't know if there's evidence that some platforms are more prone to bugs (I'm sure wars of words have been fought over this). But would it make sense to have that sort of information inform the model in a way? I fully understand that I might be out of my depth here.

[0] https://firefox-source-docs.mozilla.org/setup/index.html

[1] https://cacm.acm.org/magazines/2017/10/221326-a-large-scale-...

IAmEveryone · on July 11, 2020

Looking at the code (https://github.com/mozilla/bugbug/blob/master/bugbug/model.p...), they test for any significant words in the code, comment, commit message, tags, etc.

I wouldn't be surprised if language is explicitly included as, for example, a flag in the "data" object. But the model should be able to figure it out by itself otherwise by identifying keywords that only some languages (often) use.

hohenheim · on July 11, 2020

Fantastic read. My only concern is that there wasn't any talk around cost of false positives (selecting a test to run where it is unnecessary) vs false negatives (incorrectly dismissing a relevant test), as those costs in terms of their effect is not symmetrical.

The cost of a bug slipping through because a test being skipped will be higher than running an irrelevant test to a commit.

halbersa · on July 12, 2020

One of the authors here, first off thanks!

Yes a regression slipping through would far outweigh the benefits of reduced tests. The thing the post didn't make very clear is that thanks to our integration branch, the chance of a missed regression is still nearly zero. If the scheduling algorithm misses something, the failure will show up on a "backstop" push. These are pushes where we run everything, and then a human code sheriff will inspect any failures, and if something was missed figure out what caused it and back it out.

So the costs of missed regressions are: 1) More strain on the sheriffs (too much strain means the need to hire more) 2) More backouts which is annoying to developers and can mess up annotation (though we have ideas to fix the latter).

For the record, the algorithm with the 70% reduction in tests has a regression rate almost on par with the baseline (it's ~3-4% lower). This hasn't seemed to result in much additional strain on the pipeline.

jeffbee · on July 11, 2020

There isn't any discussion of the cost at all. It just says the test run rate is down by 70%, it doesn't say anything about the defect detection rate, even though they say this is their cost function.

10 core-years per day sounds like a lot but it's only about a 10kW load, and they've saved 70% of that, or about $20 of opex per day.

halbersa · on July 12, 2020

One of the authors here, I can't exactly deny that line was added to sound impressive, so guilty as charged. However the savings are much higher than $20/day for a few reasons:

* Many tasks run on expensive instances (hardware acceleration, Windows)

* We have OSX/Android pools that run on physical devices in a data centre (these are an order of magnitude more expensive than Linux)

* There are ancillary costs. For example each task generates artifacts which incur storage costs. These artifacts are downloaded which incur transfer costs.

* There are also overhead costs (idle time, rebooting, etc) that aren't counted in the 10 years / day stat.

All these things see a corresponding decrease in costs with fewer tasks.

dmurray · on July 11, 2020

Is that really all? That would be 3650 cores running full time. 3W per core sounds too little for power consumption. And do power costs really dominate the price of running CPUs? I'm guessing the savings here are at least one order of magnitude more than your $20/day.

I get about $1000/day based on some EC2 prices for typical machines I've used, though I'm sure Mozilla's requirements are different and they can negotiate better prices than I can.

jeffbee · on July 12, 2020

I probably missed a few factors, but I just hate a blog post that uses big-sounding numbers when they aren't big.

bonoboTP · on July 12, 2020

Big for who? Hundreds of machines running constantly is big for me.

mlthoughts2018 · on July 11, 2020

> “ The cost of a bug slipping through because a test being skipped will be higher than running an irrelevant test to a commit.”

It really depends on the type of bug, and perhaps this could be factored into the model by also correlating change sets with outage severity or complexity of a fix.

sfink · on July 12, 2020

"A bug slipping through" in this case just means slipping through to where it's detected on a later push to the integration branch, or failing that, when a more complete set of tests runs when the change is merged into the main branch. In no case will poor scheduling here result in a bug making it into the final product. It's just that it's more costly in human time to detect it later, so currently the entire goal is set at detecting the problem on the first round of testing after a push.

im3w1l · on July 11, 2020

They talk about reducing on-commit test-runs. I'd assume they all run pre-release.

weaksauce · on July 12, 2020

they have a try server that developers can push to to run a swath of tests before bringing into the integration branch. outsiders can access that by being vouched for by a developer in mozilla and insiders obviously have access to it already. having used it as an outsider it's kind of a pain to use with a lot of setup and options. so having something like `mach try auto` would be awesome for outside devs in addition to the reduce server costs.

pesenti · on July 11, 2020

Similar work done at Facebook: https://engineering.fb.com/developer-tools/predictive-test-s...

ackbar03 · on July 11, 2020

I always thought software/gui testing would be a great application for ai, although I've never really sat down to think about how it could be done

kyawzazaw · on July 11, 2020

Check this one out: https://mesmerhq.com/

It's for mobile apps mostly though.

srinivasupadhya · on July 11, 2020

similar work at google: https://www.google.com/url?sa=t&source=web&rct=j&url=https:/...

just-ok · on July 11, 2020

URL without redirects & tracking IDs: https://research.google.com/pubs/archive/45861.pdf

Tarq0n · on July 11, 2020

Interesting. So for training they use features:

> In the past, how often did this test fail when the same files were touched?

> How far in the directory tree are the source files from the test files?

> How often in the VCS history were the source files modified together with the test files?

But for prediction all they input is a tuple (TEST, PATCH), and XGboost works fine without the additional features?

dmurray · on July 11, 2020

I think they're deriving the additional features at prediction time. The test and patch don't contain all the information you need to compute the features, but they contain sufficient information when combined with a big static lookup table. At least that's the way I read it; agree it could be clearer.

sillysaurusx · on July 11, 2020

The most interesting part of this to me was something tangential: they use Redis Queues. Anyone have experience with this? Good or bad impressions?

The documentation is tantalizing, but hilariously short: https://devcenter.heroku.com/articles/python-rq

Very "And then draw the rest of the owl." Oh really, you can just do `from utils import count_words_at_url; q.enqueue(count_words_at_url, 'http://heroku.com')` and presto, your blocking function -- whose source code exists locally -- is run successfully at the other end?

I'll have to set aside some time to try this out. Python does have introspection facilities that could make that possible. I could imagine that since the code is executed on the same box, it's relatively simple to send a request like "here's which module the function was loaded from; here's the order all modules were loaded in; load those modules and call this function." But it leaves so many questions: serialization, performance, scaling, and all the tiny bugs that inevitably come up.

I guess I was hoping someone could give me a quick gut check of positive/negative reactions. The full RQ documentation is slightly better: https://python-rq.org/docs/ but has some worrying signs:

Make sure that the function call does not depend on its context. In particular, global variables are evil (as always), but also any state that the function depends on (for example a “current” user or “current” web request) is not there when the worker will process it. If you want work done for the “current” user, you should resolve that user to a concrete instance and pass a reference to that user object to the job as an argument.

Yes, sure, global variables are the root of satan, but they're also a fact of life in many scenarios.

Interesting approach... I wonder how much of a nightmare it makes devops...

kkaranth · on July 11, 2020

I've used it a bit in production. Our use case avoided a lot of the potential issues you mentioned, so it may not be entirely helpful:

* serialization: the input was passed a json string argument. the output was a file uploaded to S3, so just the URL was returned, again in a JSON string * global variables: the program was quite self-contained: there was an initial state setup that was not mutated afterwards. So RQ's fork-exec model(the default) worked well enough

Sorry I don't have much to say about performance and scaling. It was quite fine for our needs, and we could scale horizontally upto a certain point by just starting extra processes, and beyond that with more VMs. Since they all listened on the same RQ, it worked fine. (the number of items in our queue never really hit any of Redis' limitations either)

RQ lets you customize the worker model: so you could for instance use threads instead of processes.

Regarding monitoring: there's RQ dashboard[1] which gives a nice web interface to view jobs, failures, and restart them.

1: https://python-rq.org/docs/monitoring/

sillysaurusx · on July 11, 2020

Thank you very much for the detailed response :) I really appreciate the thoroughness; it convinced me to use RQ.

sillysaurusx · on July 21, 2020

+1 datapoint in favor of rq: https://twitter.com/sdan_io/status/1285687026386444288 was built with it.

ashrodan · on July 11, 2020

I've used it for small projects and it's okay. Rq is quite simple and limited in a sense.

I guess what I'd recommend to Mozilla, although fantastic work, is that's it worth trying to compute certain features (data points) upfront to simplify the main model. These calculated features can be used as inputs to the main model possibly reducing the response time.

gorgoiler · on July 11, 2020

Stability and speed from Firefox are always welcome. I’d love to see some performance gains on armhf (raspberry pi 4) in particular. It’s good, and close to being blissfully simple.

65536 · on July 11, 2020

Wouldn’t you run arm64 Debian rather than armhf Debian on the RPi 4? I haven’t used the RPi 4 in a long while so I don’t remember. But it seems weird to me that what you said would be the case.

https://stackoverflow.com/a/48954012

https://www.debian.org/ports/#portlist-released

gorgoiler · on July 11, 2020

32bit armhf is still the “supported” way of running Raspberry Pi OS. I believe the 64bit build is on the horizon, but still considered beta.

If the consensus is its stable enough to give it a go though, I’ll give it a go.

data_ders · on July 11, 2020

really cool project. nice high-level overview of all the components. However, I still don't understand the impact measurement -- how do you measure the impact of this against the baseline? I didn't get that part in the effectiveness section. Maybe I'm too newb -- but you could A/B test this, right? 50% of PRs are subjected to automated tooling, 50% manual and compare compute cost and failures b/w the two?

sfink · on July 12, 2020

That's what the shadow scheduler is measuring. If you run a superset of the AI- scheduled set, you can compute how well the AI is doing. Even if you don't run a superset, you can infer the results from following test runs (on a tree with the changeset in question, plus a few more, applied. You just have to be careful to make sure you don't blame later breakage on your changeset.)

Diane09974 · on July 11, 2020

yesss