A Two Month Debugging Story

rraval · on May 26, 2016

> To ensure each test has a clean slate, we clear the database between each test.

The proposed solution to manually nuke the database state seems crazy to me. Some alternatives:

1. Run the entire test in a transaction, do flushes and assert as normal. At the end, ROLLBACK instead of COMMIT and now you have a pristine database again. [1]

2. Setup pristine DB state once and then use `CREATE DATABASE ... WITH TEMPLATE ...` to create a temporary database. Not sure what the perf hit is, but it's probably worth trying. [2]

[1] http://alextechrants.blogspot.ca/2013/08/unit-testing-sqlalc...

[2] https://www.postgresql.org/docs/9.4/static/manage-ag-templat... CREATE DATABASE actually works by copying an existing database. By default, it copies the standard system database named template1.

kevinburke · on May 26, 2016

Hi, I wrote the post. We'd love to use transactions but our ORM doesn't support it. More info here: https://kev.inburke.com/kevin/dont-use-sails-or-waterline/

We wrote our own transaction library and have been shifting queries over to it/a saner ORM. https://github.com/shyp/pg-transactions.

I've never heard of database templates, I'll have to take a look.

agf · on May 26, 2016

If your ORM doesn't know about transactions, that doesn't necessarily prevent you from wrapping your test runs in transactions. I was in a situation where I needed to start / rollback a transaction without ActiveRecord knowing about it, and was able to hack AR slightly to make that work -- so you can likely do the same.

You even mention in the Waterline section of that blog post that you've written your own transaction library, so this shouldn't be too hard.

jwatte · on May 26, 2016

Or run your database on a RAM disk with snapshots (LVM, FS, block device, or virtual host based)

meshko · on May 26, 2016

I would imagine resetting the whole db even on a RAM disk after each of 6000 tests wouldn't be particularly fast.

jwatte · on May 27, 2016

With snapshots, it should be instant.

rraval · on May 26, 2016

One advantage of the template approach is it allows for test level parallelism, which may address your primary goal of performance.

> Waterline queries are case insensitive; that is, Users.find().where(name: 'FOO') will turn into SELECT * FROM users WHERE name = LOWER('FOO');. There's no way to turn this off. If you ask Sails to generate an index for you, it will place the index on the uppercased column name, so your queries will miss it. If you generate the index yourself, you pretty much have to use the lowercased column value & force every other query against the database to use that as well.

What the actual fuck!

mattbee · on May 26, 2016

Oof - seems like you're running on a managed platform that can't be debugged in a reasonable level of detail?

If it were me I'd want to take the problem somewhere I _could_ get root access and debug it properly. So I'd be interested to know what value this (unnamed) CI platform provides to make it worth a wild goose chase.

mattbee · on May 26, 2016

Additional: Maybe it's just me but not being able to debug drives me CRAZY, sorry that I'm not being more helpful :) I've done bits of embedded systems work where you have to chisel out your own debug tools from glue and LEDs and serial ports and I basically hated it, or rather learned a bunch of ways of making it "right first time".

Test technology surely doesn't need that level of secret sauce that you need a hosted service - after two months I'd definitely want another way of doing it.

agumonkey · on May 26, 2016

The importance of having tools that allows to gather information when you need them.

joshribakoff · on May 26, 2016

I'm not saying that e2e testing is without value, it can be useful... but its just not worth it I've found. I write unit tests only, which don't touch the database. My litmus test is am I testing an "algorithm" -- something that has outputs solely based on its inputs.

If you insist on tests that touch the DB, the speed can be improved by chaining tests instead of resetting the DB everytime (this has its trade offs).

If I need to test a piece of code that generates a dynamic SQL statement, I would simply assert that the correct SQL is generated. I would not need to actually execute the SQL to see that it is correct. The point is to test that my logic generated the correct SQL, not to test that my database vendor implements SQL correctly. The latter would just be caught in manual testing. I like the BDD school of thought that you are writing specs, not tests.

primitivesuave · on May 26, 2016

I second this. I worked at a research company that had terabyte-scale SQL databases of curated data, and the tests would take 5 - 8 hours to run because they were verifying query results rather than the queries themselves. This also meant tests would fail because of changes in the data rather than just changes in the code. I rewrote the entire test suite to match SQL statements against a symbolic SQL description, which turned out to be a much more robust and extensible solution.

overgard · on May 26, 2016

At risk of invoking the wrath of test fanatics -- aren't tests supposed to save time? If your tests are both unreliable and so hard to debug that you haven't been able to fix them for two months, I'd think you'd be better off just turning them off and figuring out a better testing strategy.

kevinburke · on May 26, 2016

Hi - we definitely weren't looking at this full time for two months, just off and on as it failed various test runs, and I had time to look at it.

We have a large suite of tests that integrate with Postgres. Maybe one out of 50,000 database queries would fail. It was hard/impossible to predict which query would fail, since the problem wasn't with any individual query. By your logic, we should throw out our entire test suite. I'm not sure what we're supposed to replace it with.

overgard · on May 26, 2016

> By your logic, we should throw out our entire test suite.

That's why I figured I was invoking the wrath of the testing fanatics... that is where my logic leads. :-) I have no idea what your tests look like, maybe they're super useful, but I've worked at places in the past where there were thousands of tests that were frankly pretty useless, and were actually a net negative for a variety of reasons, but to point that out got you labeled as a "cowboy". (To be clear I'm not against automated testing, just automated testing done badly.) If you're running them in a context where you can't debug it, at the very least I would move it in house onto a machine where you can hook up a debugger.

kevinburke · on May 26, 2016

Our tests are pretty useful, they check things like "when you ask the API to submit a pickup, it gets assigned to a driver", and so on. Especially in JavaScript where anything can take any value they've been extremely helpful.

mattmanser · on May 26, 2016

That's another argument I've never understood.

I've been working on applications with javascript for over 10 years.

Not once can I remember a javascript bug where someone passed a value as the wrong argument and it made it into production.

Maybe it has happened, but it's just not a common bug. I just don't remember it.

Like, you can easily write it as a mistake as you're developing but it's obvious as soon as you run it. But if that bug gets in production your dev didn't even bother to run the code to see if it worked as intended.

voiper1 · on May 26, 2016

First thought that came to mind was to just blow away the database and create a new one, see how long that takes.

I had found waterline's load time to have a LOT of things that I hadn't expected and take a long time... I'm just using knex now. Every time I try for magical solutions, I keep regretting it and end up using less magic.

jdmichal · on May 26, 2016

Magic is fine, but there should always be a path to do things sans pixie dust. For something like an ORM, that would be a path to retrieve the raw DB connection to do things around / between queries. You even see the same pattern play out in programming languages; C# has the `dynamic` keyword that disables type checking on a variable.

jwatte · on May 26, 2016

What we do when we detect a failure is freeze the test runner instance, and allocate the same failed test to another runner. If the second runner succeeds, we okay the build, but we put the test and the frozen runner in a queue for investigation, and some engineer will be responsible for diagnosing and fixing this intermittent test. This queue is worked every day on a rotating schedule.

We run our own CI/CD infrastructure, on top of virtualized infrastructure; if you use a third party that doesn't give you that, you might want to look for alternatives.

akiselev · on May 26, 2016

Has this strategy worked perfectly so far? I would imagine a best two of three method would work better like those used for integrity in high radiation environments. However, I'm guessing code is significantly more deterministic and the danger of false positives (tests passing when they shouldn't) is much smaller.

jwatte · on May 27, 2016

This has worked well for many years. Most of the intermittently failing tests are older, from before we had robust dependency injection, and they get incrementally improved this way.

False positives are also super costly for us, as everyone works on trunk (by design, to avoid skew) and deploys all ckeckins directly to production. "Best out of two" is sufficient for the old tests, and if someone creates new intermittent tests, we follow up with education so that isn't a persistent problem.

loftsy · on May 26, 2016

I think best practice here is to run the whole test in a transaction and then roll back the transaction at the end. This is the approach Django uses.

kevinburke · on May 26, 2016

We'd love to do this but our ORM doesn't support it. https://kev.inburke.com/kevin/dont-use-sails-or-waterline/

switchbak · on May 26, 2016

I usually try to minimize the number of tests that rely on this (shared) DB state. So I'll focus more on unit and integration tests. Functional tests still do have value though. See: http://xunitpatterns.com/Testing%20With%20Databases.html

Where possible, I like to have data that can exist independent of the other data on the system. I can make a separate 'tennant' for that test - and just ensure it's wiped before I proceed. Sort of a multi-tennant approach. Works great. I don't bother with a 'teardown', but do any cleanup before the test runs. I also ensure the tests are written to not make assumptions about global state.

Instead of dropping all constraints as the article suggested (that sounds hacky), I use ON DELETE CASCADE constraints. If I miss some, the tests fail. Seems easy enough to maintain.

With the above approach, DB testing is approachable and still pretty quick.

fapjacks · on May 26, 2016

They are using CircleCI, btw. I really like Circle, but they could sure be more helpful with the SSH access, as you mention. Also a personal nitpick, they (still) have no way to test a circle.yml file except for pushing a change and seeing if it works. Anybody that's used the service will know exactly what I'm talking about. You end up pushing four times just to get the syntax right. They know about that (iirc there was a support forum post requesting the feature some years ago), too. But besides that, I heartily recommend Circle.

markbnj · on May 26, 2016

I always like to read stories of other teams' debugging adventures. One thing that occurred to me as I got toward the end of OPs (ongoing) narrative was that perhaps its time to fall back and consider other ways of setting up the data environment for each test. If the data that each test is dependent on is a small enough subset perhaps it makes sense to create a separate database for each test or class of tests, which can just be dropped and recreated before the test run?

utternerd · on May 26, 2016

disabled autovacuum by default

I realize this wasn't their solution, but it's worth noting that generally speaking, disabling auto-vacuum isn't recommended. Even if you do, it will still force vacuum jobs to prevent transaction ID wraparound.

* https://www.postgresql.org/docs/9.5/static/routine-vacuuming...

kevinburke · on May 26, 2016

Agreed, but if you are running a CI provider and recycling containers, seems easy enough to do it as part of the recycle step.

utternerd · on May 26, 2016

I agree it'd be a nice option for them to provide, but I don't believe making it the default for every customer is a laudable goal.

If vacuums are hanging/causing locks and these truly are just test tables, it might be beneficial to use temporary tables instead as the autovacuum daemon ignores them. Additionally, unlogged tables that reside on a memory disk can be insanely fast - might cut some additional time down depending on your data.

blakzer0 · on May 27, 2016

I thought instead of telling you to drop your test suite, switch your ORM, or something equally as crazy I would rather just empathize. I'm glad you guys figured it out. We all have these kind of bugs that drive us insane. We all have those difficult to test bugs that can't be reliably reproduced. The feeling when you finally solve them is amazing. So congrats on finding that crazy bug!

gerbilly · on May 26, 2016

With Hibernate we used to connect it to a in memory DB just for the tests and to a real DB for production.

It was easy and quick to drop schemas.

kevinburke · on May 26, 2016

I agree that would be faster, but we rely a lot on Postgres's database constraints and we've previously had code problems go undiagnosed when the test schema didn't match production. Six in one, half a dozen in the other, I guess.

jwatte · on May 26, 2016

Why does your test infrastructure allow a schema that doesn't match production? Seems like low hanging fruit there!

kevinburke · on May 26, 2016

Sorry - that was a few months ago, we replaced it as soon as feasible. We had to upgrade a lot of tests to make it work with our prod schema

jwatte · on May 27, 2016

Right. In the best of worlds, your tests just apply the production schema update on start, and similarly you apply checked in schemas ASAP to production after passing tests!

karmakaze · on May 26, 2016

> We could draw a dependency graph between our 60 tables and issue the DELETEs in the correct order

If the above works, automate it.

meshko · on May 26, 2016

creating a dependency graph between tables is not exactly a 20 minute task. Certainly doable, but getting it right would take time. Anyways, no need to do it, can just do the deletes in a transaction.

dreamdu5t · on May 27, 2016

Writing 6000 integration and unit tests can be avoided by using purely functional statically-typed languages. There are some that even compile to JS, such as PureScript. Most of what typical JS tests check can be expressed with pure functions and types.

polskibus · on May 26, 2016

Have you thought about starting up multiple replicas of your database in parallel and then pointing each test suite to a different one? This way you may be able to parallelize them very well.

kevinburke · on May 26, 2016

This is an interesting idea!

polskibus · on May 26, 2016

Let me know how it works out for you.

meshko · on May 26, 2016

You can just do all the deletes in the same transaction and then the constraints wouldn't need to be disabled.

breakingcups · on May 27, 2016

This might be a weird question, but surely you run your tests locally in a loop?