LambCI – A continuous integration system built on AWS Lambda

adamb · on July 7, 2016

There's a lot of harsh commentary, but I think people here are missing the point. The fact that so much software needs root access to a (often mutable) global environment in order to properly build is a bug.

There are an increasing number of build systems that encourage squashing these bugs. The resulting build outputs are simpler and are often more portable. They're also easier to reason about. That translates to simpler deployment, simpler operations, and fewer edge cases to debug.

IMHO, the most promising answer to the 5 minute limit is finer granularity and better caching of dependency inputs.

somesaba · on July 6, 2016

Cool! I had an idea for something like this but instead of having each build be it's own lambda event, I wanted to make each individual test it's own lambda event. The goal is to have the build time for a complex project boil down to the time it takes to setup + run the longest test.

adamb · on July 7, 2016

Have you written any code to that effect? I've been trying to think through how booting up dependencies might work (for things like integration tests).

I have aspirations for QA (https://github.com/ajbouh/qa) to learn this trick, but it needs some lambda-specific smarts before it gets there.

falsedan · on July 7, 2016

We're doing something similar, but using mesos. We bundle tests up, so we don't have to eat the setup costs for every test, and try tracking test pollution by splitting bundles when they fail & running them separately, again.

not_yet_a_dalek · on July 7, 2016

That is more or less the same reason I started the eremetic framework. We were tired of dealing with the trouble of using a lot of VMs as jenkins-slaves, and long test suite runtimes.

https://github.com/klarna/eremetic

dominotw · on July 7, 2016

where do you store build caches? npm/gems ect?

falsedan · on July 7, 2016

Dependencies? On the host; they're ephemeral but they'll run multiple builds between coming into existence and disappearing.

dominotw · on July 10, 2016

but first build always take a hit making the build times unpredictable making build time regressions hard to track.

rapind · on July 7, 2016

I love that idea. Map -> Reduce of unit tests.

_qc3o · on July 6, 2016

Cute.

> No root access > 5 min max build time > Bring-your-own-binaries – Lambda has a limited selection of installed software > 1.5GB max memory > Linux only

There's a reason Jenkins is still used so widely. It's not because of utilization or all the other things pointed out. When your project gets big enough, managing the CI pipeline turns into a distributed systems problem with distributed queues, locks, error/failure recovery, and all the other headaches that such systems bring. Heck, reporting alone on a test suite with 12k tests is a problem in and of itself.

adamb · on July 7, 2016

In my (limited) experience, treating your CI pipeline like a distributed system is a design smell. It leads to build processes that are difficult to test, fix, and iterate on.

When a build system can only be effectively invoked by CI/CD, it starts to pervert developer incentives. People need to check things in before they can be sure they work. They don't bother with tiny fixes because of the inertia. Flaky jobs get a quick rebuild, because reproducing a build failure locally is complex enough that they'd prefer to avoid it if they can.

Over time, these add up to a system that grows through accretion, which is the enemy of both agility and understandability.

Better is a build process that uses simple, reusable components that work equally well on developers' machines. These tools can be tested, refined, and replaced incrementally, using the same build processes that the rest of your code base does. You can do this without needing coupling your build processes to the specific way(s) that Jenkins (and company) model builds or their configuration.

_qc3o · on July 7, 2016

Here's a problem statement for you. You have ~12k tests that takes > 40 hours to run sequentially. What do you do?

I know how we've solved the problem to provide as much validation as possible before shipping something to production and at pretty high rates of code churn. Whereas what you're suggesting is untenable on a large enough project. That's like saying drink your milk and have a hearty breakfast. Nice platitudes but not actual engineering. Our solution is not unique in fact. Shoppify and other big shops follow exact same practices (https://www.youtube.com/watch?v=zWR477ypEsc). Not because they don't know any better and haven't heard of setting up proper build pipelines using principles from immutable infrastructure but because at large enough scale you need mutability.

Jenkins was just an example. We don't use Jenkins but you do need something that manages workers and their lifecycle. Saying reduce your test runtime to 5 minutes and have better engineers and tools doesn't cut it.

DanielBMarkham · on July 7, 2016

Good discussion guys. Please keep going.

Isn't the architecture of your build directly related to both the architecture of your system and your deployment?

If so, why would somebody think that a monolithic app, even one with threading and workers built in, be better than simply engineering your own as you go along? After all, this is supposed to be engineering, right? Not "How to use Jenkins"

I agree that platitudes aren't solutions, but code smells are the kind of thing that lead one to actually take ownership instead of perhaps using the same paradigm only larger, yes?

Apologies if I missed the point, dkarapetyan.

_qc3o · on July 7, 2016

Code smell is a little ill-defined. Given two experienced enough engineers they'll smell different things based on what experiences have led them to that point. The general rough guidelines is I guess "things should be as simple as possible but not simpler" and depending on what sets of requirements you've optimized for it might not smell right to someone who values a different set of requirements.

adamb · on July 7, 2016

We suffered with a Jenkins-like solution for a long time before we decided enough was enough and we wanted to use an approach that didn't need as much soul-crushing, CI-specific effort.

If any of our experiences or insights can help others in their own environments, all the better!

adamb · on July 7, 2016

I don't know how large a large project is, but our system is pretty large. We build and test for 4 different operating system flavors and way more than that if you incorporate specific versions and distributions. We run end to end user tests against our applications that test functionality across many of these operating systems. We have broken up our tests into functional groups that have parallelism and caching within the groups and the groups themselves run in parallel. In some cases a single developer or build slave has used 40 machines at once to run these tests (this number was only limited by our budget... windows machines are extra expensive on EC2).

In terms of reporting on tests that run in parallel, we built a tool that specializes in exactly that. It collates output from parallel tests, it times out on tests that are hung, it makes sure the build system doesn't kill it if tests are too silent. It also tracks which tests have run against which versions of the codebase in the past and what their outcomes are. We use supporting tools to analyze test flakiness and understand when they are introduced. We have had a lot of success with this approach, as developers debugging weirdness across many tests is less miserable when they can use the same tools that CI does.

Critically, when bugs in those tools are discovered, developers can pinpoint and fix those bugs locally with reasonable ease. Deploying fixes to the test runner (or the logic that allocates workers for the test runner) is like any other change. No need to tinker with Jenkins (or buildbot, etc) config. No need to take the build system down to test that the change is correct. No need to bring up a test version of the build system and experiment with your change there.

We've gone to great lengths to make our system something that's a joy to work with and helped us be very productive across the many different environments we need to operate in.

It's tough to know how much detail is appropriate in comment threads like these. You're absolutely right that there's a lot that needs to come together to make something like what I've described work. I know because we pulled enough of it together to support our own large and heterogeneous projects.

It sounds like you have also thought about this problem a lot. Can you share more about the sorts of tests (language, test library, etc) you have? Perhaps we can break new ground where each of our respective experiences and intuition intersect.

_qc3o · on July 7, 2016

We have a similar design but need to juggle js and python in some interesting ways. There are only so many variations on the theme of CI so I'm not surprised about the convergent design. Our environment is not as heterogenous and we leverage pre-baked AMIs and LXC containers for isolation and reproducibility.

My contention was the emphasis on local reproducibility. In the past I would have said yes, local reproducibility should be a feature of any well designed CI pipeline but nowadays I'm not sure anymore.

Local development environments are optimized for iteration speed at the cost of reproducibility and stability. Whether this is the right decision or not can be debated. CI environment on the other hand is designed for reproducibility and stability. Those sets of requirements are somewhat at odds and you can't optimize for all at the same time. Tools should be shared across local and CI environments as much as possible but not when it comes at the cost of compromising the requirements for each environment.

GauntletWizard · on July 7, 2016

I disagree. Not that I don't think that simple, reusable components are valuable; They are, and any developer should run tests before sending things off to the cloud. But having things work on developers machines is itself a code smell, because developers have access to them.

Your whole project should build straight from very fresh boxes, and doing builds on developers machines will never be fresh boxes. It's hard to remove the cult knowledge from a development team. My project just discovered a new unlisted dependency when doing a deploy, because every developer knew about it and installed it on their machine beforehand. It was explicitly listed in the build/devtools dependencies, but not supposed to be a runtime dep. Had the developers run tests on a fresh machine, they'd have run into it. (Of course, the CI team had also installed it on the CI boxes, because they used it for debugging.)

For a large class of tests, having the developers run them in the build environment is fine. But you also need to run them in the deploy environment, and to that end developers should hit the CI system. Every test that the CI system runs on each commit should be runnable before the commit. You should have tooling and spare capacity such that the CI system is used to run tests immediately before commit, not right after - That's too late. You should run them whenever a dev sends off for code review. You should run whenever a dev feels like it; If they're in a good spot, run the unit tests, run the CI tests, see what's broken.

adamb · on July 7, 2016

Your comments are fantastically correct. Yes, secret dependencies are the worst. We use an OS sandbox to prevent access outside to non-declared dependencies. That same sandbox prevents access to the network unless a build or test target explicitly declares the need to use networking (e.g. for running tests against network services running on local host).

CI runs the same exact build system (though with a few different options so the outputs are easier to during and after the build).

Passing CI is compulsory, as humans aren't allowed to release changes on our team. Humans may only do code review. If and when a change passes code review, it will be deployed automatically once it passes CI.

We use some of the same compute capacity that our CI system uses to scale test runners across many physical machines (though tests run against a pool of freshly cloned VMs using delta disks so we get a pretty big speedup and lots of control over the environment that tests run in).

There's a fascinating correlation between developer machines and build slaves. It's been my experience that needing to install system software of any kind on one usually leads to a headache later. We've gotten it down to just Xcode on OS X and almost just build-essential on Ubuntu.

So in spirit we do exactly what you're saying, we've just found a way to do it while using the same tooling on both CI and developer machines. We also demand that the build slave images are generated straight from install media and a fixed set of files (like those that install Xcode), so the only simple way to add dependencies (i.e. build tools or libraries) is via our build system. Use of apt, homebrew, etc is completely separate for our developers. And if they mess with the build system in a way that allows those files to leak in, the fact that build slaves are pristine means that their change will fail CI and never be deployed.

Does my explanation make sense? Happy to answer follow up questions. Also happy to be shown where our rigor is lacking :)

adamb · on July 7, 2016

I forgot to mention that using the same build tool for both CI and developers has a number of other advantages, including artifact caching. When a developer downloads a change, their build will pull artifacts from the caches that build slaves have populated. So in many cases new changes (once deployed) are only built once, across the whole company by the build slave that kicks off the deploy. Everyone just reuses that cached output.

It was this sharing of artifacts that provided some of the impetus to use a sandbox, since a polluted output could poison the cache in hard to detect ways.

falsedan · on July 7, 2016

We decided to think of our CI system as a distributed job execution engine for one project… now that we know better, we have a maze of twisty jobs configs to unwind.

Rezo · on July 7, 2016

A Lambda / "serverless" approach makes a ton of sense for CI system just on the premise alone. Why have x number of build slaves sitting doing nothing 95% of the time? The ideal build system would paralellize on-demand to as many workers as possible within a few seconds of a build triggering (I don't want a build slave, I want 50), and then immediately terminate them, which is pretty much the whole point of Lambda.

Jenkins is a huge pain the behind, fragile, the machines always inevitably turns into special snowflakes, configs in git are poorly supported. It's also very expensive in any non-trivial setting and requires full-time babysitters. Cloud CIs still charge quite a lot for any meaningful parallel builds, and there's the security aspect of uploading your code to third-parties, I trust my S3 bucket permissions more than a random CI SaaS. This seems like a great start for a sweet spot between on-prem and SaaS CI.

batterseapower · on July 7, 2016

Hydra (https://nixos.org/hydra/) addresses a couple of these concerns as well, specifically the snowflakiness and version controlled config.

chriswarbo · on July 7, 2016

In the past week I've set up Hydra (adapting https://github.com/peti/hydra-tutorial ). It's quite nice, but there are a few sore points. These mostly seem to be known, so it's just a case of waiting for the features/fixes to land:

- No (semi-)stable releases, requiring mixing and matching git revisions of Hydra and Nix until we find a combination which builds. nixpkgs master (i.e. unstable) contains a Hydra package and module, but I think that's been added after the latest stable nixpkgs release (16.03).

- As a consequence of the above, we need to build a lot of stuff locally since they're not in the binary caches.

- While nixpkgs master contains a NixOS module for Hydra, these modules don't yet work outside NixOS. Hydra can be run on, say, Ubuntu, but it requires manual, imperative configuration of Hydra, Postgres, etc.

- Configuration relies on interacting with the Web UI. "Declarative jobsets", which allow jobset specification via Nix, are in master, but I couldn't get any git revision containing that to build.

- Build slaves must be running NixOS. As I'm stuck with an Ubuntu host, this forces me to use VMs, which has overhead and makes some architecture choices more difficult. NixOS containers may fix this, but they currently only work on NixOS.

- Since I'm forced into VMs, I might as well deploy them with NixOps. The standard NixOS image used by NixOps is only 10GB, and NixOps doesn't support resizing it (there are open pull requests for this).

- The Hydra Web UI allows logging in, configuring, etc. I would prefer if there were a Web UI without write access to the DB, i.e. once the jobsets are set up (via a declarative approach or a write-enabled Web UI), there should be a read-only view of the progress, outputs, logs, etc. (new builds would be triggered by git pushes, or by SSHing to the server and running a command).

- It would be nice if the DB requirements were a little lighter; e.g. if there were an SQLite option. Using the Postgres NixOS module is easy enough, but it's a bit overly-complicated (e.g. setting up credentials and sharing them between hydra and the DB, one-shot systemd jobs to initialise the DB, etc.); plus, if anything goes wrong I'd have to learn how to use Postgres (I already know far more about MySQL, MariaDB, SQLServer and Oracle than I'd like!).

dominotw · on July 7, 2016

use docker swarm or something similar for jenkins slaves. problem solved!!.

falsedan · on July 7, 2016

  > distributed queues, locks, error/failure recovery, […] [test] reporting

Jenkins doesn't help with any of those things.

_qc3o · on July 7, 2016

Neither does AWS lambda but at least Jenkins handles the worker management piece quite well.

falsedan · on July 7, 2016

Which plugin does that?

_qc3o · on July 7, 2016

They have an AWS plugin for spinning up instances and dynamically sizing the pool.

falsedan · on July 7, 2016

Oh boy, I hope it solves the job draining/triggered downstream builds problem!

empath75 · on July 7, 2016

Yeah, there is a place for this kind of thing though. For a lot of small projects, Jenkins is overkill.

Also, a lot of those issues can be worked around by using ecs for the builds.

toomuchtodo · on July 7, 2016

Or just spend the time setting up the right tool in the beginning, so you don't have to waste time later moving to the right tool?

illumin8 · on July 6, 2016

This looks very cool, but the 5 minute build time limit (an inherent limitation of the Lambda service) makes this less than ideal for a build system. The author does address this by recommending that you use Docker containers on ECS as an alternative for long running builds.

a_imho · on July 7, 2016

Hardware is usually the cheapest component when it comes to software manufacturing, but we found out we wanted CI/CD to spin 24x365 as much as it can, increasing the resolution to ~single commits with the shortest possible cycles. With a sizeable codebase and a thorough testsuite AWS bills went up so quickly even the proponents decided it was not worth it. We restored our old CI infra and were able to add a couple of new servers too. The throughput increased considerably with money to spare. Still an interesting experiment, but it showed burning money on Amazon not defaults to moving faster.

nzoschke · on July 7, 2016

Very nice!

This looks very close to the ideal CI infrastructure. I'm used to waiting on queues and long VM or container boots and configuration on other services.

We can almost certainly count on Lambda getting longer execution times and higher memory limits. We can also count on containerization solving the root problem.

We should also be building software with the goal of tests that run within reasonable limits like this.

`time make test` takes 39 seconds on my businesses Go projects. I'd consider a 5m test suite serious tech debt. The time that developers wait for feedback on tests and deployment is becoming a business bottleneck in the continuous delivery age.

rgbrgb · on July 7, 2016

Wow, what's your secret to such fast builds? We're at ~35 minutes to build and test our rails monolith on Wercker. I'm guessing you're not hitting a db too much or loading phantomjs for end-to-end tests of a web ui?

Intermernet · on July 7, 2016

At a guess, it's because they're using Go. The build and test processes are refreshingly (ludicrously) fast in that language.

Note that doesn't mean you should switch to Go, as the ridiculously fast compile times could be outweighed by other factors (retraining, lack of specific features, etc.)

I would however investigate it as an option as it seems to have found a place in the hearts of many former rails shops!

nulltype · on July 6, 2016

Is a Google App Engine application also serverless?

buckbova · on July 6, 2016

Perhaps that's PaaS? Lambda and others are now described as FaaS (function as a service).

This might help:

http://martinfowler.com/bliki/Serverless.html

Google has a seperate cloud functions offering:

https://cloud.google.com/functions/

hactually · on July 6, 2016

Kinda surprised Google don't have Golang as their language. AWS offers nodejs, Java and Python. I wonder if it's due to being able to charge more for slow to start VM based languages?

empath75 · on July 6, 2016

You win buzzword bingo for today.

dllthomas · on July 7, 2016

"LambCI - A continuous integration system built on surface dwellers"

oneplane · on July 6, 2016

How is it serverless if it runs on an Amazon server? Also, how is it serverless if you need to consume a service? (AWS in this case)...

Every time I see something nice, there is this increasing chance that I'm gonna end up sad because it requires some sort of external provider like AWS, DO, Heroku, GCE... I don't have any of those and I don't want any of them.

btown · on July 6, 2016

"Serverless" here refers to not needing a pre-requisitioned server to exist in advance of the request. Sure, the physical hardware exists in AWS, but previously you'd need to have some sort of CI system running on a server that needs to be running and accepting web requests at any time that you might push to Github. With AWS Lambda, which LambCI is run on, a lightweight server boots up in realtime in response to that webhook, runs code, and shuts down. So you have a CI server that can respond any time of the day or night, but only consumes resources when it's actually running.

inopinatus · on July 6, 2016

Serverless also describes the application invocation model.

The design intention for serverless applications uses pub-sub events rather than client-server calls. That's a major enabler of async processing and containers-on-demand, and it looks like LambCI has followed the pattern to a tee.

If you have to accept requests from a non-event-driven world, AWS offer their API Gateway to provide a listening server endpoint, but I think it's telling that this was not available when Lambda was released, and that LambCI does not need it.

oneplane · on July 7, 2016

The point I was trying to make is that it seems like people don't want to make actual services or servers, but rather depend on some sort of closed loop or ecosystem elsewhere, and if you happen to not use that, you can't use the nice stuff they put up on GitHub.

dragonwriter · on July 6, 2016

Serverless is a fairly new and content-free buzzword specific to lambda. To the extent it is sometimes rationalized as communicating something meaningful, it's not something distinct from what almost every PaaS ever has always provided. (Lambda is distinct from earlier PaaS's, and other solutions which mimic Lambda have emerged, but the way in which they are distinct from most earlier PaaS offerings isn't part of how people try to rationalize "serverless" and wouldn't really fit into such a rationalization.

runesoerensen · on July 6, 2016

The author address this question in the announcement blog post (under "PS: Hating on “serverless”?"): https://medium.com/@hichaelmart/lambci-4c3e29d6599b#.3nieqgg...

toomuchtodo · on July 6, 2016

"Well. Look. I’m not going to defend it to the death, but I don’t think it’s anywhere near as bad as some suggest. It’s a term people are using to describe architectures in which you don’t deal with anything resembling a server, or an instance, or similar."

How you can tell OP has very little experience with AWS services being broken for hours before the little "i" gets added to the service on their status board.

AlexCoventry · on July 6, 2016

  > AWS services being broken for hours before the little "i"

Are these failures being systematically tracked and documented somewhere?

toomuchtodo · on July 6, 2016

AWS Status Board I refer to: http://status.aws.amazon.com/

Down detector seems to show when they've had historical issues: http://downdetector.com/status/aws-amazon-web-services

Other HN comments about AWS' Status Board not being "faithful":

https://news.ycombinator.com/item?id=9809315

https://news.ycombinator.com/item?id=11839846

"It will take a direct hit with nuclear weapon on the datacenter for Amazon to change icon to red on service status page."

"Still down 5 hours later. ELB won't register instances. Ugh"

https://www.google.com/search?q=aws+status+hackernews

"Serverless" is not magic. Its only hiding the pain until its excruciating (service down, you wait until someone else fixes it).

discodave · on July 7, 2016

"It will take a direct hit with nuclear weapon on the datacenter for Amazon to change icon to red on service status page."

Should read:

"It will take a direct hit with multiple nuclear weapons on all the data centers in an AWS region for Amazon to change icon to red on service status page."

Not sure if I'm pointing out how good Amazons AZ/region model is or how bad they are at updating the status page. :)

bbcbasic · on July 7, 2016

I'm suspicious because anything slightly critical of AWS or their terminology has been heavily downvoted in this thread.

hn_user2 · on July 7, 2016

Not a down voter. But it's possible the down votes are for the comments spending time complaining about using the term "serverless" instead of something like "FaaS", and not for the criticism of AWS.

As a huge fan of FaaS its sad that half the discussions become arguing over what to call it.

bigiain · on July 6, 2016

There is no "serverless", there is only "the cloud".

There is no "cloud", there is only "someone else's computers".

bbcbasic · on July 7, 2016

It's just some timeshare on the mainframe

inopinatus · on July 7, 2016

You may jest, but when talking to CIOs wary of cloud computing, the comparison to old-school bureau computing does help.