Hacker News new | past | comments | ask | show | jobs | submit login
A one-line change decreased our build times by 99% (medium.com/pinterest-engineering)
462 points by luord on Oct 26, 2020 | hide | past | favorite | 243 comments



I think it takes some real humility to post this. No doubt someone will follow up with an “of course...” or “if you don’t understand the tech you use...” comment.

But thank you for this. It takes a bit of courage to point out you’ve been doing something grotesquely inefficient for years and years.


I'd be interested to know how they came to realize what was missing. Did they read the Jenkins docs more thoroughly? Post on a mailing list? See something on StackOverflow? Hire a consultant?


I would expect that internally someone profiled the build (i.e. looked at timestamps) and then either profiled git, or just looked at the logs and did some guessing/research. This didn't seem like it would be complicated to find once you realize the time is spent in git.

Also, this probably has been an exponentially increasing problem, and wasn't really a priority to solve until relatively recently. I would bet there are a lot of stale undeleted branches.


It really doesn't sound complicated to find, unless you just have a handsoff approach to building things and just don't care as long as "something" comes out on the other end.

What makes me wonder however is this: 40 min made them look into this? I mean 40 min is crazy long. What builds this long? Chrome, Windows, Linux Kernel on a single core? This should have been raising red flags much earlier. The only explanation I can come up with is that the whole build takes hours anyway, otherwise there is no way you wouldn't notice this sooner.


Remember "build" time likely also includes test runs, packaging for deployment, ... 40 mins is easy to get to, and has nothing to do with "on a single core".


You can still see which stages are taking a long time.


I wonder if people went into the habit of synchronizing git pushes with socializing breaks, with the proverbial excuse of "yeah, compiling ...".

On day, someone forgot to brief the foreign intern about the necessity of breaks, intern fixes the issue, pointy-hair-boss gets wind of the news, old crew gets fired, new intern gets promoted and fixes also the Pinterest spam on google images.


> new intern gets promoted and fixes also the Pinterest spam on google images

A man can dream.


It takes >12h to build Windows on the MS build platform.

On a single core, Chromium surely takes hours to build.

Though I agree that 40min for the repository in question is highly suspect.


> On a single core, Chromium surely takes hours to build.

Earlier this year Bruce Dawson had a post indicating that it took about a CPU-day, though coalescing files (“jumbo builds”) significantly reduces build time (we’re talking down to 5h), however that’s at the expense of incremental building, and it constrains the code as you can get symbol collisions between the coalesced files.


I read that comment as only applying the "single core" part to the Linux kernel, not Chromium.


I work in games, a clean sync of our project takes well over an hour (and if you're at home it takes multiple hours), and compiling takes a large amount of time. We use lots of tricks - "unity" builds with out build system detecting modified files and compiling them standalone as an example. On my last workstation (2x intel xeon gold) it took about 30 minutes to compile


Yep, I’m shocked that it had to be bloated to 40min before they even thought about fixing it. Anyone who has used Jenkins for nontrivial builds must have had the experience of staring at the slowly expanding session log screen? It doesn’t take any “profiling” to realize git clone’s taking forever.


I remember hearing facebook was on the order of hours (6+) to build.


I would bet on "the new recruit found it, 45 minutes in in the onboarding process".


My first task at my current job, after familiarising myself with the codebase was to improve the CI pipeline


I think everyone has done this, sometimes it really does take a second set of eyes


In my experience it only takes someone who is not drowning in a backlog of tasks yet.

That tends to be the beginners during the onboarding weeks.


The beauty and power of the beginner's mind.

It can see things that were there all along, but everyone who has been there has developed a blindness to.


I see this kind of stuff at companies where only one or two developers work on a project, or the team working on it hasn't had much experience on other projects.

An example would be a company I worked for who ran a pretty standard LAMP setup but had never heard of memcached. Simply adding that reduced the database load by like 90%.


The same way it always happens to me: Over time build time creeps up and you don't really look at this code thinking "I implemented shallow clone years ago so there is nothing I can do, it's slower because we have more code."

Until you or some other person looks what the code is doing.

It could also be that it was a new hire. I shallow clone a huge monorepo similar in commits/branches and it takes seconds. My experience would instantly tell me that something is worth looking into.


[flagged]


Why are you so angry about this? You've commented throughout this post about how this is boring and the Pinterest team is incompetent. Why?

I found it quite interesting. I've been working in deployments for over 20 years at some pretty big places, and never really though about this before. I now have a new tool in my toolbox, and I'm quite happy about it.


In general I'm frustrated that rigor, standards, etc are out the window in favor of all this warm fuzziness. I guess I might be angry...

The culture change in our industry, towards warm fuzzies and away from tech screens, results in calculable waste. Time, money, electricity, customers. We lose good engineers and tell ourselves they were a bad fit. We push crap on users just to sell ads. Then we write engineering posts to brag about fixing our own mistakes. It's a terrible shame and I speak up about it to remind everyone there was a time when RTFM would be the only response to this.

Edit: rate limited but one last thing: are we this forgiving of Equifax when they oopsie our data? Seeing this would immediately make me wonder if anything I have shared with Pinterest is safe. That's why they owe us a postmortem and not a thirst trap.


> are we this forgiving of Equifax when they oopsie our data?

The kind of culture you're in favor for, blaming engineer for mistakes and punishing them, is exactly what makes the kinda of Equifax mistake possible. Suddenly, people stop to improve things and just do the minimum possible so they can keep their job, since anything else can cause a mistake that will cost your next performance cycle (or even worse, your job).


I know I have certainly left cans of worms close various companies because I knew there was only risk and no reward even though I would have loved to tackle the problem. Fix a major problem working late and weekends and get an attaboy at the weekly conference. Do a good job but introduce a bug that is relatively minor but causes a slight delay in deployment and get on the manager's shit list for the rest of your time there.


I'm talking about blaming and punishing management, not engineers. I'm sorry that wasn't clear.


1. You seem to be projecting a lot of your own thoughts and biases onto this article, and this discussion.

2. RTFM made sense when you wrote code in one terminal window and compiled it in another and then shipped a CD. There is no way you can RTFM for every tool you use at a modern software company. NO ifs ands or buts, it just is not happening. It sounds like you are looking at a nostalgic view of the past, and not understanding the context of the scope and scale of software that is built these days.

3. Engineers should be encouraged to share their learnings with each other to collectively 'raise the tide'. I will never pooh pooh a development team wanting to share their learnings, even if you may or may not think it was a good idea, it may have helped their team or someone reading the article.

4. We've been pushing crap to sell ads since advertising began, grow up and take a longer look, the technology has complicated it, but it's still the same as it always was.


I personally do think there is a systemic problem in companies trying to hire the bare minimum in skills/experience for technical roles and ending up with people operating right at the limit of their abilities (and intermittently being in over their heads).

But do I agree RTFM is frustrating advice for a lot of things. Especially if you aren't going to use such information often enough, so will keep forgetting it anyway and have to start over with the manual each time.


> But do I agree RTFM is frustrating advice for a lot of things. Especially if you aren't going to use such information often enough, so will keep forgetting it anyway and have to start over with the manual each time.

This is related to one of biggest frustrations in the last ten years. We used to master the tool set. But now there are so many tools. Some of them we don't used them often enough. I find myself doing a lot of guessing whereas before I knew what was going on at every step.


> The culture change in our industry, towards warm fuzzies and away from tech screens, results in calculable waste.

I have not noticed, in the past decade, any move away from tech screens whatsoever.


Are we talking about technological screens, like LCD screens here?


I believe they mean screens in the sense of technical interviews to assess candidates' technical ability.


I don’t know, I consider myself fairly competent but I’d never even considered that. It’s just not so relevant until your repo is multiple gigabytes big.

Still, I’ll see if it works for our pipelines, and we can get our clone from 20s to 1s


Good teams profile everything. This team's only goal is to support other engineers. Build time is a huge issue for every ops team. Missing this for so long is wasted money that's easy to calculate. We can be nice to people while still having high standards. It's a missed opportunity for a deeper postmortem, and it's bland content at best.


I think you live in a highly theoretical parallel universe. The one I occupy is the one where 'good teams' profile those things that take too long.

Take yourself as an example: in spite of the wide availability of free certificates you are still hosting your domain without using a secure transport layer. Some would take that as incompetence. Others would assume you have more stuff on your plate rather than that you don't have high standards.


And others would assume, that there are other philosophies out there regarding ssl everywhere. So the question is, who's POV has more validity and logical rigor attached to it. I actually can't see any side winning here on a purely logical level. Only on an ideological level. At least as long as we are talking about consuming public information.

Am I in favor of the aggressiveness of OP in other posts? No. Am I using SSL myself. Hell yes.

Nonetheless, I understand that there are people who feel that consuming public information like on a private homepage is nothing that necessitates using SSL. Even if I myself have a different ideology/value set governing my decision.

I once heard the comparison that it is like the difference of sending a letter and sending a picture postcard. Not sure if I buy into that, but I can't argue against it on a purely rational basis.


Yes, it's by choice. It's read only, public information. We don't set cookies or anything.

We take security very seriously. But we don't take anything too seriously.

Edit: by the way, persuade me that there's an upside and I'll turn it on.


> The one I occupy is the one where 'good teams' profile those things that take too long.

Deciding which things take too long is profiling. Maybe you do it in your head or with pencil and paper instead of using a software approach but I think your position aligns with "good teams profile everything".


This is neither incompetence nor surprising. Maybe you’ve only worked at large companies who have had time to optimize things for years (and even then, I see grotesque software decisions at my large company quite often). Try accepting that software is often written poorly optimized on the first pass, for good reason, and learn to celebrate the wins without needing to shame someone.


This is Pinterest. Every org I've worked at has been smaller. There's space between shame and ignoring mistakes.

The purpose of this post is not to educate. There's nothing in here that anyone can use to improve. It's just marketing.


Well I learned something from this article, and I thought I had a good handle on CI/CD, so either I am incredibly stupid and shouldn't be reading these 'nothing' articles, or maybe there is so much to learn it's impossible to know it all.


If you dont have massive repos, then this is the sort of thing that not often a big problem. Also - if you are using things like gitlab runners, you might be in the same AZ and even large repos clones are fast.

And it is impossible to know it all, I like these articles just for the differing ways people work


It sounds like GP has a bit of a chip on their shoulder when it comes to feeling like you're not worthy if you don't know everything.


Are you just an angry ex Pinterest employee? There is something to learn here and a reminder to pay more attention even when you're knee deep in other tasks. Besides the obvious feature/limitation of git that the author points out.


A team discovers a major efficiency win requiring minimal engineering effort and your response is to... punish them?


People praise my git skills at work (among other things) when they come to me for git help.

My response is always the same: I've just run into these bugs more often than they.


Shocking? Incompetent? A hiring postmortem? Really? AYFKM?


They did a billion dollars of revenue last year, their management and hiring systems seem to be getting the job done.


Yep--although multiple unpleasant experiences with pinterest have spurred me to permaban it from search engine results and smite it with network filters, somewhat wasteful CI/CD pipelines have clearly not prevented the company from flourishing.


I don't get why they have to clone their repo frequently in the first place - seems to me as a brute force usage of a version control system prone to high cost in the first place.


It is a nice and fool proof way to get a clean working environment to just download everything from nothing. And you want different working folder for different jobs anyway so they don't mess with eachother or build of state between jobs due to scripting messups.


I don't know about a big org like Pinterest, but it's pretty common for "clone the repo" to be the first step of a CI/CD pipeline when using something like CircleCI or GitlabCI.

It's an easy (if inefficient) way to always get the latest changes and if you have disposable build-runners then it all gets thrown away at the end of the pipeline.


It is interesting that we trust our tools so little. A git hash is a pretty robust way to know whether the code in the repo is what it is supposed to be, so a "git fetch" rather than a fresh "git clone" should be safe, but we can't trust the build steps to not trash the build-runner so the entire thing needs to be thrown away.

Edit: for context, I wrote this comment while waiting for `npm ci` to run. Its first step is to delete the node_modules folder, as otherwise it can't be trusted to update correctly.


> we can't trust the build steps to not trash the build-runner so the entire thing needs to be thrown away.

I think it's partly this, and partly that everything is shared infrastructure now. I don't want to pay to keep a machine up 24/7 just to use it to run a build for 10 minutes half a dozen times per day.

So instead I lease time on shared hardware with ephemeral "containers" or "virtual machines" or whatever.


Jenkins has a setting to keep the checkout directory (default) or to clear the directory between builds.

At last job, the default was letting broken changes pass the build, they break some step of the setup/run process that's not run on a partial build. New joiners came in and they couldn't build because the build was broken.

Had to fix it by setting up two jobs, one running from scratch (30 minutes) and one incremental (10 minutes). The build from scratch was catching a broken change or two every week.


Ephemeral CI runners. I have the same problem at work - 4GB repository that is redownloaded on every single pipeline run.

Another reason (which is why we went for ephemeral runners in the first place...) is that if you have stuff that mounts a directory from the repository directory as a volume in a Docker container (e.g. for processing data), you may end up with the Docker container frying permissions in the repo folder (e.g. 0:0 owned files). Now, you can put a cleanup step as part of the CI (=docker run --rm -v $(pwd):/mnt sh -c 'chown -R $runner_uid:$runner_gid)... but unfortunately, Gitlab does not allow a "finally" step that always gets run, so in case the processing fails, the build gets aborted, the server hosting the runner crashes, ... anything happens, the permissions will be fried, and a sysadmin will need to manually intervene.

An ephemeral runner using docker:dind however? It simply gets removed.


In order to start with a clean slate and to guarantee state and absence of artefacts from previous builds/pulls it is common practice to start off with a clean directory.


Better title: A one-line change decreased our "git clone" times by 99%.

It's a bit misleading to use "build time" to describe this improvement, as it makes people think about build systems, compilers, header files, or cache. On the other hand, the alternative title is descriptive and helpful to all developers, not only just builders - people who simply need to clone a branch from a large repository can benefit from this tip as well.


Right, from the article:

"This simple one line change reduced our clone times by 99% and significantly reduced our build times as a result."

So the title is just completely wrong.


There's also this part of the article:

"We found that setting the refspec option during git fetch reduced our build times by 99%."

So, the article contains contradictions.


They set out to reduce build times, not to reduce git checkout times. It turns out that 99% of the entire build was spent downloading code.


Where does the article say "99% of the entire build was spent downloading code"?


The title. If they reduced the build time by that much, then at least that much of the build time must have been spent downloading code.

If the title is a lie (which it probably is), then nevermind that number, but it's clear where it came from.


The text of the article clearly states that clone time was reduced by 99%.

The only way build time could have been reduced by 99% is if every part of the build other than cloning is negligible. It is far more plausible to assume that the title is simply wrong.


It quotes a jenkins job going from 40 minutes to 30 seconds.


They say "Cloning our largest repo, Pinboard, went from 40 minutes to 30 seconds"

Presumably the build does more than just clone


This isn't true either, as the article says that builds went from 40 minutes to 30 minutes. The time spent cloning was presumably about 10 minutes and came down very far, presumably by 99%.


> the article says that builds went from 40 minutes to 30 minutes.

Where in the article does it say that? The article says this:

> This simple one line change reduced our clone times by 99% and significantly reduced our build times as a result. Cloning our largest repo, Pinboard went from 40 minutes to 30 seconds.

Both of those sentences say the clone time was reduced by 99%. There are percentage numbers given for how much the build time was reduced, nor any numbers about the total build time.


It says from 40 minutes to 30 seconds, not minutes.


I stand quite corrected. Sorry, all!


This reminds me of my first programming job in 2005, working with Macromedia Flash. They had one other Flash programmer who only worked there every once in a while because he was actually studying in college, and he was working on some kind of project from hell that, among other problems, took about two minutes to build to SWF.

Eventually they stopped asking him to come because he couldn't get anything done, and so I had a look at it. In the Movie Clip library of the project I found he had an empty text field somewhere that was configured to include a copy of almost the entire Unicode range, including thousands of CJK characters, so each time you built the SWF it would collect and compress numerous different scripts from different fonts as vectors for use by the program. And it wasn't even being used by anything.

Once I removed that one empty text field, builds went down to about ~3 seconds.


I take it that this is not something he added himself, but was likely a catch-all default of textfields at the time?


Yep. In order to use non-standard fonts in Flash I recall you had to embed the fonts, even if the movie clip containing the textfield was not being used anywhere.


This is the most I've ever gotten out of pinterest, other than this, it's just the "wrong site that google turns up, that I can't use because it wants me to create an account just to watch the image I searched for"


Can we not do the thing where we pick an organization from an article and then bring up the most generic complaint you can about it in a way that is entirely irrelevant to the post? We get it, you don't like Pinterest showing up in search results, nobody does. But this has absolutely nothing to do with the article other than it being pattern matching on the word "Pinterest", which is about the least informative comment you can make aside from outright trolling or spam. There are threads that come up from time to time where such comments would be appropriate, if not particularly substantive.


I guess you're right. I've not noticed this being a topic before, and I should have spent more words telling that the article in question is actually quite interesting, it definitely made me consider our own Jenkins setup.


Thanks :) I don't want to make it seem like I'm after you in particular, it's just that you were the top comment in this thread and it's that time of night when I should logged off and gone to bed is long past, so my patience for this was just a little thinner than it usually is. It's just that enough people have done this that I figured I might as well steal the second-to-top comment spot with this in the hopes that they might see it and not do it anymore.


So if Monsanto/Bayer had a post about their bio informatics stack, you'd expect nobody to complain about the company and its business practices?

Sometimes the negative impact of a company is just more interesting to people than what the article brings to the table.


It’s not surprising when certain firms evoke a strong personal feeling, but it’d be terribly exhausting if every article about, say, React, attracted the annotation that Facebook is the Philip Morris of media. The subsequent discussion then tends toward the divisive and derisive rather than the illuminating and informative. Hard to tell anyone they should suppress what they feel, but overall I’d tip the balance towards “fewer like this please.”


I think it's valuable to keep saying it because otherwise we start thinking it's okay to fetishise a company's products just because they're technologically interesting. If a company made them on the back of incredibly shady and unethical dealings, they shouldn't be getting free advertising here.


Who here is fetishizing products because they learned something from the engineering blog? This is not happening.


I wouldn't expect it, because I have been here long enough to know that that is just not going to happen, but I would very much like it to be so, yes. Rehashing the same topic whenever you see something tangentially related is just a lazy karma grab, not an attempt at creating interesting, insightful conversation.

Look, I get it, sometimes you want to rant about a company that you think is doing something you don't like: my point is that we have specific threads for them where such a comment could at least be on-topic. When you come to an article that about Pinterest doing some git thing to make their builds faster and your comment is "they're ruining my search!" you're commenting at the level of someone who hasn't read the blog post.


The point is it’s not directly relevant to the article, and on top of that GP’s particular complaint was especially generic. In this case Pinterests negative impact is not that interesting and it’s constantly discussed too.


HN has always been very predictive.

Praise Microsoft for turning the corner, Dislike Google for ads and snooping, Praise Apple for privacy, Dislike Zoom for privacy, Dislike Pinterest for middlewaring Google Image, and so on.


I'm not even complaining about Hacker News being predictive, we all know that likes to have certain conversations and there is no stopping that. My only request is that this doesn't happen in every single thread regardless of whether it is relevant or not. (To be clear, I am "guilty" of the former myself; there are a handful of topics that I have a particular opinion about and I don't hesitate to share them even if I have mentioned them many times before. I just try to not bring them up in places where they clearly have no connection to what's being talked about.)


Friendly amendment: *predictable


Sorry no. If an article is paywalled, on Pinterest, or similar, then please let's discuss the source instead, even if it ruins the discussion, so people learn not to post such links.


TFA isn't paywalled, or on Pinterest, or similar.


Paywall complaints are explicitly off-topic: https://news.ycombinator.com/item?id=10178989. I am not a moderator, but I think I've made it clear that I personally consider comments like the one I responded to be as well.

FWIW, in the all years I have been on this site, I have seen this happen regularly and I have yet to see any reduction in such links or these kinds of discussions. Seeing as you've been here longer, I'd be curious to hear about why you might feel differently.


We heard your complaint but you are being acting entitled now. People are free to register, free to comment, if you don't like it, downvote it. It is the top comment, that means it is being upvoted. get over it.


I tend to downvote very rarely and only for clear violations for the rules, not for comments I don't like. Telling the author why you didn't like something they did often gets them to change or explain their behavior. Just because something is upvoted doesn't mean it is something that should be on Hacker News.


I just don’t mind repeating myself whether it changes anything or not I guess. Simply because discussing paywalled links or Pinterest linked is invariably more interesting than whatever is found (or not found) when following those links.


I am not sure why google does not penalize this behavior in their search ranking.


The most frequent search keyword that I use is "-pinterest"


Yes, there seems to be no way to make it clear to Google that we want to never see certain websites in our search results. Yet, Google claims they need our information to "improve our experience".


If Google wants my information to improve my experience, I'd love to be able to vote search results up or down. Or entire sites, like pinterest and content farms.


Wasn't that a thing? I remember a +1 button somewhere on Google Search

Edit: I misremembered, it was a social network thing from Google+ https://www.techspot.com/news/43064-google-adds-1-button-to-...


There was a thing called SearchWiki where you could adjust your own results. It didn't last long.


For what it's worth, DDG image results doesn't get spammed by Pinterest. While my browsing is a drop in the ocean compared to Googles market share, using a Google competitor is as clear a signal as one can send that you're unhappy with the Google service.


-pintrest should be a search extension.


That's my experience too. Imagine how many views they have lost over the years, just because they require a login.

And shame on you, Google, for playing along and indexing their shit, when it's not visible when I click through.


This fact has forced people to write browser extensions to filter Pinterest out.

I opt for the "teach non-tech people how to dork" route instead: https://soatok.blog/2020/07/21/dorking-your-way-to-search-re...


This is one situation where a duckduckgo search is objectively of a better signal/noise ratio.


Yeah i always believed it was some kind of lone evil ai that lives through search results.


The worse experience is when you have found a dead link that contains useful information that exists only as Pinterest snapshot while doing a web search...


Y'know, I actually made a Pinterest account once because of one particular picture I really wanted. Guess what, even with an account you can't have it. Oh well, guess I'll just let it go.


They also created/maintain the kotlin linter, "ktlint".


On my first job, 20 years ago, we used a custom Visual C framework that generated one huge .h file that connected all sorts of stuff together. Amongst other things, that .h file contained a list of 10,000 const uints, which were included in every file, and compiled in every file. Compiling that project took hours. At some point I wrote a script that changed all those const uints to #define, which cut our build time to a much more manageable half hour.

Project lead called it the biggest productivity improvement in the project; now we could build over lunch instead of over the weekend.

If there's a step in your build pipeline that takes an unreasonable amount of time, it's worth checking why. In my current project, the slowest part of our build pipeline is the Cypress tests. (They're also the most unreliable part.)


At my second job in the industry I worked on a Python project that had to be deployed in a kind of sandboxed production environment where we had no internet access.

Deploys were painful, as any missing dependency had to be searched in our notebooks over 3G, then copied to an external storage, then plugged into a Windows machine, uploaded to the production server through SCP and then deployed manually over SSH. Sometimes we spent hours doing this again and again until all dependencies were finally resolved.

I worked there for almost a year, did many cool gigs and learned a lot. But my most valuable contribution came when at some point, tired of that unpredictable torture that were the deploys, started researching into solutions. I set up a pypi proxy into one of our spare office machines and routed all my daily package installs through that. Then I copied that entire proxy content into the production machine before every deploy, and voila, no more surprises.

I left this job a few weeks later, but have heard that this solution was very useful for many devs that joined the team afterwards.


I suppose no Docker containers were allowed in prod either?


Of course not. That was before docker, circa 2010. Our production environment was impossible to recreate.


> If there's a step in your build pipeline that takes an unreasonable amount of time, it's worth checking why. In my current project, the slowest part of our build pipeline is the Cypress tests. (They're also the most unreliable part.)

Would you say the (slow and unreliable) Cypress tests are worth it still?


I don't know. We need some sort of e2e tests, and all e2e test frameworks are terrible in one way or another. Cypress is okay. I would prefer to only run it on production or the dev server and have alarms go off when they fail, but either the requirement is, or other developers have decided that it's necessary to pass all e2e tests before a feature branch can be merged into the master branch.

And I get the reason for it; you don't want to accidentally merge breaking changes. But it does make our build pipelines very slow and unreliable.

So are they worth it? I don't know. If I had my way, we'd only run them on master, and not make it a requirement for feature branches to pass them. Because if you fix one tiny thing, you now have to wait 15 minutes again for the Cypress tests to run. I think they'd be better in a different setup than what we're doing.


We had similar issues with integration tests, and made them a separate jenkins job that didn't trigger automatically, but gitlab was still configured to require them to pass for merge. We would kick it off manually only after all other code review was complete. Then the only cases where we had to re-run it were the same cases where it would have failed in master if we only ran the test there, but it saved us the hassle of reverting or feeling pressured to get hotfixes into master quickly.


Check out https://reflect.run/ as a replacement for Cypress. I started using it recently to do E2E testing at work in our staging environment to run a suite of tests before we move anything to production.

So far it's been great and has saved a couple of releases in a month or so of use!


That's the nature of UI tests for most part. IIRC Cypress are written declarative tools which would make them even more unreliable and slow, albeit easier to fix.

Personally I've recently started using Playwright and I'm quite happy with it. There was occasional misunderstanding of their API, but 95% of time it's great. Microsoft is kicking butt these days.


Cypress is horribly unreliable. We used to use it, and tests would pass, then fail on subsequent runs with no code changes, due to internal bugs within Cypress screenshot plugins, if I remember right.

I have no idea if it is any better now, but we dropped it about 6 months ago in favor of pure Selenium C# for our UI tests.

edit: a word


> In my current project, the slowest part of our build pipeline is the Cypress tests

Oh man, I feel your pain.


Personally I think longer tests (like a full Cypress run) should not be a boundary to merging in prod if they take more than 10 minutes, but should be run nightly or continuously in the background.

I've not yet had the opportunity of having a large Cypress suite (working on it as we speak), but is it still more stable than e.g. Selenium is? Honestly 80% of issues we had with that were 'unstable' tests.


Exactly. I would much prefer a setup like that over our current rule that all cypress tests must pass before merging.

A better rule might be that at least one unit or e2e test was added or updated to reflect the change in the code, and that that particular test succeeds. But run all the others on master.

One advantage (or occasional disadvantage) of Cypress test before merging, is that there is someone clearly responsible for fixing it if a test fails. Problem is, sometimes the failing test has nothing to do with anything the creator of the pull request did. It's still a mystery how that's possible, but it happens. Hence my feeling that Cypress tests aren't very reliable. At least some of ours aren't.


Unfortunately, the issues we had with Cypress were with the framework itself, not the tests.

I used to write automation, and I can say that Selenium tests can be written to be very stable. Just depends on how they are written.


I sympathise a lot with this post! Git cloning can be shockingly slow.

As a personal anecdote, clones of the Rust repository in CI used to be pretty slow, and on investigating we found out that one key problem was cloning the LLVM submodule (which Rust has a fork of).

In the end we put in place a hack to download the tar.gz of our LLVM repo from github and just copy it in place of the submodule, rather than cloning it. [0]

Also, as a counterpoint to some other comments in this thread - it's really easy to just shrug off CI getting slower. A few minutes here and there adds up. It was only because our CI would hard-fail after 3 hours that the infra team really started digging in (on this and other things) - had we left it, I suspect we might be at around 5 hours by now! Contributors want to do their work, not investigate "what does a git clone really do".

p.s. our first take on this was to have the submodules cloned and stored in the CI cache, then use the rather neat `--reference` flag [1] to grab objects from this local cache when initialising the submodule - incrementally updating the CI cache was way cheaper than recloning each time. Sadly the CI provider wasn't great at handling multi-GB caches, so we went with the approach outlined above.

[0] https://github.com/rust-lang/rust/blob/1.47.0/src/ci/init_re...

[1] https://github.com/rust-lang/rust/commit/0347ff58230af512c95...


> Contributors want to do their work, not investigate "what does a git clone really do".

Exactly this. Especially if the repo and CI pipeline are complicated, it is incredibly easy to just assume “it’s slow” is a fact of life.

And from the point of view of the dev-productivity team, well, they have tons of possible issues to deal with at any given time. Not just CI but the repos themselves, the build system, maybe IDEs, debuggers, ... Sure the fix ends up being easy but you have to know to go looking for it.


When you’ve got a billion other tasks to do, you might even know that it could be orders of magnitude faster and still not fix it, simply because of higher priority work.

Frankly, I’d rather spend extra time trying to address problems/bugs/potential security holes in the actual shipped code than in fixing a poorly working CI pipeline...and I’m the kind of dev who gets really irritated by these problems. But you have to prioritize.

Basically, barring “external” forces like cost overflow, customer unhappiness, or similar...stuff like that gets fixed at an equilibrium point between how much the problem hurts the dev, how adjacent to the codebase the devs current work is, and how interesting/irritating the dev finds the problem.


Out of curiosity, why not use the submodule.<name>.shallow option in .gitmodules?


Primarily because, until you mentioned it now, I wasn't even aware it was an option!

That said, I generally shy away from shallow clones and probably wouldn't use it here:

- it's a trap for people who ever want to work in that repo normally (we use the trick for more than just LLVM) - I believe shallow clones, over time (e.g. for contributors), are less nefficient than deep clones - I would expect shallow cloning to reuse fewer objects and benefit less from git's design. [0] describes a historic issue on this topic

[0] https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomm...


> Even though we’re telling Git to do a shallow clone, to not fetch any tags, and to fetch the last 50 commits ...

What is the reason for cloning 50 commits? Whenever I clone a repo off GitHub for a quick build and don't care about sending patches back, I always use --depth=1 to avoid any history or stale assets. Is there a reason to get more commits if you don't care about having a local copy of the history? Do automated build pipelines need more info?


Some tools (like linters) might need to look at the actual changes that occurred for various reasons, such as to avoid doing redundant work on unmodified files. To do that, you need all the merge bases... which can present a kind of a chicken-and-egg problem because, to figure this out with git, you need the commits to be there locally to begin with. I'm sure you can find a way around it if you put enough effort into scripting against the remote git server, but you might need to deal with git internals in the process, and it's kind of a pain compared to just cloning the whole repo.


If you're interested in metadata, you can use --filter=blob:none to get the commit history but without any file contents.


Did not know, that's great, thanks! Seems this is a relatively recent feature?


I can’t speak for the original post, but I’ve seen other people[1] increase the commit count because part of the build process looks for a specific commit to checkout after cloning. If you have pull requests landing concurrently and you only clone the most recent commit, there is a race condition between when you queue the build with a specific commit id and when you start the clone.

All that being said, I don’t know why you would need you build agents to clone the whole damn repo for every build. Why not keep a copy around? That’s what TFS does.

One other thing I've seen to reduce the Git clone bottleneck is to clone from Git once, create a Git bundle from the clone, upload the bundle to cloud storage, and then have the subsequent steps use the bundle instead of cloning directly. See these two files for the .NET Runtime repo[2][3]. I assume they do this because the clone step is slow or unreliable and then the subsequent moving around of the bundle is faster and more reliable. It also makes every node get the exact same clone (they build on macOS, Windows, and Linux).

Lastly, be careful with the depth option when cloning. It causes a higher CPU burden on the remote. You can see this in the console output when the remote says it is compressing objects. And if you subsequently do a normaly fetch after a shallow clone, you can cause the server to do ever more work[4].

1: https://github.com/dotnet/runtime/pull/35109

2: https://github.com/dotnet/runtime/blob/693c1f05188330e270b01...

3: https://github.com/dotnet/runtime/blob/693c1f05188330e270b01...

4: https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomm...


Also worth noting that git is pretty efficient at cloning a bunch of subsequent commits, due to delta encoding.

edit: looks like git doesn't implement fetching thin packs when populating a shallow clone. It will still avoid fetching unnecessary packs, so the efficiency is still high for most software repositories.


Does git do delta encoding during clones? I know it doesn’t use deltas for most things.


I am fairly sure it uses thin packs during a clone usually. Though I checked the docs at https://www.git-scm.com/docs/shallow, and it says:

> There are some unfinished ends of the whole shallow business:

> - maybe we have to force non-thin packs when fetching into a shallow repo (ATM they are forced non-thin).


Tags. All of my builds use `git describe` to get a meaningful version number for the build.


I expected this to be some micro-optimization of moving a thing from taking 10 seconds to 100ms.

> Cloning our largest repo, Pinboard went from 40 minutes to 30 seconds.

This is both very impressive as well as very disheartening. If a process in my CI was taking 40 minutes I would be investigating sooner than a 40-minute delay.

I don't mean to throw shade on the pintrest engineering team, but, it speaks to an institutional complacency with things like this.

I'm sure everyone was happy when the clone took 1 second.

I doubt anyone noticed when the clone took 1 minute.

Someone probably started to notice when the clone took 5 minutes but didn't look.

Someone probably tried to fix it when the clone was taking 10 minutes and failed.

I wonder what 'institutional complacencies' we have. Problems we assume are unsolvable but are actually very trivial to solve.


I'm not sure this is complacency - this just seems like regular old tech debt. The build takes 40 minutes but everyone has other things to do and there is no time to tend to the debt. Then one day someone has some cycles and discovers a one line change fixes the underlying issue.

I'm sure many engineering projects have similar improvements that just get a ticket/issue opened and never revisited due to the mountain of other seemingly pressing issues. From IPO to the start of the year Pinterest stock price had been trending downwards - I'm sure there was more external pressure to increase profitability than to fix CI build times. The stock has completely turned around since COVID, so I'm sure that changes things


IMHO (from having addressed such CI issues personally on teams that otherwise wouldn't bother) it's likely due to other factors, like a lack of interest, being scared of breaking the build, not being terribly comfortable touching build scripts, or the inability to run scripts locally, than a genuine lack of time. The returns you can get can be ridiculously huge across the entire team compared to the hours you might spend, but I've found many people just aren't terribly interested in sitting down and digging into ugly scripts and pushing dozens of commits to figure out what might be slowing things down. And honestly, it's not exactly trivial to structure things in a way that's simultaneously both efficient and maintainable, especially if you're refactoring an existing system instead of starting from scratch, so that can be another turn-off.


For me the biggest issue is that CI is often siloed to hell and back.

Even when most of the rest of the engineering environment is fine, the build scripts and configuration often aren't under version control themselves, or are manually deployed - meaning any changes require access to carefully guarded server credentials. This may even be by design as a "security measure" - as if I didn't already have the ability to run arbitrary code on the build servers in question through unit tests etc. The gatekeepers in question are often an underfunded IT department that has too much on their plate already, and are underwhelmed by the idea of reviewing a bunch of changes to "legacy" code that they've somehow convinced themselves they'll rewrite "soon" that they don't directly benefit from anyways.

And I find I can rarely run the scripts locally. They're also often hideously locked in to a specific CI solution that I can't locally install without a ton of work on my part to figure out the mess of undocumented dependencies, and rife with edge cases that I can't easily imitate on my dev machines.

My preferred CI setups involve a single configuration file, checked into the same repository it's configuring CI for, that simply forwards to a low-dependencies script that works on dev machines. Getting there from an existing CI setup, however, can be quite the challenge.


Or just creeping build time over years, "its always taken a while, I guess it just takes longer now". You dont bother optimizing things until they cause you sufficient pain to optimize them.


I can totally see a situation where the engineers whp made the script are long gone, the new engineers are justifying their hiring by churning out features and trying not to break things, especially things they dont own and effect everyone, like ci/cd, and that annoying but manageable 40 minute wait, just gets put on the backlog, waiting for half a year until someone with just enough experience and frustration makes a push to management to dedicate a bit of time to diving into the issue.


My assumption is some or all of those more than people thinking it's "fine," that it's deficiencies more than complacencies.


Yup, it's all about incentives alignment. If you get promoted for shipping a feature but you don't get promoted for saving 40 minutes of everybody's time every day you will get a lot of features, delivered slowly.


This is the kind of thinking I tried to sell in my corpo.. where cloning monorepo takes 30m and building this monstrosity takes 1.5h (first time). Got scolded by management for saying - speed of changes should be more important than “looking busy” delivering stuff.


> I wonder what 'institutional complacencies' we have. Problems we assume are unsolvable but are actually very trivial to solve.

I spend a lot of time optimizing builds, because the effect is a multiplicator for everything else in development.

But it is not an easy task. One issue with performance-monitoring is that you have to carefully plan your work, or you will sit around and wait for results a lot:

Try the build: 40 minutes. Maybe add profiling statements, because you forgot them: another 40 minutes. Change something and try it out: no change, 40 minutes. Find another optimization which decreases time locally and try it out: 39.5 minutes, because on the build-server that optimization does not work that well. etc.

You just spent 160 minutes and shaved 0.5 minutes off the build.

I'm not saying it's not worth it, but that line of work is not often rewarding.

On the flip-side I once took two hours to write a java-agent which caches File.exists for class-loading and managed to decrease local startup time by 500% because the corporate virus-scanner got active less often.


Considering the build host does this hundreds of times every day, a better solution would be to simply have a git repo cache locally, should be secure and reliable given git’s object store design?

Any simple wrappers for git that can do this transparently?


Build servers don't git clone everytime though. They do a git clean if needed followed a git fetch / git pull equivalent.

GoCD for example maintains a single copy of the repo on the server for every pipeline that refers to it and the agents have the repos that they work on checked out. Any local changes or untracked files are by default cleaned. There are settings to force reclone etc, but it's not the default.


In many cases the build agent is a stateless container which is destroyed as soon as the build is finished. In cases like this the repo needs to be (shallow) cloned each time.


That depends very heavily on the build infrastructure being used however


I doubt that they started off with a 40 mins delay. It probably crept slowly as the repo got bigger and no one noticed it because of the gentle gradient. And they didn't have the time/resources to look into it.


You're confusing full clone, which for a huge repo is OK to be long as the fix which was to specify one RefSpec so they don't clone the full repo in CI.


People probably did complain, but they were met with, "We're cloning a 20GB repo! It's not going to happen in an instant!"


This is the real complacency

Did someone really think "well it takes 40min, what can you do about it?" and just left it as such?

I knew people who would have that mentality in companies that are not around anymore. Take it as you want.

Yes, git is hard, but you know, maybe someone else has a better idea, or you can check SO, etc. (I don't even know why they were adding the refspecs there)


I’ve found as an industry we’ve moved to more complex tools, but haven’t built the expertise in them to truly engineer solutions using them. I think lots of organizations could find major optimizations, but it requires really learning about the technology you’re utilizing.


It's a natural tradeoff made when we ask for generality and flexibility: Doing that means implicitly saying "I want to do less implementing and more engineering" because a complex configurable dependency becomes an object of study in itself, something that needs empirical testing to use at its best.

Versus the simple thing you would author yourself: if you know the engineering tradeoffs made at the per-line level you have a decent grasp of the performance and flexibility, but you are implementing it and debugging it.


Also, profiling applications is surprisingly easy to learn. It boils down to looking at timestamps, and seeing what takes the longest. The majority of the effort is just figuring out where/how to get the timestamps you are looking for.

I will add that I think software complexity is only going to continue increasing over the long term; it reduces in some domains, but expands in others as we develop more advanced systems. Some kind of analogy to entropy.


Totally agree. Example: now that Node.js supports native `import` and `export` from modules I can see how many JS libraries will not need a transpilation step.

On the other hand TS seems to be more and more popular, which requires a compilation step.


The whole point of being an Agile "generalizing specialist" is that one is a mile wide and an inch deep.


Which i think is a fair approach when you’re early on. When you have a dev efficiency team you’re no longer hiring generalists.


This, this, so much this. When we build more complexity into a system, the less we understand it, similar to how development frameworks create multiple layers of abstraction to the point where the developers have no idea what actual code the framework produces, much less how to fix it.


Yes, we probably need people to stop thinking about tools as if they "solved" problems; what they really do is "transform" them. Now instead of having to deal with the original problem, you only need to deal with part of it and part of the new problem of using the tool that's supposed to help you, plus any leaks you might have because tools rarely solve problems perfectly. It's a trade-off, and you need to be aware of these transformations.


Another way of looking at it is this is the current golden age of infosec.

Think of all these complex systems developers and SysAdmins need to maintain at a company. Then think of how well each person knows each technology. Most of them will be "T" shaped, ie know one tech well but surface-level on all the others.

If I know several tools really well (or better than the company's sysadmins / devs) I can probably find some security issues with them.


We have not "as an industry" moved to git. There's a vocal subset of git fans, but it is by no means an industry standard.


What industry are you part of?

In many domains git has replaced other version control systems.

I would love to see a new approach to version control. Things like subversion or mercurial have exposed too many drawbacks for them to win back industry.


Google and Facebook both don't use git. Google uses a proprietary, perforce-esque system with multiple frontends, and Facebook uses Mercurial.

Among startups, I'm sure git holds a near monopoly, but if you move into other parts of the industry, that monopoly loosens.


Is Git not the most used VCS?


When I first joined one of my previous jobs, the build process had a checkout stage where it was blowing away the git folder and checked out from scratch the whole repo every time (!). Since the build machine was reserved for that build job I simply made some changes to do git clean -dfx & git reset --hard & git checkout origin branch. It shaved off like 15 minutes of the build time, which was something like 50% of the total build time.


It's frustrating how many ways there are for a git clone to get out of sync, especially when it's an automation-managed one that is supposed to be long-lived (think stuff like gracefully handling force-pushed branches and tags that are deleted). I've dealt with a bit of this with my company's Hound (code search engine) instance. Currently there's a big snarl of fallback logic in there that tries a shallow clone, but then unshallows and pulls refs if it can't find what it's looking for, culminating in this ridiculousness:

    git fetch --prune --no-tags --depth 1 origin+{ref}:remotes/origin/{ref}
See the whole thing here: https://github.com/mikepurvis/hound/blob/6b0b44db489f9aeff39...

The pipeline I manage is many repos rather than a monorepo, and maintaining long-licheckouts in this context is not really realistic, but what does work and is very fast is just grabbing tarballs— GitLab and Github both cache them, so they don't don't cost additional compute after the first time, and downloading them is strictly less transfer and fewer round trips than the git protocol.

The only real cost is that anything at build time which needs VCS info (eg, to embed it in the binary) will need an alternate path, for example having it be able to be passed in via an envvar.


A new checkout is good practice. Using refspec and depth options can make it quick.


> In the case of Pinboard, that operation would be fetching more than 2,500 branches.

Ok, I'll ask: why does a single repository have over 2,500 branches? Why not delete the ones you no longer use?


Where I work doesn't delete branches, because there is no reason to. Git branches have essentially zero overhead and deleting them is just extra complexity in the CI toolchain. Deleting branches also deletes context in some scenarios. When dealing with an old codebase its nice to be able to checkout the exact version of the code at some point without having to dig through the log to get hashes and then dealing with a detached head.

The example in the article is a bit of a special case. It is a huge, and old, monorepo. In the typical case, fetching everything and fetching master is equivalent because all commits in all branches make their way into master anyway. If you have a weird branching strategy where you maintain multiple, significantly diverged branches at once, but only care about one of those branches at build time, then this optimization would save you time.


> Git branches have essentially zero overhead

Based on the article linked here, they do.


> If you have a weird branching strategy where you maintain multiple, significantly diverged branches at once, but only care about one of those branches at build time, then this optimization would save you time.

Its not the fact that they had lots of branches itself, its the fact that they had lots of commits hanging out in the middle of nowhere.


If you are doing squash merges, git branches have a cost.


A git branch is literally a file with a commit hash in it. It's conceptually a pointer to a commit. Creating, destroying, and maintaining a branch has all the overhead of a ~40 byte file.

Squash merges leave a ton of commits just floating in your old branch. If you delete the branch (the 40B file), all those commits are still there. Doing lots of squash merges brings you into this case I mentioned:

> If you have a weird branching strategy where you maintain multiple, significantly diverged branches at once, but only care about one of those branches at build time, then this optimization would save you time.


If you have several releases with different targets, and want to make future security updates accessible to all


They could already be doing that.

That is if we assume they copy google's philosophy of a single monolith repository.

Pinterest has about 2000 employees, assuming 20% are active developers thats about 400 people, that gives you roughly 6 branches per developer which wouldn't be outrageous.


Because they use a monorepo. With monorepos at large companies the individual git repositories will be much larger and contain a ton more branches than if you have a repository-per-project model.


Probably because they have 1600 employees and the 2500 branches are the active ones.


monorepo culture.


One of the (many) things that drives me batty about Jenkins is that there are two different ways to represent everything. These days the "declarative pipelines" style seems to be the first class citizen, but most of the documentation still shows the old way. I can't take the code in this example and compare it trivially to my pipelines because the exact same logic is represented in a completely different format. I wish they would just deprecate one or the other.


I find the self-congratulatory tone in the post kind of off-putting, akin to "I saved 99% on my heating bill when I started closing doors and windows in the middle of winter."

If your repos weigh in at 20GB in size, with 350k commits, subject to 60k pulls in a single day, having someone with half a devops clue take a look at what your Jenkinsfile is doing with git is not exactly rocket science or a needle in a haystack. (Here's hoping they discover branch pruning too; how many of those 2500 branches are active?)

As a consultant I've seen plenty of apallingly poor workflows and practices, so this isn't all that remarkable... but for me the post seems kind of pointless.


Indeed. I wasn't aware of that specific git option, but a build pipeline with a checkout step taking FORTY MINUTES is unacceptable. Plenty of ways to solve that problem, but it's a problem that never should have made it into a critical workflow.

I don't care for casting stones. It's clearly a big win, and you don't get numbers like that every day. But I feel like someone should've twigged to this much sooner.


Can someone explain the intended meaning behind calling six different repositories "monorepos"?

It sounds to me like you don't have a monorepo at all and instead have six repositories for six project areas.


My interpretation is that each "monorepo" is a big git repository that consists of a collection of individually-deployed services, as opposed to having a single git repository per service.

I do not know whether that's what the blog author meant by that though.


I got that impression too. I can imagine the Pintrest monorepo, for example, has the website and server code together.

Their iOS and Android repos may contain the code for multiple apps. Though, I'm not aware of which other apps Pintrest (the company) creates besides the obvious one.


I'm a git noob, so I'm sorry if this sounds dumb but wouldn't

git clone --single-branch

achieve the same thing (i.e, check out only the branch you want to build) ?

Also, why would you not only check out one branch when doing CI ?


Looks like its implied from the documentation

    Implies --single-branch
https://git-scm.com/docs/git-clone#Documentation/git-clone.t...


hmm, the --depth implies single-branch, but the +refs overrode it by making sure it had the data to match all branches because of the wildcard refspec ?


I truly appreciate articles like this — it’s warming to see other companies running into the kinds of issues I’ve ran into or had to deal with, and more so that their culture openly discusses and shares these learnings with the broader community.

The most effective organizations I’ve worked at built mechanisms and processes to disseminate these kinds of learnings and have regular brown bags on how a particular problem was solved or how others can apply their lessons.

Keep it up Pinterest engineering folks.


He says that "Pinboard has more than 350K commits and is 20GB in size when cloned fully." I'm not clear though, exactly what "cloned fully" means in context of the unoptimized/optimized situation.

He says it went from 40 minutes to 30 seconds. Does this mean they found a way to grab the whole 20GB repo in 30 seconds? seems pretty darn fast to grab 20GB, but maybe on fast internal networks?

Or maybe they meant that it was 20GB if you grabbed all of the many thousands of garbage branches, when Jenkins really only needed to test "master", and finding a solution that allowed them to only grab what they needed made things faster.

I'm also curious about the incremental vs "cloning fully" aspect of it. Does each run of Jenkins clone the repo from scratch or does it incrementally pull into a directory where it has been cloned before? I could see how in a cloning-from-scratch situation the burden of cloning every branch that ever existed would be large, whereas incrementally I would think it wouldn't matter that much.


> He says that "Pinboard has more than 350K commits and is 20GB in size when cloned fully." I'm not clear though, exactly what "cloned fully" means in context of the unoptimized/optimized situation.

It probably means including all commits.

It looks like they were successfully only pulling the last 50 commits, but they were doing that for each of 2500 branches. Now they are pulling only the most recent 50 commits for one branch.


My similar story goes like this: We had CRM software that let you setup user defined menu options. Someone at our organization decided to make a set of nested menu options where you could configure a product, with every possible combination being assigned a value!

So if you had a large, blue second generation widget with a foo accessory and option buzz, you were value 30202, and if was the same one except red, it was 26420...

Every time the CRM software started up, it cycled through the options, generated a new XML file with all the results, this took about a minute and created like a 60MB file.

The fix was to basically version the XML file and the options definition file. If someone had already generated that file, just load the XML file instead of parsing and looping through the options file. Started up in 5 seconds!

What was the excuse that it took so long in the first place? "The CRM software is written in Java, so it's slow."


Seems like there's a lot of hostility towards the title, which might be considered the engineering blog equivalent of clickbait. If the authors are around, the post was quite informative and interesting to read, but I'm sure it would have been much more palatable with a more descriptive title.

But back on topic: does anyone have any insight into when git fetches things, and what it chooses to grab? It is just "when we were writing git we chose these things as being useful to have a 'please update things before running this command' implicitly run before them"? For example, git pull seems to run a fetch for you, etc.


Ok, I'll ask the obvious question: why did setting the branches option to master not already do this?

EDIT

https://www.jenkins.io/doc/pipeline/steps/workflow-scm-step/ makes it sounds like the branches option specifies which branches to monitor for changes, after which all branches are fetched. This still seems like a counter-intuitive design that doesn't fit the most common cases.


This is good info. Need to check my own build pipelines now and see if we are just blindingly cloning everything or not. 40 minutes to do a clone is a pretty long time to wait though.


Parkinson's Law of builds. "work expands so as to fill the time available for its completion", or in this case the available time is the point at which people can't stand the build taking too long. 30-60 minutes is normal because anything > 1 minute required you to context-switch anyway, and > 60 minutes means you are now at risk of taking a day if you have a work queue of a 1-pizza team. So [1..60] range causes a grumble but nothing will be done.


Is there any way to do this for GitLab CI [1]? I'm using GIT_DEPTH=1, but I'm not sure how to set refspecs. It's not too important right now since it only takes about 11 seconds to clone the git repo, but maybe it's a quick win as well.

[1] https://docs.gitlab.com/ee/ci/large_repositories/


The docs seem to give the impression that they already do this, but it'd be great if someone from Gitlab could confirm because it doesn't use the refspec term or show the resulting git command.

> The following example makes the runner shallow clone to fetch only a given branch; it does not fetch any other branches nor tags.

https://docs.gitlab.com/ee/ci/large_repositories/#shallow-cl...


Checkout the docs here https://docs.gitlab.com/ee/ci/yaml/README.html#git-fetch-ext...

GIT_FETCH_EXTRA_FLAGS accepts all options of the git fetch command


Thanks for that link - seems to be further evidence they do already do this refspec flag by default.

> The default flags are:

    GIT_DEPTH.
    The list of refspecs.
    A remote called origin.


> For Pinboard alone, we do more than 60K git pulls on business days.

Can anyone explain this? Seems ripe for another 99% improvement even with hundreds of devs.


An unhealthy obsession with CI/CD is the usual culprit.


Misleading title. They reduced their clone time by 99%. Not their build time.


With a repo that is 20GB, I can imagine that could be 99% of the build time.


My CI servers have to build branches as well, though. A fresh clone for every build? No wonder it was slow, but even this solution seems inefficient. My preferred general solution is a persistent repository clone per build host, maintained by incremental fetch, and use git worktree add, not git clone, to checkout each build.


Well, good advice, and good for them, but

> Cloning monorepos that have a lot of code and history is time consuming, and we need to do it frequently throughout the day in our continuous integration pipelines.

No you don't!

If removing per-build clones was the only way to speed things up, I'm absolutely sure you could figure out how with medium difficulty at most.


60K pulls per day for 100 commit in a day? What tests are being done that can't leverage earlier pulls?


Thus just shows how poor visibility into git is, I hope it gets better.

Building a product with poor visibility and ridiculing users for not knowing internals is the worst practice in Computer Science.

Hadoop did the same, and has set a record of fastest software to become legacy.

Super nice to see great comments here and the nice article.


Looks like Pinterest’s team is confused about Git Branches. These are not real full copy versions of the main branch like in SVN or TFS. A branch in Git world is simply a pointer to a specific commit in the code push history.

Having said that, happy to be proven wrong, and learn about it.


IIUC the issue here is the depth option - they're telling it to only fetch the last 50 commits, but they were fetching the last 50 commits from EACH branch. In other words, they were fetching all commits that are within 50 commits of any branch head. By restricting the branches, they drastically reduce the set of commits to fetch.


yeah,esp given 2500 branches


For CI on large repos, you can do much better than this by using a persistent git cache. It takes a little finessing to destroy it if it's corrupt and avoid concurrent modifications, but it's extremely worth it.


you mean syncing to git bares on CI-nodes and then in the build not using a WAN remote but just clone from the bare with hard-links and checkout?


Because of strife with 99% claim. If the pull time took 39.9min (and thus build took 0.1min = 6sec) then a 99% decrease in pull time would result in 99% decrease of total time and you would get 30sec total time in the end. (Rounding to 0 decimal places).

Not that any of this is important for the article to be interesting. In a previous job we had to fight long pull times and we quickly created a git repo for CI that would sit on a machine next to the CI server and would periodically pull from GitHub to avoid the CI to do pulls over Internet.


The title is a bit of a misnomer, isn't it?

> This simple one line change reduced our clone times by 99% and significantly reduced our build times as a result.

Sounds like it didn't reduce build times quite by 99%.


Misleading title. They reduced git clone time 99%, not build times.


Will this mean even more Google image search spam‽


Alternative title: "How one line of code made our build time 100x what it should have been"


I'm not impressed by the author of the post, since it's also something documented in the plugin, saying that you should not checkout all the branches, if not interested. The default behaviour of course is to get all of them.


So git doesn't scale well with wide, deep source histories? That's a failing of git I think, not the Engineers who may even have written that line when the source base was far less gnarly.


I once reduced the speed of our test suite from 10 mins to < 5 minutes by changing 2 characters in 1 line...

Then bcrypt work factor! It was originally 12, reduced it to 1 (don’t worry, production is still 12)


Is it a common practice to clone the repo on every build (especially on web apps)? I just have Jenkins navigate to an app folder, run few git commands (hard reset, pull), and build (webpack).


The article is erroneous in many ways as others have described, but the main error I see is that it says 'git clone' is run before the fetch.

It should be 'git init'


It is pinteresting that a webapp for making your image saving obsession easier to satisfy takes hundreds to thousands developer actions per day and repository sizes of tens of gigabytes.


Semi-related for JS developers: if you do `eslint` as part of your build, make sure `node_modules` (and `node_modules` in subolders if you have monorepo-ish solution) is excluded.


We recently reduced our build times by 5-10% or so by changing the default bcrypt iteration count (for tests). It also felt silly once we found it.


Thanks so much for this tip! I just made this change and some of my tests are now much faster. Here's the result for one of the affected tests (averaged across 5 runs):

Before: 1.39 seconds

After: 0.62 seconds

I have this default line in config/initializers/devise.rb:

    config.stretches = Rails.env.test? ? 1 : 11
So hashing user passwords was already very fast. But I'm also manually calling BCrypt in some other places, so these calls are now much faster as well.


You should consider doing the same thing in production.

It's a trope at this point how the modern slow hashing algorithms are utterly misconfigured. Stopped counting how many times I've seen it.

Take a whole second to compute a hash on the production machine because "hashing is supposed to be slow", noting the production server is a low frequency Xeon that has many core but they're half as slow as your development with a 4GHz i7-9999.

Hashing is supposed to take milliseconds, not seconds. If it's taking longer than 100 ms you need to make it faster.

edit: found the problem, this bad stackoverflow answer that's been spreading bad recommendations for years https://security.stackexchange.com/questions/17207/recommend...


One thing to keep in mind is that this obviously changes the timing of your software with respect to production behaviour, which may or may not matter depending on what you are testing.


Troubleshooting CI/CD feels like troubleshooting a printer: What the hell is it doing now and why is it doing that?!


I'd rather Pinterest increased their build times by 99% so they could do less damage to search results.


“We have six main repositories at Pinterest: Pinboard, Optimus, Cosmos, Magnus, iOS, and Android. Each one is a monorepo and houses a large collection of language-specific services.”

What is an “iOS monorepo” supposed to be like?


@Dang, can we get an edit?

This did NOT slash build times 99%, but rather time to do a git pull.


If build includes a git pull, maybe it did.


Nitpick... if 99% of your build time is consumed by preparing the workspace, that's the story. This isn't interesting to anyone who doesn't have that exact problem. Most people who click this won't find it interesting.


From the article:

> We found that setting the refspec option during git fetch reduced our build times by 99%.

Seems pretty clear to me that build times were reduced by 99% as a result of cutting the git fetch times significantly (but exacyt number is not give). The headline looks correct to me.


FTA: "This simple one line change reduced our clone times by 99% and significantly reduced our build times as a result"

Unless their build is 100% git pull time, this did not reduce build time by 99%.


Exactly. The article makes both statements in different places, and they are contradictory. Kind of gives an impression of sloppiness.


To be fair, if their pull took 40 minutes, that’s a very real option :)


missleading title. Not the build time was decreased by 99%. Only the git checkout step was.


tldr can i guess it was doing some extra network roundtrips or something?


TLDR: they reduced "git clone" time on their massive monorepo by making it only check out master branch when building in Jenkins.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: