Hacker News new | past | comments | ask | show | jobs | submit login
Incremental Builds in Gatsby Cloud (gatsbyjs.org)
130 points by Dockson on April 22, 2020 | hide | past | favorite | 79 comments



Having never used a static site generator in anger, can someone explain to me like I'm five what's going on here?

My understanding is that Gatsby is a tool that converts a bunch of markdown files into a static HTML website. Why is slow builds a problem for any static site generator? Why does it need a cloud?

In other words, what problem am I supposed to be having that any of this solves?

Note, I'm trying not to be skeptical here - my company's website is hand-maintained HTML with a bunch of PHP mixed in so I can totally imagine that things may be better. But I don't understand the kinds of situations where using a 3rd party cloud to generate some static HTML solves a problem.


Gatsby is a fairly complex static site generator. At the highest level, it provides an ingest layer that can take any data sources (CMS, markdown, json, images, or anything that a plugin supports) and bring them into a single centralized GraphQL data source. Pages (which are built using React) can query this graph for the data they need to render. Gatsby then renders the React pages to static HTML and converts the queries to JSON (so there's no actual GraphQL in production).

This process is fairly fast on small/simple sites. Gatsby is overall very efficient and can render out thousands of pages drawing from large data sources rather quickly. The issue is that Gatsby isn't just used for personal blogs. As you can imagine, a site with thousands of pages of content that is processing thousands of images for optimization starts taking a long time to build (and a lot of resources). For example, I'm building a Gatsby site for a photographer than includes 16000+ photos totaling a few hundred GB. Without incremental builds, any change (e.g. fixing a typo) means every single page needs to be rebuilt.

Incremental builds means you don't have to rebuild everything. Because the data is all coming from the GraphQL (which Gatsby pre-processes and converts to static JSON), it is possible to diff the graphs (i.e. determine what data a commit has changed) and determine what pages it affects (i.e. which pages include queries that access that field). From there, Gatsby can only rebuild that changed pages.

This not only means faster build times, it also means that only the changed pages and assets have to be re-pushed to your CDN. This way, content that hasn't changed will remain cached and only modified pages will have to be sent down to your site's users.


But if you have 16,000 of anything, why are you using a static site? Surely the access patterns are long tail and you need to build more often than most pages are even accessed.


Cheaper to build a site and dunp files to a bucket than running Wordpress code on every request.


Surely there's a CPU/disk trade off at some point. Static pages are much larger (less likely in memory) and would cause disk reads much sooner than the same files being generated dynamically. Of course wordpress isn't known for it's efficiency so the static page preference is probably quite high.


there is a big difference in the cost of static hosting and CDN (think Cloudfront / S3) for static stuff and running an active piece of hardware for static stuff that doesn't change. Like orders of magnitude. Sure for small sites it's not that much but it's still orders of magnitude.

also the answer to a large number of my interview questions ends up being figuring out how you can just effeciently serve the stuff from a CDN/Blob Storage. You can scale the crap out of this for quite cheap.


If the final result is 500k of HTML, a dynamic website is doing a LOT more work to return that 500k than a static website. Assuming you get more traffic than you have pages/generated.


> Surely there's a CPU/disk trade off at some point

At an extreme case, yes. Disk is SO CHEAP.


I was thinking more about the time cost. Disk is cheap but slow compared to memory.


Admittedly in this case I'm mostly just trying to push Gatsby to it's limits. For a photography site, there ends up being very little overhead with a static site (if you can do incremental builds). I also explored NextJS (SSR) and just making a good old SPA, but decided to go with Gatsby because at the end of the day, a distinct majority of the storage cost is just the raw images. I think Gatsby ends up making the most sense because you get to take advantage of a CDn for caching (most don't like being used just as an asset cache) and I can just leave it there without worrying about a server.


Hi, have you thought about hosting for this yet? I've got a similar site which I originally tried on AWS Amplify but it got too big for the artifact size limit so I opted for S3/Cloudflare instead however build times are slow and more of a manual process currently.


I’m yet to figure that out! I’m procrastinating on that but until I have everything else figured out (it’s for a family member so no strict timeline). I’m thinking in the end the setup will be something with a CMS for editing the photo metadata, a file storage system for the images, and everything else in git. The build would pull from all 3, run and then the processed images will be pulled out and hosted on their own. I’m planning that it’ll just run and take a long time on a droplet unless I figure something better out.


netlify is the de facto build/host for gatsby


Server-side rendering (like Wordpress) generates HTML in response to a URL. Static site generators just visit every possible URL at build time and save the final HTML as files. This makes it easy to deploy and scale when your site is static and doesn't need any features of dynamic server-side rendering.

Gatsby (and other frameworks) automate this process by going through whatever data sources you have (directory of markdown files, databases, etc) and producing the HTML. Gatsby uses React for the templating logic and any client-side interactivity on the pages. Build times scale with the size of your content and number of pages to generate so that's the reason for the cloud.

Overall, static sites are in the hype phase of the software cycle. Most sites are just fine using Wordpress or some other CMS and putting a CDN in front to cache every pageview. Removing that server completely is nice but most static sites end up using some hosted CMS anyway and at that point you just replaced one component for another. There's also advantages to completely separating the frontend code from the backend system for fancy designs or large teams.


Wordpress in particular is not “just fine”. During this crisis, every government site based on WordPress ends up crashing under the load. Yes, you can avoid this with a good cache plugin and a CDN. But you can also avoid it by using a tool that is designed to not crash under load in he first place.


Serving up a HTML file from disk or generating some dynamic HTML are both trivial to do once. After that, a CDN layer is used to cache the HTTP response, regardless of how it was generated.

Sure if you just want to serve HTML files on every request with a simple file server than static files will be faster, but it'll eventually get overloaded too. The CDN is where the real scaling happens. And using a CDN is far easier than changing the entire backend to a static site.

Also most of the sites that crashed were dynamic applications, not just static pages. Using a static site generator wouldn't solve that problem.


I am specifically complaining about WordPress. Anyone could make a server side dynamic application that applies proper caching headers to work with a CDN. Anyone could, but WordPress specifically did not. It's an uncacheable mess by default and caching plugins just barely make it useable.

Using a static site generator is basically just a way of ensuring that the pages are properly cached by the brute fact that they've been dumped out on disk. It's not strictly necessary for a well designed system, but it raises the floor because even in the worst case, the pages are static files.

For many public facing sites, dynamic applications aren't strictly necessary. If you're just hosting a PDF of an Excel spreadsheet of permitted job categories, you don't need a dynamic application. Again, a well designed app would already be hosting this through S3, but you can't trust things to be well designed when made by a contractor with no technical oversight.


Static files don't automatically set any headers either, you still need a webserver to serve those. And you can override those headers in the server or in the CDN so there's no reason to switch out the entire backend for it.

CDNs handle scaling of static assets. That's their entire purpose, with features like request coalescing and origin shielding to help ensure unique URLs are never requested more than once. Optimizing for static files at the origin is just not worth the trouble when Wordpress and other frameworks are far more productive and provide CMS functionality which is usually needed anyway.


We're talking in circles. No one disputes that CDNs are good and expert users of WordPress are capable of making it not shit the bed. The point is that WP cannot be left unmanaged by novices, which means it should not be used in many situations in which it is currently used. Static sites have a higher floor and so are better suited to non-expert use.


My point is that tradeoff isn't worth it. It's far easier to tweak security settings and configure a CDN than to completely change the backend with a complex build process requiring more technical knowledge, deploy it to a host which you still need, configure a CDN which you still need, and wire it up to read from a CMS which you still need.


And then you can regenerate all your files every time you make a change - the trade off isnt worth it for one off sites that generally sit at zero traffic.

Cheaping out on things that need actual support wont be fixed by making it a static html page.


I'll take a stab at this, Kyle just shoot me if I get something wrong below :D

1) There's a server-centric approach and a client-centric approach:

--a) hand-maintained HTML + php falls into the first camp

--b) React (/Angular/Vue) fall into the second

2) If you go with the second camp (b), you end up having a higher initial page load time (due to pulling in the whole "single page app" experience), but a great time transitioning to "other pages" (really just showing different DIVs in the DOM)

3) Gatsby does some very clever things under the hood, to make it so that you get all the benefits of the second camp, without virtually any downsides.

4) There are of course all kinds of clever code-splitting, routing & pre-loading things Gatsby does, but I hope I got the general gist right.

If not, Kyle, get the nerf gun out! -- how would you describe the Gatsby (& static sitegen) benefits? :)


Gatsby can also let you use react components to do some pretty clever things around image resizing, effects, etc that you might expect from a static site generator but couldn’t achieve with just a frontend framework.


(3) is incorrect, Gatsby initial page load times are mostly really bad.

(2) is both overstated and overvalued. It's overstated because loading a static HTML page from a CDN is extremely fast. Too many people who point at this advantage for SPAs are thinking back to pre-CDN usage with slow origin servers. Of course there are still use-cases where going to network is not wanted, but these aren't the primary use-cases that Gatsby covers.

It's also overvalued in that most users are not getting to a page by navigating in a loaded site, they are coming from a social or search link (again, for the sort of use-cases that Gatsby pages are built for).


> (3) is incorrect, Gatsby initial page load times are mostly really bad.

This has not been my experience, considering all HTML is ready to go from last byte so that other than blocking CSS, rendering can begin ASAP. At this point, no JS is required to interact with the page so things are generally pretty snappy while we wait for React hydrate to kick in.


Pick a few random sites from the Gatsby showcase on their homepage and run them through webpagetest.org Simple Testing.


shopflamingo.com, ideo.com, ca.braun.com, and bejamas.io are all blazingly fast for me.


Ideo has a pretty bad insight score (41): https://developers.google.com/speed/pagespeed/insights/?url=...

First paint: 4.1 Seconds. Time to interactive: 11.5 Seconds.

I wouldn't say that's very fast.

Edit: I didn't check the other three.


shopflamingo.com gets 47

ca.braun.com gets 77

bejamas.io gets 96

So implementation dependent I guess


You mean your computer or on webpagetest.org?


So, for (2) above, not sure I understand:

camp 1) at best, a TCP connection is re-used, and the HTML for "page 2" is fetched over the network, parsed, the CSS OM is applied, and then the whole caboodle* is "painted on screen".

camp 2) the CSS OM is applied and "page 2" is painted on-screen (possibly even faster if the browser cached "page 2" in a texture on the GPU, so the CSS OM application step may be optimized away)

So I genuinely don't understand how fetching a "page 2" from a CDN

(we use Cloudfront & GCP's CDN at https://mintdata.com, so I'm basing my experience on this)

is faster than the SPA approach?

I am genuinely curious on the above -- not trying to start a flame war :D

* Yes, apparently caboodle is a word?! I had to Google it just like you to make sure :)


It's not faster than the SPA approach. It's just not very much slower. It used to be much slower below using CDNs was common.


I've only done simple stuff with Gatsby, but it fully supports generating static HTML from dynamic data sources. The difference between that and traditional JS frameworks is the generation for Gatsby happens at build time instead of runtime.

I love it because I vastly prefer serving static assets to server-side rendering because of the numerous simplicities it provides (aggressive caching, predictable latency, etc). In most cases you get to have the cake of complex sites generated from template and eat the cake of static asset serving.


It doesn't have to be markdown files. Gatsby supports a wide range of data sources which is available to use in your templates via graphql. If your website is big and gets frequently updated with data from the backend triggering the builds, new content on the site can take few minutes to appear as the generator will need to build a static site (html/css/js files) which I assume is a problem for big publication sites.


For a hundred markdown files, no big deal. But if your site has tens of thousands of pages, those build times become a real pain point. Why should every single page rebuild if you only changed one of them?


> In other words, what problem am I supposed to be having that any of this solves?

It saves time, especially for larger pages, because instead of rebuilding the entire site with all its pages, you just rebuild those that change.


So does a Makefile.


A makefile works great when you have a pile of source files and you want to make a parallel pile of output files, and each source file is compiled individually. It's not so great when you have a compilation process where you have a folder of entrypoint source files that each need their own output artifacts produced but they happen to share many dependencies, and you want to automatically create common output chunks if there's enough overlap between them, etc. I'm sure you could find a way to involve Make by automatically generating makefiles, but at that point, Make is only handling the really easy part and isn't worth it.

Think of how many makefiles just end with one big linker call. Most web toolchains (which crunch a lot of source files into a few artifacts) have more in common with that linker call than the rest of the stuff that happens in a makefile. You have to have a system that's more integrated with what's being built to make that step meaningfully incremental.


Instead of a Markdown file, imagine your data is somewhere in a REST API, or many REST APIs. Gatsby (and next.js, which I vastly prefer) will query these APIs during the BUILD process to generate your static sites - and that can be slow. Imagine you have a site that list the top 1000 IMDB movies with details. To generate your static site, you need to make 1,000 REST calls to the IMDB API during build time to get the necessary data. Parallelizing and caching it makes it faster.

If it were just Markdown files you probably wouldn't need this since parsing and transforming local Markdown files it fast. But this is Javascript, so nothing is truly fast.


Static site generators output static HTML files instead of running a server that renders every request, it doesn't say anything about what language they consume. Gatsby is built around Javascript and React, not Markdown.


>> Static site generators output static HTML files

This is not really true; they often generate a static client-side web application vs. a dynamic first-time (or every time) app based on server-side processing. This provides a highly optimized, largely self-contained application that avoids a lot of the runtime dependencies and complexity we typically get (ex: web servers and databases). They are still highly dynamic through the use of APIs and such.

Gatsby has an extensive build pipeline and can query almost any data source during the build, but the original base source is markdown, and react is the Javascript.


None of the answers/comments below even come close to answering the simple question that started this thread. This looks like an overly complex solution looking for a problem to me.


True. The article has a decent answer though. Incremental builds are necessary because:

> [Slow builds] can be annoying if your site has 1,000 pages and one content editor. But if you have say 100,000 pages and a dozen content contributors constantly triggering new builds it becomes just straight up impossible.

Gatsby needs a cloud to host this build server. They also apparently host a nice content editing UI.

If you don't need a content editing UI, and/or are fine maintaining your own static builds, you presumably wouldn't subscribe to the cloud service.


They don't host a content editing UI, only a "dynamic" version of the site that you can embed or link in a CMS for draft previews etc.

I use and like gatsby a lot and don't think it is generally overcomplicated for what it does. They are really pushing static at all costs though, and these cloud solutions are needed because of that. When seriously evaluting a 100.000 pages / dozens of editors project, if you ask what the benefits and the costs of static really are, I think you might come up with a different answer than Gatsby Inc. I think Zeit+Next actually has a better story there, because its not "static at all costs".


yup, totally on us to prove the static model can scale!


Imagine MSDN as a static site


That is easy to do! https://docs.microsoft.com which is, to my knowledge, the MSDN replacement is built as a static site hybrid.

Here's a talk form one of the creators: https://www.youtube.com/watch?v=EpYYe6aQjJM


Gatsby founder here.

Really appreciate the feedback and support for our launch today! The team worked super hard to get Incremental Builds live in public beta but are taking all the feedback (here and all over the web) as we go into full launch. Let us know what you think. Thanks!


Kyle,

Just read the post, congrats on the launch!

We've been using Gatsby on:

https://mintdata.com

for the past few years, and are huge fans of your work.

I still recall the day when I brought Gatsby into our org, our front-end guys almost ate me alive :D

They said: a React.render(...) + GraphQL thing, why do we need it? What's the big deal?

Fast forward a few years later, and Gatsby dominates (in my opinion) the best way to build a static website based on React.

Keep up the awesome work!

Your true fan, Denis


Wow MintData looks so cool! I was just trying to figure out whether Webflow can be used to build simple apps, then I saw this, a whole new level. Is there a way to try it?


Glad it's been a great experience!


Hi Kyle, Gatsby was my second static site generator (after Jekyll) and something I have stuck with the longest.

I know you guys are working towards SSR as well, but do you see a particular point of convergence between what you're doing and Nextjs.

Because it seems that given Nextjs SSR, SSG and everything else working now...Gatsby will get to where Nextjs is today.


Yeah. Next js already supports static site generation.


Thanks for the great piece of software! I really like how the Gatsby community is dabbling in templates for more than just blogs. I've seen documentation sites, landing pages, and notes.

Here are some thoughts on my experience with Gatsby:

- You've done a lot of work to make configuring Gatsby easier, but I still seem to constantly hit roadblocks trying to get the config I want. For example I was running into problems getting MermaidJS, embedded video (that I was hosting on my own machine, not on YouTube), and mdx files all working together.

- I've been thinking that Gatsby is the perfect framework for creating semantic web content. E.g., you could have calendar events sprinkled through a website and create a GraphQL API for listing those calendar events, and that API would be accessible during the build process.


Interesting, what were the challenges you faced with MDX?

We built https://mintdata.com/docs on it, and it's been proverbially better than sliced bread -- that is, a true joy to work with MDX.

What're the challenges/pitfalls you faced with MDX + Gatsby?


This is great! Is there any technical limitation keeping this from being part of the open source version?

I get that Gatsby company put a lot of effort into this and wants a return on that investment, and good for them. I assume a third party could offer the same but why would they compete at the same value prop.

However an open source version to not be reliant on any company would be compelling to many.


We recently introduced build optimizations for incremental data changes for self-hosted environments: https://www.gatsbyjs.org/docs/page-build-optimizations-for-i... and are continuing to improve build speed across platforms.

To reliably provide near real-time deployments, we need tight integration with the CI/CD environment to optimize and parallelize the work; that's why you’ll see the fastest builds and deploys through Gatsby Cloud — the platform is purpose built for Gatsby!


How much is the speed issue related to the language used? I know Hugo is an order of magnitude faster than most static site generators for example - it's written in Go with e.g. 2 seconds to generate about 10K pages https://forestry.io/blog/hugo-vs-jekyll-benchmark/.

I would have thought the generation process could be massively parallelised and a typical blog page would only need a modest amount of computation e.g. concat header, footer, pull in body text, resolve a few URLs. I can't help but think about how much work a typical computer game is doing in comparison 60 times per second even without a GPU.


I don’t think it’s a language issue. Even for JavaScript bundlers you have the slow extensible bundle and the “new super fast bundler” that dies in a month because it only fits one use case.

How flexible is Hugo? And how many plugins does someone generally use?


> How flexible is Hugo? And how many plugins does someone generally use?

It processes Markdown, JSON, YAML and SASS, can pull in data files from URLs, and has custom templates/themes, custom macros/shortcuts, image processing and live reload. It doesn't have a plugin system as far as I know but nothing stops you combining Hugo with other tools e.g. run a JS script to pull in and transform a JSON file before Hugo runs.


I think that’s the point. No plugin system. Compare Babel to Bublé or even Sucrase for example: https://github.com/alangpierce/sucrase

Preparing data for external use always takes extra effort.

You can build an efficient self-contained tool in JavaScript too.


A counterpoint: Babel's extensibility doesn't matter in practice at all, other than helping the Babel team organize their code.

Pretty much every new ES6 feature required parser and babel-core changes just to be able to be used.

Example: a lot of changes that only worked in Babel 7 (that was on Beta for months) were not possible in Babel 6, and so on for previous versions. A plugin was not enough: you also needed parser/core changes.

Other than for novel non-standard features (like code substitution), plugins are not exactly that powerful, and even things like that are frowned upon in most environments, as 99.9% of people just want ES6 features.


> Preparing data for external use always takes extra effort.

How much effort are you really talking about for a static site though? Can you be specific?

Most sites I've worked on are processing a modest number of Markdown + JSON files, where sometimes these are pulled in from an external URL. Why does any of this require anything particularly complicated or anything that could justify a big performance hit?

> You can build an efficient self-contained tool in JavaScript too.

Does one exist for static site generators though?


I use Hugo to run a site for a small news organization. It’s flexible enough. I’ve never run into something I couldn’t make it do with some creative thinking. It doesn’t have any plugins because why would it need a plugin? I guess there is a basic build and an extended build for including image resizing, but they would only have one build if they didn’t need to use C for the image stuff. Plugins mean the system doesn’t solve the problems of its users…

It does have themes, but I just write my own.


Some of it is due to the language and the general JS tooling being bloated and slow.

However, in many cases build time is slow because you're doing something that's slow, like calling a REST API. You are not going to generate 10k pages in 2sec if you need to make 10k REST requests, each taking 100ms, to a remote API to fetch the data for your pages. This kind of "data integration" from various sources is a standard use cases for site generators like gatsby and next.js. It seems like what this is targeting is smarter caching to avoid such expensive calls when possible.

Hugo is different in that it basically just transforms local HTML/Templates/Markdown. That's always fast. Even JS can handle that.


> Hugo is different in that it basically just transforms local HTML/Templates/Markdown. That's always fast. Even JS can handle that.

Do you know of any benchmarks that show that? As far as I know, most static site generators can take minutes to process a few thousand Markdown files with Hugo being the exception.


Is the technology behind "incremental builds" being upstreamed into the open source project?


Our team's been waiting on this for a year to start moving larger sites to Gatsby. Can't wait to try it.


so this is cool release, and no objection on that but if your pipeline has automated testing, security scans and more then you are not actually deploying in 10s

more technical details would be good but I guess either I missed it or they look at it as IP


You wouldn't be running automated testing etc on data updates though, surely? That's what this feature is for, not code updates.


Javascript re-invents "code typing" (Typescript)

Javascript re-invents "Promises" because callback hell

Javascript re-invents "compilers" (babel)

Javascript re-invents "build systems" (webpack, etc)

Javascript re-invents "caching" (incremental builds) - but paid, and in the cloud

Because why not.


/s/re-invents/implements/g and suddenly JavaScript is running a successful dev cycle.


You still need to use an api for everything. Good apps need a backend; not a JAMStack fan for anything but the most basic of sites.


The 'backend' here is ... HTML. For read-only a blog, that's likely more than enough. Otherwise, for dynamic content like contact forms and such, I don't know if there's a meaningful benefit to building out a whole site in PHP/Python/Rails or something (and paying commensurately more in hosting) than to use Formspree or something similar.

Yes, it calls an API. And thankfully with Formspree, it's pretty easy to see the price breakeven points vs. hosting, but there are benefits to be had.


Content has to come from somewhere. Operational complexity is increased, not decreased, since you are running a build server in addition to your CMS. The benefit is easier optimizations in the frontend.


Yeah, but you're only running your build server some of the time (which is favorable in this era of by-the-second pricing) and you're running it inside a private network where nobody can try to infiltrate it 24/7 for no good reason.


But you can outsource those (build server, CMS) to e.g. Netlify and Contentful.

But if you don't want that, there's a billion alternatives; the CMS market is one of the most saturated ones out there.


I actually feel like Gatsby will be developing a CMS to round out their paid feature-set. I have zero-inside knowledge, just a hunch that to get customers to pay up handsomely, the product needs a deeper "fit" in the publishing pipeline.


Content can come from text files. Operational complexity ought to be neutral (or even negative) if that content compilation step replaces your CI pipeline.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: