Having never used a static site generator in anger, can someone explain to me like I'm five what's going on here?
My understanding is that Gatsby is a tool that converts a bunch of markdown files into a static HTML website. Why is slow builds a problem for any static site generator? Why does it need a cloud?
In other words, what problem am I supposed to be having that any of this solves?
Note, I'm trying not to be skeptical here - my company's website is hand-maintained HTML with a bunch of PHP mixed in so I can totally imagine that things may be better. But I don't understand the kinds of situations where using a 3rd party cloud to generate some static HTML solves a problem.
Gatsby is a fairly complex static site generator. At the highest level, it provides an ingest layer that can take any data sources (CMS, markdown, json, images, or anything that a plugin supports) and bring them into a single centralized GraphQL data source. Pages (which are built using React) can query this graph for the data they need to render. Gatsby then renders the React pages to static HTML and converts the queries to JSON (so there's no actual GraphQL in production).
This process is fairly fast on small/simple sites. Gatsby is overall very efficient and can render out thousands of pages drawing from large data sources rather quickly. The issue is that Gatsby isn't just used for personal blogs. As you can imagine, a site with thousands of pages of content that is processing thousands of images for optimization starts taking a long time to build (and a lot of resources). For example, I'm building a Gatsby site for a photographer than includes 16000+ photos totaling a few hundred GB. Without incremental builds, any change (e.g. fixing a typo) means every single page needs to be rebuilt.
Incremental builds means you don't have to rebuild everything. Because the data is all coming from the GraphQL (which Gatsby pre-processes and converts to static JSON), it is possible to diff the graphs (i.e. determine what data a commit has changed) and determine what pages it affects (i.e. which pages include queries that access that field). From there, Gatsby can only rebuild that changed pages.
This not only means faster build times, it also means that only the changed pages and assets have to be re-pushed to your CDN. This way, content that hasn't changed will remain cached and only modified pages will have to be sent down to your site's users.
But if you have 16,000 of anything, why are you using a static site? Surely the access patterns are long tail and you need to build more often than most pages are even accessed.
Surely there's a CPU/disk trade off at some point. Static pages are much larger (less likely in memory) and would cause disk reads much sooner than the same files being generated dynamically. Of course wordpress isn't known for it's efficiency so the static page preference is probably quite high.
there is a big difference in the cost of static hosting and CDN (think Cloudfront / S3) for static stuff and running an active piece of hardware for static stuff that doesn't change. Like orders of magnitude. Sure for small sites it's not that much but it's still orders of magnitude.
also the answer to a large number of my interview questions ends up being figuring out how you can just effeciently serve the stuff from a CDN/Blob Storage. You can scale the crap out of this for quite cheap.
If the final result is 500k of HTML, a dynamic website is doing a LOT more work to return that 500k than a static website. Assuming you get more traffic than you have pages/generated.
Admittedly in this case I'm mostly just trying to push Gatsby to it's limits. For a photography site, there ends up being very little overhead with a static site (if you can do incremental builds). I also explored NextJS (SSR) and just making a good old SPA, but decided to go with Gatsby because at the end of the day, a distinct majority of the storage cost is just the raw images. I think Gatsby ends up making the most sense because you get to take advantage of a CDn for caching (most don't like being used just as an asset cache) and I can just leave it there without worrying about a server.
Hi, have you thought about hosting for this yet? I've got a similar site which I originally tried on AWS Amplify but it got too big for the artifact size limit so I opted for S3/Cloudflare instead however build times are slow and more of a manual process currently.
I’m yet to figure that out! I’m procrastinating on that but until I have everything else figured out (it’s for a family member so no strict timeline). I’m thinking in the end the setup will be something with a CMS for editing the photo metadata, a file storage system for the images, and everything else in git. The build would pull from all 3, run and then the processed images will be pulled out and hosted on their own. I’m planning that it’ll just run and take a long time on a droplet unless I figure something better out.
Server-side rendering (like Wordpress) generates HTML in response to a URL. Static site generators just visit every possible URL at build time and save the final HTML as files. This makes it easy to deploy and scale when your site is static and doesn't need any features of dynamic server-side rendering.
Gatsby (and other frameworks) automate this process by going through whatever data sources you have (directory of markdown files, databases, etc) and producing the HTML. Gatsby uses React for the templating logic and any client-side interactivity on the pages. Build times scale with the size of your content and number of pages to generate so that's the reason for the cloud.
Overall, static sites are in the hype phase of the software cycle. Most sites are just fine using Wordpress or some other CMS and putting a CDN in front to cache every pageview. Removing that server completely is nice but most static sites end up using some hosted CMS anyway and at that point you just replaced one component for another. There's also advantages to completely separating the frontend code from the backend system for fancy designs or large teams.
Wordpress in particular is not “just fine”. During this crisis, every government site based on WordPress ends up crashing under the load. Yes, you can avoid this with a good cache plugin and a CDN. But you can also avoid it by using a tool that is designed to not crash under load in he first place.
Serving up a HTML file from disk or generating some dynamic HTML are both trivial to do once. After that, a CDN layer is used to cache the HTTP response, regardless of how it was generated.
Sure if you just want to serve HTML files on every request with a simple file server than static files will be faster, but it'll eventually get overloaded too. The CDN is where the real scaling happens. And using a CDN is far easier than changing the entire backend to a static site.
Also most of the sites that crashed were dynamic applications, not just static pages. Using a static site generator wouldn't solve that problem.
I am specifically complaining about WordPress. Anyone could make a server side dynamic application that applies proper caching headers to work with a CDN. Anyone could, but WordPress specifically did not. It's an uncacheable mess by default and caching plugins just barely make it useable.
Using a static site generator is basically just a way of ensuring that the pages are properly cached by the brute fact that they've been dumped out on disk. It's not strictly necessary for a well designed system, but it raises the floor because even in the worst case, the pages are static files.
For many public facing sites, dynamic applications aren't strictly necessary. If you're just hosting a PDF of an Excel spreadsheet of permitted job categories, you don't need a dynamic application. Again, a well designed app would already be hosting this through S3, but you can't trust things to be well designed when made by a contractor with no technical oversight.
Static files don't automatically set any headers either, you still need a webserver to serve those. And you can override those headers in the server or in the CDN so there's no reason to switch out the entire backend for it.
CDNs handle scaling of static assets. That's their entire purpose, with features like request coalescing and origin shielding to help ensure unique URLs are never requested more than once. Optimizing for static files at the origin is just not worth the trouble when Wordpress and other frameworks are far more productive and provide CMS functionality which is usually needed anyway.
We're talking in circles. No one disputes that CDNs are good and expert users of WordPress are capable of making it not shit the bed. The point is that WP cannot be left unmanaged by novices, which means it should not be used in many situations in which it is currently used. Static sites have a higher floor and so are better suited to non-expert use.
My point is that tradeoff isn't worth it. It's far easier to tweak security settings and configure a CDN than to completely change the backend with a complex build process requiring more technical knowledge, deploy it to a host which you still need, configure a CDN which you still need, and wire it up to read from a CMS which you still need.
And then you can regenerate all your files every time you make a change - the trade off isnt worth it for one off sites that generally sit at zero traffic.
Cheaping out on things that need actual support wont be fixed by making it a static html page.
I'll take a stab at this, Kyle just shoot me if I get something wrong below :D
1) There's a server-centric approach and a client-centric approach:
--a) hand-maintained HTML + php falls into the first camp
--b) React (/Angular/Vue) fall into the second
2) If you go with the second camp (b), you end up having a higher initial page load time (due to pulling in the whole "single page app" experience), but a great time transitioning to "other pages" (really just showing different DIVs in the DOM)
3) Gatsby does some very clever things under the hood, to make it so that you get all the benefits of the second camp, without virtually any downsides.
4) There are of course all kinds of clever code-splitting, routing & pre-loading things Gatsby does, but I hope I got the general gist right.
If not, Kyle, get the nerf gun out! -- how would you describe the Gatsby (& static sitegen) benefits? :)
Gatsby can also let you use react components to do some pretty clever things around image resizing, effects, etc that you might expect from a static site generator but couldn’t achieve with just a frontend framework.
(3) is incorrect, Gatsby initial page load times are mostly really bad.
(2) is both overstated and overvalued. It's overstated because loading a static HTML page from a CDN is extremely fast. Too many people who point at this advantage for SPAs are thinking back to pre-CDN usage with slow origin servers. Of course there are still use-cases where going to network is not wanted, but these aren't the primary use-cases that Gatsby covers.
It's also overvalued in that most users are not getting to a page by navigating in a loaded site, they are coming from a social or search link (again, for the sort of use-cases that Gatsby pages are built for).
> (3) is incorrect, Gatsby initial page load times are mostly really bad.
This has not been my experience, considering all HTML is ready to go from last byte so that other than blocking CSS, rendering can begin ASAP. At this point, no JS is required to interact with the page so things are generally pretty snappy while we wait for React hydrate to kick in.
camp 1) at best, a TCP connection is re-used, and the HTML for "page 2" is fetched over the network, parsed, the CSS OM is applied, and then the whole caboodle* is "painted on screen".
camp 2) the CSS OM is applied and "page 2" is painted on-screen (possibly even faster if the browser cached "page 2" in a texture on the GPU, so the CSS OM application step may be optimized away)
So I genuinely don't understand how fetching a "page 2" from a CDN
(we use Cloudfront & GCP's CDN at https://mintdata.com, so I'm basing my experience on this)
is faster than the SPA approach?
I am genuinely curious on the above -- not trying to start a flame war :D
* Yes, apparently caboodle is a word?! I had to Google it just like you to make sure :)
I've only done simple stuff with Gatsby, but it fully supports generating static HTML from dynamic data sources. The difference between that and traditional JS frameworks is the generation for Gatsby happens at build time instead of runtime.
I love it because I vastly prefer serving static assets to server-side rendering because of the numerous simplicities it provides (aggressive caching, predictable latency, etc). In most cases you get to have the cake of complex sites generated from template and eat the cake of static asset serving.
It doesn't have to be markdown files. Gatsby supports a wide range of data sources which is available to use in your templates via graphql. If your website is big and gets frequently updated with data from the backend triggering the builds, new content on the site can take few minutes to appear as the generator will need to build a static site (html/css/js files) which I assume is a problem for big publication sites.
For a hundred markdown files, no big deal. But if your site has tens of thousands of pages, those build times become a real pain point. Why should every single page rebuild if you only changed one of them?
A makefile works great when you have a pile of source files and you want to make a parallel pile of output files, and each source file is compiled individually. It's not so great when you have a compilation process where you have a folder of entrypoint source files that each need their own output artifacts produced but they happen to share many dependencies, and you want to automatically create common output chunks if there's enough overlap between them, etc. I'm sure you could find a way to involve Make by automatically generating makefiles, but at that point, Make is only handling the really easy part and isn't worth it.
Think of how many makefiles just end with one big linker call. Most web toolchains (which crunch a lot of source files into a few artifacts) have more in common with that linker call than the rest of the stuff that happens in a makefile. You have to have a system that's more integrated with what's being built to make that step meaningfully incremental.
Instead of a Markdown file, imagine your data is somewhere in a REST API, or many REST APIs. Gatsby (and next.js, which I vastly prefer) will query these APIs during the BUILD process to generate your static sites - and that can be slow. Imagine you have a site that list the top 1000 IMDB movies with details. To generate your static site, you need to make 1,000 REST calls to the IMDB API during build time to get the necessary data. Parallelizing and caching it makes it faster.
If it were just Markdown files you probably wouldn't need this since parsing and transforming local Markdown files it fast. But this is Javascript, so nothing is truly fast.
Static site generators output static HTML files instead of running a server that renders every request, it doesn't say anything about what language they consume. Gatsby is built around Javascript and React, not Markdown.
>> Static site generators output static HTML files
This is not really true; they often generate a static client-side web application vs. a dynamic first-time (or every time) app based on server-side processing. This provides a highly optimized, largely self-contained application that avoids a lot of the runtime dependencies and complexity we typically get (ex: web servers and databases). They are still highly dynamic through the use of APIs and such.
Gatsby has an extensive build pipeline and can query almost any data source during the build, but the original base source is markdown, and react is the Javascript.
None of the answers/comments below even come close to answering the simple question that started this thread. This looks like an overly complex solution looking for a problem to me.
True. The article has a decent answer though. Incremental builds are necessary because:
> [Slow builds] can be annoying if your site has 1,000 pages and one content editor. But if you have say 100,000 pages and a dozen content contributors constantly triggering new builds it becomes just straight up impossible.
Gatsby needs a cloud to host this build server. They also apparently host a nice content editing UI.
If you don't need a content editing UI, and/or are fine maintaining your own static builds, you presumably wouldn't subscribe to the cloud service.
They don't host a content editing UI, only a "dynamic" version of the site that you can embed or link in a CMS for draft previews etc.
I use and like gatsby a lot and don't think it is generally overcomplicated for what it does. They are really pushing static at all costs though, and these cloud solutions are needed because of that.
When seriously evaluting a 100.000 pages / dozens of editors project, if you ask what the benefits and the costs of static really are, I think you might come up with a different answer than Gatsby Inc. I think Zeit+Next actually has a better story there, because its not "static at all costs".
My understanding is that Gatsby is a tool that converts a bunch of markdown files into a static HTML website. Why is slow builds a problem for any static site generator? Why does it need a cloud?
In other words, what problem am I supposed to be having that any of this solves?
Note, I'm trying not to be skeptical here - my company's website is hand-maintained HTML with a bunch of PHP mixed in so I can totally imagine that things may be better. But I don't understand the kinds of situations where using a 3rd party cloud to generate some static HTML solves a problem.