The Surprising Path to a Faster NYTimes.com

saturdaysaint · on Sept 24, 2014

Improvements like that at Nytimes.com have had me happily paying for my daily news for the first time. The user experience is just orders of magnitude better than anything I've found that's available for free. Almost reminds me of when Google maps came out and completely changed expectations for a map site. The iOS app also impressed me: it's great at caching stories, so even in crappy network areas it feels like a "broadband" newsreading experience (kind of amazing how long it took for a news app to accomplish this).

Curmudgel · on Sept 24, 2014

I find the user experience on the new site to be much worse than the new site. When I'm reading an article I really don't care about navigating to the rest of the site. I have a small widescreen laptop, which means that the fixed headers at the top give me less room to read the article. I don't need gobs of white space to read the news article in print, so why is white space so important on the web that navigation has to be hidden in one of those awful hamburger buttons? The text is harder to read because it has lower contrast. The new comment system is totally unusable. I don't really appreciate large, useless images interrupting the flow of the article. And I don't need Javascript to read a newspaper article (the site loads like a pig with Javascript) because I'm going to be navigating to a different page when I read another article. It's faster to just open the home page and open each article in a new tab (with JS disabled).

nanoscopic · on Sept 24, 2014

The slides say there are 1 million pages, and "republishing" them would take 90 days. Maths -> 7.8 seconds on average to "republish" a page. Modern templates systems can convert a page constructed out of structured data in less than 1/4 of a second. ( and that is a high estimate ) That is ~30 times faster, meaning all pages could be "republished" in 3 days instead of 90 had a more efficient system been used from the start.

Focusing on shifting all "rendering" into front end JS seems like it will lead to more difficulty in the long run instead of using a more efficient structured page creation mechanism.

I am curious how the static pages were created. Others here are speculating that templating was not done. If not; what does "republishing" mean exactly?

bonaldi · on Sept 24, 2014

Their source material here is 1 million pages of HTML, they don't have (on my reading) some separate source of "structured data" for the modern template system to use.

It seems reasonable (and possibly low) for a 90 days estimate to extract the content from the variety of versions of static page, structure it, and then publish it in a more modern fashion.

It's all very well to say they should have used a more efficient system from the start, but "the start" in this case is 1996, which is the wild west in terms of best practices.

thezilch · on Sept 24, 2014

In 1996 (or as late as early 2000s), many published texts were via tools like Dreamweaver, where the publisher would checkout / lock a file, make static changes, and save the file, directly on prod. Like you said, it's not surprising at all they have a large portion of their articles in static HTML files.

eitanmk · on Sept 24, 2014

Exactly. To compound the issue, older content may not fit into our current schema, so there are other data issues that need to be solved. Adding hand-coded HTML to pages was quite common 10 years ago, so parsing that into the structures we work with today isn't always straightforward and difficult to automate.

eitanmk · on Sept 24, 2014

As a disclaimer, this falls under the "Back then" section of the talk which is an overview of how things used to work. We no longer do things this way.

Publishing a page is running content from a CMS through a templating system. However, the time spent executing templates isn't the only factor in the duration of the "publish". The slides refer to a compilation step (which actually also included a preprocessor step), and includes delegating to a service to copy the resulting page to disk and ensure the write succeeds for all data centers. For data consistency and system monitoring, we essentially treat that entire process as an atomic action and wait for all parts to finish. Additionally, since "publishing" is a core process for us, we avoid doing massive publishes that might risk the systems involved in the successful publishing of current articles. So increasing the number of these for the sake of pushing code is considered too risky. Yes, there are ways of mitigating that risk, but dealing with this legacy problem once and for all is a better path forward than scaling up this solution.

nanoscopic · on Sept 30, 2014

Thank you for this detailed explanation. I assumed there was some extenuating complexity to the process, and was simply wondering what it was that prevented the original publishing system from being adapted to be faster.

ben336 · on Sept 24, 2014

"From the beginning" in this case going back to the mid 90s. Pretty sure "Modern template systems" have mostly been written since then.

nanoscopic · on Sept 24, 2014

http://search.cpan.org/~mjd/Text-Template-0.1a/ Basic text templating; circa 1995. Using templates is hardly a new thing. I wrote a full CMS system for a news company in 2002. Sure that is a couple years later; but I based it on Text::Template in combination with XML; both of which existed in the mid 90s.

I am not attacking what they were doing, my point was that this is an issue of delaying the move to a template based system way longer than necessary. I am speculating on whether they did move to some templating, but not all the way, and wondering what exactly they were using before that.

Saying "static HTML files" is well and good, but it doesn't explain how they were created. By hand??

Also, saying "90 days to republish" seems to be suggesting some process they have in mind besides manually scraping the data out. If you have 1 million random files, scraping could easily take years not 3 months. It would be interesting to know what process they are suggesting to follow in those 90 days. My speculation is they did use some sort of outdated CMS software.

Zizzle · on Sept 24, 2014

It also seems like a task with inherent parallelism.

Upload the corpus and throw a bunch of cloud instances at it.

hyperpape · on Sept 24, 2014

The point about elements on the page shifting around is huge. I've personally dreamed of a browser change that would make reflows that are out of the current browser viewport not change your position on the page. If you have a slow connection or view certain types of content (liveblogs, etc) this can become a huge pain.

Just thinking about it raises all sorts of questions about whether the browser/rendering engine can actually reliably know that information, but it doesn't mean I can't dream.

malyk · on Sept 24, 2014

If you have fixed ad sizes and known locations you could make those spaces empty boxes of the correct size and then asynchronously fill them with data. I haven't tried it, but it seems like it should work.

stdbrouw · on Sept 24, 2014

I can imagine static pages getting really annoying at this scale, and it also seems like a no-brainer to have your content in a database... but the nerd in me did think "page rendering can be trivially parallelized – why not throw some map/reduce at it?"

jamessantiago · on Sept 24, 2014

I've been really impressed with the quality of new york times posts as of late. The post "Norway the Slow Way" posted here a few days back was impressive in its use of a variety of frontend display techniques to tell a single story. Even their web console output had some neat ascii art and a hiring call to interested developers.

gizzlon · on Sept 24, 2014

http://www.nytimes.com/interactive/2014/09/19/travel/reif-la...

ljosa · on Sept 24, 2014

I just found that annoying. I simply wanted to read the text.

eitanmk · on Sept 24, 2014

Hello Hacker News.

I'd first like to say that this is the deck from a presentation at Velocity NY last week. Like most other talks, separating the slides from the presenter can make interpreting the context difficult. I did try to make an effort to have my slides provide useful information without me presenting them, but I acknowledge that I may not have done enough in that regard. I also received feedback from people present that there were too many bullet points and my font was too small. Can't please everyone I guess. But if you have a link to what you consider the "perfect" slide deck where unambiguous context is maintained without video of the talk, I'd love to study it in order to improve.

Other replies will be directed at the specific comment thread.

x110dc · on Sept 26, 2014

Is there a video of your presentation? I'm interested in watching it.

DanielBMarkham · on Sept 24, 2014

Static pages are a barrier to scaling if you have a bunch of other stuff tied in with them Stuff like HTML macros, CSS, and so forth.

My physical NYT copy from 1980 is fine. It was "published" this way, and it stays this way.

What we're really saying is that if you want to go the static route, you can't go half-way: everything that _is_ the page gets deployed in one file. I doubt very many people who think they have static pages actually do.

andreasvc · on Sept 24, 2014

What is WPO?

hawtshot · on Sept 24, 2014

Web Performance Optimization

gulbrandr · on Sept 24, 2014

I was wondering the same thing, going back in the previous slides, trying to find the definition of it.

vkb · on Sept 24, 2014

What really strikes me here, aside from the technical aspects, is the note on p. 21 about how the project was supported from the top because SEO was lagging as a result of site load time, and this line especially: "NYT became an e-commerce site since the last redesign."

Once you are focusing on e-commerce and SEO as an executive team, are you still committed to journalism?

acdha · on Sept 24, 2014

If you're selling subscriptions, wouldn't that leave you more committed to the journalistic quality readers want rather than letting advertisers dominate that discussion?

vkb · on Sept 25, 2014

Yes, if you're selling subscriptions. SEO is not meant to maximize subscription revenue, but to get people to click for free, aka ads. Which will it be for them?

Two recent pieces are leading me to believe that NYT is floundering around, without a real cohesive online strategy, still:

[1]http://www.cjr.org/the_audit/the_new_york_timess_digital_li....

[2]http://www.cjr.org/the_audit/the_new_york_times_cant_abando....

eitanmk · on Sept 24, 2014

Why must it be either-or? The web is another medium for the journalism. There's no reason to assume performance was of greater importance than our core mission. The point of that slide is to explain why this redesign had a performance goal at all. Perhaps the slides don't explain this point well, but I think you're reading too much into it.

akgerber · on Sept 24, 2014

Perhaps off-topic, but recently it's appeared that NYT pages have had some sort of JS memory leak when left open for a long time in Chrome.

ck2 · on Sept 24, 2014

They were keeping a million static pages on disk without any templating?

Whoa.

untilHellbanned · on Sept 24, 2014

What was so surprising about the path? Wasn't clear from the deck.

jrochkind1 · on Sept 24, 2014

That they needed to step away from "everything possible async loaded" in order to avoid having the page move around in front of the user's eyes, and that this resulted in 'objectively' slower page loads (more time to DOMReady), but an actually increased perception of performance and load speed from users.

At least I think that's what they were saying. Always challenging to deal with a slide deck meant to be presented by a human, but without the human doing the presentation.

sp332 · on Sept 24, 2014

Really? Each surprise gets a whole slide in bold text all to itself. #1. A lot of static pages are a barrier to optimization. #2. Performance increase demanded as part of redesign. #3. Sometime you have to slow down to seem faster.

#1 really did surprise me, because I had always assumed that serving static pages would be really fast. I guess I never thought about sites with millions of pages.

nsfmc · on Sept 24, 2014

for #1, it could be fast, but the problem they're running up against is that the data appears to be held up by a single bottleneck, similar to pre-sharded database setups. in this case, their filesystemdb is hitting the limits of too many connections saturating much of their disk's i/o.

a possible way of approaching that problem is a divide and conquer approach with a reverse proxy that assigned manageable chunks of their content across numerous machines each serving less than millions of pages. the nyt already has a /<yyyy>/<mm>/<dd>/<section>/<subsection>/<slug> url scheme which would make this less painful.

I'm not sure how inefficient this would be, though, certainly a time investment, but it ends up offloading your disk i/o/ issues by creating more and more s3 buckets (or what have you) and routing via a proxy. i'd be curious to see when s3+cloudfront-as-host becomes too slow simply because of disk i/o limitations, although s3 almost certainly has its own abstraction above the bucket i'm not aware of which mitigates that.

it still doesn't address the serious and complex frontend issues they were facing, which seemed much more onerous to be honest. their server rendering seems to be pretty lean though, it looks like dom processing and client rendering make up easily 80% of their 3.6s pageload time.