No, you have no idea what your scale problems are going to be (if you ever have ...

0xbadcafebee · on Aug 27, 2013

Do you really expect me to buy the idea that your company failed because Gigabit was too expensive? Even if 100Mbit was far cheaper, there are plenty of workarounds to cheaply increase throughput.

I assume by no-brand you mean custom built, and also assuming you mean the cheapest available, in which case even one gigabit interface may have been difficult, seeing as 32-bit 33mhz bus capacity is barely above gigabit speed. In any case, the money you saved on Sun gear could have built you a sizeable PC cluster and even with several 100Mbit interfaces would have been more powerful and cheaper. Really I think it wasn't built on Linux because Sun was the more robust, stable platform. But I could be crazy.

While i'm being crazy, I should point out all the other things you mentioned can be planned for. Anybody who's even read about high-performance systems design should be able to account for too many reads/writes! Geographical distribution is simple math: at some point, there are too many upstream clients for one downstream server, capacity fills, latency goes through the roof. A DBA knows all about schema woes. I thought service-oriented architecture was basic CS stuff? (I don't know, I never went to school) AWS didn't exist at the time. And the maturity level of your software in two years will, obviously, be two years more mature.

All of these problems are what someone with no experience or training will run into. But there should be enough material out there now that anyone can read up enough to account for all these operational and design problems, and more. But if your argument is that start-up people shouldn't have to know it because hey, I haven't got time to figure out how to do it right because I have to ship right now, I don't buy that for a minute.

There's a paper that some guy wrote years ago that goes over in great detail every single operational scaling issue you can think of, and it's free. I don't remember where it is. But it should be required reading for anyone who ever works on a network of more than two servers.

As an aside: was it really cost that prohibited you from porting to Linux? This article[1] from 2000 states "adding Linux to the Traffic Server's already impressive phalanx of operating systems, including Solaris, Windows 2000, HP-UX, DEC, Irix and others, shows that Inktomi is dedicated to open-source standards, similar to the way IBM Corp. has readily embraced the technology for its eServers." And this HN thread[2] has a guy claiming that in 1996 "we used Intel because with Sun servers you paid an extreme markup for unnecessary reliability". However, it did take him 4 years to move to Linux. (?!) A lot of other interesting comments on that thread.

[1] http://www.internetnews.com/bus-news/article.php/526691/Inkt... [2] https://news.ycombinator.com/item?id=3924609

asdfjjjjjj · on Aug 27, 2013

Hi Peter, another ex-inktomi/ex-yahoo guy here. I worked on this infrastruture much later than Diego. Traffic Server is not a significant part of the Inktomi environment -- you are looking at the wrong thing. Diego is describing the search engine itself, which ran on Myrinet at that time. It did not run on 100baseT ethernet. Myrinet was costly and difficult to operate, but necessary as the clusters performed an immense amount of network i/o.

It is also extremely non-trivial to replace your entire network fabric alongside new serving hardware and a new OS platform. These are not independent web servers, these are clustered systems which all speak peer to peer during the process of serving a search result. This is very different from running a few thousand web servers.

Even once migrated to gigE and linux, I watched the network topology evolve several times as the serving footprint doubled and doubled.

I assure you, there is no single collection of "every single operational scaling issue you can think of," because some systems have very different architectures and scale demands -- often driven by costs unique to their situation.

0xbadcafebee · on Aug 27, 2013

What you're saying makes total sense in terms of complexity and time for turning out a whole new platform. But to my view it depends a lot on your application.

Was the app myrinet-specific? If so, I can understand increased difficulty in porting. But at the same time, in 1999 and 2000, people were already building real-time clusters on Linux Intel boxes with Myrinet. (I still don't know exactly what time his post was referencing) If Diego's point was that they didn't move to Linux because Gigabit wasn't cheap enough yet, why did they stick with the expensive Sun/Myrinet gear, when they could have used PC/Myrinet for cheaper? I must be missing something.

I can imagine your topology changing as you changed your algorithms or grew your infrastructure to work around the massive load. I think that's natural. My point was simply that making an attempt to understand your limitations and anticipate growth is completely within the realm of possibility. This doesn't sound unrealistic to me [based on working with HPC and web farms].

What I meant to say was "every single issue", as in, individual problems of scale, assumptions made about them, and how they affect your systems and end users. It's a broad paper that generically covers all the basic "pain points" of scaling both a network and the systems on it. You're going to have specific concerns not listed, but it points out all the categories you should look at. I believe it even went into datacenter design...

krenoten · on Aug 27, 2013

> Geographical distribution is simple math

HAHAHAHAHAHAHA.

> And the maturity level of your software in two years will, obviously, be two years more mature.

HAHAHAHAHAHAHA.

0xbadcafebee · on Aug 27, 2013

You don't get token karma on here for being an ass. This isn't reddit.

sbarre · on Aug 27, 2013

If this isn't Reddit, then why are you telling diego that you know more about his personal life experiences than he does?

That's a pretty Reddit thing to do..