More

cabacon · on June 14, 2022

I worked at a supercomputing facility for a few years. The codes are typically decades old, maintained by hundreds of people over the years. By and large, they understand their performance profiles, and are working to squeeze as much out of the code as they can.

In addition, the performance engineers tend to be employed by the facilities, not the computational scientists. They're the ones who do a bunch of legwork of profiling the existing code on their new platform, and figuring out how to squeeze any machine-specific performance out of the code.

A lot of these codes are time-marching PDE solvers that do a bunch of matrix math to advance the simulation, so the kernel of the code is responsible for a vast majority of the time spent during a job. So it's not necessarily a huge chunk of code that needs to be tuned to wring better performance out of the machine.

The parallel communication they do is also to an API, not an ABI - the supercomputing vendors drop in the optimizations in the build of the library for their machine, to take advantage of network-specific optimizations for various communications patterns. If you express your code in the most-specific function (doing a collective all-to-all explicitly, say, rather than building your own all-to-all out of the point-to-point primitive) the MPI build can insert optimized code for those cases.

There's some misalignment because the facility will be in the top 500 for a few years, while the code lives on and on and on. If your supercomputer architecture is really out of left field (https://en.wikipedia.org/wiki/Roadrunner_(supercomputer)) it's not going to be super worth it for people to try to run on it without porting support from the facility.

cabacon · on April 24, 2021

The story I read (https://andrewdamitio-92271.medium.com/the-decline-of-the-am...) says that there was a shift post-WWII that encouraged building real estate, the accelerated depreciation from the article.

Then as things shifted back to linear depreciation, it made building/running malls much less attractive, and we're seeing that play out over the 20-30 year capital lifecycle you mention.

cabacon · on Feb 12, 2020

Yup. I have an Erdős number, but despite being a Bacon, no Bacon number to my great dismay.

cabacon · on May 23, 2017

You might enjoy http://whatever.scalzi.com/2010/10/02/when-the-yogurt-took-o... as a hypothetical walkthrough of your scenario.

cabacon · on Dec 13, 2016

You might enjoy http://www.iquilezles.org/live/ where he live codes some ray-marching using some kind of opengl editor. It's not quite what you're talking about, since it's just running the opengl code, but you could imagine it going through some kind of compiler/visualizer pipeline like you're thinking about.

cabacon · on Nov 4, 2016

Yes, absolutely. They did something like this at the Sun Microsystems field office outside of Chicago while I worked there. You would log into a sunray with your smartcard and pick up whatever you had left behind in your session, with no permanent desk assignment.

It was unpopular, to say the least. Your personal belongings went into a pedestal on wheels that you could take to whichever workspace you wound up at that day. This was in 2000/2001 or so.

david-given · on Nov 4, 2016

These days we call them ChromeBoxes.

And they can be great. My work issue laptop is a Pixel 2 ChromeBook, and I love it. But that's mine, which I alone use (except when I lend it to someone), and the grease on the keyboard comes from my fingers.

cabacon · on July 21, 2016

My favorite interviewing question as an IC was "Tell me about someone on your team you admire". It let me learn about what people valued based on why people were admired, and gave some depth-of-bench sense whether there were lots of distinct names, or if everyone was in awe of the one good person on the team.

If you're looking for cross-team health, maybe you could adapt it to "Tell me about someone on the other team that you admire?"

veneratio · on July 21, 2016

This is an awesome question. What sort of responses have you seen from this? Do most folks have a quick answer or do you get some thought? As an interviewer, I'm not sure I would ever expect a question like this.

cabacon · on July 21, 2016

Truth be told, I don't think I'm calibrated on the question yet; I've only used it twice. In one org, there was a shining star who attracted all the answers. In the other org, someone laughed because of the number of good answers, and started rattling off names and reasons.

In hindsight, I wish I'd had enough experience with the question and possible scenarios to ask for a second answer from people in the first org; I suspect there were more good answers available, but one obvious answer that everyone snapped to first.

cabacon · on April 11, 2016

Regarding "theoretical best", I think that is "in the absence of mitigations". I think you can build a service with a higher SLA than one of its dependencies, but only if you recognize that impedance mismatch and build in defenses.

As a contrived example, if you've got a microservice that provides data FOO about a request that isn't actually end-user critical, you can mitigate your dependency on it by allowing your top-level request to succeed even if the FOO data is missing. Or maybe you can paper over blips of unavailability with cached data.

But, yes, know what you depend on and how reliable they are, then see if you need to take more action than that if your target is higher than the computed target.

asuffield · on April 11, 2016

(Tedious disclaimer: my opinion only, not speaking for anybody else. I'm an SRE at Google)

Building reliable services out of unreliable dependencies is a part of what we do. At the lowest level, we're building services out of individual machines that have a relatively high rate of failure, and the same basic principles can be applied at every layer of the stack: make a bunch of copies, and make sure their failure modes are uncorrelated.

cabacon · on June 3, 2015

See also a very nice video from ClojureWest about queues in system architectures: https://www.youtube.com/watch?v=1bNOO3xxMc0

The queue are everywhere - your messaging queue, the threadpool, the hardware threads, and other layers of the stack and APIs you use. The video adds the interesting detail that as you add more tellers (workers) you learn of impending disaster only in the outlier p99 (or higher) latencies; by the time your p85 latency rises, you're already about to stall out.

cabacon · on June 26, 2014

Fortress (http://en.wikipedia.org/wiki/Fortress_(programming_language)) and X10 (http://en.wikipedia.org/wiki/X10_(programming_language)) were the other two languages that came out of the DOE HPCS program (http://en.wikipedia.org/wiki/High_Productivity_Computing_Sys...) that might be interesting if you liked Chapel.

Other aiming-for-HPC languages include co-array Fortran (http://en.wikipedia.org/wiki/Coarray_Fortran) and Unified Parallel C (http://en.wikipedia.org/wiki/Unified_Parallel_C).

I never really saw any of them while working in HPC, though. It was just Fortran, C, and sometimes Python. The Python would really just call out via SWIG to a C function for the numeric kernel.