Thinking About Performance

mdwrigh2 · on April 29, 2015

> Well, good luck, because Android devices have touchscreen latency of 100+ ms.

For what it's worth, this statistic is a little outdated, a little wrong, and not necessarily the one you care about.

A little outdated: It was done in 2013, and in the past couple years the touch panels on most flagship Android devices have gotten significantly better. Even the linked article was comparing Apple's latest device to older flagship models. The Nexus 5, released a week after Touchmarks published their numbers, consistently has about 70ms of latency, for example. The M8 reportedly[1] gets around 50ms of latency, which is pretty astounding.

A little wrong: There were multiple issues with the Touchmarks benchmark. They reportedly "discovered an optimization in our iOS test app that was not present in our Android or Windows Phone test apps", they had known race conditions that could introduce additional delay on Android that were never fixed, etc.

Not necessarily the one you care about: That statistic measure physical touch down to visible response, but you only actually care about the time until the application receives the event because that's the point you actually can kick off the network activity. Considering the display latency side of it is ~48ms, that's a fairly significant difference.

[1]: http://www.phonearena.com/news/Funky-metrics-HTC-One-M8-has-...

chadaustin · on April 29, 2015

Thanks! I searched for more recent numbers, but couldn't find them. Much appreciated.

I'll edit the article when I get a chance.

mdwrigh2 · on April 29, 2015

Thanks for caring to fix it!

For what it's worth, I agree with pretty much the rest of your post. Too often I see people start to complain about "pre-mature optimization" but when you're trying to do something like hit a smooth 60fps animation then a lot of these things really matter. Profiling is great when you have hotspots, but too often these things are plagued by a death of a thousand cuts.

jblow · on April 29, 2015

This article seems goofy and weird. He spends a LOT of time randomly talking, in order to justify not using a profiler, when profiling is such a simple and easy thing.

I know many high-performance programmers and all of them profile because profiling is how you test your mental model against reality. Yes, as the author says, having a mental model of machine performance is important. But you need to test that against reality or you are guaranteed to be surprised in a big way, eventually.

Example: How does he even know that his div optimization matters? If he is even reading through one pointer in that time, he is probably taking a cache miss on that read, the latency of which is going to completely hide an integer divide. The author seems generally to not understand this, since he spends most of his time talking about instruction counts. Performance on modern processors is mostly determined by memory patterns, and you can have all kinds of extra instructions in there and they mostly don't matter.

Which this guy would know if he profiled his code.

chadaustin · on April 29, 2015

Hi Jon. I'm certainly familiar with caches and memory optimizations. I also know when I'm compute-bound (as in, the prefetcher is running ahead).

Sorry if I wasn't clear - I love profilers! CodeAnalyst in particular is my go-to choice for "quick, I need a sample histogram across my functions".

You're right that an example involving the memory subsystem would have been a good idea.

My two points are:

* It's possible to know something is on the latency critical path (e.g. div is ~20 cycle latency, but you run ~2 in parallel) without needing a profiler. Just look at the data flow through your algorithm.

* When you begin an application, you should know your performance goals and approximately how you plan to hit them. If you end up building an application where you round-trip to the network six times to build your UI, you've just limited your best possible load time in Australia to over a second.

That's all. :)

p.s. I've never used that div optimization, though I think it's interesting.

sqeaky · on April 29, 2015

How do you know you are right if you didn't measure?

I have done things "knowing" what the outcome would be only to be surprised, and I never would have known if I hadn't measured.

chadaustin · on April 29, 2015

You have to measure at some point to build up your mental model. I run experiments all the time.

In the specific example of buffer-builder, I have already built up a (reasonably accurate) mental model of modern CPUs, and I knew what I wanted the generated code to look like.

Once I made the generated code look like I wanted, then I was not surprised to find that it outperformed existing libraries by 5x. :)

I suspect the alternative approach, "profile the existing libraries and optimize hot spots" would have taken a lot more time.

sqeaky · on April 29, 2015

With results that good it is surprising that you didn't document your benchmark procedures and include them in your article.

I too feel comfortable working with modern CPUs but after performance sensitive projects I benchmark and/or profile to identify what I didn't know. How else can you learn (after listening to all the experts and reading all the documentation)?

As for your feeling it would have taken longer with the "alternative approach", I must again ask for numbers. How do you know which approach would take longer without taking measurements on it? Is that with you taking that approach or an expert with that approach taking it. Are you an expert in that approach, yet humbly avoided stating so in the blog post?

I don't really see them as alternatives. Using all the knowledge you have up front is simply a good design strategy, but once that knowledge is exhausted you can get more through testing empirically.

implicit · on April 29, 2015

It's more about knowing your constraints ahead of time and working within them.

If you're doing audio mixing, for instance, you probably have a thread that has to respond with samples within 1ms. Missing the window means catastrophic audio glitches. (it sounds terrible)

It's a mistake to write this in Ruby and expect the profiler to tell you something you don't already know.

kllrnohj · on April 29, 2015

> in order to justify not using a profiler, when profiling is such a simple and easy thing.

My day job is basically performance optimization. It is not simple nor is it easy to use a profiler. In fact, profilers often don't tell you anything useful at all. They'll tell you sort of where you're spending your time, but they don't help at all in telling you why you're spending your time there.

Most of the major performance wins I find are from optimizing the architecture not from optimizing a hot loop. I can't even think of the last time I've seen any measureable difference from optimizing a hot loop. Heck, I can't even remember the last time I found a hot loop.

Then again I work on things more like game engines and web browsers. Huge systems with tens of thousands of lines of code on the hot path. Profilers don't help all that much here, and I can optimize in advance in less time than it'll take to setup the profiler and "justify" the change.

mkesper · on April 29, 2015

TLDR: * To hit your performance goals, you first need to define your goals. Consider what you’re trying to accomplish.

* While throughput numbers increase over time, latency has only inched downwards. Thus, on most typical programs, you’re likely to find yourself latency-bound before being throughput-bound.

* A profiler is not needed to achieve the desired performance characteristics. An understanding of the problem, an understanding of the constraints, and careful attention to the generated code is all you need.

BruceM · on April 29, 2015

Since y'all killed his blog again:

http://webcache.googleusercontent.com/search?q=cache:wjD83Ex...

M8 · on April 29, 2015

It's not HN's fault, the blog is not performant :).

angersock · on April 29, 2015

Should've picked a webscale langauge like Perl, Ruby, or Javascript.

chadaustin · on April 29, 2015

Thanks Bruce. :) Seems my single small VPS is no longer enough to keep up with modern HN traffic...

mwcampbell · on April 29, 2015

Have you set up a WordPress caching plugin? Maybe it's time to switch to a static site builder.

chadaustin · on April 29, 2015

Hi Matt! Figures this thread would have everyone I know. :)

Yeah, I'm running WP Super Cache, which means the site is still up.

Do you have a recommended static site builder?

mwcampbell · on April 29, 2015

I use Pelican (http://getpelican.com/).

alain94040 · on April 29, 2015

And then we got to "you have to run a profiler to know it matters." I contend it’s possible to use your eyes and brain to see that a div is on the critical path in an inner loop without fancy tools. You just have to have a rough sense of operation cost and how out-of-order CPUs work

I'd be convinced only if you showed a benchmark with and without the trick. I still suspect it doesn't matter in the end. But the only we'd know is if the author ran a benchmark. Which he refuses to do, because he is so sure of himself.

smitherfield · on April 29, 2015

Interesting, but I have to wonder what the point of his project to reproduce the performance of C++ with Haskell is if he's doing things like replacing

  (i+1)%3

with

  (1<<i)&3 for i in [0, 2]

It seems to me like it would be far more human-readable and human-maintainable to simply write it in C++ in the first place.

chadaustin · on April 29, 2015

Haskell and that particular div -> shift trick were two unrelated examples.

The Haskell one is kind of a long story. In short, we love Haskell but a particular inner loop was destroying our performance. We were getting PHP-level performance in Haskell, where normally we can expect Java-level performance. So we took this inner loop (JSON encoding and URL encoding) and built BufferBuilder to solve it once and for all. Now JSON encoding and URL encoding are barely visible in the timings.

The div -> shift trick is worth knowing about, though I haven't had a reason to use it yet.

falcolas · on April 29, 2015

Any time you cross the FFI, you start losing things. Specific to Haskell, if you hit the FFI into C++, you have to go through two translation layers (Haskell to C, C to C++) in each direction, losing the ability to quickly identify identify code paths. Not to mention many debuggers and profilers tend to shrug their shoulders in ignorance when the FFI is crossed.

If a quick obfuscation in the name of performance can be explained away with a single comment, there's no need to jump over the FFI.

GFK_of_xmaspast · on April 29, 2015

I don't know haskell, but I would have expected the c++ end to have been compiled with C linkage, why do you have to have the second jump?

falcolas · on April 29, 2015

It would depend entirely upon whether you wrote all of the C++ code and thus provided yourself with C wrappers inside an `extern c` call. In which case, yes, it would be simpler (though still with an interpreter layer going from C to C++).

However, if you use an external library, or you are interfacing with C++ overloaded methods or classes, you have to write an interface inside the `extern c` wrapper which handles calling out to C++ code.

Of course, there is always the option of running something like swig to create the C bindings for you.

chadaustin · on April 29, 2015

I think falcolas is saying that, since the GHC FFI only supports C functions, if you wanted to expose a C++ API (classes et al) you'd need to have a C trampoline call into C++. But, then again, the compiler might inline the C++ call into the exported C function. Hard to say without trying and looking.

implicit · on April 29, 2015

I worked with Chad to write BufferBuilder.

We conspired our use of C++ to exclude the C++ standard library. It's basically C with stricter pointer conversion rules.

mwcampbell · on April 29, 2015

Another insightful post from Chad on performance is this one:

http://chadaustin.me/2009/02/logic-vs-array-processing/

When I read this post, I suddenly realized why one's chosen programming language (or more precisely, the compilation approach and runtime environment) has such an impact on application startup time, though he only briefly touched on that. Think of the logic required to locate and load each module or class as it's first needed (e.g. Python or a typical Java or .NET runtime) versus just mapping the executable into memory and jumping to main (AOT-compiled native code, best if statically linked). Good luck if your application uses the former approach and it typically starts when a user's computer starts up, on a computer with a spinning disk.

jconley · on April 29, 2015

Modern CPU's are so complex I'm not convinced that one can reason about the performance implications of micro optimizations such as the div trick. Perhaps the 0.001% of developers that specialize in, and understand how said CPU's work might be able to, but for the rest of us, we have profilers.

There are so many layers of abstraction and so much to understand in modern computing that it is a Bad Idea(tm) to tell engineers to not profile.