Measure code execution time accurately in Python

hinkley · on Jan 18, 2020

One of my special skills as a young developer was intuiting that the perf tool timings for small methods were highly inaccurate (at the time I characterized this as lying), and in a GC’d language that inaccuracy can be quite amplified.

I found a lot of performance issues that others missed, and there were a number of quite notable cases where I had rather extraordinary results. Like 20% from tweaking (not even deleting) a method measured as taking 5% of run time. Not the weirdest I every did, but a solid number of top ten entries from this category.

How did I find them? Sorting by invocation count instead of run time. You can blindly jump into any of functions and get some useful results, but a more sophisticated approach is to think first about expected versus actual results. You think to yourself, “I just ran this task ten times. It has this much data to deal with. How many calls to this particular function should I have expected? 250? Then why was it called 759 times?”

What you find next will be architectural problems, or problems of data flow through the system (while some, including me, would argue that I just repeated myself, not everyone accepts that).

The fastest methods are the ones you never call, and I’m surprised at how frequently people ignore call count even though it’s often right there in the results. Maybe because it’s often the last value displayed, as if to say “This is barely worth mentioning.”

zimbatm · on Jan 18, 2020

That's a great advice.

Another one is to only look at memory allocations. Usually when there is a lot of allocations, a lot of data is being handled. This gives a bit of a different perspective where you start thinking of folding chunks of data together.

The nice thing about those two approaches is that they don't vary between CPU and architectures.

hinkley · on Jan 18, 2020

Certainly. When a quick function allocates lots of memory, the slowdown gets blamed on the functions that follow. In purely sequential code this leaves some clues behind. With event-based or concurrent code these problems can be more elusive.

Cache invalidation can follow a similar misattribution pattern.

When you do succeed in making an improvement, the % of time for everything else should increase. 10% of 90% is 11%. Look back at the old results and see if anything else went up disproportionately, and see if you can understand why.

Related to that notion, and as a parting thought, since this is getting some love: game and A/V programmers understand something the rest of us do not. And that is a computation/time budget. It doesn’t matter if one thing takes four times as long as the other if they both take too long. You haven’t fixed “the problem” until you’ve fixed them all. Late is late.

There’s another “systematic” approach to making very large perf improvements that doesn’t fix problems from most to least significant but by functional unit. Fix all of the problems in that functional unit even if it’s 1% of the total (if you’re after a 5x speed gain it’s actually 5% of the total budget). If you make large scale changes you are going to have to do some expensive regression testing on this part of the code, tests or no tests. Once you’ve paid that price once, you will never go back for that last 2% because you can not justify it to your organization. You’re stuck with those in ten different functional areas, forever. What you can justify is boosting a raft of work from 18% to a 20% overall improvement.

chubot · on Jan 19, 2020

The uftrace [1] tool is gold for counting function calls (and more), and I don't think it's that well known. It's relatively new.

I used 4 or 5 performance tools to look at the performance of Oil shell, and uftrace was one of the most important ones.

I hope to write a blog post about it soon. The overall results were very good -- Oil is now faster than bash despite doing more parsing by design -- though the biggest win was translating the code to C++ from Python :)

So it's unusual, but after translation, uftrace was a winner, and I used it largely by counting function calls and allocations, not measuring timing. Although you can use it in many different ways.

Oil's Parser is 160x to 200x Faster Than It Was 2 Years Ago

http://www.oilshell.org/blog/2020/01/parser-benchmarks.html

Another lesson is that aside from the big difference in execution speed, tools for measuring C++ performance are just way better than those for Python. Debugging tools are also better. Of course I still avoid C++ for most problems because of build times, headers, memory management, library style smorgasbord, etc.

[1] https://github.com/namhyung/uftrace

magicalhippo · on Jan 19, 2020

Another quick check I've used is to take a function I suspect might need optimization and make it slower, and see how that affects run-time. If it hardly affects the run-time it's probably not worth spending time on.

Not perfect by any means but usually a nice check to verify optimization potential.

jaybosamiya · on Jan 19, 2020

I haven't had a chance to mess around with it much, but the coz-profiler [1] uses basically the same idea, but the other way around (i.e., slow down everything else to get a "virtual speedup") which helps quantify what the effect of optimizing some code would be.

I learnt about this tool from Emery Berger's talk [2] on this (at strangeloop), which I highly recommend. Lots of really nice insight, even outside of this tool.

[1] https://github.com/plasma-umass/coz [2] https://www.youtube.com/watch?v=r-TLSBdHe1A

sitkack · on Jan 19, 2020

Came to say the same thing, this Berger talk is phenomenal. I have watched it twice, planning on at least a third.

draw_down · on Jan 18, 2020

Great post, thanks.

oefrha · on Jan 18, 2020

No mention of PSL timeit? https://docs.python.org/3/library/timeit.html

Also, quote from the timeit page:

> Note: It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.

hyperpape · on Jan 18, 2020

It's technically true that the lowest value will approach the lower bound of how fast your machine can run the code, but there's no guarantee that number is valuable. If there is any significant non-determinism in the execution, the minimum time will be a value that you'll only expect to reach in a fraction of executions, even if there are no other processes contending for resources.

The only real solution for avoiding noise is to control its sources.

Reference here: https://tratt.net/laurie/blog/entries/minimum_times_tend_to_....

oefrha · on Jan 18, 2020

If there's significant nondeterminism in the execution, the method discussed in TFA is also worthless. It's pretty clear determinism is the basic assumption of this discussion; no one's arguing that you should use the minimum execution time among random cases as the performance characteristics of timsort, for instance.

Of course, caching of deterministic code complicates matters too.

hyperpape · on Jan 18, 2020

I would have to spend a long time studying the relevant Python implementation to think I knew whether there was nondeterminism present in a given benchmark. That's a tall order.

oefrha · on Jan 18, 2020

Now that I think about it, you’re right that randomized hashing in Python itself could cause nondeterminism in seemingly deterministic code. However, timeit does mitigate that to some extent by repeating the same code a large number of times — 1 million times by default, and it’s better used that way. Of course, depending on the actual code, it could be the case that all 1M invocations are skewed by the specific choice of PYTHONHASHSEED; it’s not something that you can solve in-process, so there’s no panacea, you must ditch the idea of getting a single stable number in that case.

mattkrause · on Jan 18, 2020

I'm not sure that there's ever a turnkey method that will save you from needing to do a careful experiment and a careful analysis of the results.

__s · on Jan 18, 2020

Victor Stinner (python dev) wrote a good series on trying to get stable benchmarks in python:

https://vstinner.github.io/journey-to-stable-benchmark-syste... https://vstinner.github.io/journey-to-stable-benchmark-deadc... https://vstinner.github.io/journey-to-stable-benchmark-avera...

poooogles · on Jan 18, 2020

Eurgh for starters using `time.time()` is a bad idea because it calls gettimeofday which is susceptible to clock changes. `time.perf_counter` is what you want to use as it can never go backwards and it will always be from the highest granularity clock.

gmrple · on Jan 18, 2020

Next, if you really want to time something fast you should probably know about cache. In a modern computer there are levels of memory. For now we’ll say we have 3 levels: cache, ram, and a hard drive. Cache is small but very fast, it is the closest level to the CPU. RAM is bigger but slower and the hard drive is much much slower. Because the cache is small the computer has to pick what’s in it. At program load the cache will be full of information related to whatever was running before you. After you run you will have left some information in the cache which will not need to be fetched from slower levels of memory. Unless your usecase is to run your test code repeatedly very quickly, you should be wary of timing multiple calls in fast sequence as the later calls will execute faster due to the cache being primed to execute your code.

hyperpape · on Jan 18, 2020

Great point. In addition to caching, there's potential for branch prediction effects.

This is why the JMH samples warn against including loops[0] of the operations you're trying to measure, and have an API that's designed to avoid the programmer having to every write their own looping code.

[0] https://hg.openjdk.java.net/code-tools/jmh/file/7bc6011260aa...

stygiansonic · on Jan 18, 2020

Great point about using JMH. As of JDK 12, it’s included in the SDK: https://openjdk.java.net/jeps/230

What’s the equivalent of JMH for Python?

hyperpape · on Jan 18, 2020

I'm afraid I don't know, I really don't have any particular knowledge of performance engineering in Python.

pvg · on Jan 19, 2020

I'm not sure the JMH is meaningfully included in the JDK for others to use. The JDK happens to come with some JMH-based benchmarks in the source tree, that seems to be about it.

bknasmueller · on Jan 18, 2020

You are completely right, thank you - perf_counter is an even better way to measure the passed time per measurement. I have corrected the usage in the code snippet.

umvi · on Jan 18, 2020

Isn't there a performance penalty when measuring with perf_counter for anything other than really short snippets of code?

pjc50 · on Jan 18, 2020

Don't think so; at least on Windows it uses the optimal QueryPerformanceCounter API. https://github.com/python/cpython/blob/master/Python/pytime....

cscheid · on Jan 18, 2020

I don't know how else to put it, but the technique suggested in this paper is irreparably incorrect in the setting that it recommends. Attempting to fix a linear least squares fit by removing points with high residuals does not work. Please don't do that: they won't necessarily be the outliers in the dataset and your model will converge to the wrong thing.

Statistics is hard. Like, really really hard. Stuff goes wrong all the time. Please leave it to the experts.

If you're going to do this, please look up the methods behind (for example) robust least squares, outlier detection, L_1 regression, etc. The right way to do this is to start with a small dataset that is with very high probability free of outliers, and slowly grow it by never adding points which have large residuals. (If you've done 3D scan registration and image alignment, this is what RANSAC does.)

The principle is, intuitively, that once an out-of-distribution point gets into your linear model, the model is poisoned forever. You can't trust the model to tell you that the bad points are the outliers. The way this paper does is is irreparably broken, sorry :/

gdxhyrd · on Jan 18, 2020

> Please don't do that: they won't necessarily be the outliers in the dataset and your model will converge to the wrong thing.

They are the outliers. These experiments are typically measuring something that is effectively constant with almost zero noise, rather than some complex physical phenomena.

If you don't get an almost perfect fit, there is something going on that invalidates what you are doing (e.g. cache effects, clock effects, etc.).

In fact, if there are any outliers, I would not trust the benchmark at all. So removing them seems like trying to fix a bad benchmark with statistics.

> Statistics is hard. Like, really really hard. Stuff goes wrong all the time. Please leave it to the experts.

That is unnecessary gatekeeping. Benchmarking in CS is hard not because the maths/stats that are needed are hard, but because setting up the right experiment is hard and most people don't know all the pitfalls.

Therefore, if anything, you should leave benchmarks to CS/SE experts, rather than a statistician!

colmmacc · on Jan 18, 2020

There's another reason too; there are strong reasons why the timing of successive code executions should not be linear. Typically the first call to a function will incur the cost of fetching data into a cache. At minimum the basic blocks for the function will be loaded through the l1/l2/l3 pipelines.

2, 3, 4 function calls in a row will benefit from that caching, but also help train the speculative execution engine. At some point that speculation may radically alter performance. Later, with a sufficient number of runs, the timing will also suffer from preemption, context-switches, or other interruptions that can invalidate both the cache and the speculative execution engine.

The idea that there is a fixed performance time for a function is naive on modern hardware.

cscheid · on Jan 18, 2020

You make a good point I hadn't considered. In that case, not even robust least squares will save you, because the samples are not independent of one another. You'd have to start looking into explicit time-dependent methods and yikes, I have no idea of what the literature says about time dependence and outliers. I wouldn't be surprised if it's "here be dragons" territory.

gdxhyrd · on Jan 18, 2020

You are speaking as a stats person. :)

While there are dozens of things that can go south in benchmarking, how to solve them is not about applying advanced stats, but about understanding and eliminating the sources of noise.

Most benchmarks in CS can be done quite easily as long as one understands all the technology stack.

cscheid · on Jan 18, 2020

I'm not a stats person! But I have been burned in real projects by doing the wrong thing. I just have learned --- the hard way --- to honestly appreciate it.

colmmacc · on Jan 18, 2020

Speculative execution optimizations, e.g. branch prediction, are even data dependent. It's a very complex measurement space!

jwkane · on Jan 18, 2020

"The idea that there is a fixed performance time for a function is naive on modern hardware." This is exactly it, though it can be further compounded by runtime optimization in software.

reubenmorais · on Jan 18, 2020

For the specific problem of benchmarking small code snippets, how do you create the initial small dataset?

cscheid · on Jan 18, 2020

Thank you for pointing out some bad phrasing on my part.

When I said "small dataset", I should have said "small subset of all collected points that are initially fed to the model". The issue isn't that you have to collect the data little by little. The issue is that, once you've given a linear model an input that is outside the distribution which you're hoping to model, nothing about the model can be trusted.

So you collect all the data points, but you don't give them all to the model at once. (I'm describing the RANSAC method here now) You start with a large number of "candidate models" that are all fit with a small number of input points, and then test which of the candidate models predict well the points you have not yet given the model. Then you feed the best of these candidate models only the points which it predicts well, and create a more refined, still outlier-free model. This can be proven to work in the presence of a small number of out-of-distribution points.

derefr · on Jan 18, 2020

Very interesting. It sounds to me like this approach exists outside any kind of Bayesian learning framework; i.e. an individual Bayesian agent couldn’t be expected to converge to a correct model here. Is that true?

I would hazard that the RANSAC method sounds a lot like what you’d get out of a larger hybrid model, where e.g. many Bayesian agents with different priors are bred under a genetic algorithm after being ranked by their predictive power. (Like humans surviving to reproduce and pass down models to their children!)

cscheid · on Jan 18, 2020

As far as i understand it the RANSAC arguments are very much not Bayesian (they're all about "if you repeat this an infinite number of times under an infinite number of new samples, then...", which is almost caricaturely frequentist).

Still, you just gave me reason to mention one of my favorite "no, that won't work either" paper :) on how Bayes will not save you in the presence of model misspecification. Instead of butchering it any further, I'll just point you to this piece explaining the work, written by the author of the paper himself: http://bactra.org/weblog/601.html

pletnes · on Jan 18, 2020

It’s good to know about turbo boost when measuring cpu bound code. When the cpu gets hot, the clock drops, sometimes by a large factor. When it’s cooler and the OS wants to run code, the clock goes back up. This can be disabled in BIOS to get a more stable result. You can monitor the cpu clock in Resource Monitor on windows 10, to see how big a deal the issue is. It varies by cpu model, temperature, cooling of your particular machine etc.

hinkley · on Jan 18, 2020

This is one of the things that people are thinking about when they ask you to rearrange your benchmarks and run them again. Sometimes the first or the last set run is the fastest, depending on how the confounding factors interact.

Extraordinary claims require extraordinary proof, and people can be rightly insistent.

kqr · on Jan 18, 2020

...but obviously only disable it if it is disabled in the environment which you want to predict. Don't try extrapolating from an environment in which it is disabled to one in which it is not, and vice versa.

pletnes · on Jan 19, 2020

Not really true. If the particular computer you’re running the code on is warm due to weather conditions, which affects clock frequency, it doesn’t tell you anything about the performance of your code.

gdxhyrd · on Jan 18, 2020

It depends on what you are trying to do.

Most of these benchmarks are not done to predict performance in a target machine, but to find out which is faster.

spankalee · on Jan 18, 2020

Isn't figuring out the slope a red herring? The total run time is:

    n * timing_overhead + n(n+1)/2 * runtime

Seems like you'd do better by measuring n^2 runs and dividing by n^2. But that's not very good because it's not taking into account the observed variance. What you really want to know it something like the 95% confidence interval after observing n runs.

My teammate has been working on a library to do just this for JavaScript, with added automation for running a set of benchmarks until some set of thresholds, like A is 95% likely faster than B.

https://www.npmjs.com/package/tachometer

kqr · on Jan 18, 2020

> What you really want to know it something like the 95% confidence interval after observing n runs.

Great in theory, but almost always relies on a normality assumption which just doesn't hold in general for performance measurements.

How well does this package handle pareto distributed run times?

mattkrause · on Jan 18, 2020

Though this particular package doesn't, you can certainly derive confidence (or better, credible) intervals for non-normally distributed data.

longemen3000 · on Jan 18, 2020

https://arxiv.org/abs/1608.04295 this paper is the statistical basis of time measuring in Julia. It is, at least a good read when trying to measure the execution of some program seriously at low level. It uses also a predictor of the estimate number of evaluations via a logistic function

cbsks · on Jan 18, 2020

Off topic, but why is there a pop up on the site that says “This website uses cookies to ensure you get the best experience on our website”? What is the “best experience” that I am missing out on by refusing to allow cookies? I didn’t think cookies on a blog are useful to anyone except to the website owner for tracking, but maybe I am missing something?

userbinator · on Jan 18, 2020

One thing I've learned over the years is that "best experience" is an euphemism for "what we want to force down your throat".

acdha · on Jan 18, 2020

You can open the developer tools in your browser to see what cookies are set. My guess is that this could be something remembering settings but it's most likely saying that the Google Ads will be more targeted to things you want.

rcthompson · on Jan 18, 2020

So the "algorithm" is just... running the code different numbers of times and then using least squares linear regression? Or am I missing something?

speedplane · on Jan 19, 2020

I used to develop in C and C++, back then the software ran on a single machine, and the most common performance issues we faced were unintentionally introducing some O(N^2) complexity, usually within a deep call stack.

Today I develop mostly in Python on a web backend system. Now, the most common performance issues I see are making multiple network calls when one would suffice (e.g., not caching repeated database queries). I have no doubt that there are many hidden O(N^2) issues lurking in the python code, but because network calls are so expensive, they seem to dominate optimization efforts.

I suspect that I'm not alone here, as today, there are many I/O performance tools than there were in the past.

jniedrauer · on Jan 19, 2020

I haven't used Python full time in a few years, but it seems like this has been a solved problem for a while with timeit[0]. Am I missing something?

[0]: https://docs.python.org/3/library/timeit.html

orf · on Jan 18, 2020

Just use the timeit module, or the %timeit iPython magic command.

xurukefi · on Jan 18, 2020

You could also do the following to compute the overhead of the timing function:

    time_start = time.perf_counter()
    time_end = time.perf_counter()
    execution_time = time_end - time_start

fleur-de-lotus · on Jan 19, 2020

I initially thought it was an entry for a Rube-Goldberg contest.

avremel · on Jan 19, 2020

That is exactly what the article negates.

luord · on Jan 19, 2020

We made measurements this way in my physics I lab classes. Guess there's no reason why it wouldn't apply to software.

cmonnow · on Jan 19, 2020

how about a simple average ?

cmonnow · on Jan 19, 2020

how about a simple average?

mastrsushi · on Jan 18, 2020

20 minutes

rasz · on Jan 19, 2020

Its python! The moment you need to measure precise execution time of functions is the moment you should be looking at rewrite to C, C++, Rust, maybe even Java.