Four Kinds of Optimisation

jandrewrogers · on Nov 14, 2023

> it's difficult to measure the indirect impact of things like memory locality — I have heard such factors blamed for poor performance much more often than I have seen such factors proven as responsible for poor performance

This particular assertion does not seem to be well-founded. The importance of spatial and temporal locality to software performance on modern hardware is a singular obsession of data-intensive software architectures. Nor is it difficult to measure, these are some of the highest impact optimizations you can make assuming the code isn't naive. There are perf counters and such but the majority of poor locality is visible via simple code inspection. You don't need perf counters until you are doing serious performance engineering.

Locality optimization tends to be architectural in nature. If you do not design your software to have good locality characteristics then it will be complex to fix later. An argument can be made that it isn't worth the cost to fix software designs with poor locality after the fact, but the centrality of locality to software optimization is not controversial.

hinkley · on Nov 15, 2023

The first memorable locality problem I ever solved, I had clocked a function at 10% overall cpu time. The more I looked at it, the more I realized we were calling it way too often. It wasn’t allocating a lot of memory, it was just crunching a bit of data.

A bit of jiggling of the call structure above it and it goes from 2 invocations per call to one. So almost exactly half as many calls overall. I ran my end to end benchmark again, expecting 4-6% improvement in run time. I got 20%. Twice what I would have expected to get by commenting the function out entirely.

Few years later, similar situation. Operation taking about 5 times as long as we were allowed. Function A gets a list from the database. Function B gets the same list. I don’t recall what percent of the overall time this was, but it was a tall tent pole and obvious waste. Something like 30%. As a first course for a longer optimization engagement, I changed the method signatures to take the list and return a result, looked up the list once in the common parent function. Best case I figured a 20% reduction, which is still slow enough I can use the stopwatch function on my phone to estimate the improvement at the UI. I didn’t get 20%, I got 90%. So I went from 5x over the target to half, in one change. Unfortunately even this result didn’t convince some people they were barking up the wrong tree with respect to performance problems.

Right about the time of the first experience I had been learning an ugly truth: profilers lie. They aren’t the floodlight people think they are. They’re a book of matches in the dark. They reveal part of the picture, and cast as many shadows as they do light. You have to fill in a lot of gaps yourself, and that takes mechanical sympathy and more than a little stubbornness. Hardware counters narrow that gap a lot, but they still aren’t perfect, and a lot of languages seem not to use them.

nerpderp82 · on Nov 15, 2023

Performance (Really) Matters by Emery Berger talks a lot about layout. https://www.youtube.com/watch?v=7g1Acy5eGbE

And the topic of Coordinated Omission, "How NOT to Measure Latency" by Gil Tene https://www.youtube.com/watch?v=lJ8ydIuPFeU

Injection slowness into an application to measure the relative impact of that portion of the application has really worked well for me.

hinkley · on Nov 17, 2023

Once I’ve done the dead obvious bits I start looking at invocation counts, on two fronts.

Not enough is said about running a subsystem a deterministic number of times and then evaluating the call counts. Invariably there is some method that is being called 2x as often as the requirements and architecture dictate (scenario one above). Someone just goofed, forgot they already had an answer, or part of one.

And then they’re the “profilers lie” part. If you have two functions that take 4.5% of overall time, and one is called 25k times, while the other is called 250k times, the chances of bookkeeping errors in the latter are much higher. It could be 4%, or it could be 10%. When in doubt, sort by invocation count and try another optimization pass.

I’ve also seen a variation on this in GC languages - one periodic large allocation gets stuck doing a full GC almost every time, because a dozen little methods are churning out garbage at a tempo that keeps missing the GC threshold because of the periodic allocation, which keeps raising the heap size toward whatever max you’ve set. To an extent cache smashing does the same thing. Little invalidations make problems for bigger methods.

nerpderp82 · on Nov 28, 2023

It situations like this, Off Heap Collections or an Array of primitive values can reduce GC churn and pressure.

hinkley · on Nov 18, 2023

> Coordinated Omission

Now I have something new to worry about.

Or at least a new name for a nameless terror.

Humans are impatient and hate bad news, which adds to these things. I filed a PR recently against a small benchmarking library because I discovered when I cranked up the number of iterations that it was running out of memory.

They were going for simplicity, and doing too much processing along the way complicates things like our friend cache locality, but also branch prediction and thermal throttling.

But they weren’t consolidating between runs either and the sequence diagram for callbacks prevented you from manual cleanup.

Code is hard if you really think about it. Hard as you want it to be (or often, harder).

For injecting complexity, there have been times where I add a for loop to an inner scope to multiply the number of calls and see if the profiling data follows the expected curve. Plenty of surprises to find there.

anonymoushn · on Nov 14, 2023

Well, it's a lot of work to prove to someone who doesn't do this sort of thing that they have this sort of problem. And it's a lot less work to say that there is this sort of problem. So he sounds right to me.

meheleventyone · on Nov 15, 2023

A nice easy one is to benchmark trivial examples for people. For example summing an array of numbers in JS. Have examples of four levels of indirection between objects down to one level, straight numbers in a standard JS array and the same in a TypedArray. This pretty clearly demonstrates the gains that can be had.

saagarjha · on Nov 15, 2023

Most code is naive. FWIW, in my experience most people who just look around for poor code locality focus on things that don't actually matter, while the most expensive misses (when you actually profile and decide you care about them) end up in surprising places, or they were just missed in someone's quest to reorder all the structs they see in the codebase.

meheleventyone · on Nov 15, 2023

It’s memory not code locality (do you mean in the instruction cache or something else?) that matters. The problem for a naive solution with poor memory locality is that the memory locality issue is more akin to a rewrite to a faster language. The performance gap left on the table can be significant but the need for it not show up in profiles because everything is slow.

saagarjha · on Nov 15, 2023

That is very often the case, yes.

lmm · on Nov 14, 2023

Anything that can't be objectively measured will inherently be controversial. You claim you can see poor locality via code inspection; will other people who inspect the same code reach the same conclusions? Why should we believe you rather than anyone else?

jmoss20 · on Nov 15, 2023

Agreed with the general point, but it doesn't apply here. Memory locality can be objectively measured (e.g. with last level cache miss counters), and parent comment is correct besides -- it's usually plain to see in the code.

There are mysterious boogiemen in performance optimization, but this isn't really one of them.

mejutoco · on Nov 15, 2023

I am happy (good) science does not take the "is obvious" claim as sufficient, and instead focuses on proving things with objective facts.

I am not saying these cannot be plain to see in the code, but the best standard IMO is still to measure before and after the optimization. IMHO, again, you can skip that step, but then other people might rightfully ask you what proof you have that the optimization is faster (I would).

meheleventyone · on Nov 15, 2023

Some of these architectural decisions benefit from being made early. That doesn’t mean you can’t objectively defend them but it does mean the measure, optimize, measure loop you might use on a mature codebase to optimize a hot path can’t be the only standard.

jmoss20 · on Nov 15, 2023

It's a matter of economy-- if you spent all day measuring obvious things you'd never get anything done.

Clearly which things you choose to measure should be a function of how certain you are about them, and how much you stand to lose if you get them wrong.

sdenton4 · on Nov 15, 2023

You need perf measurements to understand the impact. It sounds like you're saying that it's a massive time stick to fix the problems, so it had better be a measurable impact to understand the ROI.

I've definitely seen careful memory management make a huge difference in custom RNN code, so don't doubt it can make a difference, but I expect it matters mainly when it's part of a very tight loop.

jandrewrogers · on Nov 15, 2023

Unlike some other optimizations, the effects of locality optimization are usually not local. The reduction in cache and bandwidth pressure makes everything run better. These are common optimizations in e.g. databases, even shaving a few bytes off of key data structures can result in system-wide gains of 5-10% performance just from the cache pressure reduction resulting from a seemingly innocuous reduction in data size. You don’t need to measure it to know it will have impact, the relationship between reducing cache pressure and performance is pretty direct and repeatable. Spatial locality is straightforward, temporal locality is a bigger engineering commitment.

A canonical example of locality optimization is thread-per-core architectures versus classic multi-threading architectures. By random luck, I was around when thread-per-core architectures were first developed to improve efficiency on many-core processors (on supercomputers). Instant 10x performance improvement on the same hardware. By comparison, aggressive vectorization with AVX-512 might buy you 3x improvement. Locality optimizations are the highest leverage optimization we can do today. It does require disciplined software architecture.

There are limits to locality optimization in real systems. If you are doing a good job of it, you frequently run out of memory bandwidth at some point. That’s a pretty good outcome.

webnrrd2k · on Nov 14, 2023

Its a good post, but I'd add that there isvalso a different kind of optimization where you optimize by minimizing th overall cognitive load of any given code.

For example, it's important to develop a feel for data structures. Often, if you choose the right data structure, the algorithm is fairly easy. In other words, when faced with a choice of complex algorithm vs complex data structures, it's generally easier to work with a complex data structure and use a simpler algorithm.

Data is fairly static, and therefore easier to reason about, vs a complex and "dynamic" algorithm. Optimization for ease of reading and maintenance is also really important.

hinkley · on Nov 14, 2023

A great example of this is Norvig’s sudoku solver. Arranging the data the right way makes solving the problem a mere matter of correct bookkeeping .

keyle · on Nov 15, 2023

ref. https://norvig.com/sudoku.html

webnrrd2k · on Nov 14, 2023

Yes, great example. I forgot about that one.

from-nibly · on Nov 15, 2023

He forgot the most powerful optimization of all. Redirecting to /dev/null. Or in other words just not doing it in the first place.

As dumb as it sounds it's really easy to get caught up in optimizing stuff you just flat out don't need to do.

howenterprisey · on Nov 15, 2023

Very true. At work we have "the first and most important thing to optimize is the requirements" as a principle and it works well.

hinkley · on Nov 15, 2023

The fastest code is the code that never runs.

lifthrasiir · on Nov 15, 2023

> Exactly what constitutes "correct" varies from one situation to another. For example, fast inverse square root approximates multiplicative inverse: for situations such as games, its fast nearly-correct answer is a better trade-off than a slow definitely-correct answer.

Amazingly in topic, the famous constant 0x5f3759df from this algorithm was also thought to be derived from an approximate search algorithm because you need ~2^31 calls to determine the maximal relative error for a single constant---a big deal back in 1980s. A better constant 0x5f375a86 was only found in 2003, and took a few more years to be proven optimal (Robertson 2012, Moroz et al. 2016).

orf · on Nov 15, 2023

> For anything but the smallest lists [6], binary search is much quicker than the linear search above.

> In my experience, binary search becomes faster than linear search after about 8-12 elements

It depends on your data, but CPUs are extremely good at scanning through contiguous vectors. Binary searching might make a difference at tens of thousands of elements.

If you’re using Python then it’s different as nothing is contiguous, or you have a complex equality function, but then a simple improvement is to put the data in a numpy array of dataframe.

hinkley · on Nov 15, 2023

> or you have a complex equality function

This is part of my thesis that Knuth today is wrong to the point of harm. That we need a new complexity theory built on top of Information Theory. We are steering kids wrong more often than right at this point. Data sets we work with are at least six orders of magnitude larger than they were in the eighties. There are almost no comparisons in Real Data that run in constant time, and for big enough n even addition and multiplication are not constant time. For large enough n, storage accesses are sqrt(n). Even in primary storage.

Imagine you have a list of a million distinct strings. Case sensitive, alphanumeric. Complexity theory tells us you can sort those in nlogn time. But that’s pure fantasy because n unique strings have an average length of logn, meaning the compares all take logn time. Mergesort for any real data type is n (logn)². Now sort a quintillion unique strings. That takes 65-70 exabytes of space, just for starters, so it’s going to involve Ethernet cables and NVME RAID arrays. Your mergesort is now in the neighborhood of n^1.5 (log n)². Your piece of paper says this might run in hours, while the real implementation runs for weeks.

For most purposes, you should multiply every algorithm in TAOC by logn. In some cases, sqrt(n). And as n actually approaches infinity, multiply by both.

From another angle, I think the fact that it took 54 years to get from mergesort to Timsort has more to do with sacred cows than Tim’s brilliance. I’d like to see benchmarks on Pentium hardware, maybe even 486.

“If I have not seen further it is because giants are stepping on my toes.”

mythhabit · on Nov 15, 2023

The complexity of sorting is done in comparisons. If your system can compare in constant time the complexity is indeed n*log n - if your strings are limited in length there is a constant time upper limit and the complexity holds. Saying that the average length of input strings are log n, seems to be for specific data sets and not a general conclusion. Remember that big-O notation is not about calculating the actual runtimes, but the growths as the input grows. One of the first things you learn in theory is that you must profile this no matter what.

Basic complexity theory is not concerned with access times because they work in operations. If your operation involves slow random access over network, then that is the basis. There are branches of complexity theory that takes those exact examples into account. Branches that proves certain time complexity in arbitrary cache hierarchy, or only allow streaming algorithms with no lookback. They all explicitly do this in terms of IO operations.

The basic theory is fine, and should be as it is for an introduction. Trying to teach a novice with respect to extremely complex CPUs or broader computer architecture is premature and sure to make many run away screaming. The overarching message should be, and in my experience is, that theory only guides you and profiling is superior.

hinkley · on Nov 15, 2023

> The complexity of sorting is done in comparisons.

Yes, and the difference often enough comes down to whether a calculation can be performed in an affordable manner. It may be on demand is not fast enough and a calculation should be done at update time, or in a queue.

But it also come down to capacity planning. If you’re off by a factor of logn, it gives a bigger lie to the concept of “economies of scale” (which have always been BS in our field) and you can transition a project from making a healthy profit off of a set of features to going bankrupt on it. Each new customer costing more than you can make off of them until the VC money runs out. Who here hasn’t worked on that project?

> Trying to teach a novice with respect to extremely complex CPUs or broader computer architecture

I’m not talking about CPU architecture here at all. I’m talking about Information Theory, which they absolutely should be taught, and Physics, which often enough is a requirement for graduation. At least at schools where CS is lumped into the college of engineering.

I haven’t sorted integers in about a decade, and maybe three times in two decades. I use BI and telemetry tools that sort them, sure, but that’s someone else’s problem and not part of OLTP workflows. For me it’s a toy problem.

Nobody has ever sorted strings in O(nlogn) time. Nobody has ever inserted String->Object mappings into a hash map in O(1) time. Stop lying to intellectuals. They either learn to distrust you or you break them if they don’t.

mythhabit · on Nov 16, 2023

> I’m not talking about CPU architecture here at all. I’m talking about Information Theory, which they absolutely should be taught, and Physics, which often enough is a requirement for graduation. At least at schools where CS is lumped into the college of engineering.

I would expect any reasonable reputable CS institution to have do this as part of their curriculum. It just is not relevant for the theory of Big-O. If you want to use some of that knowledge in practical applications, then it is relevant, but I expect graduates to be able to make that connection by themself. It is also part of advanced algorithmic theory anyway.

> Nobody has ever sorted strings in O(nlogn) time. Nobody has ever inserted String->Object mappings into a hash map in O(1) time. Stop lying to intellectuals. They either learn to distrust you or you break them if they don’t.

Big-O is not about "sorting in O(n log n) time", it's about the growth of the algorithm to do so as the input grows. It can obviously only hold as long as the assumptions hold, in this case that comparison is constant time. What that constant is, is of no importance for what you use Big-O for. I'd rather teach intellectuals what Big-O is (upper limit on growth when input size increases), rather than trying to get them to use it for something it isn't (measure runtime as input increases). It's in the very definition of the concept, and at least where I got my degree, it was drilled into our head "it's an upper bound, constants do not matter as the input grows sufficiently".

thewakalix · on Nov 16, 2023

I don't think it's mathematically possible for n distinct strings to have less than O(log n) mean length.

mythhabit · on Nov 16, 2023

You're right. I missed that crucial word, and the argument still stands, as long as you have an upper bound of the length of the strings, you can do comparisons in constant time.

hinkley · on Nov 16, 2023

And that’s my point. There is no upper bound on the length of strings. If your data continues to grow it is because the information inside of it is distinct in some way. Nobody is putting “a” into an ordered list or into a set billions of times. They’re putting John Smith at 123 Sesame St, Springfield, MO into that list.

If we had a billion Americans, we’d have to use 9 digit zip codes to distinguish customers. The data grows logarithmically. Always.

This is less true in event processing, but groups of events are often information in that case, so the pattern still mostly holds. (Maybe someone could convince me that events are loglogn, but we also deal with 3 orders of magnitude more of them than any other data at my job.)

mythhabit · on Nov 20, 2023

Not all data entries in every dataset is distinct, far from it. It seems to me, that you are trying to argue that big O is not useful, because you believe that an assumption of constant time comparison cannot be done. But I disagree, once we go into the realm where we cannot assume that comparisons are constant time, we are also in the realm where IO operations dominate the runtime, and we base our big O on IO operations rather comparisons. Here the bound and big O is even more relevant.

nerpderp82 · on Nov 15, 2023

Knuth isn't wrong, he is taken out of context. Knuth doesn't say programmers shouldn't worry about performance. You still have to define in the problem and the requirements and fulfill those reqs.

> There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3 %. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. After working with such tools for seven years, I've become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off.

He says we need better tooling that continuously tells us which parts of the program are taking the most time and why. Yes! We do need this.

page 268, https://dl.acm.org/doi/10.1145/356635.356640 Structured Programming with go to Statements

hinkley · on Nov 17, 2023

I’m not talking about the “premature optimization” conversation for once. I’m talking about complexity theory, which he is the father of, being unmoored from information theory and problematic in this day and age - actually problematic back to 2000 if you’re working on problems that require clusters of machines. In the last 20 years we’ve flipped from less than 20% of developer working in horizontally scaled systems to perhaps 20% who have never had to. I believe that accentuated the discrepancy.

svat · on Nov 16, 2023

> Complexity theory tells us you can sort those in nlogn time.

You seem to be attacking a strawman — no CS textbook says you can sort n strings in O(n log n) time (that's just the number of comparisons); of course sorting n strings each of length m will take O(mn log n) time — that is, you have to multiply by the average time for each comparison. This is what complexity theory says, and what any good textbook would also say. You don't need "a new complexity theory built on top of Information Theory"; the current one already does so. (And m could be much bigger than log n too.)

hinkley · on Nov 17, 2023

I’ve seen actual implementations of hash tables claim O(1) amortized insertion time, not O(m). And I misspent a summer about 5 years back thinking I had a constant space overhead mergesort (when what I had was poorly chosen input data. Bummer). I read a lot of mergesort papers then. Nobody mentioned m. I may get bored and check your theory on textbooks, but I suspect you’re letting optimism get the better of you. In a conversation about pragmatism that’s not good.

All of my examples at the top were not hypothetical, they were all actual arguments I’ve had with real people or documentation I bitched to someone about.

Edit: I mean fuck, even Wikipedia has it wrong:

> In sorting n objects, merge sort has an average and worst-case performance of O(n log n).

N objects. Not n numbers. If you want to fix the world instead of declaring bankruptcy with me, start there.

At least they get space overhead right farther down:

> In this case, the notion of "in-place" can be relaxed to mean "taking logarithmic stack space", because standard merge sort requires that amount of space for its own stack usage.

svat · on Nov 18, 2023

Well, Wikipedia is obviously wrong there. I've fixed it to say number of comparison operations: https://en.wikipedia.org/w/index.php?title=Merge_sort&diff=p...

Note also that in computational complexity theory, "n" is the length of the input in bits: if you have k strings each of length l, using a regular sort, with a comparison subroutine that takes O(l) time, will take O(kl log k) time — O(k log k) operations each taking O(l) time — which, as n=kl, is O(n log k), upper-bounded by O(n log n).

————

I looked up a couple of influential algorithms textbooks in the meantime, and I must admit this is less clear than I'd have liked (though complexity theorists are very aware of it).

• TAOCP vol 3 (which starts with Chapter 5, Sorting) says, on page 5 (in the introduction section of the chapter, where he often says things informally and not exactly):

> The time required to sort N records, using a decent general-purpose sorting algorithm, is roughly proportional to NlogN; we make about logN “passes” over the data. This is the minimum possible time, as we shall see in Section 5.3.1, if the records are in random order and if sorting is done by pairwise comparisons of keys. Thus if we double the number of records, it will take a little more than twice as long to sort them, all other things being equal. (Actually, as N approaches infinity, a better indication of the time needed to sort is N(logN)^2, if the keys are distinct, since the size of the keys must grow at least as fast as logN; but for practical purposes, N never really approaches infinity.)

It's emphasized less than I'd have liked but he has already mentioned your observation though (and I checked this was present in the first edition in 1973). BTW, note no "O" here: where possible he actually gives the number of machine operations, including the constant factor. TAOCP is very much a "nothing up my sleeve" book, where everything is analyzed down to the constants (where possible) and machine code given.

He starts with a very concrete problem where what's being sorted are numbers that each fit in a machine word:

> “Memory locations R+1, R+2, R+3, R+4, and R+5 contain five numbers. Write a computer program that rearranges these numbers, if necessary, so that they are in ascending order.”

and he makes it clear that what's being sorted are records that contain keys and satellite information, and he's going to mostly assume the keys are integers that fit in a word, sometimes even part of a word. The mention of sorting strings (can get there via the index) is an exercise on page 178:

> Design an algorithm to sort strings α1, ..., αn on an m-letter alphabet into lexicographic order. The total running time of your algorithm should be O(m+n+N), where N = |α1| + · · · + |αn| is the total length of all the strings.

Note that this running time is better than O(n (N/n) log n) — assuming that the strings are of about equal length N/n, and using a standard sorting algorithm — that I mentioned above. The solution is essentially to construct a trie.

• CLRS starts its sorting section with "A sequence of n numbers" and has:

> A sorting algorithm describes the method by which we determine the sorted order, regardless of whether we are sorting individual numbers or large records containing many bytes of satellite data. Thus, when focusing on the problem of sorting, we typically assume that the input consists only of numbers. Translating an algorithm for sorting numbers into a program for sorting records…

Its index on sorting variable-length items points to p206 (in the 3rd edition) where, similar to TAOCP:

> 8-3 Sorting variable-length items

> a. You are given an array of integers, where different integers may have different numbers of digits, but the total number of digits over all the integers in the array is n. Show how to sort the array in O(n) time.

> b. You are given an array of strings, where different strings may have different numbers of characters, but the total number of characters over all the strings is n. Show how to sort the strings in O(n) time.

YoshiRulz · on Nov 15, 2023

> I think that, in general, most programmers struggle to accept that correctness can sometimes be traded-off—personally, it offends a deep internal conviction of mine that programs should be correct. [...] possible incorrectness more often causes problems. I might be happy trading off a bit of image-quality for better compression, but if an ML system rewrites my code and leaves off a "not" I'm unhappy.

I'm torn on this point. On the one hand, I'm as fallible as any human, and reading this has made me aware of a bias I've always had but never realised was there. But on the other hand, our field is young, and there are surely better, correct solutions to be found for things like compression and sorting. So with that assumption, I find it tempting to label programmers who trade correctness for speed or size (e.g. using JPEG when QOI and WebP exist) as 'lazy', since "they haven't exhausted all other options". Obviously that's not fair—it ignores practicality and trivialises the discovery of novel algorithms—but I feel there's still some truth in it.

Karellen · on Nov 15, 2023

I'm always reminded of the aphorism: If my solution doesn't have to be correct, `return 0` runs in a couple of cycles and uses zero additional memory.

The complication, of course, is (as the original author explains) that - with respect to some problems - there are degrees of correctness in the solution space, and some specific, limited trade-offs may be acceptable.

ttfkam · on Nov 15, 2023

I think "use a better data structure" needs to be #1. If you have a bad data structure for your use cases, no algorithm change will save you.

If you have a linked list and your use case tends toward non-sequential search and access, tweaking the algorithm only burns cycles that could be spent switching up. If you have an ordered array and need to insert/delete values in the middle, you are similarly hobbled in performance. Then there's cache coherence when interacting with hardware, which is heavily influenced by data structure.

Data structures should come first when developing and optimizing.

say_it_as_it_is · on Nov 15, 2023

I don't agree with this categorization and it leaves out other important considerations:

- Data structure and algorithm are of the same "algorithm" topic

- Some programming languages like Rust do not have tail-call optimization and so do not perform recursion well. You need to use the strengths of the language and avoid its weaknesses.

- IO Optimization is another kind of optimization. Minimizing copies of data in memory, sharing whatever possible

- Cache Optimization - design according to how CPU caches work with the operating system

matheusmoreira · on Nov 15, 2023

The linked paper about PyPy's homogeneous collections optimization is extremely interesting:

https://tratt.net/laurie/research/pubs/html/bolz_diekmann_tr...

Would be nice to have a collection of such techniques for increasing performance sorted by effort required...

AtlasBarfed · on Nov 15, 2023

- parallelization

gavinhoward · on Nov 14, 2023

This is a great article, and I have some notes.

I initially disagreed with the author when he said, "we tend to overestimate how much we know about the software we're working on."

I disagreed because that is not my experience; I do know a lot about my software.

However, then he said, "We overemphasise the parts of the system we've personally worked on, particularly those we've most recently worked on. We downplay other parts of the system, including the impact of dependencies (e.g. libraries)."

Oh. Um, well, this is also not my experience.

I work on software alone, and I do not have any dependencies.

So the author is right; I just have weird experience. :)

I agree with the four kinds of optimization, especially since he focuses on optimization humans do. I think you could even shoehorn compiler optimizations into those four. For example, strength reduction could be considered a "better algorithm."

His observations about the differences between best, average, and worst cases are the things programmers need to learn most. At some point, our code will always hit bad cases, and we need to know if it will hit it often.

I love his example of Timsort because I've analyzed and implemented it myself. It ended being more comments than code. (Of course, that includes two license header comments, so slightly less comments than code actually.)

Timsort is pretty complicated, and Tim isn't the best at naming things, so it took a bit for me to understand what was going on.

(The most prominent example is the galloping functions. He named them "gallop_left" and "gallop_right," but they both gallop left and right. I renamed them to "gallop2Leftmost" and "gallop2Rightmost," which is actually what they do.)

The author says he biases to writing algorithms and adopting pre-written data structures. This is a good thing. However, after having written data structures (because I work in C, which doesn't have them), I would encourage every programmer to write some. Best, average, and worst cases are easier to learn on data structures than algorithms. Still use pre-written ones though; they are usually better optimized.

The trick about reducing memory is crucial. I recently spent an entire refactor just reducing the size of some of the most-used structs. I cut the size of one struct in half. `pahole` is great for this.

The part of PyPy is a great example too. A great book to read is Is Parallel Programming Hard, And, If So, What Can You Do About It? The entire point of the book is to make you think, "Okay, I need some concurrency or parallelism; what is the easiest way to get it?" The PyPy example is that same thing with optimization; what's the easiest way to get the performance you need?

Every time you run into an optimization problem, ask that question first, and it will get faster to answer every time. And the author's summary reinforces this.

[1]: https://mirrors.edge.kernel.org/pub/linux/kernel/people/paul...

dj_mc_merlin · on Nov 14, 2023

> However, then he said, "We overemphasise the parts of the system we've personally worked on, particularly those we've most recently worked on. We downplay other parts of the system, including the impact of dependencies (e.g. libraries)."

> Oh. Um, well, this is also not my experience.

> I work on software alone, and I do not have any dependencies.

If you work in a company of 100+ engineers with multiple specialties it's impossible to keep up with everything. At <50 engineers if you collaborate and/or switch teams it's possible to know almost everything, but that's the most I think the majority of us have the internal storage/RAM for.

ska · on Nov 14, 2023

I suspect your thresholds are far too high, and that some specialties are almost mutually incomprehensible, so rare for one person to properly understand what is going on.

Time is a real factor also. How long has this code base been accumulating...

dj_mc_merlin · on Nov 15, 2023

Early on in the company I work we had a practice of collaborating across teams, sharing members etc. so a lot of us knew a lot of the code base from "different angles". If you implemented a backend feature you might write also write the terraform and the frontend that worked with it, with some input from the people experienced in those things. Kind of like random full-stack development? Slower than normal development at first but it makes future collaboration easier. It was quite great, but unsustainable as the amount of people grows.

edit: I think in general at the early stages of a company you hire a lot of people who are good at adapting and learning new skills since they might need to fill new roles quickly as new needs arise. So those people are good at this kind of full stack development since it's their niche. As you grow, you hire more specialized or junior people, and they can't do this process quickly enough for it to feel fluid anymore.

gavinhoward · on Nov 14, 2023

Yes, I said as much in my comment.

llamajams · on Nov 14, 2023

There needs to be more articles like this, far too often the answers on SO and the like is canned and dissimissive; "you don't need to optimize because who cares youre writing a crud app anyway".

armchairhacker · on Nov 14, 2023

The very first CS class taught at my undergraduate, where many students get exposed to programming the very first time and solve the simplest problems in functional style (e.g. list map, fold, fibonacci), introduces problems which require optimizations in the later part.

No matter how powerful your computer and how small your computer program is, you have to optimize at least the exponential-time algorithms, and sometimes the high-degree-polynomial ones. When writing a real-time or production app you have to optimize many polynomial or even linear-time algorithms.

Micro-optimizations like allocation and boxed numbers? Unnecessary unless you need performance, only apply a constant multiplier to your program’s speed. But macro-optimizations which affect time complexity can’t be ignored.

saagarjha · on Nov 15, 2023

Algorithmic time complexity is a very important, but I honestly feel like a typical university education heavily underemphasizes exactly how much the constant factor can matter. Like, consider a simple example: someone's hand-written memcpy in C might achieve, say, 500 MB/s, maybe 1 GB/s if the loop is particularly tight. The one that ships on your system likely does 20-100 GB/s. Same big-O on both, of course. Most cases aren't this easy but there is a lot of performance that comes from constant factor optimization, which can dwarf all sorts of clever algorithms that are theoretically more efficient. Everyone likes to go "ok well if I scale this to a billion users your algorithm takes a month and mine takes a minute" but it is very likely that between now and the billion users the code is probably not even going to look anywhere near the same. But it's likely to be running on the same JVM or the same OS or the same allocator and all of those have been optimized to cut their constant factor down. After all, you don't want to be the guy who has the same algorithmic complexity as your competitor but your cloud bill is twice as much.

(This isn't to say you shouldn't do algorithmic improvements, and most of the performance work I do is in fact along those lines, but I do want to clarify that the "micro-optimizations" you're talking about are in fact the difference between a computer from today and the 1990s.)

morelisp · on Nov 14, 2023

> it's difficult to measure the indirect impact of things like memory locality — I have heard such factors blamed for poor performance much more often than I have seen such factors proven as responsible for poor performance. In general, I only look to such factors when I'm getting desperate.

No thanks.

Cache-related perf counters are easily measurable, but the impacts are so big you rarely need them.

nerpderp82 · on Nov 14, 2023

    v = [ random.randrange(0, 100) for _ in range(1000) ]
    %timeit sorted(v) ## include a free extra allocation
    71.2 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)