Hacker News new | past | comments | ask | show | jobs | submit login
Limitations of Goodhart's Law (commoncog.com)
127 points by trojanalert on March 3, 2023 | hide | past | favorite | 50 comments



I've always considered Goodhart's Law to be "of the people," but the article offers only top down solutions. What is the solution when the decision-makers have subjected you to a Goodhartable system of incentives and don't have enough awareness, vision, or humility themselves to change the system as described in the article? Well, you play the game or get beat by those who do. Right up until the point when you all lose.

I agree that if you're the one in control, complaining about Goodhart's Law is a sort of learned helplessness. However for those not in control -- and this goes as well for scenarios with diffuse control such as large crowds and democracy -- Goodhart's Law is useful, in the way that naming any systemic ailment is useful.

Perhaps we can talk about the 'hard problem' of Goodhart's Law to separate it from the problem solved in the article. Does anyone have good advice for those living inside the hard problem?


Even as someone who is in charge, it can be incredibly difficult to effect change once this type of culture is established. The only way it can happen is if enough people know it's wrong up and down the chain and are able to consciously recognize the systemic problems and rally individuals to start pulling the cart out of the mud. This is easier said than done, because most people are willing to put up with a tremendous amount of bullshit as long as it is predictable bullshit and they don't feel responsible for it.

I also want to challenge your framing about learned helplessness by "the one in control". It's true that in many situations you won't have power and you need to pick your battles. But on the other hand, everyone has some degree of agency, and programming especially requires folks to understand what they are doing and how it fits into larger systems (both technological and human). As a manager, I have seen a tremendous amount of learned helplessness based on assumed constraints that simply were not true. Yes, in some cases it's justified, but the most successful people tend to have fewer assumed constraints, regularly take action to do what they can to improve things, and they aren't easily discouraged when (inevitably) outcomes fall short of their ideal vision.


If people are serious about rooting out people who lie or cheat to appear on-target (beneath them in the management chain) they need employ anonymous whistleblowing straight up to them.

But I think many people in charge see the corruptness of the people below them as "headaches to deal with" and "if it's not squeaking it must be working" which I think is exactly backwards (it's the "if it compiles it's working" approach to management).


develop a network of senior people and build consensus around change. if you try to go this alone you are perceived (and may very well be) a shortsighted iconoclast acting out some narcissistic fantasy.

modulo that, once we have established some legitimacy and buy-in from the people that matter..just lose your fear and do the right thing. sure, they can fire you.


One of the subtler points Eli Goldratt makes indirectly is essentially that, "this type of culture is established" as you put it, in situations of abundance.

So Goldratt's description of "crunch time" is a company which starts out every quarter doing things in the (as they see it) "right way," then they close out every quarter doing everything in the "wrong way" or taking shortcuts or spending considerable costs to finish projects to make their revenue come in.

The spin I'm offering is: when you have a season of abundance, you do not have to face crunch time, and you get, as you put it, "predictable bullshit and they don't feel responsible for it." But effecting change during crunch time, by contrast, seems much easier. As long as you can keep your head straight. Goldratt's starting point begins, "maybe these 'shorctuts' we take under pressure are not the 'wrong way' per se."

I personally hate learning abstractions without examples, so let's get concrete. Your software project gets overdue and you and your peers start approving each others' merge-requests after only skimming them for flagrantly bad code, rather than "is this method in the right class?" and "does this have adequate test coverage?" you are now instead looking for "okay that's a raw SQL query, could that accidentally drop the whole database? no? approved! anything else we can add tests and refactor in a month." That attitude, right?

Goldratt says, "no, pay attention to that!" ... you say "but that's not how it should be done, we must review code as we go, tests are good to have" and my-imaginary-version-of-him replies, "sure I understand that you have a need for code safety, I am not saying that you don't. But the fact remains, you accepted the bypassing of this process when crunch-time came, which means that it constrained the team in unacceptable ways overall. So if we believe in working smarter-not-harder, we need to get imaginative now, because the 'logical place' for code safety measures to get injected, is too slow. And there are lots of imaginative solutions. We could hire someone whose only job will be code review, they could start reviewing your code before you're even done writing it. Or we could mandate that everyone must commit their code to the main branch every single day with only pro-forma code review, this will force us to adopt out-of-band practices to make things safe. Other things like that."

So what does effecting change look like in a system that does not have the luxury of abundance? It looks like a "crunch time" that never "ends", but does "get better." Goldratt's point is essentially that you don't want to meet this punch with a counterforce, as that will hurt: you instead want to yield with it and let its own momentum carry the both of you to a better solution.

I have a theory about how this works on software projects but I didn't quite get the chance to run the experiment at my last job, so if this intrigues you and you have the headcount to hire me... :)


Everybody should have a lot of experience working within Goodhart's Law, since grades and school tests are such a huge part of everybody's early life.

You can then apply Joiner's list to school tests as:

1. you can learn your source material 2. you can game the test 3. you can cheat


But notice in a public school environment the hard incentives ($$$) are all on the administration. So, it's attendance, tests and grades, maybe followed by class size. Yes, there's incentives for students to cheat but cheating is much more widespread and likely on the part of the teachers and administration. The blogger didn't talk about this, but you need to look to see for any metric or group of metrics who is incentivized to cheat and how it might manifest.


The thing being distorted is not passing tests, it's education.

Learning the source material for tests does not give you a good education.


You say this like there is an obvious and right way to handle the situation. But the reason such a situation might end up Goodhartable is if you had three tests every day with no time to study. Unless the requirements change, it would be impossible to "succeed" without adopting #2 or #3. This is why the GP is saying this concept is great for a top-down approach, but if you are in a system that doesn't give a shit, then you are forced to play the game or lose to those who do.


> Does anyone have good advice for those living inside the hard problem?

Keep speaking up. Make the problem evident, on an ongoing basis. Mention possibly solutions or alternatives when you think of them. People may not like complainers, but they'd probably have a more difficult time dismissing a sincere person who wants to make things better.

-----

With respect to education one of the main points of credentials is prestige (https://www.science20.com/quantum_gravity/blog/phd_octopus_w... ), from the point of view of the hirer. Particular demonstrable skills are also another point, but from the other end, from the education institute. Even in the case when the hirer is also an educational institute, the people in control of one set of reasons for choosing the measuring stick are not the ones in control of the other set of reasons.

Prestige requires a measuring stick, and the point of that measuring stick is exclusivity and marketing power. Neither of which particularly require a skill.

Basically fighting back against Goodhart's law is so difficult because no particular person or group has total control over all of the incentives for using a particular measure.

If people were more honest with themselves about why they are using a particular measuring stick then maybe we could tackle Goodhart's law.

Are you hiring Ph.D.s because on average they are more capable to the demands of the job (Extremal Goodhart)? Because you also have a Ph.D. and so identify with Ph.D. holders (affinity pseudo-Goodhart [my own creation])? Because you also have a Ph.D. and know a whole ton of Ph.D. holders who have difficulty getting good paying jobs (a form of Causal Goodhart, probably)? Because you have too many decent resumes to sort through (Regressive Goodhart)?

But they either aren't honest, or have too many differing goals for the use of a measure to justify to themselves minimizing the measure.


I think trying to identify a separate "Hard Goodhart's Law" or whatever is a bit silly. Very little you've said here is really specific to Goodhart's Law. If you don't have the power to solve the problem, then, well, you can't solve the problem. You're left with the standard options for working with any bad system: go with the flow, leave, try to gain more power, etc.

Which is not too say that asking for advice is a bad idea, quite the opposite. Just that said advice doesn't need a special name.


That's like- the whole point. If you as a manager set a goal, those three choices are all the choices the people under you have to meet the goal, and you often set it up so that the preferred choice (actually make things better) is impossible. Its saying that decision makers need to be aware of this and adjust your behavior.

Like you said, if they don't, the people dealing with it have no choice- so, it'd be impossible to offer advice for those people, since they have no way of actioning it (unless its convincing the decision maker of that point)


What presumption! The author doesn't know how useful I find Goodhart's Law!

Seriously though, Goodhart's Law means that the world is more complex in a way that you may not have realized. Saying it is wrong, and that the world is more complex than you may have realized is just expanding it.

Every model is wrong, some models are useful after all.

> "Well, let’s think about the weight control example. Losing weight is a process with two well known inputs: calories in (what you eat), and calories out (what you burn through exercise). This means that the primary difficulty of hitting a weight loss goal is to figure out how your body responds to different types of exercise or different types of foods, and how these new habits might fit into your daily life (this assumes you’re disciplined enough to stick to those habits in the first place, which, well, you know)."

Oh the irony! If you want to lose fat and keep it off, you need to understand and address HUNGER[0]. Just because you understand part of the system, doesn't mean you understand the whole system.

If one part of the process or system is obvious and easy to measure, we will measure it and talk about it. Thus we talk about weight and BMI, not Body Fat Percent and Lean Mass. We talk about calories and measuring food intake and not the body system that drives us to maintain our energy reserves.

Hunger is hard to understand, impossible to measure, and impossible to compare between people at scale. Calories are relatively easy to measure, but they are only INPUTS to the system, not an understanding of the system.

0. Basically every obese person has lost a significant amount of weight at least once in their life. Losing weight is not a mystery. Book recommendations:

guyenet hungry brain https://www.amazon.com/Hungry-Brain-Outsmarting-Instincts-Ov...

pontzer burn https://www.amazon.com/Burn-Research-Really-Calories-Healthy...


I thought it was more ironic than presumptuous.

After all, without the wisdom of goodheart, he would not be addressing the problems exposed.

It may be presumptuous of me to assume that he wouldn't see right through how measures make people dysfunctional, but I'm not sure what makes him so special from the rest of us it wouldn't have had the particular insight to realize where the problem actually lay with measurement dysfunction.


Picking on calories in/out is problematic if you think that is just two variables. Easily listed problematic parts, there are several ways to get "calories out," "calories in" have an impact on energy levels, and "hunger" and "satiation" are impacted by more than just "calories in or out." Though, as stated, the biggest gotcha is going to be that, unless you are doing extreme exercise, that will not be the dominate way that you are expending calories. Such that the criticism of the idea is flawed from the start. That is not what in/out means.

For the rest? I don't buy it. For starters the WBR process can easily get downright toxic from my view. Too much time spent preparing the metrics without nearly enough time understanding. There are literally teams of people whose job is to present numbers up and feel like they are somehow masters of the system. Not at all healthy for the boots on the ground doing work. (Ironically, this is a manifestation of Goodhart's Law. When the metrics were being used by the boots on the ground, they were great. As soon as you get a middle layer...)

Which, is the point of the "law." There are no evergreen policies or metrics that can lead to improvement. Sucks, as you can't even really take this in a meta direction. Constantly changing metrics and process is a bad idea when you are tracking how often you do that.


1000 times this. Any management system is only as good as its practitioners. The most talented managers and leaders tend to be good communicators, and so there are hundreds and thousands of great books on management philosophy, systems, tactics, etc. But all of it is subject to cargo culting by people who simply lack the experience or skillset to successfully follow the advice. At the end of the day, unless you have a quorum of leaders who viscerally understand what is meant by the old adage "the map is not the territory", then you are doomed. This is why leaders should be drawn from experienced practitioners, MBA programs should require 5+ years industry experience before admission, and Finance should not be let anywhere near product decisions (other than to set a budget).


While it sounds simple in the article, I think in reality the difference between "Distorting the system" and "Improving the system" is often not very clear. Even looking at the example of a WBR metric that evolved over time[1], the final version incentivizes product behavior that may run counter to the idea of Amazon as an "everything store", such as down-ranking items that are a better match for what the customer is looking for because they aren't immediately available or removing a long tail of products altogether. Is that a distortion or an improvement?

[1]:

- number of detail pages, which we refined to

- number of detail page views (you don’t get credit for a new detail page if customers don’t view it), which then became

- the percentage of detail page views where the products were in stock (you don’t get credit if you add items but can’t keep them in stock), which was ultimately finalized as

- the percentage of detail page views where the products were in stock and immediately ready for two-day shipping, which ended up being called Fast Track In Stock.


> such as down-ranking items that are a better match for what the customer is looking for because they aren't immediately available

What struck me about the WBR is this assumption they're making about the customer's desire.

> The backlog indicated the amount of work we’d need to do to make sure our customers received their gifts before the holidays.

How many of these people don't care if their product arrives before December 24th? You never, ever bothered to ask them. If it's even 5% of the orders during this period that customers don't care about arriving before December 25th then you could have made logistics easier by just asking this question.

And as Amazon grew to become the "everything store", a greater proportion of mid-December orders are ones that the customer doesn't need by December 24th. If you ever bothered to ask the question you'd make the holiday rush so much easier. Instead you make Amazon inconvenient for the customers who don't want to be a bother, by forcing the customer to put off submitting orders until after the holiday.


In any corporation many, if not most managers, directors and VPs are not measured by revenue impact. Not they don't have a revenue impact but that they aren't measured by it because the contribution is indirect. In those cases, I notice that the 4 most common places for distorting or cheating the system aren't in the article--budget, head count (hiring), performance reviews and space allocation. And the gaming can come from below by inflating budget requests, head count and the like--empire building, or from above by underinvesting in or cutting critical functions where the revenue impact is indirect.


"When Goodhart's law is used as a measure, it ceases to be a good measure"


I think this article has great advice when the metric is well chosen.

But that's the hard part, choosing a metric that matters, and that's the majority what goodhart's law is about. Too many things in business cannot be quantified, so you measure things adjacent and hope the correlate, and often that correlation is just wishful thinking.

Take for example, how do you measure productivity of developers?

Lines of code? Number pull requests? Story points completed? Tickets closed?

All of those measure....something...but they don't measure productivity directly. A smart engineer who spends a few extra days on a design that comes up with a new plan that cuts out half the work fails by all metrics I've listed above, but could easily be the most productive developer just by cutting out work.

Business is hard to measure directly. It's way too easy to penalize your best people by over focusing on quantifiable metrics.


This is a nice refinement to the discussion. There is an element of truth to Goodhart's law, but it is certainly not a law and it is maybe not even a default condition. As per the author, the work of Demming clearly shows that measuring tolerances has a good effect on quality. I personally even seen Goodhart's law (mis)applied to discussions of machine learning pipelines, so clearly there are glaring exceptions to the utility of metrics.


> Goodhart's law . . . is certainly not a law

A scientific law is a description of a common pattern. It's not a rule voted into being by the reality legislature and enforced by nature beat cops.

Goodhart's certainly is a law.

> Goodhart's law (mis)applied to discussions of machine learning pipelines, so clearly there are glaring exceptions to the utility of metrics.

How can an alleged misapplication of Goodhart's law mean that metrics are not always useful? Goodhart's law itself says metrics aren't always useful.


It's not a scientific law. It's a meme law, broadly applicable but not always true.

Scientific laws have no counterexamples, only refinements.


Goodhart was an economist who stated his eponymous law in 1975, before Dawkins's The Selfish Gene (origin of "meme") was published in 1976. So Goodhart was not meme-ing.

If you want to argue that economics isn't science or isn't hard science, fine.

But law is used here in the sense of scientific law nevertheless.

https://en.wikipedia.org/wiki/Category:Economics_laws


Yes, for Deming:

Enumerative study: A statistical study in which action will be taken on the material in the frame being studied.

Analytic study: A statistical study in which action will be taken on the process or cause-system that produced the frame being studied. The aim being to improve practice in the future.

https://en.wikipedia.org/wiki/Analytic_and_enumerative_stati...


> Goodhart's law (mis)applied to discussions of machine learning pipelines

I see a way in which it overlaps with this issue https://en.wikipedia.org/wiki/Overfitting


For context this was just a team objecting to tracking progress against an external standard. The proposed metric was not that far off from the mean square error loss that was used for training models. Goodhart's law was just used as a rhetorical device to evade accountability.


Site is not loading for me (maybe hugged to death).

Archive: https://web.archive.org/web/20230303122025/https://commoncog...


You can really tell when someone has never worked in a call center


This is such a compelling article on Goodharts law and interesting to see how Amazon conducts their Weekly Business Reviews. I'm wondering if there's a rebuttal to this piece.


Looking at the 'damage' amazon has done, both to its workers, and to the credibility of their wares, you could say that amazon has suffered from goodhart's law to some extend on the grander scale.

Specifically, they set a list of 'outcome metrics' that they wanted to optimize. Using these WBRs they ensures that the metrics by which their middle managers are measured remain effective for actually optimizing these outcome metrics. However, at some point these outcome metrics stopped being good targets. They have led amazon to exploit their workers, and have negative impact on outside businesses and the environment to the cost of massive political pressure. Moreover, these targets have led amazon to often sell counterfeit products, as well as incredibly low-quality budget alternatives. This is slowly causing people to distrust the amazon store for many things.

In other words, whatever outcome metrics they set as their target, have seized being a good metric for how well amazon is doing.

Or maybe the problems I identify are well considered trade-offs, and they are aware this is what their targets are causing...


You not liking Amazon's goals doesn't make them bad for Amazon.

Unless your concerns lead to >$500B loss, they are merely tradeoffs in the overall optimization.


The entire overbuild and overhire of the pandemic? Hard not to think of the hiring metrics that you know they were tracking in "efficient WBR" meetings the entire time. :D


Not a direct rebuttal.. but there are several things that need to come together for this to work.

1. The leadership team should trust each other's data. Over a period of time, that has not been true. 2. The group reviewing the data depends on everyone being there long enough to understand and get the feel. That is also hard to achieve and imho, is no longer true. 3. The metrics tend to ignore the cost to the workers. After all, everything looks like numbers and charts.

For me, WBR is a mechanism, but Amazon's focus on just the metrics has hurt its workers, employees, sellers and partners a lot.


By "no longer", do you mean at Amazon particularly?


Yes.. I mean at Amazon. The WBR (when I was there) was held on Monday at the director level, Tuesday at the VP and Wednesday at the S-team.

The following is purely my opinion:

The constant churn of employees and directors at the lower levels meant the data is mostly massaged or explained away.

The really good old-timers would focus on exceptions and single anomalies - much to the frustration of poor engineers and managers. But they intuitively knew that those small issues can snowball and catching them early helped mitigating the issue. But as the systems grew more complex and the number of teams expanded massively, the feel for the pulse became more mechanical and remote. New managers, directors and VPs did not come from a similar culture and the constant politics sidelined the more experienced employees who quit. Once that knowledge is lost, it takes multiple years to get it back and I do not think Amazon got that mojo anymore.

It felt more mechanical and ritualistic, rather than genuinely trying to uncover issues (which is when I quit). It is Goodhart's law at the next level.


> if you want to improve some process, you have to ignore the goal (target)

This is the whole message of Goodhart's Law. Once you set a target, people will naturally aim for it. If you have to ignore the target or even not have one then you are saying Goodhart's Law is useful, which counters his narrative. The point of the measure should be to detect anomalies in the system, not drive targets or goals.

If you measure widgets per week, you can look back at the end of each week and ask the questions: * Why did we produce 10% more widgets this week than our 6 week rolling average? * Why did we produce 5% less widgets this week than our 6 week rolling average? * Is there anything we can easily do to increase the amount of widgets we produce next week?

None of these questions have a target. They are process oriented. We aren't saying what the rolling average must be, we are using it to detect variance. Or, we are investigating to see if we can produce positive variance. What we try might fail and the variance is negative but then we revert our process. Or it is positive and we have improved our process. Rinse and repeat


I mentioned it in a different comment, but my biggest reservation is that the difference between "Distorting the system" and "Improving the system" is more often than not in the eye of the beholder, and most metrics will incentivize a mixed bag of "good" behavior and "distorting" behavior.


What part of it do you want to see rebutted? I thought it was a great article too, but it seems like for the WBR process to work you need rapidly moving input and output metrics, which is more true of sales and marketing than engineering. Maybe you can apply the same approach to simple A/B tests, but then structurally you're incentivizing thinking small which puts you right back in the passengers seat of Goodhart's law.


Correlation is not causality.

If you base your incentives on something that's only correlated with what you want (for example, paying programmers per line of code) you get screwy results (say, lots of manually unrolled loops).


> [Goodhart's Law] is descriptive; it tells you of the existence of a phenomenon, but it doesn’t tell you what to do about it or how to solve it.

This is the difference between an adage and an essay. And I will say just thunking about the existence of Goodhart's Law whenever you design a system is all you really need. Any time you make a metric, put yourself in the shoes of the people who will be judged by the metric, and try to game it as hard as you can. Ask others to try to game it. And if the cheating you discover is manageable, use it. If not, get rid of it.


I agree with all that. If you're an organization of any size, you need some level of metrics and measurement. And, if you're a public company, the owners (i.e. shareholders) are going to dictate some of that at a company level. (And even if you're a growth unicorn, that just means the metrics are around growth rather than profit or revenue.)

Just counting on everyone to do the "right thing" and evaluating them on gut feel doesn't really work at scale.


I mostly disagree with this.

Yes, there are ways to be really careful with targets. But at a societal or institutional level, you don’t get that kind of control.

I’ve written about the difficulties a bit here:

https://superbowl.substack.com/p/beware-the-variable-maximiz...


I can't seem to reach the original URL; this seems to work:

https://web.archive.org/web/20230303122025/https://commoncog...


Goodhart's law is extremely useful for illustrating the issues which come from bad KPIs and raising awareness of the natural psychology that people will work to game those measures if incentivized to do so.


Campbell's law is not simply a "social variant" of Goodhart's law. The two have a subtle but crucial difference.

To paraphrase a DonaldTrumpism, Goodhart's law says "Sounds good, doesn't work".

Whereas Campbell's law adds "Worse still, it'll certainly backfire", and makes that the focus.

Incidentally, the other "law" alluded to in the text, also has a name:

> The (Lance) Armstrong Principle:

"If you push people to promise more than they can deliver, they’re motivated to cheat."

A more general rephrasing: if you create / hold people accountable (e.g. by forcing a promise) to unreasonable expectations of performance that cannot be delivered there will be a motivation to cut corners, e.g. cheat.

(Popularized by Andrew Gelman; I first encountered this in an article by friend/colleague Nick Yeung, about the bad side of forcing phd students to publish, also an interesting article in its own right: https://www.nature.com/articles/s41562-019-0685-4 )


Although this is an insightful article with many good points, I am disappointed that the author didn't mention a more fundamental problem with metrics: that the very framing of business that leads to a focus on metrics is toxic to responsible business practices.

The author talks a lot about listening to the voice of the process. How about listening to the voice of people? For instance, people who are concerned that we're losing our ethics and our vision and our sense of meaning in our work?

The moment someone suggests a metric, my reaction is not to think about causes and effects but rather "what is motivating you to think that dwelling on that metric, OR ANY METRIC, will lead to happiness?

Framing and philosophy precedes any possible notion of improvement. Start there. If that leads you to measurement, so be it... Or you can be like Amazon: a rich company that is famous for its rapaciousness and exploitative behavior.


There are a lot of production-focused businesses where metrics are one of the most sensible, predictable ways to run the business. Are you going to listen to the opinions of 2000 auto manufacturing employees to know how things are going or look at completed cars and initial defects reported?


The voice of the people is also a metric.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: