This is an ancient technique. I used essentially the same process ~30 years ago (1987–95) to publish N-way product comparisons. The approach was old even then (see e.g. https://en.wikipedia.org/wiki/Quality_function_deployment).
The promise of this quantifying is that results are more objective and precise, so everyone can come to rational agreement. In practice, it doesn't do that. What attributes are chosen to quantify, how they're scored, how they're weighted—these are all subject to a great deal of fiddling and constant debate. "You over-weighted X!" or "You didn't consider Y!" Et cetera. Truly never-ending, and unless participants already well-aligned, doesn't secure genuine consensus.
Results are also highly perturbable. Tweak the weights and/or scores but a little and they tell an entirely different story. New winners emerge, clear victories become dead heats, and the former Red Lantern Award winner is suddenly in the middle of the pack.
I've seen the same problems when this technique is attempted.
Something else to consider is that weights scaling linearly may not make sense.
More importantly, criteria may overlap. In the example in the article, I suspect technical ease and scaling ease are highly correlated, which effectively means you're double counting.
Basically the technique only works when you have fully independent criteria which cover the full spectrum of what matters and which can be weighted objectively using a scale that represents true relative importance.
When I have used this technique, I kept the weights confidential until I had collected the data from teams about their rating of each criteria on the strategies.
Then, I created two conversations. One was about our confidence on those estimated ratings together, and the second was the proposed weights and whose they were, which I revealed after so we could add up what the model provided and have a much faster discussion about whether it yielded what we ought to do.
Nobody wants to be subject to process, but almost everyone wants to appeal to it to get their way, so I wouldn't use this tool to make decisions themselves, but instead to make higher quality ones on teams faster.
Discussion plus qualitative not quantitative scores, in simple terms like "Weak" and "Strong" (plus occasional modifiers like "Very"), with no weighting of attributes. Keep the comparison matrix simple and the discussion at a high level of granularity. While that's paradoxical for the analytic person I am, I've come to realize that "net scores" are usually not the point. The real goal and need is to have a multi-party discussion that leads towards consensus, or at least a rational understanding of why product/approach X was chosen over its competitors. In my experience, soft, non-numerical discussions of the choices available have more often lead to that happy (quasi-)consensus than when the numbers come out.
The major problem with this approach is that you go into the analysis with an opinion as to what is the better solution, then you tweak the weights and scores to match that opinion. This is not data-driven or even rational, it's just a way to express "numerically" what your guts tell you.
Clearly someone who disagrees with you will just tell you you got the weights and scores wrong.
The main value of this exercise is to sit down and think about a problem in a slightly deeper way and try to rationalize why we think solution A is better than solution B. That is good in itself. But people shouldn't mistake a mental exercise for a quantitative analysis.
> you go into the analysis with an opinion as to what is the better solution, then you tweak the weights and scores to match that opinion. This is not data-driven or even rational, it's just a way to express "numerically" what your guts tell you
I often use the same approach except I lock the weight to rank with unique numbers. I.e. 5 weight items?, can’t use a rating of 5 twice. This forces me to further itterate the criteria selection. Using the same principle across further helps to pick a clear winner but depends on the # of items on the list and how far you want to think without overthinking it.
This framework is very similar to an a priori, multi-objective optimization using linearly scalarized weights[0]. It is a priori because the weights are chosen before scoring and kept constant.
I've found this approach works generally well for humans, however results may not be pareto-optimal.
And a minor adjustment to your minor adjustment, a defence against those that lead with metrics (see for example Management by Objectives): explore what makes it important and meaningful before defining measures of success.
A handy mnemonic: Meaning before metric, measure before method (2MBM)
Reminds me of the process of designing scientific experiments. State exactly what you care about and measure it directly (not the proxies). Define the metrics before conducting the experiment and a way to aggregate these metrics. Pitch different approaches against each other by evaluating them on these metrics and choose the best one in a fair manner by detaching yourself from any approach.
The blog puts forth a nice framework that one can actually remember to apply in real life. I often unintentionally treat scientific decisions differently than real life decisions (not as quantitatively). So this is a nice way to force you to define your preferences to the best extent that you can. I also like the aspect of having this framework in a team and understanding everyone's POVs, allowing for more transparency. Often in larger groups, discussions tend to meander with viewpoints all over the place. This should also force people to start with and end with 1 objective metric at a time. I bet this exercise is quite fun when done with a team.
Score cards are simple but prone to manipulation. Another approach from multi-criteria decision-making that is very useful for benchmarking is Data Envelopment Analysis (DEA) [0], or the Analytic Hierarchy Process (AHP) [1] for group-decision making.
A strength of DEA is that criteria are not weighted but that alternatives are compared against the efficient frontier (similar to risk/return in a Markowitz portfolio). DEA is very useful for benchmarking many alternatives.
AHP uses pair-wise comparison which is less prone to manipulation than scoring is but it needs a group of people that do the comparison.
Most of my decisions deal with investments, here bringing in a net-present-value measure as well as a metric for opportunity cost has been useful as well.
Quantitative people can be scared of attributing numbers to fuzzy ideas, and non-quantitative people can be afraid of these uncomfortable glyphs altogether.
Yet, it is a helpful exercise to establish a way to compare alternatives, and this seems to be the least ridiculous method.
This looks like a good way to pick something everyone will feel good about, but not necessarily the best strategy. Pretty much quantitate way to design by committee
That's a great point. I suspect though, the critique applies to decision making by consensus more broadly.
Decisions made by consensus are not necessarily the optimal strategy (as defined by efficacy to achieve targets). This method simply reflects that.
However, the benefit of this method is that all participants are forced to be more rigorous about why they hold certain beliefs, and then making an attempt to aggregate those preferences to drive an outcome.
The promise of this quantifying is that results are more objective and precise, so everyone can come to rational agreement. In practice, it doesn't do that. What attributes are chosen to quantify, how they're scored, how they're weighted—these are all subject to a great deal of fiddling and constant debate. "You over-weighted X!" or "You didn't consider Y!" Et cetera. Truly never-ending, and unless participants already well-aligned, doesn't secure genuine consensus.
Results are also highly perturbable. Tweak the weights and/or scores but a little and they tell an entirely different story. New winners emerge, clear victories become dead heats, and the former Red Lantern Award winner is suddenly in the middle of the pack.