Hacker News new | past | comments | ask | show | jobs | submit login
Make up your own rules of probability (johndcook.com)
88 points by wallflower on Jan 31, 2011 | hide | past | favorite | 21 comments



These [incorrect] rules are not explicitly stated in the methods; we inferred them... from formulae embedded in Excel files...

One of the things that most puts me off from using spreadsheets is the difficulty of validating calculations. The formulae themselves are not visible, so it takes a lot of mousing around to convince yourself the cells do what you think they should. For example, I've seen problems where a long "vector" calculation is correct in the upper reaches, and wrong down below, due to failure to completely "fill down" the calculation.

I wonder how many of the spreadsheets studied by Baggerly and Coombes expressed input errors, rather than errors of understanding?


One of the things that most puts me off from using spreadsheets is the difficulty of validating calculations.

A very good friend of mine worked for a major aerospace company that used Excel spreadsheets for almost all engineering. I'd say the biggest problem is that if you're doing important calculations in Excel, then you're completely ignorant of the last 40 years of research into defect reduction and detection in computer science. That ignorance is not completely unintentional either: there were a lot of people who literally pride themselves on not being like the dumb computer folks.

Plus, if spreadsheets are your tool, you likely have very bad tools. So when you make a spreadsheet to run the power budget for satellite A, the obvious thing to do when satellite B comes up for development is to copy the spreadsheet containing the analysis from sat A, change the filename and bury it somewhere on the shared network drive. Then, if you find a defect in the spreadsheet, you have no way of tracking which copies have that same defect because they're scattered all over the shared drive.

In any event, my friend found tons of bugs in spreadsheets, many of which were actually important. Probably the best bug involved a power budget for satellites that inverted electrical current so instead of comparing current to a max theoretical current value, it compared 1/current to the max value.


I can verify that this practice is still common. I have been amazed at the size and complexity of some spreadsheets in aerospace applications.

It's so clearly the wrong thing, and yet the authors of these beasts are always really pleased with their efforts.

Besides the "old shoe" feeling of Excel, I think there's something about the automatic updating done by the spreadsheet that users find gratifying. You alter a cell and the program launches into a bunch of updates automatically. It's like seeing the array of blinking lights.


Spread sheets with units would help "can't compare ampere to 1/ampere". As would ditching "the world is one giant grid" in favor of multiple tables as in a relational DB on the screen. For example if you applied a formula to a table, you get a new table instead of some region of cells that happens to be the formula applied to another region of cells. This also solves the problem of having to enlarge the region where a formula is applied when more data comes in.


I wonder whether there’d be any good way to encourage people to use more inspection-friendly code, e.g. in R, Matlab or even better Python/numpy/scipy. I wish that science journals would accept/encourage source code appendices to articles or at least prominent links to online source code, and that such code, ideally clearly organized and commented, were as expected for “serious” articles as bibliographies are currently.

Even many papers in computer sciencey topics leave out critical implementation steps in describing algorithms and approaches; even a mediocre implementation makes the authors assumptions’ and precise meanings much more explicit. Often (though not always) code is as readable or more readable than mathematical formulae.


The formulae themselves are not visible, so it takes a lot of mousing around to convince yourself the cells do what you think they should

To toggle formulae visibility in Excel hit: ctrl+tilde(~)


That barely begins to address the problems though. When you're writing formulas, you're writing in a small box with no syntax highlighting, no serious whitespace, and no comments. It almost seems like Excel's designers worked really hard to come up with a user interface that would make introducing errors trivial and noticing errors very hard.

And yeah, I understand that spreadsheets were never designed with the idea that people who would use them to enter long complex formulas or for purposes where defect detection would be really important, but that's how they're used in the real world.


> "It almost seems like Excel's designers worked really hard to come up with a user interface that would make introducing errors trivial and noticing errors very hard."

Or, rather, Excel's designers spent very little time and effort into support for non-trivial formula creation and use.

But, if we all seem to agree that Excel is not the place for serious calculations, can we fairly continue to take shots at Excel for being bad at supporting them?

i.e. When I see people maintain non-trivial web apps by opening notepad against production code, it never occurs to me to complain that notepad should have syntax highlighting, collaboration and version management, etc.


I actually feel bad for the Excel devs. In my friend's experience, the relatively modern features that MS introduced into Excel like named cell references were deliberately avoided because they smacked of "programming" and "made things too complicated"...so it seems like the Excel guys lose no matter what.

But with something like making the formula editor reasonable, I think they do have some blame. Excel costs a heck of a lot more than notepad and it is billed as being a serious tool for professionals to use.


Totally agree, hopefully MS will get on the ball one day and improve the editing features. Limited syntax highlighting and an awkward commenting tool are available, but mostly unusable.

Google Spreadsheets is catching up fast (and has even surpassed Excel in quite a few areas), I'm hoping that this motivates MS to make more improvements.


I am not a serious spreadsheet user. But I am curious. In what areas have Google Spreadsheets surpassed Excel? (Collaboration tools is probably one)


Correct, collaboration is much better (realtime editing/chat/cloud storage and sharing).

Admittedly, Excel has a smoother editing experience (which is the real issue here) and is probably better suited for more resource heavy tasks, but Google Spreadsheets excels [intentional] in the following areas:

API -- http://code.google.com/apis/spreadsheets/

Scripts/Templates -- http://code.google.com/googleapps/appsscript/

Forms -- http://docs.google.com/support/bin/answer.py?hl=en&answe...

External data functions -- http://docs.google.com/support/bin/answer.py?hl=en&answe...

Revision History --

Some of this might be achievable in Excel, but it's not integrated as well and requires much more wrangling (e.g. VB scripts).


At one stage, when we were selling a very expensive version of our spreadsheet product, we'd offer a pre-sales service of going through some of a potential client's spreadsheets and converting them to our own system -- the aim, of course, being to show how much better our product was. In one case, a company was using a specific metric to drive their strategy; it was calculated as the difference between the sum of two columns. Kind of, "if the metric is positive, buy, if it's negative, sell". (I'm simplifying, of course.)

Each column was populated with a recurring formula, for example lets say that column A had something like "=D3+E3" in row 3 and column B had "=D3-F3" in the same row, filled down the whole sheet. We factored out the repeating patterns, put them into user-defined functions, and found that we got different results to the original sheet. After spending a while trying to work out what we'd got wrong, we double-checked the original. Halfway down column B it switched from using its own formula to using a mis-copied version of column A's. So half the numbers in column B were wrong. In effect, the metric that the business had been running on for several months had been meaningless.

The company never bought our software, but that may have been because they disappeared a month or so later.


The problems pointed out in this post have turned into a major issue. A clinical trial was designed to test cancer treatments recommended by the erroneous analysis. The trial was stopped (and restarted and stopped again) and now the National Cancer Institute is conducting an investigation.


Part 1: I'm a financial model builder since Excel came out on the MAC in 1987 (I’m now however a PC and 63 years old www.fser.net). Some of the inaccuracies are caught using accounting articulation points: Assets = Liabilities + Shareholders' Equity... if out of balance, something is wrong... What is wrong is a bit tricky and takes thorough understanding of how Income Statements, Balance Sheets and Cash Flow Statements interact. Unfortunately, articulation points don't catch algorithmic mistakes. My models have some very complex algorithms built over the years and the best thing to do is graph each line “quickly”, looking for breaks or unusual patterns in each line item. My models are regularly 3,000 to 5,000 lines long - not because I like complexity, but because I have lots of ways of expressing revenues and expenses. I can model most business in a few hours. Recently, I built a steady state feasibility model for growing, harvesting and extracting lipids from algae using heterotrophic and autotrophic systems. I did make a graphing tool that would allow me to capture the relationship between any input and output as I incremented an input 10 times. In most cases the output was basically linear as the model was basically a linear calculator… I would suggest any builder of engineering models to implement an incrimination graphical recording system as an error checker.


These kinds of things are why I hate papers with undetailed methods sections. Being too explicit in the methods section is obviously unnecessary (e.g. restating Bayes' theorem), however your assumptions and probablistic methods should at least be detailed in supplemental notes.


Some of the errors are in applications of methods that I don't feel should be documented. I would be worried if I saw someone explaining what they did to find P(AB), even if their method were correct. Of course, I would be extremely worried if they claimed it was max(P(A), P(B)), as the original article quotes!


Part 2 To underscore the other knowledgeable experts on this bog, finding the error before you replicate is very tough, indeed. It takes time, several sets of devoted eyes to methodically question each algorithm, which unfortunately are written in cells, which are hard to edit. One thing I learned was to use spaces in a cells code… Spaces inside a huge concatenated IF Statement (the type I love to make), have no effect on calculation, but a huge effect on legibility and understanding… Remember, Excel 2007 allows something like 7,000 characters in a cell, so spaces can work… Use Alt Tab to move code to the next line… give yourself a clear line and line up the code so that similar code components are in columns… this really helps when you come back a year later and have no comments to remind you what you wrote… also if you use VBA – to calculate and fill in a cell, be sure you clean that code and leave some narratives as to what you are doing…


Part 3 A few weeks ago I wrote a simple fuel saving device feasibility calculator. I whipped off the Excel version in a couple of hours…. Then, I tried to do all the calculations inside controls using VBA code, placing output in text boxes… OMG … this “simple” experiment turned into a long nightmare… What we take for granted with Excel – i.e. dependent relationships being automatically programmed is NOT done in VBA… I had to place code in every dependent control (SpinButtons) to calculate several text boxes and to carefully format them both numerically and graphically … like Green Boxes for Positive Numbers and Pink for Negative Numbers… Now I should have call up a module and avoided duplication, bu nonetheless, this was a tough edit… Spinning numbers in and seeing if every text box that should have registered movement actually did…


Part 4 Why did I do this excruciating experiment? Well, I wanted to show clients how I can make Excel look like a bullet proof application with output on top of sculptured, raised backgrounds… i.e. the look and feel of make-your-own Dashboards… Bottom line – I don’t think clients will pay for this expertise… You can make you own Excel Dash Boards 9 as I have done many times) without resorting to the extremes I recently undertook… …So having confessed my inexperience about the difficulty in making non- Excel calculators, I thank my luck stars that we have Excel… without out a doubt the most flexible and beautiful numerical authoring tool on the planet offset by the extreme difficulty in detecting algorithmic mistakes. Isn’t life a trade off, anyway? Thanks for reading this … Bruce Badeau bruce@fser.net


Spreadsheets are a fascinating blight on the face of science and business. They offer so much convenience and speed, but amplify the risk of errors quickly.

There is actually a research group devoted to studying spreadsheet risks: http://www.eusprig.org/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: