Hacker News new | past | comments | ask | show | jobs | submit login
Misleading Graph Generator (yrden.de)
68 points by gulbrandr on Feb 23, 2014 | hide | past | favorite | 35 comments



This isn't a graph generator. It's just a graph. You can plug data into it and get a graph, I guess, but the same goes for Excel.

A misleading graph generator that automatically matched concurrent data sets based on correlation would be quite interesting, but this isn't it.


So are you saying that the term "graph generator" is misleading here?


Ba dum cha


How is this graphs fault? The only fault the graph has is that it clearly displays the data, the problem is in the idea that is represented. It's formally called "Correlation does not imply causation" and it's fault of the person who is suggesting it.

There is a famous satirical version of this too: http://en.wikipedia.org/wiki/File:PiratesVsTemp(en).svg

It has nothing to do with graphs, graphs are are just visual tools and people reading data from them are supposed to evaluate it just like reading data from any other tool and not jump into conclusions.

What the graph in question suggests is that both "transistor count" and "average life expectancy in germany" have risen trough time and if you are reading this as "rising transistor count increases life expectancy" it's your fault. Why not read it like "from 1971 to 2011 both transistor count and life expectancy increased steadily, maybe because of the advances in technology - we should look into it, it's too early to say anything"


> It has nothing to do with graphs, graphs are are just visual tools.

To make a graph, one has to make decisions, these decisions can be legitimate or they can be biased. Biased decisions lead to misleading graphs.

The same goes for other types of content such as news articles, product reviews, war photos, etc.

Edit, to illustrate better: the author of the graph chose to overlap data on a logscale (transistor count) with data on a linear scale (life expectancy). The resulting graph shows two similar curves, that suggests that the two are strongly correlated. That's what I call a biased decision.


> How is this graphs fault?

It's not. I'd love if more people understood this clearly, it's not "statistics" that are "misleading" - it's their authors that are lying.

You can lie when speaking, you can lie in writing, and in the same way you can lie with data or with data presentation. There's no fundamental difference. Either you do your best to tell the truth, or you do your best to mislead people.

I wish people treated, on the emotional level, being presented with a misleading graph in the same way as being lied in the face in person. It should have the same social cost. If I told you a blatant lie, you would never want to have anything to do with me in your life anymore. So please, in the same way, don't trust those journalists, companies, salesmen and others, who mislead people by using statistical graphics.


>It's formally called "Correlation does not imply causation" and it's fault of the person who is suggesting it.

Well, correlation sure DOES imply causation.

In the dictionary sense of "imply", as: "suggested but not directly expressed; implicit".

What it does not is necessitate causation, but it sure as hell does imply it.


This is not true. Correlation means that there is some kind of relation between two variables(for example, how they change over time) but it says nothing about the mechanics of the relation, which is the causation.

Let me investigate this example:

the observation: Students who watch less TV have better grades.

It's not like pirates v.s. global warming, it actually makes sense at first and if you are the minister of education whose goal is to increase the grades among students and you don't have an idea about "Correlation does not imply causation" principle you may actually suggests to ban TV's to improve the education. You can even draw a fancy graph to support your idea.

However, after further investigation you may find out that this would not work because it's not the TV that is causing the lower grades. Maybe the better students tend to watch less TV because they prefer to read books instead of watching TV.

This time it may seem to suggests the opposite: Increasing the students success in school reduces the time the TV is watched. However after even further investigation you may find out that your better students are from poor families(maybe the schools in your country are accepting the best students without tuition and are not selective about these who can pay) that just don't have a TV set, or maybe these better students also need work part-time to support their families, thus they just don't have time for TV.

Thus, you can't really change one value by synthetically manipulating the other one. Turns out, the correlation did not meant causation.


It's not like that. Correlation does not imply causation only in the strict meaning of "imply", =>, in mathematical logic. But correlation is sure as hell evidence for casual relationships. Just yesterday I found a blog post by author of "Think Bayes" book explaining that nicely.

http://allendowney.blogspot.com/2014/02/correlation-is-evide...

Moreover, with good enough correlation data, you can even determine the direction of casual relationship if you're willing to do a little bit of maths:

http://lesswrong.com/lw/ev3/causal_diagrams_and_causal_model...

Also, http://xkcd.com/552/ ;).


I would not go into lengths to call correlation an evidence. It can be taken as a clue for further investigation because it also does not deny causation.


It is evidence for causation, in the Bayesian statistics meaning of "evidence". The first article I linked explains the process in detail.


Thank you for the links.


He didn't say it proved causation, but it increases the probability of causation. Let's say you have several possible hypotheses:

Watching TV causes worse grades.

Watching TV has no effect on grades.

Something else causes watching TV AND worse grades.

Worse grades cause watching TV.

Watching TV causes better grades.

Etc, for all the other possible correlations between these three variables.

Assume all these are equally likely. That makes it about 1/7 chance that Watching TV causes worse grades (and there is equal chance of the exact opposite.) Now the study that watching tv is correlated with worse grades comes out. You can eliminate all but the first few hypotheses that predicted the correlation. Now the hypothesis "watching TV causes worse grades" has a probability of 1/3. Almost twice as likely. And the hypothesis that watching TV has any positive effect on grades has been completely eliminated.

This is why I'm bothered when people say "correlation doesn't imply causation!!!". No it doesn't, but it significantly raises the probability. If the other hypotheses aren't that likely to begin with (i.e. "cancer causes cellphones") then it should really bring that hypothesis to your attention.


Correlation does _suggest_ causation.



> There is a famous satirical version of this too: http://en.wikipedia.org/wiki/File:PiratesVsTemp(en).svg

What do you mean, satirical? There's a religion founded on this image alone!


This is my favorite example: http://i.imgur.com/gAGjP.png


I hear more and more chanting of "correlation does not equal causation!" which is great if your goal is to form a causal model of the world, but there are plenty of insights you can arrive at from correlation alone.

For starters in the world of machine learning and predictive analytics, it doesn't really matter if X causes Y so long as X is a consistently good predictor of Y. Maybe powerlines being over someone's home are not the cause of cancer, but if their presence can be used to predict cancer rates that's a good thing.

More important imho is the idea of latent or hidden variables. Two things that are clearly correlated but also seem to not have a causal relationship (just as transistors and longevity) may share a latent variable, that may be either non-quantifiable or completely unobservable. For either case measuring outputs that share a common latent variable and thus correlate with each other might be the only way to attempt to measure hidden, non-quantifiable causes.

For example employee happiness might be the cause of employee retention. However you can't currently measure or observe 'happiness', but there may be many, seemingly, unrelated employee activities that correlate with retention because they are also driven by this same latent variable. Studying them is the only way to get a quantifiable understanding of this latent cause.

tl;dr somethimes correlation is just as important as causation.


Ironically, this graph does an excellent job of showing a correlation that really exists between the two data sets, albeit a non-linear one.

The only misleading thing about this graph is the title, which states a causal link with no evidence of one.


Yeah, I think this is completely off-base in blaming the axes. The axes are perfectly fine in showing the actual correlation that's present here (almost certainly due to a third common causal factor, of course).

There's no really magically "unbiased" way of choosing axes. There seems to be a popular view recently that you should always start your axes at zero, I assume as a backlash to some graphs magnifying very small differences by choosing zoomed-in axis values that visually exaggerate variation. That isn't really an absolute truth either, since interesting data regions are not always near zero. For example, if you graph temperature variation in different cities staring at 0 K, you can make it look like essentially all habitable cities have around the same temperature, somewhere in the range of 250-300 K give or take. Of course 250 K versus 300 K is a huge difference to human perception of temperature, while the entire range 0-200 K is more or less irrelevant when discussing weather, so starting your axis at 0 would be a poor choice. In this case that would be the biased choice, intended to visually minimize actually important variation by choosing an unreasonably low starting point for the Y axis.


> (almost certainly due to a third common causal factor, of course).

..or absolutely certainly associated so remotely that correlation is purely accidental and found only by carefully cherry picking data sources, ranges, functions to massage the data (log with the right base) and axis ranges. To sum up ... not meaningful.

You could find similar correlation between ocean temperatures and lottery numbers but you'd have to precisely adjust so many inputs that the correlation would be not so much found as constructed.


"log with the right base" doesn't make any difference. That's a linear mapping. For instance:

    2log x = log x / log 2 = ln x / ln 2
(With 2log the logarithm in base 2, log and ln the logarithm in bases you pick, but you can think of 10 and e)

See, for example, http://www.purplemath.com/modules/logrules5.htm, or derive it from the axioms.


I think this ln2 factor matters when you are drawing a graph. But I guess that's included in what I called axis range.


Actually, in this case I'd argue that a general/generic measure of "human scientific and technical advancement" is a decent underlying causal factor.

Maybe meaningful, if kind of obvious...


Maybe a bit off-topic here, but I think there is a cause-effect relationship between number of transistors and life expectancy. More transistors implies more computing power. More computing power leads to better/faster information processing, including medical information. This leads to faster patient diagnostics, better treatments (pharmaceutical innovations), earlier and more precise health warnings (lab tests an medical equipment), and so on.

Faster and better information processing leads also to higher food quality (food processing plants), higher life quality (environmental temperature and humidity control), etc.

Germany is a highly industrialized country, so information processing power causes a big social impact.


Maybe, but there is another, perhaps simpler hypothesis: Both the transistor count and life expectancy increase with time.

One way to verify one or the other is to look at a linear hypothesis:

    States: LifeExp, Transistors, Time
    Structural Equation Model [1]:
Model 1:

    LifeExp ~ Time + Transistors + noise_exp_1
    Transistors ~ Time + noise_trans_1
(The 2nd equation means that: Transistors(t) = a * t + noise, and you try to estimate "a" from the data.)

vs Model 2:

    LifeExp ~ Time + noise_exp_2
    Transistors ~ Time + noise_trans_2
If model 2 has more predictive power of LifeExp than model 2 (i.e., noise_exp_1 is lower than noise_exp_2), then according to _this_ model, Transistors causally affects LifeExp. However, this model is way too simplistic and doesn't incorporate other causal paths, such as the one you describe (Transistors -> Computing -> DiagnosticRate -> ...)

[1] http://en.wikipedia.org/wiki/Structural_equation_modeling


Math is not my field of expertise, but I think I get your point. My point was based on a social and economics perspective. I think these other causal paths are fundamental, because they are empirically verified (computing power vs health and life quality).


For causation, you need A/B testing -- real A/B testing (randomized controlled experiments).

Not everyone understands the need for A/B testing and not everyone understands how to do A/B testing.

I recently worked on modeling to forecast revenues for a large multinational. One of the variables that was essential for an accurate model was whether prior data came from before or after the company hired their current A/B testing guru. With the new guru, revenues rose every week. With the previous "guru", changes by Engineering had no impact on revenues.


Everyone's talking about "correlation vs. causation" but I'm pretty sure the page explicitly states that that's not what this is about:

> if its axes are chosen unwisely

The graph is misleading because it makes it look like the transistor count and the life expectancy grow at the exact same rate. It pretends there's a linear correlation. And that's the misleading part.


Those who have not read "How to Lie With Statistics" are doomed to reinvent it.

https://archive.org/details/HowToLieWithStatistics


Beyond Numeracy (John Allen Paulos) is another great book.

http://www.amazon.com/gp/aw/d/067973807X


This is only one misleading graph, which, to be honest, I find misleading.


I feel that many missed my point: It’s not to say causation never implies correlation, that there’s no common source or that latent factors are fiction.

In my experience many people believe something is true, just because of “math” or “data”. So, this is basically a variation of the joke that 73.37% of people put more trust into statistics, if the value isn’t rounded. Since you all critically discussed the topic, you are far beyond of what this can little project can teach you.

If you have suggestions on how to express this more clearly, please let me know.



I think this is a very good idea. Lots of people still mix up correlation and causation. Either this tool will make them understand what this is wrong, either they will enjoy finding lots of weird relations between data sources.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: