I have still not given up hope that this conversation is useful for somebody. He...

I have still not given up hope that this conversation is useful for somebody.

Here is my redoubled effort.

Part of what gives me hope is that the CEO of a prominent data analysis company, who does have a background in statistics and data analysis, and has a PhD in computational mathematics from Stanford, said that my original comment was "amazing" and that "the complacency of selecting variables is lost on the hive mind."

And, so, while I have diminished hope that I'll be able to get through to you this time, at the moment, since things seem to have regressed to statements asserting that I don't understand the "faintest clue" of things like model fitting and significance (this is absolutely false, actually, I'm quite deeply aware of the meaning), in fact this conversation does have at least some merit, even if outside of this, surprisingly argumentative, Hacker News context.

Including a specific number of terms in a Taylor series expansion (as per the suggestion of xapata, or in any expansion, be it a Fourier expansion, a Lagrange expansion, or whatever, is a somewhat perilous choice that can distort the meaning. Any form of model fitting has this problem. But one cannot dismiss all other models that could be fit to the data as insignificant in this case!

In particular, choice of a quadratic function for fitting, using constant, linear, and quadratic terms, automatically distorts this data, because of the nature of the data, where there is a anti-correlation between unemployment and cognitive indicators.

This is demonstrated by the following example, which took me about 5 minutes to construct -- and bit more to explain and write about here.

Suppose the data show a completely flat response for IQ versus working hours, except for the unemployed population, which has a lower set of cognitive indicators.

The data and curve are linked here.

https://mycurvefit.com/share/0530c696-2eb0-4f9f-8af1-3277c5b...

In this example, the data shows no optimum number of working hours, and IQ doesn't diminish for more hours worked. But the quadratic fit does suggest this: a peak for IQ near 25 hours of hours worked.

Obviously, this example is not the data the study worked on. The study doesn't directly share the data. But, from the graphs on page 20, the example I constructed is quite like the data. The part-time and full-time work probability density curves are practically identical to one another -- they are right on top of each other. The only really significant difference is between the working and not working populations.

Yet, the authors do not hedge their findings.

"Our findings show that there is a non-linearity in the effect of working hours on cognitive functioning. For working hours up to around 25 hours a week, an increase in working hours has a positive impact on cognitive functioning. However, when working hours exceed 25 hours per week, an increase in working hours has a negative impact on cognition."

and the study concludes "Our study highlights that too much work can have adverse effects on cognitive functioning."

In my judgment, this analysis does not demonstrate this, even though it would be convenient for me for this to be true.

PS:

Because they did two stage least squares, and instead of directly using working hours they used fitted values, there is a slight adjustment that needs to be done to the example above, in order to be relevant.

It is not entirely obvious exactly how well the anti-correlation for cognitive indicators will carry through after "working hours" are estimated by regression with the variables:

Vacancy rate, Inner regional, Outer regional, Remote, Very remote, Number of dependent, Children, Parent is still alive, Other public benefits, Australian citizen, Work experience, Ownhouse.

I mean literally the best connection there is is in "other public benefits" which is a variable with an effect measured in dozens of hours of work per week. Everything else is a far smaller effect. So, effectively, what the second stage of least squares is really doing, is doing a regression on the variables about versus cognitive indicators; and really mostly, upon whether or not they receive public benefits.

A large fraction of those people who have public benefits will have their "estimated work hours" estimated below 0, then will be reassigned to 0 for the purposes of the final regression. Hence, if there is an anti-correlation for "receiving other public benefits" (their terminology) with cognitive indicators -- and there is -- it will appear that there is a significantly lower set of cognitive indicators for the instrument WH* that they estimate.

After that, the rest of my toy example is still quite apt -- there can be no effect in IQ as a function of WH* or WH (as measured) outside of the unemployed population, even though the quadratic analysis will suggest an optimum.