N=36 and testing protocol of 4 servings over a short period of time is all you need to know.
Even if you assume all they find is true, the effect could disappear or change into something else one month after initiation (dose accommodation, etc...)
Since you seem so certain, I have to ask: have you done the math to show that 36 isn't enough to tell you anything in this case?
Depending on the effect size and probability distributions involved, N=36 can give you a pretty respectable picture. It's not good enough to be the final word on the matter, but most papers aren't aiming to be that, nor should they. It seems to me these researchers were trying to scientifically test an idea that floats around coffee communities (I've heard specifically that theanine in your coffee will do this). There's no reason to "go big" right away when doing such a study. You start small to see if there's any hope for the idea, then try to get funding for a larger study by publishing.
>4 servings over a short period of time
The definition of "a short period of time" is relative. In this study, what you're defining as a short period of time seems ludicrous. The participants were given the servings on entirely different *days*, which is far past the amount of time that caffeine or theanine are active.
Finally, the article on HN the other day about bad research was not about small sample sizes, it was about outright fraudulent data. It's certainly possible that this study is a fraud, but, even if we grant that the sample size is small and the delay between administration is too short, that's not evidence of fraud.
What you'd hope is that someone would use that result to get funding for a bigger study. I've got a lot of sympathy for the sceptical position, but you've got to start somewhere.
> I'd be more impressed if you demonstrated why the sample size lacks the power to demonstrate an effect.
Unless and until someone in this thread gets a copy of the paper so we can find out the effect sizes involved, we simply aren't able to objectively assess the study's statistical power.
But even then, I'm perhaps more worried about the file drawer effect. The type 1 error rate is fixed at 5%, n=36 studies are cheap, and p>.05 studies never get published. And we're looking at exactly one paper here. As far as I'm concerned, you can't have credibility without replicability.
It’s not just about raw numbers though, you need to consider the experimental design, which looks really tight in this case.
It’s a repeated measures study and from the looks of it all subjects spent time in each of the four treatments + control, so it’s direct comparisons of the same people in each condition. They used three separate measures. Accounting for all that, they are working with something like 540 data points, and the fact it’s the same people in each set is a nice little feature for direct comparisons rather than a limitation. They even double blinded everything. All of that has to count for something.
It absolutely does, but also leaves me even less inclined to make much of the abstract alone. You clearly have access to the fulltext. I don't, so the abstract and hearsay from others is all I've had to go on. But my concern about a repeated measures study is that that extra power doesn't come for free. It comes along with a bunch of new and subtle ways for endogeneity to sneak into your model, and, sadly, the menu of techniques for dealing with that introduce a lot of new ways to (accidentally or otherwise) engage in p-hacking.
Since a lot of the things that matter happen behind closed doors, and aren't necessarily mentioned in the paper (elsewhere someone quoted Gilman, another of his good zingers is something to the effect of, "You don't talk about your exes during a date."), there's also just too much room for people to fire spitballs from the back row when you've got a complicated design like that.
On the other hand, a successful, independent replication can be quite compelling. Not only that, but it's on philosophically firmer ground. There's a reason why it was so central to Popper's original formulation of the scientific process. It's the empirical way to vet a result. Squabbling over the statistics, on the other hand, frequently devolves into a kind of sophistry with a different mix of greek letters. It's great fun for economists, but this isn't economics, it's science.
There's pharma companies worth over a billion dollars based on studies with a smaller sample sizes so I don't see N=36 as a deal breaker especially when there's very little money incentive here.
And the abstract doesn't even give enough information to describe what they actually found. It gives p-values, but not effect sizes.