When a small sample size produces a big effect, it does not mean you can be more confident in the effect!
All things equal, you'd expect confidence intervals / credible regions to be very large.
Another way to say this, is that since the measurement error is large, you expect spurious findings that result from something like p-hacking to be large (they have to be to be significant).
What annoys me as a someone not trained in statistics is that these discussions are always abstract in-principle arguments about what proper statistics should be like.
Meanwhile, we have a real paper with real p-values, why not discuss the validity of that?
> SRT Consistent Long-Term Retrieval improved with curcumin (ES = 0.63, p = 0.002) but not with placebo (ES = 0.06, p = 0.8; between-group: ES = 0.68, p = 0.05). Curcumin also improved SRT Total (ES = 0.53, p = 0.002), visual memory (BVMT-R Recall: ES = 0.50, p = 0.01; BVMT-R Delay: ES = 0.51, p = 0.006), and attention (ES = 0.96, p < 0.0001) compared with placebo (ES = 0.28, p = 0.1; between-group: ES = 0.67, p = 0.04). FDDNP binding decreased significantly in the amygdala with curcumin (ES = −0.41, p = 0.04) compared with placebo (ES = 0.08, p = 0.6; between-group: ES = 0.48, p = 0.07). In the hypothalamus, FDDNP binding did not change with curcumin (ES = −0.30, p = 0.2), but increased with placebo (ES = 0.26, p = 0.05; between-group: ES = 0.55, p = 0.02).
I can see that there are a lot of p-values around 0.05, but that the supposed improvements have much lower p-values:
(ES = 0.63, p = 0.002) SRT Consistent Long-Term Retrieval
(ES = 0.53, p = 0.002) SRT Total
(ES = 0.50, p = 0.01) BVMT-R Recall
(ES = 0.51, p = 0.006) BVMT-R Delay
(ES = 0.96, p < 0.0001) attention
What does this imply? Is it a good sign, or does it actually make it more fishy? pythonslange's comment[0] seems to suggest the latter, claiming that none of the other pre-registered tests are mentioned, without any explanation why.
(eyeing that "attention" one, if this does hold up under scrutiny, could I expect tumeric-based ADD medication somewhere in the future, without all the nasty side-effects of my current amphetamine-based options?)
Here is the problem: If there is an actual large effect size, then you can detect this in small samples. However, the opposite is not true. In a small sample, spurious large effects are MORE likely, not less likely (outliers have a greater impact in a small sample). See: http://andrewgelman.com/2017/08/16/also-holding-back-progres...
I agree with the other commenter, and Gelman's blog is a good place to start. There are very clear, concrete statistical arguments, but they are difficult to summarize in the comments section of HN.
The gist is like this. Imagine that someone had people play 20 different slot machines. Then they went into a private room and looked at the results for each machine. After, they come out with the results of 5 of the machines and say, "look, our slot machines pay out at a higher than chance level!".
Do you believe them? I hope not. If only 4 machines had done well, maybe they would have shown only 4, or 3, etc.. They've effectively stacked the deck.
On the other hand, suppose someone said they were running an honest experiment with 20 slot machines. How many machines would you expect them to report on?
I can follow your analogy, but it doesn't quite add up.
The researchers took a group of people and submitted each of them to an "extensive neuropsychologist test battery".
That's not just testing 20 different slot machines, that's testing 20 different designs of slot machines.
(Also, each person does each test on their own, which is the equivalent of having them not only play twenty different models, but a unique production per person unit per model)
In that light, the claim that all slot machines have a high pay-out chance is obviously suspicious, but would coming back and saying "these five designs have a higher than chance level of winning!" be an incorrect conclusion?
If each slot machine types is unique, no. But if one slot machine design is known to have a flaw, and if it shares this flaw with another design, and if that other design does not show the same increased performance, then things get really suspicious.
So the question becomes: do we know how strongly correlated the results of these tests typically are? If that is a lot (which I would expect to be true with at least some of these tests), the absence of the other tests is suspicious. If it is low, it might be less of an issue.
You hit on a lot of important points. The key here is that they are making a claim about memory, a psychological construct, that is being represented by some of their measures (batteries). But their batteries also purport to measure other psychological constructs, too.
To connect this with the slot machine analogy, it might be like if groups of slot machines had different colors, and they chose the color that yielded the best results.
> but would coming back and saying "these five designs have a higher than chance level of winning!" be an incorrect conclusion?
It would be if you didn't take into account that you analyzed 20 machines in your analysis .
The number of hypothesis tests is problematic (alpha inflation / familywise error rate) and the authors simply mention, "Another limitation was that we did not correct for multiple tests in the analyses as this was a pilot trial."
However, identifying their study as a pilot study does not excuse the lack of correction for multiple tests. These procedures are well known so failing to use them leads to maximum likelihood of ending up with false positives. The study then provides the best case scenario with the highest likelihood for false positives and seems to highlight the authors' desire to avoid failures to detect effects.
There is always a statistical trade off between reducing the numbers of failures to detect and reducing the number of false positives, so readers will be left to their own devices to decide if the authors' approaches and interpretations were justified.
The study's strength of argument is reduced by:
* The authors' self-description of the study as a pilot study with a small sample size,
* no corrections for multiple testing,
* use of a non-representative sample ("Only approximately 15% of the screened volunteers were included in the study, and our recruitment method yielded a sample of motivated, educated, physically healthy subjects concerned about age-related memory problems. The sample, therefore, was not representative of the general population.")
* PET brain imaging results self-described as "exploratory" (I take the word exploratory to mean "ourselves and others need not trust these results or interpretations" or at least "different results would not be unexpected")
What I find least appealing is the lack of a specific conflict of interest statement even while there is explanation of authors' financial interests in the substance being studied and the company selling it. As noted in the article, "Industry Sponsorship and Financial Conflict of Interest in the Reporting of Clinical Trials in Psychiatry " from the American Journal of Psychiatry ( https://ajp.psychiatryonline.org/doi/abs/10.1176/appi.ajp.16... ), "Author conflict of interest appears to be prevalent among psychiatric clinical trials and to be associated with a greater likelihood of reporting a drug to be superior to placebo." (Perlis et.al, 2005) and this explains the primary conclusions of the current study - better than placebo.
We can't fault the inventors of the substance, holders of the patent, and those with financial interests in the company which sells the product to want to test their product and promote positive findings, but how objective are the investigators and how rigorously are they trying to apply the notion of falsification to their own ideas?
They lose everything with falsification and negative results would undermine the parent company's claims that, "Theracurmin® product is one of the most advanced and studied, highly bioavailable forms of curcumin in the marketplace." http://theravalues.com/english/ . Looking on the research page, there are only a small number of studies shown at http://theravalues.com/english/research-clinical-trials/ . All positive outcomes of course.
What sample sizes do we see in other studies? How many other studies of this substance are equally as weak in terms of sample size? 6 rats here, 6 people there, 40 people here... not convincing. If anything, the strong marketing hype based on such studies makes me more wary and less trusting of the marketing and scientific claims.
Given the Theravalues clinical trials website promotes the product for for "Progressing malignancies, Mild cognitive impairment (the study highlighted in the parent post), Heart failure / diastolic dysfunction, Cachectic condition, Osteoarthritis, Crohn’s disease, Prostate-Specific Antigen after surgery" I'm left with distrust.
Kudos to the efforts to begin product testing and this sort of research is time consuming and expensive, but studies with sample sizes of 40 people do not support the company's marketing hype. If this study is positive enough for the authors to obtain more grant funding and run a much larger and better clinical study then I would be interested to see what is found then, but happier even still if authors with no financial or personal conflicts of interests ran the study.
I know you want this to be true given your stated disatisfication with the current state of attention deficit medication, and I know that arguing about statistics seems pointless, but this is n=40 (21 test 19 control). No matter how good your p-values are or how many different tests you run, there's just nothing to take away from this right now.
That is not to say its worthless. If there are a number of repeated similar trials by different researchers and this study is combined with those in a larger meta-study that continue to show significant results, then it could end up being a very important study to look back on.
But right now its just 40 people eating refined turmeric and taking memory tests...
Yeah, I'm aware of my own bias. But maybe I phrased myself badly: it's not that I think that arguing about statistics is pointless, it's that this discussion often feels like it becomes disconnected from the paper at hand.
The replies to my complaint are nice counterexamples of that, though.
Most discussions about research findings are like that and HN is probably no exception. :-)
"The sample size is too small" or "they didn't control for [x] effect" are the sort of gut armchair comments people make offhand, knowing it probably applies to any study if you get challenged. I assume that's the sentiment you reacted to and I agree it can be frustrating.
All things equal, you'd expect confidence intervals / credible regions to be very large.
Another way to say this, is that since the measurement error is large, you expect spurious findings that result from something like p-hacking to be large (they have to be to be significant).