> A good rule of thumb is to assume that around 90% of the information you see in Elicit is accurate. While we do our best to increase accuracy without skyrocketing costs, it’s very important for you to check the work in Elicit closely. We try to make this easier for you by identifying all of the sources for information generated with language models.
A 90% accuracy rate seems like the sweet spot between "an annoying waste of time" for honest researchers and "good enough to publish" for dishonest careerists.
I don't like disparaging the technology experts who work on these things. But as a business matter, 1/10 answers being wrong just is not good enough for a whole lot of people.
I don't think the number is as important as the question of how would someone be expected to magically know which 10% is wrong and needs to be corrected?
This is a good point! (Hopefully) obviously, if we knew a particular claim was fishy, we wouldn't make it in the app in the first place.
However, we do do a couple of things which go towards addressing your concern:
1. We can be more or less confident in the answers we're giving in the app, and if that confidence dips below a threshold we mark that particular cell in the results table with a red warning icon which encourages caution and user verification. This confidence level isn't perfectly calibrated, of course, but we are trying to engender a healthy, active, wariness in our users so that they don't take Elicit results as gospel.
2. We provide sources for all of the claims made in the app. You can see these by clicking on any cell in the results table. We encourage users to check—or at least spot-check—the results which they are updating on. This verification is generally much faster than doing the generation of the answer in the first place.
This is true but if the error rate were 1/1000 I could see the risk management argument for using this thing. 1/100 is pushing it. 1/10 seems unconscionably reckless and lazy.
> if it takes 1 hour to get one answer by hand, but only 20 minutes for the machine, and 20 minutes to check the answer, the user still comes out ahead
If it's right 90% of the time, with those other assumptions, then you either:
1) spend 10 hours doing all of them by hand
2) spend 3h 20 waiting for the machine, 3h 20 checking the machine, and 1h replacing the machine's mistake with a hand-written version, for a total of 7h 40
(I never trust marketing claims, so I doubt 90% accuracy; but also it generally takes LLMs a few seconds rather than tens of minutes to produce an output to be checked).
There's also the issue where if you aren't doing the work much anymore, will you continue to be able to competently check its output? Be able to intervene as well when it makes a mistake?
I think it's tempting but oversimple to focus on "output" and "time saved generating it," but that misses all the other stuff that happens while doing something, especially when it's a "softer" task (vs. say, mechanical calculation). It also seems like a mindset focused on selling an application rather than doing a better job.
> Sure, but that's part of a much broader issue that predates AI by at least two millennia, probably much longer — the principal-agent problem.
I don't think that captures what I'm thinking about, which is more skills atrophy ("Children of the Magenta") problem than a conflict of interest problem.
> William Langewiesche's article analyzing the June 2009 crash of Air France flight 447 comes to this conclusion: “We are locked into a spiral in which poor human performance begets automation, which worsens human performance, which begets increasing automation” (www.vanityfair.com/news/business/2014/10/air-france-flight-447-crash).
> ...
> Langewiesche's rewording of these laws is that “the effect of automation is to reduce the cockpit workload when the workload is low and to increase it when the workload is high” and that “once you put pilots on automation, their manual abilities degrade and their flight-path awareness is dulled: flying becomes a monitoring task, an abstraction on a screen, a mind-numbing wait for the next hotel.”
Ah, yes, I think I get you this time. (Is it just me, or does that now feel hideously clichéd from the LLMs doing that every time you say "no" to them? Even deliberately phrasing it unlike the LLMs, it suddenly feels dirty to write the same meaning, and I'm not used to that feeling from writing).
I still think it's a concern at least as old as writing, given what Socrates is reported to have said about writing — that it meant we never learned to properly remember, and it was an illusion of understanding rather than actual understanding.
(That isn't a "no", by the way; merely that the concern isn't new).
Those numbers are arbitrary and fictional, and the more relevant made-up quantity would be the variance rather than the mean. It doesn't really matter if the "average user" saves time over 10,000 queries. I am much more concerned about the numerous edge cases, especially if those cases might be "edge fields" like animal cognition (see below).
In my experience it takes quite a bit longer to falsify GPT-4's incorrect answers than it does to a Google search and get the right answer. It might take 30 seconds to check a correct answer (jump to the relevant paragraph and check), but 30 minutes to determine where an incorrect answer actually went wrong (you have to read the whole paper in close detail, and maybe even relevant citations). More specifically, it is somewhat quick to falsify something if it is directly contradicted by the text. It is much harder to falsify unsupported generalizations or summaries.
As a specific example, I recently asked GPT for information on arithmetic abilities in amphibians. It made up a study - that was easy to check - but it also made up a bunch of results without citing specific studies. That was not easy to check[1]: each paragraph of text GPT generated needed to be cross-checked with Google Scholar to try and find a relevant paper. It turned out that everything GPT said, over 1000 words of output, was contradicted by actual published research. But I had to read three papers to figure that out. I would have been much better off with Google Scholar. But I am concerned that a large minority of cynical, lazy people will say "90% is good enough, I don't want to read all these papers and nobody's gonna check the citations anyway" and further drag down the reliability of published research.
[1] This was a test of GPT. If I were actually using it for work, obviously I would have stopped at the fake citation.
I'm not sure I agree that those rule-of-thumb statistics are "arbitrary" or "fictional"… I guess it depends on what you mean by that. I can say that on our part they're a good faith attempt to help users calibrate how best to use the tool, using evaluations of Elicit based on real usage.
Definitely accept that the tool can work better or worse depending on your domain or workflow though!
One way we do try to distinguish ourselves from vanilla LLMs is that we provide sources for all of the claims made. I mention this because we hope our users can approach the falsification process you mention for Google. We want to show people where particular claims come from such that we earn their trust.
Walking citation trails and verifying transitive claims is something we've talked about but need more people to implement! (https://elicit.com/careers)
This is impacting online discourse too. It used to be that when someone is wildly wrong it was relatively easy to identify why: ideology, common urban myth, outdated research, whatever.
Now? I’ve seen people argue positions that are demonstrably wildly wrong in unusually creative and often subtle ways and there’s no way to figure out where they went off the rails. Since the LLM is responsive, they can use it to come up plausibly sounding nonsense to answer any criticisms collapsing the debate into a black hole of bullshit.
I am not sure what you mean by this comment. I took the language from the developers. If you mean that commercial AI providers should give more specific information then I agree wholeheartedly.
I assume it is difficult for Elicit to give specific numbers because they lack the data, and confabulations are highly dependent on what research area you are asking about. So the "rule of thumb" is a way of flattening this complexity into a usage guideline.
A 90% accuracy rate seems like the sweet spot between "an annoying waste of time" for honest researchers and "good enough to publish" for dishonest careerists.
I don't like disparaging the technology experts who work on these things. But as a business matter, 1/10 answers being wrong just is not good enough for a whole lot of people.