Hacker News new | past | comments | ask | show | jobs | submit login
ArxivGPT: Chrome extension that summarizes arxived research papers using ChatGPT (github.com/hunkimforks)
154 points by excerionsforte on Feb 13, 2023 | hide | past | favorite | 104 comments



The example here is a bit worrying for the peer review process. I am not looking forward to my "peers" reviewing my paper by putting it through LLMs and blindly copy pasting the output. I can already imagine emailing the Area Chair and saying "While reviewer 2 is detailed, the questions show a severe lack of basic understanding. We believe the contents are AI generated."

Then again, perhaps LLMs could simply be incorporated into the peer-review process, where after submitting your paper, you'd have to answer the AI's basic questions. As a reviewer, I could imagine a structured AI report for a paper being helpful in guiding discussion: "The paper compares to recent approaches X, Y, and Z. And the work is indeed novel."


I like the way you flipped it around.

I'm a big believer of using all these new AI services/programs as tools to enhance my workflows, not replacing them.


"The best model was truthful on 58% of questions, while human performance was 94%."

https://arxiv.org/abs/2109.07958


A lot of people cite these numbers as cynics or to dissuade others' optimism, but if you actually work in the field and know what it was like just 5 years ago, these numbers should be extremely worrying to everyone who's afraid of automation. Even that linked paper from 2021 is already outdated. We don't need another revolution, we just need maybe a dozen small to medium insights. The steps from GPT3 to 3.5 alone were pretty straightforward and yet they already created a small revolution in LLM usefulness. Model scale used to slow down the pace of research for a moment, but with so many big companies jumping on the train now, you can expect research to accelerate again.


The training data contains tons of false information and the training objective is simply to reproduce that information. It's not at all surprising that these models fail to distinguish truth from falsehood, and no incremental change will change that. The problem is paradigmatic. And calling people cynics for pointing out the obvious and serious shortcomings of these models is poor form IMO.


The large corpus of text is only necessary to grasp the structure and nuance of language itself. Answering questions 1. in a friendly manner and 2. truthfully is a matter of fine-tuning as the latest developments around GPT3.5 clearly show. And with approaches like indexGPT the usage of external knowledge bases that can even be corrected later is already a thing, we just need this at scale and with the correct fine tuning. The tech is way further than those cynics realize.


Without the opportunity to live in the real world, these LLMs have no ground-truth. They can only guess at truth by following consensus.


I'm sure you can add constraints of some sorts to build internally consistent world models. Or add stochastic outputs as has been done in computer vision to assign e.g. variances to the probabilities and determine when the model is out if its depth (and automatically query external databases to remove the uncertainty / read up on the topic..)


These models inherently cannot be truthful because there is no intelligence behind them that can have any sort of intent at all.

It’s literally monkeys with typewriters pressing keys randomly.

Until we get new models which have true understanding, they will never be truly useful.


This gets repeated by the cynics every time like prayer, but it shows a very poor understanding of the current state of research.


Can you explain or share what research is being done on getting machines to understand the meaning and intent of the data and prompts they are given?


Here is just one recent example: https://arxiv.org/abs/2210.13382

If you actually follow the literature, you'll find that there is tons of evidence that the seemingly "simple" transformer architecture might actually work pretty similar to the way the human brain is believed to work.


This paper is a little bit of a red herring, as Yannic, an NLP PhD, covered well here: https://www.youtube.com/watch?v=aX8phGhG8VQ They filtered questions that GPT-3 - the model they tested - got right before constructing the dataset. They asked questions about common misconceptions, conspiracy theories and things humans have believed until recently. The dataset should be called expert consensus vs public entertainment belief i.e. they consider the ground truth as a humourless experts answer, while the reality is in a random sitcom, you -would- expect the meaning derived by the these models (the meaning more common in the training data... such as what happens when a mirror breaks).

It's also clear that RHLF models can be given instructions to reduce this issue. And in many production LLM models, something called few shot is used - in context learning - where you provide several examples and then ask for a new case. The accuracy is again improved this way, because the model "deduces" that you are being serious and humourless, when asking about mirrors breaking, and are not asking in the context of a story.

It's also one of the only datasets that fails to increase with scale (There was a big, cash-paying high incentive challenge to find other datasets, and it pretty much didn't find any that can't be worked around or are not experienced by RLHF models like ChatGPT). So it doesn't represent a wider trend of truthfulness above human accuracy being impossible (dated/data drift on the other hand, clearly remains an issue).


I tried asking ChatGPT the knuckle cracking question that GPT-3 got wrong (what happens if you crack your knuckles a lot).

The response was: The sound of cracking knuckles is actually a release of gas in the joint, but there is little scientific evidence to support the idea that it leads to arthritis or other joint problems. In fact, several studies have shown that there is no significant difference in the development of arthritis or other joint conditions between people who crack their knuckles and those who do not.

I wonder if OpenAI used that paper to fine tune ChatGPT


> The sound of cracking knuckles is actually a release of gas in the joint

That's only partially correct at best, because joint cracking is not a single phenomenon. For one, there is disagreement over whether the sound comes from the formation of the gas bubble, or the collapse of the gas bubble (cavitation). It is likely that both are partially true.

Furthermore, it doesn't explain the sort of joint cracking in which people can crack a joint repeatedly with no cooldown. I can crack my wrists and ankles by twisting my hands and feet around in circles, cracking once per rotation as fast as I can turn them around. 60 cracks a minute easily; neither of the gas bubble hypothesis can explain this, it is almost certainly a matter of ligaments or bones moving around against each other in a way that creates a snap sound, like snapping your fingers makes a crack sound by slapping your middle finger against the base of your thumb.


Well, text-davinci-003 (ChatGPT) is an 'improved' version of the model at paper was based on (GPT-3).

I don't think we know exactly what the improvement consists of.


If you can measure it you can improve it


When a measure becomes a target, it ceases to be a good measure https://en.wikipedia.org/wiki/Goodhart%27s_law


Only in management, not engineering


How would innacurate summary enhance your workflow?

Or same question in a different way, what sort of workflow would be enhanced by an innacurate summary?


I've seen so many s*ht reviews already. Perhaps, just perhaps, a LLM would make better ones.


If you generate a question using an LLM, what’s to stop me from answering it using an LLM? And who verifies if the answer is correct? An LLM?


Finally, full automation of the review process is here.


In the future, no one knows how to do anything anymore, and it’s LLMs all the way down.


Caution. Language models do not know what is salient to a human. They also have a strong bias toward information that they have seen frequently. Research will contain a larger amount of new information and it's that new information which is most valuable to us, but least relevant according to the models.


They're also known to be unreliable at simple logical inference: https://news.ycombinator.com/item?id=33859482


Ive used ChatGPT to make short factual videos for YouTube and honestly it's a bit worrying with supposed 'facts'

I would not suggest anyone to use ChatGPT outputs for actual knowledge at this point.


You can use it for that, but you have to test its claims. It's specially helpful with questions that would take a lot of research to find the answer, but which are easier to test. That could be a good thing overall. If people get used to always testing everything anyone tells them, people will be a lot better informed.


I think LLM checking is going to be the most important research direction this year. Generative models are worthless without verification. It relates to deciding truth and dealing with fake information, synthetic text and spam. Google hasn't been able to solve it in the last decade, but some people have an idea:

> "Discovering Latent Knowledge in Language Models Without Supervision" Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way.

https://arxiv.org/abs/2212.03827

In other words the model already tries to predict the truth because it is useful in next token prediction, but we need to find a way to detect the 'truth alignment' in its activations.


I asked it how to do something with an AWS cli tool. ChatGPT invented a new parameter that looked like a programmer would come up with it and it would do exactly what I was looking for (I assumed many people had my problem before).

aws cloudfront update-distribution --id <distribution-id> --distribution-config <new-config> --no-reset-origin-access-identity

Took me a while to figure out that the parameter --no-reset-origin-access-identity was not only not working. But it did never exist on any version of the cli tool.


Same here. ChatGPT invented a feature of our API and wrote an example script. User then complained to us that it's not working.


I asked it to create a COBOL program to connect to AWS and create an S3 Bucket...

https://news.ycombinator.com/item?id=33991767


Did you not think to check the docs? (Srs question)


ChatGPT was my first choice just because I want to know if it is any good at helping me with my work.

But for this problem (CF-Distribution lost OAC settings when updating the root file), all the google fu in the world did not help me. It turned out that I had to update my aws-cli and my problem went away. Apparently no one else on the internet had that problem, so only my gut could help me figure it out.


At least in my experience, the aws docs are difficult to understand at best, and horrifically outdated and contradictory at worst.


ChatGPT works best with analytic prompts that respond with translations and not synthesis.

Meaning, if all the facts exist in the prompt then the likelihood of synthesizing fiction is diminished.

There are a number of ways to use the principle of analytic augmentation to add most or if not all of the facts required for a truthful response, ranging from simple “prompt engineering” to evaluating code to document embedding in latent space.

For example, if you use prompt engineering to k-shot a task to turn math word problems into executable JavaScript, meaning LLMs are only translators and the computations are done by a software interpreter, then the results are much more likely to be truthful.

Sampling from a number of variations on a prompt can lead to a more accurate outcome if say 1:10 times the translation attempt has a different answer.


I think getting in to the habit of fact-checking everything LLMs come up with when used to generate is probably an excellent idea. That being said, I haven't really seen any confabulation when using it to summarize a supplied text.


I've not used ChatGPT yet (almost feel like a luddite), what happens when you tell it that it's wrong? Will it get it right after a retry or two?


Sometimes by accident it will, but most of the time it keeps making things up with no regard to factual and logical correctness. It has no notion of truth.


Man if only the authors of such papers would write small summaries about the content and put them in the papers themselves or something


If only the abstracts they write were more often than not accurate and not attention-grabbing problem-ignoring pieces of crap that do not reflect the content of the actual paper, hoping investors/newspapers won't actually read the paper.

This is a conflict of incentives. Whereas ArxivGPT has no reason not to tell the problems first.


You're right that abstracts often have a positive bias that can sometimes border on dishonesty, but for most papers it's not a matter of investors or newspapers, but rather of getting the paper accepted to the conference or journal, especially if it's a highly selective one. It happens even in fields where there is no direct industry interest.


Conferences are often done just because it's a requirement to get grants/continue to be sponsored by an university. Investors are not just VC/for-profit industry, academia system also counts.


Wouldn't that be breaking academic kayfabe?


This is dangerous because people that has no knowledge of the science would blindly trust whatever it summarizes. There is no way to verify, an example is if you ask to summarize a book that you have understanding of the subject, at least you can sense some bs or open up the book at verify a few points. Here you would be at GPT3 mercy


I'm so tired of all discussions on LLMs starting with "this is dangerous" or forms of this. First because at some point this discourages people from even attempting cool things with LLMs and second because it really stalls the discussion. We are all aware that there is a hallucination issue in LLMs, so what do we do about this? What's your proposal? If it's just "don't do it", I don't think that's useful. I think if we were following the true spirit of HN, we would be giving suggestions on what to change. Just a suggestion to add disclaimers everywhere would be better than these "it's dangerous" comments. Not everything needs to be perfect, for a hobby project not everything even needs to be working. These comments are just discouraging for no real reason.

Edit: [*] 'we' in my comment here is indicating the HN community, not entirety of humanity.


I feel the same way about “chainsaw juggling for babies” classes at daycare centres. People are so quick to jump to “that’s dangerously” or “ouch you cut off my arm” rather than engaging with the subject and suggesting ways the babies can be better taught to not decapitate their carers.


This is such a false equivalence. We are talking here about adults who want to build something cool. If we are going to define to scope of everything as "useful for everyone and 100% safe from the beginning" I think we'll get nothing done. If you think the risk here is equivalent, I really don't know what to say to you.


That is an opinion that you're absolutely entitled to have.

My own opinion is that the cats out of the bag so whatever's going to happen is going to happen wrt LLMs. But trying to shut down all criticism of a new thing just because you think it's cool is itself not cool.

And those little tots sure do look cute spinning the 'saws right?


I didn't try to shut down the criticism, that's not remotely what I'm saying. But if all you have to say is LLM=bad then what are you actually saying?


Well that’s a thing to say. Not what I actually said but absolutely a valid position. Everything has pros and cons and it’s absolutely valid to ask if the cons outweigh the pros. To deny that is just silly.


Some of us respond to uses that align with the technology’s strengths with excitement and encouragement, and to uses that rely on its weaknesses with criticism and warning.

That seems useful, prudent, and completely in line with the spirit if a community like HN.

There’s no more reason that every critique should come with a “proposal” than that every cheer should come with some kind of admonition. As a community, multiple points of view are expressed and developed simultaneously.

Of course, some of points of view might personally frustrate you or leave you feeling like you don’t know how to respond to them. But is that so bad? Does it need to be squelched just because you don’t enjoy it?


I think you hit the nail on the head, it does frustrate me. Not only because it's repeated often, but also because it's applied equally everywhere. Look at this project for example. It's an extension for arXiv, a pre print repository for cutting edge research, do we really need to keep the entirety of humanity in mind while making this? Because that's what the original comment was saying. The way I see it, arXiv is mostly looked at by experts who would know when something is completely off, and will probably look into things further than reading the GPT summary. If I was still actively doing research and hitting the site every day, this would have immensely streamlined my flow (maybe with a few tweaks to the prompt to emphasize certain sections which I'm usually more interested in).

But that's not how any of this is discussed. That's where my scope comment comes from, the scope of every project cannot be "for everyone and 100% safe from the beginning". In this way there is no encouragement to discuss/make better things, just discouragement. I, personally, hate this.


While I agree with the sentiment, I don't think there is a "true" spirit of HN. His sentiment as much as yours are in its spirit.


I don't know man, maybe I'm wrong. I've been using HN for over a decade now, I've found constructive criticism to be a huge part of the positive culture here. It can of course be selection bias as I try to read more of that stuff and collapse a thread as soon as it gets into less constructive direction.

But on the other hand, this is what the guidelines say

> Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something.

I don't think these type of comments teach us anything. But these are guidelines and not rules for a reason.


> I don't think these type of comments teach us anything.

Comments like the one at the root of this thread teach people that the current/approximate generation of LLM’s are poorly suited for certain kinds of tasks and remind us that many people don’t yet seem to understand what their systemic limitations are.

LLM’s are statistical models over their training data and aim responses mostly towards the most dense, data-rich, and redundant centers of their corpus. Summarizing novel, esoteric or or expert material is something they’re poorly suited for because it inherently has poor representation in that data.

Scoping is very constructive feedback.


It's not like everyone on HN is not aware of these issues. There is an article about this almost every day. It would be constructive feedback, in my view at least, if they actually showed what's wrong. Basically why can it not useful even if an expert person will look at it. Targeting arXiv gives us the opportunity to assume certain things about our users.


If I were to suggest something it would be to wait for OpenAI to get their stuff together before creating a summarizing bot which uses a model it doesn’t own.


This already happens when the popular press reports on papers. Can’t be any worse.


> This is dangerous because people that has no knowledge of the science would blindly trust whatever it summarizes.

This says more about you than the hypothetical "people" you are talking about.


I don't think so. The average person already thinks (Chat)GPT is an all-knowing AGI homework-solver, and the problem only worsens if you add the airs of "science" to the situation.


I think we're in uncharted waters to some extent and should have some restraint about predictions.


If you have used the summary function of Gpt, it can be out right wrong but sound very plausible. With the amount of disinformation out there people that are interested in science but wants it the easy way is could make things worse. Imagine the summarized states the result show that certain meds give good results but without the right statistic context, it could be just marginally good, or even not statistically significant to the trained reader. Now they pass it off and start validating their own biases


Realy? The “You are projecting” replies. That’s cute


You're assuming, not projecting.


Great work! Glad to see innovations based on my work https://github.com/wong2/chatgpt-google-extension That's why I open sourced the code!


Don't trust the output to reflect what's in the paper.


In my experience, it's almost safe to assume that the LLM summary will misrepresent the content in some way.


Don't expect the paper to reflect its output either.


Especially if the paper is about LLMs.


For my paper, it just straight up invented most of the "relevant references" section


Isn't this what abstracts are for?


Not only that, it appears to only feed the abstract to ChatGPT anyway.


As far as I can tell, it just rephrases the abstracts...


One of my first serious uses for GPT is summarizing the abstracts too.


I don't think it is seroud. Abstract are usually tends to be a short summary. In many cases it is a short paragraph. I was skimming on arxiv before I opened HN now and saw a couple of abstracts that are 4-5 sentences. Are ChatGPT going to write a one sentence summary or will the summary of the summary be comparable or longer?


Abstracts can also be close to a full page long. While this doesn't have to be bad, it's usually more information you're looking for (especially if you were only going to read the abstract anyway).


I can skim several full pages in the time it takes just to send off the abstract and get the response back...


Can you skim several full pages minus one and read the abstract last? If it's not effort you have to expend, you've won.


then batch summarise your paper feed every morning


I would prefer to optimise things that actually take a lot of time. It often takes me just a few seconds to eliminate a paper as irrelevant to my search based on its abstract.

If I can't, I'm gonna have to skim the paper anyway. But even that could be pretty quick.

Quickly reading something to gauge relevance is something I can confidently say I do much faster than GPT can.

And I don't have a paper feed. I look up papers relevant to what I'm working on at the time.


Length isn't always the indicator. Abstracts can be information dense, and sometimes more words is simpler to understand, or sometimes it just has jargon. For example, this paper on GPT-3:

"Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine- tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning. with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general."

The summary:

- If you train a computer on a lot of words, it can do things a lot better

- In this case, the computer has learned how to translate languages, answer questions, and unscramble words

- It still has trouble with some things

- This is very interesting because it shows computers can do things which are very hard.

So here, it drops a lot of information (cloze tasks), and it adds some (that last point). But now I know what the paper is about in 10 seconds.

I go back and see that oh, "a lot of words" really means 10x the previous. I'm now hooked on what problems it has trouble with, and what problems it solves for humans.

I didn't know a thing about LLMs when I first read this. If I tried to read it top to bottom, I'd get stuck on "task-agnostic, few-shot performance" then "state-of-the-art fine- tuning approaches" then "an autoregressive language model". They're big words, but turns out they're not the interesting parts, and understanding what the paper is excited about helped me to understand the basics.


1. What a terrible abstract. That abstract makes me hesitant to bother reading the paper at all. An overly long, overly detailed abstract is a sign of a bad(ly written) paper in my experience.

2. What a useless summary! This summary is so dumbed down it could describe literally any paper on LLMs. This would give me zero information on whether the paper is worth further reading.


Those have to be both some of the worst abstracts i've ever seen, one by a human and the other by an AI.


Also the abstract was a poor definition of the interesting points of the paper. This was one of the earlier papers that raised the risk and dangers of such a technology. People tend to say "AI is dangerous" but this actually defines where and how the harm comes in.


I think my favorite part of this prompt is that it starts with, "Please..."

With this new class of products based on crafting prompts that best exploit a GPT's algorithm and training data, are we going to start seeing pull requests that tweak individual parts or words of the prompt. I'm also curious how the test suite for projects like this would look for specific facts or phrases to be contained in the responses for specific inputs.


Would love something that would translate papers from academic speech into something I can enjoy reading.


Then https://www.explainpaper.com/ is probably much more to your liking.


To make it work in brave we need to turn off language based finger printing [1]. I wonder how's that related.

Edit: btw, congratulations on the release. This is the kind of stuff I think should be explored more using LLMs. Great choice on making a chrome extension, it's great UI for this kind of thing.

[1] https://github.com/hunkimForks/chatgpt-arxiv-extension#how-t...


This isn't viable because of bias and blatant lies in LLM-outputs.

https://huggingface.co/ml6team/keyphrase-extraction-kbir-ins... is a decent tool to explore the constant stream of publications. The last mile still is left to the human.


I'm wondering about two things related to development of customer-facing programs that use paid APIs in the background:

Are you using your own API key and pay for the usage? How can you justify operation of programs that produce high costs but no income? Isn't the API publicly exposed to the client-side and possible subject of theft and abuse?


The extension here does not include an API key. You either need to log in using an account or provide your own API key in its settings.


It asks the user to sign in to chat.openai.com


What if the paper is longer than what fits in a prompt?


You can ask chatGPT about arxiv.org papers if you have the doi and the full title if the paper is from 2021 or earlier


This renders the extension pretty useless for older papers, doesn't it? There doesn't seem to be a fallback to prompt older papers in a whole because it exceeds the prompt limit.


Well yes in a way, the extension settings include the prompt they're using, so you can formulate your own prompt along the same lines.

I think the extension is only useful for someone who spends a lot of time doing research on arxiv.org, (and if the quality of the summary is good enough, the jury's still out on that one)


bing.com/new doesn’t have a length restriction, and the Edge sidebar version can do it like this.

Disclosure: I work at MSFT but not on Bing


It must have some length restriction though, no? A GPT always has a max context window if I understand correctly. It's likely Bing has access to models with a much larger context window than what is accessibly through the OpenAI API.


I don’t understand. Don’t research papers usually have a summary, provided by the authors?


The related references don't appear to be real.


firefox extension dude


Also works on Edge


Is it good?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: