Hacker News new | past | comments | ask | show | jobs | submit login
ArXiv Papers as Audiobooks (github.com/imelnyk)
105 points by Acsmaggart 4 months ago | hide | past | favorite | 41 comments



When I’ve tried listening to YouTube videos explaining, say, Attention Is All You Need, I find that I cannot do it passively at all. The first 10 or so minutes I’m nodding along, folding laundry or doing dishes, then the presenter says something like “by reifying this tensor against the priors I was just talking about, we’re able to—“ and I have to pause, rewind a couple minutes, grab a piece of paper and actually engage with what’s going on.

I have to imagine listening to raw papers (not even someone like Andrei Karpathy interpreting and presenting it) would be even more difficult. I don’t know if there’s an easy way to passively consume academic literature at all. If it’s important stuff, it will usually be pretty challenging.


There is definetly a way to make this happen though. Little bit o' whisper, Mixtral in some RAG, and you've got yourself a buddy to talk about the paper while it's reading it to you.

Of course everyone will immediately say this is dangerous and it may mislead you by giving wrong explanations, etc etc. and then others will counter with 'it will definitely get better over time' (the best models as products are ~3 years behind the improvements being show in academic work for example). However, ultimately this is just a neat product to make, even if it has some bugs. Listening to TTS right now spends about half the time reading jumbled numbers from tables and listing off author names. So just tackling that alone (which this would do much better) would be valuable.


But listening to a paper passively is the not the same thing as being mentally prepared to converse with an LLM about a dense topic. I feel the usecases are quite different, and I doubt that there is a middle ground between listening passively and learning a complex topic. But maybe I am missing something.


This is a bit different than the "read a paper" TTS app. I mentioned the idea just to say it's possible and coming. The blend of the two isn't out of the question though.

Think of asking for a reading of a paper wherein you could interject at any time.

System: "This work is presented fromainly 3 groups: Deepmind, University of Pennsylvania, and ETH Zurich - the authors are Matthew Botvinick, Dani Bassett, and Bastian Rieck. They uncover a useful meta-learning program that relies on an AT methodology rooted in the bifiltration of the Ricci curvature of the embeddings and training step, wherein ..." You: "Wait a second - the algebraic topology method - what are the prior works in that area and why would that be the starting point for this paper" System: "It appears that the relevant citations point to Anne Sizemore's work while in Bassett's lab, with a few other key authors such as Guisti. The titles suggest that..."

(...) System: "Now that we've cleared that up a bit (and added it to a research list for further exploration later), to continue on the paper ..."

And so on.

This is very achievable today with a little bit of work. Perhaps not easy to work _super well_ - but likely easy enough to get working to _some degree_. A well polished product that does work super well certajnly isn't out of the question though.


How much would this product be worth to you?


I came to post essentially this. I could listen to review articles in a area I'm familiar with, but listening to primary papers could never work for me.


It also really depends on what type of papers. Psychology papers are very accessible to audio format as are, in general, quite comprehensive.


Just listening while doing nothing else is soporific, but I can imagine finding this invaluable if I had a long commute to work.


I use a combined approach of pre listening then reading the technical writing later sometimes


Visual reading of dense papers also leads to failing to understand concepts or distractions.


This is a great point. People will complain if LMs are applying to anything, but ultimately it improves accessibility, and allows for someone to dive deeper when needed.

There will always be ways to misinterpret some academic work, and there are plenty of opportunities in the path of understanding a work to do that.

Allowing someone to engage with a work _at all_ by lifting some barriers (visually impaired people's for exampld) should be acknowledged as an improvement, not discouraged continually for having some bugs.


The LLM prompts are pretty interesting, e.g.: https://github.com/imelnyk/ArxivPapers/blob/main/gpt/utils.p...

> "You are an ArXiv paper audio paraphraser. Your primary goal is to rephrase the original paper content while preserving its overall meaning and structure, but simplifying along the way, and make it easier to understand. In the event that you encounter a mathematical expression, it is essential that you verbalize it in straightforward nonlatex terms, while remaining accurate, and in order to ensure that the reader can grasp the equation's meaning solely through your verbalization. Do not output any long latex expressions, summarize them in words."


There's no way the prompt actually works, though. LLMs are not able to reliably "preserve the overall meaning" of things unless they're doing direct technical translation. The problem is going to be even worse with original research, because the LLM will try to summarize according to old ideas from blog posts / etc in its training data, and not the new ideas in the original research. In general document summarization is one of the worst use cases for LLMs, both in terms of its reliability and the difficulty of finding errors - how would you know without reading the paper? I would be surprised if this prompt worked on a single honest[1] paper that was written after the LLM was pretrained.

The bit about translating LaTeX expressions into human-comprehensible math sentences is interesting and AFAIK should work on something like GPT-4. But that's just a case of technical translation. GPT-4 definitely cannot "rephrase the overall paper... simplifying along the way." GPT-4 can't even summarize corporate reports without screwing up facts and figures - why on earth would you try to use it to summarize new scientific research?

Stuff like this is why I'm so concerned about LLMs: this prompt doesn't work, and people using AI for this stuff is just automating ignorance. Very frustrating.

[1] I say "honest" because this prompt would probably do ok on stuff coming out of a paper mill - the problem is carefully stated original ideas. GPT tears original ideas to shreds.


Can somebody please pay me a nickel every time someone states a belief LLMs cannot perform some task or another?


LLMs can't give you nickels


Just put the words in there, signed The Prompt Engineer


Papers are already difficult to process when reading them carefully multiple times, what even is the point of turning them into an audio version? I am genuinely at a loss, unless we are talking about blind people


The YouTube Channel may shed some light. As I understand this, it is not reading the paper, but interpreting or summarizing it with visual cues as to which section it is analyzing.


I still don't get the purpose. If you have a video to watch it's not an audiobook anymore. Secondly, why not just read the abstract? The paper might contain formulas (need to be carefully read to understand) and data (need to be carefully read to understand). If you strip the paper of its scientific elements then only a series of badly justified steps remain, at which point you might as well just consider the abstract + conclusions paragraphs


The choice of the word "audiobook" is really unfortunate. That's never mentioned on the GitHub project page. I find LLMs to do a decent job of summarizing text. Obviously, it depends on the audience. If it is a subject-matter expert, they may not be happy with the result, but a layperson might be.


What if you want to hear about the latest arxiv updates while on your morning run?

This seems like a fantastic idea for that purpose.


I mean, couldn't you do it with a program that takes an RSS feed, parses the abstract from each paper and puts ot through a run of the mill TTS engine?


Sure, I've done something similar to that in a few hours.

The free TTS options aren't great still, and "just the abstract" is not the problem. I did full articles, and the hard...er (it was a few hours) part was extracting out the relevant sections of full papers without the 'junk' info (page numbers, superscript citations, 7 pages of authors in any cern paper, etc.

So you get something, but it's often not good.


Many years ago, I did that when I had a large paper reviewing load during my phd. My solution was simply to purchase an app called SayIt for like a dollar that read the pdf to me, worked really well.

Nowadays I often pass the pdf through LLMs to get personalize (expand on jargon or contract the verbiage) and then read them. That gives me a better return on time spent.


I had been daydreaming a couple of weeks ago about being able to listen to papers while driving or doing repetitive tasks, and it looks like there is now a YouTube channel where these get posted:

https://www.youtube.com/@ArxivPapers

The pipeline seems to do a pretty good job of cleaning up the writing too, some ArXiv papers are a little rough.

(I'm not the project owner)


I've been looking for a good way to TTS longer PDFs and EPUBs into recordings so I may listen to them on the go. I'd like to take advantage of high quality TTS models but I'd prefer it to be one I may host myself.

Haven't found the right way yet, I'm considering: https://github.com/MycroftAI/mimic3


I use Librera Reader [1] for this, it handles both ePub as well as PDF and then some. The quality of the TTS output is dependent on what you have on your (Android) device since that is what it uses. I tend to use Google's TTS with a male UK voice which I tune down (as in deeper voice) and speed up a bit. It mostly works fine, probably better for nonfiction than fiction but that is what I mostly use it for anyway. You can swap between reading on-screen and listening since it keeps position in the document while reading aloud.

You can also have it read into an audio file is so desired which can be listened to later.

[1] https://f-droid.org/en/packages/com.foobnix.pro.pdf.reader/


If you’re an iOS user, try https://oration.app


Please don’t use HN primarily to promote your product. It is against the community guidelines.


It was a formal Show HN <https://news.ycombinator.com/item?id=39316019> from a while back, so I think that's within bounds but I also agree that it should be one top-level comment and let it go, not replying every time with the same text


I had a similar idea but what happens when you stumble upon code, equations, tables, graphs etc.? Can LLM understand that as well?

For example; you are listening to the paper with some text2speech model and then it stumbles open code snippet or table or graph....what should happen next? Should model skip it or prompt you to look at the graph or table or whatever. Or should you write some software that tries to interpret graphs and other non-text content.


I am still trying to understand this, but it seems like the potential here is tremendous. For example, you can imagine producing audio tailored to the sophistication of the reader where a layperson may wish a more basic interpretation than a subject-matter expert. Really looking forward to seeing where this goes for the dissection and understanding of scientific publications.


Did you purposefully omit a license?

I really do wish GitHub would prompt its repo owners "did you forget a license?", but I also wish it would prompt them for adding "topics" to enhance discovery and I guess I'll just continue to hold my breath on those


https://www.listening.com/ does this as a service.FWIW I haven’t tried it myself.

Edit: looks like they support a few traditional publishers as well.


Last time I tried it, app literally just read papers. As in parsed arxiv pdfs text2speech. It was an awful misunderstanding of the medium. Unless it was rebuilt significantly over last months, it's just bad.


We built Oration (https://oration.app) to improve on issues like this. It also generates a summarized version


Give oration (https://oration.app) a try! It’s cheaper and many of our users found it a better option than Listening


Whatever they're using for text to speech is rough. Probably using an open source model. The one used in OP (Google's) is a lot more listenable.


It'd be interesting to also have these generate a slide presentation explaining a paper via some combination of presentation markdown, MermaidJS, and an image generator.


I started working on a version of this just the other night—thank you for saving me the time! This is awesome.


Audiobooks make sense for thibgs which are communicated as fast as speech. Like stories.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: