Hacker News new | past | comments | ask | show | jobs | submit login

I was really impressed until I realized that the app is basically a wrapper around tesseract.js, which is the actually cool part. Tesseract has a wasm port that can operate inside of a webworker.

Not saying that the article was being misleading about this, just saying that the LLM part is basically doing some standard interfacing and HTML/CSS/JS around that core engine, which wasn’t immediately obvious to me when scanning the screenshots.




The LLM part is almost irrelevant to the final result to be honest: I used LLMs to help me build an initial prototype in five minutes that would otherwise have taken me about an hour, but the code really isn't very complex.

The point here is more about highlighting that browsers can do this stuff, and it doesn't take much to wire it all together into a useful interface.


Simon - hope you don't mind me commenting on you in third person in relation to the above. Simon is a great explainer, but I wish he would credit the underlying technology or library (like tesseract.js) a bit more upfront, like you.

It matters in this case because for tesseract, the exact model is incredibly important. For example, v4 is pretty bad (but what is available on most linux distros when ran serverside) whereas v5 is decent. So I would have had a more accurate interest in this post if it was a bit more upfront that "Tesseract.js lets you run OCR against PDFs fairly quickly now, largely because of better processors we as devs have, not because of any real software change in the last 2-3 years".

I felt this before for his NLP content too - but clearly it works because he's such a great explainer and one for teasing content later that you do read it! I must say I've never been left confused by Simons work.


I was pretty careful to credit Tesseract.js - it's linked at the top of the tool itself https://tools.simonwillison.net/ocr and prominently in the article I wrote: https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/

What else should I have done?


It's all subjective. Reading your linked blog made it perfectly clear for me you built this using tesseract.js. No idea what the other guys are complaining about.


I drafted a few longer responses about the feeling that I had after reading your post after that headline but they were a bit unavoidably asshole-y! So really, just something like "with Tesseract.js" in the headline is all I would think could be helpful to people on sites like HN. I do like your writing. But I do enjoy knowing what I'm reading specifically if possible, when it's tech.


You act like you were misled, but the article, within the first few sentences, says he realized the tools are available to do this (including naming tesseract.js explicitly!), he just needed to glue them together. Then he details how he does that, and only then mentions he used an LLM to help him in that process. The author's article title is equally not misleading.

Was an earlier headline or subtitle here on HN what was misleading, but then that was changed to not be misleading?


Using the built-in browser OCR is usually much better but it is still behind an experimental API.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: