Show HN: Copyfish – Extract text from images, videos or PDF

grezql · on June 20, 2017

Is the OCR-extraction performed in the client? if its transferred to a server then people should be aware of this so sensitive data from documents/pdf is not submitted.

qz_ · on June 20, 2017

Yeah apparently it uses https://ocr.space/, deal-breaker for me.

a9t9 · on June 20, 2017

I understand that hosted OCR, just like SaaS in general, is not suitable for every use case.

On the other hand, the OCR.space OCR API has a very strict privacy policy:

https://ocr.space/privacypolicy - All uploaded images and the extracted text are deleted immediatly after processing.

cantrevealname · on June 20, 2017

> All uploaded images and the extracted text are deleted immediately

Until they are served with a subpoena for a particular client, or a sweeping subpoena to store everything forever, or the company is sold and the new parent has different values, or the company decides to mine customer data for advertising uses, or there's a bug in the software, or there's a long-lived cache of the data, or it gets into their backups accidentally or deliberately, or they don't keep the data but keep "just" the meta-data, or they do statistics or analytics before deleting the data, or they are hacked, or they simply change their minds.

In terms of privacy, even a non-free non-open-source local app with DRM or license management is better than a server app with a "strict privacy policy". With a good firewall setup, you can be pretty sure that the local app won't betray you.

bpicolo · on June 20, 2017

It doesn't seem reasonable to blame them for an arbitrary potential future when they're currently doing the right thing.

icebraining · on June 20, 2017

"The best way to avoid privacy breaches is not to formulate a detailed privacy policy; it's to reduce your capabilities so that you're unable to violate anyone's privacy."

http://www.daemonology.net/blog/2012-01-19-playing-chicken-w...

mnem · on June 20, 2017

No, however the description of the plugin should make it clear data will be uploaded to a third party server for recognition so the user can make a choice about that.

bpicolo · on June 20, 2017

It more or less does.

`For developers: Copyfish is published under the GPL open-source license. As OCR software, it uses the free OCR API from https://ocr.space/ .`

tripzilch · on June 21, 2017

I don't find that clear at all. And this is also important to non-developers.

Also, for nearly all documents I ever need to scan, if they're important enough to require scanning, they're important enough that a third party should have nothing to do with them.

The majority of exceptions to the above being, ironically, documents without text, sketches, doodles, etc.

m-p-3 · on June 20, 2017

> when they're currently doing the right thing

You mean that we have to place some trust that they are. Some users cannot afford that kind of trust.

stingraycharles · on June 20, 2017

I suggest adding a big notification dialog that explains this when you first try to do an OCR request.

JoblessWonder · on June 20, 2017

Why did you end up going with a .space domain? We blocked that whole TLD because we were getting massive amounts of spam from it when it first came out.

jolmg · on June 20, 2017

Oh dear. My main domain and email are in the .space TLD. I hope your practice is not widespread.

Personally, I chose .space simply because it's cool, cheap, and not overcrowded. It also seems to lend itself well to being part of a name.

I know spam is a hard problem, but I wish you wouldn't label me a spammer simply because of the TLD I chose.

treitnauer · on June 20, 2017

That's one of the problems with cheap domains in the sub $5 range. Some gTLD registries (.space included) thought it was a good idea to offer them really cheap, but what they got were mostly spammers which puts you in a bad neighborhood.

There are a few others which you may want to avoid according to this report: https://securityintelligence.com/enticing-clicks-with-spam/

ignoramceisblis · on June 21, 2017

The author's full comment:

> Why did you end up going with a .space domain? We blocked that whole TLD because we were getting massive amounts of spam from it when it first came out.

From your comment:

> I know spam is a hard problem, but I wish you wouldn't label me a spammer simply because of the TLD I chose.

The author is not "labeling you a spammer". They're simply stating a fact about their experience. And in fact, it doesn't even mention you.

jolmg · on June 21, 2017

I'm not taking offence nor am I taking it personally. I was hoping my tone was clear on that (i.e. "I know [you have reason], but ..." and "I wish ...", which is just an expression of hope). Sorry, if it came off as aggressive.

I only tried to hightlight that they have, in effect, labeled everyone in .space (not just me, but me included) as a spammer.

It's heavy handed, but I understand there are sometimes pressing needs for quick solutions, like when having your mailboxes flooded with SPAM. Hence, the "I know ..." clause.

superasn · on June 20, 2017

Why is it a deal breaker?

kebman · on June 20, 2017

It's a deal breaker because THAT'S NONE OF YOUR DAMN BUSINESS, and that also goes for Copyfish. It smells fishy to me, and _promises_ never kept prying eyes away secret documents. People who handle confidential documents should never use SaaS. It's an issue of trust, and Copyfish deserves none.

ghostly_s · on June 20, 2017

Okay, don't use it then. They make no claims of enhanced privacy and frankly it's unreasonable to presume a service such as this would do all processing locally unless you're paying a premium for that ability. Or did I miss the "Great for confidential documents!" banner? For most peoples' use-cases, this is not a concern.

mnem · on June 20, 2017

It's cheaper for a service to OCR locally than remotely.

a9t9 · on June 20, 2017

There is simply no good OCR engine available that can run inside a Chrome or Firefox extension. The best available is Tesseract.js. And while this engine is fantastic as a project, its recognition rate does not come close to what is available server side.

ocrcustomserver · on June 21, 2017

I agree. There's also ocrad.js .

awqrre · on June 21, 2017

Mozilla should have made an effort to have that OCR code be able to be ran locally... not everything needs the cloud (well, almost nothing)

ocrcustomserver · on June 21, 2017

If you need a private OCR server that you can host yourself (locally or on the cloud), shoot me an email.

kronos29296 · on June 21, 2017

Did you create this account just to answer this question? I am curious.

ocrcustomserver · on June 21, 2017

Yes. Does it sound like shameless promotion to you? Maybe it is, but some people might have a need for this (and it's relevant to the topic/comment).

kronos29296 · on June 27, 2017

Promotion - Yes. Shameless - No. (This is what I think) I am just curious. I sure love the idea of this extension. If I need to use something like this I atleast know a handy extension for this now.

drez · on June 21, 2017

They sure did. I also have a sneaking suspicion that it's a9t9, the owner of copyfish and ocr.space.

johnvonneumann · on June 20, 2017

Can you explain why this is a deal breaker? Is it the use of OCR or the choice of provider? Assume I know nothing here, because I do.

awqrre · on June 21, 2017

it adds a third party in the equation...

TekMol · on June 20, 2017

What is the business model of free extensions like these? Is it all spyware/malware?

It looks like many free extensions either have malware in them from the start or get sold to malware companies later on, who then deploy the malware via updates:

http://lifehacker.com/many-browser-extensions-have-become-ad...

irrational · on June 20, 2017

Why does everything have to have a business model? Sometimes people like to create things for the sure enjoyment of creating things or they have an itch to scratch and think others might have the same need. Not everything is nefarious.

Volt · on June 20, 2017

Exactly for the reason they said. The concern isn't that extensions are necessarily nefarious, but that people often want something in return for their work, which might be money by whatever means.

eriknstr · on June 20, 2017

I recently found an interesting issue [1] filed in public on the GitHub repository of a fork of a popular extension.

Here are archived versions of the URLs mentioned in the issue:

Without "partner extension": http://archive.is/anu2E

With "partner extension": http://archive.is/bp93l

As is evident, what their "partner extension" does is in fact maliciously hijacking and replacing ad-space on websites visited by the user.

Strangely, searching for their name among the issues on GitHub does not show other such results. I guess they usually make contact directly and that the person at that company who filed this issue did not realize it would be visible to the public.

Here is the full text of the issue:

> Adnow is interested in byuiing your extension traffic #1

> Dear Kyong Tsu,

> My name is Anastasia, I am a manager from international advertising network Adnow.

> Extension traffic is a hot trend nowadays, and we are interested in buying traffic from Facebook Video Downloader extension and the others. We are ready to share an idea of monetization extensions with you and give you a method.

> We offer:

> * high payouts

> * 100% fill rate (we buy traffic from all over the world)

> * Integration through JS Tag / XML / JSON feed

> * Integration method

> That's how the page looks without partner extension: https://gyazo.com/5d635a9dc7bdc142e18e6775a1d1340d

> And that's how it looks for user with our plugin/code in extension: https://gyazo.com/a2b48b16d304a3ba37cdf6967fa4d9d8

> Please contact me in case you are interested in monetization your extensions.

> I am looking forward to your answer.

> Thank you in advance.

> Best regards,

> --

> Anastasia Nova

> Sales manager | Adnow LLP

> e.: tasya@sales.adnow.com

> Skype: tasya@adnow.com

[1]: https://github.com/KyongTsu/TabMemorySaver/issues/1

Archived snapshot of above issue: http://archive.is/Z5mJl

ImJasonH · on June 20, 2017

Similar Chrome extension I wrote using Google Cloud APIs: https://chrome.google.com/webstore/detail/cloud-vision/nblmo...

Zyst · on June 20, 2017

It is actually on Chrome as well: https://chrome.google.com/webstore/detail/copyfish-%F0%9F%90...

I started using it a bit ago, the area selection seems a bit wonky, but otherwise works.

mdani · on June 20, 2017

No need to use third-party extensions if you have a Google cloud account. You can download https://github.com/kaneshin/pigeon and just run it from command line - protects privacy and more secure compared to relying on third-parties.

pbhjpbhj · on June 20, 2017

Isn't Google a third-party?

angry_octet · on June 20, 2017

We need to evolve a grammar for describing privacy implications, because proper classification of this software would allow it to be marked as malware/spyware.

It is beyond irresponsible for mozilla to do nothing to prevent this malware from being recommended on their platform.

ghostly_s · on June 20, 2017

How about you explain what on earth you're talking about if you're going to take the time to disparage this product here?

angry_octet · on June 20, 2017

Read the page. Yes, it isn't obvious is it?

Look down the bottom.

https://ocr.space/

It uploads everything to a commercial OCR service. Which provides these CPU cycles 'for free'.

Who owns this data? Do you have a privacy agreement with ocr.space? Can you trust them as far as you could spit?

It doesn't matter that this is documented though. Unless it had a popup banner EVERY TIME YOU USED IT saying "Your data will be sent to a cloud service for OCR, which may keep/index/sell you data without restriction."

cortesoft · on June 20, 2017

I think you are going a bit too far with your requirement for a popup banner every time you use it. Do you expect a popup banner every time you click a link on a web page taking you to a third party website, because they are going to be able to run javascript code on your computer?

As long as the plugin is clear that they are using a third party service that will recieve your images, I think it is fine to leave it at that. Not everyone feels that is a deal breaker, and they shouldn't be annoyed by a pop up just because their deal breaker is different than yours.

angry_octet · on June 20, 2017

In reply to you first question, no, because it is my computer, it runs with same origin policy in the sandbox. And I've chosen to enable js for that site. So it couldn't do what this extension does, which is cross site data transfer.

If the end user clicks the 'do not show again' checkbox on the message, sure. But it should still be graphically represented whenever you use an insecure cloud plugin, e.g. via an unlocked padlock sub icon if it doesn't use TLS, maybe a cloud sub-icon to represent someone else's computer.

kgwxd · on June 21, 2017

A link shouldnt get a pop up but, if running JavaScript had required at least a one-time user approval for each individual script link from the day off it's inception, the web would be a much friendlier place.

cortesoft · on June 21, 2017

Do you really think so? I think in practice, if every website you visited made you click 10-50 pop ups the first time you went to the site, people would start to blindly click without reading, and it would be even easier to slip a malicious pop up request by a user.

While you might want to believe that a user would actually think about what they are accepting, reality is almost all don't. Even the more security minded people among us will start to get numb to the requests. Only the most paranoid would pay attention to all of them, and those people are probably already doing things that would make that sort of pop up redundant.

I think this is a very common trap we fall into, where we want to provide MORE warnings to people and let them use their judgement. However, there is such a thing as 'alert fatigue'.

In California, companies that produce carcinogens took advantage of this aspect of human nature; when California wanted to place warning signs about cancer causing substances, they realized they couldn't win the fight against the warnings. Instead, they fought for MORE warnings; they wanted warning signs for even very slight risk carcinogens. They knew that if the signs were EVERYWHERE, people would stop paying attention to them.

It worked. Basically every building in California has a warning that 'substances known to cause cancer or birth defects are present'. Since every building has the same warning, I have no way of knowing which ones are ACTUALLY dangerous.

kgwxd · on June 22, 2017

No, I don't expect most users would care. Most users wouldn't care if sites could execute native code as root on their machine. I think, if there was a prompt, content providers that cared, even a little bit, about presentation would think real hard before introduction that prompt. The way it works now, providers very rarely think twice about adding it. And I think ad networks, trackers and all the other useless, JS based, user hostile tools of the web, would have a much harder time convincing site owners to drop in a snippet of JS when there were actual consequences for doing so.

However, I don't believe for a second, without some kind of law, punishable by death, a requirement like that would have lasted. It would take only one browser to default "Never prompt for permissions to run JavaScript". Typical users would flock to it (because sites would say they only work with it) and compliant browsers would have to copy to compete. Users ruin everything.

cortesoft · on June 23, 2017

If human nature doesn't allow a solution to be viable, it is useless to blame human nature. You need a different solution.

kgwxd · on June 25, 2017

An argument against building A.I. because the simplest solution to that is to eliminate humans from the equation.

Nadya · on June 20, 2017

Maybe I'm missing something - but it only OCR's things you choose to capture and isn't constantly trying to OCR every single thing you see.

Are you misunderstanding the extension or am I missing something bigger?

E: A total guess: "the server will see the image you are trying to OCR"? That's about as much privacy as I could see being intruded upon.

angry_octet · on June 20, 2017

It could easily build a profile of everything you get scanned/translated. I don't know if it uses https, so maybe it encrypts, maybe everyone listening can see what you get scanned.

It is good that it isn't scanning everything, i.e. complete exfiltration, but that is a low bar. It leaks every time you use it.

mholt · on June 20, 2017

Even the name is too reminiscent of Superfish.

xophishox · on June 20, 2017

Give me an api end point to send an image to, and a text response. Ill hand you cash.

RhodesianHunter · on June 20, 2017

There are a ton of these now. Google provides OCR as part of their machine vision API. AWS has similar with Rekognition. As others have mentioned, there are dozens of others on less well known platforms.

a9t9 · on June 20, 2017

Actually, based on my tests, there are only a few good services:

Abbyy (best recognition rate but by far most expensive), Google Cloud Vision (second best recognition rate), Microsoft OCR and... our OCR.space service with a very generous free tier and a competitive priced PRO tier.

grezql · on June 20, 2017

Rekognition from Amazon doesnt have OCR as far as I remember

ythn · on June 20, 2017

Under the hood, the extension is using:

https://ocr.space/ocrapi

Nadya · on June 20, 2017

From the creators of Copyfish: https://ocr.space/

They should have an API to point to. It is fairly accurate. I use them occasionally via ShareX, which uses their API for OCR.

E: https://ocr.space/ocrapi

ocrcustomserver · on June 21, 2017

Like a9t9 said, ABBYY, Microsoft and Google offer this.

If your images however differ from the typical text document, recognition from those services will fail. OCR is highly dependent on the particular application and the kind of images that you're dealing with. Preprocessing and segmentation are very important.

If you need a custom solution, my email is in my profile.

jklinger410 · on June 20, 2017

CM30 · on June 20, 2017

Hmm, I've seen a few apps and extensions like this before. I think Project Naptha was a heavily advertised one that did the same thing a few years back.

But how's the accuracy here? Cause when I used previous plugins for this functionality, I often found they'd return gibberish if the text was even slightly ambiguous looking in image form.

How does it compare to the other plugins doing the same thing here?

maggit · on June 20, 2017

The text on the linked page actually compares this to Project Naptha:

> For extension gurus: You might have heard of Project Naptha, a great addon that applies state-of-the-art computer vision algorithms on every image you see while browsing the web. Copyfish solves the same problem, but it takes a different user interface approach. It does not try to alter the website. Instead, it lets you mark the text in the image that you want to extract. As a result Copyfish works with every website, even videos and PDF documents.

chrischen · on June 20, 2017

Need this for mobile. Somehow it became a defining characteristic of an "App" to disable text selection.

yjftsjthsd-h · on June 20, 2017

Should be fixable via accessibility APIs?

Zyst · on June 20, 2017

This seems cool, I just tried it in chrome and it has support for pop-up dictionaries, so I'll be using this for some beginner reading assistance.

Thanks for making this!

dontchooseanick · on June 20, 2017

Copyphish ?

RhodesianHunter · on June 20, 2017

Semi related: I would love to see someone do a comparison of the various OCR APIs on speed, accuracy, and cost.

ocrcustomserver · on June 21, 2017

https://ocr.space/blog/2015/02/ocr-online-converter-review.h...

asenna · on June 20, 2017

Same here. Was just researching on this. Not sure if I should go with an open source OCR engine or one of these APIs

RhodesianHunter · on June 23, 2017

Doing it yourself with Tesseract is pretty hard (time consuming, error prone). It's something I would only consider doing once my project was build, viable, and the costs of an API were an issue.

foota · on June 20, 2017

On my phone so I don't have a chance to give it a shot, but what I find has been most irritating in the past about ocr is the accuracy. If your extension has better accuracy you might call that out.

imron · on June 20, 2017

I saw the heading on HN and thought "I wonder if it works with Chinese".

I saw the first example screenshot on the page was a Chinese movie and thought "Great, it does"

I saw the enlarged version of the screenshot and the Chinese subtitles contain multiple mistakes: "Nice try, but maybe not so great after all for the use case I'd personally be interested in".

a9t9 · on June 20, 2017

Well, at least this confirms that the screenshots are not manipulated ;)

The tricky part for the OCR in this example is the diverse background, as the Chinese characters are directly inside the movie.

Your comment is interesting, as the original motivation for creating the Copyfish extension was to help me watch Chinese movies. So I can confirm that for this purpose, it works fine. Of course, once in a while it gets some characters wrong but it works ok with many movies.

Here is a screencast of Copyfish doing subtitle OCR:

https://www.youtube.com/watch?v=YNGkGWj8lA4

imron · on June 20, 2017

> as the Chinese characters are directly inside the movie.

Yep, same with TV shows, and soft-copies of transcripts are difficult to come by, hence my interest in something like this.

I just watched the video. When used on a video does it keep a history of all OCRed text?

Finally, you might also like to try posting this on http://www.chinese-forums.com If it mostly works well for TV and films, I'm sure there will be quite a few people there who are interested in it.

a9t9 · on June 20, 2017

> When used on a video does it keep a history of all OCRed text?

Not yet - but this feature is already on my todo list ;)

Thanks for the hint about the chinese forums!

imron · on June 20, 2017

> Not yet - but this feature is already on my todo list ;)

Another interesting feature would be to do some sort of statistical analysis of Chinese text being OCRed and then combining that with possible characters suggested by the OCR. This would almost certainly prevent the mistake in the last two characters of the Chinese movie screenshot.

bondolo · on June 20, 2017

This seems like it could be very useful for accessibility applications.

Nimsical · on June 20, 2017

This is cool!

Wondering what you're using for OCR?

jffry · on June 20, 2017

  For developers: Copyfish is published under the
  GPL open-source license. As OCR software, it uses
  the free OCR API from https://ocr.space/

whitten · on June 20, 2017

So, to answer the question mentioned above, the document storing the text is sent to an off-site server (https://ocr.space/) which does the OCR and returns the results.

tobltobs · on June 20, 2017

And what lib is using ocr.space for OCR?

tangue · on June 20, 2017

I suspect they're using Tesseract as they've written a gui for it ( https://ocr.space/blog/p/free-ocr-windows.html ) but there's no way to find more.

samfisher83 · on June 20, 2017

https://github.com/A9T9/Free-OCR-Software

Based on this github they might be using the microsoft ocr library.

PokemonNoGo · on June 21, 2017

I guess it auto defaults to English then? Running Tesseract on Scandinavian texts gives AAO instead of ÅÄÖ in my experience if you don't supply the correct language training set. That's quite the hen and the egg problem. Can't language identify without the text can't get the text without the right language identified.

ransom1538 · on June 20, 2017

Is this using TesseractOCR?

sjs382 · on June 20, 2017

Could you add an email address to your HN profile so I can contact you?

a9t9 · on June 20, 2017

Done. In addition, the email listed on https://github.com/A9T9 also reaches me.

vdRrsithZm · on June 20, 2017

> Done. In addition, the email listed on https://github.com/A9T9 also reaches me.

Neat! Brother. +1 =100 Ace

vdRrsithZm · on June 20, 2017

Neat brother. +1 = A109

zhangkehu · on June 20, 2017

wanderful tool,we can get text from some pdf file easyly.