Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: PDFs that are readable by human eyes only (humaneyesonly.com)
216 points by viggity on July 6, 2022 | hide | past | favorite | 126 comments
Hi, OP here. A friend was involved in a custody battle and was afraid his ex was going to leak all of his discovery documents on the internet and he asked if there was something I could do to make it harder for bots/crawlers to find sensitive information. Originally I was going to turn all of his docs to image based PDFs, but those get large fast and are easy to OCR.

So I found a post musing about altering fonts/glyphs so that it looks like english, but the actual character being seen by the pdf reader is a non-english character. As such, when you try to OCR these files, it doesn't see any images and can't convert it.

I figured it had some potential uses and maybe you fine folks can identify other use cases. I'll be monitoring this post most of the day.




I am a blind user using an extension to my screen reader which (under the covers) uses the Windows 10 built-in OCR. Your sample document gives me:

INTRODUCTION The Cybersecurity and Infrastructure Security Agency (CISA) is committed to leading the response to cybersecurity incidents and vulnerabilities to safeguard the nation's critical assets. Section 6 of Executive Order 14028 directed DHS, via CISA, to "develop a standard set of operational procedures (playbook) to be used in planning and conducting cybersecurity vulnerability and incident response activity respecting Federal Civilian Executive Branch (FCEB) Information Systems." I Overview This document presents two playbooks: one for incident response and one for vulnerability response. These playbooks provide FCEB agencies with a standard set of procedures to identify, coordinate, remediate, recover, and track successful mitigations from incidents and vulnerabilities affecting FCEB systems, data, and networks. In addition, future iterations of these playbooks may be useful for organizations outside of the FCEB to standardize incident response practices. Working together across all federal government organizations has proven to be an effective model for addressing vulnerabilities and incidents. Building on lessons learned from previous incidents and incorporating industry best practices, CISA intends for these playbooks to evolve the federal government's practices for cybersecurity response through standardizing shared practices that bring together the best people and processes to drive coordinated actions.

Pretty sure this doesn't actually work.


OCR built in to Mac OS Preview.app:

The Cybersecurity and Infrastructure Security Agency (CISA) is committed to leading the response to cybersecurity incidents and vulnerabilities to safeguard the nation's critical assets. Section 6 of Executive Order 14028 directed DHS, via CISA, to "develop a standard set of operational procedures (playbook) to be used in planning and conducting cybersecurity vulnerability and incident response activity respecting Federal Civilian Executive Branch (FCEB) Information Systems."


very interesting. the windows 10 screen reader consume the raster data on PDFs to OCR and not the code point data embedded within the PDF. People here have been on my ass about saying "OCR resistant" and I get where they are coming from. I've primarily been testing the various "OCR" functionalities built within the various PDF readers out there. The "OCR" that 98% of laypeople are going to rely on. I always new that exporting to an image based PDF wouldn't be defeated. If a human can read it, a machine can read it. Just most PDF readers aren't set up to do it. Out of curiosity, when you use your screen reader on my website, does the <textarea> read and/or start with "Name: Satoshi Nakamoto"?


I don't think people have an issue with your implementation, but your misrepresentation. OCR has a specific meaning and this is not resistant to it (at all), in fact it encourages people to do it because the text isn't already copyable.

What your service DOES block is casual copying and scraping. But the people who are going to be doing that (search engines and the like) are different from people who need actual protection from OCR (I don't know who that is, but presumably they've identified it as a threat and need to specifically mitigate it, a la CAPTCHAs).

By misrepresenting the actual security/obscurity of your service, you are putting people at risk with a false sense of security that's trivially defeated by anybody with minimal IT experience. It'd be like if Signal promised encryption but actually just implemented ROT13.

Which would you rather hear from, a bunch of devs saying "you're misrepresenting your product, might wanna tweak your marketing" or a bunch of burned users trying to sue you because you misled them into a bad situation?


I can see the content in the textareas are a bunch of Unicode glyphs that aren't mapped to speakable characters and when I perform a "read all" action mostly render as questionmarks.


Interesting how the screen reader will use the raster image for a PDF but not for the website. Based on the work I had to do to get the PDF to work, PDFs are a clusterfuck of glyphs floating in space and not really structured like HTML, so presumably they have to do the raster for the PDF so they can determine+optimize their own text flow but they're relying on the hierarchy of HTML to be enough of a guidepost that they don't need to OCR raster data. Thank you very much for replying!


"If a human can read it, a machine can read it." Completely contradicts the clickbait title I wish I never clicked


This is broken in multiple ways, some obvious, some not.

1. Obviously most people jumped directly to OCR, and that works of course. So counter to the OP, you can trivially render the first page to a high resolution PNG and then OCR that with what will probably be 100% accurate results. Sample image: https://i.imgur.com/hyJOSjY.jpg

2. This is just messing with glyphs in fonts, so one trivial way of undoing the changes losslessly (not even requiring OCR) is to create a mapping between each font glyph and the original character. For example, I was able to extract the font used for the text "CONTENTS" near the beginning of the sample document. It is named "SecureFont-1845559949-FranklinGothic-Demi", as extracted by mutool. In the PDF "CONTENTS" is made up of eight Unicode characters, which render as "CONTENTS" in that font.

3. Even if the first two methods somehow failed, the same character in a given font is repeatedly used to render the same character in English. That makes the approach similar to a substitution cipher [1] which is trivially broken with frequency analysis. You could literally just copy / paste the fake "text" out of the PDF and with an analysis tool derive the original text. This isn't really significant since the PDF can be read by sight anyway, but it's worth pointing out.

[1] https://en.wikipedia.org/wiki/Substitution_cipher


To point 1, it was trivial for me to open the PDF in Preview on macOS, save it out as a multi-page PNG, open that PNG in Preview and macOS itself just let me copy out the text without error (in my limited test).


Yep. Even taking the most obvious approach, PDF -> PNG, PNG -> PDF, and running the PDF through Acrobat's "Optimize Scanned Pages" feature results in a PDF that is almost the exact same size as the original and has copyable text: https://0x0.st/oQvE.pdf

Even most of the columns scan correctly!


I took a screenshot of the address and pasted it into OneNote. The 'Copy Text from Picture' extracted the text with no issues.


I believe this can be taken as an axiom: there's nothing a human eye can see that a machine cannot.


Isn't the opposite of your assertion the whole reason reCaptcha (and its dumbass hcaptcha competitor) exists?

Maybe you were asserting only on written text, and not machine vision in the general case -- but even then I'd bet that only applies to very regular text, and not handwritten items


Are you asserting that reCaptcha is not solvable by OCR?

It's a two parter if so. First, the main reCaptcha checks are via JS, that's what determines your likelihood to be a bot and the difficulty of the check. That's a cat and mouse game, but ultimately the client wins.

Second is the images like road signs, red lights, etc. can be solved via OCR, and there are services that do so. But reCaptcha keeps a decent amount of the opportunistic actors out, so it's good enough for the industry.

But I will give you that currently the human captcha services have a better solve rate, albeit they are more expensive, but it's been getting closer and closer and it's more economical to use an OCR service.


But there are many things that the machine can read, that is completely unintelligible to (most) humans.


except for beauty.


Sure thing. Machines can see, not appreciate, evaluate or contemplate the same way humans could.


For 1 - this is spot on. There are tools to dump PDF text, but they are quite flawed because you don't know how the text is laid out. In many cases the text can come out quite jumbled, for example if there are columns of text. Therefore, for OCR it's better to convert the PDF to an image and let OCR handle it. Google's OCR for example will understand and output columns properly.


so one thing that I do to counter point 3 is by having multiple non-english characters map to the same english character and I pick a random one each time. Depending on the input, there can be ~10 or so characters mapping to any single english letter. If you're advanced enough to know about a substitution cipher, you'll figure out how to convert to an image based PDF and then use OCR on that. The reason I have the multiple mappings is so that if a layperson was trying to find all instances of "Billy" they could copy those characters and then search for "Ҽҙӈㅰベ", but the other instances of "Billy" might have the codepoints "Ҽтぃぃヴ".

Again, its resistant to built in PDF Reader OCR, not bulletproof. I'm trying to thwart a crawler or a script kiddie, or a 50 year old divorce attorney. Not the denizens of HN.


That's a fair point. I was trying to point out an interesting relationship between your approach and various forms of cryptography, but obviously this would not be the first line solution to the problem. I think the resistance to crawlers will come down to whether they implement OCR or not. I suspect some of the more sophisticated ones do. (A naive one might also be fooled by the fact that real glyphs are used and not attempt to OCR the text.)

BTW, you probably shouldn't say it's resistant to PDF reader OCR because most PDF readers don't have OCR, AFAIK. They just pull the text from the document, that's not OCR. Software that has OCR like Adobe Acrobat will not be fooled by your obfuscation if you render it to bitmap or textless vector first. If OCR doesn't work on the document as is, it's only because the presence of text glyphs fools it into thinking there's nothing there to perform optical character recognition on.


I don't think you're right on that second point. i'm pretty sure they do OCR, but they're only looking for image data to mine the text out of. The way they're coded now, they think that the document already is all text so it can't find any images to convert. Again. This is not a impervious approach. There are ways around it, it's just that crawlers don't go down that rabbit hole (right now).


In an earlier comment you said: "PDFs are a clusterfuck of glyphs floating in space". On that point you are spot on. A textual PDF is in essence simply a series of instructions to position glyphs in space (where "space" is the 2D "sheet of virtual paper" that the gylphs render onto).

But you are incorrect in asserting that PDF readers do OCR. Most do not, and even Acrobat did not have OCR capability built in for a good long time in its early history.

However, because the PDF file is simply instructions to position glyphs, then if the PDF reader maintains a map table of "where in space" (the 2D sheet) it placed each glyph, select and copy can be performed by simply using the coordinates in space of the selection box to look up which glyphs were positioned in that space. Then using either the reverse glyph map table (optional, but recommended to be included) or if that is missing simply outputting the code point values that chose those glyphs to position, you get the "text back out", without doing any OCR.


Maybe (2) can be resolved by using different mapping in different documents, making them not scrape-able by bots?


I like your intent (helping your friend), and I'm _REALLY_ impressed at the polish (clever name, nice looking website, convert-for-free microservice) but I'm sad to report (like other comments here) that in practice this is useless. :/

Repeating what others have said, don't assume that anyone who cares will even bother looking at the PDF's embedded text. They'll rasterize then OCR. To scale up, they'll just deploy more tesseract pods. :) (At least, this is how I've seen it done!)

I'd take what you have, tweak/rebrand a bit. I personally don't use embedded text in PDFs for anything, I know firsthand some large crawlers don't either, but perhaps something does. Identify that one $foo that uses embedded text then rebrand as a "$foo obfuscator" :)

Alternatively, you could try to make something that really does confuse OCR! You can't foil everyone but you could raise the barrier to entry. Most rely on the pre-trained model, which you can also use. Keep permuting the image until the resulting PDF gives garbage when you try to OCR it.

I'm sure you can do all sorts of transforms to the PDF that make the resulting image ugly-but-readable to humans, but really frustrating to use off-the-shelf OCR on. Mess with spacing then add slightly colored light geometric shapes in the white space. Change image contrast slowly from one corner to another. Things like that. ;)


Having to read 40 pages of captchas is sure to please the judge :)


> As such, when you try to OCR these files, it doesn't see any images and can't convert it.

That isn't true. Acrobat might skip parts of the PDF that it thinks are already text/glyphs, but it's trivial to get around that by either using other OCR software or just printing the PDF to a raster image first. Example: https://filebin.net/qse2e0oaqkl1hjof/ocred.pdf

Still, though, for the purposes of obscuring these from bots/crawlers... a lot better than nothing!


As an author of a PDF library this is hilarious, because the number of bugs I have received over the years where this is unintentionally happening is quite high.


If someone who actually has to deal with the PDF standard can't help OP, I don't think anybody can.


Password-protect (i.e. encrypt) the documents and provide the passwords on a separate piece of paper with your reasoning and concerns clearly laid out. This won't prevent the leaking of those documents, but it will prevent automatic indexing by search engines unless they deliberately strip out encryption prior to leaking them.

If they do strip it out, then the excuses of them "accidentally" leaking the documents becomes very implausible to any reasonable judge.

For good measure, add unique watermarks to all documents to make it easier to prove who leaked the documents later.


I thought this was going to be some adversarial neural network.

Unfortunately OP, I don't think your solution even works for your intended use case; Google already does OCR (actual OCR, not just parsing text) in Images. I use it in Gmail quite often. Regardless the implementation is quite neat and will surely thwart less advanced indexers.


Adding to this, it's trivial to get the human readable text on a Google phone:

switch between apps and pick the app with the text but don't jump back into the app yet, select text and you can immediately copy the text out.

It's yielding the visible text, so most be OCR'ing the image (works offline too).

When i copy direct from the example on the web page, in the non-OCR method, that does give the messed up text, but not when done the way above.

" Phone: 514-867-5309" was copied out easily (can't be bothered to go back get the Cell bit i was just inaccurate copying, I'm sure it works!)


Likewise iOS, text in screenshots is selectable and in this case it is recognised correctly.


Does it ocr on live text pdfs or just pdfs that have text in images/flattened?


thank you to everyone in this thread who realize it was never meant to be perfect and appreciating it for what it does do!


The point isn't that it has flaws, but that its description is wrong. "Non-human eyes" -- normally understood to be OCR -- read it just fine. I think most of us were expecting something that disrupts "computer eyes" (e.g. because of deceiving overly narrow "tricks" that neural networks use to identify characters) but left it readable for the typical human (like an easy Captcha).

A more accurate (and helpful!) description of the problem you're solving is that this disrupts text parsers. That is, any program that just reads this in as text won't see the "real" letters (unless it's been pre-programmed with a specific reverser, etc.) and thus will frustrate, say, text search.

Which, on that note, I notice elsewhere you mention this being a solution applied to document submission in legal proceedings. In that case, the assumption might be that one side wishes to run text searches and assume its compatible with that. In that case, this could be viewed as non-compliance with a judge's orders, so FYI.


I think people will get tripped up by you saying it "can't" be OCR'd or that it is difficult to do so, and will end up looking past a pretty elegant solution in the process.

This seems like a nicely clever way to trip up non-targeted scrapers which might attempt to OCR any images they encounter, but which will ignore what looks like random gibberish codepoints. It doesn't eliminate the ability to index this data but I can see how it might greatly reduce it.

Obviously you could still convert these PDFs to an image and OCR them, but that's not the thing being defended against here.


Not sure OCR is the correct term here. OCR specifically means extracting text from an image. This approach doesn't protect against that. Some maybe better options will be "machine obfuscated" or "scrape resistant".


Hmm... interesting in theory, but take a screenshot and it's trivially bypassed. Try it yourself here: http://www.structurise.com/screenshot-ocr/


Which is not what I'd expect from anything that claims to be "OCR resistant". It's not at all clear what they mean by that.


I think the OP, while well-intentioned, did not really understand how OCR works. Follow-up convo in a separate thread here: https://news.ycombinator.com/item?id=32003066

What this blocks is not OCR but casual copy & pasting (and search engine indexing)


I think it works for the use case - where documents can be provided for discovery, but if posted online won't have the content indexed by search engines.

The various legal teams involved are unlikely to ever be the wiser. Or will they?

Won't this print out a pile of gibberish? Hard copies are rather important in the courts. Somebody is going to complain about what was provided in that case.


I posted some more info here: https://news.ycombinator.com/item?id=32003066


Funny enough, this is because the PDF spec literally allows you to map glyphs like that. Some properly-produced PDFs are broken like this, but it's been less common in recent years.

You're supposed to provide mapping tables for text extraction but they are optional.

This fails pretty bad for security because you can detect the glyphs themselves in the font tables and provide a mapping yourself


It’s because PDF was designed before Unicode became viable, and was designed to be flexible regarding character sets, hence you can basically define your own encoding.


Coupled with embedded fonts that’s pretty clever and good foresight from Adobe.


I literally used an OCR tool to grab the text directly out of the first box. I think this is meant to be guarding against copy/pasting—not OCR.


Yeah, that was not an accurate choice of terminology... as it says in the "more info" box,

> Resistant to Optical Character Recognition (OCR), most laypeople will need to print+rescan to OCR

If print+rescan (or equivalently, screen-grab+OCR) works, which it will, then it's hardly OCR-resistant!

The only thing this "blocks" is text extraction from the PDF with things like copy/paste or pdftotext/html/whatever conversion tools, which will "see" the codepoints used rather than the glyph images.


So this is interesting. I guess I didn't realize that there are (common?) tools to OCR screenshots. And do that end, there probably isn't a whole lot I can do to stop it. But when you're looking at a huge tax return, or sworn testimony, or just a dump of 3000 emails, you're not gonna screenshot each one. You're going to want to automate the OCR, which most PDF readers (at least the commercial ones) will let you do. It is against that type of OCR that my app is resistant to. They look for image data within the PDF and OCR that. They bypass my text because to the pdf reader, it already is in a text format.

I'm 1000% sure there are gurus who could whip up a script to overcome this. But its kind of one of those things where you don't have to outrun the bear, you have to outrun your friend running next to you. It makes your sensitive documents just that much less likely to be scanned/found.


Nothing that you said is wrong but it doesn't make the situation better.

1) As many people pointed out, this doesn't prevent OCR, it just prevents copying strings (e.g. with crawlers). 2) Majority of OCR doesn't deal with PDFs produced from a text source but either from a) jpg-scans of documents b) pdfs produced from those jpg-scans. 3) The first thing I tried, was OCR with my iPhone and it obviously worked. As someone else said, there're solutions that let you batch process many documents.

Don't get me wrong, your stuff works for what you designed it to. However, it provides <false sense of security> by <falsely> claiming that it prevents OCR; which in turn, can lead to more harm[1].

[1] - e.g., it may convince people to share stuff that they wouldn't otherwise.


> I didn't realize that there are (common?) tools to OCR screenshots.

Retrieving text from images is literally the definition of OCR.


I think the main issue is that you have no idea what OCR is: https://en.wikipedia.org/wiki/Optical_character_recognition


> I didn't realize that there are (common?) tools to OCR screenshots.

This seems like quite the oversight to me...


But if you have a huge tax document, you're likely not going to screenshot page by page. Yes, there are ways to automate this. But if you're 50 year old divorce attorney, you're going to click on the "OCR" button in your PDF reader and it will not work.


You don’t have to screen shot every page… convert the PDF to a PNG/TIFF image for every page, and OCR those. This is very easy to automate. If this is working with Unicode code points, you’re not blocking OCR, you’re obfuscating text. Anything that renders the PDF to a raster format will produce an OCR-able document.

If you’re a divorce attorney who used this to convert documents in response to a discovery request, and the opposing side had a valid reason for needing the unobfuscated text, then you’re probably going to end up having a nice conversation with the judge about acceptable formats.

Sending compressed TIFFs would probably be just as good. A bit larger file sizes, but it would be just as effective as stopping automated scraping of text. Also, less likely to piss off a judge. Any opposing firm that would be sophisticated enough to automate scrapping the text from a normal PDF would be able to OCR these files just as easily.

Or maybe you have a second site that sells the decoder, so you get to sell to both sides. Not a bad business model, if you can work it.


I don't know why you think divorce attorneys are stupid. Some are probably very well versed in tech; those who aren't know others who are. They won't simply sit there and think "oh, for some reason I can't copy-paste from that PDF, better give up the case then".

... And most attorneys simply print documents. Once the PDF is on paper, OCR-ing it back into text is just one scanner away.


I understand that. And I understand your personal use case was valid. But I think your "Human Eyes Only" name and domain is a little deceptive.


I’m not sure what reader you are talking about, but that button is most certainly not doing any kind of OCR if your technique stops it.


It's actually much easier than you think.

You don't need any scripts, just Acrobat itself (or any comparable PDF viewer) can do this. Export the PDF to images, make a new PDF out of the images, scan the text, done.

Example (took your example and did just that with it, now everything can be copied & pasted as normal text): https://filebin.net/qse2e0oaqkl1hjof/ocred.pdf

In general, if it LOOKS like text, SOMETHING can OCR it. That's the whole point of OCR. If you want to try to block OCR, you need something like CAPTCHAs, and that's getting less and less effective every day. In fact many are already more easily solved by computers than humans.


Of course. The OP doesn't understand what "OCR" actually means.


>It is against that type of OCR that my app is resistant to.

There is no form of OCR this is resistant to, simply the change the description to be accurate and remove references to being OCR resistant as this is false.


>I'm 1000% sure there are gurus who could whip up a script to overcome this. But its kind of one of those things where you don't have to outrun the bear, you have to outrun your friend running next to you. It makes your sensitive documents just that much less likely to be scanned/found.

Security through obscurity is stupid:

    gs -sDEVICE=tiffscaled24 -dNOPAUSE -dBATCH -dSAFER \
       -sOutputFile=filename.tiff \
       filename.pdf
    tesseract filename.tiff filename.txt
All you need is ghostscript and tesseract. Both are an apt-get away.


If it's on the dark web then they probably know how to use 'ocrmypdf' as well (which uses tesseract under the hood).

https://github.com/ocrmypdf/OCRmyPDF


I'm fairly sure you can just open up any image (not sure if there's a limit on size or complexity) on a Mac and use the select all shortcut to grab all the text to use for whatever you'd like.


macOS does it by default now. I've found it to be very useful at times.

Any image you open in Safari, Preview, etc. (official Apple programs) will be OCR'd automatically, allowing you to extract the text with copy+paste. I think it works with PDFs, but I haven't tested it.

https://support.apple.com/guide/preview/interact-with-text-i...


And tools like pdftotext...it effectively breaks that.


Cybersecurity Incident & Vulnerability Response Playbooks

Operational Procedures for Planning and Conducting Cybersecurity Incident and Vulnerability Response Activities in FCEB Information Systems

Publication: November 2021

Cybersecurity and Infrastructure Security Agency

DISCLAIMER: This document is marked TLP:WHITE. Disclosure is not limited. Sources may use TLP:WHITE when information carries minimal or no foreseeable risk of misuse, in accordance with applicable rules and procedures for public release. Subject to standard copyrght rules, TLP:WHITE information may be distributed without restriction. For more information on the Traffic Light Protocol, see

---

Converting the first page of the sample PDF file to a tiff file using ghost script and running tesseract OCR without any special filters.

>Resistant to Optical Character Recognition (OCR), most laypeople will need to print+rescan to OCR

This is not OCR resistant, I used the same two liner I used to get my textbooks scanned at university 20 years ago.


You specifically are technologically proficient. Not everybody knows how to export to a tiff and then OCR. 98% don't. When I say "OCR Resistant", I mean that I haven't found PDF software with built in OCR that has managed to extract the english text back out.


That's like saying a lock is pick resistant because you haven't been able to open it with a dead fish.

Words mean things, if what you did can't stand up to 20 year old technology then it's basically useless. Remove the claim that it resists OCR and just called it copy/paste proof and unsearchable.


if a lock convinces most popular lock-picking devices to use the ineffective dead fish technique then it's something atleast


Using tesseract ocr is literally the first thing anyone does.


PDF software doesn't attempt to OCR text because it is already text. This fools PDF software not to even attempt to OCR rather than defeating OCR.

What you're resisting here is the ability for other applications scrape the already text-format text.


You could use the ZXX typeface to defy OCR: https://walkerart.org/magazine/sang-mun-defiant-typeface-nsa...

It’s probably still not ML-resistant.


You could just train an OCR engine on that typeface. IIRC training Tesseract for a new font is quite trivial.


Once you get it working, I'm sure. I've tried doing exactly that and every time I end up scouring through unmaintained scripts, old manuals and help guides that assume you're already intimately familiar with the tool itself.

In the end I managed to get some text out of it but the end result was still pretty terrible.


This is basically just really weak DRM and is just as evil.


I agree that OCR is an important tool for end users, especially those with accessibility needs, and that we shouldn't use something like this lightly, but the context is completely different here. If I want to send you a PDF from my gmail, and want to make it difficult for Google to leverage that data - that's completely different than if I were a giant media company, gatekeeping to ensure a huge portion of culture flows through me, which I then claim as my own and charge exorbitant rents for, enforced by DRM.

The problem with DRM is not that someone is trying to control what happens to a string of bits, it's that it props up an institution which is harmful.


The problem is that people won’t “use it lightly” and not everyone speaks the same language. Being able to copy/paste text into a translation tool (probably Google Translate which is kinda ironic in the case you mentioned) to understand what the document is about is super important when in another country or communicating with someone in another country.


Being able to copy/paste is an important option that empowers users.

Being able to selectively defeat copy paste is an additional option that additionally empowers users.

I don't anticipate this tool being used very widely. If it became the default, I would have a problem with it for the reasons you highlight, among others.


It doesn't empower users when the final authority over what a computer does or refuses to do lies with anyone other than the computer's owner.


I agree that DRM is bad and that people should be able to exercise the full capabilities of their computer and the data on it. However, this includes controlling access to your computer and your data.

Saying this is tantamount to DRM misunderstands what the problem is and what empowerment means. Actions taken by those in positions of power and those who are not aren't morally equivalent. This sort of thinking is damaging to any effort to empower users.


DRM is absolutely immoral even if it's the little guy imposing it.


> I agree that OCR is an important tool for end users, especially those with accessibility needs, and that we shouldn't use something like this lightly, but the context is completely different here. If I want to send you a PDF from my gmail, and want to make it difficult for Google to leverage that data - that's completely different than if I were a giant media company, gatekeeping to ensure a huge portion of culture flows through me, which I then claim as my own and charge exorbitant rents for, enforced by DRM.

There are easier ways to do this, such as encrypting the PDF. It is trivially easy to password protect a PDF as well [0], it is even a part of the PDF spec. It isn't ironclad, but it will defeat Gmail's indexer.

[0]: https://digify.com/blog/protect-pdf-with-password/


Is it even DRM or more closely just child-like ROT13?


I'm not very convinced by the PDF idea, but web fonts done this way would be great for the parts of web pages you don't want scraped or collected by search engines, if it is on pages where you do want at least some of the content available to search engines.


the antiscraping thing is a good idea. Hell, you could poison it very lightly using homoglyphs (greek capital Epsilon instead of just an E) just to see where else on the internet your data ends up, too.


> his ex was going to leak all of his discovery documents on the internet

If that happens, I suspect he'd have a very strong case to win custody...


he got what he wanted out of the case, so good for him. The problem is that he just didn't know if it was going to happen. And she could definitely get sanctioned for it. But his info would still be out there.


To me, this falls under the category of — you can’t have a technical solution to a societal problem. Yes, technology may have made your friend feel better. But the actual thing that protected him was the law, not the obfuscated PDFs.

But, if the judge didn’t care and it made your friend feel better, who am I to judge? But this isn’t a great protection scheme… it just adds a few extra technical hurdles that are easy to get around.


> afraid his ex was going to leak all of his discovery documents on the internet

Any chance of getting a clear protective order in advance from the judge, and then enforcing that with contempt of court sanctions if his ex violated the order?

The discovery process is very, very intrusive (sometimes unbelievably so), but judges presiding over cases involving sensitive discovery may be interested in trying to use their powers to mitigate harms from leaks.

(You could use some kind of PDF watermark or text watermark to confirm that the leaked version was the version produced in discovery.)


MacOS does OCR on this just fine. Screenshot, open in Preview, and select, copy, paste:

Name: Satoshi Nakamoto DOB: 1982-06-05 SSN: 958-20-3141 Cell Phone: 514-867-5309


This is terrible for accessibility, please don't do this.


Isn't that the point though?

If you make a sensitive document unindexable (assuming this works), then effectively no one can find it.

The intent here is not to restrict the document to sighted users but to hide the document from everyone, which includes sighted and blind users searching for keywords. The fact that blind users can't read the text at all without screen grabbing is just a bonus.


Treating blind people that way will 100% lead you towards a lawsuit. So have fun with that I guess.


You asked about use case ideas, while I personally strongly dislike them, there are number of sites, including online testing apps, that try to remove copy-and-paste. Not sure how valuable it would be, but it’s for sure a use case.


Lesser known fact: this "feature" is actually built into Chrome on MacOS! If you try to print a website as PDF it will completely break it and copy and paste will result in random Unicode characters.


But why did this "feature" added in the first place?

In most cases, this feature isn't needed but causes confusion instead.


funny, I did this for the web, something like 8 or 7 years ago, under the name "cprotext", but was unable to find a way to sell this as SaaS :)

There should be a wordpress plugin floating around somewhere called wp-cprotext and maybe one or two demo websites that I can't even remember the url.

People came with the same critics as we can read here: evil DRM, accessibility nightmare and easily bypassed by OCR. All in all, I came to be quite convinced by these critics, especially the first and second, and shut it down completely.

I would genuinely be interested to see how you'll succeed where I failed ! good luck !


Interesting. Thanks for the info. I'll look it up!


The reason - to make it only readable by human eyes - doesn’t make sense. It is always only ever readable by human eyes since bots don’t have eyes. You don’t mean eyes, but consciousness? Bots don’t have consciousness, at least until AGI isn’t here yet.

By being indexed by bots and published on the Web, it’s just more human eyes. If you just want certain human eyes, then you instead need access control.

By trying to exclude machines you just make it harder for humans to use the information, like copy pasting, text-to-speech, screen readers for the blind, etc.


Aside from it evidently not working there is, it seems to me, a logical problem at the root - if your friend's x was going to leak all the documents on the internet and it was readable by people and not crawlers people would still read it, probably transcribe it if your friend was important enough to worry about this kind of thing, and then put the transcribed documents online.

It seems more likely that if you wanted to pass around documents with hidden content you would do so via the time honored methods of steganography and not anything else.


I am on a older phone.

What your eyes see is identical to what the computer sees. They are both giberrish. Also the email is giberrish.

What am I missing here?


I'm wondering if your browser isn't showing woff fonts for some reason. The "your eyes see" textbox on a modern browser shows:

Name: Satoshi Nakamoto

DOB: 1982-06-05

SSN: 958-20-3141

Cell Phone: 514-867-5309


Your PDF reader probably doesn't support the particular font this is using. Try it on your computer.


> Have a good use case? Email me: ㆋホυ@ӎㄻӈӭυӽѫұぅѡれハΣ.ҤㅔӸ

This is what I am getting on the browser.


You maybe don't allow pages to choose their own font? From what I understand it works by making a font that maps Latin characters to random Unicode points. So, unless for a specific text its respective font is used, text will appear gibberish.

edit: Saw you said you use an old phone. Found those fonts are in woff format. Older Android phones don't have support for it.


Mm, might have to use a different browser/phone? Not sure what else there is to do here... it's a proof of concept that doesn't work on all devices.


it probably depends on browser font support. PDF works with embedded fonts and so it is less likely to be an issue when viewing the PDFs there.


Ironically, that undermines the entire point of PDF


My AP history textbooks did this, yet somehow was still searchable with ctrl-f.

I wonder what the compromise was.


I thought it would be trivial to render and apply OCR and was going to give it a quick attempt in Swift, but looking at the comments, looks like many implementations already exist without even trying


Phishers and scammers will like this. If they don't have links they resort to using pictures of text instead and hope for lack of OCR scanning.

But I am sure this can be also defeated by OCR.


took me a while to realise that my font settings on firefox break this


Has anyone tried uploading one of these pdfs to a website? I’m curious to know whether search engines successfully ocr and index stuff like this already.


Using CLI OCR, I am getting a 100% match: `Name: Satoshi Nakamoto DOB: 1982-06-05 SSN: 958-20-3141 Cell Phone: 514-867-5309`


I thought this was going to be some style guide on how to make the PDF document easy on the eyes to read.


This seems like it would just be annoying and would not even work for most purposes. Kind of neat though.


Just give me readable papers instead. Such a pain in the dick format yet its the only thing there is.


I like the idea, it’s a decent approach to stop low-effort scraping of your info.


With Javascript disabled, it doesn't work... they both look about the same


Impressive, the example web text area is very lovely too


Fascinating. How does this work?


It takes the embedded font out of your PDF, and then maps non-latin characters (japanese, cyrillic, etc) to render as if they looked like a latin character. So in the example on the site. "ӕ" will render as a "D" using my special font. And "ㅈ" will draw the "B" glyph. Then I do a replacement on the underlying text so all "B" are replaced with "ㅈ". It is more complicated than that, but that's the gist.


So basically it's a type of Caesar cipher where letters are mapped to something else one-to-one. Very easy to decrypt / reverse. If this tool ever became popular there would be hundreds of scripts to defeat it.

And as it is, it does not prevent "OCR", only copy-paste.


looks like it's actually one-to-many across unicode, if so then you could think of it as approaching one-time-pad encryption, with the key being the font

if the generator crafted a new font every time, never used the same codepoint twice, and kept the font separate from the document (pre-shared by being installed on the intended receiver's machine) then it'd be uncrackable!


brilliant solution.


[deleted]


"As such, when you try to OCR these files, it doesn't see any images and can't convert it."

Bullshit

1. Screenshot

2. OCR

3. Profit




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: