I built an online PDF management platform using open-source software

porcoda · 2024-05-13T00:58:30 1715561910

Unfortunately, most of the PDF work I do involves things I’m not uploading to a service - ever. I don’t care if they’re “deleted immediately after processing” - they left my control. This sort of software would be great if it were 100% offline.

This isn’t just a niche issue either: this is a very real consideration for any corporate user. More companies are taking data loss and security issues seriously, which often means restricting what cloud services they are willing to use.

harryf · 2024-05-13T07:53:21 1715586801

I work at https://www.pdf-tools.com and we hear this again and again.

Despite the proliferation of cloud services, most large enterprises DO NOT want their sensitive documents entering the cloud. And in some cases, e.g. patient medical records, there are strict regulations about how those documents can be stored, which means on-premise is a requirement.

Good news for us, as that's what we specialise in, but also perplexing how trends in the software industry can completely ignore what customers actually want.

thr0waway001 · 2024-05-13T13:02:55 1715605375

Looks interesting.

However, the pricing page with no actual numbers and the ambiguous ‘Contact Us’ is a huge turn off.

I cannot stand the dance with business people who want to have a bunch of calls and meetings to know how big a company they’re dealing with is before they decide on a good rate to gouge them.

Pricing pages should be straight forward. Have tiers if you want to cover your rear but only at the limit of usage have the ‘Contact Us’ option.

I’m shopping around for a PDF solution and would’ve recommended this to my manager but I’m not willing to do more meetings to get quotes.

snehk · 2024-05-13T13:59:29 1715608769

> the ambiguous ‘Contact Us’ is a huge turn off

Same. About three years ago we introduced a company wide policy to not buy anything where the price is not known. So, so much time (money) being wasted on figuring out the actual costs, the offering would have to be really inexpensive to make up for this. And if that were the case, the price would be right there.

thr0waway001 · 2024-05-13T15:06:47 1715612807

Yup.

They usually do high usage volume pricing at high rates that are proportional to the size of the company and make you sign a yearly agreement so they can get a huge payment upfront.

How about building some trust? What if the service sucks? It will be hard to get your money back and you paid a year in advance.

They make you work to get a quote and the quote usually doesn’t work for your needs.

I too will not look at services with this pricing structure anymore unless word of mouth is favorable.

IG_Semmelweiss · 2024-05-13T16:41:47 1715618507

very good heuristic. I'll be borrowing. Any others you'd care to share ?

rekabis · 2024-05-15T17:02:56 1715792576

> the pricing page with no actual numbers and the ambiguous ‘Contact Us’ is a huge turn off.

It’s also one of the top-10 web usability mistakes as defined by the Nielsen Norman group.

As in, it drives away far more potential clients than it can possibly convert. It’s a massive anti-pattern.

bluGill · 2024-05-13T13:31:11 1715607071

Large enterprises can afford to take things in house and might even save money that way, not to mention the security gains. Medical offices have no choice. However small companies often don't have anyone in IT (other than the CEO who does everything and only rarely knows what he is doing other than the niche the company is in). These should be the prime market for tools like this - just pay us a little bit and we will worry about he details for you - everything is backed up. However if you can get one enterprise account that is a lot more money than thousands of little accounts and so everyone focuses on them anyway.

gruturo · 2024-05-13T10:42:34 1715596954

> Good news for us, as that's what we specialise in, but also perplexing how trends in the software industry can completely ignore what customers actually want.

I initially read this backwards and thought you were lamenting that people insist on on-prem stuff when cloud is clearly The Right Thing.

I certainly don't think the entire software industry is ignoring what customers actually want. Case in point, you. But also lots of other developers who thrive in covering the myriad use cases the myopic behemoths can't see. They just have very loud PR and marketing and pretend those cases don't exist, so you hear about them a lot.

isatty · 2024-05-13T18:28:47 1715624927

You seem to think that users want everything in the cloud and that’s what’s causing the proliferation of cloud services. You are wrong. Users want _convenience_. They couldn’t care less about the cloud or technical details. If your website can do what they want to do without uploading their documents to your server then and if it’s faster and cheaper then that’s what they’ll prefer.

iLoveOncall · 2024-05-13T10:15:49 1715595349

No PHP nor JavaScript SDK? You guys don't like money?

harryf · 2024-05-13T12:18:45 1715602725

It's a fair point. Most of our customers work with CPP, C# and Java in enterprise / back office contexts, which is why no PHP or Javascript right now - we've been tied up with other priorities. That said we just added Python to our main SDK and PHP is coming.

Plus our enterprise automation product can basically talk to anything via REST API ( https://www.pdf-tools.com/docs/conversion-service/api/conver... ).

But yeah - now you got me fired up to annoy some colleagues ;)

tracker1 · 2024-05-13T19:56:51 1715630211

I would think that JS/TS support would be relatively high up... my own bias speaking, but a lot of development and effort to easing cloud apps is JS/TS centric.

lomase · 2024-05-13T14:49:12 1715611752

PHP and Javascript? So you never worked on "enterprise"?

iLoveOncall · 2024-05-13T15:09:05 1715612945

I work in a FAANG on stuff that is definitely "enterprise software", a major part of what we develop is written in TypeScript.

I admit PHP will not be as good of a candidate but for smaller companies it is still extremely attractive, and it's probably easier to develop since you can write PHP extension in C.

m3h · 2024-05-13T05:04:27 1715576667

In that case, you can use https://www.pdftool.org/, which runs in the browser but offline and never uploads your files to any server.

IG_Semmelweiss · 2024-05-13T16:54:14 1715619254

I wanted to let you know that i disabled UBlock and badger for your site, but i'm still getting "please disable adblocker" ad error.

THe site renders fine otherwise. I'm not a technical user, but i do run Ublock in the complete Javascript disabled settings.

m3h · 2024-05-13T18:31:55 1715625115

I didn't create this tool, but I use it frequently. I'm also using uBlock Origin, but I don't see the issue you describe. I'm not sure what Badget is, though.

BizarroLand · 2024-05-15T20:38:04 1715805484

Privacy Badger

https://privacybadger.org/

szundi · 2024-05-13T07:01:55 1715583715

How can I really know that as a random user

Rinzler89 · 2024-05-13T07:06:52 1715584012

Unplug your network cable when you use it.

zo1 · 2024-05-13T07:22:29 1715584949

And it stores it in local storage and uploads it using a service worker later when I'm online?

a2800276 · 2024-05-13T08:37:01 1715589421

If that's your paranoia level: How do you know the "offline" tool you're using is not uploading to a server? Possibly inadvertently in the course of bug reports, or surreptitiously while contacting the license server...?

Should security concerns really warrant not trusting the (reputable) vendor that the files are not being uploaded, you would need to do some sort of audit and/or run in an isolated environment and wouldn't be the "random user" referred to in OP.

ffpip · 2024-05-13T10:42:34 1715596954

You can easily block network access for an app on Windows using Windows Firewall. Same on a few Android skins such as MIUI by Xiaomi

alandarev · 2024-05-13T13:47:45 1715608065

same is true for Chrome Browser, open dev tools and select Network to "Offline"

ffpip · 2024-05-13T17:16:24 1715620584

Thanks

Rinzler89 · 2024-05-13T07:25:50 1715585150

Use incognito mode then close that window before reconnecting online?

navane · 2024-05-13T07:56:53 1715587013

I'd suggest install a separate browser (there exists a myriad by now), unplug internet, use the service, uninstall the separate browser, reboot pc.

Rinzler89 · 2024-05-13T08:07:19 1715587639

I suggest a separate VM for that, that you can delete when you're done. Add put the VM on a separate PC that you bought with cash off craigslist. Then toss the PC away in a different postcode when you're done. Then you can use the PDF tool safely without fear you're being tracked.

intelVISA · 2024-05-13T10:48:03 1715597283

Run it on an air gapped breadboard 8086?

tqwhite · 2024-05-13T15:39:14 1715614754

Use 'Developer Tools' and Inspect. Watch the Network tab.

If you also wear a tinfoil hat, delete the local storage, etc, after you are done using it.

sp0ck · 2024-05-13T14:48:22 1715611702

Is is OpenSource ? Can it be run as docker pull; docker run ? If this is an option then use can make sure it will work offline..

m3h · 2024-05-13T18:35:50 1715625350

This isn't my tool but based on what I read on the previous thread about it, it doesn't seem to be open-source. However, some folks recommended this tool which does seem to run locally: https://github.com/torakiki/pdfsam

re-thc · 2024-05-13T04:26:48 1715574408

> This isn’t just a niche issue either: this is a very real consideration for any corporate user

Very true, but I'd wish this "common" knowledge is more widespread. Security is a major issue commonly overlooked. People do a lot of insecure things for convenience.

sanusihassan · 2024-05-13T01:19:03 1715563143

I understand that you want to keep your work private and not expose your documents to the internet, but there might be a situation where the document isn't that important to you and any online solution would be sufficient, let's say you one of your friends tells you to ask the ai a math problem they want to know how to solve/learn but the ai only understands text then you need to ocr the pdf which is jpg converted then copy it to the ai, you might be on your phone or away from your desktop environment, here you might consider using an online solution like pdfequips :)

nip · 2024-05-13T19:22:19 1715628139

For anyone looking edit/fill PDFs locally (the data you fill in and document you load stay in your browser): https://SimplePDF.eu

You can read more in the privacy policy [1]

It can also be embed in any website [2]

Disclosure: I’m the developer behind it

[1] https://simplepdf.eu/privacy-policy

[2] https://simplepdf.github.io/

thekevan · 2024-05-13T01:34:36 1715564076

I'd also not upload any personal or identifying docs up to this, but I would use it for fliers and it would REALLY be useful converting PDFs I downloaded off the ineternet to begin with. (I've downloaded stuff in the past that I had to convert in order enter the data on the PDF into my computer. Geologic data for maps, list of states with capitals, alphabetized by them--well before ChatGPT, the list goes on.)

pan69 · 2024-05-13T01:27:48 1715563668

Sounds to me like that (a desktop app version) is the product to sell (since the online service seems to be free).

chakintosh · 2024-05-13T09:40:13 1715593213

docker pull frooodle/stirling-pdf-base

nashashmi · 2024-05-13T09:59:33 1715594373

This was on hn a couple of days ago. Stirling pdf is a self hosted docker container and this way you don’t have to worry about files being uploaded. https://news.ycombinator.com/item?id=40242639

I almost thought this hn post was the same service wrapped in a show and tell.

darken · 2024-05-13T19:02:58 1715626978

I had just setup "Stirling PDF" on my home NAS a few of weeks ago, since my SO needed to merge some documents and I'd recently read that (or a similar) HN thread.

I definitely would recommend it. It was really quick to setup; though my already having a reverse proxy with wild card TLS certs setup probably helped streamline the networking side of things.

https://github.com/Stirling-Tools/Stirling-PDF

ranger_danger · 2024-05-13T02:20:17 1715566817

Stirling-pdf. You can self-host it. Even though it all runs locally anyway

hyuuu · 2024-05-13T19:17:59 1715627879

this might be a stupid question, but how do the teams share the documents?

sanusihassan · 2024-05-12T22:05:30 1715551530

I decided to create pdfequips.com when a friend kept sending me PDF files for translation, realizing the widespread need for PDF solutions Now, it serves as a central hub for PDF management, offering conversion tools like PDF to Word and CSV, as well as OCR technology Over the past year, I extensively developed the website, leveraging a wide range of open-source tools on both the front-end and back-end.

czl_my · 2024-05-12T23:04:53 1715555093

I'd like if there's more details on the open source software used.

coretx · 2024-05-13T00:13:04 1715559184

Same here. No (F)OSS licenses to be found on the page itself. Sus. Perhaps it is simply injecting remote root vulnerabilities into the PDF's.

sanusihassan · 2024-05-13T00:40:00 1715560800

the web app i.e the front end part is next.js and typescript mostly, the landing page is built using astro.js, and the back end is heavily python, flask and some javascript for web-to-pdf and markdown-to-pdf, the rest is mostly python

deathemperor · 2024-05-13T06:51:00 1715583060

just curious: what do you use to convert web pages to pdf?

cuu508 · 2024-05-13T07:21:51 1715584911

Not op, but I've had good experience with WeasyPrint. I use it for generating PDF invoices: I create a HTML invoice from a template, WeasyPrint turns it into a PDF document. It handles CSS, images, custom fonts, etc.

A neat trick to convert HTML to PDF in a browser environment is to open a new browser window, load the HTML in it, and call print() on it, like here: https://stackoverflow.com/a/33890644/5821. May be OK for an internal tool.

sanusihassan · 2024-05-16T03:27:08 1715830028

puppeteer

aspenmayer · 2024-05-13T00:36:43 1715560603

I hope those are FOSS remote root PDF vulns!

coretx · 2024-05-13T02:33:06 1715567586

If something is turing complete, don't trust/execute it until you have verified where it comes from, who is behind it and what it does.

Here you have what Adobe has to say about PDF's: https://www.adobe.com/acrobat/resources/can-pdfs-contain-vir...

sanusihassan · 2024-05-13T00:37:12 1715560632

i used open source solutions to built it, like libreoffice, ghostscript, google's tesseract and a bunch of other tools, Google's Tesseract: https://github.com/tesseract-ocr/tesseract

beagle3 · 2024-05-13T06:46:46 1715582806

I’m surprised everyone is using Tesseract. It was the only game in town 10 years ago, and it’s Ok on cleaned aligned data, but there are a few newer ones like EasyOCR [0] that can deal with much less organized text (albeit more slowly)

[0] https://github.com/JaidedAI/EasyOCR

harryf · 2024-05-13T08:05:47 1715587547

EasyOCR looks like it's more focused on the mobile use case of extract text from photos. That's a little bit different from extracting text from scanned documents, where document structure is an important aspect, and Tesseract is the devil we know. In the commercial space ABBYY Finereader still dominates - https://en.wikipedia.org/wiki/ABBYY_FineReader

But perhaps I'm wrong...

ianhawes · 2024-05-13T13:14:13 1715606053

ABBYY does indeed dominate, but Google Document AI is making inroads.

racl101 · 2024-05-13T13:18:06 1715606286

Careful with the Ghostscript AGPL licensing if you plan to make a commercial product that uses it.

sedro · 2024-05-13T00:42:35 1715560955

The PDF metadata says it's PyPDF2

sanusihassan · 2024-05-13T00:47:55 1715561275

i used PyPDF2 to implement some tools, but not all of them.

tqwhite · 2024-05-13T15:41:25 1715614885

I think it looks like a nice tool, naysayers notwithstanding. I don't have sensitive PDFs and, though I would probably not use it for my tax return, I'll use it for other stuff. For my level of security, I'm happy enough with your promise to delete the stuff right away.

sanusihassan · 2024-05-14T01:12:15 1715649135

i appreciate your trust, and yeah belive me i'm deleting the files right after the processing, the way it works is that i'm saving the files uploaded as a tmp file then process it then delete them after the response.

this is how the code looks like on the server side for most of the tools:

```python ... @after_this_request def remove_file(response): os.remove(tmp_file.name) return response return response ``` i don't have any reason to keep them.

sanusihassan · 2024-05-14T01:15:16 1715649316

indentation is not showing correctly, but you get the idea.

saturn5k · 2024-05-12T22:59:19 1715554759

This is great, definitely going into bookmarks. The website design lacks some refinement, but overall easy to use.

sanusihassan · 2024-05-13T00:41:01 1715560861

thanks for the feedback!

boffinAudio · 2024-05-13T07:45:43 1715586343

I have 85,000 PDF documents, collected over a few decades.

What I really want is a semantic interface to those PDF documents. Find me "all PDF files which mention <subject>", or "show me any PDF with python example code", or "all PDF's before 2011 on the subject of coding standards for SIL-4".

I keep thinking this is out there somewhere, but whenever something new comes along I get bogged down in the details of setting it up. Surely someone has come up with an AI that you can just 'give the folder to' and it figures things out automagically?

timc3 · 2024-05-13T14:52:27 1715611947

Have you tried Paperless NGX?

boffinAudio · 2024-05-14T06:55:42 1715669742

No I haven't, so thanks for recommending it to me - looks pretty detailed. I will try it out some time this week, maybe its exactly what I'm looking for. Thanks again!

spiderfarmer · 2024-05-13T13:29:56 1715606996

You can do this locally with your favourite LLM and Open WebUI: https://github.com/open-webui/open-webui

boffinAudio · 2024-05-14T06:56:55 1715669815

Looks like I've got a few days of hacking ahead of me, thanks for the recommendation - will put it alongside the other suggestions and check it out when I do my "PDF sortout workbench" session ..

andro_dev · 2024-05-13T15:31:31 1715614291

This is what I use for that

https://github.com/simon987/sist2

boffinAudio · 2024-05-14T06:56:18 1715669778

Looks pretty functional, if not entirely polished - I will try this out (alongside Paperless NGX, also suggested here..) - I appreciate the recommendation, thank you!

torgeros · 2024-05-13T07:02:52 1715583772

If you're so keen on the open source aspect, could you make the sources of your website and the tools involved, too? Otherwise there is no use to it

kordlessagain · 2024-05-13T13:00:28 1715605228

I have a lot of similar tools and it's all Open Source: https://mitta.ai

nicknow · 2024-05-13T02:09:14 1715566154

If this is entirely build using open-source software why not open source the site itself? Especially if you aren't planning to turn it into a commercial service.

sanusihassan · 2024-05-16T07:15:16 1715843716

i'm open-sourcing the backend, but not 100% of the code.

zxexz · 2024-05-13T03:43:56 1715571836

This is quite nice, but you really ought to have some page accessible with attributions to the open source projects you're using to power this!

mikabasketball · 2024-05-12T23:25:27 1715556327

What do people use to perform those pdf tasks without uploading sensitive files to a website?

vikp · 2024-05-13T02:45:52 1715568352

For PDF to markdown, I recently released V2 of my tool marker - https://github.com/vikparuchuri/marker

rch · 2024-05-13T05:19:17 1715577557

This is very effective - it consistently yields great results.

sanusihassan · 2024-05-13T00:44:19 1715561059

The files are deleted immediately after processing I'm considering implementing WebAssembly (Wasm) to do most of the work on the client's device and enable offline use

bruce511 · 2024-05-13T05:54:28 1715579668

You say that here, but its not in the privacy policy as far as I can see.

When you get to the stage of monetizing the site, I expect the most obvious starting point is monetizing the information inside the pdfs.

Then there's the obscurity of how much (if any) is passed on to other services (like google). You may have one policy about the PDFs, they may gave another.

So yeah, I'm in the "not for general use" camp myself. (Although there are edge cases where it may be useful.)

Don't get me wrong, I can see the upsides, and your web site looks professional, but alas the downsides are too significant to overcome the inconvenience of searching out something local.

lannisterstark · 2024-05-12T23:27:26 1715556446

Stirling-pdf. You can self-host it.

senectus1 · 2024-05-13T00:30:08 1715560208

using this myself. its pretty good.

lloydatkinson · 2024-05-13T00:38:42 1715560722

Only last week there was a HN thread about how the author said they just used chatgpt to make the entire thing and as a result the code is beyond bad. I don't think I'd trust it.

l8arrival · 2024-05-13T00:53:52 1715561632

They didn't say that. They said they wrote the first version in a few days, using ChatGPT. Then worked on it almost another year since then. Something of that nature. Pretty big difference.

lannisterstark · 2024-05-13T20:03:34 1715630614

>I don't think I'd trust it.

You can audit the code yourself then. What's stopping you?

lloydatkinson · 2024-05-14T13:09:31 1715692171

Nothing is stopping me using something else that isn’t ChatGPT hope and pray code.

senectus1 · 2024-05-13T23:48:45 1715644125

Am not a Coder :-P

lannisterstark · 2024-05-14T03:02:12 1715655732

That's a fair point lol, my bad.

acidburnNSA · 2024-05-13T00:46:08 1715561168

Pandoc, ocrmypdf, libreoffice, pdftk, pypdf2

tqwhite · 2024-05-13T15:42:27 1715614947

There is a site listed elsewhere in these comments that does the work entirely in-browser. That's what I would use.

brailsafe · 2024-05-13T23:15:17 1715642117

I use Preview on mac or even Spotlight for a good portion of these functions

andretti1977 · 2024-05-13T08:24:45 1715588685

Am i missing something or does it lacks pdf editing functionalities like adding/editing text or adding images? I usually use https://smallpdf.com/edit-pdf because 99% of the times i simply need to compile fields with text and attach a png of my signature on some pages and resend the document to the organization that required me to compile it (schools, medical self certifications, governative tax entities and so on). For those need, smallpdf is fantastic, but obviously i'd prefer an opensource or simply a self hostable solution

xvfLJfx9 · 2024-05-13T15:24:15 1715613855

This would be awesome, if it can be selfhosted. I work with sensitive documents I can't upload to a third party.

epalm · 2024-05-12T23:21:34 1715556094

Nice roundup of tools!

Just a small note, on safari mobile if I expand the Edit and then Convert sections, they open on top of each other.

https://i.imgur.com/bSZdRTN.png

sanusihassan · 2024-05-13T00:52:27 1715561547

Thanks for bringing that up, I'll take care of that issue.

omegant · 2024-05-13T08:36:51 1715589411

At this point a new format should emerge as a replacement of pdf. It’s very useful and easy to publish, but working with pdf documents beyond reading and printing is way too complicated.

laurensr · 2024-05-13T06:26:12 1715581572

If anyone knows about a FOSS pdf form editor, please share!

unanimous · 2024-05-13T18:12:05 1715623925

You can edit PDF forms by opening the files in Firefox, but maybe that's not exactly what you mean. I'm not sure about other browsers.

hatenberg · 2024-05-13T17:18:00 1715620680

90% overlap with the free and selfhostable stirling-pdf?

carte_blanche · 2024-05-15T07:08:43 1715756923

I've been using Stirling-PDF as my go-to solution for any pdf needs and have never needed any other service. Open source gold standard for any pdf needs: https://github.com/Stirling-Tools/Stirling-PDF

erremerre · 2024-05-13T07:05:31 1715583931

I am always surprised there is absolutely nothing like the Adobe Acrobat Pro on the open source space.

There are a collection of open source tools, everyone which is its own interface that does a subset of things.

The alternative, which is an online service, it is not great...

martin_a · 2024-05-13T07:15:33 1715584533

PDF is harder than most people think due to its variety. While there might be a tool for every job, "one for all" is hard to do.

xupybd · 2024-05-12T23:24:17 1715556257

You should run ads or charge for something. I suspect this is going to get very popular.

sanusihassan · 2024-05-13T01:03:28 1715562208

Thank you for your suggestion, I'm considering keeping the basic functions free and adding premium features, but still not changing any of the core features i.e the website is going to work as is.

teeray · 2024-05-13T04:38:04 1715575084

Anybody have a good open-source receipt data extraction tool for PDFs?

tappio · 2024-05-13T06:27:15 1715581635

We just launched a MVP for pdf data extraction https://excelifier.com/. The service is not open source and relies on open ai, which is probably a bit problematic in your case.

However, we understand that privacy concerns are really important for many organizations. Making it self-hostable and depend on a locally running LLM is something that we are looking into.

julianwachholz · 2024-05-16T12:53:01 1715863981

It sure sounds interesting, but I'm only getting timeouts. A possible hug-of-death period should be over by now?

madspindel · 2024-05-13T07:29:41 1715585381

Any plans to make this available as a docker container?

Refusing23 · 2024-05-14T05:00:16 1715662816

i tried converting a pdf to markdown and i just got a large bunch of ... seeminly random numbers and letters.

nilstycho · 2024-05-12T23:30:24 1715556624

I am a happy user of smallpdf, which seems quite similar. What advantages do you offer?

sanusihassan · 2024-05-13T00:59:48 1715561988

pdfquips is fast, free, and offers tools that are not available on smallpdf like pdf-to-csv, pdf-to-pdf-a, translate-pdf, you can OfCourse use what you feel most comfortable with, but i guess you should give pdfequips a try :)

tomthumb · 2024-05-13T08:45:30 1715589930

Bookmarked!

sanusihassan · 2024-05-13T11:40:28 1715600428

Thanks! :)