Hacker News new | past | comments | ask | show | jobs | submit login
I built an online PDF management platform using open-source software (pdfequips.com)
231 points by sanusihassan 8 months ago | hide | past | favorite | 114 comments



Unfortunately, most of the PDF work I do involves things I’m not uploading to a service - ever. I don’t care if they’re “deleted immediately after processing” - they left my control. This sort of software would be great if it were 100% offline.

This isn’t just a niche issue either: this is a very real consideration for any corporate user. More companies are taking data loss and security issues seriously, which often means restricting what cloud services they are willing to use.


I work at https://www.pdf-tools.com and we hear this again and again.

Despite the proliferation of cloud services, most large enterprises DO NOT want their sensitive documents entering the cloud. And in some cases, e.g. patient medical records, there are strict regulations about how those documents can be stored, which means on-premise is a requirement.

Good news for us, as that's what we specialise in, but also perplexing how trends in the software industry can completely ignore what customers actually want.


Looks interesting.

However, the pricing page with no actual numbers and the ambiguous ‘Contact Us’ is a huge turn off.

I cannot stand the dance with business people who want to have a bunch of calls and meetings to know how big a company they’re dealing with is before they decide on a good rate to gouge them.

Pricing pages should be straight forward. Have tiers if you want to cover your rear but only at the limit of usage have the ‘Contact Us’ option.

I’m shopping around for a PDF solution and would’ve recommended this to my manager but I’m not willing to do more meetings to get quotes.


> the ambiguous ‘Contact Us’ is a huge turn off

Same. About three years ago we introduced a company wide policy to not buy anything where the price is not known. So, so much time (money) being wasted on figuring out the actual costs, the offering would have to be really inexpensive to make up for this. And if that were the case, the price would be right there.


Yup.

They usually do high usage volume pricing at high rates that are proportional to the size of the company and make you sign a yearly agreement so they can get a huge payment upfront.

How about building some trust? What if the service sucks? It will be hard to get your money back and you paid a year in advance.

They make you work to get a quote and the quote usually doesn’t work for your needs.

I too will not look at services with this pricing structure anymore unless word of mouth is favorable.


very good heuristic. I'll be borrowing. Any others you'd care to share ?


> the pricing page with no actual numbers and the ambiguous ‘Contact Us’ is a huge turn off.

It’s also one of the top-10 web usability mistakes as defined by the Nielsen Norman group.

As in, it drives away far more potential clients than it can possibly convert. It’s a massive anti-pattern.


Large enterprises can afford to take things in house and might even save money that way, not to mention the security gains. Medical offices have no choice. However small companies often don't have anyone in IT (other than the CEO who does everything and only rarely knows what he is doing other than the niche the company is in). These should be the prime market for tools like this - just pay us a little bit and we will worry about he details for you - everything is backed up. However if you can get one enterprise account that is a lot more money than thousands of little accounts and so everyone focuses on them anyway.


> Good news for us, as that's what we specialise in, but also perplexing how trends in the software industry can completely ignore what customers actually want.

I initially read this backwards and thought you were lamenting that people insist on on-prem stuff when cloud is clearly The Right Thing.

I certainly don't think the entire software industry is ignoring what customers actually want. Case in point, you. But also lots of other developers who thrive in covering the myriad use cases the myopic behemoths can't see. They just have very loud PR and marketing and pretend those cases don't exist, so you hear about them a lot.


You seem to think that users want everything in the cloud and that’s what’s causing the proliferation of cloud services. You are wrong. Users want _convenience_. They couldn’t care less about the cloud or technical details. If your website can do what they want to do without uploading their documents to your server then and if it’s faster and cheaper then that’s what they’ll prefer.


No PHP nor JavaScript SDK? You guys don't like money?


It's a fair point. Most of our customers work with CPP, C# and Java in enterprise / back office contexts, which is why no PHP or Javascript right now - we've been tied up with other priorities. That said we just added Python to our main SDK and PHP is coming.

Plus our enterprise automation product can basically talk to anything via REST API ( https://www.pdf-tools.com/docs/conversion-service/api/conver... ).

But yeah - now you got me fired up to annoy some colleagues ;)


I would think that JS/TS support would be relatively high up... my own bias speaking, but a lot of development and effort to easing cloud apps is JS/TS centric.


PHP and Javascript? So you never worked on "enterprise"?


I work in a FAANG on stuff that is definitely "enterprise software", a major part of what we develop is written in TypeScript.

I admit PHP will not be as good of a candidate but for smaller companies it is still extremely attractive, and it's probably easier to develop since you can write PHP extension in C.


In that case, you can use https://www.pdftool.org/, which runs in the browser but offline and never uploads your files to any server.


I wanted to let you know that i disabled UBlock and badger for your site, but i'm still getting "please disable adblocker" ad error.

THe site renders fine otherwise. I'm not a technical user, but i do run Ublock in the complete Javascript disabled settings.


I didn't create this tool, but I use it frequently. I'm also using uBlock Origin, but I don't see the issue you describe. I'm not sure what Badget is, though.



How can I really know that as a random user


Unplug your network cable when you use it.


And it stores it in local storage and uploads it using a service worker later when I'm online?


If that's your paranoia level: How do you know the "offline" tool you're using is not uploading to a server? Possibly inadvertently in the course of bug reports, or surreptitiously while contacting the license server...?

Should security concerns really warrant not trusting the (reputable) vendor that the files are not being uploaded, you would need to do some sort of audit and/or run in an isolated environment and wouldn't be the "random user" referred to in OP.


You can easily block network access for an app on Windows using Windows Firewall. Same on a few Android skins such as MIUI by Xiaomi


same is true for Chrome Browser, open dev tools and select Network to "Offline"


Thanks


Use incognito mode then close that window before reconnecting online?


I'd suggest install a separate browser (there exists a myriad by now), unplug internet, use the service, uninstall the separate browser, reboot pc.


I suggest a separate VM for that, that you can delete when you're done. Add put the VM on a separate PC that you bought with cash off craigslist. Then toss the PC away in a different postcode when you're done. Then you can use the PDF tool safely without fear you're being tracked.


Run it on an air gapped breadboard 8086?


Use 'Developer Tools' and Inspect. Watch the Network tab.

If you also wear a tinfoil hat, delete the local storage, etc, after you are done using it.


Is is OpenSource ? Can it be run as docker pull; docker run ? If this is an option then use can make sure it will work offline..


This isn't my tool but based on what I read on the previous thread about it, it doesn't seem to be open-source. However, some folks recommended this tool which does seem to run locally: https://github.com/torakiki/pdfsam


> This isn’t just a niche issue either: this is a very real consideration for any corporate user

Very true, but I'd wish this "common" knowledge is more widespread. Security is a major issue commonly overlooked. People do a lot of insecure things for convenience.


I understand that you want to keep your work private and not expose your documents to the internet, but there might be a situation where the document isn't that important to you and any online solution would be sufficient, let's say you one of your friends tells you to ask the ai a math problem they want to know how to solve/learn but the ai only understands text then you need to ocr the pdf which is jpg converted then copy it to the ai, you might be on your phone or away from your desktop environment, here you might consider using an online solution like pdfequips :)


For anyone looking edit/fill PDFs locally (the data you fill in and document you load stay in your browser): https://SimplePDF.eu

You can read more in the privacy policy [1]

It can also be embed in any website [2]

Disclosure: I’m the developer behind it

[1] https://simplepdf.eu/privacy-policy

[2] https://simplepdf.github.io/


I'd also not upload any personal or identifying docs up to this, but I would use it for fliers and it would REALLY be useful converting PDFs I downloaded off the ineternet to begin with. (I've downloaded stuff in the past that I had to convert in order enter the data on the PDF into my computer. Geologic data for maps, list of states with capitals, alphabetized by them--well before ChatGPT, the list goes on.)


Sounds to me like that (a desktop app version) is the product to sell (since the online service seems to be free).


docker pull frooodle/stirling-pdf-base


This was on hn a couple of days ago. Stirling pdf is a self hosted docker container and this way you don’t have to worry about files being uploaded. https://news.ycombinator.com/item?id=40242639

I almost thought this hn post was the same service wrapped in a show and tell.


I had just setup "Stirling PDF" on my home NAS a few of weeks ago, since my SO needed to merge some documents and I'd recently read that (or a similar) HN thread.

I definitely would recommend it. It was really quick to setup; though my already having a reverse proxy with wild card TLS certs setup probably helped streamline the networking side of things.

https://github.com/Stirling-Tools/Stirling-PDF


Stirling-pdf. You can self-host it. Even though it all runs locally anyway


this might be a stupid question, but how do the teams share the documents?


I decided to create pdfequips.com when a friend kept sending me PDF files for translation, realizing the widespread need for PDF solutions Now, it serves as a central hub for PDF management, offering conversion tools like PDF to Word and CSV, as well as OCR technology Over the past year, I extensively developed the website, leveraging a wide range of open-source tools on both the front-end and back-end.


I'd like if there's more details on the open source software used.


Same here. No (F)OSS licenses to be found on the page itself. Sus. Perhaps it is simply injecting remote root vulnerabilities into the PDF's.


the web app i.e the front end part is next.js and typescript mostly, the landing page is built using astro.js, and the back end is heavily python, flask and some javascript for web-to-pdf and markdown-to-pdf, the rest is mostly python


just curious: what do you use to convert web pages to pdf?


Not op, but I've had good experience with WeasyPrint. I use it for generating PDF invoices: I create a HTML invoice from a template, WeasyPrint turns it into a PDF document. It handles CSS, images, custom fonts, etc.

A neat trick to convert HTML to PDF in a browser environment is to open a new browser window, load the HTML in it, and call print() on it, like here: https://stackoverflow.com/a/33890644/5821. May be OK for an internal tool.


puppeteer


I hope those are FOSS remote root PDF vulns!


If something is turing complete, don't trust/execute it until you have verified where it comes from, who is behind it and what it does.

Here you have what Adobe has to say about PDF's: https://www.adobe.com/acrobat/resources/can-pdfs-contain-vir...


i used open source solutions to built it, like libreoffice, ghostscript, google's tesseract and a bunch of other tools, Google's Tesseract: https://github.com/tesseract-ocr/tesseract


I’m surprised everyone is using Tesseract. It was the only game in town 10 years ago, and it’s Ok on cleaned aligned data, but there are a few newer ones like EasyOCR [0] that can deal with much less organized text (albeit more slowly)

[0] https://github.com/JaidedAI/EasyOCR


EasyOCR looks like it's more focused on the mobile use case of extract text from photos. That's a little bit different from extracting text from scanned documents, where document structure is an important aspect, and Tesseract is the devil we know. In the commercial space ABBYY Finereader still dominates - https://en.wikipedia.org/wiki/ABBYY_FineReader

But perhaps I'm wrong...


ABBYY does indeed dominate, but Google Document AI is making inroads.


Careful with the Ghostscript AGPL licensing if you plan to make a commercial product that uses it.


The PDF metadata says it's PyPDF2


i used PyPDF2 to implement some tools, but not all of them.


I think it looks like a nice tool, naysayers notwithstanding. I don't have sensitive PDFs and, though I would probably not use it for my tax return, I'll use it for other stuff. For my level of security, I'm happy enough with your promise to delete the stuff right away.


i appreciate your trust, and yeah belive me i'm deleting the files right after the processing, the way it works is that i'm saving the files uploaded as a tmp file then process it then delete them after the response.

this is how the code looks like on the server side for most of the tools:

```python ... @after_this_request def remove_file(response): os.remove(tmp_file.name) return response return response ``` i don't have any reason to keep them.


indentation is not showing correctly, but you get the idea.


This is great, definitely going into bookmarks. The website design lacks some refinement, but overall easy to use.


thanks for the feedback!


I have 85,000 PDF documents, collected over a few decades.

What I really want is a semantic interface to those PDF documents. Find me "all PDF files which mention <subject>", or "show me any PDF with python example code", or "all PDF's before 2011 on the subject of coding standards for SIL-4".

I keep thinking this is out there somewhere, but whenever something new comes along I get bogged down in the details of setting it up. Surely someone has come up with an AI that you can just 'give the folder to' and it figures things out automagically?


Have you tried Paperless NGX?


No I haven't, so thanks for recommending it to me - looks pretty detailed. I will try it out some time this week, maybe its exactly what I'm looking for. Thanks again!


You can do this locally with your favourite LLM and Open WebUI: https://github.com/open-webui/open-webui


Looks like I've got a few days of hacking ahead of me, thanks for the recommendation - will put it alongside the other suggestions and check it out when I do my "PDF sortout workbench" session ..


This is what I use for that

https://github.com/simon987/sist2


Looks pretty functional, if not entirely polished - I will try this out (alongside Paperless NGX, also suggested here..) - I appreciate the recommendation, thank you!


If you're so keen on the open source aspect, could you make the sources of your website and the tools involved, too? Otherwise there is no use to it


I have a lot of similar tools and it's all Open Source: https://mitta.ai


If this is entirely build using open-source software why not open source the site itself? Especially if you aren't planning to turn it into a commercial service.


i'm open-sourcing the backend, but not 100% of the code.


This is quite nice, but you really ought to have some page accessible with attributions to the open source projects you're using to power this!


What do people use to perform those pdf tasks without uploading sensitive files to a website?


For PDF to markdown, I recently released V2 of my tool marker - https://github.com/vikparuchuri/marker


This is very effective - it consistently yields great results.


The files are deleted immediately after processing I'm considering implementing WebAssembly (Wasm) to do most of the work on the client's device and enable offline use


You say that here, but its not in the privacy policy as far as I can see.

When you get to the stage of monetizing the site, I expect the most obvious starting point is monetizing the information inside the pdfs.

Then there's the obscurity of how much (if any) is passed on to other services (like google). You may have one policy about the PDFs, they may gave another.

So yeah, I'm in the "not for general use" camp myself. (Although there are edge cases where it may be useful.)

Don't get me wrong, I can see the upsides, and your web site looks professional, but alas the downsides are too significant to overcome the inconvenience of searching out something local.


Stirling-pdf. You can self-host it.


using this myself. its pretty good.


Only last week there was a HN thread about how the author said they just used chatgpt to make the entire thing and as a result the code is beyond bad. I don't think I'd trust it.


They didn't say that. They said they wrote the first version in a few days, using ChatGPT. Then worked on it almost another year since then. Something of that nature. Pretty big difference.


>I don't think I'd trust it.

You can audit the code yourself then. What's stopping you?


Nothing is stopping me using something else that isn’t ChatGPT hope and pray code.


Am not a Coder :-P


That's a fair point lol, my bad.


Pandoc, ocrmypdf, libreoffice, pdftk, pypdf2


There is a site listed elsewhere in these comments that does the work entirely in-browser. That's what I would use.


I use Preview on mac or even Spotlight for a good portion of these functions


Am i missing something or does it lacks pdf editing functionalities like adding/editing text or adding images? I usually use https://smallpdf.com/edit-pdf because 99% of the times i simply need to compile fields with text and attach a png of my signature on some pages and resend the document to the organization that required me to compile it (schools, medical self certifications, governative tax entities and so on). For those need, smallpdf is fantastic, but obviously i'd prefer an opensource or simply a self hostable solution


This would be awesome, if it can be selfhosted. I work with sensitive documents I can't upload to a third party.


Nice roundup of tools!

Just a small note, on safari mobile if I expand the Edit and then Convert sections, they open on top of each other.

https://i.imgur.com/bSZdRTN.png


Thanks for bringing that up, I'll take care of that issue.


At this point a new format should emerge as a replacement of pdf. It’s very useful and easy to publish, but working with pdf documents beyond reading and printing is way too complicated.


If anyone knows about a FOSS pdf form editor, please share!


You can edit PDF forms by opening the files in Firefox, but maybe that's not exactly what you mean. I'm not sure about other browsers.


90% overlap with the free and selfhostable stirling-pdf?


I've been using Stirling-PDF as my go-to solution for any pdf needs and have never needed any other service. Open source gold standard for any pdf needs: https://github.com/Stirling-Tools/Stirling-PDF


I am always surprised there is absolutely nothing like the Adobe Acrobat Pro on the open source space.

There are a collection of open source tools, everyone which is its own interface that does a subset of things.

The alternative, which is an online service, it is not great...


PDF is harder than most people think due to its variety. While there might be a tool for every job, "one for all" is hard to do.


You should run ads or charge for something. I suspect this is going to get very popular.


Thank you for your suggestion, I'm considering keeping the basic functions free and adding premium features, but still not changing any of the core features i.e the website is going to work as is.


Anybody have a good open-source receipt data extraction tool for PDFs?


We just launched a MVP for pdf data extraction https://excelifier.com/. The service is not open source and relies on open ai, which is probably a bit problematic in your case.

However, we understand that privacy concerns are really important for many organizations. Making it self-hostable and depend on a locally running LLM is something that we are looking into.


It sure sounds interesting, but I'm only getting timeouts. A possible hug-of-death period should be over by now?


Any plans to make this available as a docker container?


i tried converting a pdf to markdown and i just got a large bunch of ... seeminly random numbers and letters.


I am a happy user of smallpdf, which seems quite similar. What advantages do you offer?


pdfquips is fast, free, and offers tools that are not available on smallpdf like pdf-to-csv, pdf-to-pdf-a, translate-pdf, you can OfCourse use what you feel most comfortable with, but i guess you should give pdfequips a try :)


Bookmarked!


Thanks! :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: