Unfortunately, most of the PDF work I do involves things I’m not uploading to a service - ever. I don’t care if they’re “deleted immediately after processing” - they left my control. This sort of software would be great if it were 100% offline.
This isn’t just a niche issue either: this is a very real consideration for any corporate user. More companies are taking data loss and security issues seriously, which often means restricting what cloud services they are willing to use.
Despite the proliferation of cloud services, most large enterprises DO NOT want their sensitive documents entering the cloud. And in some cases, e.g. patient medical records, there are strict regulations about how those documents can be stored, which means on-premise is a requirement.
Good news for us, as that's what we specialise in, but also perplexing how trends in the software industry can completely ignore what customers actually want.
However, the pricing page with no actual numbers and the ambiguous ‘Contact Us’ is a huge turn off.
I cannot stand the dance with business people who want to have a bunch of calls and meetings to know how big a company they’re dealing with is before they decide on a good rate to gouge them.
Pricing pages should be straight forward. Have tiers if you want to cover your rear but only at the limit of usage have the ‘Contact Us’ option.
I’m shopping around for a PDF solution and would’ve recommended this to my manager but I’m not willing to do more meetings to get quotes.
Same. About three years ago we introduced a company wide policy to not buy anything where the price is not known. So, so much time (money) being wasted on figuring out the actual costs, the offering would have to be really inexpensive to make up for this. And if that were the case, the price would be right there.
They usually do high usage volume pricing at high rates that are proportional to the size of the company and make you sign a yearly agreement so they can get a huge payment upfront.
How about building some trust? What if the service sucks? It will be hard to get your money back and you paid a year in advance.
They make you work to get a quote and the quote usually doesn’t work for your needs.
I too will not look at services with this pricing structure anymore unless word of mouth is favorable.
Large enterprises can afford to take things in house and might even save money that way, not to mention the security gains. Medical offices have no choice. However small companies often don't have anyone in IT (other than the CEO who does everything and only rarely knows what he is doing other than the niche the company is in). These should be the prime market for tools like this - just pay us a little bit and we will worry about he details for you - everything is backed up. However if you can get one enterprise account that is a lot more money than thousands of little accounts and so everyone focuses on them anyway.
> Good news for us, as that's what we specialise in, but also perplexing how trends in the software industry can completely ignore what customers actually want.
I initially read this backwards and thought you were lamenting that people insist on on-prem stuff when cloud is clearly The Right Thing.
I certainly don't think the entire software industry is ignoring what customers actually want. Case in point, you. But also lots of other developers who thrive in covering the myriad use cases the myopic behemoths can't see. They just have very loud PR and marketing and pretend those cases don't exist, so you hear about them a lot.
You seem to think that users want everything in the cloud and that’s what’s causing the proliferation of cloud services. You are wrong. Users want _convenience_. They couldn’t care less about the cloud or technical details. If your website can do what they want to do without uploading their documents to your server then and if it’s faster and cheaper then that’s what they’ll prefer.
It's a fair point. Most of our customers work with CPP, C# and Java in enterprise / back office contexts, which is why no PHP or Javascript right now - we've been tied up with other priorities. That said we just added Python to our main SDK and PHP is coming.
I would think that JS/TS support would be relatively high up... my own bias speaking, but a lot of development and effort to easing cloud apps is JS/TS centric.
I work in a FAANG on stuff that is definitely "enterprise software", a major part of what we develop is written in TypeScript.
I admit PHP will not be as good of a candidate but for smaller companies it is still extremely attractive, and it's probably easier to develop since you can write PHP extension in C.
I didn't create this tool, but I use it frequently. I'm also using uBlock Origin, but I don't see the issue you describe. I'm not sure what Badget is, though.
If that's your paranoia level: How do you know the "offline" tool you're using is not uploading to a server? Possibly inadvertently in the course of bug reports, or surreptitiously while contacting the license server...?
Should security concerns really warrant not trusting the (reputable) vendor that the files are not being uploaded, you would need to do some sort of audit and/or run in an isolated environment and wouldn't be the "random user" referred to in OP.
I suggest a separate VM for that, that you can delete when you're done. Add put the VM on a separate PC that you bought with cash off craigslist. Then toss the PC away in a different postcode when you're done. Then you can use the PDF tool safely without fear you're being tracked.
This isn't my tool but based on what I read on the previous thread about it, it doesn't seem to be open-source. However, some folks recommended this tool which does seem to run locally: https://github.com/torakiki/pdfsam
> This isn’t just a niche issue either: this is a very real consideration for any corporate user
Very true, but I'd wish this "common" knowledge is more widespread. Security is a major issue commonly overlooked. People do a lot of insecure things for convenience.
I understand that you want to keep your work private and not expose your documents to the internet, but there might be a situation where the document isn't that important to you and any online solution would be sufficient, let's say you one of your friends tells you to ask the ai a math problem they want to know how to solve/learn but the ai only understands text then you need to ocr the pdf which is jpg converted then copy it to the ai, you might be on your phone or away from your desktop environment, here you might consider using an online solution like pdfequips :)
I'd also not upload any personal or identifying docs up to this, but I would use it for fliers and it would REALLY be useful converting PDFs I downloaded off the ineternet to begin with. (I've downloaded stuff in the past that I had to convert in order enter the data on the PDF into my computer. Geologic data for maps, list of states with capitals, alphabetized by them--well before ChatGPT, the list goes on.)
This was on hn a couple of days ago. Stirling pdf is a self hosted docker container and this way you don’t have to worry about files being uploaded. https://news.ycombinator.com/item?id=40242639
I almost thought this hn post was the same service wrapped in a show and tell.
I had just setup "Stirling PDF" on my home NAS a few of weeks ago, since my SO needed to merge some documents and I'd recently read that (or a similar) HN thread.
I definitely would recommend it. It was really quick to setup; though my already having a reverse proxy with wild card TLS certs setup probably helped streamline the networking side of things.
I decided to create pdfequips.com when a friend kept sending me PDF files for translation, realizing the widespread need for PDF solutions Now, it serves as a central hub for PDF management, offering conversion tools like PDF to Word and CSV, as well as OCR technology Over the past year, I extensively developed the website, leveraging a wide range of open-source tools on both the front-end and back-end.
the web app i.e the front end part is next.js and typescript mostly, the landing page is built using astro.js, and the back end is heavily python, flask and some javascript for web-to-pdf and markdown-to-pdf, the rest is mostly python
Not op, but I've had good experience with WeasyPrint. I use it for generating PDF invoices: I create a HTML invoice from a template, WeasyPrint turns it into a PDF document. It handles CSS, images, custom fonts, etc.
A neat trick to convert HTML to PDF in a browser environment is to open a new browser window, load the HTML in it, and call print() on it, like here: https://stackoverflow.com/a/33890644/5821. May be OK for an internal tool.
i used open source solutions to built it, like libreoffice, ghostscript, google's tesseract and a bunch of other tools, Google's Tesseract: https://github.com/tesseract-ocr/tesseract
I’m surprised everyone is using Tesseract. It was the only game in town 10 years ago, and it’s
Ok on cleaned aligned data, but there are a few newer ones like EasyOCR [0] that can deal with much less organized text (albeit more slowly)
EasyOCR looks like it's more focused on the mobile use case of extract text from photos. That's a little bit different from extracting text from scanned documents, where document structure is an important aspect, and Tesseract is the devil we know. In the commercial space ABBYY Finereader still dominates - https://en.wikipedia.org/wiki/ABBYY_FineReader
I think it looks like a nice tool, naysayers notwithstanding. I don't have sensitive PDFs and, though I would probably not use it for my tax return, I'll use it for other stuff. For my level of security, I'm happy enough with your promise to delete the stuff right away.
i appreciate your trust, and yeah belive me i'm deleting the files right after the processing, the way it works is that i'm saving the files uploaded as a tmp file then process it then delete them after the response.
this is how the code looks like on the server side for most of the tools:
```python
...
@after_this_request
def remove_file(response):
os.remove(tmp_file.name)
return response
return response
```
i don't have any reason to keep them.
I have 85,000 PDF documents, collected over a few decades.
What I really want is a semantic interface to those PDF documents. Find me "all PDF files which mention <subject>", or "show me any PDF with python example code", or "all PDF's before 2011 on the subject of coding standards for SIL-4".
I keep thinking this is out there somewhere, but whenever something new comes along I get bogged down in the details of setting it up. Surely someone has come up with an AI that you can just 'give the folder to' and it figures things out automagically?
No I haven't, so thanks for recommending it to me - looks pretty detailed. I will try it out some time this week, maybe its exactly what I'm looking for. Thanks again!
Looks like I've got a few days of hacking ahead of me, thanks for the recommendation - will put it alongside the other suggestions and check it out when I do my "PDF sortout workbench" session ..
Looks pretty functional, if not entirely polished - I will try this out (alongside Paperless NGX, also suggested here..) - I appreciate the recommendation, thank you!
If this is entirely build using open-source software why not open source the site itself? Especially if you aren't planning to turn it into a commercial service.
The files are deleted immediately after processing I'm considering implementing WebAssembly (Wasm) to do most of the work on the client's device and enable offline use
You say that here, but its not in the privacy policy as far as I can see.
When you get to the stage of monetizing the site, I expect the most obvious starting point is monetizing the information inside the pdfs.
Then there's the obscurity of how much (if any) is passed on to other services (like google). You may have one policy about the PDFs, they may gave another.
So yeah, I'm in the "not for general use" camp myself. (Although there are edge cases where it may be useful.)
Don't get me wrong, I can see the upsides, and your web site looks professional, but alas the downsides are too significant to overcome the inconvenience of searching out something local.
Only last week there was a HN thread about how the author said they just used chatgpt to make the entire thing and as a result the code is beyond bad. I don't think I'd trust it.
They didn't say that. They said they wrote the first version in a few days, using ChatGPT. Then worked on it almost another year since then. Something of that nature. Pretty big difference.
Am i missing something or does it lacks pdf editing functionalities like adding/editing text or adding images?
I usually use https://smallpdf.com/edit-pdf because 99% of the times i simply need to compile fields with text and attach a png of my signature on some pages and resend the document to the organization that required me to compile it (schools, medical self certifications, governative tax entities and so on).
For those need, smallpdf is fantastic, but obviously i'd prefer an opensource or simply a self hostable solution
At this point a new format should emerge as a replacement of pdf. It’s very useful and easy to publish, but working with pdf documents beyond reading and printing is way too complicated.
I've been using Stirling-PDF as my go-to solution for any pdf needs and have never needed any other service. Open source gold standard for any pdf needs: https://github.com/Stirling-Tools/Stirling-PDF
Thank you for your suggestion, I'm considering keeping the basic functions free and adding premium features, but still not changing any of the core features i.e the website is going to work as is.
We just launched a MVP for pdf data extraction https://excelifier.com/. The service is not open source and relies on open ai, which is probably a bit problematic in your case.
However, we understand that privacy concerns are really important for many organizations. Making it self-hostable and depend on a locally running LLM is something that we are looking into.
pdfquips is fast, free, and offers tools that are not available on smallpdf like pdf-to-csv, pdf-to-pdf-a, translate-pdf, you can OfCourse use what you feel most comfortable with, but i guess you should give pdfequips a try :)
This isn’t just a niche issue either: this is a very real consideration for any corporate user. More companies are taking data loss and security issues seriously, which often means restricting what cloud services they are willing to use.