Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: PDF API – Generate, convert, and modify PDF documents
204 points by arkgil on March 17, 2022 | hide | past | favorite | 116 comments
Hi HN,

Arek here. We’re super excited to officially launch PSPDFKit API [1].

PSPDFKit API is a collection of HTTP APIs that enable you to convert, generate, and edit documents without running any service on your infrastructure.

What differentiates our API from others is that you can chain together multiple “actions” as part of a single API request. For example, you can convert, OCR, watermark, edit, and flatten a document — all in one call.

Available actions [2]:

- PDF Generator

- PDF Converter

- Image Converter

- OCR

- Watermark

- Merge

- Split

- Duplicate

- Delete

- Flatten

Our documentation includes sample code for JavaScript [3], Python [4], Java [5], C# [6], PHP [7], and the command line. We also have a Postman collection [8].

Let us know what you think or if you have any questions.

[1] https://pspdfkit.com/api/

[2] https://pspdfkit.com/api/documentation/tools-and-api/

[3] https://pspdfkit.com/api/tools/javascript/

[4] https://pspdfkit.com/api/tools/python/

[5] https://pspdfkit.com/api/tools/java/

[6] https://pspdfkit.com/api/tools/csharp/

[7] https://pspdfkit.com/api/tools/php/

[8] https://pspdfkit.com/api/documentation/getting-started/postm...




Very useful product, congratulations. ;-)

Quite expensive though. When I use an API, I usually assume (1) there will be some significant base volume, and (2) this volume has no upper bound, depending on my users behavior. For ~750 € you can only process 1k documents during the month... hard-capped? The price schemes seem to target entreprise but entreprises usually have bigger volumes than that. (But maybe I confuse API calls and document processing with your product?)

But it’s nice you released SDK in several common languages.

Good luck!


Wow so if I’m reading this they charge about $1 USD per “document”?

I’ve been looking for an easy OCR solution considering I have about 10,000 one-page documents a month (invoices). For comparison, Amazon textract is ~$0.05/pg for key-value pairs, but it involves more programming to set up.


If you're looking for something that's robust and easy to setup but a bit more expensive than Textract, check out https://www.impira.com/ (Disclaimer: I am the founder/CEO there).


With an annual plan of 10,000 documents per month, we charge $0.04 per document. If you try it out, I'd love to hear your feedback on the quality of OCR!


I think you might like fintract[0]; never tried them just stumbled upon them.

[0]: https://www.fintract.io/en/


Shameless Plug: Want to try at 0.025/page on the UI? We can help you.


Thanks! This is Claudio, PSPDFKit's CTO.

At this point in time the price is per generated document - irrespectively of how complicated the operation is.

Because you can combine operations in one http call, you're incentivised to do that as opposed to perform separate calls which increase the possibility of errors and cost for all sides.

Happily taking feedback though - your comment around hard-cap is definitely sound, for example.


> At this point in time the price is per generated document - irrespectively of how complicated the operation is.

As part of an integration for a few customers we do relatively simple PDF operations. Combining a few TIFFs into a PDF, or splitting a PDF into separate PDF per page for example. Think invoices vs orders. Can be a royal PITA though due to edge cases in terms of whatever generated the TIFFs or PDFs.

However each of our customers have 1-10k documents per day, and we're in Scandinavia so small fries in terms of volume.

edit: OCR and table extraction is something that might be very interesting depending on quality, but again even 10k/mo seems low for most of our larger customers.


The volume you’re describing is totally doable. We can do custom plans if you’re interested - there’s a contact link at https://pspdfkit.com/api/pricing/.


For higher volume, but simpler operations (merge, watermark, encrypt/decrypt, etc.) you can try https://www.pdfblocks.com/api. You get 10K docs processed for $29/mo, and 1M for $99/mo. We don't have conversion, OCR, generation, or chained operations though.

Source: I work at PDF Blocks


Is it appropriate to have multiple competitors pitch their own products in an Ask HN? this should not be such an opportunity


Considering I often come to things like Show HN/Ask HN to see what exists in the world, yes. I say it is not only appropriate but within the spirit of fostering good conversation and interest in the community. (Obviously only as such competition remains civil.)


Absolutely, especially in such a crowded mature space it makes total sense to discuss the full market. As others have said, I like to come to the comments to learn not just about the product at hand but of what else is out there, show hn isn’t some exclusive show room for the product in the title.


Why not?

Also, it’s a Show HN.

Seems reasonable to share alternatives.


Not a good usecase for an online API. To the extent that those PDFs could include sensitive information, there's a huge security/privacy headache there, with no real benefit when compared to performing these functions offline. It also seems to me a lot more expensive than alternative ways of doing the same thing.


Very valid concern around privacy. We don't store the documents (see https://pspdfkit.com/api/privacy/), but for people that have sensitive documents to process, we offer an on-prem product, see https://pspdfkit.com/api/documentation/deployment-options/. You can run it in your own infra and it doesn't report any telemetry to us, so information remains completely private.


> with no real benefit when compared to performing these functions offline

Most organizations these days are developing in the cloud. I assume you don't mean "offline" but rather performing these functions yourself.

I've tried, but it's a huge pain in the butt. PDFs are very quirky. Things work well 95% of the time and the 5% takes a lot of time to figure out.

When trying to do this myself in an app deployed to AWS, I've had many issues with getting all characters in different languages to work. Every few months, some new thing in a PDF file throws an error and the file won't generate. You get weird file size errors. And the quality of PDF generation varies a lot by language. I'd much rather have an API that I can just call from any of my code if it JUST WORKS.

Now, their pricing is strange and might be a dealbreaker for me. I'd like to see an option to pay per transaction with no cap without having to negotiate with their sales team.

> there's a huge security/privacy headache there

eh, maybe. Some use cases don't require privacy. In my case, I'm mostly assembling PDFs from various sources with my company's documents. No, I don't want a vendor that is going to post my documents to twitter, but I can sleep at night if I have some kind of assurance that they don't use or sell my data.


By negating the cloud model, I'm not saying I want to do stuff "myself" at all. I just want the delegation to work differently.

By "offline", I mean "offline" w.r.t. their servers. The point is: I have some kind of an environment that I'm already using for handling the data underneath the PDFs I'm trying to produce. That environment has a given amount of attack surface area. If they hand me a program that I can run in my environment and that doesn't communicate outside that environment, I have gained additional functionality. This is additional stuff I can do that I couldn't do before, and I've achieved this without increasing the attack surface area that might lead to that data getting compromised.

If I do it the other way around, i.e. instead of them handing me a program, I am handing them my data, then the attack surface area increases. Because every attack against my pre-existing environment continues to be an attack that will compromise my data. But also: Some attacks against their environment would now also end up compromising my data. So it's worse for security.

If I have a car in a garage, and I give the key to the garage to 3 people, it's going to be more secure than if I give the key to 4 people. Because there's one less person who might lose the key and enable a robber to get in.

I don't understand the logic behind the converse idea at all. The idea seems to be: Azure/AWS/whatever is "the cloud". Therefore my data is already in "the cloud". Random company X is also in "the cloud". So I might as well send my data to company X. -- This sounds to me like the What-The-Hell Effect. Like: I've broken my diet because I ate a hamburger. Now I might as well quit the diet. No: Eating fewer hamburgers tomorrow is still better than eating more.

I also don't understand why things have to be architected that way in order to "just work". Weasyprint just works. Pandoc just works. LaTeX just works. I can put them on a computer with no network connection, and they'll happily do their job for me. They give me a lot of functionality and ask very little trust in return. That's a good thing. Whenever that's an option, that's what I'm going to do.


That's a great point. For folks that have strong privacy needs, we do have an on-premise product that provides the same functionality [1].

[1] https://pspdfkit.com/server/processor/


So what exactly does that leave? A wrapper that you've created around weasyprint, pandoc, latex, ghostscript, imagemagick, and stuff like that?

Sounds to me like an unnecessary extra expense for an unnecessary extra layer of abstraction. And there's a risk factor that comes with it: Say I make a nontrivial investment, like write a book that I'm planning on typesetting with this, or write a reporting infrastructure that creates automated reports or something. I'll make a huge up-front investment there that is tied to your API. Then I want to run this, while not touching it, for 10 years so it can earn a return on investment.

Then I come back to it 10 years later, because I'm writing the second edition of the book, or I want to change something about my reporting infrastructure. Has your company gone out of business in the meantime? Have you deprecated the product? Do you still support the API from 10 years ago? Does it still produce the same output for the same input? ...or do I need to take a huge write-off on all the work I've done on the typesetting my book or hooking up my reporting infrastructure?

In the open source world, I'd just make sure to bundle all the tools I'm using, including their sourcecode, in a docker container or something. In the "10 years later" scenario, I'll probably need to touch only the book's sourecode, or the reporting infrastructure's sourcecode, not the typesetting infrastructure. And if there's something I really really need, then I can go to the source and change it.


You’re touching on a few different points so I’ll try to cover everything.

- We do build on top OSS (just not those programs you listed - see https://pspdfkit.com/legal/acknowledgements/processor-acknow... for a complete list). The layer we build is quite large though, and it would take many person-years to replicate in its entirety. It’s possible though that you don’t need that at all and a focused program that wraps other ones might do the trick for your use case.

- If you build a product based on our tech, you’re taking a conscious decision about risk: while I do think we’re gonna be in business in 10 years (we have solid revenue and last year we got backed by a large investor, Insight), that we would version APIs and support you (not just during upgrades), the reality is that it is indeed possible that we’re not gonna be around anymore, like every other company on the planet. As a consumer, this is the reality for most of the things we buy nowadays. We do take deprecation seriously, as sell SDKs, and I’m sure in case of the company shutting down you would have enough time to migrate.

- Depending on what you need to build, using our product may shortcut your development time by a large factor. It may not, if you just need to rotate pages of a PDF document and there’s a reliable OSS package that does that in your language of choice. It really depends on what you need to do.

- Even if you package everything with OSS, waiting 10 years is a sufficiently large amount of time that it may not work and you have to fork and rebuild yourself. It’s a different type of risk, but still a risk. 10 years ago Docker had just been launched. Whether you build something on OSS or commercial, you would wanna test things once a year to see if they still work or keep up with security and bug fixes.

Ultimately, there are situations where the approach you described is sound: for example, I do my taxes in plain text accounting, using ledger and emacs. I generate the reporting via a couple of Ruby scripts. I do that exactly because I care about longevity: I do my taxes once a year, I don’t wanna spend time fixing the toolchain every time I have to do them. Yet every year I hit a couple of snags I have to fix, but I consider that acceptable.


It's unclear what you're trying to say. They've been around since 2010 and they have quite a large team, why would they suddenly disappear? Also what do you want them to do?


What I wanted to say was: "PDF processing is something where I wouldn't want to rely on an online API over something local. And it's also something where I wouldn't want to rely on a small commercial company over an open source project".

I once worked for a software company where non-tech clients would have custom-made software developed for their exclusive use. Half the projects we did were "We're relying for one of our business functions on this software that we bought from this company that's now out of business. We need you to reimplement it from scratch, because we need a tiny change."


To me it makes sense to buy this functionality instead of building it yourself, the upfront cost involved with building it yourself will likely be much higher even if you manage to chain together a bunch of open source tools.


Then I come back to it 10 years later, because I'm writing the second edition of the book, or I want to change something about my reporting infrastructure.

Those seem like one-off PDF conversion use cases that MS Word or Acrobat can easily handle. Not a high-volume, daily PDF invoice use case.


Would work if you want to publish the pdf anyway.


Vast majority of organisations already store all their working documents and data in the cloud.


"The cloud" is not one thing. Each additional company in "the cloud" that gets to see your data increases the attack surface.


FUD. Most companies are using Box, Dropbox, or some variant of cloud-based document storage today. Extending storage services with document transformations and conversions is a logical evolution.


A few years ago, $COMPANY had similar needs for a client. I ended up creating an in-house solution, which has a surprisingly close API (well, there are only so many ways to do this).

So I asked myself: if this kit existed at the time, would we have used it? I don't think so. For all specified pricing plans, the document limit is way too low for what $COMPANY or its clients do. Judging by the progression of the costs, we'd have gone in-house instead of negotiating an Enterprise plan.

If you don't want to adjust pricing, perhaps you could add a consumption based plan. The plan could have a much larger limit, but the client also pays per API call.


Thanks for the feedback! For those that need larger volume, we also have an on-premise product: https://pspdfkit.com/api/documentation/deployment-options/.


I recently discovered that search-replacing text in a PDF without changing the layout is much harder than I thought it would be (a customer forgot to change their billing address, and now that the invoice is finalized, Stripe won't let me edit anything, so down the PDF-editing rabbithole I went). I would love it if I could just use an API for this.


There are so many ways to layout text on a PDF page, that this is nearly impossible to implement for all scenarios. I don't know a PDF editor which works in all cases.

Sometimes text is positioned absolute to the page border, sometimes relative to other elements, where moving a word shifts all following elements around. There can be multiple matrices involved for positioning text elements. Sometimes text elements are all positioned independently, sometimes by using newlines with custom size. Text elements can span multiple lines or words but sometimes each letter is a single text element where it is even hard to determine, which letters go together or if there's meant to be a space. Additionally fonts can be subsetted, where it's impossible to use other unused letters without knowing the original font. And than there can be OCR'ed PDF's, where an image of scanned text is overlayed on top of the real text. Oh and there can be clipping paths: Rectangles which erase all text below.

And each PDF-Producer creates a different PDF structure.

For reading, PDF's are awesome. For editing, PDF's are a nightmare.


If it's just one off, I'd draw a white rectangle over the text that needs to be changed, then add the text on top of that.


This isn't easy because PDFs are PostScript, so text is laid out absolutely. You can make very small changes but a larger change requiring a reflow of the text would break things. In some cases it is possible to convert the PDF to a Word document, make edits, and then save it back to a PDF.


You only need an Acrobat Pro for that.* That's daily business for me, although not with invoices but printing data.

* (becomes harder when the font is not embedded/existent as a subset, but Acrobat let's you choose another font, so no big deal.)


LibreOffice Draw edits PDFs pretty well.


What I used for this exact problem was pdftk's `stamp` option, with a stamp pdf that was just a white rectangle with text on it, as a sibling commenter mentioned. Worked for several hundred documents!


I recently went down the PDF rabbit hole for a project.

I had to use different OSS tools to do everything I wanted. I was able to access three from within nodejs without touching the disk:

1) Libreoffice CLI for converting doc/docx to PDF. It handled the formatting remarkably well. WARNING: you must have the fonts on the system doing the generating or it will substitute "similar" fonts! NPM: libreoffice-convert

2) NPM pdfjs-dist from mozila for extracting text and finding page numbers.

3) NPM pdf-lib for manipulating PDFs: deleting pages, adding pages from other PDF files (even to the middle of a PDF.)

4) PDF Jam commandline for resizing a pdf `pdfjam --keepinfo --outfile "${path}.resized.pdf" --paper letterpaper "${path}"`;


Libreoffice only does a mediocre job of rendering Word documents. There are a number of cases where it really mangles things. An example would be some types of bulleted lists or indentation.


Given the GP's and your differing experiences, I wonder in which circumstances it works and in which it doesn't.


In most cases it is adequate. But there are a couple factors to consider:

1) OpenXML is an open standard and like HTML it is interpreted and rendered, much like a browser. MS Word is obviously the reference here. But in certain cases, you will see differences when using other renderers. If you wrote the document in Word and then view it in LibreOffice or wherever, those differences are going to seem pretty glaring. It's possible to fix your document to not run into these issues, but because OpenXML has cascading styles and Word can often times produce quite messy output, it's hard to know what is going to not work well. It has taken the web 20+ years to get to a place where the difference between renderers is small enough to not be a big headache.

2) In OpenXML there is no concept of a page. Everything is relative and only at render time do you know how it looks. PDFs do have a concept of a page and are absolute - so that translation can be quite important and another source of rendering issues. An example of this is in Excel: people do need to print spreadsheets and if you want it to not look obnoxious you have to fiddle with the settings to get everything looking good. If you convert a spreadsheet to PDF you're more or less doing the same thing. However, in an automated context you can't make the judgements needed to make it look good - so you will often times end up with PDFs that have hundreds and hundreds of pages and look awful.

For 1) Microsoft could just release an API or some kind of package that gives you the output that matches Word and it would solve all of this. But at that point OpenXML ceases to be really open because nobody would want to use anything else. For 2) That's a lot harder to solve.


Thank you!


Just an FYI but PSPDFKit has a very predatory sales model. Our organization received pricing that was generally very high and we pushed back on it because we were a startup and it was outside of our budget.

I've connected with several other customers of PSPDFKit over the years and they almost all have much more reasonable pricing.

Beware!


I experienced the same. We ended up developing our own solution and no longer have to rely on any 3rd party framework.


We did the same.


I've always wondered (as someone who hasn't made corporate purchases), wouldn't there be a point where it would be easier to teach all employees LibreOffice (Draw) instead of buying proprietary/expensive software that might break/has external dependencies?


Maybe I’m missing something but how is that a “predatory sales model”?

Their prices may be too high, so you declined to pay. I don’t see that as predatory.


They claim to charge based upon your requirements, your business model and revenue. If you are a startup they will most certainly overcharge you for their framework. They also want access to your financials to make sure you would be in compliance with the contract.


How have you avoided the AGPL headache that comes with almost all the open source libraries for PDF editing?

Have you written your own code from scratch?


Our engine is based on Google's PDFium, which is Apache licensed. We use it for rendering and reading the PDF object tree. Editing, annotations, etc. are all built on top of that.


Very cool!

Consider perhaps adding one more call, one to remove passwords. Not brute force, although that would be cool too, but just one that let's you specify the password and it makes the pdf no longer require a password.


Cool idea, yep. ;-)


It all sounds great until you get a quote. $15,000/yr for an API to sign a PDF. Come on.


I was contracted to make a legal document generation service for a client. I looked at all of these tools and decided to just use HTML/CSS and then print to PDF via puppeteer on a serverless cloud function.


Congratulations. Last year I launched a PDF generator API here on HN, got zero upvotes, you have managed to get to front page. Wish you all the success.



Some time ago at a $Company we needed to generate pdfs and also OCR incoming documents. In order to quickly release a product we decided to use online API from a $Vendor. Initial price was quite OK-ish but a year-two later we saw a significant increase in price. At that time we also started moving to on-premise hosting to decrease latency and to address other GDPR stuff. Considering high volume of documents it was just too much for us and the $Vendor didn't want to negotiate. In addition to that we also needed to implement reports to show to the $Vendor how many items we procesed for on-premise licensing..

Long story short, instead of that we spent a few two-week sprints of two-men team and were able to successfully fulfill our needs using open source software. $Company saved hundreds of thousands per year. We also tried to influence company to donate to OSS, that unfortunately never happened, but that's another story.

So please be aware of vendor lock-in and of possible price increase. Always think of a plan-b.


>We also tried to influence company to donate to OSS, that unfortunately never happened, but that's another story.

I don't blame you for going down that route. But it feels to me that open source is devaluing our work. PDF is a big and complex specification, there must be thousands of hours of work in the software you chose and yet you are getting all that value for free. Is there any other industry that does this to itself?


Very nice and useful product!

Question: do you have or plan to support PDF signatures? This may be then useful for us[1], we issue qualified certificates and eIDAS-compliant legal qualified electronic signatures which often need to then be embedded into PDFs.

[1]: https://www.zealid.com/en/


Signing is something we'd like to explore, we often hear from folks who'd want to simplify their signing workflows. Thanks for the feedback!


It looks interesting, but addresses a scale that I don't work on.

For my personal needs, I use pdftk from the command line.


dawg why can't you just make a normal program I don't want a REST API


What languages do you need support for?


How about none? Any service that can be passed text from an http request can be passed text via argv and called from the command line. The fact that a helper program runs on the same host rather than across a network doesn't mean you can only use it via direct function call. Imagine if the developers of pandoc didn't actually distribute pandoc and only allowed you to invoke a pandoc instance running on their servers remotely.


How well does this handle large tables that span pages? That seems to be a key differentiator for most PDF libs I sampled. I'd assume this works well if it's coming from Chromium


If you have a sample HTML you wanna try, you can use https://pspdfkit.com/pdf-sdk/web/pdf-generation/ and paste HTML there - the generation engine is virtually the same.


Do you mean tables when converting HTML to PDF, or simply rendering the PDFs with tables in them?


simply rendering tables - most of the (python) pdf generation libraries I evaluated a few years ago all had the same limitations (reflow is hard) around laying out large multipage tables. We went with a headless chrome service to print to pdf which did not have the limitation.


We've had customers in beta trying it out with multi-page tables and we've heard positive feedback.


Does this API allow for the generation of accessible documents, such as PDFs, which can then be read by blind persons using a screen reader such as Jaws or NVDA? These tools have the ability to bring up a dialog box (e.g. elements list) listing the links, headings, form fields, buttons and landmarks present on documents, (e.g. html, pdf, and so on) that blind people would need in order for them to navigate a document.


We’ve done some tests in that area and while Chromium is technically able to generate tagged PDFs, which would be accessible for the most part, it’s far from perfect.

We have some work planned in that direction, but nothing close to release at this stage.


Really handy service, even if there are probably a lot out there already. The following might be worth to check out: - https://docspring.com/ - https://bulk-pdf.com/

However a good documentation you have there!


From the pricing page, limits seem to be on number of documents, not number of pages. Is the number of pages per document also limited?


No, just number of created documents.


The pricing may need to have a revisit! Enterprises would probably be the most keen and also find better alternatives for the cost.


> What differentiates our API from others is that you can chain together multiple “actions” as part of a single API request.

https://transloadit.com offers similar composable workflows in a single request, and supports more file types besides PDFs.

Disclosure, I am a founder :)


Nice set of tools!! I recently launched a PDF-related project.

https://www.scholars.io

It's a tool for reading research papers (PDFs) together with your colleagues. You can read, annotate, comment etc.

Needless to say, it led me down quite deep into the PDF world and it.. was interesting.


Incidentally, I wonder if you can answer a question: I want my books in an electronic format, that will be usable for the rest of my life or longer, and which preservers annotations.

As far as I know, PDF/A is the only format that fits the first two specs. I know annotations are in the PDF specs but is it reasonable to think that annotations I make today will be readable - and updatable - in (e.g.,) 30 years?


Yes, that is quite reasonable. The simple text annotations are quite easy to work with when using a PDF library and the standard tries to be backwards compatible where sensible.

The PDF 2.0 standard removed some parts of PDF 1.7, like the proprietary XFA forms. But most things stayed in PDF 2.0 and one can expect that those annotations will also be available in future iterations of the PDF specification.

And generally, since future PDF viewers will need to be able to view older documents (think: all the (signed) documents created by governments), you can expect PDFs created today to be usable in 30 years and more.


Thank you.


Do you have any plan in your roadmap to support different languages in the OCR feature? I'm specifically interested in recognize and processing PDF files written in Japanese and Korean.

I am also dealing with some clients that are struggling with processing handwriting in their document, but I guess it will be a little far fetched.


This is the list of supported languages: https://pspdfkit.com/api/pdf-ocr-api/#supported_languages

At the moment we don’t include Japanese and Korean, but I’ll take a note around your questions.

Handwriting is definitely a different beast, that’s not supported.


Thanks. I have been dealing with ton of headache from my projects since modifying PDFs can be very problematic. Rather than stitching up multiple libraries, I would rather suggest one platform to handle everything.

Will definitely keep this in mind until it meet my requirements. Is there a mailing list I can sign up for?


I’m sorry if this stupid question. What kind of industry (having thousand process document a month) or use case for someone using this maybe an expensive tools ? If there’s a use case what is the manual process that usually happen, thanks


Hi Arek...congrats on the launch.

Maybe PSPDFKit is interested in integrating Bionic Reading into their products? Take a look at the website (bionic-reading.com) to see if BR can add value to your users.

Let me know if you are interested and best regards from the Swiss Alps, Renato


We need to offline convert HTML to PDF. We created a small docker container with Chromium and Selenium and added a small HTTP API layer on top. Works like a charm and it is easy to keep it up to date.


Would be great to have a file upload button to test the OCR API from the UI, without to perform the CURL. Just to test how your API works from UI.


You can try it at https://pspdfkit.com/pdf-sdk/web/ocr/, the OCR functionality is shared with our SDKs.


Which rendering engine are you using in the backend?


We're based on PDFium, but there's a lot more going on than just that - see https://pspdfkit.com/blog/2019/contributing-to-pdfium/ for an overview.


So, no compliance with printing industry standards. That's a pity.


I don't know about printing industry and their compatibility requirements. Would you mind elaborate a bit on this (I occasionally do some pdf output, so I'd like to avoid basic mistakes)?


What you would typically be looking at, is compliance witt the PDF/X standards [1] in various levels, which are basically ISO norms for PDFs.

Files for printing production need to have their fonts embedded, color profiles attached/at least tagged to images, transparency dealt with, lots of stuff that ensures that the PDF itself contains all the necessary information for a successful reproducting/printing on a printing machine of any kind.

As printing production systems have evolved, the rules became "less strict" as all (most) the systems can now handle transparency natively, for example. That for example was a big change with PDF/X4, before you had to convert (keyword is "transparency reduction") all transparencies and factor them into the underlying elements.

Most PDF generators out there are not able to follow the rules of ISO/the PDF/X specifications, so print shops might have a hard time handling that data, due to various missing pieces of information.

That's normally no deal for your office printer, but when you are looking at large(r) printing operations, it surely is.

[1] https://en.wikipedia.org/wiki/PDF/X


Thanks!

> Most PDF generators out there are not able ...

Do you know of any compatible?


I'm mainly working with two systems.

The one is PDFlib [1] which can easily be accessed via Java and PHP. As a web guy, I'm using PHP obviously. There's some learning curve to it, and you have to take care of lots of stuff by yourself, but the results are pretty good afterwards.

The second are the products from callas, mainly pdfToolbox [2] and pdfChip [3], which are kind of the de facto standard for the printing industry, at least in my Western Europe bubble.

pdfChip is based around the WebKit rendering engine, so you can work with HTML + CSS and convert your document to a PDF file. The pdfChip internals will take care of PDF/X compliance, if you want to.

pdfToolbox and pdfChip both have a steep learning curve, too, but you'll probably find that with any software that is highly specialized.

[1] https://www.pdflib.com/ [2] https://www.callassoftware.com/en/products/pdftoolbox [3] https://www.callassoftware.com/en/products/pdfchip


The metadata on an output PDF I tested says Skia (though I guess that could be being wrapped by another library)



Is this used for creating pdfs from scratch (a docraptor alternative)?

Or filling out pdfs programmatically (a docspring alternative)?


You can create a PDF from scratch starting from HTML - see https://pspdfkit.com/api/documentation/developer-guides/pdf-.... Note that HTML generation has a few nice quality of life additions around headers/footers, logos and conversion of HTML forms to PDF forms, which are things that you don't normally get with the print to PDF workflow you would normally build from scratch.

Filling out forms is not supported, but I'll take a note. The engine can do it, but we haven't got it exposed via the API.


I find this utterly bizarre. Once upon a time, if you wanted to left pad a string, you would just do it. A while later, people discovered that you could use a library. (I’m joking a bit here, but libraries are genuinely useful.). With a library, you get to pick from various schemes and schedules for updating the library, but you have a degree of control.

But now apparently you’re supposed to use a web API and depend on an external service. This has all kinds of downsides: it has latency (and potentially tail latency). It has larger security issues. It doesn’t work in many sandboxes. It requires an asynchronous call. Callers have to handle timeouts and retries. (If you left pad a string with a normal library, it either works or it doesn’t. With a web service, it can fail transiently or give wrong answers transiently.). It updates on its own schedule, without notice, and cannot be rolled back. And it can charge an utterly outrageous per-call price, so instead of merely profiling and debugging slowness due to making too many calls, developers also have to worry about inadvertently spending hundreds of thousands of dollars.

Replace “left pad a string” with “generate a PDF” and you get this. Why is this desirable?

I suppose things like this may partially explain the stunning slowness of bank websites.


This really does not resonate at all, and I have the scars to prove it.

I used to work on a browser-based document management system, and I would have used (or at least tried) all of these APIs without hesitation. PDFs are a pain and the mish mash of poor functioning tools that exist provides a constant headache.

1) OCR'ing of a PDF is difficult. The only good service is Google, but requires that you break it into pages as images to be performant. This would have simplified things greatly. Even if the PDF has text inside and is not an image, it can be wrong or not laid out in a linear way, so you have to OCR it. Command line tools do not get you very far. An example: if you OCR or text extract a PDF with multiple columns of text, does it handle the columns well?

2) People want searchable OCR'd PDFs where you can highlight the text, even when it's a bitmap underneath. This requires a technique where you overlay transparent text in the exact position of text in the bitmap. This does not come for free and I've only seen this done on proprietary Windows-only software. This alone would be worth it.

3) Office to PDF is an extremely standard need, especially if you want to display them online. But it's not easy. You have to hack together a headless OpenOffice to have it work at all, but it doesn't do a great job. It's difficult to do well because Office docs are like HTML pages in that it greatly depends on the renderer, not to mention the fonts. Microsoft does not offer a service to do this, unfortunately. If you think anything will do, it really won't: when people see their PDF looks very different than what they saw on Word, they get upset.

4) Table extraction APIs are super important, especially if you are trying to automatically extract data from PDFs (e.g. analyze financial disclosures). There have been whole startups dedicated to this.

5) HTML to PDF is also a pain: you have to set up an instance that is running headless Chromium, which can be quite slow. This has become the defacto standard to quickly create complex PDFs. Having a simple API wrapper around this is just one less thing to manage.

The rest of the APIs, like the merging/splitting/watermarking etc., are pretty standard and you do not need APIs if you already have access to the PDF on a server. But if you were in a browser or on mobile, you might not.


I'll just throw my hat in the ring and mention that at Impira, we are one of those startups wholly dedicated to (4). We happen to use Google's OCR engine (1) under the hood (for raw OCR), and what you said resonates for sure: there's a lot of engineering work required to make it work performantly and generally (happy to chat about this with anyone who is interested).

Feel free to take Impira for a spin (https://www.impira.com) if you need to accurately extract data from PDF documents. Would love feedback from anyone who tries it out. [Disclaimer: I am the CEO/Founder of Impira].


I agree many of these things are a pain. This often reflects a workflow that is approaching things from entirely the wrong direction. ("If I wanted to go there, I wouldn't start from here.")

E.g. instead of trying to OCR a PDF, go back to the source document or database or whatever from which the PDF was generated. (Yes, I know that's not always an option. But it should be the first avenue to explore. We should push back against people who send around PDFs as though they were an all-purpose interchange format for textual or structured data.)

I'm a bit puzzled by (3), though:

> Office to PDF ... it's not easy ... when people see their PDF looks very different than what they saw on Word, they get upset

To get a PDF that looks the same as the Word document, just tell them to use the Print to PDF driver from right there within Word.


I think you recognize this already, but to add a bit of color, in highly regulated industries (e.g. financial services) and B2B settings with lots of peers (e.g. supply chain), "going back to the source document or database or whatever" requires an insane amount of consensus (which is not currently incentivized).

To add to that, a lot of PDFs (e.g. financial reports) are generated procedurally with ancient code that would have to be rewritten to generate a different format. The underlying database format is often many layers of abstraction different than the final output.


> Office to PDF is an extremely standard need

Is it really an extremely standard need or just something that appears in the bs corners of our jobs a few times a year.


Yes, if you're working with documents a lot it is. Word docs are not portable and people don't like them because they can be changed easily, not to mention not everybody has Word. You also can't display them in inline in a browser.


>HTML to PDF is also a pain: you have to set up an instance that is running headless Chromium, which can be quite slow...

There are at least 6 non-Chromium alternative that I can think of in a moment's notice, and also LGPL wkhtmltopdf.

>Office to PDF.... You have to hack together a headless OpenOffice to have it work at all, but it doesn't do a great job... Microsoft does not offer a service to do this, unfortunately.

Microsoft sorta does offer a service to do this. Sharepoint has a word to pdf action, and with some stitching you can make it into an API. There are also several commercial solution (e.g. Spire.NET) for this and also ways exist to mangle the OpenXML into HTML (of course losing some fidelity into the process).


All of the above may be correct, but nothing here advocates for a web service instead of licensed software. If I want to solve a linear program, I can use an open source library or I can pay for a commercial offering, but that commercial offering will run on my hardware (or cloud instance) and will operate independently of the network. If I want to edit a Word document, I can pay Microsoft for a local copy of Word.


I'm a very happy user of OCRmyPDF: https://github.com/jbarlow83/OCRmyPDF/


> 1) OCR'ing of a PDF is difficult. The only good service is Google

OCRspace is OK, too, and easier to use. You can just send the PDF. It is free for PDFs with 3 or less pages.

> 2) People want searchable OCR'd PDFs where you can highlight the text, even when it's a bitmap underneath.

OCRspace can also create searchable PDFs: https://ocr.space/searchablepdf


Same on my mind. Let say you have to create an invoice for a customer and your operations stop just because your not using {Cario, Skia, PoDoFo, JagdPDF, Haru, Whatever} on the local environment but relied upon an external service which halted. This introduces a huge dependency chain across the web. But they don't provide anything which cannot provided autonomously by a local library. Integrate with external services because you must and not because you can.


Nodejs forces this architecture(no, worker threads are not a solution, they are heavy and have too many restrictions), you don't want to slow down the event loop with heavy PDF processing.


This is not in any respect limited to NodeJS. If you want to do a 500ms computation, you don’t want to do it synchronously in your network thread. It doesn’t make much difference whether it’s C, Rust, NodeJS, Go, etc. (CGI is different: everything is off the network thread.)

But this doesn’t mean you should outsource computations to a third party remote system. You can have a local (same physical hardware or same data center) off-thread service (or just thread pool) to do this kind of work with much nicer properties.


Taking a quick look at https://pspdfkit.com/api/documentation/tools-and-api/, I'm puzzled.... what distinguishes a "PDF Generator" from a "PDF Creator" from a "PDF Writer"? How would I know which one I want?

Oh, looks like they're the exact same thing: a webpage-to-PDF service.

Then there are a whole bunch of "PDF Converter" options, including "HTML > PDF", which seems to be yet another name for the same thing.

For me, all this has a whiff of SEO spam that I find quite distasteful. Just tell me what the product does. Don't try to list it under a collection of different titles in the hope of catching more search terms, it just makes you sound like a snake-oil salesman.


I'm sorry you find it distasteful - that was never our intention. What we found was that a user searching for a PDF generator, creator or writer are generally looking for the same solution - to create a PDF. So by repositioning our tool we were hoping to provide a better landing page experience for users that were searching for one of those specific keywords.

Of course the downside is as you've pointed out - it can be seen as distasteful and in some cases confusing to our users. We will review this on our side and see if it makes sense to remove some of those tools to reduce confusion.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: