My first solution would be improving reader security by starting with one with decent code (Espie suggested MuPDF), compiling it with something that makes it memory-safe, and running it in a sandbox on separation kernel (eg Genode or Muen). Then, a memory-safe conversion tool turns it into something more trustworthy. This might even be batched on simple hardware which itself has lower attack surface. Later on secure hardware like CHERI CPU albeit that can happen today if you have FPGA board and skills to run their HDL code.
For fun, though, I'll dust off an old concept since you're talking printing. One might start by printing them to a virtual screen like in Nitpicker GUI with the untrusted reader. Aside from isolation, there could be a feature to convert what's on the virtual screen or page into a compressed image. A PDF with N pages becomes a zip of N images or a single image of some size. That itself could be distributed to run in the trusted, safe viewers we already should have, right? ;) It might also be run back through similarly-deprivileged OCR to turn into a safer format. Gotta eyeball it if doing it that way. That said, there are fonts that work well with OCR that it might be converted to as part of image production if OCR is the goal in the first place.
Could be a fun, little project teaching folks about a number of topics at once.
Your "first solution" would be to take a de novo PDF implementation written in C, "compile it with something that makes it memory-safe", and then port it to an L4 microkernel. Maybe bust out some HDL and get parts of it deployed directly on to FPGA.
I said a separation kernel like the FOSS projects and commercial products dating back to 2005 I told Joanna about on Qubes mailing list which were compartmentalizing things on security-focused kernels. Aside from small TCB, they have optional mitifations for storage and timing channels. Aside from isolation, a standard practice on embedded side was including safe subsets of Java or Ada running right on the kernel to implement specific components more safely. So, basically just what was standard, deployed practice in high security over a decade ago.
Optionally, I also pointed out people interested in developing solutions have options available now for safety or security on CPU side, too. They can do software, hardware, mix of both, whatever suits their purposes.
"safe subsets of Java or Ada running right on the kernel to implement specific components more safely. "
"a safe subset of Java or Ada in the kernel."
Done here since you're arguing against points Im not making. For anyone your strawman confused, the specific components are user-mode apps running on a separation kernel to minimize privilege. Not just piles of extra code in some kernel.
> For fun, though, I'll dust off an old concept since you're talking printing. One might start by printing them to a virtual screen like in Nitpicker GUI with the untrusted reader. Aside from isolation, there could be a feature to convert what's on the virtual screen or page into a compressed image. A PDF with N pages becomes a zip of N images or a single image of some size. That itself could be distributed to run in the trusted, safe viewers we already should have, right?
Which is literally what Qubes "Convert to trusted PDF" does.
> My first solution would be improving reader security by starting with one with decent code (Espie suggested MuPDF), compiling it with something that makes it memory-safe, and running it in a sandbox on separation kernel (eg Genode or Muen). Then, a memory-safe conversion tool turns it into something more trustworthy.
It would of course be preferable to have a secure PDF reader to begin with, but the complexities of the PDF format doesn't isn't really conducive to that.
Oh, that's neat it's what they're doing. Far as secure PDF reader, you can definitely reduce risks it poses with mitigations which reduce headaches when they don't reduce attacks. Those I was thinking of are doing it with acceptable overheads these days. On the far end, the CPU solution already compiles legacy C to run capability-secure on FreeBSD with OS and CPU available to download and run. Just gotta buy the board which has other uses.
So, there's more possibilities to explore on top of these existing solutions.
> It would of course be preferable to have a secure PDF reader to begin with, but the complexities of the PDF format doesn't isn't really conducive to that.
I thought pdf.js was a Javascript application in a browser on a full OS with all the risks that come with that versus a memory-safe, native code in a deprivileged partition or container. Web tech isnt my strong area do I could be wrong. Do correct if it's not a browser or JS tech built in unsafe language.
And it's a little strange your reply to memory-safe code for a PDF reader is that an "unsafe one exists, just use it" when you or your colleagues are currently applying my recommendation to the browser hosting it via Rust and Quantum.
You're doing one thing that matches the language part of my recommendation while saying we should do the opposite about a type of program that's similarly high risk. Quite the contradiction.
This is really confusing for me since you keep implying JavaScript is all we need for safe, secure, efficient, and/or low-TCB apps like this one parsing and rendering PDF's. Yet, you arent rewriting Firefox parsers and renderers in Javascript: you are using a new language with the properties I just named. Properties shared with safe C/Java/Ada subsets used in embedded but with even more safety added (borrow-checker). That's probably because you didnt trust Javascript to do the job efficienty, securely, and without leaks.
Now, you do in this thread if it involves a risky format attackers love. I dont. I think complex languages running in large apps increase attack surface. So, I still recommend strong sandboxing whatever parser/renderer one uses plus developers in security-focused projects (eg Qubes) using compilers or languages offering safety if having resources to spare. Everyone contributing a little gives us more building blocks over time.
And far as your other comment, there are always new ways to turn C code safe or secure being developed. C++ might also be able to use them via a C++ to C compiler but has stuff like SaferCPlusPlus to help. For C, options to attempt include Softbound+CETS, SAFEcode, Code Pointer Integrity, and dataflow integrity. At least three are FOSS with one I havent checked yet. So, they exist. They could also be in even better shape if security tool builders put more time in them.
All Im saying on this since you seem set on Javascript for efficient, secure apps. We arent going to agree on that premise.
I'm not saying pdf.js is fast. I'm saying that it's fast enough to be a useful tool to read most PDFs securely (which in fact millions of Firefox users do!), and it has the large advantage of actually existing, unlike complex schemes involving vaporware compilers and Ada in the kernel. (If you care about fast secure PDF viewing, write a new PDF renderer in Rust or Java or Go or whatever. This doesn't have to be a complex problem.)
By the way, SaferCPlusPlus is not memory safe, and porting a PDF rendering code base to use it would be about as much work as rewriting the renderer in a safe language.
> This is really confusing for me since you keep implying JavaScript is all we need for safe, secure, efficient, and/or low-TCB apps like this one parsing and rendering PDF's. Yet, you arent rewriting Firefox parsers and renderers in Javascript: you are using a new language with the properties I just named. […] That's probably because you didnt trust Javascript to do the job efficienty, securely, and without leaks.
JavaScript is a memory-safe language thanks to a well known runtime trick called a «garbage collector» … Until Rust came, GC was the only viable way to have a memory-safe language. Unfortunately, it has important performance drawbacks which makes it unsuitable to write a browser in a GC-ed language. But for 99% of the code written everyday (including a PDF renderer), GC is a good enough solution to write memory-safe code.
Also, Rust has been designed to make parallel code safe, something a GC can't give you.
> So, I still recommend strong sandboxing whatever parser/renderer one uses plus developers in security-focused projects (eg Qubes)
Browsers are probably the most exposed piece of software nowadays, and the vendors already do a lot of work to provide secure sandboxing. When using JavaScript, you're using a memory-safe language, in a sandboxed environment, which mean you need two exploits to get out of it (a bug in the js VM and a sandboxing bug). There's no guaranty that using another sandboxing system instead would offer better security, especially because you'll just have 1 layer of security.
> And far as your other comment, there are always new ways to turn C code safe or secure being developed. C++ might also be able to use them via a C++ to C compiler but has stuff like SaferCPlusPlus to help. For C, options to attempt include Softbound+CETS, SAFEcode, Code Pointer Integrity, and dataflow integrity. At least three are FOSS with one I havent checked yet. So, they exist. They could also be in even better shape if security tool builders put more time in them.
If there's an easy way to give C or C++ code a acceptable level memory-safety, why aren't developers using it ? (Don't tell me people already do, because it would be the proof that those tools aren't able to reach the «acceptable level»). Notice that if such tool was invented tomorrow, it will also benefit browsers, and increase the security offered by JavaScript.
Or, you know, you could use pdf.js, which has two advantages: (1) it already exists; (2) it can exist, unlike your proposal, which involves using a nonexistent memory-safe C++ compiler.