Question: why do we see stable video and audio "container formats" like MKV that persist as encodings come and go (where you might not be able to play a new .mkv file on an old player, but the expected answer to that is to upgrade your player to a new version, with universal support for pretty much any encoding being an inevitability on at least all software players); but every new image encoding seemingly necessitates its own new container format and file extension, and a minor format war to decide who will support it?
Is this because almost all AV decoders use libffmpeg or a fork thereof; where libffmpeg is basically an "uber-library" that supports all interesting AV formats and codecs; and therefore you can expect ~everything to get support for a new codec whenever libffmpeg includes it (rather than some programs just never ending up supporting the codec)?
If so — is there a reason that there isn't a libffmpeg-like uber-library for image formats+codecs?
The original entrant in this competition is TIFF, and—like Matroska or QuickTime add indexing to raw MP3 or MPEG-TS—it does provide useful functionality over raw codec stream non-formats like JPEG (properly JIF/JFIF/EXIF), in the form of striping or tiling and ready-made downscaled versions for the same image. But where unindexed video is essentially unworkable, an untiled image is in most cases OK, except for a couple of narrow application areas that need to deal with humongous amounts of pixel data.
So you’re absolutely going to see TIFF containers with JPEG or JPEG2000 tiles used for geospatial, medical, or hi-res scanned images, but given the sad state of open tooling for all of these, there’s little to no compatibility between their various subsets of the TIFF spec, especially across vendors, and more or less no FOSS beyond libtiff. (Not even viewers for larger-than-RAM images!) Some other people have used TIFF but in places where’s very little to be gained from compatibility (e.g. Canon’s CR2 raw images are TIFF-based, but nobody cares). LogLuv TIFF is a viable HDR format, but it’s in an awkward place between the hobby-renderer-friendly Radiance HDR, the Pixar-backed OpenEXR, and whatever consumer photo thing each of the major vendors is pushing this month; it also doesn’t have a bit-level spec so much as a couple of journal articles and some code in libtiff.
Why did this happen? Aside from the niche character of very large images, Adobe has abandoned the TIFF spec fairly quickly after it acquired it as part of Aldus, but IIUC for the first decade or so of that neglect Adobe legal was nevertheless fairly proactive about shutting up anyone who used the trademarked name for an incompatible extension (like TIFF64—and nowadays if you need TIFF you likely have >2G of data). Admittedly TIFF is also an overly flexible mess, but then so are Matroska (thus the need for the WebM profile of it) and QuickTime/BMFF (thus 3GPP, MOV, MP4, ..., which are vaguely speaking all subsets of the same thing).
One way or another, TIFF is to some extent what you want, but it doesn’t get a lot of use these days. No browser support either, which is likely important. Maybe the HEIF container (yet another QuickTime/BMFF profile) is better from a technical standpoint, but the transitive closure of the relevant ISO specs likely comes at $10k or more. So it’s a bit sad all around.
I think TIFF has some unique features that makes it more prone to certain security issues[1] compared to other formats, such as storing absolute file offsets instead of relative offsets. So I am not sure TIFF is a good container format, but many camera raws are TIFF-based for some reason.[2]
> I think TIFF has some unique features that makes it more prone to certain security issues[] compared to other formats, such as storing absolute file offsets instead of relative offsets.
That’s an impressive number of CVEs for a fairly modest piece of code, although the sheer number of them dated ≥ 2022 baffles me—has a high-profile target started using libtiff recently, or has some hero set up a fuzzer? In any case libtiff is surprisingly nice to use but very old and not that carefully coded, so I’m not shocked.
I’m not sure about the absolute offsets, though. In which respect are those more error-prone? If I was coding a TIFF library in C against ISO or POSIX APIs—and without overflow-detecting arithmetic from GCC or C23—I’d probably prefer to deal with absolute offsets rather than relative ones, just to avoid an extra potentially-overflowing addition whenever I needed an absolute offset for some reason.
There are things I dislike about TIFF, including security-relevant ones. (Perhaps, for example, it’d be better to use a sequential format with some offsets on top, and not TIFF’s sea of offsets with hopefully some sequencing to them. Possibly ISO BMFF is in fact better here; I wouldn’t know, because—well—ISO.) But I don’t understand this particular charge.
Absolute file offsets demand a particular memory layout or some extra bookkeeping that could be avoided with relative offsets. If I were to write a JPEG parser, I could write a function to handle one particular segment and not have to worry about other segments because relative offsets makes parsing them independent, compared to TIFF where I need to maintain a directory of things and make sure the offsets land in the right place.
I think parsing file format with absolute offsets is similar to handling a programming language with all GOTOs, compared to relative offsets which are more like structured control flow.
As much as I’m fond of my collection of Matroska files with SSA/ASS subtitle tracks, I don’t think those are appropriate for the Web, what with all the font issues; and SRT is a nightmare of encodings. But apparently there’s a WebM-blessed way[1] of embedding WebVTT ( = SRT + UTF-8 − decimal commas) now? Which is of course different[2] from the more recent Matroska.org-blessed way[3], sigh.
“The PFR specification defines the Bitstream portable font resource (PFR),
which is a compact, platform-independent format for representing high-quality,
scalable outline fonts.
Many independent organizations responsible for setting digital TV standards
have adopted the PFR font format as their standard font format, including:
Video container formats do something useful: they let you to package several streams together (audio, video, subtitles), and they can take of some important aspects of av streaming, letting codec part to focus on being a codec. They let you to use existing audio codecs with a new video codec.
OTOH a still image container would do nothing useful. If an image is all that needs to be contained, there's no need for a wrapper.
It would, at least, create a codec-neutral location and format for image metadata, with codec-neutral (and ideally extensible + vendor-namespaced) fields. EXIF is just a JPEG thing. There is a reason that TIFF is still to this day used in medical imaging — it allows embedding of standardized medical-namespace metadata fields.
Also, presuming the container format itself is extensible, it would also allow the PNG approach to ancillary data embedding ("allow optional chunks with vendor-specific meanings, for data that can be useful to clients, but which image processors can know it's safe to strip without understanding because 'is optional' is a syntactic part of the chunk name") to be used with arbitrary images — in a way where those chunks can even survive the image being transcoded! (If you're unaware, when you transcode a video file between video codecs using e.g. Handbrake, ancillary data like thumbnail and subtitle tracks will be ported as-is to the new file, as long as the new container format also supports those tracks.)
Also, speaking of subtitle tracks, here's something most people may have never considered: you know how video containers can embed "soft" subtitle tracks? Why shouldn't images embed "soft" subtitle tracks, in multiple languages? Why shouldn't you expect your OS screen-reader feature to be able to read you your accessibility-enabled comic books in your native language — and in the right order (an order that, for comic books, a simple OCR-driven text extraction could never figure out)?
(There are community image-curation services that allow images to be user-annotated with soft subtitles; but they do it by storing the subtitle data outside of the image file, in a database; sending the subtitle data separately as an XHR response after the image-display view loads; and then overlaying the soft-subtitle interaction-regions onto the image using client-side Javascript. Which makes sense in a world where users are able to freely edit the subtitles... but in a world where the subtitles are burned into the image at publication time by the author or publisher, it should be the browser [or other image viewer] doing this overlaying! Saving the image file should save the soft subtitles along with it! Just like when right-click-Save-ing a <video> element!)
“The Plain Text Extension contains textual data and the parameters necessary to render that data as a graphic, in a simple form. The textual data will be encoded with the 7-bit printable ASCII characters. Text data are rendered using a grid of character cells defined by the parameters in the block fields. Each character is rendered in an individual cell. The textual data in this block is to be rendered as mono-spaced characters, one character per cell, with a best fitting font and size.”
“The Comment Extension contains textual information which is not part of the actual graphics in the GIF Data Stream. It is suitable for including comments about the graphics, credits, descriptions or any other type of non-control and non-graphic data.”
Correct. Not subtitles as a vector layer of the image, but rather subtitles as regions of the image annotated with textual gloss information — information which has no required presentation as part of the rendering of the image, but which the UA is free to use as it pleases in response to user configuration — by presenting the gloss on hover/tap like alt text, yes; or by reading the gloss aloud; or by search-indexing pages of a graphic novel by their textual glosses like how you can search an ePub by text, etc.
In the alt-text case specifically, you could allow for optional styling info so that the gloss can be laid out as a visual replacement for the original text that was on the page. But that's not really necessary, and might even be counterproductive to some use-cases (like when interpretation of the meaning of the text depends on details of typography/calligraphy that can't be conveyed by the gloss, and so the user needs to see the original text with the gloss side-by-side; or when the gloss is a translation and the original is written with poetic meter, such that the user wants the gloss for understanding the words but the original for appreciating the poesy of the work.)
Concrete use-cases:
• the "cleaner" and "layout" roles in the (digitally-distributed) manga localization process, only continue to exist, because soft-subbed images (as standalone documents) aren't a thing. Nobody who has any respect for art wants to be "destructively restoring" an artist's original work and vision, just to translate some text within that work. They'd much rather be able to just hand you the original work, untouched, with some translation "sticky notes" on top, that you can toggle on and off.
• in the case of webcomic images that have a textual "bonus joke" (e.g. XKCD, Dinosaur Comics), where this is currently implemented as alt/title-attribute text — this could be moved into the image itself as a whole-image annotation, such that the "bonus joke" would be archivally preserved alongside the image document.
Region annotation is used for some images on Wikimedia Commons and a lot of Manga pages on the booru sites[1]. It's really, really good for translations.
That's a very basic view, take a look at TIFF or DICOM specs. It can be useful to have multiple images, resolutions, channels, z or t dimensions, metadata, ... all in a single container as it's all one "image"
captions / alt-text could also very reasonably be part of the image, as well as descriptions of regions and other metadata.
there are LOTS of uses for "image containers" that go beyond just pixels. heck, look at EXIF, which is extremely widespread - it's often stripped to save space on the web, but it's definitely useful and used.
- contain multiple streams of synced video, audio, and subtitles
- contain alternate streams of audio
- contain chapter information
- contain metadata such as artist information
For web distribution of static images, you want almost none of those things, especially regarding alternate streams. You just want to download the one stream you want. Easiest way to do that is to just serve each stream as a separate file, and not mux different streams into a single container in the first place.
Also, I could be wrong on this part, but my understanding is that for web streaming video, you don't really want those mkv* features either. You typically serve individual and separate streams of video, audio, and text, sourced from separate files, and your player/browser syncs them. The alternative would be unnecessary demux on the server side, or the client unnecessarily downloads irrelevant streams.
The metadata is the only case where I see the potential benefit of a single container format.
* Not specific to mkv, other containers have them of course
Container formats increase size. Now for video that doesn't matter much because it doesn't move the needle. For images a container format could be a significant percentage of the total image size.
> The alternative would be unnecessary demux on the server side, or the client unnecessarily downloads irrelevant streams.
HTTP file transfer protocols support partial downloads. A client can choose just to not to download irrelevant audio. I think most common web platforms already work this way, when you open a video it is likely to be in .mp4 format, and you need to get the end of it to play it, so your browser gets that part first. I am not entirely sure.
I believe mp4 files can be repackaged to put the bookkeeping data at the front of the file, which makes them playable while doing a sequential download.
That metadata is usually put around the end of the file for compatibility reasons, but one can use ffmpeg's `-movflags faststart` option to move it to the beginning (very common in files that are meant to be served on the web).
> You typically serve individual and separate streams of video, audio, and text, sourced from separate files, and your player/browser syncs them.
That's one school of thought. Some of the biggest streaming providers simply serve a single muxed video+audio HLS stream based on bandwidth detection. Doesn't work very well for multi-language prerecorded content of course, but that's just one use case.
That's true, but my understanding is they serve a specific mux for a specific bandwidth profile, and serve it by just transmitting bytes, no demux required. I didn't mean to imply that wasn't a common option. I only meant to say I don't think a common option is to have a giant mux of all possible bandwidth profiles into one container file, that has to be demuxed at serve time.
My understanding is that YouTube supports both the "separate streams" and "specific mux per-bandwidth profile" methods, and picks one based on the codec support/preferences of the client.
Containers are just containers — you still need a decoder for their payload codec. This is the same for video and images. For video, containers are more important because you typically have several different codecs being used together (in particular video and audio) and the different bitstreams need to be interleaved.
The ISOBMFF format is used as a container for MP4, JPEG 2000, JPEG XL, HEIF, AVIF, etc.
And yes, there are ffmpeg-like "uber-libraries" for images: ImageMagick, GraphicsMagic, libvips, imlib2 and gdk-pixbuf are examples of those. They support basically all image formats, and applications based on one of these will 'automatically' get JPEG XL support.
Apple also has such an "uber-library" called CoreMedia, which means any application that uses this library will also get JPEG XL support automatically.
I'm guessing it's mostly up to mostly tradition/momentum on how the formats where initially created and maintained.
Videos has (most of the time at least) at least two tracks at the same time that has to be syncronized, and most of the time it's one video track and one audio track. With that in mind, it makes sense to wrap those in a "container" and allow the video and audio to be different formats. You also can have multiple audios/video tracks in one file, but I digress.
With images, it didn't make sense at least in the beginning, to have one container because you just have one image (or many, in the case of .gif).
We're starting to see a move towards this with HEIF / AVIF containers, however in cases where "every bit must be saved" the general purpose containers like ISO-BMFF introduce some wastage that is unappealing.
> however in cases where "every bit must be saved" the general purpose containers like ISO-BMFF introduce some wastage that is unappealing.
Sure, but I don't mean general-purpose mulimedia containers (that put a lot of work into making multiple streams seekable with shared timing info.) I mean bit-efficient, image-oriented, but image-encoding-neutral container formats.
There are at least two already-existing extensible image file formats that could be used for this: PNG and TIFF. In fact, TIFF was designed for this purpose — and even has several different encodings it supports!
But in practice, you don't see the people who create new image codecs these days thinking of themselves as creating image codecs — they think of themselves as creating vertically-integrated image formats-plus-codecs. You don't see the authors of these new image specifications thinking "maybe I should be neutral on container format for this codec, and instead just specify what the bitstream for the image data looks like and what metadata would need to be stored about said bitstream to decode it in the abstract; and leave containerizing it to someone else." Let alone do you ever see someone think "hey, maybe I should invent a codec... and then create multiple reference implementations for how it would be stored inside a TIFF container, a PNG container, an MKV container..."
But HEIC/AVIF did exactly that, defined image format on top of standard container (isobmff/heif). JPEG-XL is the odd one out because it doesn't have standardized HEIF format, but for example JPEG-XS and JPEG-XR are supported in HEIF.
JPEG XL uses the ISOBMFF container, with an option to skip the container completely and just use a raw codestream. HEIF is also ISOBMFF based but adds more mandatory stuff so you end up with more header overhead, and it adds some functionality at the container level (like layers, or using one codestream for the color image and another codestream for the alpha channel) that is useful for codecs that don't have that functionality at the codestream level — like video codecs which typically only support yuv, so if you want to do alpha you have to do it with one yuv frame and one yuv400 frame, and use a container like HEIF to indicate that the second frame represents the alpha channel. So if you want to use a video codec like HEVC or AV1 for still images and have functionality like alpha channels, ICC profiles, or orientation, then you need such a container since these codecs do not natively support those things. But for JPEG XL this is not needed since JPEG XL already does have native support for all of these things — it was designed to be a still image codec after all. It's also more effective for compression to support these things at the codec level, e.g. in JPEG XL you can have an RGBA palette which can be useful for lossless compression of certain images, while in HEIC/AVIF this is impossible since the RGB and A are in two different codestreams which are independent from one another and only combined at the container level.
It would be possible to define a JPEG XL payload for the HEIF container but it would not really bring anything except a few hundred bytes of extra header overhead and possibly some risk of patent infringement since the IP situation of HEIF is not super clear (Nokia claims it has relevant patents on it, and those are not expired yet).
> JPEG XL uses the ISOBMFF container, with an option to skip the container completely and just use a raw codestream
Hey, thanks for the clarification! I was basing my info on Wikipedia (my bad), ISO BMFF page doesn't mention JXL at all, and even JPEG XL page has only small print in infobox saying that its "based on" ISO BMFF but the main article text doesn't mention that at all.
> But for JPEG XL this is not needed since JPEG XL already does have native support for all of these things — it was designed to be a still image codec after all
I suppose that is bit the thing grand-parent comment was complaining about, format not designed for general-purpose containers but rather as an standalone thing. I suppose it could be fun thought experiment to imagine what JXL would look like if it were specifically designed to be used in HEIF.
Of course it is well understandable that making tailored purpose-built format ends up better in many ways vs trying to fit into some existing generic thing.
> It would be possible to define a JPEG XL payload for the HEIF container but it would not really bring anything except a few hundred bytes of extra header overhead and possibly some risk of patent infringement since the IP situation of HEIF is not super clear (Nokia claims it has relevant patents on it, and those are not expired yet).
I suppose JXL-in-HEIF would allow some image management tools to have common code path for handling JXL and HEIC/AVIF files, grabbing metadata etc, and possibly would not need any specific JXL support. But that is probably not a practical concern in reality.
And at the same time, we are likely going to use codec-specific extensions for all AOM video codecs (.av1, .av2) as well as for images (.webp2, not sure if .avif2 will ever exist but I guess so), even when the same container is used, as we did with .webm (which was a subset of .mkv)
I think that's because video is a much more active and complex topic than still images.
We are still using image formats from the 90s, and their matching containers, and they are good enough, so there is not much work for going beyond that. There is no real incentive for making a more flexible format. By comparison, video is the biggest bandwidth hog and people care a lot.
And mkv supports video, multiple sound tracks, subtitles,... All using different codecs made by different people (ex: h265+opus or vp9+vorbis, or any other combination). An image container usually only has the image and a few metadata.
videos can have their own "containers" too, for instance in AV1 the stream is stored inside of an obu, which is wrapped in a external container. (such as matroska) if you really wanted to you could (and can) put images into containers too, PNGs in a matroska are actually pretty useful way for transfering PNG sequences.
you can also with a simple mod on a older commit of ffmpeg (the commit that added animated jxl broke this method and I haven't gotten around to fixing it) by simply adding jxl 4cc to mux JXL sequences into MKV.
1. put a reference to the decoder into the header of the compressed file
2. download the decoder only when needed, and cache it if required
3. run the decoder inside a sandbox
4. allow multiple implementations, based on hardware, but at least one reference implementation that runs everywhere
Then we never need any new formats. The system will just support any format. When you're not online, make sure you cached the decoders for whatever files you installed on your system.
We used to have formats like this, and then the attacker points to his decoder/malware package.
Apart from that of course the decoder has to be fast and thus native and interface with the OS, so the decoder is X86 on the today version of Windows, until the company hosting it dies and the patented, copyrighted decoder disappears from the internet.
If you assume you just say the magic word ‘sandbox’ and everything is safe, then yes, security is a solved problem. This is however a prime example of the saying that in theory, there is no difference between theory and practice but in practice there is.
You're treating sandboxing like it's all the same, but it's not. Multimedia decoders are one of the absolute easiest things to sandbox. Webassembly was designed for sandboxing and even that's very overcomplicated for what a decoder needs.
If we had standard headers, then reading metadata wouldn't be part of the decoder. The decoder would only need to take in bytes and output a bitmap, or take in bytes and output PCM audio. It doesn't need to be able to call any functions, or run any system calls, and the data it outputs can safely contain any bytes because nothing will interpret it.
It's like taking the very core of webassembly and then not attaching it to anything. The attack surface is astoundingly small.
You just need to give it a big array of memory and let it run arithmetic within that array, plus some control flow instructions. Easy to interpret, easy to safely JIT compile.
I have made secure emulators for simple CPUs before. Seriously, you only need a handful of opcodes and they only need to operate on a big array. It's hard to do wrong!
The part of sandboxing that's hard is dealing with I/O, or giving useful tools to the sandboxed code, or implementing data structures for the sandboxed code. You don't need any of that for a multimedia decoder. You just let it manipulate its big block of bytes, and make sure you bounds check.
A Java VM exposes tens of thousands of functions to the code inside it. A barebones sandbox exposes zero. It just waits for the HLT opcode.
And when it gives you raw RGB data, or raw PCM data, there's no way to hide a triggerable malicious payload inside. If the code does something bad, the worst it can do is show you the wrong image.
Yes, and you can also download a million projects like this, the sandbox exists.
But my suggestion would be that you build a video codec out of it. Preferably one that has the properties the market demands: performance and energy efficiency.
Not a codec, the only thing customized for this would be the container format.
You'd use existing codecs, and the way you get good performance and energy efficiency on a video codec is by having a hardware implementation. Software decoding doesn't even come into that picture.
As far as practical software decoding outside of battery-powered video, can I just point at webassembly? Especially the upcoming version with vector instructions. You could use normal webassembly, or even an extra-restricted version. It gets pretty good performance, and when you remove its ability to talk to the outside world it goes from pretty good to extremely good security.
> The container format would be decoded by a sandboxed codec that can be found by decoding the container?
The container parser would not be dynamically downloaded, and may or may not be sandboxed.
We don't need a new container with almost every codec. We just need the new codec itself.
> WebAssembly codecs indeed exist, and they are impractical due to a lack in performance.
Mostly because they don't have vector instructions yet, I bet. But plenty of webassembly is within 50% of native, which is good enough for lots of things, which includes image decoding for sure.
So now container decoders have been magically vetted and secured so they don’t need the sandbox. Which is quite surprising considering most vulnerabilities in streams are in the container decoders and their multitude of hardly used features, but okay.
The challenge remains for you to actually provide the codec you describe. Which a few comments ago was trivial because it was a hardware codec anyway, now it’s just a bit of WebAssembly away. Well that should be trivial because cross compilers to WebAssembly exist. So why don’t you just provide a few real world examples? Your probably not the first to think of these ideas, there has to be a reason why it hasn’t been done yet.
> So now container decoders have been magically vetted and secured so they don’t need the sandbox.
Not "magically". But you only need one or two, and they don't need to be very fast, so you can put a lot of effort into making them secure.
But more importantly, browsers already have many container decoders. This is not an expansion in attack surface. The goal here is allowing a lot more codecs compared to current browsers without a significant increase in attack surface compared to current browsers. Pointing out flaws that already exist doesn't disqualify the idea.
> So why don’t you just provide a few real world examples? Your probably not the first to think of these ideas, there has to be a reason why it hasn’t been done yet.
Image decoders in webassembly already exist. Did you even look? Including JXL!
Video decoding needs more support structure in the browser, and I already said some decoders need things that are being added to webassembly but aren't done yet. Even then, the first google result for "av1 webassembly" is a working decoder from five years ago.
You no longer need "printer drivers", they're supposed to be automatically downloaded, installed and ran in a sandbox. You never need any "new drivers". The system will support any printer.
Except the "sandbox" was pretty weak and full of holes.
> Except the "sandbox" was pretty weak and full of holes.
Nothing prevents you from installing only the trusted ones.
Second, software is getting so complicated that if we don't build secure sandboxes anyway then at some point people will be bitten by a supply chain attack.
Is this because almost all AV decoders use libffmpeg or a fork thereof; where libffmpeg is basically an "uber-library" that supports all interesting AV formats and codecs; and therefore you can expect ~everything to get support for a new codec whenever libffmpeg includes it (rather than some programs just never ending up supporting the codec)?
If so — is there a reason that there isn't a libffmpeg-like uber-library for image formats+codecs?