Also, at what point will AV1 become "mainstream"? How prevalent is it? Still seems that hardware decoding (never mind encoding) support is still only so-so.
What I think the AV1 initiative neglects as a way of spreading their tech would be giving away a great open implementation of an encoder.
x265 and x264 spread their respective codecs all over the internet through pirated content, which forced adoption.
The fact that you can play .mkv files almost anywhere now is a testament to that.
https://github.com/xiph/rav1e - written in rust, also subject to a lot of optimising. It's not complete coverage of all features, intended to be used for places where libaom is too slow.
I have used x265 fairly easily to encode a 4K Bluray. None of the AV1 options seem to have the required performance (both quality and encoding speed) and ease of use.
SVT-AV1 has been both faster and higher quality than x265 for awhile now though? For ease of use, you can just use ffmpeg, or one of the many CLIs or GUIs that support AV1 (preferably either Av1an for CLI, or a gui that uses it under the hood eg. nmkoder).
This is false, relative to x265, there is some preset which provides both faster encoding and higher "quality", iso-bitrate, across the whole gamut of x265 settings, and I believe x264 as well, although I am not entirely sure about what is available at the extreme inefficient/fast end (the usefulness of which is questionable).
I think backing companies like Google and Netflix don't care about that. They need hardware decoding support in phones and TVs, and they will serve av1 from their platforms and save a lot of money. It might become the dominant codec without you even noticing it.
HW vendors have already/are implementing AV1 because they were on the big consortium and the industry is a bit more grown up now.
AV1 is a big compromise between a bunch of vendors. It is definitely the future, the powers that be have already decreed this, but these things move slowly.
Couldn’t they reuse the tensor cores that are shipped in every device at this point? There are already lots of papers on compressing images using deep learning, I don’t see any reason why the companies couldn’t make a video standard that relies on that hardware.
having a hardware encoder and decoder on a device is super useful for streaming content of that device. Not sure I would want to use other compute for that, that compute is much better used doing CV on the video stream :)
Why do you think so? Those tensor processors are actually already optimized for video processing: all of the complex postprocessing in the iPhone camera app is done by the tensor cores inside the M1 chip. I wouldn't be suprised if it would already far be able outperform the mentioned codecs, but of course it needs lots of software development that can only be done by the big companies.
A codec it’s static, almost not changing at all over a decade. This allow you to implement it as a single purpose hardware which is orders of magnitude more efficient and fast than code running in a multipurpose chip, tensor or not.
For things that evolve fast, as deep learning, an programmable chip is the right choice.
The iPhone doesn't yet use M1. Besides, post-processing a video is one thing, encoding is completely different. What Apple does with the neural processing is most likely the analysis of the content, not the "editing".
In something like a mobile device, every watt counts. If it takes more energy to decode video on the tensor cores than it does to have a dedicated hardware block, you keep the hardware video decoder.
AFAICT AV1 is just a significantly more complex byte stream to create vs h264, especially if you want that byte stream to be meaningfully more compact than h265 created with x265 @ slow. x265 @ slow was already pretty time consuming[1][2]. And x265 itself had a hard time against x264 for years too[3]. So it may not be that the AV1 initiative is neglecting the public encoders and more that the math and algorithms for computing an AV1 byte stream really are that difficult.
[2] I ran a test 1080P encodes using x264 and x265 slow both CRF 20 via handbrake on my m1 MacBook Air and got ~20FPS, and ~4.4FPS for x264 and x265 slow respectively. On my i7-1165G7 NUC I ran the same test but directly invoked FFmpeg and got ~16FPS and ~4FPS for x264 and x265 slow respectively.
No, OS support for H.264 happened because the industry wanted more efficient codecs, and it didn't have much competition at the time. H.265 and AV1 are both competing to replace H.264 and so far neither have a clear lead.
Yeah, in this era of mobile device dominance i think hardware decoding is key. Suspect Apple will add it then suddenly the Android SoC folks will suddenly consider it a priority and not before.
Apple won't add it; they traditionally ignore free codecs, and even if they implement them, it is so limited, so it is usable only in the narrow situation, where it is required (see Opus, for example).
On the PC side, decoding is supported by Intel (since Tiger Lake), Nvidia (since 30x0) and AMD (since RDNA2). On the mobile side, it is supported by bunch of mobile chip vendors, like Mediatek or Samsung; so you can have an android device that does support AV1 already; on OS-side it is supported since Android 10. Vendors like Amlogic also support it, so settopboxes are also covered.
Apple didn't care about free codecs before H.265 for the same reason ISO and Leonardo Chiariglione didn't: "the best codecs require big $$$ to research and competitive royalty-free codecs will stifle innovation by not sending money back to researchers"[0]. The ISO business model was to not care about patents as long as everyone agreed to charge FRAND rates, so they could just pick the best technology and rely on their creators' patent royalties to fund research. For Apple, they don't care about the cost, as long as the codec is suitable to implement in iPhones.
That business model broke once Velos Media and Access Advance realized they could game the system by offering a handful of net-implementer companies severely reduced patent rates if they agreed to license their patents or, better yet, pull out of MPEG-LA entirely. The end result is that companies have to license significant portions of H.265 multiple times from three different patent pools... and most people who are even remotely cost-sensitive just aren't bothering with it at all and are sticking with H.264 or moving to AV1.
Apple has yet to implement AV1 in hardware, which makes using the codec exclusively a non-starter as they have a very dim opinion of software decoders. I imagine they joined AOM to either hedge some bets or extract leverage from H.265/H.266 patent owners. The thing is that their business absolutely does not require the existence of a viable royalty-free codec, unlike Google's, which does. We'll know if Apple's actually decided to join the Free video party if they actually ship an AV1 decoder in their silicon.
[0] Leonardo is very vocally opposed to AOM's lack-of-a-business-model, see:
If you're wondering, he's the former chair of MPEG, before ISO decided to cut up MPEG into a bunch of different pieces and more or less left him without a job. He's salty about that too: https://blog.chiariglione.org/iso/
The Broadcasting industry has definitely sided with VVC. So 8K broadcast will likely be using VVC along with possibly LCEVC [1] as announced by the Brazilian SBTVD standard.
The AV1 question is harder to answer. One could argue VP9 is good enough and a few more years before it reach 100% VP9 hardware decoding on Smartphone. Of course Google could force the usage of AV1 in the future if they slowly phaseout VP9 video and only support AV1 on Youtube, while using H.264 as baseline. Qualcomm should have AV1 decoding on Snapdragon by late 2022 / 2023. Mediatek are starting to roll AV1 decoding as well. No timeline on encoding and generally speaking I dont expect any mobile hardware vendor to use AV1 encoding in the near future.
Even to this day I still do not believe Apple will support AV1 decode, at least not until Google forced the AV1 usage. That is a very contrarian view not just on HN but generally on the internet. But recent news with MPEGLA and AccessAdvance might have change things bit.
They were somehow a founding member of AOM 12 months after the announcement because someone "add" their "name" on the list and only found out by a CNET journalist. And four years later AOM still dont have permission to use the official Apple logo on that page. Compared to previous HEVC and VVC showing where Apple logo were officially presented.
Doesn't inspire a lot of confidence to say the least.
The only recent changes were Apple's name somehow disappeared in both AccessAdvance and MPEGLA list.
Isn't their non-support more a consequence of the fact that they haven't shipped a single product with an optical drive since 2012? Like, you can obviously plug a $100 external Blu-Ray writer into any of their computers, but I think between the competition with their own digital distribution platform and the "bag of hurt" that is DRM nonsense, I can understand them not being interested in explicitly supporting it but still caring about the direction that it takes since they are a media and devices company, and the decisions made there impact them.
> but still caring about the direction that it takes since they are a media and devices company, and the decisions made there impact them.
Agreed, but with a sinister connotation.
While I am absolutely not accusing Apple of it, let's not forget that all companies want the inside track on any tech that could threaten profits, and may join these organizations to guide them away from decisions that are not in the companies' interests.
Presumably those conflicts of interest would be well understood by everyone at the table, and the other companies with a heavier stake on the devices side (Panasonic, Pioneer, Sony, etc) would hold Apple accountable for any obvious attempts to sabotage the standard.
An elucidating exercise is to use python to do a DCOS or wavelet transform on an image and then quantize it to look at the results. It’s a few hundred lines of code if that and gives you a solid idea of the significance of working in the frequency domain and how that makes compression much easier.
PSA: For images, there's a finally successor to jpeg: JPEG XL (.jxl) - has lossy and lossless mode; is progressive (you can download just the first parts of the bitstream to get a lower resolution image; and other benefits!)
As a mobile developer I generally see WEBP getting served to Android devices and HEIC to Apple devices. Is there any advantage to JPEG XL over those?
If our app supported older iOS devices, maybe JPEG would be needed as a fallback, but it seems like JPEG XL wouldn't be compatible with old devices anyway, right?
I don't believe any platforms or browsers have enabled JPEG-XL support yet, so right now you'd need to ship the decoder yourself anyway.
But, it's at least as much better than WebP as WebP was better than JPEG. And unlike HEIC, web browsers are considering supporting it, though AVIF is currently ahead of JPEG-XL in browser support.
JPEG-XL also currently beats HEIC and AVIF at medium to high quality, but it's a fair question just how much that's intrinsic to the format and how much that is from libjxl's tuning focus being at those quality levels; AV1 and HEVC encoders have mostly focused on lower bitrates in the range that video would use.
JPEG-XL is the ~best lossless image format currently available, though.
> though AVIF is currently ahead of JPEG-XL in browser support.
It certainly looks like AVIF is ahead because it has landed in and Firefox and Chrome, but it's blocked in Safari by a required OS change[0] (iOS/macOS support if I understand the comment correctly?). Additionally, there's no implementation in Edge (even though it's chromium)[1]. Sorry, I wish I had more details but I'm not sure where to look to find a status update on that.
Meanwhile, JPEG-XL has support behind a feature in every major browser except Safari[2]. As you and others have noted, there seems to be a lot of buzz and excitement around JPEG-XL throughout the ecosystem. The webkit folks seem to be making an effort to get the ball rolling[3]. I might be misinterpreting something but it all looks very encouraging at any rate. It almost seems like it might gain wide support in Firefox and Chromium (Chrome & Edge) around the same time with Webkit following (hopefully) closely thereafter. Heck, I don't see why Webkit doesn't just abandon AVIF and focus on rolling out JPEG-XL.
I think you might have that backwards. JPEG can be losslessly converted to JPEG XL with better compression (~20% smaller), but I haven't seen anything about going the other direction. I'm not sure how it would be possible with JPEG XL's different sized blocks and different available transforms. https://en.wikipedia.org/wiki/JPEG_XL
The JPEG -> JXL conversion is reversible, so you can encode JPEG -> convert JXL -> convert back to JPEG as needed. You could potentially encode a JPEG-XL directly with the subset that can be converted back to JPEG, but it's not clear to me if libjxl currently implements that.
Either way, it's not that useful for anyone that's already deploying post-JPEG formats; even WebP should save more than 20%. Mostly it's useful for CDNs and SaaS companies like Cloudinary to transparently deploy JPEG-XL for JPEG-only customers.
Do you mean that you would want that to happen on the web server, if they detect a user-agent that doesn't support JXL, they would convert back to JPEG? That seems pointlessly costly in CPU.
Also like mkl said, there doesn't seem to be any evidence that JXL->JPEG can be done losslessly.
I actually meant to say Capture One and Photoshop. I know this is HN and we're gung-ho about open source but the last time I tried Darktable it didn't hold a candle to Capture One. To be fair that was a few years ago, but the color work put in to Capture One doesn't really have an equal IMO.
not sure when you used it but you should try it again now... i have been on Lightroom and Capture One for years, tried DT last year and never bothered coming back to them, nothing to do with it being open source its just straight up a better raw editor, yes its hard to learn but once you figure it out you will see how much better it handles editing raw images, especially with the filmic module
It has poor support for input formats though. E.g. it still doesn't support CR3. I prefer RawTherapee which supports CR3, doesn't look like a space station and still is miles ahead of Lightroom in terms of capability.
JPEG2000 is a good option for lossless compression. JPEG XL seems to offer better result and improvements now, but it isn't yet mainstream and has competition.
I use the J2K photoshop plug-in - https://www.fnordware.com/j2k/ . And even macOS Preview supports JPEG2000 (HEIF on macOS on the other hand is not suitable for professional graphic works; doesn't work with CMYK).
A jpeg pixel is 64 (=8 x 8), 8 bit coefficients which are summed together for each pixel. That result is not 8 bits, but more; that is a misunderstanding, often repeated. A jpeg is capable of over 11 bits dynamic range. You can figure out that an 11 bit dynamic range image has more than 8 bits. See the wikipedia.
A jpeg pixel is not 64 eight-bit coefficients. Jpeg compresses an 8x8 pixel block at a time by taking a DCT (which mathematically is lossless, but in practice is not due to rounding and quantization at this stage), which turns those original 8x8 values into another set of 8x8 values, then some of these are thrown away and/or quantized for lossy compression.
Decompression is the reverse: take these 8x8 quantized DCT coefficients, perform an inverse 8x8 DCT top get pixel values.
The 11-bit dynamic range part you claim is merely from a color profile, which takes the resulting 8 bits per channel (i.e., 256 possible values) and spreads them over a gamma curve to an 11-bit range. But there are still only 256 possible levels per channel, too few for quality image editing.
Think of it as squaring: taking [0,255] as your input and squaring every value gives you a range of 0 to 255^2 = 65025, but that does not allow you to store any value in that range. It only allows you the 256 values that are squares.
So the dynamic range is 11 stops, but the number of representable levels per channel is still 8 bit: 256 levels. This makes gradients band no matter how you do them in JPEG.
It's why photo processing software wants RAW, not JPEG. JPEG, besides being lossy, does not allow enough steps.
One example: given a so-so RAW, at 11 bits, you can pull out dark things or darken bright things smoothly. This is not possible once you go to jpeg, for any implementation of jpeg.
I said a jpeg pixel is the summation of 64 8 bit coefficients. The coefficients are 8 bit, but obviously the cosine values are not 8 bit. They can be floating point. All a jpeg need give is coefficients, the coder/decoder knows what to do with them. Summing 64 products, with each product = 8 bit numbers x cosine value gives more than an 8 bit result for the resultant pixel. In addition, there is another dct for the color values. This adds more bits of info.
The cosine values are transcendental numbers, they can have an infinite number of decimal places, yes? So adding up 64 products of (cosine values * 8 bit integer) to get 1 pixel value can obviously have more than 8 bits.
No, a jpeg pixel is not "the summation of 64 8 bit coefficients." I've written jpeg codecs (and many other image formats). It works just as I explained above.
Or simply read the libjpeg source.
Don't like that, read this [1]: "JPEG images are always recorded with 8-bit depth. This means the files can record 256 (28) levels of red, green and blue."
Don't like that, here [2] is the JPEG ISO standard, section 4.11, baseline jpeg, "Source image: 8-bit samples within each component".
A DCT takes an 8x8 pixel input, 8 bits per channel, and transforms them into an 8x8 output. It matters not what these are - the information theory content is nothing more than what was put into it. There is not suddenly magically more information content.
More simply, appending zeroes to a number does not mean you can represent more numbers. You simply can represent the exact same numbers, just wasting more space.
None of what you wrote adds more resolution at the output. It simply isn't there.
If I give you 5 possible inputs to a function, then you have 5 possible outputs, not matter how many digits you finagle into representing the output.
Jpeg has 8 bits of resolution per channel. End of story. That is why professional photos are taken and edited in raw - you get more bits of resolution per channel.
I'm not sure why you're still arguing this. It's a longstanding, well known issue, and I explained it all again very simply.
If you think it isn't true, encode one of your magic jepgs with more than 256 levels of gray and post it here. Good luck :)
If you cannot do that, then maybe you should consider that you're wrong.
What you have described is usually called a 24 bit rgb image--not a 8 bit image. An 8 bit image can have only 256 distinct levels, whereas a jpeg can have ~16 million or 2^24 values for each pixel. 8 bit images are used often for medical imaging, but they are crude compared to 24 bit rgb images. One can argue that 24 bit rgb images are too crude, but they should not, IMHO, be called 8 bit images. But that is often what people say about jpegs. Typical jpegs with 8 bit coefficients have much more information than 8 bit images. Perhaps typical imprecise terminology?
[1] https://en.wikipedia.org/wiki/Color_depth#True_color_(24-bit...
[2] https://www.quora.com/How-many-colors-does-a-JPEG-contain
[3] https://en.wikipedia.org/wiki/JPEG#JPEG_codec_example << They walk thru the steps.
I never called them 8 bit images. I wrote 8 bits per channel. Each of RGB are channels. An RGBA image has 4 channels. A grayscale image has one channel. This is standard terminology. So an 8 bits per channel image with three channels is a 24 but image.
It is very precise terminology, used correctly. It's also covered in your links; you can read it there.
Now, if you encode gray levels in RGB, at 8 bits per channel, you do indeed end up with only 256 gray levels in the image, because for each pixel, R=G=B.
Actually the coefficients are 12 bit in JPEG, before quantization. In principle you can make pretty accurate 10-bit HDR JPEG files, and with an accurate JPEG decoder, it would work well enough.
The most common JPEG decoders though (in particular libjpeg-turbo) are using a cheap but not super precise iDCT that has 8-bit YCbCr as output, which then gets chroma-upsampled if needed and converted to 8-bit RGB. That causes the effective precision in reds and blues to be only 7-bit. But in principle you could have about 10 bits of effective RGB precision, it just requires a sufficiently precise JPEG decoder.
Alternatively, if you're being confused by the quoted intermediate DCT precision, that's not relevant to the final output either. You cannot implement any transform, DCT especially, without having intermediate values with a greater range than the input or output. Like, even a simple average of two 8-bit values (a+b)/2 has an intermediate range of 9 bits.
The JPEG spec does specify both 8 and 12 bit sample precision for lossy, but I don't think anyone ever implemented 12-bit since libjpeg never cared about it.
Libjpeg does have support for it. Unfortunately, you have to have two copies of the library, one for 8 bit and one for 12. And you have to rename all the API methods in one of the libraries so you dont get name collisions. I believe that LibTIFF has a build configuration for this so that 12bit JPEG data can be encapsulated in a TIFF file.
Fair enough, but the end result is what really matters, and I regularly see banding in the skies of photos. Some people add grain just to help deal with that, which is a ridiculous problem to have in 2022.
JPEG2000 wasn't technically better enough; wavelets are too blurry and less efficient to decode. For some reason there was an epidemic of terrible codec designs for a decade or two, though it could've all been patent avoidance.
Wavelets were a great idea... except AFAIK we never figured out fast algorithms or hardware implementations to optimize processing them, the way we did with Fourier Transform and the FFT algorithm and IDCT blocks.
A lifting implementation of a wavelet is pretty low complexity, iirc the CDF 9/7 used in JPEG2000 was effectively a 5-tap filter. Implementation-wise, the biggest issue is that approximately no one bothered optimizing their memory access patters for CPU caches with anywhere near the same attention FFTs got. Then unlike block-based codecs, you basically need to store 16-bit coefficients for an entire frame for 8-bit pixels, instead of per-block.
But ultimately, DCTs make it way cheaper to encode similar but wrong details, where in wavelets details are effectively repeated across multiple highpass bands so the only cheap option is to not code detail.
DCI (how movies are sent to theaters) does use JPEG2000, but I think they just wanted something with 12-bit video and that's pretty rare in any codec.
One advantage is if you want to decode the same picture at lower resolutions; JPEG can do this equivalent to nearest-neighbor, wavelets can do it equivalent to a better filter than that. But still not as good as just decoding and resizing down.
That license doesn't say anything about patents. This one does: https://gitlab.com/wg1/jpeg-xl/-/blob/main/PATENTS and only says that patents owned by Google are covered. So it may be that other companies patent the technology. The information in the repository is not enough to decide whether JPEG XL is safe to use.
What's fascinating to me as a web dev is that (unless I'm mistaken), if I store all my images as JPEG-XL, I won't need to create thumbnails or resized images anymore... I think I can literally just hold onto the originals and just truncate the image binary stream at whatever resolution is required.
I'm not sure how this would interact with "right-click to save image" though... what resolution would it be at that point?
The problem with these formats is that they don’t offer enough of an advantage to overcome the disadvantage of have less than the enormous install base of JPEG. Look how much better PNG is than GIF and realize it only succeeded in replacing it (for static images) because of the Unisys patent suits, not because it’s so much better.
> Look how much better PNG is than GIF and realize it only succeeded in replacing it (for static images) because of the Unisys patent suits, not because it’s so much better.
GIF being limited to 8bpp would have meant that it would have been replaced eventually anyway.
In the specifics: WebP is pretty popular in my experience.
In general: I expect that eventually something else will win, even if it takes a long time. There's too much room for improvement for the ecosystem to fossilize just yet.
"Your browser does not support JPEG XL"; I'm sure it's cool technology but whoever is behind it needs to hire developers and advocates to push code to Webkit / Blink to add support for it, else it won't get mainstream adoption.
Chrome: behind a flag since version 91.0.4470.0
Firefox: behind a flag since version 90.0a1
Edge: behind a flag since version 91
Opera: behind a flag since version 77
ImageMagick: since version 7.0.10-54
XnView MP: since version 0.97.0
ImageGlass: since version 8.1.4.18
ExifTool: since version 12.23
gThumb: since version 3.11.3
OpenMandriva Lx: since version 4.3 RC
GIMP: since version 2.99.8
qimgv: since version 1.0.0
PhotoQt: since version 2.4
jxl-winthumb: with WIC-codec since version 0.1.11
KImageFormats: since version 5.89.0
I have read that both Chrome and Firefox now support it if one enable the proper option in the about:config page (and its Chrome equivalent, but I avoid Chrome by principle).
It's in my to do-list in the coming weeks, I have not verified this yet.
It’s available behind a flag, perhaps awaiting v1.0 in the reference implementation or something.
I really hope it takes off. There was a bit poor timing because widespread AVIF support happened _right_ before and now we need to beg browser vendors to adopt JPEG XL! I hope they don't balk at it for starting introduce too much file format cruft because JPEG XL is really, really something they should implement. It can succeed JPEG as well as PNG (lossless JPEG XL tend to destroy even optimized PNG). I'd much rather abandon AVIF support. It's just one of those annoying single-video-frame-as-a-picture hacks like WebP and HEIF without drive to make them outstanding image formats on their own for photography as well as line art.
Is this too late? iPhones are shooting in HEIF now, which means there's already millions of HEIF images in the world. Surely that is the de-facto next generation image standard.
H.264 patents are not expiring yet, if anyone was wondering. 2027 seems to be when that happens. On the other hand, I believe H.263 patents already expired, and MPEG-4 ASP (DivX etc.) is expiring this year.
You don’t have to wonder: There have been many truly patent free codecs and the answer is mostly “Try to convince companies to adopt it even though they have a vested interest in using codecs they control”
FLAC was released in 2001 and widely popular by at least 2006.
With much fanfare, iOS11 (2017) was the first time iOS supported playing FLAC files and that was (and still is) not the Apple-branded “Music” app but only in the “Files” app.
Vorbis has been pretty much entirely obsoleted by Opus so there wouldn't have been much incentive for new Vorbis support since then - is there a similarly decisive open successor for FLAC?
I certainly have not seen one. FLAC is either A) the best option for free lossless music encoding, or B) The music industry is way too big of a heavy hitter to let any others build momentum
The problem was FLAC wasn't patent free but claims to be not covered by any patent. That is quite a bit of difference from a legal perspective.
One could argue about patent system and its usage, but most companies ( ignoring Tech and Internet companies ) tend to avoid them. Especially when there are alternative.
And consumer tends to just use whatever they want. I remember I picked Wavepack over FLAC in that era. But I cant record the reason any more it was just too long ago.
> The FLAC and Ogg FLAC formats themselves, and their specifications, are fully open to the public to be used for any purpose (the FLAC project reserves the right to set the FLAC specification and certify compliance). They are free for commercial or noncommercial use. That means that commercial developers may independently write FLAC or Ogg FLAC software which is compatible with the specifications for no charge and without restrictions of any kind. There are no licensing fees or royalties of any kind for use of the formats or their specifications, or for distributing, selling, or streaming media in the FLAC or Ogg FLAC formats.
While I know through experience that you have a point about some companies avoiding anything that might be patent encumbered, and that the statement “[FLAC is not] covered by any known patent.“ would be a red flag to them, the argument falls down when we talk about how much code has been used over the years by companies who stepped all over the intellectual property rights of the authors.
Even if we look at Mp3s which are definitively patent-encumbered and commercially-licensed, here have been thousands of devices that played Mp3s before they a license (if they ever got one at all).
> The problem was FLAC wasn't patent free but claims to be not covered by any patent. That is quite a bit of difference from a legal perspective.
What is that difference? It's not like you can proove that anything is not covered by any patent without being sued by every patent holder and winning.
>What is that difference? It's not like you can proove that anything is not covered by any patent
When you have a codec that have all patent expired after 20 years. That is guaranteed patent free, because you can no longer extract patent from it. Claiming you dont have any patent just means you think it is not covered by any patents until proven.
Last I checked[0] I think there were some remaining H.263 patents in the US, but they only covered some error correction or optional annex features.
Interestingly enough, there's some non-standard-essential patents still floating around out there. For example, Mediatek patented converting Sorenson Spark bitstreams for decoding with standards-compliant H.263 hardware decoders[1], and that patent is still live until later in this decade.
[0] I wrote a Sorenson-flavored H.263 decoder in Rust for integration into Ruffle.
[1] The invention is literally just truncation-with-saturation wrapped in a bunch of patent language. If you have the money to front a legal fight against them, feel free to get the patent invalidated. But I don't think it matters.
You could write a complete from-scratch decoder, and be sued if you distribute it.
This is the case with MPEG-2 formats used in DVD Video, and the reason that the VLC Media Player exists as a French student research project distributed as source code instead of a normal library.
IANAL, but as far as my understanding goes, it means that the algorithm is patented and therefore requires you to pay a license fee if you want to redistribute the code in a country where software patents are practically enforceable (like the USA).
In other parts of the world (like large parts of Europe, for example) software patents are practically unenforceable and you can probably distribute your code as you see fit. Code is still copyrightable, so you can't just redistribute someone else's library, but your own code does not violate their copyright.
Some greedy American companies will probably still sue you if you do, hoping to enforce their patents regardless or at least scare you into paying them, but at legally you _should_ be in the clear.
When it comes to dedicated hardware for an algorithm, though, like hardware accelerated video codecs, you'll likely run into patent trouble in most of the world if you choose to build your own hardware for a patented system.
I worked on a project where I extracted the motion vectors from the h264 encoded stream from the camera, to detect motion. It's like a basic motion detector for free.
It's been a long time. But I remember I started with this page [0]. There are also many resources on github, like [1] or [2], but I haven't tried them.
There is a github project called pikrellcam that utilizes motion vectors generated from the raspberry pi's gpu to detect motion. It's a relatively simple piece of code and it works very well
Does anything interesting happen if you take the frequency domain representation of an image, represent the frequency domain as an image itself, and compress that with some sort of image compression?
For example, encode the frequency domain representation as a low quality JPEG, and then undo the steps to turn it back into the "original". How do the JPEG artifacts on the frequency domain manifest in the resulting image?
Thanks, this is great. I found it works a lot better if you leave the magnitude alone and only mess with the phase.
The image is still recognisable even at quite harsh compression of the phase file, but almost totally unrecognisable at even minor compression of the magnitude file.
This wouldn't work well because in the frequency domain representation, different "pixels" have very different importance for the overall appearance of the image: The pixels at the center of the frequency domain representation represent low frequencies, so compressing them will drastically alter the appearance of the image. On the other hand, the corners/edges of the frequency domain representation represent high frequencies, i.e. image details that can be removed without causing the image to change much. That's the crucial benefit of the Fourier transform for compression: it decomposes the image into important bits (low frequencies) and relatively less portant bits (high frequencies). Applying compression that doesn't take that structure into account won't work well.
Minor note.
If the original data is a time-signal like in electrical engineering (amplitude vs. time function), then the "frequency domain pixels" (its transform) are different frequencies (points in frequency domain: how many repetitions in a second, etc.) and the time-signal's transform function becomes an amplitude vs. frequency graph.
But if the original data is an image (matrix or grid of pixels in space), then the "frequency domain pixels" are different wave-numbers (aka spatial frequencies: how many repetitions in a meter, etc.) and the Fourier transform (of the pixel grid) is a amplitude vs. wave-number function.
I'm into the glitch art scene and this makes me wonder what happens if you crop/erase patterns of the frequency domain representation and put it back together...
Your idea is analogous to what’s done in MRI image reconstruction. “Images” are actually obtained in the frequency domain, but there are various undersampling strategies to accelerate the frequency domain data acquisition. These are then reconstructed into actual images with the inverse FFT.
the frequency domain of the frequency domain is just a flipped version of the original image, so that would not work. the idea of frequency-domain based compression is that natural images have little high-frequency information.
Almost all of those encoding concepts mentioned are not introduced with H.264, but much earlier with MPEG-2 in the early 90's https://en.wikipedia.org/wiki/MPEG-2
There was once a blog titled "Diary Of An x264 Developer" that gave some interesting detail of how h264 worked and the and the x264 implementation. It's still available on via the internet archive.
Tied as the leading implementation by 2006, pulled far ahead in SSIM in 2010, and all FOSS. Hat off to Glaser and Merritt!
An interesting Baader–Meinhof for me when I first heard: Glaser (as Dark Shikari) used to be active on HN and wrote the quote (oft misattributed to Descartes),
Any community that gets its laughs by pretending to be idiots will eventually be flooded by actual idiots who mistakenly believe that they're in good company.
One thing I was curious about is how the PNG would compare after running it through an optimizer since that would be a more fair comparison since the H.264 encoding does optimize. Even so, I bet the H.264 would fare well. I did an experiment with using single-frame H.264 instead of images on a site: I think it’s a viable technique but didn’t have time to flesh it out in full. If you have some kind of asset pipeline for a site it’s not really more work to encode and the HTML is now a video tag with no player controls so isn’t a ton of work client side either. Would love to explore that more at some point.
It would be better for you to serve .webp and .avif (or optionally .heif for apple devices) inside a <picture> tag with .jpg or .png as fallback, as these formats are designed for static images, but are based on I-frame compressors for vp9, av1 and hevc respectively.
> Suppose you have some strange coin - you've tossed it 10 times, and every time it lands on heads. How would you describe this information to someone? You wouldn't say HHHHHHHHH. You would just say "10 tosses, all heads" - bam! You've just compressed some data! Easy. I saved you hours of mindfuck lectures. This is obviously an oversimplification, but you've transformed some data into another shorter representation of the same information.
Future generations will disagree it's the "same information."
I predict there will be a small but vocal cabal who seize on your example with nine H's to argue that it's not the case.
On a more serious note, if you lose the decoder ring does it cease to be a "representation of the same information?"
It's curious because where do you draw the line between an algorithm that compresses information and an algorithm that just stores the inforamtion in the algorithm.
Well, if we're being pedantic here, raw RGB video as would be displayed on a RGB monitor does indeed take 3 bytes per pixel. Technically, YUV is either an optimization or a backwards compatibility measure, and hence the adjective "raw" does not apply to it.
Actually the "raw" data coming from most cameras are often already compressed somehow, otherwise the camera would not be able to write it to storage (or send it over USB) fast enough.
In fact since decoding also often happens in hardware, the raw size may never materialize anywhere in the pipeline, other than the framebuffer of the video card. Even HDMI has compression now [0]
The author probably choose a screenshot as a benchmark, because otherwise it's hard to get your hands on anything compressible but not already compressed.
10 bit video 444 with 8294400 or 8847360 samples per picture for 4k (tv vs movie) tops out below 270mbit per frame, or at 60hz 16gbit, so you can fit half a dozen signals down a 100G transceiver, or two down a 40G.
Throw in YUV compression and you can shift 120hz 8k with relatively little problem.
If you define raw video as RGB with 1 byte for red, 1 byte for green and 1 byte for blue, then yes, it will be 3 bytes per pixel.
But there are clearly other ways to store a pixel of colour information in less than 3 bytes, which is OP's point. It's not an optimization really - it's just a different coding format (Just as ASCII isn't an optimization of Unicode).
I met one of the creators of H264 years ago while photographing a party in Palo Alto. He and his wife were really genuinely nice and interesting people. At the time H265 was not released but it sounded like they were actively working on it and even something past it
Just got a drone air 2s and it recorded in h.265. I tried to play it on 2013 MacBook retina and it plays a frame every minute or so. I can play 4k h.264 no problem. Am I missing something why its not playing
Is there as a (good) college-level text book on the topic of image and video encoders? I never took a digital image processing course .. curious if someone has a favorite book on this topic.
Lectures from a course I took (CS 674 at UNR) can be found online here, along with class exercises and names of textbooks:
https://www.cse.unr.edu/~bebis/CS474/
Would highly recommend it! We went over pretty much everything in this article, plus more :)
Back in college when I was taking Digital Image Processing class we discussed h.264. Truly impressive technology no doubt. This article goes in much more detail about it, job well done.
I stopped reading when the author started to do comparisons with PNG. PNG or alternatives should only be used when the errors created by lossy compression is too visible.
With that said. Improvements of audio and video compression over the last 25 years are very impressive and have changed how the world works in several areas.
PNG is a good baseline for image sizes before adding any kind of smart compression. Nothing wrong with that, and the comparison would not look much different against JPG (except it would probably look actually worse than the video).
For the web, you might use it when you want alpha channels. WebP is still not supported everywhere.
Regardless, it’s useful as a comparison in the article to show just how good h264 is versus other compression techniques. They are not recommending PNG or anything. WebP would only produce a 20-50% smaller file, not 1000x smaller, so it doesn’t really matter which image format they compare to.
Suppose you have a bunch of raw video. You take extracts of it and put them together to make a movie, M1. You make an H.264 encoded copy of that. Let's call it C1.
You then make a new cut of your movie, M2, which is mostly the same footage as M1 except that you've shortened a few scenes and lengthened others. You make an H.264 encoded copy of that. Call this C1.
When making C1 and C2 your H.264 encoders have to decide which frames to turn into I-frames and which to turn into P-frames.
If they just do something simple like make every Nth frame an I-frame then after the first different between M1 and M2 it is unlikely that C1 and C2 will have many I-frames in common, and therefore also not have many P-frames in common.
If they look for scene changes and make new I-frames on scene changes, then we might expect that at least for the scenes that start identically in M1 and M2 they will get identical I-frames and P-frames up to their first edit if any.
Scenes that are edited in the front would still end up encoded totally different in C1 and C2.
Question: are there any encoders that when encoding M2 to produce C2 can be given M1 and C1 as references using them to adjust I-frame spacing so as make as many C2 I-frames as possible match C1 I-frames?
That would allow C2 to be stored efficiently as a binary diff from C1. This could be handy if C1 and C2 needed to be checked into a version control system, or you needed to distribute C2 over a low bandwidth or expensive link to someone who already had C1.
The second question concerns recompressing after decompression. I actually thought of this question in terms of audio so will ask in those terms, but I guess it applies to video too.
Suppose someone has an uncompressed source S. They compress it with a lossy compressor producing C and distribute C to you. You decompress C producing S'.
You then compress S' with a lossy compressor (the same type that the original producer used--e.g.., if C is an MP3 you use an MP3 compressor) producing C'. I don't know about video, but for audio (at least back in days when MP3 was starting to get big) C' would be lower quality than C.
Are there any compressors that can figure out that they are dealing with something that already has undergone the "throw out imperceptible parts to make it more compressible" step done and just skip to the next stage, so they produce a C' that is a lossless representation of S'?
> If they look for scene changes and make new I-frames on scene changes, then we might expect that at least for the scenes that start identically in M1 and M2 they will get identical I-frames and P-frames up to their first edit if any.
> Question: are there any encoders that when encoding M2 to produce C2 can be given M1 and C1 as references using them to adjust I-frame spacing so as make as many C2 I-frames as possible match C1 I-frames?
I suspect if you were to encode both files with x264 with identical settings, in crf mode, with vbv disabled and keyint=0 (unlimited keyframe length), its scene detection should place I-frames in the same places (that is, on scene cuts, not on the timeline). Maybe some scenecut tuning would be necessary.
> That would allow C2 to be stored efficiently as a binary diff from C1. This could be handy if C1 and C2 needed to be checked into a version control system, or you needed to distribute C2 over a low bandwidth or expensive link to someone who already had C1.
I'm not aware of any automated way to do that, but you can do a similar thing manually using mkv's ordered chapters. You first compress the original cut, then you compress any additional scenes and insert them where you need. For example, you can make a mkv file for a theatrical cut of a movie, and then make a separate file for a director's cut that is only as big as the additional scenes are, since it uses the theatrical cut file for the common scenes.
> Are there any compressors that can figure out that they are dealing with something that already has undergone the "throw out imperceptible parts to make it more compressible" step done and just skip to the next stage, so they produce a C' that is a lossless representation of S'?
The compressor will try throwing away exactly the information that was thrown away the first time you compressed it. So basically it will leave the content as is, because there is nothing extra to throw away.
You can easily see this with an MP3 file at 128 kbps - the first time you compress it most of the very high frequency content will be thrown away - you can see this on a spectogram of the uncompressed file compared with one of the compressed file. But the if you compress it again, the second compression spectogram will look very similar to the first compression one, because there is not anything else that you can throw away.
But there is a complication - the audio file is typically stored in the time domain (PCM), but the compressor operates in the frequency domain (FFT), and there will be a conversion between these two domains that you can't avoid. This conversion unfortunately will lose a bit of information and degrade quality a bit.
For the first question, I ended up using this trick for a video editing app on Android phones about 10 years ago in order to cut video together on low end, out of date (Android 2.3) phones. They couldn’t handle video compression in a reasonable time and we didn’t want to upload/download to process on a server.
The point of the app was to sync the cuts to music cues, so each clip had a defined length. I ended up doing it all through file manipulation. You can cut into a video file starting at any arbitrary I-frame then trim it to the desired length. I would cut the input videos down to size then concatenate the files, replacing the audio with the new soundtrack at the end.
It worked great, only took a few seconds to create the final edit. Of course you couldn’t overlay text or filter video, but I still think it was a valid solution.
With the requirement of starting each clip on an I-frame, there was some imprecision in where your cut would actually start—an arteur might have a problem with their masterpiece being butchered that way, but it would certainly work well for some special cases like efficient distribution or being able to show a diff that a video was unaltered outside of timing cuts.
> If they just do something simple like make every Nth frame an I-frame then after the first different between M1 and M2 it is unlikely that C1 and C2 will have many I-frames in common, and therefore also not have many P-frames in common.
Modern encoders won't be using a fixed rate for I-frames unless you force it to. It will choose an I-frame when deemed optimal.
You're correct in that using a compressed source which goes through a video editor and then to another encoder will likely not pass through to the encoder which of the frames to encode were originally I-frames in the source. This is because video editors combine multiple sources together so there is no "single" source.
But if you're not actually editing and you're just "trimming" and "joining", this can be done with perfectly matched i-Frames and p-Frames, but probably not b-frames?
Even when using the commandline x264, you can specify which frame numbers you'd like encoded as I-frames.
> You make an H.264 encoded copy of that. Let's call it C1.
> You then make a new cut of your movie, M2, which is mostly the same footage as M1 except that you've shortened a few scenes and lengthened others.
Your new cut gets encoded from whatever you've chosen from frame-index numbers and timestamps in the various original raw yuv project files (or ProRes, etc, depending on what your camreas output).
The existence of the original cut should be irrelevent because what you're not doing is taking the compressed output product of C1 and re-cutting it into C2 unless for some catastrophic reason you have lost or deleted all of the original raw/uncompressed original camera files.
If you are generating new C2 from your original uncompressed (or lightly compressed) files, of course the encoder will be making new decisions on where to insert I-frames and P-frames based on the duration of each camera cut, changes you've made to CGI, other visual factors in the re-cutting.
> Suppose someone has an uncompressed source S. They compress it with a lossy compressor producing C and distribute C to you. You decompress C producing S'.
> You then compress S' with a lossy compressor (the same type that the original producer used--e.g.., if C is an MP3 you use an MP3 compressor) producing C'. I don't know about video, but for audio (at least back in days when MP3 was starting to get big) C' would be lower quality than C.
all of this is generally a bad idea unless you have no other option but to work with received files that are already compressed. say for example you're working on a documentary about something in a conflict zone and somebody sends you a highly compressed h264 file recorded on a smartphone. in this case there is no uncompressed original available.
you're going to want to first extract it into an uncompressed yuv raw file on disk so that you can work with it in your standard editor, and then whatever scenes you choose from within it will get re-encoded into your final output.
That's amazing, thanks for sharing. Going in rather uninformed, I expected it to 'look better' by typical smoothing tricks and whatnot...but no. It has -tons- more detail...looking at little things like wrinkles in the socks and shirt and whatnot. Impressive.
What's the patent status? As far as I know, H265 is patent encumbered(and not actually supported on a large chunk of Android devices for the same reason) , so including this somewhere might invite legal trouble.
I was under the impression most implementations rely on hardware support. Hardware vendors pay the royalties then software just hands it off. Afaik that's how it works in most browsers and many modern CPU/GPUs support hardware decoding h265
HEIC is this plus a kitchen sink of features for digital camera rolls, which aren't useful on the web, all wrapped in a ton of legacy ISO BMFF complexity reaching as far as Classic Mac Quicktime. BPG structure is much more minimal and pragmatic.
However, BPG is unnecessary nowadays. Modern browsers support AVIF, which is HEIF+AV1, so basically HEIC with the painfully patented bit swapped for a freely licensed one.
It seems that way - I read this [1] which does a high level comparison of the technologies. It does seem like BPG is patent encumbered by the underlying H265 algorithms (aka HEVC) [2]
Wow that comparison is really cool. Looking at BPG vs original, you can see some loss of fine detail as the file size cranks down, but you can get an excellent reproduction with an order of magnitude less storage. Seems like BPG is smoothing the image a bit. For example, the space shuttle image has film grain in the sky, but BPG smooths this away.
Significantly better than JPEG at small file sizes... BPG eliminates the obvious color banding and does a much better job preserving detail.
If you set BPG to a large file size, you still get some of the noise in the sky of the Endeavor photo. But this is 1/20 the size of the original image which is incredible.
Why would anyone use such exotic unsupported format when there is WebP, AVIF or JPEG XL, which are all much more supported and better? It's shame they are not natively supported by most of the camera Android apps.
This was fascinating. I changed to JPEG large vs BPG small and they are almost indistinguishable. JPEG small vs BPG small wasn't even close (JPEG being much worst).
It does look really good for detailed areas, but when you look at the abandoned building pic BPG can never quite get the smooth sky gradient right, even at Large size.
Bro. Why are you promoting so much an app that barely does anything (opens a series of images from a hosting), as 'pornographic application that change your life'?
This account exists ONLY to talk about that app. Every one of my projects is isolated from the others by using different accounts, etc. We should all be meticulous about compartmentalizing our online identity.
This app uses an unusual image codec (BPG) which illustrates that if you control the encoder and decoder, you can choose a format with no mainstream support. I think it is a good point.
Do you run into issues being seen as a shill for your app?
I ask this in good faith, because I think a decent chunk of HN readers check a person's post history when they recommend something. I get the idea of compartmentalizing your online identity, but I imagine that comes at a cost in some instances.
> We should all be meticulous about compartmentalizing our online identity.
I disagree and I believe the HN guidelines do as well:
> Throwaway accounts are ok for sensitive information, but please don't create accounts routinely. HN is a community—users should have an identity that others can relate to.
Do you believe the person posting is guilty of "creating accounts routinely"? Or just one for this app and one 'main' account?
I don't see an issue with creating multiple accounts as long as they're not used for abuse, e.g. spam or trolling, or cause technical issues, e.g. creating millions of accounts. I mean if done right you'd never know or care.
I'm usually pretty critical of self-promotion, but in this case they gave a real world example of how they did something related to the article's contents. They didn't even provide a link or an app name, just said that you could find it in their history if you were curious.
I think that's about the best you can do not to promote your product while trying to give real-world examples of things that are being discussed.
Listen friend, I'm using it as a concrete example of a technology in every case.
Today I'm using it as a concrete example of H.265 (BPG) as an astonishingly good alternative to JPEG in applications where you control the encoder and decoder.
> "Suppose you have some strange coin - you've tossed it 10 times, and every time it lands on heads. How would you describe this information to someone? You wouldn't say HHHHHHHHH. You would just say "10 tosses, all heads" - bam! You've just compressed some data! Easy. I saved you hours of mindfuck lectures."
Ahh, my favorite kind of technical walkthrough. Love it
Doesn't that description basically presuppose that you already understand compression? "HHHHHHHHHH" is 10 characters, but "10 tosses, all heads" is 21. Already understanding compression, I of course know that the second part could be efficient if encoded properly, but someone actually needing the explanation probably wouldn't know that.
It's not really wrong, and not really an "oversimplification" as the parent article calls it. It's just a bad example, and it seems confused about the audience it's trying to reach. Either they already understand compression, and it's superfluous, or they don't and it's just confusing.
That was an example, intended for humans. It is much quicker to say "10 tosses, all heads", than it is to say "HHHHHHHHHH".
A computer example might explain that it takes less storage to store "10H" (i.e. the integer 10 followed by the character H), than it is to store 10 H characters in a sequence.
Ah, but perhaps in this encoding scheme the first H is implicit, and a leading T is encoded as _T. Half the time this results in an infinity increase in bytes saved!
Ah ASCII art. Long rows of identical characters. The only thing that actually allowed a dial up modems compression claims to actually live up to the marketing hype....
H.264 is magic (2016) - https://news.ycombinator.com/item?id=19997813 - May 2019 (180 comments)
H.264 is Magic – a technical walkthrough - https://news.ycombinator.com/item?id=17101627 - May 2018 (1 comment)
H.264 is Magic - https://news.ycombinator.com/item?id=12871403 - Nov 2016 (219 comments)