Reverse Engineering the Comtech AHA363 PCIe Gzip Accelerator Board

triangleman · on June 23, 2020

So how do you get a web server to take advantage of a hardware gzip accelerator? Is there like a custom nginx plugin that points to a custom driver? Seems very cool and I've never heard of something like this.

tverbeure · on June 23, 2020

According to their press release, it was an Apache plugin.

https://www.businesswire.com/news/home/20081008005207/en/Com...

mng2 · on June 23, 2020

They mention APIs for Windows, Linux, OpenSolaris, a ZLIB, and the Apache module.

http://www.aha.com/DrawProducts.aspx?Action=GetProductDetail...

tyingq · on June 23, 2020

It was also pretty common in the early days of SSL to have a similar outboard accelerator and webserver plugins to call it.

That still exists for Allwinner CPUs. You can offload the TLS handshake.

triangleman · on June 23, 2020

That's so cool. Has anyone made a JSON-parsing chip, because I could see that being useful in today's servers.

wereHamster · on June 23, 2020

Is JSON parsing still a bottleneck? https://github.com/simdjson/simdjson

lpgauth · on June 23, 2020

mixing simd and non-simd load is not recommended.

https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...

bri3d · on June 23, 2020

this is well-addressed here: https://github.com/simdjson/simdjson/blob/master/doc/perform... .

first off, AVX2 is a subset of SIMD instructions - SIMDjson includes a large number of implementations including several which do not use AVX2 at all.

next, only a subset of AVX2 instructions cause downclocking (namely, the 512 bit wide ones), which SIMDjson does not use. throwing out the use of "SIMD" entirely because you read an article about a specific AVX512 instruction which caused down-clocking might be a bit premature, no?

BooneJS · on June 23, 2020

Lots of NICs for large supercomputers have special message passing offload engines. But JSON is too ASCII and too random to find a reason to accelerate it.

the8472 · on June 23, 2020

One way to do it is to expose it as an nvme block device. Not sure how that's used from userspace, I suspect it's either the right combination of mmap/vmsplice/sendfile from userspace to do p2pdma between the devices in question or some custom ioctls.

https://www.snia.org/sites/default/files/SDC/2019/presentati...

castratikron · on June 23, 2020

Maybe hooks into libz somehow?

yjftsjthsd-h · on June 23, 2020

Oooh, that would work; just provide your own libz that's a wrapper around the hardware. You could even package it up and do a drop in replacement:)

tachyonbeam · on June 23, 2020

I'm curious to know how much faster than Gzip compression on a modern multicore CPU this is. The AHA webpage says "Compresses and decompresses at a throughput rate over 5.0 Gbits/sec" (that's 1.2 GB/s). How fast can you gzip compress on a 16-core Ryzen CPU, for example?

abiogenesis · on June 23, 2020

I don't think gzip can use multiple cores, but there is a parallel implementation of gzip called pigz (race condition pun?) [1] which uses a clever trick [2] not to reduce compression efficiency:

> The input blocks, while compressed independently, have the last 32K of the previous block loaded as a preset dictionary to preserve the compression effectiveness of deflating in a single thread

[1] https://zlib.net/pigz/

[2] https://zlib.net/pigz/pigz.pdf

masklinn · on June 23, 2020

According to https://rachaellappan.github.io/pigz/ pigz did about 360MB/s on a 96 cores machine, though that was 3 years ago.

Ballas · on June 23, 2020

I regularly get 80-200MB/s on my 8 core 7700HQ, though it is likely limited by disk speed.

Ballas · on June 23, 2020

Looking at the puny heatsink on this card, I call that pretty impressive performance/watt.

(Note that the heatsink is on the FPGA, not the actual gzip accelerator chip.)

jtchang · on June 23, 2020

Wow. I know very little about FPGAs and hardware and this is pure magic to me. Debugging these things must be a royal pain.

pjc50 · on June 23, 2020

Yes, it is :)

The normal development flow for FPGA software is similar to ASIC in that people focus on testing as many elements as possible in software simulation to a very high level of coverage before even downloading to the FPGA. Once in there, you're reliant on JTAG (high-speed serial bus) to read out values from the target device.

Tools like ChipScope can let you see what's going on and set ""breakpoints"". http://web.mit.edu/6.111/www/labkit/chipscope.shtml

All of this is much harder when it's on a board you didn't design!

_sbrk · on June 23, 2020

Nice development board for $20, even if it takes some work. Bought one!

tverbeure · on June 23, 2020

It's too late now since you've already bought one, but right now, it's still a doorstop than can't be used for anything: I haven't even been able to get one of the LEDs blinking, and that's only the Hello World of the FPGA hobbyist.

_sbrk · on June 23, 2020

No worries. As the saying goes "this isn't my first rodeo". I'll let you know what I discover.

tverbeure · on June 23, 2020

Get in touch!

One thing that didn’t make it in the blog post was that a strategically soldered wire shortened the JTAG chain to only include the 2 Intel chips and bypass the AHA chips.

The power to the AHA chips gets cut off at some point, breaking the JTAG chain, but the FPGAs stay active.

So by bypassing, you keep the chain alive.

tyingq · on June 23, 2020

It is interesting to see how the prices have come down. While just the FPGA chip on this board was originally $1K, a new DE10 board is ~$130 with slightly more logic units at a higher clock, plus a dual core ARM A9, peripherals, etc.

happycube · on June 24, 2020

. o O (And the MISTer project for lots of great retro systems/consoles/games...)

happycube · on June 23, 2020

Are there any free drivers for it with the native hardware? There are ones on their site... but behind a registration-wall.

It might be pretty nifty to run swap through this...

battery423 · on June 23, 2020

I'm super exicted what will come in this direction after PS5 announcement and Microsoft mentioning Direct Storage.

PS5 has a IO chip and extra architecture for decompressing textures (here the context with this gzip accelerator) and more features.

Mark Cerny said that they needed this chip because from a CPU ressource usage,it would use all CPU cores. So nvm now are so fast that a io co processor is feasable again.

rasz · on June 23, 2020

PC SSD drives started out with buildin compression. This is why most test software has separate "incompressible data" graphs. Its usually something you actually dont want, akin to fake streamer tape capacities/speeds assuming 2x compression.

> IO chip and extra architecture for decompressing textures

you dont want uncompressed textures in your GPU memory, compressing DXT compressed textures is not idea to say the least, you can count on 30-50% compression ratio, not the marketing 2x peak number Sony was throwing around

>would use all CPU cores

is marketing exaggeration, LZ4 decompression can achieve ~3GB/s per core. How it is today: https://www.jonolick.com/home/oodle-and-ue4-loading-time

battery423 · on June 24, 2020

I do have seen issues on my machines i would highly put in the direction of "nvm got shitty fast really quick and something inbetween is weird, we need to do something".

My feeling is, that what Sony is now doing, goes in that direction.

I'm highly curious and hopefull.

I will read your posted blog, looks interesting. Lets see what will arrive in the industry at the end of the day.

battery423 · on June 24, 2020

When i look at his data, its capped and the additional PC transfer speed of over 512mb/sec vs hitting the speed limit much sooner, does show that something is missing.

rasz · on June 24, 2020

Diminishing returns. Cap is most likely caused by the graphic engine fixed costs, at some point you cant use more speed no matter what without rewrite (initialization, deserialization etc).

jjoonathan · on June 23, 2020

DMA units are back, baby!

fomine3 · on June 24, 2020

Revisiting LTO tape.

shawnz · on June 23, 2020

Why would they design it to require two ASICs AND an FPGA? Couldn't they just have built the FPGA program into the ASICs?

tverbeure · on June 23, 2020

Making an ASIC with pedestrian low speed IOs is easy. Making an ASIC with high speed SERDES IOs that are required for PCIe is hard.

Also, AHA has a followup product that doubles the performance by dropping 4 ASICs and 1 FPGA on a board instead of 2 ASICs. So modularity is a factor as well.

But I think the first point is very likely the reason.

Palomides · on June 23, 2020

pure speculation:

the FPGA does the PCIe and all necessary data marshaling, which allows a lot of flexibility for updates and bug fixing on the most finicky parts of hardware

two ASICs because once you have one manufactured, most of the costs are sunk and per unit it's cheap to stick a second on the board

tachyonbeam · on June 23, 2020

You're probably right. The FPGA is probably the most expensive part of this board by far, and maybe they figured the FPGA and the PCIE bus can handle enough traffic to keep two compression chips busy.

tyingq · on June 23, 2020

The FPGA is also potentially insurance for bugs on the ASIC that aren't economical to fix. Catch the bug and fix it on the way out of the card.

ComputerGuru · on June 23, 2020

FPGAs are cheap compared to the initial run costs for comparable custom ASICs, though.

rjsw · on June 23, 2020

Maybe they designed the ASICs for a PCI board and they just have a memory-mapped interface.

craftinator · on June 23, 2020

Or vice versa...