Hacker News new | past | comments | ask | show | jobs | submit login
Reverse Engineering the Comtech AHA363 PCIe Gzip Accelerator Board (tomverbeure.github.io)
127 points by todsacerdoti on June 22, 2020 | hide | past | favorite | 41 comments



So how do you get a web server to take advantage of a hardware gzip accelerator? Is there like a custom nginx plugin that points to a custom driver? Seems very cool and I've never heard of something like this.


According to their press release, it was an Apache plugin.

https://www.businesswire.com/news/home/20081008005207/en/Com...


They mention APIs for Windows, Linux, OpenSolaris, a ZLIB, and the Apache module.

http://www.aha.com/DrawProducts.aspx?Action=GetProductDetail...


It was also pretty common in the early days of SSL to have a similar outboard accelerator and webserver plugins to call it.

That still exists for Allwinner CPUs. You can offload the TLS handshake.


That's so cool. Has anyone made a JSON-parsing chip, because I could see that being useful in today's servers.


Is JSON parsing still a bottleneck? https://github.com/simdjson/simdjson


mixing simd and non-simd load is not recommended.

https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...


this is well-addressed here: https://github.com/simdjson/simdjson/blob/master/doc/perform... .

first off, AVX2 is a subset of SIMD instructions - SIMDjson includes a large number of implementations including several which do not use AVX2 at all.

next, only a subset of AVX2 instructions cause downclocking (namely, the 512 bit wide ones), which SIMDjson does not use. throwing out the use of "SIMD" entirely because you read an article about a specific AVX512 instruction which caused down-clocking might be a bit premature, no?


Lots of NICs for large supercomputers have special message passing offload engines. But JSON is too ASCII and too random to find a reason to accelerate it.


One way to do it is to expose it as an nvme block device. Not sure how that's used from userspace, I suspect it's either the right combination of mmap/vmsplice/sendfile from userspace to do p2pdma between the devices in question or some custom ioctls.

https://www.snia.org/sites/default/files/SDC/2019/presentati...


Maybe hooks into libz somehow?


Oooh, that would work; just provide your own libz that's a wrapper around the hardware. You could even package it up and do a drop in replacement:)


I'm curious to know how much faster than Gzip compression on a modern multicore CPU this is. The AHA webpage says "Compresses and decompresses at a throughput rate over 5.0 Gbits/sec" (that's 1.2 GB/s). How fast can you gzip compress on a 16-core Ryzen CPU, for example?


I don't think gzip can use multiple cores, but there is a parallel implementation of gzip called pigz (race condition pun?) [1] which uses a clever trick [2] not to reduce compression efficiency:

> The input blocks, while compressed independently, have the last 32K of the previous block loaded as a preset dictionary to preserve the compression effectiveness of deflating in a single thread

[1] https://zlib.net/pigz/

[2] https://zlib.net/pigz/pigz.pdf


According to https://rachaellappan.github.io/pigz/ pigz did about 360MB/s on a 96 cores machine, though that was 3 years ago.


I regularly get 80-200MB/s on my 8 core 7700HQ, though it is likely limited by disk speed.


Looking at the puny heatsink on this card, I call that pretty impressive performance/watt.

(Note that the heatsink is on the FPGA, not the actual gzip accelerator chip.)


Wow. I know very little about FPGAs and hardware and this is pure magic to me. Debugging these things must be a royal pain.


Yes, it is :)

The normal development flow for FPGA software is similar to ASIC in that people focus on testing as many elements as possible in software simulation to a very high level of coverage before even downloading to the FPGA. Once in there, you're reliant on JTAG (high-speed serial bus) to read out values from the target device.

Tools like ChipScope can let you see what's going on and set ""breakpoints"". http://web.mit.edu/6.111/www/labkit/chipscope.shtml

All of this is much harder when it's on a board you didn't design!


Nice development board for $20, even if it takes some work. Bought one!


It's too late now since you've already bought one, but right now, it's still a doorstop than can't be used for anything: I haven't even been able to get one of the LEDs blinking, and that's only the Hello World of the FPGA hobbyist.


No worries. As the saying goes "this isn't my first rodeo". I'll let you know what I discover.


Get in touch!

One thing that didn’t make it in the blog post was that a strategically soldered wire shortened the JTAG chain to only include the 2 Intel chips and bypass the AHA chips.

The power to the AHA chips gets cut off at some point, breaking the JTAG chain, but the FPGAs stay active.

So by bypassing, you keep the chain alive.


It is interesting to see how the prices have come down. While just the FPGA chip on this board was originally $1K, a new DE10 board is ~$130 with slightly more logic units at a higher clock, plus a dual core ARM A9, peripherals, etc.


. o O (And the MISTer project for lots of great retro systems/consoles/games...)


Are there any free drivers for it with the native hardware? There are ones on their site... but behind a registration-wall.

It might be pretty nifty to run swap through this...


I'm super exicted what will come in this direction after PS5 announcement and Microsoft mentioning Direct Storage.

PS5 has a IO chip and extra architecture for decompressing textures (here the context with this gzip accelerator) and more features.

Mark Cerny said that they needed this chip because from a CPU ressource usage,it would use all CPU cores. So nvm now are so fast that a io co processor is feasable again.


PC SSD drives started out with buildin compression. This is why most test software has separate "incompressible data" graphs. Its usually something you actually dont want, akin to fake streamer tape capacities/speeds assuming 2x compression.

> IO chip and extra architecture for decompressing textures

you dont want uncompressed textures in your GPU memory, compressing DXT compressed textures is not idea to say the least, you can count on 30-50% compression ratio, not the marketing 2x peak number Sony was throwing around

>would use all CPU cores

is marketing exaggeration, LZ4 decompression can achieve ~3GB/s per core. How it is today: https://www.jonolick.com/home/oodle-and-ue4-loading-time


I do have seen issues on my machines i would highly put in the direction of "nvm got shitty fast really quick and something inbetween is weird, we need to do something".

My feeling is, that what Sony is now doing, goes in that direction.

I'm highly curious and hopefull.

I will read your posted blog, looks interesting. Lets see what will arrive in the industry at the end of the day.


When i look at his data, its capped and the additional PC transfer speed of over 512mb/sec vs hitting the speed limit much sooner, does show that something is missing.


Diminishing returns. Cap is most likely caused by the graphic engine fixed costs, at some point you cant use more speed no matter what without rewrite (initialization, deserialization etc).


DMA units are back, baby!


Revisiting LTO tape.


Why would they design it to require two ASICs AND an FPGA? Couldn't they just have built the FPGA program into the ASICs?


Making an ASIC with pedestrian low speed IOs is easy. Making an ASIC with high speed SERDES IOs that are required for PCIe is hard.

Also, AHA has a followup product that doubles the performance by dropping 4 ASICs and 1 FPGA on a board instead of 2 ASICs. So modularity is a factor as well.

But I think the first point is very likely the reason.


pure speculation:

the FPGA does the PCIe and all necessary data marshaling, which allows a lot of flexibility for updates and bug fixing on the most finicky parts of hardware

two ASICs because once you have one manufactured, most of the costs are sunk and per unit it's cheap to stick a second on the board


You're probably right. The FPGA is probably the most expensive part of this board by far, and maybe they figured the FPGA and the PCIE bus can handle enough traffic to keep two compression chips busy.


The FPGA is also potentially insurance for bugs on the ASIC that aren't economical to fix. Catch the bug and fix it on the way out of the card.


FPGAs are cheap compared to the initial run costs for comparable custom ASICs, though.


Maybe they designed the ASICs for a PCI board and they just have a memory-mapped interface.


Or vice versa...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: