So how do you get a web server to take advantage of a hardware gzip accelerator? Is there like a custom nginx plugin that points to a custom driver? Seems very cool and I've never heard of something like this.
first off, AVX2 is a subset of SIMD instructions - SIMDjson includes a large number of implementations including several which do not use AVX2 at all.
next, only a subset of AVX2 instructions cause downclocking (namely, the 512 bit wide ones), which SIMDjson does not use. throwing out the use of "SIMD" entirely because you read an article about a specific AVX512 instruction which caused down-clocking might be a bit premature, no?
Lots of NICs for large supercomputers have special message passing offload engines. But JSON is too ASCII and too random to find a reason to accelerate it.
One way to do it is to expose it as an nvme block device. Not sure how that's used from userspace, I suspect it's either the right combination of mmap/vmsplice/sendfile from userspace to do p2pdma between the devices in question or some custom ioctls.
I'm curious to know how much faster than Gzip compression on a modern multicore CPU this is. The AHA webpage says "Compresses and decompresses at a throughput rate over 5.0 Gbits/sec" (that's 1.2 GB/s). How fast can you gzip compress on a 16-core Ryzen CPU, for example?
I don't think gzip can use multiple cores, but there is a parallel implementation of gzip called pigz (race condition pun?) [1] which uses a clever trick [2] not to reduce compression efficiency:
> The input blocks, while compressed independently, have the last 32K of the previous block loaded as a preset dictionary to preserve the compression effectiveness of deflating in a single thread
The normal development flow for FPGA software is similar to ASIC in that people focus on testing as many elements as possible in software simulation to a very high level of coverage before even downloading to the FPGA. Once in there, you're reliant on JTAG (high-speed serial bus) to read out values from the target device.
It's too late now since you've already bought one, but right now, it's still a doorstop than can't be used for anything: I haven't even been able to get one of the LEDs blinking, and that's only the Hello World of the FPGA hobbyist.
One thing that didn’t make it in the blog post was that a strategically soldered wire shortened the JTAG chain to only include the 2 Intel chips
and bypass the AHA chips.
The power to the AHA chips gets cut off at some point, breaking the JTAG chain, but the FPGAs stay active.
It is interesting to see how the prices have come down. While just the FPGA chip on this board was originally $1K, a new DE10 board is ~$130 with slightly more logic units at a higher clock, plus a dual core ARM A9, peripherals, etc.
I'm super exicted what will come in this direction after PS5 announcement and Microsoft mentioning Direct Storage.
PS5 has a IO chip and extra architecture for decompressing textures (here the context with this gzip accelerator) and more features.
Mark Cerny said that they needed this chip because from a CPU ressource usage,it would use all CPU cores. So nvm now are so fast that a io co processor is feasable again.
PC SSD drives started out with buildin compression. This is why most test software has separate "incompressible data" graphs. Its usually something you actually dont want, akin to fake streamer tape capacities/speeds assuming 2x compression.
> IO chip and extra architecture for decompressing textures
you dont want uncompressed textures in your GPU memory, compressing DXT compressed textures is not idea to say the least, you can count on 30-50% compression ratio, not the marketing 2x peak number Sony was throwing around
I do have seen issues on my machines i would highly put in the direction of "nvm got shitty fast really quick and something inbetween is weird, we need to do something".
My feeling is, that what Sony is now doing, goes in that direction.
I'm highly curious and hopefull.
I will read your posted blog, looks interesting. Lets see what will arrive in the industry at the end of the day.
When i look at his data, its capped and the additional PC transfer speed of over 512mb/sec vs hitting the speed limit much sooner, does show that something is missing.
Diminishing returns. Cap is most likely caused by the graphic engine fixed costs, at some point you cant use more speed no matter what without rewrite (initialization, deserialization etc).
Making an ASIC with pedestrian low speed IOs is easy. Making an ASIC with high speed SERDES IOs that are required for PCIe is hard.
Also, AHA has a followup product that doubles the performance by dropping 4 ASICs and 1 FPGA on a board instead of 2 ASICs. So modularity is a factor as well.
But I think the first point is very likely the reason.
the FPGA does the PCIe and all necessary data marshaling, which allows a lot of flexibility for updates and bug fixing on the most finicky parts of hardware
two ASICs because once you have one manufactured, most of the costs are sunk and per unit it's cheap to stick a second on the board
You're probably right. The FPGA is probably the most expensive part of this board by far, and maybe they figured the FPGA and the PCIE bus can handle enough traffic to keep two compression chips busy.