Hacker News new | past | comments | ask | show | jobs | submit login
Xilinx-Samsung SmartSSD Computational Storage Drive Launched (servethehome.com)
116 points by blopeur on Nov 11, 2020 | hide | past | favorite | 84 comments



The core concept that you ship computation to data rather than the other way around is what made Google so impressive when it launched. There are lots of algorithms that do well in that model. Back when I was at NetApp I did a design of a system where the "smart storage" essentially labeled blocks with an MD5 hash when you went to store them. That allowed you to rapidly determine if you already had the block stored and could safely toss the one being written[1]. Really fast de-duplication and good storage compression.

At Blekko they had taken this concept to the next logical step and built a storage array out of triply replicated blocks (called 'buckets') that were distributed by their hashid. You could then write templated perl code that operated in parallel over hundreds (or thousands) of buckets and giving a composite result. It always surprised me that IBM didn't care about that system when they acquired Blekko, it was pretty cool. If you implemented it in these Samsung drives it would make for a killer data science appliance. That design almost writes itself.

Also in the storage space, there was the CMU "Active disk" architecture[2] which was supposed to replace RAID. There was a startup spin-off from this work but I cannot recall its name anymore, sigh.

These days it would useful to design a simulator for systems like this and derive a calculus for analyzing their performance with respect to other architectures. Probably a masters thesis and maybe a PhD or two in that work.

[1] Yes MD5 hash collisions are a thing but not for identical length documents (aka an 8K block), and yes NetApp got a patent issued for it.

[2] https://www.pdl.cmu.edu/PDL-FTP/Active/ActiveDisksBerkeley98...


Wasn't this the aim of Hadoop as well?

Also, to be fair, IBM wasn't able to do much with any of the companies they acquired in the same timeframe as Blekko. I was working at IBM at that time and witnessed this first hand.


I kind of gathered that IBM had this sort of love/hate relationship with acquisitions. They loved the "new ideas" but many of the "old guard" hated the idea that something from outside of IBM was novel and worth pursuing. I watched first hand as things that had been done by a company that was acquired were buried while the exact same kind of thing done by IBM research was given lots of funding. I understood not having two efforts, but suggested they be blended to get the best of both teams, alas that was shot down.

At one time IBM research had a pretty awesome storage group, perhaps they will will have a computational storage fabric offering at some point.


I know exactly the IBM Research group you are referring to but won't name names to avoid embarrassment :-) Politics in large organizations can be pretty nasty.


The way HBase/HDFS does this is perfectly backwards and counterproductive. You don't want to put a program and its data on one local disk, because that one local disk has almost no resources. The later-period thinking at Google is the opposite of shipping the program to the data. Instead, you allocate the data randomly over as many spindles as you can find, then when you read it all back out you very briefly use all of those spindles at once. Really, it's something to behold in action.

That's batch processing: mapreduce and flume and whatnot. Search is still very much an exercise in getting the queries out to where the index shards live.


The secret sauce for that is networking. Once you have a domain with full clos connectivity and very low latencies you can start disaggregating storage from compute again


Assuming you have a 1GB file. According to Colossus paper, that gets split into 1MB parts, so 1000. And they probably use erasure-coding for replication. How does it work ? Do they EC each 1000 part separately, or ?


According to these slides, each stripe is a separate replication group. Although the figure depicts replicated rather than erasure coded files, I think it's safe to imagine that the idea holds for both.

http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Ke...


Did you mean Hadoop/HDFS? HBase works pretty analogously to Bigtable (though, having run HBase in production at $dayjob-1, I'd take the latter any day).


HBase doesn't work analogously to bigtable. HBase intentionally tries to move region files onto the same node as the region server. That is ass-backwards. HBase devs think of this as "good locality" but all you're really getting from it is terrible performance.


> "smart storage" essentially labeled blocks with an MD5 hash when you went to store them

Asking out of curiosity - isn't this similar to what Venti (from Plan 9) did? Of course, Venti was content-addressed, and in this case I'm guessing this system sat above WAFL (which is definitely not content-addressed).

* http://doc.cat-v.org/plan_9/4th_edition/papers/venti/


Many of the same pieces solving a slightly different problem. In the NetApp case it was pushing the edge of the de-duplication for efficient archival storage problem. EMC had discovered that MD5 hash collisions made deduping on a document level dangerous (you could think you had a document when you didn't). Those collisions came from dissimilar sized documents and indeed you could "attack" MD5 signatures in that way. With a fixed document size, the probabilities went back to the actual MD5 collision probabilities. Those probabilities were acceptably small. On the "fast" part of the archival server, instead of storing 8K blocks you could store 16 byte "block identifiers" (but still using all of the standard WAFL file system layout, it thinks it is using 16 byte blocks. Those could be stored in "fast" storage (think SSD) and the actual data on slow "cold" storage. Your back-end server does do "content addressable" kinds of things in terms of hash->block location services but it is simple, only three operations "does block <x> exist?", "give me block <x>" and "store this block as <x>."

For a write mostly fabric attached archival device it had some benefits over the SATA based filers (higher density, lower watts/terabyte, less CPU load on the filer head (spread out to the storage retrieval unit which could have many of the hash->block tranalators) etc. I don't believe NetApp ever built a complete one though. Just "too many degrees off their bow" as an engineer I knew would say.


Thank you, this is fascinating to read! This is a piece of NTAP history I was completely unaware of :)


Worth mentioning this model is reversing with Dremel/BigQuery/Pulsar etc looking to disaggregate compute from storage again.


panasas?


Yes, man I could not come up with that.


I think putting something like SQLite on the actual storage device could be a super efficient way to directly express your intent to the actual durable storage system and bypass mountains of virtual bullshit.

The optimization opportunities are pretty obvious to me. Imagine if SQLite journaling was aware of how long the supercapacitor in the SSD would last, potentially even with real-time monitoring of device variables. You could have your entire WAL sitting in DRAM on the drive as long as it has enough stored energy to flush to NAND upon external power loss.


In one of his talks Richard Hipp actually mentioned off-hand that some storage devices allegedly use SQLite under the hood as their storage controller, with the firmware bridging the gap.

(I can't find the video, but his slides are here if you want to go searching - http://sqlite.org/talks/index.html )


I would pay good money to be able to talk directly to their SQLite, without any abstraction layer inbetween, to get some extra performance/lower latency - and I don't think I'm alone.

In fact, even if it wasn't SQLite but something non standard, I'd be interested in learning it and trying to make it work with my needs.


That shouldn't be too hard to get a SQLite-ish system working on the "smart" part of the SSD. What would be the uses cases that you have in mind?


All great, then you need to port it to something other than that one specific SSD and it's too much work.

We have abstraction boundaries for a reason, we give up a small % of performance and in return we can write code once (say SQLite) and use it in many scenarios. For something like SQLite it means it's been around for a long time and had a lot of optimization work done on it (and that probably outweighs the few % gain you might get from such tight integration).

You'd probably get a bigger performance gain from just not using SQL (eg. a DBM).


There are really only two FPGA vendors that can compete in this space, and Xilinx is the one that's clearly ahead for computational storage applications. They already provide a platform for storage accelerator IP to be shared between this Smart SSD and their pre-existing PCIe FPGA cards that connect to standard SSDs over PCIe or networks. So porting to another accelerator platform probably isn't as big an issue as you expect.

The bigger challenge I see for implementing something like SQLite on a SmartSSD is that you really don't want your database to exist on just one drive, so you need to figure out how to do HA across multiple SSDs while still offloading most of the computationally expensive database operations to the FPGAs instead of leaving it on the CPU. I think this will condemn SmartSSDs to always working at a slightly lower abstraction layer than what the application really wants.


Thinking about a theoretical SQLite database running on an SSD like this as a HA database for server-based systems is a poor design and a mistake. It would be a poor design even without the FPGA; SQLite simply isn't a HA replicated database.

Something like SQLite might make a decent alternative API to the flash storage layer of the SSD, though. Imagine if the storage controller of your SSD exposed a built "filesystem" that featured robust indexes, transactions, sorting, column families, etc. You could skip talking to the Linux block layer or any POSIX filesystem at all, and your optimized userspace software could directly talk to the storage controller in the SSD instead with a high level software API. This isn't far-fetched; Samsung also has a "Key-Value SSD" on the way that exposes the underlying flash storage using a (surprise!) high-level get/set KV API, for similar reasons.

A design where the controller is this powerful would also allow features like predicate pushdown in the query planner to be implemented. i.e., a `WHERE x > 7` can get pushed into the storage controller, and bad tuples that don't fit the predicate can get excluded/filtered out before getting pushed onto the memory bus. That will save significant processing time and memory bus traffic in aggregate, and it scales with the number of drives (such each drive has its own controller.) Not to mention tricks like inline hardware for sorting, compression, etc.

Outside of fancy SQLite-as-a-filesystem tricks, I suspect the allure of optimizations like predicate pushdown and inline sorting will be very attractive for OLAP systems. Time will tell if these things will stick around, but Xilinx at least seems sure as hell determined to make their way into the datacenter.


SSD's are so tiny these days you could probably bundle several of them into a 5.25" drive bay along with a controller PCB that is an FPGA interfacing with a striped and mirrored SQLLite. (Does "striping" even mean anything anymore in the age of SSDs?)


I've been strongly interested in computational fabrics for at least 15 years... this looks interesting, but very, very locked down.

It is my understanding that FPGA vendors have fought the open source community every step of the way. I would hate to see the future of computing locked up in a new spiffy prison.


FPGAs have ITAR restrictions which might be an issue with open sourcing the tools.

That could get you to a proper prison?


In what way is it locked down ? I guess the FPGA may be big enough that it isn't covered by the free version of the Vivado toolchain.


keep in mind that "locked down" doesn't necessarily mean "free vs paid" but possibly "open source vs proprietary"


Isn’t Vivado entirely proprietary?


Yes, and it is also a bug riddled mess that is prone to synthesis errors. I am hoping maybe the AMD acquisition will encourage them to open more things up to get more eyes on the synthesis flow and allow more recourse/debugging when issues are encountered.


What synthesis errors have you run into?


Random ones requiring keep attributes for no reason on larger designs to keep stuff from being inferred away erroneously. Certain issues with limitations on SV interfaces only supporting constructs used in the "IP integration" scripting as opposed to the full language spec. I am sorry if I came off as overly negative, but I really think the FOSS EDA tools are going to lap them unless they open up somewhat.


I agree that just about all of the EDA tools are a pretty terrible experience.

I’ve gotten tired of dealing with quirks using SV interfaces in RTL. I’m using structs as a substitute at the moment.


(I am also biased because the designs I work on are small enough that ECP and presumably upcoming Lattice FPGAs are plenty. I am excited by the Xilinx reverse engineering efforts, too. But there seems to be less official interest than we see with Lattice in supporting the OSS efforts.)


How does that prevent you from making use of the product described in the article ?


That comment was made in the context of a thread about open source FPGA development tools.


And I'm questioning whether lack of open source FPGA development tools means that the device is "locked down". You can do everything that the device is designed to be able to do.


That's true for any locked down device. An iPhone can do everything it is designed to do yet it could also do so much more if it weren't locked down.


Ok, I understand now. It is locked down because it is locked down.


Storage is starting to get extremely exciting again. The KV SSD's, this, and Intel's Optane are opening up a lot of new avenues for extremely high performance storage.


I have a 1,000,000,000 byte file on my Optane SSD, and it loads into RAM in about 4 seconds, with zero tweaking, under Windows. The thing is wicked fast, even on a consumer laptop. Intel didn't sell the best of their SSD tech, just the lower end stuff.


Curious, but 1 GB in four seconds sounds kind of slow?


That's 250MB/s.... A sata3 SSD can do 500MB/s sequential reads. If you were talking about a 10GB file, which would be the limit of NVMe pcie3 drives for that transfer period.


Isn't that slower than a lot of SSDs? Mine can do that in around 2 and a half seconds and it's not Optane, just a normal NVMe SSD.


Didn't Intel just sell off it's SSD division to SK Hynix?



They should have called it OINF: OINF is not flash.


Sorry if I missed it, but I'm not seeing it: what's the bandwidth here? i.e., the time to read, process and write back the whole contents of the disk (using just the FPGA)?


This doesn’t change anything about the speed of the nvme drive it offloads the work load from the processor.


I was wondering about the speed at which the FPGA could read from the attached storage; the bandwidth internal to the drive.


Can anyone quantify the advantages this yields in terms of latency and bandwidth, compared to plugging a regular SSD into an external FPGA (via PCIe or whatever interface)?


I'm still waiting to become skilled enough or end up invested in a project enough to merit dedicated super fast SSD storage, or some kind of exotic storage appliance!


So I'm thinking Deduplication on drive will be the big thing here. Think XFS or ReFS block cloning but without server side processing.


The compute-intensive part of that task is the hashing function, for which common CPU extensions are cheaply available.


Indeed, but not always the implementation. Microsoft's ReFS has had several patches due to issues involving memory usage and general performance.


Too bad IBM killed off Netezza. If the cloud vendors started offering this widely, it would have given them another round of relevance.


I wonder how hard it would be to port the server-side code of FoundationDB to one of these devices; architecturally FDB seems well suidted to this (at least until predicates show up) as it is already extremely constrained as to the expectations on the storage nodes; they basically provide just (time-bounded) versioned KV access.


Looks like this is a Xylinx KU15P - not shabby, but about 1/2 the size of the 3-die monstrosities that are in the AWS FPGA instances you can rent for ~$1.50 an hour - so useful for disk stuff closely coupled to the drive, but maybe not as a general compute resource (depending on actual price of course)


They really should extend KVS for this - it's going to be very difficult to leverage if the XSS interface is underneath the filesystem (as shown in the diagram) especially for RDBMS where the database is (usually) a single big flat file as far as the filesystem is concerned.


Reminds me of Micron's Automata Processor: https://www.cs.virginia.edu/~skadron/Papers/wang_APoverview_...


So one can look at the chip on beside the storage as an cpu offload built inside the drive, instead of a coprocessor on the motherboard. I’m not seeing a huge use case here except the narrowest of uses like decryption, compression.


> except the narrowest of uses like decryption, compression

These seem like broad use cases to me. Also consider ETL and database applications. Time series, finance, machine learning, search engines. It seems like the primary benefit is in terms of latency and minimizing data bussed to the main CPU.


> I’m not seeing a huge use case here except the narrowest of uses like decryption, compression.

SSD already have all provisions for both, and do it. Something like that will genuinely benefit more from a highly optimised ASIC than anything else.

The use case is obviously huge, and you don't see the elephant in the room: money.

Putting all those drives to even a cheapest Xeon around, increases the price n-fold over the price of the flash, unless you talk about multi-terabyte scale SSDs.


Convolution on stored data. This is huge and really really simple to implement. You could work with TB of data without the gpu.

Or am I missing sometjing obvious?


But, one potential issue here is that your FPGA now needs to modify the filesystem to write any new results.

Maybe the use-case here is more like transforming the data on the fly. Let's say storing the data compressed, but reading it back uncompressed. This could effectively be transparent to the host CPU, but handled by the FPGA.

The more that I think about it, this data flow sounds significantly more reasonable than asynchronously processing data. Then you could still read / transform / write the new data to the SSD, but you'd limit the main CPU to only sending the read/write IO, instead of the transparent transformation.


I see, thanks for the explanation.


Would anyone want to do an explain-like-I’m-in-high-school on this?


The FPGA is programmable silicon. It's not as fast as a regular CPU, but it can be re-programmed with whatever algorithms you want.

Currently one big bottleneck for data processing is moving the data from the drive, over a storage link, into memory, and then back to the drive.

By putting a programmable processor on the drive, you can eliminate that overhead by putting some of your processing algorithms on the storage.

So, for example, if you were running a Hadoop cluster, you could have your common Hadoop algorithms baked into the processor of the drive. Every drive now becomes a Hadoop processing accelerator. Rather than pulling a full dataset into main memory over that very slow link, you run some of the job on each drive, and return only the data you need. Every drive has its own processing power, so the more data you have, the more grunt you have, and you're eliminating the slowest steps.

Because the FPGA is reprogrammable, you can change which algorithms you push down to the drive as you change your workloads over time. Every disk you add becomes a specialized big data, ML, whatever, processing unit.


This is a new class of devices that couples fast SSD storage directly with an FPGA.

The types of computing that can be done in an FPGA require a new style of programming, because in an FPGA, programs get mapped into hardware, such that data flows through the chip, with all of the code running at the same time, instead of in sequence.

There are large differences in the style of programming required to get efficient use out of an FPGA... there are some ways of translating C that work, but they are crude crutches at best.

This is like going back to the early days of computing, when the new machines were expensive, but very fast... and programmer time was relatively cheap. It's going to take a special new breed of programmer to make this really work well. We're at the beginning of a new era.

I hope that makes sense.


Is this to be leveraged at the application level or the OS level? Since so much of the world is going to the cloud, the only applications I can think are for file systems or databases, what else is there? Is there a significant advantage for personal computers?


I don't really know what this would be good for on a personal computing level. I'd do a lot of things with digital signal processing. If you had a large set of videos, and you wanted to stream a down-sampled version on the fly, the FPGA could do that type of thing, easy peasy. We're at the stage where there is maximum possibility, and minimal legacy to worry about... like the early days of the PC.


Me, too. I'd like to know if these will have a practical impact on, say, media-storage for editing.


That reminds me of calculating the mandelbrot set on the commodore 1541 floppy drive, back in the days.


This is super exciting. It makes no mention of on device bandwidth.


On-device bandwidth is PCIe 3 x4 (~3.5GB/s full duplex). The FPGA is separate from the SSD controller, which is an off the shelf Samsung controller. A portion of the FPGA's resources are used as a PCIe switch: the FPGA is connected to the host PCIe link and presents one PCIe endpoint for the FPGA and one for the SSD that is connected through the FPGA.

Other companies in the computational storage area are putting compute resources onto their SSD controller ASICs so that the compute doesn't have a PCIe bottleneck between it and the NAND. But you won't see that kind of design coming out of a Samsung/Xilinx collaboration.


If these drives are just fpgas sitting on the existing interface, so still hitting the PCIe limits, then this is un-impressive. If we start seeing multiples of device bandwidth available to the FPGA, then we can see huge cost savings.


I agree that on its own, it doesn't seem too interesting for the FPGA to be accessing the flash through the same PCIe x4 interface that the host system could use. But servers with 24+ NVMe SSDs don't always have the bandwidth to saturate all the SSDs simultaneously; they're often connected through a PCIe fanout switch that has just an x16 uplink (or an x16 per CPU socket). Having an accelerator to offload eg. search means the drives in aggregate don't have to send as much data up to the CPUs (or NICs). Even if these drives have what appears to be a sub-optimal design, they can still help alleviate bottlenecks elsewhere.


So no mention of syCL support... Only offering ~C in 2020 is an insult to computer science.

Unrelated: when will Nvidia allow to seamlessly offload Java or another GC based language to the GPU? https://developer.nvidia.com/blog/grcuda-a-polyglot-language... GrCuda seems promising but it would only allow interoperability with Java on the CPU, not offload Java to the GPU, right? Such advances would make gpu computing order of magnitudes more developper friendly and therefore much more mainstream.


Xilinx makes FPGAs, not GPUs. The advantage of an FPGA is that all operations take place at the same time in hardware. Verilog is one language that I would expect strong support for... they do mention RTL, which does many of the same things.

This is not a general purpose programming environment, it is more of an data flow / filtering system.

There is no allocation of memory, thus no need for garbage collection.


Verilog, VHDL, and SystemVerilog are supported by Vivado. A small subset of C++ is supported through Vivado's high level synthesis compiler, which transforms some C++ code into Verilog. RTL stands for register-transfer level, which is a general layer of abstraction; one uses a hardware description language such as Verilog to describe RTL designs. The analogy of Verilog with RTL is "Verilog is to RTL, as Haskell is to functional programming".


AMD owns Xilinx now, so there's that.

If AMD can get their act together by seamlessly integrating their GPU, FPGA and CPU with minimum I/O bottleneck it will be a huge boon for the computing industry. People will start doing something that probably unthinkable now but will be very obvious in the near future. Personally I have one application in need of the proper integration of their disparate systems and already started talking to their R&D engineers but see little improvement being implemented.

FYI, AMD has come up with SSD storage and GPU integration before but with limited success [1], but if they can also integrate FPGA together that can probably be a recipe for a great success.

I think AMD and Intel (since they both own GPU/CPU/FPGA technology) really need to come up with open and intuitive design tools for these new systems or just sponsor the work on MLIR and LLHD by LLVM and ETH Zurich, respectively.

[1] https://www.extremetech.com/extreme/232416-amd-announces-new...


FPGA vendor support for higher-level synthesis tools is still pretty poor. Not surprised on that front.

Offload-from-GC runtime tools do exist, e.x. Cudafy would translate .Net code into Cuda kernels and handle kernel dispatch. Of course, you were very limited in what constructs and types you could put in kernel functions, but you could write your whole application in C# and accelerate the important blocks.

In practice, a lot of beginner GPU computing has moved to the world of NN training and inference, in which the complexities of GPU offload are entirely wrapped by the libraries you use.

For traditional GPU-accelerated tasks, the limited languages available are not the problem. Decomposing your problem into a form that is amenable to GPU offload can be difficult, and if you're experienced enough to do that well, writing Cuda kernels and dispatch in C++ is not an obstacle. For example, Cudafy meant you didn't need to know Cuda-specific syntax and expressions, but you still had to understand the behavior and limitations of GPUs to write performant code.


> when will Nvidia allow to seamlessly offload Java or another GC based language to the GPU?

Lots of garbage collected languages seamlessly target GPUs. This is typically done at the library level, either within the language ecosystem (eg in Python using Cupy instead of Numpy: https://towardsdatascience.com/heres-how-to-use-cupy-to-make...) or below it (using cuBLAS as your BLAS implementation: https://developer.nvidia.com/cublas)

Java can do this too - something like ArrayFire is reasonably popular: https://developer.nvidia.com/arrayfire


What does syCL have to do with FPGAs?




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: