Xilinx-Samsung SmartSSD Computational Storage Drive Launched

ChuckMcM · on Nov 11, 2020

The core concept that you ship computation to data rather than the other way around is what made Google so impressive when it launched. There are lots of algorithms that do well in that model. Back when I was at NetApp I did a design of a system where the "smart storage" essentially labeled blocks with an MD5 hash when you went to store them. That allowed you to rapidly determine if you already had the block stored and could safely toss the one being written[1]. Really fast de-duplication and good storage compression.

At Blekko they had taken this concept to the next logical step and built a storage array out of triply replicated blocks (called 'buckets') that were distributed by their hashid. You could then write templated perl code that operated in parallel over hundreds (or thousands) of buckets and giving a composite result. It always surprised me that IBM didn't care about that system when they acquired Blekko, it was pretty cool. If you implemented it in these Samsung drives it would make for a killer data science appliance. That design almost writes itself.

Also in the storage space, there was the CMU "Active disk" architecture[2] which was supposed to replace RAID. There was a startup spin-off from this work but I cannot recall its name anymore, sigh.

These days it would useful to design a simulator for systems like this and derive a calculus for analyzing their performance with respect to other architectures. Probably a masters thesis and maybe a PhD or two in that work.

[1] Yes MD5 hash collisions are a thing but not for identical length documents (aka an 8K block), and yes NetApp got a patent issued for it.

[2] https://www.pdl.cmu.edu/PDL-FTP/Active/ActiveDisksBerkeley98...

2sk21 · on Nov 11, 2020

Wasn't this the aim of Hadoop as well?

Also, to be fair, IBM wasn't able to do much with any of the companies they acquired in the same timeframe as Blekko. I was working at IBM at that time and witnessed this first hand.

ChuckMcM · on Nov 11, 2020

I kind of gathered that IBM had this sort of love/hate relationship with acquisitions. They loved the "new ideas" but many of the "old guard" hated the idea that something from outside of IBM was novel and worth pursuing. I watched first hand as things that had been done by a company that was acquired were buried while the exact same kind of thing done by IBM research was given lots of funding. I understood not having two efforts, but suggested they be blended to get the best of both teams, alas that was shot down.

At one time IBM research had a pretty awesome storage group, perhaps they will will have a computational storage fabric offering at some point.

2sk21 · on Nov 12, 2020

I know exactly the IBM Research group you are referring to but won't name names to avoid embarrassment :-) Politics in large organizations can be pretty nasty.

jeffbee · on Nov 12, 2020

The way HBase/HDFS does this is perfectly backwards and counterproductive. You don't want to put a program and its data on one local disk, because that one local disk has almost no resources. The later-period thinking at Google is the opposite of shipping the program to the data. Instead, you allocate the data randomly over as many spindles as you can find, then when you read it all back out you very briefly use all of those spindles at once. Really, it's something to behold in action.

That's batch processing: mapreduce and flume and whatnot. Search is still very much an exercise in getting the queries out to where the index shards live.

ithkuil · on Nov 12, 2020

The secret sauce for that is networking. Once you have a domain with full clos connectivity and very low latencies you can start disaggregating storage from compute again

ddorian43 · on Nov 12, 2020

Assuming you have a 1GB file. According to Colossus paper, that gets split into 1MB parts, so 1000. And they probably use erasure-coding for replication. How does it work ? Do they EC each 1000 part separately, or ?

jeffbee · on Nov 12, 2020

According to these slides, each stripe is a separate replication group. Although the figure depicts replicated rather than erasure coded files, I think it's safe to imagine that the idea holds for both.

http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Ke...

Xorlev · on Nov 12, 2020

Did you mean Hadoop/HDFS? HBase works pretty analogously to Bigtable (though, having run HBase in production at $dayjob-1, I'd take the latter any day).

jeffbee · on Nov 12, 2020

HBase doesn't work analogously to bigtable. HBase intentionally tries to move region files onto the same node as the region server. That is ass-backwards. HBase devs think of this as "good locality" but all you're really getting from it is terrible performance.

mangamadaiyan · on Nov 12, 2020

> "smart storage" essentially labeled blocks with an MD5 hash when you went to store them

Asking out of curiosity - isn't this similar to what Venti (from Plan 9) did? Of course, Venti was content-addressed, and in this case I'm guessing this system sat above WAFL (which is definitely not content-addressed).

* http://doc.cat-v.org/plan_9/4th_edition/papers/venti/

ChuckMcM · on Nov 12, 2020

Many of the same pieces solving a slightly different problem. In the NetApp case it was pushing the edge of the de-duplication for efficient archival storage problem. EMC had discovered that MD5 hash collisions made deduping on a document level dangerous (you could think you had a document when you didn't). Those collisions came from dissimilar sized documents and indeed you could "attack" MD5 signatures in that way. With a fixed document size, the probabilities went back to the actual MD5 collision probabilities. Those probabilities were acceptably small. On the "fast" part of the archival server, instead of storing 8K blocks you could store 16 byte "block identifiers" (but still using all of the standard WAFL file system layout, it thinks it is using 16 byte blocks. Those could be stored in "fast" storage (think SSD) and the actual data on slow "cold" storage. Your back-end server does do "content addressable" kinds of things in terms of hash->block location services but it is simple, only three operations "does block <x> exist?", "give me block <x>" and "store this block as <x>."

For a write mostly fabric attached archival device it had some benefits over the SATA based filers (higher density, lower watts/terabyte, less CPU load on the filer head (spread out to the storage retrieval unit which could have many of the hash->block tranalators) etc. I don't believe NetApp ever built a complete one though. Just "too many degrees off their bow" as an engineer I knew would say.

mangamadaiyan · on Nov 12, 2020

Thank you, this is fascinating to read! This is a piece of NTAP history I was completely unaware of :)

jpgvm · on Nov 12, 2020

Worth mentioning this model is reversing with Dremel/BigQuery/Pulsar etc looking to disaggregate compute from storage again.

joshu · on Nov 12, 2020

panasas?

ChuckMcM · on Nov 12, 2020

Yes, man I could not come up with that.

bob1029 · on Nov 11, 2020

I think putting something like SQLite on the actual storage device could be a super efficient way to directly express your intent to the actual durable storage system and bypass mountains of virtual bullshit.

The optimization opportunities are pretty obvious to me. Imagine if SQLite journaling was aware of how long the supercapacitor in the SSD would last, potentially even with real-time monitoring of device variables. You could have your entire WAL sitting in DRAM on the drive as long as it has enough stored energy to flush to NAND upon external power loss.

anitil · on Nov 11, 2020

In one of his talks Richard Hipp actually mentioned off-hand that some storage devices allegedly use SQLite under the hood as their storage controller, with the firmware bridging the gap.

(I can't find the video, but his slides are here if you want to go searching - http://sqlite.org/talks/index.html )

1996 · on Nov 11, 2020

I would pay good money to be able to talk directly to their SQLite, without any abstraction layer inbetween, to get some extra performance/lower latency - and I don't think I'm alone.

In fact, even if it wasn't SQLite but something non standard, I'd be interested in learning it and trying to make it work with my needs.

sgignoux · on Nov 13, 2020

That shouldn't be too hard to get a SQLite-ish system working on the "smart" part of the SSD. What would be the uses cases that you have in mind?

jleahy · on Nov 11, 2020

All great, then you need to port it to something other than that one specific SSD and it's too much work.

We have abstraction boundaries for a reason, we give up a small % of performance and in return we can write code once (say SQLite) and use it in many scenarios. For something like SQLite it means it's been around for a long time and had a lot of optimization work done on it (and that probably outweighs the few % gain you might get from such tight integration).

You'd probably get a bigger performance gain from just not using SQL (eg. a DBM).

wtallis · on Nov 11, 2020

There are really only two FPGA vendors that can compete in this space, and Xilinx is the one that's clearly ahead for computational storage applications. They already provide a platform for storage accelerator IP to be shared between this Smart SSD and their pre-existing PCIe FPGA cards that connect to standard SSDs over PCIe or networks. So porting to another accelerator platform probably isn't as big an issue as you expect.

The bigger challenge I see for implementing something like SQLite on a SmartSSD is that you really don't want your database to exist on just one drive, so you need to figure out how to do HA across multiple SSDs while still offloading most of the computationally expensive database operations to the FPGAs instead of leaving it on the CPU. I think this will condemn SmartSSDs to always working at a slightly lower abstraction layer than what the application really wants.

aseipp · on Nov 12, 2020

Thinking about a theoretical SQLite database running on an SSD like this as a HA database for server-based systems is a poor design and a mistake. It would be a poor design even without the FPGA; SQLite simply isn't a HA replicated database.

Something like SQLite might make a decent alternative API to the flash storage layer of the SSD, though. Imagine if the storage controller of your SSD exposed a built "filesystem" that featured robust indexes, transactions, sorting, column families, etc. You could skip talking to the Linux block layer or any POSIX filesystem at all, and your optimized userspace software could directly talk to the storage controller in the SSD instead with a high level software API. This isn't far-fetched; Samsung also has a "Key-Value SSD" on the way that exposes the underlying flash storage using a (surprise!) high-level get/set KV API, for similar reasons.

A design where the controller is this powerful would also allow features like predicate pushdown in the query planner to be implemented. i.e., a `WHERE x > 7` can get pushed into the storage controller, and bad tuples that don't fit the predicate can get excluded/filtered out before getting pushed onto the memory bus. That will save significant processing time and memory bus traffic in aggregate, and it scales with the number of drives (such each drive has its own controller.) Not to mention tricks like inline hardware for sorting, compression, etc.

Outside of fancy SQLite-as-a-filesystem tricks, I suspect the allure of optimizations like predicate pushdown and inline sorting will be very attractive for OLAP systems. Time will tell if these things will stick around, but Xilinx at least seems sure as hell determined to make their way into the datacenter.

javajosh · on Nov 12, 2020

SSD's are so tiny these days you could probably bundle several of them into a 5.25" drive bay along with a controller PCB that is an FPGA interfacing with a striped and mirrored SQLLite. (Does "striping" even mean anything anymore in the age of SSDs?)

mikewarot · on Nov 11, 2020

I've been strongly interested in computational fabrics for at least 15 years... this looks interesting, but very, very locked down.

It is my understanding that FPGA vendors have fought the open source community every step of the way. I would hate to see the future of computing locked up in a new spiffy prison.

8note · on Nov 11, 2020

FPGAs have ITAR restrictions which might be an issue with open sourcing the tools.

That could get you to a proper prison?

rjsw · on Nov 11, 2020

In what way is it locked down ? I guess the FPGA may be big enough that it isn't covered by the free version of the Vivado toolchain.

RicardoLuis0 · on Nov 11, 2020

keep in mind that "locked down" doesn't necessarily mean "free vs paid" but possibly "open source vs proprietary"

saagarjha · on Nov 11, 2020

Isn’t Vivado entirely proprietary?

trepetti · on Nov 11, 2020

Yes, and it is also a bug riddled mess that is prone to synthesis errors. I am hoping maybe the AMD acquisition will encourage them to open more things up to get more eyes on the synthesis flow and allow more recourse/debugging when issues are encountered.

FPGAhacker · on Nov 11, 2020

What synthesis errors have you run into?

trepetti · on Nov 11, 2020

Random ones requiring keep attributes for no reason on larger designs to keep stuff from being inferred away erroneously. Certain issues with limitations on SV interfaces only supporting constructs used in the "IP integration" scripting as opposed to the full language spec. I am sorry if I came off as overly negative, but I really think the FOSS EDA tools are going to lap them unless they open up somewhat.

FPGAhacker · on Nov 11, 2020

I agree that just about all of the EDA tools are a pretty terrible experience.

I’ve gotten tired of dealing with quirks using SV interfaces in RTL. I’m using structs as a substitute at the moment.

trepetti · on Nov 11, 2020

(I am also biased because the designs I work on are small enough that ECP and presumably upcoming Lattice FPGAs are plenty. I am excited by the Xilinx reverse engineering efforts, too. But there seems to be less official interest than we see with Lattice in supporting the OSS efforts.)

rjsw · on Nov 11, 2020

How does that prevent you from making use of the product described in the article ?

saagarjha · on Nov 11, 2020

That comment was made in the context of a thread about open source FPGA development tools.

rjsw · on Nov 11, 2020

And I'm questioning whether lack of open source FPGA development tools means that the device is "locked down". You can do everything that the device is designed to be able to do.

rowanG077 · on Nov 11, 2020

That's true for any locked down device. An iPhone can do everything it is designed to do yet it could also do so much more if it weren't locked down.

rjsw · on Nov 11, 2020

Ok, I understand now. It is locked down because it is locked down.

jdsully · on Nov 11, 2020

Storage is starting to get extremely exciting again. The KV SSD's, this, and Intel's Optane are opening up a lot of new avenues for extremely high performance storage.

mikewarot · on Nov 11, 2020

I have a 1,000,000,000 byte file on my Optane SSD, and it loads into RAM in about 4 seconds, with zero tweaking, under Windows. The thing is wicked fast, even on a consumer laptop. Intel didn't sell the best of their SSD tech, just the lower end stuff.

jlundberg · on Nov 11, 2020

Curious, but 1 GB in four seconds sounds kind of slow?

accountofme · on Nov 12, 2020

That's 250MB/s.... A sata3 SSD can do 500MB/s sequential reads. If you were talking about a 10GB file, which would be the limit of NVMe pcie3 drives for that transfer period.

sudosysgen · on Nov 12, 2020

Isn't that slower than a lot of SSDs? Mine can do that in around 2 and a half seconds and it's not Optane, just a normal NVMe SSD.

UncleOxidant · on Nov 11, 2020

Didn't Intel just sell off it's SSD division to SK Hynix?

ohples · on Nov 11, 2020

Optane wasn't part of the sale https://uk.reuters.com/article/intel-divestiture-sk-hynix/so...

jeffbee · on Nov 12, 2020

They should have called it OINF: OINF is not flash.

generalizations · on Nov 11, 2020

Sorry if I missed it, but I'm not seeing it: what's the bandwidth here? i.e., the time to read, process and write back the whole contents of the disk (using just the FPGA)?

pmorici · on Nov 12, 2020

This doesn’t change anything about the speed of the nvme drive it offloads the work load from the processor.

generalizations · on Nov 12, 2020

I was wondering about the speed at which the FPGA could read from the attached storage; the bandwidth internal to the drive.

rkagerer · on Nov 11, 2020

Can anyone quantify the advantages this yields in terms of latency and bandwidth, compared to plugging a regular SSD into an external FPGA (via PCIe or whatever interface)?

dave_4_bagels · on Nov 11, 2020

I'm still waiting to become skilled enough or end up invested in a project enough to merit dedicated super fast SSD storage, or some kind of exotic storage appliance!

Severian · on Nov 11, 2020

So I'm thinking Deduplication on drive will be the big thing here. Think XFS or ReFS block cloning but without server side processing.

marshray · on Nov 11, 2020

The compute-intensive part of that task is the hashing function, for which common CPU extensions are cheaply available.

Severian · on Nov 11, 2020

Indeed, but not always the implementation. Microsoft's ReFS has had several patches due to issues involving memory usage and general performance.

jmpman · on Nov 11, 2020

Too bad IBM killed off Netezza. If the cloud vendors started offering this widely, it would have given them another round of relevance.

foobiekr · on Nov 11, 2020

I wonder how hard it would be to port the server-side code of FoundationDB to one of these devices; architecturally FDB seems well suidted to this (at least until predicates show up) as it is already extremely constrained as to the expectations on the storage nodes; they basically provide just (time-bounded) versioned KV access.

Taniwha · on Nov 12, 2020

Looks like this is a Xylinx KU15P - not shabby, but about 1/2 the size of the 3-die monstrosities that are in the AWS FPGA instances you can rent for ~$1.50 an hour - so useful for disk stuff closely coupled to the drive, but maybe not as a general compute resource (depending on actual price of course)

gfody · on Nov 12, 2020

They really should extend KVS for this - it's going to be very difficult to leverage if the XSS interface is underneath the filesystem (as shown in the diagram) especially for RDBMS where the database is (usually) a single big flat file as far as the filesystem is concerned.

avmich · on Nov 11, 2020

Reminds me of Micron's Automata Processor: https://www.cs.virginia.edu/~skadron/Papers/wang_APoverview_...

m3kw9 · on Nov 11, 2020

So one can look at the chip on beside the storage as an cpu offload built inside the drive, instead of a coprocessor on the motherboard. I’m not seeing a huge use case here except the narrowest of uses like decryption, compression.

RickHull · on Nov 11, 2020

> except the narrowest of uses like decryption, compression

These seem like broad use cases to me. Also consider ETL and database applications. Time series, finance, machine learning, search engines. It seems like the primary benefit is in terms of latency and minimizing data bussed to the main CPU.

baybal2 · on Nov 11, 2020

> I’m not seeing a huge use case here except the narrowest of uses like decryption, compression.

SSD already have all provisions for both, and do it. Something like that will genuinely benefit more from a highly optimised ASIC than anything else.

The use case is obviously huge, and you don't see the elephant in the room: money.

Putting all those drives to even a cheapest Xeon around, increases the price n-fold over the price of the flash, unless you talk about multi-terabyte scale SSDs.

pfortuny · on Nov 11, 2020

Convolution on stored data. This is huge and really really simple to implement. You could work with TB of data without the gpu.

Or am I missing sometjing obvious?

mbreese · on Nov 11, 2020

But, one potential issue here is that your FPGA now needs to modify the filesystem to write any new results.

Maybe the use-case here is more like transforming the data on the fly. Let's say storing the data compressed, but reading it back uncompressed. This could effectively be transparent to the host CPU, but handled by the FPGA.

The more that I think about it, this data flow sounds significantly more reasonable than asynchronously processing data. Then you could still read / transform / write the new data to the SSD, but you'd limit the main CPU to only sending the read/write IO, instead of the transparent transformation.

pfortuny · on Nov 12, 2020

I see, thanks for the explanation.

mrfusion · on Nov 11, 2020

Would anyone want to do an explain-like-I’m-in-high-school on this?

rodgerd · on Nov 11, 2020

The FPGA is programmable silicon. It's not as fast as a regular CPU, but it can be re-programmed with whatever algorithms you want.

Currently one big bottleneck for data processing is moving the data from the drive, over a storage link, into memory, and then back to the drive.

By putting a programmable processor on the drive, you can eliminate that overhead by putting some of your processing algorithms on the storage.

So, for example, if you were running a Hadoop cluster, you could have your common Hadoop algorithms baked into the processor of the drive. Every drive now becomes a Hadoop processing accelerator. Rather than pulling a full dataset into main memory over that very slow link, you run some of the job on each drive, and return only the data you need. Every drive has its own processing power, so the more data you have, the more grunt you have, and you're eliminating the slowest steps.

Because the FPGA is reprogrammable, you can change which algorithms you push down to the drive as you change your workloads over time. Every disk you add becomes a specialized big data, ML, whatever, processing unit.

mikewarot · on Nov 11, 2020

This is a new class of devices that couples fast SSD storage directly with an FPGA.

The types of computing that can be done in an FPGA require a new style of programming, because in an FPGA, programs get mapped into hardware, such that data flows through the chip, with all of the code running at the same time, instead of in sequence.

There are large differences in the style of programming required to get efficient use out of an FPGA... there are some ways of translating C that work, but they are crude crutches at best.

This is like going back to the early days of computing, when the new machines were expensive, but very fast... and programmer time was relatively cheap. It's going to take a special new breed of programmer to make this really work well. We're at the beginning of a new era.

I hope that makes sense.

dumbfounder · on Nov 11, 2020

Is this to be leveraged at the application level or the OS level? Since so much of the world is going to the cloud, the only applications I can think are for file systems or databases, what else is there? Is there a significant advantage for personal computers?

mikewarot · on Nov 11, 2020

I don't really know what this would be good for on a personal computing level. I'd do a lot of things with digital signal processing. If you had a large set of videos, and you wanted to stream a down-sampled version on the fly, the FPGA could do that type of thing, easy peasy. We're at the stage where there is maximum possibility, and minimal legacy to worry about... like the early days of the PC.

Wistar · on Nov 11, 2020

Me, too. I'd like to know if these will have a practical impact on, say, media-storage for editing.

ze_m · on Nov 12, 2020

That reminds me of calculating the mandelbrot set on the commodore 1541 floppy drive, back in the days.

sitkack · on Nov 11, 2020

This is super exciting. It makes no mention of on device bandwidth.

wtallis · on Nov 12, 2020

On-device bandwidth is PCIe 3 x4 (~3.5GB/s full duplex). The FPGA is separate from the SSD controller, which is an off the shelf Samsung controller. A portion of the FPGA's resources are used as a PCIe switch: the FPGA is connected to the host PCIe link and presents one PCIe endpoint for the FPGA and one for the SSD that is connected through the FPGA.

Other companies in the computational storage area are putting compute resources onto their SSD controller ASICs so that the compute doesn't have a PCIe bottleneck between it and the NAND. But you won't see that kind of design coming out of a Samsung/Xilinx collaboration.

sitkack · on Nov 14, 2020

If these drives are just fpgas sitting on the existing interface, so still hitting the PCIe limits, then this is un-impressive. If we start seeing multiples of device bandwidth available to the FPGA, then we can see huge cost savings.

wtallis · on Nov 14, 2020

I agree that on its own, it doesn't seem too interesting for the FPGA to be accessing the flash through the same PCIe x4 interface that the host system could use. But servers with 24+ NVMe SSDs don't always have the bandwidth to saturate all the SSDs simultaneously; they're often connected through a PCIe fanout switch that has just an x16 uplink (or an x16 per CPU socket). Having an accelerator to offload eg. search means the drives in aggregate don't have to send as much data up to the CPUs (or NICs). Even if these drives have what appears to be a sub-optimal design, they can still help alleviate bottlenecks elsewhere.

The_rationalist · on Nov 11, 2020

So no mention of syCL support... Only offering ~C in 2020 is an insult to computer science.

Unrelated: when will Nvidia allow to seamlessly offload Java or another GC based language to the GPU? https://developer.nvidia.com/blog/grcuda-a-polyglot-language... GrCuda seems promising but it would only allow interoperability with Java on the CPU, not offload Java to the GPU, right? Such advances would make gpu computing order of magnitudes more developper friendly and therefore much more mainstream.

mikewarot · on Nov 11, 2020

Xilinx makes FPGAs, not GPUs. The advantage of an FPGA is that all operations take place at the same time in hardware. Verilog is one language that I would expect strong support for... they do mention RTL, which does many of the same things.

This is not a general purpose programming environment, it is more of an data flow / filtering system.

There is no allocation of memory, thus no need for garbage collection.

iron2disulfide · on Nov 11, 2020

Verilog, VHDL, and SystemVerilog are supported by Vivado. A small subset of C++ is supported through Vivado's high level synthesis compiler, which transforms some C++ code into Verilog. RTL stands for register-transfer level, which is a general layer of abstraction; one uses a hardware description language such as Verilog to describe RTL designs. The analogy of Verilog with RTL is "Verilog is to RTL, as Haskell is to functional programming".

teleforce · on Nov 12, 2020

AMD owns Xilinx now, so there's that.

If AMD can get their act together by seamlessly integrating their GPU, FPGA and CPU with minimum I/O bottleneck it will be a huge boon for the computing industry. People will start doing something that probably unthinkable now but will be very obvious in the near future. Personally I have one application in need of the proper integration of their disparate systems and already started talking to their R&D engineers but see little improvement being implemented.

FYI, AMD has come up with SSD storage and GPU integration before but with limited success [1], but if they can also integrate FPGA together that can probably be a recipe for a great success.

I think AMD and Intel (since they both own GPU/CPU/FPGA technology) really need to come up with open and intuitive design tools for these new systems or just sponsor the work on MLIR and LLHD by LLVM and ETH Zurich, respectively.

[1] https://www.extremetech.com/extreme/232416-amd-announces-new...

cpgxiii · on Nov 11, 2020

FPGA vendor support for higher-level synthesis tools is still pretty poor. Not surprised on that front.

Offload-from-GC runtime tools do exist, e.x. Cudafy would translate .Net code into Cuda kernels and handle kernel dispatch. Of course, you were very limited in what constructs and types you could put in kernel functions, but you could write your whole application in C# and accelerate the important blocks.

In practice, a lot of beginner GPU computing has moved to the world of NN training and inference, in which the complexities of GPU offload are entirely wrapped by the libraries you use.

For traditional GPU-accelerated tasks, the limited languages available are not the problem. Decomposing your problem into a form that is amenable to GPU offload can be difficult, and if you're experienced enough to do that well, writing Cuda kernels and dispatch in C++ is not an obstacle. For example, Cudafy meant you didn't need to know Cuda-specific syntax and expressions, but you still had to understand the behavior and limitations of GPUs to write performant code.

nl · on Nov 12, 2020

> when will Nvidia allow to seamlessly offload Java or another GC based language to the GPU?

Lots of garbage collected languages seamlessly target GPUs. This is typically done at the library level, either within the language ecosystem (eg in Python using Cupy instead of Numpy: https://towardsdatascience.com/heres-how-to-use-cupy-to-make...) or below it (using cuBLAS as your BLAS implementation: https://developer.nvidia.com/cublas)

Java can do this too - something like ArrayFire is reasonably popular: https://developer.nvidia.com/arrayfire

pkaye · on Nov 11, 2020

What does syCL have to do with FPGAs?