Designing far memory data structures: think outside the box

eternalban · on Sept 26, 2019

I find the term "far memory" a bit strange, specially considering that the paper starts using the dual of "remote" and "local". The first paper of the "Prior works" also is being consistent and uses the adjective "remote" applied to cpu, procedure, and memory. Is there a technical distinction that I am missing here?

(Oddly enough, just did a search for "remote memory data structures" and guess what blog post and paper comes up!)

amelius · on Sept 26, 2019

Shouldn't we use better notation for the time complexity of the algorithms? For example, an algorithm can have

    O(n^2) + rt * O(n)

time complexity (where rt is the round trip time). Of course this expression collapses to O(n^2), but by writing it like above you can more clearly see where the cost comes from.

EDIT: on second thought, perhaps bring the rt under the O() together with n.

sagebird · on Sept 26, 2019

I agree with the spirit, but why use O here at all? Isn’t the idea that O collapses to its highest ordered term, so if you don’t want that, don’t use it.

You could use a normal function. Like t(n) = f(n^2) + g(n) + rt

afiori · on Sept 26, 2019

The point is that all the nice manipulation you would like to do are sound in O-notation and unsound in many other notations, what the parent wants is

O(n^2) + O(m) * O(n)

where m is the number of roundtrips.

sterkekoffie · on Sept 26, 2019

Writing an O in front of something doesn't mean it's in big O.

afiori · on Sept 26, 2019

You can use every tool in the wrong way; if you stay in simple cases it is (comparatively) hard to misuse big-O notation.

T3OU-736 · on Sept 26, 2019

Curious. This is somewhat reminiscent of SGI's ccNUMA and CRAYLink/NUMALink architectures.

If memory serves, IRIX (SGI's UNIX OS) had both the metrics to see the latency of access, and the ability to migrate the data and/or the compute closer to each other.

ccNUMA was open-sourced and AMD uses it on their multi-core/multi-socket systems, though usually within the motherboard. Not so much leaving the case and interlinking SGI Origin system style (which is what the CRAYLink/NUMALink tech did).

MisterTea · on Sept 26, 2019

The sad thing is that Hyper Transport was supposed to offer this exact feature and implement it just like SGI did with NUMAlink. There were a few boards produced with HTX slots, I have an older Tyan dual socket Opteron board with an HTX slot kicking around.

There is a connector standard: https://www.hypertransport.org/ht-connectors-and-cables

Connectors available from Samtec: https://www.samtec.com/standards/ht3#connectors

Manycore CPU's and converged ethernet pretty much made it moot.

kjs3 · on Sept 27, 2019

Yeah...HTX was really interesting until it was clear that 40G/100G enet was going to become commodity really fast.

inetknght · on Sept 26, 2019

This talk seems to me to follow a similar line of thinking to the one I saw presented by Chandler Carruth at the 2014 C++ conference [0]. In the talk he presented a table with (approximately) round-trip-times of various data layers.

[0]: https://youtu.be/fHNmRkzxHWs?t=2208

davidw1t · on Sept 27, 2019

The https://wizzlove.com/reviews/datingcom-review has been a great social networking site to search for the person I love. They helped me to get in link with 3-4 considerable matches. the effort was great. Thanks.

slashdev · on Sept 26, 2019

Is it possible to have direct remote memory access in any of the major cloud providers?

I think it should be technically possible inside your virtual network, if the cloud platform and network gear were to support it.

xiii1408 · on Sept 26, 2019

Generally, no.

The main requirement to support this is that a RoCE or other RDMA API needs to be exposed inside the cloud VM. This requires (1) the physical boxes have RDMA (likely universal at this point), but also (2) the virtualized network adapter, e.g. AWS ENA, to expose an RDMA API, which is much harder.

AWS did not support any kind of RDMA when I looked into it last year. Azure does, but in my understanding this is only in their "supercomputer partition," which is not really a cloud environment.

I've heard that AWS is looking to write an ENA backend for GASNet (a communication library), which could perhaps (?!) lead to them exposing RDMA and other low-level NIC features.

posnet · on Sept 26, 2019

https://aws.amazon.com/blogs/aws/now-available-elastic-fabri...

aloknnikhil · on Sept 26, 2019

I think the answer is, it depends. Far memory is only useful when the CPU isn't involved. Which probably means, the VMM underneath should support VM to VM memory access without trapping the call. I don't think that's something VMMs support today. In fact, they're actively building measures to defend against such an access.

fragmede · on Sept 26, 2019

If there's Remote DMA (RDMA) capable hardware (Infiniband or 10-gigabit Ethernet pci card) and a hypervisor that supports PCI-passthrough, then guest VMs can do RDMA. Not especially applicable for cloud providers trying to offer generic VPS' but possibly useful on the backend for managed services where the per-customer VM is not exposed to the customer (Eg AWS Redshift).

namibj · on Sept 26, 2019

Azure has Infiniband clusters.

nixgeek · on Sept 27, 2019

Oracle Cloud can support this - yes.

Disclaimer: I work for Oracle.

https://www.oracle.com/cloud/solutions/hpc.html

deffbjinnnbbvf · on Sept 26, 2019

How is far memory different from a disk?

xiii1408 · on Sept 26, 2019

Disk could be considered a specific form of "far memory."

In the context of this paper, though, "far memory" is referring to memory outside the local system that is accessed using RDMA instructions.

deffbjinnnbbvf · on Sept 26, 2019

Don't disk-based data structures have similar constraints? There too there is no ability to ship computations and we try to optimize for minimal data round trips.

xiii1408 · on Sept 26, 2019

RDMA instructions are (1) more expressive than disk operations, from what I understand (support compare-and-swap, fetch-and-add, etc.) and (2) have different latencies and bandwidths (on the order of 1us latency, 20 GB/s BW).

This paper is mostly about proposing new RDMA instructions, such as a relative load/store, that could make remote data structures more efficient.

wtallis · on Sept 26, 2019

NVMe defines compare and atomic compare-and-write operations, but I'm not sure if there are any notable users of them. They certainly aren't exposed by typical file IO abstractions. There's nothing like a fetch-and-add in any typical storage protocol that I know of.