Are there languages that have first-class support for representing/optimizing memory hierarchy characteristics? Optimizing C compilers, for example, may have extensions to force specific alignments:
But I'm not aware of languages where e.g. declaring alignments is part of the base language. Awareness of L1, L2, L3 cache characteristics, plus NUMA nodes, is increasingly important to writing high performance code. Every language I've ever used exposes nothing of that hierarchy; it's all just "memory".
People who write performance-critical code are already aware of these issues and can measure/hint/optimize to work with the memory hierarchy in existing languages. But I feel like this could be better if we had languages that modeled these issues up front instead of only exposing the abstracted view. Maybe such languages already exist and I just haven't encountered them yet.
Representing memory hierarchy in a somewhat-HLL was the goal of the programming language Sequoia, presented in the paper 'Sequoia: Programming the Memory Hierarchy' [1][2] from Stanford University.
Abstract:
"We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and provides language mechanisms to describe communication vertically through the machine and to localize computation to particular memory locations within it. We have implemented a complete programming system, including a compiler and runtime systems for Cell processor-based blade systems and distributed memory clusters, and demonstrate efficient performance running Sequoia programs on both of these platforms."
Sequoia has been largely superseded by its spiritual successor Legion [1], another programming system by Alex Aiken. Legion doesn't focus so much on low-level memory hierarchies, but it does a very, very good job of scaling to very large machines, and taking advantage of heterogeneous processors such as GPUs. It is also incomparably better at working with dynamic programs, whereas Sequoia needed to know basically everything about the program up front (including what machine it will run on).
If you're looking for a modern version of Sequoia that actually runs on modern hardware, I would strongly recommend looking at Legion. (Or the language component of Legion, Regent [2].)
I'm not sure if anyone is still reading this thread, but I wrote up some thoughts on the differences between Sequoia and Legion and some of the lessons we learned. Maybe this is a little too much "inside baseball", but hopefully some will enjoy it.
Sequoia was basically an answer to the question, "what could a compiler do for
you if it knew everything about your program?" And I do mean everything:
in addition to your source code Sequoia had to be given, at compile time, the
sizes of every input array in addition to the exact layout of the machine it
was to execute on. While these restrictions were lifted to some degree later
on, Sequoia basically had to be able to statically compute every task that it
would execute along with the exact sizes of all inputs and outputs. And with
this information it was able to do some pretty [fantastic
optimizations](http://theory.stanford.edu/~aiken/publications/papers/ppopp0...). I
believe, though I can't find the reference at the moment, that the compiler
was able to generate (from completely hardware agnostic source code)
implementations of BLAS routines competitive with Intel MKL. This was a pretty
mind-blowing achievement and to this day I'm not aware of any work that has
really been able to equal it.
Requiring programs to be fully static also had a dramatic cost, one which it
turned out not to be so easy to lift. In what I believe was the [final paper
to be published on
Sequoia](http://theory.stanford.edu/~aiken/publications/papers/ppopp1...),
Mike Bauer and others showed how to extend Sequoia to support limited forms of
dynamic behavior. But the solution wasn't all that satisfying and the work to
get there was fairly excruciating. Instead of continuing to work with the
language, Mike ended up switching gears to start a new project entirely,
Legion.
As an aside, one of the other lessons learned from the Sequoia project is that
research projects tend to be tied to the Ph.D. students who work on them. In
the Sequoia project, there was a gap in continuity between the first and
second generations of students to work on the project. This meant that
second-generation students like Mike basically ended up picking up a big
legacy code base on their own. This turned out to be a huge drain on
productivity and was among the reasons that Sequoia was eventually abandoned.
The biggest decision made about Legion up front was that it had to be
dynamic. This was a huge risk because the analysis performed by Sequoia
simply could not have been performed at runtime with reasonable cost. Sequoia
relied on knowing every single task that would ever execute and the exact
bounds of every array input and output. This meant that Sequoia could do
perfect alias analysis on these input and output arrays. Because Legion would
be a dynamic system, any analysis would have to be performed during the
execution of the program. There was no way we were going to do this in Legion
without fundamentally changing the abstractions of the programming model, and
frankly it wasn't obvious up front if this would even be possible to make
tractable.
One of the central insights of Legion was to [explicitly reify the notion of
data partitioning](http://legion.stanford.edu/pdfs/oopsla2013.pdf) in the
programming model. Sequoia didn't need this because it simply knew the inputs
and outputs to every task. Legion couldn't afford to pay the $O(N
\operatorname{log} N)$ cost to perform this analysis at runtime. By lifting
partitioning into a first-class primitive, we were able to radically reduce
the runtime cost of this analysis by leveraging what the user already know
about their data partitioning. Originally this required users to declare
certain properties of their partitions, such as disjointness, in order to make
things efficient. Eventually we discovered a [partitioning
sublanguage](http://legion.stanford.edu/pdfs/dpl2016.pdf) which allowed us to
statically discover most of these important properties, and to provide many
static guarantees even in the presence of data-dependent behavior.
The other big insight of Legion was what we call index tasks. Index tasks
are basically collections of tasks that can be described in $O(1)$ space. This
turns out to be essential if you want to scale your system to very large
distributed machines. Any representation which is $O(N)$ inevitably becomes a
scalability bottleneck, either because of memory-capacity constraints or
because of runtime complexity. By making the index task representation $O(1)$
we were able to design an [optimization to keep the runtime analysis
complexity $O(1)$ as well](http://legion.stanford.edu/pdfs/cr2017.pdf).
One consequence of being a dynamic runtime system is that Legion itself isn't
able to optimize the low-level code executed by the system. It seems to be a
general principle that the closer you are to the hardware, the more static you
need to be. Conceptually, Legion solves this problem by splitting it into two
parts: the kernels, or low-level code that has to execute as efficiently as
possible on the processor, and the orchestration of the parallelism that
exists between those kernels. Legion focuses nearly entirely on the upper
layer, leaving the lower layer to be handled by programmers. We've recovered
some of this capability via the [Regent programming
language](http://regent-lang.org/) which sits on top of Legion, but we haven't
taken this nearly as far as Sequoia did.
Despite being a dynamic runtime system, we still put a lot of effort into
making sure that it had a principled foundation. Legion is one of the few
runtime systems I'm aware of that has its own [type system and operational
semantics](http://legion.stanford.edu/pdfs/oopsla2013.pdf). This also made it
relatively easy to develop Regent.
With Regent in some ways we've now come full circle, since we're back to
having a programming language again. However, one key difference between
Regent and Sequoia is what happens when the compilers run into behavior they
can't analyze. In Sequoia, it's of the utmost importance that the compiler
understand your program, because if the compiler can't prove that parallelism
is safe, then it will serialize that part of the program. In contrast, if
Regent can't determine that parallelism is safe, it will simply fall back to
the runtime system. This means that in general Regent suffers from fewer
performance cliffs, or places where performance can suddenly (and sometimes
inexplicably) degrade.
There were some lessons we failed to learn from Sequoia. Sequoia made big use
of [graph-based
IRs](http://theory.stanford.edu/~aiken/publications/papers/ppopp0...) for
its compiler. This was a design I copied when developing the [RDIR
format](https://github.com/StanfordLegion/rdir) for Regent, a decision I now
consider to be a mistake. Graph-based IRs are superficially appealing because
they seem to represent the fundamental invariants that you want to reason
about. That is, if two tasks are mutually unreachable in the IR, then they are
provably safe to execute in parallel. However, working with the graph-based IR
as a representation of the program turns out to be a huge pain. RDIR has been
by far the biggest source of bugs in Regent and continues to be largest
maintenance burden. The other big motivation for a graph-based IR, that it
makes certain aliasing properties of the program more explicit than a
traditional CFG, is something that I now think could be done adequately well
in more traditional formats.
One final note about hardware support: Sequoia made a big bet on the
[Roadrunner
supercomputer](https://en.wikipedia.org/wiki/IBM_Roadrunner). Roadrunner was
an excellent match for Sequoia's programming model, because it used
scratchpads (instead of caches) for the compute-heavy processing units. This
meant that data (and even instructions!) had to be explicit copied in and out
of the memories of the individual processors, something which Sequoia excelled
at. Unfortunately, Roadrunner ended up not being so representative of the
machines that would follow, and the effort required to retool all the
infrastructure ended up being a big drain on the project. Though it would not
surprise anyone if these ideas ended up resurfacing in the future, a project
like this simply can't afford not to work on contemporary hardware.
In contrast, Legion made its bet on GPUs and heterogeneous architectures more
generally. So far this appears to be turning out well and has positioned us to
work well on architectures such as
[Summit](https://en.wikipedia.org/wiki/Summit_(supercomputer)). However,
investment in technologies such as LLVM mean that we're also in a better
position to adapt if future technologies end up going in a different
direction.
How does Legion differ from languages like Halide which separate scheduling from the algorithm?
How much of this could get handled by PGO (profile guided optimization) to observe locality and access patterns?
Do you think AI has a place in directing scheduling given enough semantic information so that "bad hypothesis" can get pruned?
Does it rely on a static description of machine (latencies and bandwidth) or does it evolve over time? Can it handle things that are dynamic but are thought of as static like memory bandwidth in a cloud environment?
How does Legion compare to cache oblivious techniques?
Are there changes at the processor level that could fuse operations given the existence of data in the cache, much like hyperthreading can do a context switch on a cache miss.
Do we need to modify what constitutes a Basic Block? Is modern hardware too low level?
Have you read the "Collapsing Towers of Interpreters" paper and do you think it applies to your work?
Wow, this is a tough set of questions! Some initial thoughts:
> How does Legion differ from languages like Halide which separate scheduling from the algorithm?
Yes, that's one of the similarities. Legion, Sequoia, and Halide all share a separation between the specification of the algorithm from the mapping to any particular architecture.
The biggest difference between Legion and Halide, aside from the fact that Halide is much more domain-specific to image processing, is that Legion (being a dynamic runtime system) focuses more on higher-level orchestration of tasks executing in a distributed system.
> How much of this could get handled by PGO (profile guided optimization) to observe locality and access patterns?
Yes, this is called the Inspector/Executor technique. While it's an old technique, the most recent general-purpose implementation that I know of is this paper: [Code Generation for Parallel Execution of a Class
of Irregular Loops on Distributed Memory Systems](http://web.cse.ohio-state.edu/~rountev.1/presto/pubs/sc12.pd...).
The main place where it struggles is memory capacity. It turns out that the profiles, when you're running on large distributed systems, can become very, very large. Keep in mind that modestly sized simulations these days run on hundreds to thousands of machines, and the largest run on tens of thousands of machines. So it's not hard to see why this can be problematic.
> Do you think AI has a place in directing scheduling given enough semantic information so that "bad hypothesis" can get pruned?
Yes. One nice property of Legion's approach to mapping is that no matter what mapping is chosen, the system will guarantee that the computation is still executed correctly. So the worst you can do is make yourself slower.
I think mapping, especially in the distributed setting, is one of the big unsolved problems. For domain-specific settings we have some preliminary answers. E.g. see this paper for how it applies MCMC search to find optimal mapping strategies for DNN execution: [Beyond Data and Model Parallelism for Deep Neural Networks](https://arxiv.org/pdf/1807.05358.pdf)
> Does it rely on a static description of machine (latencies and bandwidth) or does it evolve over time? Can it handle things that are dynamic but are thought of as static like memory bandwidth in a cloud environment?
Legion's model of the machine happens to be static at the moment, because that's the most expedient to implement, but explicitly designed with dynamic behavior in mind. One of the biggest cases is where you lose a machine completely (e.g. because it dies, or your spot reservation gets revoked). I'm not sure if more fine-grained behavior can be exploited in a useful way, e.g. temperature fluctuations in a CPU might very well influence performance, but unless you can predict them I don't see what you can necessarily do about that. For many of these problems, being more asynchronous helps, because at least you can turn latency-limited problems into throughput-limited ones.
> How does Legion compare to cache oblivious techniques?
I'd say that cache oblivious techniques are particular algorithms (or perhaps mapping strategies) that you could implement in Legion, but which Legion is agnostic to. Legion provides mechanism but tries to avoid hard-coding any particular policy.
> Are there changes at the processor level that could fuse operations given the existence of data in the cache, much like hyperthreading can do a context switch on a cache miss.
We don't do this on quite such a low level, but we can do it at higher levels. Mappers can select the ordering of tasks on a processor, as well as the placement and layout of data, which is in theory sufficient to play with this. In the past Legion's overheads were too high to really think about this, but we've made some significant improvements in overhead recently which could allow us to start to think about things at this granularity.
> Do we need to modify what constitutes a Basic Block? Is modern hardware too low level?
I'm not one of them, but I know there are people thinking about this. Can't recall any references at the moment.
In general, I think systems like Legion make us much less dependent on caches. If the system manages data movement in and out of explicitly managed memories, that gives us a lot more freedom to play with the architectures. And increasingly I think we'll need this flexibility if we're going to keep driving performance improvements. As one of my friends once said, "x86 is a platform for making shitty code run fast", but perhaps we don't have that luxury any more.
> Have you read the "Collapsing Towers of Interpreters" paper and do you think it applies to your work?
Haven't read it yet, but it sounds like it shares some ideas with [Terra](http://terralang.org/), which we do use heavily. In fact, the Regent language is built entirely in Terra. In general, these sorts of techniques make it much faster to build new languages, while it's not really in my direct line of research I am very grateful people are working on these things.
Thank you for such a detailed response, I am grateful.
You mentioned Terra which is an inspiration of mine, I really enjoy mixing high level and low level code and being able to do so in the same program which is one of the major affordances of LuaJIT.
Hugo M. Gualandi and Roberto Ierusalimschy have released the core Lua teams' version of Terra, in Pallene [1] [2] which has the austere aesthetic of Lua applied to ideas of Terra. Pallene looks to be the next incarnation of Titan [3].
P.S. Shameless plug: If you're interested in working on this sort of stuff, we're hiring for full-time staff scientists as well as interns. Drop me a line! (Contact in profile.)
3rd party vouch for shameless plug: I was a RA in a minor role at this team and these are some of the friendliest and most approachable people I've had the pleasure to work with in academia. (Hi Elliot! \o/)
Basically you can write an algorithm that is independent of the exact data layout, and then the compiler has some flexibility to work with.
I haven't worked with it all, just read the paper awhile ago. But I'm pretty sure they did something with SoA - AoS transformation (structure of arrays/array of structures). At the very least, the compiler can make the data smaller, which fits better in cache. Not sure if it can do anything more advanced than that (i.e. scaling to multiple levels of memory hierarchy rather than compacting data).
My memory of Sequoia is that it's very very far semantically from "Python" (mentioned in the original article). Their example was matrix multiplication, and the language had much more of a functional flavor (perhaps unsurprisingly). I'll have to go back and check it out again.
You can also think of MapReduce as a programming model that is locality-aware, and there have been several papers about multicore (not distributed) MapReduce.
You partition your data into smaller sections, then a user-defined "mapper" runs on each segment. The shuffle is an efficient distributed sort which you don't have to write -- it can be specialized to different problem sizes. And then the user-defined reducer also operates on a (different) small partition of data.
I think the real problem is that the ISA doesn't really expose the memory hierarchy to the programming language, not necessarily that the programming language doesn't expose it to the programmer. GPGPU might change that but I'm not entirely familiar with it.
One of the differentiating features of Jonathan Blow's language, Jai, is supposed to be AoS/SoA transformation. I wasn't aware other languages were experimenting with this as well.
I thought it was something fairly specific to game programming, but I guess it shouldn't be surprising that other performance sensitive domains might also find it useful.
That was true long ago when he demo'd such a feature, but seems to have since disappeared from the language. Jon claims there's "something better" that replaces it, a mysterious yet-to-be-revealed feature.
Welp, architectures don't really "expose" cache characteristics either in a first-order fashion; they also go along with the idea that it's all just "memory" like you say. Obviously you can query your cache characteristics but there's only so much you can do.
Some embedded systems had scratchpads and it is possible to 'wire down cache' on some architectures to similar effect. But it would be tricky in general at the language level.
I have a project half on the boil that's aiming to produce a language aimed at these kind of usages. I'm figuring that the supposed insult hurled at C ("structured assembly") isn't really an insult at all but that someone should actually do that. At the moment C, with its giant raft of Undefined Behavior, isn't really that language any more.
On GPUs you can program the L1 cache at least. I’ve never had much success with it though, as in more recent architectures I found the default L1 and the schedulers are doing a good job at keeping occupancy high and by wiring it you either don’t win anything or it often even gets slower - and there’s no way to know in advance since scheduler and cache architecture are largely a trade secret. In general, GPUs sre optimized to use their high memory bandwidth through an extreme amount of hyperthreading, and the number of possible threads goes down with the ressources your kernel occupies.
which you can use together to either make sure that 2 variables are not in the same cache line (if you are making a lock-free list for example) or to make sure at compile-time that 2 variables are close enough to share a cache-line.
I see Halide and several interesting projects. Let me know if you find anything interesting :)
It does seem to confirm that locality aware programming models are probably going to look like "big data" programming models (i.e. MapReduce and family).
As does C++ as of 11. [0]
For heap allocations, one used the POSIX (not stdlib) posix_memalign until C++17/C11, when std::aligned_alloc became standard. [1] The other option was manually aligning heap memory; it’s so much more convenient to use a transparent aligned allocator.
Distinguishing stack versus heap memory accesses and allowing processors to not have to, e.g., check stack reads against pending heap stores could certainly have performance benefits in the memory system. The Reduceron does that since Haskell doesn't have the sort of free pointers that make this infeasible in C:
Assembly typically requires you to declare your own alignments. There are some higher-level assemblers, with macros and things, but you're right, there's rarely an abstract language offering low-level memory specification.
I also dislike the tendency to only expose the abstract view, although I think that trend is due to the difficulties involved in designing a language and compiler whose job is to output code for multiple architectures. Well, at least I hope that's why, because I don't know why you'd choose to offer less control in your language otherwise.
I don't think it's a dumb question. And I think you're right: the closer you get to the way the CPU you're targeting likes its memory optimally managed, you're basically heading towards assembly.
I have notions that there can be something 'between' assembly and, for instance, C, but that's a daydream of mine without any details worked out yet.
I’m not sure you’re actually heading down the path of assembly in terms of the language itself, but you’re definitely moving towards the architectural lock-in of assembly.
But maybe thats fine if the language can express architectural-specific decisions (like memory layout) and abstract it away in a more sane fashion than ifdefsc and the rest of the codebase can maintain the pretense of portability.
Low-level memory work, if you care about things like data locations and exact layout of memory, is possible in practice with C. There are cases where you can run into aliasing problems, but controlling the actual memory layout is no problem at all. (There are some limitations.)
niftich, chubot: Thanks for the pointers to Sequoia and citations thereof. It looks very interesting. That is the sort of thing I had in mind when I asked my question. There are a lot of citations to follow up on.
the8472, RcouF1uZ4gsC, stochastic_monk: Thanks for highlighting the language-level alignment support in Rust and newer revisions of C and C++.
glangdale, regarding "I have a project half on the boil that's aiming to produce a language aimed at these kind of usages. I'm figuring that the supposed insult hurled at C ("structured assembly") isn't really an insult at all but that someone should actually do that." -- That's great! I would like to see that language.
There's interesting work being done by the guys working on the Unity game engine. Their new data-oriented programming model (still based on C#) promises efficient packing of objects into arrays of properties which (supposedly) results in massive performance gains thanks to sequential processing. They also have a custom compiler optimized for this scenario. However, AFAIK there are no plans to provide this separately from the engine (on the other hand, they've been separating out a lot of functionality into optional modules so you can go really light).
XLA, the JIT compiler in tensorflow could hide this from you. It's a thing of beauty. It can vectorize along multiples of the memory higherachies of CPUs, GPUs, and the family of TPUs.
It seems to me that it is generally better to keep architecture-specific memory commands as part of a platform-based library that is imported into the code, depending on where it is being compiled (Separation-of-concerns). Imagine the amount of inappropriate/unnecessary commands that one would create in a language if you moved the superset of all possible memory commands on all possible architectures, into the language itself. There would be enough to make it ugly.
If you don't need the full precision or range of a 64-bit integer you can pack multiple of them together to get multiple integers transferred in a single cache line read. Compressing in this manner also means you are more likely to get a cache hit.
Caches are significantly less dense than normal ram. You wouldn't be able to fit gigabytes in the same area. They would probably be more expensive as well.
It's really hard. Remember the Itanium and the Cell. You can build it, but they may not come.
GPUs, though. Those have turned out to be very successful, they can be parallelized as much as you're willing to pay for, and they're good for some non-graphics tasks.
Much of machine learning is a simple repetitive computation running at low precision. Special purpose hardware can do that very well.
> Q: I feel like deja vu - at Hot Chips, Intel introduced VLIW-concept Itanium that pushed complexity onto the compiler. I see traces of that here. What are you doing to avoid the Itanium traps? How will you avoid IP from Intel?
> A: Itanium was in-order VLIW, hope people will build compiler to get perf. We came from opposite direction - we use dynamic scheduling. We are not VLIW, every node defines sub-graphs and dependent instructions. We designed the compiler first. We build hardware around the compiler, Intel approach the opposite.
That captures a lot of my take on Itanium at the time: despite being "VLIW" it took on what at the time was a vast amount of CPU controller complexity without an obvious design win or clear high-performance compilation model. As an outsider, it felt like an extremum of Intel's flavor of design hubris.
It's interesting to compare the Itanium's approach to a VLIW contemporary from TI: the C6x VLIW DSP ISA. The C6x ISA took on very little controller complexity. Instead, it exposed a very clear operational and optimization model for the VLIW instruction blocks. An optimizing compiler did a decent first-step job at generating code. That was complemented by clear documentation of the optimization workflow, from compiler-stage iteration down to good guidance on the nuts and bolts of hand-optimizing. By comparison, IA64 was just ... a big head-scratching moment all around. I never got to work with C6x code professionally, but dang it looked like a blast to code and optimize for.
Well, past experiences are not that useful right now. I don't think the Itanium or the Cell were great architectures, but it's clear that their most obvious failure mode (mainstream architectures improving faster than them) isn't a showstopper anymore.
For some guess on the next big thing, I would imagine that stuff that avoid the need of memory coherence have nice odds.
The most obvious failure for the Cell architecture was that it was horrendous to program. That was partly caused by the lack of memory coherence between the PPE (main core that ran the OS) and the SPEs (the faster cores meant for offloading computations to), but also caused by a lack of developer tools.
The main lesson I would take away from Cell is that you want to design an architecture that allows gradual performance refinement. With the Cell, it was more of a step function: good performance came from using the SPEs well, but using the SPEs well was hard. There wasn't much of an in-between.
Yes, I mispoke by saying the SPEs lacked memory coherence, since you used the same physical addresses to load memory into the SPEs local storage. What I meant is that the SPEs are divorced from the normal memory hierarchy. You needed to explicitly retrieve and send all memory from each SPEs, which was a significant burden on programmers.
Does this mean it was a mistake or just ahead of its time? I think software-managed memory tiers have been a dream for advanced architectures for a very long time. The problem, perhaps is assuming software-managed means programmer-managed.
In a single system there is precedent in virtual memory systems using software-managed page mappings rather than static page table data structures. In HPC there is of course the precedent of distributed parallel systems with message-passing rather than shared memory abstractions. And the separation of GPU memory from system memory is certainly common today. Similar things are happening in software-defined storage (i.e. RAM/SSD/disk/tape tiering).
Computation libraries and frameworks help straddle the gap between application programmer needs and current language/runtime/architecture semantics. I think one problem of current markets is that people tend to want to evaluate hardware independently of software, or in terms of yesterday's software.
There is also a strange history of wanting to discard the software/firmware offered by the hardware vendor (for being insecure/inept/whatever), but blindly accepting the complex hardware design that enables our naive/platform-independent software to run well. The whole Spectre debacle shows how that may have been wishful thinking that we can divorce the software and hardware designs...
Does anyone else have real-world Itanium experience? We've got a couple 5+ year old HP Integrity servers running OpenVMS on Itanium at work that we use to batch process large ASCII vendor files in an ETL process written in C. They certainly don't embarrass themselves. We'll be connecting them to a new Pure Storage SAN in a few months and the IT guys are really excited to see what happens to performance.
I take the same exact C code, compile it with Visual Studio, and then run the same ASCII files locally on my 7th-gen 4c/8t i7 desktop and am not seeing any improvement. That's about the best I can do as far as benchmarking goes.
I used Itanium from 2003-2006 for scientific computing on a big cluster. For that purpose it was much faster than Intel's Xeons of similar vintage. It was also significantly faster than the MIPS and POWER systems we had. Caveats:
- The simulation suite that I used most heavily was developed in-house at the same lab that bought the hardware. It was profiled and tuned specifically for our hardware. The development team was a real software development team with experience writing parallel simulation code since the early 1990s. It wasn't just a bunch of PhD students trying to shove new features into the software to finish their theses.
- There was also close cooperation between the lab, the system vendor (HP), and Intel.
- Other software (like all the packages shipped with RHEL for Itanium) didn't seem particularly fast.
- God help you if you needed to run software that was available only as binary for x86. The x86 emulation was not fast at all.
It was great for numerical code that had been specifically tuned for it. It was pretty good for numerical code that had been tuned for other contemporary machines. Otherwise I didn't see particularly good performance from it. I don't know if it really was a design that was only good for HPC or if (e.g.) it also would have been good for Java/databases, given sufficient software investments.
Maybe it would have been competitive against AMD64, even considering the difficulty of architecture-switching, if it had not been so expensive. But I'm not sure Intel had wiggle room to price Itanium to pressure AMD64 even if they had wanted to; Itaniums were quite big, complicated chips.
At my previous job we used itanium servers for the secure64 dns platform[1]. It performed well, the logic behind itanium is security not performance (according to secure64) but performance was never an issue. I do know the hardware support for rsa did give nice performance for dnssec signing.
The logic behind Itanium was mostly to have a 64-bit architecture that was specific to Intel without any cross-licensing to other vendors like AMD. And which would also replace proprietary RISC flavors like PA-RISC.
It did also have a fair bit of security acceleration built-in. As I recall, one of the principals behind Secure64 was one of the HP leads associated with the original IA64 announcement. Secure64 pushed Itanium for security applications when it was becoming clear that it was never going to be a mainstream 64-bit architecture.
Even SIMD, a feature of processors for over 20 years, is not handled well by compilers for automatic optimization. It seems like all code taking advantage of SIMD either drops into ASM, or uses a library from ARM/Intel to give descriptive C functions names for the underlying ASM.
I wonder why high level languages themselves don't add syntax support for wide-math instructions? I understand why loop unpacking is a little tricky, so why not let the programmer take care of it?
It's really hard. Remember the Itanium and the Cell. You can build it, but they may not come.
GPUs, though. Those have turned out to be very successful, they can be parallelized as much as you're willing to pay for, and they're good for some non-graphics tasks.
Not all CPU architectures are equally successful. Likewise, not all highly parallel architectures will be equally successful. If progress in hardware is going to be characterized by more parallelization, things like Erlang Actors are going to become more important.
I vaguely remember my computer arch course in college describing how everything we have today essentially suffers from the von neumann bottleneck, that even if we implemented a lambda calculus machine, it would still suffer from this bottleneck. Has this been revisited in academia lately? I am a bit out of touch, but this always interested me.
The Cell was way ahead of its time. Its security architecture capabilities was almost perfect, but they've screwed up the implementation and they failed to provide good programming tools. And, it was very very expensive.
I think that the sitch is different these days. Ideally, you’d have an end-to-end solution (hw, os, compiler).
These days you can build an architecture, fork llvm and add a new backend and get Linux running. At that point you can start selling.
Dynamic x86 imterpretation could have made competition as well. There were companies that did that and had Intel moved to EPIC others might have done what AMD did.
I fear, the more we advance, the more we are going to become permanently entrenched in The Way Things Are.
To really shake things up at a fundamental level will probably require running into an alien species that does basic things differently than us.
Even creative people with lots and lots of spare time and resources at their disposal will still be too "colored" by existing human knowledge and established practices to really try anything new (and without falling into the trap of needlessly reinventing the wheel.)
I want a language where you can express time and constraints on time within the language and the type system. I want to be able to guarantee program correctness, pre-calculate fine-grained energy usage, optimize for energy/power-saving mode usage, interface with asynchronous signals, and whatnot -- all with respect to program execution at the ISA-level within some hard real-time constraints.
Compilers make optimizations using detailed instruction timing information, but as far as I know, these details do not bubble up to the surface in any programming language.
It may be wise to keep these details under the surface at the compiler level, but for 8-bit architectures, it would be awesome to have a language where you have explicit control of time.
Part of the reason most languages obscure this is because it's a moving target.
If a language let you say, "this chunk of code here should run in 7 cycles", what happens when a new optimization finds a way to reduce that, or a new architecture comes up where that operation gets slower but lots of others get faster?
I'm not arguing against your desire, just explaining that it's not unreasonable that we're where we are now. We've gotten so used to language portability, that it's good to remember how painful it can be to lose that. It's no fun having to rewrite all your code every time a new chip comes out.
This could only ever be doable with extremely simple architectures anyway. Off the top of my head, add in just one of branch prediction, micro-op fusion, out-of-order execution, pipelines and pipeline stalls, or cache misses, and this becomes impossible. Of course, this assumes you even know which CPU you are targeting its specific instruction latencies.
That's already an extremely niche set of processors. Further, the number of bits of code you're likely to care about this kind of extremely precise timing for, you'll either examine the emitted assembly, or just hand-write the ASM yourself.
It seems like a huge amount of effort for an extremely niche scenario. Remember, the ISA is still just an abstraction, after all.
To add to that there's also the difference between cycles spent executing an instruction and how many of those instructions can be executed at once in the pipeline. So there is a difference between executing a set of instructions once versus executing them millions of times.
This is actually why you want the compiler to track it.
You write an algorithm that seems reasonable. And you encode timing constraints into the thing. Now you re-target to a different machine, and the compiler re-checks those constraints. A much cheaper way of dealing with weird timing bugs than doing a post-mortem of why the fuel valves didn't quite close on the day of the rocket test.
But this only works if the CPU is knowable to the compiler. Which is why it is only even remotely feasible in the embedded world (where it is also the most useful).
This is one of the most actively researched fields in computer architecture/engineering. Hard real-time requirements has been been extensively researched for ISAs, compilers, network-on-chips, concurrency models etc etc.
Personally, I was briefly involved in the T-CREST[0] project where a complete ISA, network on a chip and compiler where devolved in tandem to support deterministic timing and well understood worst case execution times (WCET).
As an example, the ISA explicitly defines multiple caches with different write-rules. So for example the stack cache is write-back to lower WCET.
The whole project is available as a VM image[1] where the HDL can be compiled to an emulator and/or synthesized to hardware.
When I was a student in the late '90s, folks at the Uni wanted to that stuff with Ada. I suspect that some limited-but-useful version of this does exist by now in niche sub-branches of safety-critical embedded programming. (Though the commercially used onces are likely to be static analysers for C as they are to be an ADA compiler).
I mean it's nice for compilers to count instructions, but that is far more likely to work on some "primitive" embedded CPU than it is to work on the modern spectre-pipelined goliath CPUs that we use in personal computers and datacentres.
So there are languages that can do these things individually:
Dedalus is an evolution of datalog, incorporating the notion of time in the type system. You're better off working with an abstract version of time, for portability reasons - the compiler can look at the timing specs to optimize resource usage, otherwise you're into VHDL territory and your only recourse is simulation of your system to reason about it, due to complexity .
Fine grained energy usage would be specific to say the SOC you're using, eg. for the CC1310 there are specific modes for low power. The compiler could potentially take care of this, but you would probably require an abstract state machine for it to optimize, eg. identify periods to go in to low power mode.
Hard realtime nearly precludes many languages due to their need for GC - there are a few realtime GCs around, but really it's better to use a language that has managed memory without GC.
So after all that, you're down to a handful of suitable languages for realtime eg. Rust or Composita.
Lastly, guaranteeing program correctness is something you need to do in a high level language, to be able to capture the semantics you're tying to prove. ie. you need a language that can express higher order logic. Whether this proof step is intrinsic to the language is up to you, but you then need to map the higher-level model to your low level language. Several projects do this eg. CakeML, but the full self-verified OS doesn't exist as yet.
Interestingly there is a linguistic solution for all of this which I'm working on, just not got anything concrete to share yet - maybe in 6 months :)
I'm very curious about your solution! If there's a place where I could follow the development, I'd be grateful for a link or, if it's not too much of a hassle, a ping to the email in my profile when you have something you want to show off :)
I'm grateful for the interest! Not really thought about a location - I'll probably use my github account when I've got some material, and ping you when it's there.
As an aside, most of the stuff I've on my account is higher level eg. https://github.com/alexisread/noreml deals more with a new method of visually programming cross-plaform UIs ie. using nodered dashboards and flow-based programming in a drag and drop fashion.
Main influences at the moment are Maru, Composita, Ferret and Maude:
Are you suggesting a language where you could hypothetically look at your IDE and it'll tell you via static analysis, "this code block takes x cpu units. On a Y core this is 0.95ns".
There are lots of attributes that are provably impossible to statically determine about a computer program. Whether the program halts is the obvious example, but Rice's theorem generalizes this to effectively any attribute of a computer program you would be likely to care about.
Impossible to prove for any arbitrary program doesn't mean impossible to prove for programs that we actually care about, and it especially doesn't mean impossible to prove for a program where the programmer is intentionally trying to write a program where certain attributes are provable.
Any statically typed language necessarily has to prevent some well-formed programs from being represented. This language would just need to reject programs where it can't prove the attributes that it is suppose to prove in addition to programs that are actually malformed.
It is really ergonomics that prevents a language like this from taking off, not any theoretical concerns. Languages that let you prove a lot of complex properties at compile-time take too much work to use, but they are not impossible to create.
F-Star [0] and Idris [1] are two great languages which give you a large variety of static analysis, and yes, provide ways to prove totality.
Idris, for instance, disallows recursion unless it can prove an argument "smaller" in the recursive call. For instance:
fib : Nat -> Nat
fib Z = Z
fib (S Z) = (S Z)
fib (S (S n)) = fib (S n) + fib n
In this case, the function pattern matches on natural numbers (Z stands for Zero and S for the one plus some other natural number). Each recursive call of `fib` has an argument which is smaller (closer to the base case), and thus this program must halt. Idris will not compile unless it can prove the program will halt. It, therefore, rejects all programs which would have halted but which it could not prove.
Rice's Theorem generalizes this to any implementation-independent attribute (i.e. properties of the idealized function on natural natural numbers a program represents). However, certain implementation-specific details such as run time (i.e. performance) are fair game. In particular it is quite simple to at least calculate lower bounds on run time. Simply choose a lower bound and then run the program and see if it halts before then.
If you can do that with your program then you can probably just precompute the actual result at build time. Many, if not most unexpected performance problems occur with unexpected or unconsidered inputs.
There are far less wasteful ways of calculating implementation-specific details. My "run until completion" was just an existence proof that general techniques exist. Things like symbolic execution, type systems, and other static analysis tools could in theory help here.
That being said I have personally yet to see non-toy static analysis tools for performance (although I suspect they exist at least in some limited fashion or operating in some constrained domains). This is probably due to the nature of certain pathological inputs and code paths as you point out.
There may nonetheless be hope if you guide the way users can write programs as a sibling comment points out.
There are many properties as such, but we can still prove many properties about a program despite non-deterministic behaviour, using hidden logic
https://cseweb.ucsd.edu/~goguen/pps/hherb.ps
As such, a proof system which incorporates behavioural specification is highly desirable.
Approaches like the Abstract Algorithm (that you link to) turn out to just shift the costs elsewhere; in particular to the "book keeping" (IIUC nodes act differently depending on which "path" we're traversing, and keeping track of that exactly cancels out the benefits of optimal sharing).
This is already possible to a degree within existing type systems - here’s a small experiment tracking execution time of list comprehensions on the type level in Purescript [0].
Here's something that seems shockingly under-explored to me: languages that incorporate relational data structures natively.
We have SQL of course, but SQL is not a general purpose language and is (intentionally) often not Turing-complete.
I'm imagining something like Go, Ruby, JavaScript, or Rust with native tables, queries, and other SQL-ish relational data structures.
The long term goal would be to kill the age-old impedance mismatch problem and eliminate CRUD code. Toward this long term end the language runtime or standard library could actually contain a full database engine with replication/clustering support.
Data frames in R give you the relational model with a Turing complete language. I think of it as a better SQL, at least for analytics (as opposed to transaction processing).
Besides its data structures, R is surprisingly similar to JavaScript -- it's has Scheme-like semantics but C-like syntax.
For certain problems, it's a pleasure to program in. You don't have to go through "objects" to get to your tables; you just have tables!
R itself has a lot of flaws, but the tidyverse [1] is collection of packages that provide a more consistent interface.
FWIW I agree that more mainstream languages should have tables natively. It is an odd omission. Somehow they got coupled to storage, but they're also useful for computation.
Agreed - Improving SQL should be a research focus eg new options to make it more logical such as FROM clause before SELECT clause (much better for tooling) Add Immutable options (to make it truly functional), optimize for Optane persistent memory, add native GUI/ HTML support (like Foxpro) to make it much easier to wire up IO between tables and display/input. Add NOT IN Foreign Keys.
PostgreSQL desperatly needs auto refersh and incremental Materialized Views and BiTemporal system/app timestamps.
SQL is already an incredible Logical functional language that operates at a much higher abstraction level than mainstream languages, it could be even more amazing, with not that much effort/cost relative to any other routes I can think of - to get to more human friendly, more expressive, faster, more reliable, app development tools.
Not improved SQL! Modern languages with native relational data models. Not a DSL inside a language but something native to the language and one with its normal syntax.
The ideas within SQL are powerful but the syntax would need a dust off and would need to be integrated into a modern language.
There's a bunch of relational algebra systems in Haskell, and one in Ruby that gives better first class control over these structures. However the fact that SQL engines all use SQL as text to communicate results in marshalling cost. However there are also languages that have embedded database such as K+.
I think the current attitude could be summed up better as "Who care's if it's inefficient, yesterday's computers are fast enough!" I think that's true of most software, but not all -- there will always be people who care about the best performance, but we shouldn't expect them to be in the majority.
"there will always be people who care about the best performance, but we shouldn't expect them to be in the majority."
Given how machine learning is going to kick in in lots of fields quite soon there will be a lot of real world applications that are not ambivalent about performance (if not else then from energy use point of view).
We're going to get Clippy the Clipper 2.0 who can actually do something usefull, and how long he spends figuring out his suggestions and how good they are are likely to be dependent on the available CPU resources and how fast Clippy can think.
I would say erlang is an interesting new language, except that it's not new. The idea of a functional language that is just very careful about its impurities and has a clever way of managing state without spinlocks or mutexes is a great thing. Beam languages have almost no footguns and programming them feels like programming with inflated gutterball balloons. A chip architecture that was highly optimized for the beam would be great.
Who cares if it’s slow, tomorrow’s computers will be faster!
Well, that’s a misrepresentation of history, the Python ethos has always been to drop into C for performance critical sections. That’s why NumPy won, and why Numba is highly competitive with Julia, and you can keep your existing code.
I have to say that most bottlenecks are from lazy designs, not wimpy hardware. As a thought experiment, suppose parallel processing scaled easily without worrying about Amdahl's Law and similar bottlenecks. So then we put 128 cores into our computers. Software companies would eat up every single core eventually if there's no penalty for inefficient coding. They don't pay our electric bill, we do. It's kind of like freeways: the more you build, the more the traffic increases to fill them up again.
It is "nice" to be able to slap components together like magic legos and/or a Dagwood Sandwich rather than smartly tune and coordinate everything, but doing such is often computationally inefficient. The speed of light will probably put a hard upper limit on this practice that no clever language or architecture can work around to a notable degree.
Software is a gas; it expands to fill its container - Nathan Myhrvold
Bad performance isn't really a software design problem; it's an economics problem (ditto for security). You could speed up the browser by 2x now, or reduce it's memory usage by 2x, and websites would be exactly as slow after a few months.
I'm not trying to be negative here, just stating an unfortunate systems dynamic.
I generally agree: it's about weighing trade-offs, and calling software companies "lazy" is perhaps an oversimplification. However, I do believe the speed of light and size of atoms puts an upper limit on the degree you can keep throwing hardware at the problem. We may continue to find tricks such as quantum computing, but I suspect it will get ever tougher. There's a theoretical limit to how small you squeeze data and to how fast it can transfer.
> We may continue to find tricks such as quantum computing
Quantum computing does not magically speed up general purpose computing tasks. We only know a handful of algorithms where quantum computers provide algorithmic complexity reductions and even that that does not mean they necessarily calculate fast in terms of IPC, they just calculate previously intractable problems in reasonable timescales.
TL;DR: you won't pop quantum GPUs/CPUs into your computer and suddenly get 50 times the framerates in DOOM.
Some have suggested that certain common algorithms (or needs) can be rewritten into the types QC are good at. I'm not an expert so cannot confirm such. It may await new algorithm discoveries to answer anyhow.
Software companies are lazy in exactly the same way that the entire rest of the economy is lazy. Companies in every industry try to cut costs as much as they can.
Product developers make cost-benefit decisions all the time and much of the cost of slow or insecure software is externalized. A knee in the developer-visible cost-benefit curve represents the MVP everyone is so keen to release to market.
To be clearer, I would say that software is exactly as fast and secure as the market will bear.
There is probably an economics term I can cite to explain this better, like some kind of equilibrium (I appreciate any help explaining this).
But the sibling basically got it half-right -- software developers and designers are always making cost-benefit decisions.
The other half of it is that consumers tend to choose features rather than security. The are ALREADY faster and more secure alternatives to every piece of software you use -- you just don't use them because they don't have the features you want!
People vote with their feet, and what we end up with is not the fastest or most secure situation, because there are tradeoffs.
This applies to developers too -- not just end users. I mean why do people choose Python, PHP, or JavaScript? There are faster languages out there. There are languages that could help you write more secure applications.
But those languages have the features and library ecosystem we want, so we use them.
----
I'm one of those people who laments slow and insecure software. But doing my own open source project has kind of brought home the tradeoffs. I've been working for over 2 years on an open source Unix shell. It's written in Python [1], which is not exactly the right tool for the job. But if I wrote it in C or C++ or Rust, it would take 4 or 6 or 10 years.
So unforunately it is too slow, but there's a tradeoff between being slow and being fast but not existing and nobody using it. (Although I have a plan to fix it, basically outlined in that blog post. Not sure how much effort it will take, but I think less than rewriting from scratch.)
It's an economics problem -- if it doesn't have the features people already use, then they'll have to rewrite their software. Rewriting shell scripts costs a lot of money and it doesn't have a lot of advantages. It's more economical for one person/team to emulate bash in a new shell, than for the entire world to rewrite their scripts.
This recent blog post has the same sentiment buried in there:
XHTML didn't succeed because it's more economical for a single browser vendor to make a really complicated parsing algorithm than it is for every single person in the world to change how they write HTML.
It's not like we don't KNOW how to design strict data formats, or we don't KNOW how to make secure software. (e.g. ask yourself why more people aren't using DJB's software.) It's matter of economics -- how much effort are the people making decisions willing to put in, and how much the market rewards them for it.
We can certainly improve the situation by valuing secure and fast software, and that has happened in the past. Windows used to be an utter disaster security-wise, but Microsoft realized that it was more than the market would bear, and they changed their whole attitude toward security.
Although if you want to be cynical, you can also say they made the "right" move to ship first and make things secure later :-(
Not quite.. see my sibling response. It's probably more accurate to say that some people have thrown a lot of time and energy at the problem.
But the market does not value their software; we use slower and less secure alternatives.
Slack is a great example. There are probably 99 other chat services that perform better than Slack. (Apparently it makes fans spin on laptops, which is kind of shocking for a chat app.)
But the market doesn't necessarily care -- it values the features that Slack provides more.
Part of it is a bad network effect. I care about speed, but I might have to use Slack because the people I want to talk to on Slack don't care about speed.
I think the idea is that people stop optimizing for speed as soon as it gets "fast enough", and what "fast enough" means doesn't change at all if you make better tools and machines.
Faster machines and tools only really matters to people who are trying to do things that are currently never fast enough to put into production.
I believe end of next month there will be another Talk where the Mill designers will talk about Spectre and Meltdown attacks in relation to Mill (though the spoiler would be that the talk mentions the point "and why Mill is immune" so I guess it'll be a very fun talk).
I do hope a lot that Mill will succeed, it's an incredibly promising architecture.
I do hope they just put paper to silicon and implement the damn thing, even if they just distribute simple RaspberryPi type boards with a slow version on it.
The author makes it sound as if there have been no material changes to programming languages recently but unless you've been hiding under a rock you will know that this is not true.
He compares Python to C. Come on, please. It would have been more relevant to speak about the LLVM project and how that enables architects to build more expressive languages which can still make use of 30 years of optimization wisdom (most of which was built when CPU's were much slower)
What about mentioning languages like Rust which are designed for high concurrency and zero overhead which let developers write software with the safety of python but the performance of C.
What about mentioning that some old languages, like c++, have evolved though new language features AND guidelines into much more effective languages to guide us into the future of concurrent programming (the easiest way to enjoy the speedups of yesteryear)
The lack of obsolescence is terrible news for Dell and HP. Ran into a client this afternoon still running his seismic workflows on an old HP EliteBook 8570w with a couple big monitors, big SSD and 16 Gigs of memory.
It would be nice if low volume foundry costs went down an order of magnitude. I'm in a bio lab right now, but I did CS in college and really enjoyed VLSI. If I want to turn a design into a prototype in my bio lab, it's going to cost hundreds of dollars, but if I want to turn a chip I designed into reality, it's going to cost thousands to tens of thousands of dollars even with specialized low volume foundries like MOSIS. I feel like it should be cheaper.
Generally speaking, that's where FPGAs excel. You get a good chunk of the speedup that comes with designing your own circuit at a lower cost than actual ASICs (at least for protoyping and low volume batches).
A little off topic, but I greatly admire generalists: Dr. Patterson (along with Dr. Fox) taught a good Coursera course on agile web development with Ruby and Rails. Quite a departure for an originator of RISC architectures!
I think Dr. Patterson makes a great point on there being plenty of headroom in the increase in software efficiency.
It's way easier to push the limits on simple things. We need languages with fewer features and clear design, running on hardware with less exotic features.
I agree in general, and while I think that is totally completely possible in languages and operating environments, I'm not so sure it is in hardware, at least not without sacrificing a lot of performance. People respected by people I trust, who have a lot more domain knowledge than I do, don't seem to think it is.
I'm a VC and I've seen lots of pitches lately for various tech related to new architectures, be it GPU, FPGA or RISC. I'd add cryptocurrency to list workloads with insatiable computing demands.
Then an ASIC for sparse matrix multiply. Valiant's reduction of CFG parsing. Massive speedup to parsing workflows, and boon to security by getting rid of hand rolled parsers.
Do u see more pitches for source-code level solutions ?
i.e. analyzing and modifying the huge legacy of existing code base ?
This is a very difficult area of course (both financially and technologically). But the pay off can be very large. Imagine a vmware-like transformation at the source code level.
they site some numbers by Gartner (from '99, when 2K scare
was on everyone's mind)
"...
Cost of doing a manual migration:The Gartner Group notes that the cost for manual code conversion can range from $6 - $26 per LOC and is accomplished at a rate of 160 LOC per day (Gartner Group study "Forecasting the Worldwide IT Services Industry: 1999,1").
An assumption is that the migration is going between two relatively similar languages (e.g., not trying to go from COBOL to full object-oriented Java). Using Gartner's numbers, a million lines of code costs $6-$26 million to migrate.
Time to do a manual migration: Again, using Gartner's numbers, a legacy system with a million lines of code requires some 28+ man-years of labor to convert. With a team of 10, this takes 3 elapsed years if done well. With larger systems, one has larger teams. Larger teams require more interactions, slowing them down further.
A 10 million line application simply doesn't have any practical manual migration due to time frames.
"
Never used their services.
But I do think this will continue to be difficult, yet
beneficial option, for the migrations from legacy/less secure software.
> the so-called Adaptive Compute Acceleration Platform (ACAP) will deliver 20x and 4x performance increases on deep learning and 5G radio processing, respectively, Xilinx claimed. The first chip, called Everest, will tape out this year in a 7nm process.
> The centerpiece of the new architecture is word-based array of tiles made up of VLIW vector processors, each with local memory and interconnect.
> tiles communicate with each other to create data paths that best suit their application.
I have had this fun idea for an odd architecture floating in my head for ages.
Lots of small processors with local work ram, cached ram, shared ram.
Each processor has a numerical id, communicates with each processor with one bit different in the id by optical link plus an additional link with to the processor complimenting all bits. They send messages and fill their caches from the pool of shared ram.
Place even parity processors on one board and odd parity processors on another board facing towards the first. Thus all processors have line of sight on their communication partners. Messages go back and forth with processors relaying messages by fixing one bit in the message address. All neighbours with a one bit better address are candidates for relaying. If any are busy, they have options to use another path.
This means the most number of hops a 2^19 core system would have to do is 9. If more than half of the bits are wrong then jump to the compliment.
So the example with half a million processors, each talking to 20 neigbours by optical links. Messages can be sent anywhere with a latency of up to HopTime*9. Filling a cache from anywhere in the shared memory pool would have a latency of twice that. If speed of light is the latency factor a 150ms latency would get you 9 hops across a 5 meter gap. Smaller is of course always going the make it better.
This is the sort of thing that would also probably need a new language. I'm not entirely sure you could come up with an appropriate language without at least a simulation of the architecture.
Hop time is not constant. There will be longer and shorter optical links, and if you want the system to be synchronous, everyone would have to wait on the longer links.
As part of one of those "45 hardware startups" (in stealth mode) trying to tackle the problem, I think nevertheless there is plenty of fat to trim just on the software side still... I mean, just look at that very page: 200 MB [edit: in memory consumption] for a single static article!
Information-theorically at least, I bet that is another 10-100x speedup opportunity right there (and next multi-billion dollar industry, perhaps)...
IMHO, machine learning is a tiny aspect of the issue. The main issue is that, strangely, C is still the desert island language (first hit: http://www-cs-students.stanford.edu/~blynn/c/intro.html). The is a huge intimacy between the C language and CPU, they have evolved together during the many years of Moore law. I think there should exist a far better language to access hardware. linux kernel should migrate to this language. And the intimacy between OS and hardware should drive CPU (and GPU) toward better directions. Maybe I am dreaming.
Only tangentially related to the article but... I'm left to wonder about the photograph selected. His left hand appears to be on a stack of 20-year old HP 9000s (maybe rp3440, pre-grey box?), and standing across from a rack full of ancient 4U systems mounted into 2-post relay racks. The white blanking panels are early 2000's-era HP kit, and in the background are racks full of tower systems stacked vertically.
Is this a recent photo? Do they have a computer history museum @ UCB or is that datacenter actually running production workloads on that kit?
That would certainly explain it! After posting my comment it occurred to me that having Dr Patterson pose next to a PA-RISC system was probably intentional, given his history.
In gamedev, Nvidia's Turing enabled real time rendering at 4k 60fps could be "future-proof" well into the next decade. I'm just not sure end users are clamoring for more photo-realism.
Instead, future of computing turns toward optimizing for the experience. With 3D printed form factors and smart haptics. European startup Canatu provides a glimpse of what carbon nano tube (CNT) sheets make possible:
Now it's time for display technology to catch up. Once things like light fields roll around those 4k will seem laughable.
> I'm just not sure end users are clamoring for more photo-realism.
People are definitely making fun of characters looking waxy, fire effects looking cheap, shadows being pixelated etc.
Plus this is just the first generation of realtime raytracing. It's not like the entire scenes will be raytraced, just some parts will be raytracing-assisted, e.g. lighting. The next generations will most likely enable a higher percentage of the scenes being based on tracing.
> And Intel, he said “is trying to make all the bets,” marketing traditional CPUs for machine learning, purchasing Altera (the company that provides FPGAs to Microsoft), and buying Nervana with it specialized neural network processor (similar in approach to Google’s TPU).
Actually Intel has even wider view of the things and interest to non-(yet?)-mainstream tech, like e.g. async circuit design. They has bought Achronix, Fulcrum, Timeless Design Automation. Fulcrum in 2002 made things like "a fully asynchronous crossbar switch chip that achieves 260-Gbit/second cross-section bandwidth in a standard 180-nanometer CMOS process" - https://www.eetimes.com/document.asp?doc_id=1145012
> "When performance doubled every 18 months, people would throw out their desktop computers that were working fine because a friend’s new computer was so much faster."
So we have finally reached some sustainability. Good.
People still have to throw their computers out because other hardware still advances, software also advances, and companies decide not to support older hardware in their new software. The wastefulness really doesn't end.
ionno... did people actually throw out their computers more often before? I'm honestly skeptical that people threw out their computers every 18 months just because their friend had a faster computer as was claimed here. I feel like people are very quick to buy new computers now. And not just that, but now we're having this problem with phones too.
While I agree with Patterson, I'd be careful with the expectations though. Computational engineering is highest tech. Whatever purpose, whatever specialization, processors are the central domain in modern technology and engineering. It would still take decades to find a feasible approach to make new architectures work (in silicon). But I agree that it is an exciting time, I want to see new clever architectures emerging, competing and be used for different purposes.
Most languages in use today have had to extend from their core to the web, or mobile through a framework.
On the other hand, web or mobile first frameworks/platforms, are often high level, and have been a little ornery when digging into the weeds.
Our paradigm has shifted from desktop-first, to web-first, to mobile-first a while ago, but our frameworks are still often anchored from a desktop-first world perspective.
If there's examples that do, or do not highlight this, would love to see and discuss :)
Not a new language, maybe not even a new idea but Unity's new Entity Component System.is designed around making cache friendly parallelizable coding easy
It seems like hardware is moving away from RISC and even CISC architectures in favor of lots of different types of silicon for specialized applications (GPU, machine learning, vision, DSP, hashing, encryption, SIMD).
It makes sense from a hardware perspective but it seems software development is not keeping pace. General purpose languages have trouble targeting this kind of hardware.
I think that we well see more and more specialized silicon like has been used for mining and training. This hardware is orders of magnitude faster than conventional CPU.
Another point is that processors are already now incredibly fast and the performance is caped by the peripherals. This is were the huge potential is hiding for some time.
Better FPGA could bridge the gap between ASIC and software. FPGA chips are still almost exclusively niche products, but if they became mainstream like GPUs the economy of scale kicks in and it's a whole different game.
>"Google has its Tensor Processing Unit (TPU), with one core per chip and software-controlled memory instead of caches"
Is have heard of software-controlled caches before but I am imagining this is not the same thing? Could someone say? Might anyone have any decent resources on software-controlled memory architectures?
This is a question of computer science but more a question of economics. Computer systems will drop their own compassing solutions for more specific solutions as time goes forward.
IMHO, stochastic optimal (SOC) control is a framework that is much more powerful than anything like current artificial intelligence, machine learning, deep learning, etc. and, in particular, a much better version of something like actual learning. The difference is like going from gnats to elephants, like a bicycle to the star ship Enterprise.
A good application of SOC looks not just "intelligent" but wise, prudent, brilliant, and prescient.
There have been claims that necessarily SOC is the most powerful version of learning there can be -- a bit much but ... maybe.
For computing architectures, the computing that SOC can soak up makes the computing needs of current deep learning look like counting on fingers in kindergarten.
The subject is no joke and goes back to E. Dynkin (student of Kolmogorov and Gel'fand), D. Bertsekas (used neural nets for approximations for some of the huge tables of data), R. Rockafellar (e.g., scenario aggregation), and, sure, R. Bellman. Some of the pure math is tricky, e.g., measurable selection.
I did my Ph.D. dissertation in SOC and there actually made the computing reasonable, e.g., wrote and ran the software -- ran in a few minutes, with my progress in algorithms down from an estimate of 64 years. For larger problems, can be back to taking over all of Amazon for 64 years. More algorithmic progress is possible, and, then, some specialized hardware should also help, sure, by factors of 10; right, we want lots of factors of 10.
If want to think ambitious computing, past Moore's law, past anything like current AI, with special purpose hardware, go for SOC.
SOC Applications 101. For your company, do financial planning for the next 50 years, 60 months. So, set up a spreadsheet with one column for each month, one column for the current state of the company and then 60 more columns. Goal is, say, to maximize the expected value of something, maybe the value of the company, in the last column, right, with control over the probability of going broke, if a bank always be able to meet reserve requirements and pass stress tests; if a property-casuality insurance company, stay in business after hurricane Florence, etc.
For each variable of interest, have a row.
In the cells, put in the usual expressions in terms of the values of cells in earlier columns.
Also have some cells with random numbers -- the stochastic part.
Also have some cells empty for the business decisions, the control part.
This is the 101 version; the 201 version has more detail!
Can't get the solution with just the usual spreadsheet recalc because the best solution is optimal and varies through the 60 months as more information is gathered -- and that is the core of the need for more in algorithms and taking over all of Amazon for the computing.
Really, the work comes too close to looking at all possible business state scenarios over the 60 months yet still is astronomically faster than direct or naive ways to do this.
Or, a little like football, don't call the play on 2nd down until see the results of the play on 1st down, BUT the play called on 1st down was optimal considering the options for the later downs and plays.
The optimality achieved is strict, the best possible, not merely heuristic: No means of making the decisions using only information available when the decisions are made can do better (proved in my dissertation).
Intel, AMD, Qualcomm, DARPA, NSF, Microsoft, etc., think SOC!!! You just heard it here -- I might not bother to tell you again.
Start with books on dynamic programming,
discrete time, discrete space space, no
uncertainty. There's a nice one by Dreyfus and Law, The Art and Theory of Dynamic Programming that also gets into the stochastic case. It's called dynamic programming because the program, that is, the planning, changes dynamically over time as learn more and more as time passes.
D&L show a nice result that could be used for some first cut approximations: If are minimizing a quadratic cost, if the algebraic expressions in that spreadsheet I mentioned are all linear, and if the randomness is all just Gaussian, then get "certainty equivalence" -- that is, get to f'get about the Gaussian, random, stochastic part and use just the expectations themselves. Huge speedup.
For more, there is Dynkin and Yushkevich, Controlled Markov Processes -- one of the key assumptions that permits treating the spreadsheet columns one at a time is that the stochastic part obeys a Markov assumption (the past and future are conditionally independent given the present).
There is, from MIT and CMU,
Dimitri P. Bertsekas and
Steven E. Shreve,
Stochastic Optimal Control:
The Discrete Time Case.
And there is
Wendell H. Fleming and
Raymond W. Rishel,
Deterministic and Stochastic Optimal Control.
There are papers by R. Rockafellar, long at U. Washington.
There is now an ambitious program at Princeton in the department of Operations Research and Financial Engineering.
I tried to get IBM's Watson lab interested; by now they would have been nicely ahead. The guys I was working for wanted me to do software architecture for the OSI/ISO CMIS/P data standards. Garbage direction. A really big mistake for Big Blue, not so big now.
One of the reasons for IBM to have done SOC was that they had some vector hardware instructions, that is, that would do an inner product, that is, for positive integer n, given arrays A and B, each of length n, find the sum, i = 1, 2, ..., n of
A(i)*B(i).
Well inevitably work in probability does a LOT of this. So, if thinking about hardware instructions for SOC, such a vector instruction would be one of the first features. Then maybe more for neural net approximations to some big tables of several dimensions, e.g., computer language arrays A(m, n, p, q) for 4 dimensions but in practice several more. And multivariate splines can play a role.
Commonly can get some nice gains by finding non-inferior points -- so will want some fast ways to work with those. The software for my dissertation did that; got a speedup of maybe 100:1; on large problems, commonly could get much more.
There is some compiling that can get some big gains -- some of the big gains I got were from just my doing the compiling by hand, but what I did could be a feature in a compiler -- there's likely still a stream of publications there if anyone is interested in publications (my interests are in business, the money making kind, now my startup).
There is a cute trick if, say, all the spreadsheet column logic is the same -- can "double up on number of stages".
There are lots of speedup techniques known. A big theme will be various approximations.
If there's an opportunity to exploit sufficient statistics, then that could yield big speedups and shouldn't be missed. Having the compiling, sufficient statistics, and non-inferior exploitation all work together could yield big gains -- that should all be compiled. Get some papers out of that! No doubt similarly for other speedups.
Looked at APL long ago. My guess is that since it is interpretive it would be too slow. I wrote my dissertation code in just quite portable Fortran.
As I suggested, I do believe that some progress in execution time could be had from some new machine instructions, new language features to use those instructions, and some compiling help. For the compiling part, the language would have some well defined semantics the compiler could exploit. E.g., for something simple, the semantics could let the compiler exploit the idea of non-inferior sets.
E.g., say have 100,000 points in 2-space. Say, just for intuitive visualization, plot them on a standard X-Y coordinate system. Say that the X coordinate is time and the Y direction is fuel. Part of the work is to minimize the cost of time and fuel. At this point in the computation, we don't know how the costs of time and fuel trade off, but we do know that with time held constant, less fuel saves cost, and with fuel held constant, less time saves cost.
So, in the plot of the 100,000 options, we look at the lower left, that is, the south-west parts of the 100,000 points.
Point (X2,Y2) is inferior to point (X1,Y1) if X1 <= X2 and Y1 <= Y2. That is, point (X2,Y2) is just equal to point (X1,Y1) or is to the upper right, north east of (X1,Y1). If point (X1,Y1) is not inferior to any other of the 100,000 points, then point (X1,Y1) is non-inferior.
So, there in the work, can just discard and ignore all the inferior points and work only with the non-inferior points. May have only 100 non-inferior points. Then, presto, bingo, just saved a factor of 1000 in the work. Okay, have sufficiently restricted programming language semantics that the compiler could figure out all that and take advantage of it. When I wrote my code in Fortran, I had to write and call a subroutine to find the non-inferior points -- bummer, that work should be automated in the compiler, but to do that the compiler will need some help from some semantic guarantees.
The above is for just two dimensions, but in practice might have a dozen or more. So, what's the fast way with algorithms, programming language semantics, and hardware to find the non-inferior points for a dozen dimensions?
Non-inferior points are simple -- lots more is possible.
How about instead of inventing new languages we just pick a couple and tell people that if care at all about performance not to use the rest.
I would start by throwing away all the interpreted/GC'ed languages because even after decades of massive effort they are frequently lucky if they even manage to keep up with unoptimized C.. But that really isn't the problem, the problem is that they cannot be debugged for performance without directly hacking the GC/interpreter. At least in C when you discover your data structures are being trashed by insufficient cache associativity (for example), you can actually fix the code.
Put another way, up front people need to know that if they write in python/Java/etc to save themselves a bit of engineering effort and its anything more than throwaway low volume code, the effort to optimize or rewrite it will dwarf any savings they might have gotten by choosing python.
For a lot of problems performance really isn't the problem. In many, many cases the engineering cost to optimize C++ to get a benefit over Java / C# will be way more expensive than just paying for more servers. Even in game dev where milliseconds matter C# is very common.
Personally I agree about dynamically typed languages not being great for large projects but that's more about maintainability than perf.
"unoptimized C" can easily take 10x as much effort to write as heavily optimized python/Java/..., and has a much higher defect rate. Make it work, then make it right, then make it fast; people working in C rarely get past step 1.
We need new languages; at the very least new ways to approach to optimization. We are at the point where even C doesn't give you the performance you want -- that's why people invented things like Halide, XLA (Tensorflow), etc.
There are a ton of domain specific languages (SQL, GLSL, etc), and I don't have a problem with those, they have a place.
The problem is that we already have too many languages trying to be generic, and languages like C++, and openMP can be wrangled into nearly any programming paradigm in common use and the results tend to also be significantly faster.
If your talking about performance, the minimum baseline requirement should be running faster than a language in common use, say C++. There are a few languages that might take this (Cuda/OpenCL) but they also tend to be somewhat domain specific. Similarly verilog/VHDL can produce fast results. There have been calls for languages that can express parallelism better and are both expressive, and safe for decades. But what we have actually gotten haven't been better in most objective measures. What we have gotten are a pile of "scripting like" languages with very similar characteristics (tcl, ruby, perl, python, javascript, PHP, lua, go, applescript, etc).
So would Rust be one of the languages you'd advocate including in the chosen few? Does Rust's approach to memory management put it more in the C group in terms of debugging for performance?
I wasn't really trying to pick/choose a language. Although I did pick on python a bit, because I just spent a few days wading through a gigantic "python shell script" that would have literally been 1/100th as many lines written in bash because every third line was calling some shell utility or doing the equivalent of 1 line sed calls in 4-8 line python functions.
The point is that we literally have dozens and dozens of languages active on any given machine with absolutely massive frameworks duplicating functionality. Here is a few hundred meg ruby runtime, with a few hundred meg node.js runtime, with a few hundred meg python2 runtime (and another for python3), with a few hundred meg perl runtime, etc, etc, etc. Its insane because in most cases you can't even statically bind the whole thing into a single .jar like file.
Not only is the cognitive load incredible, but the code quality in most of them is horrendous. Then there is the issue, that most of these language runtimes have for most purposes have zero documentation outside of the actual code.
Then when someone says "what language should I choose" they are presented with a whole bunch of square pegs to pound into their round hole. If your going to work with a square peg, it should probably be one well understood and polished enough to actually have documentation for more than the 100 most common functions being called.
What I'm trying to say is that we don't need yet another over-hyped language, and we should probably try to kill a few more of them off. Object Pascal is a better language than probably 1/2 the ones on common use, but its pretty much dead. Inventing another language that is _worse_ than object pascal is just stupid.
https://software.intel.com/en-us/articles/coding-for-perform...
But I'm not aware of languages where e.g. declaring alignments is part of the base language. Awareness of L1, L2, L3 cache characteristics, plus NUMA nodes, is increasingly important to writing high performance code. Every language I've ever used exposes nothing of that hierarchy; it's all just "memory".
People who write performance-critical code are already aware of these issues and can measure/hint/optimize to work with the memory hierarchy in existing languages. But I feel like this could be better if we had languages that modeled these issues up front instead of only exposing the abstracted view. Maybe such languages already exist and I just haven't encountered them yet.