Anyone experimented with this yet? I'd like to know how this is resolved when the architecture doesn't support Intel's SIMD approach, they map the objects pretty close to the instructions (SIMD.float32x4.sub and the likes).
I'm trying to figure out what happens when you port this to ARM NEON, and how you catch it with architectures that don't support NEON (they often lack them in Marvell and Allwinner).
I'm a Mozilla engineer involved in this. NEON support is very important and we're designing the spec to support it well.
CPUs that lack SIMD units can support the functionality (though not the performance of course), and there's even a polyfill library that can lower this API into scalar operations for SIMD-less browsers too.
It would be great, if you could detect SIMDable operations in classic JS (e.g. in loops) and use SIMD for interpret them. I think that adding low-level features into a high-level language is not good practice.
We will probably do that too at some point, but it won't replace explicit SIMD, just as widely-available auto vectorization support in C++ hasn't eliminated the need for explicit SIMD extensions there either.
One thing to keep in mind is that most programmers probably won't want to use this feature directly; it'll be used in libraries that expose higher-level APIs. It's still true that every feature we add increases overall clutter, but SIMD seems sufficently useful and sufficiently self-contained that it's worth the tradeoff.
The primitives are pretty generic, just a few new vector types based on typed arrays. Operations on those types are supported on CPUs without a SIMD unit, they're just slower, but not any slower than coding with non-SIMD operations.
What about 8 and 16 bit ints? How about signed vs unsigned? Or what about pixel like data that clamps instead of overflows? What about 64-bit IEEE? What id the SIMD unit is 64 bits wide? Or 256? It just seems so not future and varying implementation proof.
> architectures that don't support NEON (they often lack them in Marvell and Allwinner).
I'm probably nitpicking here, but:
* All Allwinner SoCs have NEON[0]
* Most current ARMv7 processors have NEON. Of the current ARM cores, only Cortex-A5 and Cortex-A9 don't have mandatory NEON support (it's optional). Cortex-A5 is intended for embedded applications. Of the existing Cortex-A9 processors, AFAIK the only somewhat popular one without NEON support is NVIDIA Tegra 2, which is retired. Out of the third party cores, all Qualcomm and Apple ones have NEON support.
Yes, and as jmpe said they have quite a few SoCs without NEON (I think basically everything apart from ARMADA 1500 plus). They seem to be targeting devices like smart TVs and STBs nowadays, so I guess it's not a big deal.
Correct, it was wrong of me to point to "architectures" that lack Neon; you're correct in your reply. I should have mentioned specific implementations. My experience with Allwinners without Neon indeed comes from smart TVs and STB. You know your stuff ;)
It happens in the browser, right? I think ultimately there just needs to be a unified API, or maybe more domain-specific APIs (like BLAS) that map to NEON or SSE instructions as appropriate, and do everything the slow way if they aren't available
SIMD.float32x4 and SIMD.int32x4 classes are available in Firefox Nightly, but without Float32x4Array and Int32x4Array loads and stores are horribly slow. About 100x slower than normal JavaScript in my tests.
I think their effort would be much useful, if they focus on WebCL. It is already standardized (unlike their "SIMD" object). CPU implementation of WebCL, that utilizes SIMD, would probably offer much better performance, than any current Javascript engine.
WebCL implementations for Firefox and Chromium exist for 2 years. But they have not been included in any alpha, beta or canary release. Browser developers don't seem to care about WebCL, but they do care about WebGL, which is strange.
WebCL community is geared towards exposing platform OpenCL backed by GPU, that would be stealing their fire.
Though it might still be the smartest thing to do given the poor state and lack of recent progress of GPU OpenCL drivers. Even desktop apps that would like to use OpenCL are just barely limping along. See eg. Blender.
I agree, even CPU implementation of WebCL could be much faster, than the fastest JS engine (because it is low-level C code). Currently there are several very good OpenCL implementations, running on the CPU.
Stuff like this, and asm.js, and WebKit's crazy LLVM-based DFG optimizations, all lead me to think Native Client will end up being looked at as a transitional, niche tech: you keep tuning the JS engine until it's not unacceptably slower than native for asm.js-type code, and you keep hooking webpages up with more and more native capabilities and code (graphics, SIMD, crypto/compression), and eventually very few NaCl use cases are left. I don't see other vendors getting on the NaCl bandwagon because they don't want to depend on Google's code and it's a lot of work to implement--so maybe NaCl ends up remembered as the toolchain some companies used to port some apps to Chrome OS and that's mostly it. Sort of a shame; I _like_ NaCl, just don't see a great path for it compared to iterating on existing technologies.
Given Nitro's being converted to LLVM bit code and NaCl was using it as well, it seems like a good thing to have LLVM bitcode/assembly be declared the web standard language .. then any tool that generates llvmbc can be used with browsers. This would open up a lot of scope for other languages too - bringing the choice of languages the server-side enjoys to the client-side as well.
Threads... Native Client has threads. Javascript does not. I'm writing a scheduler for Emscripten that supports pthreads but it's crazy because it's still single threaded and doesn't look like readable javascript like it used to output. Bending over backwards. ...and don't get me started on web workers.
Please start, I'm curious as to why web workers doesn't fit the needs like using threads, I have a fairly basic knowledge of web workers and not much practice with them, but I would like to know if possible what are the limitations.
Javascript is inherently single threaded. It goes back to it's roots. It would extremely difficult to make it multithreaded. Apps developed in Javascript are heavily asynchronous and avoid blocking for too long because of this limitation. In some ways this is good.
In the right situation, for example in node.js, this discourages anyone from doing anything that would block the single thread handling requests for very long and makes it easier to handle the C10K problem in a much more interesting way (where traditional servers create a thread per request or queue them to handle one by one from a pool, where node does it all from a single thread). In other ways this can be bad it causes (in the case of Node developers) developers to not do a lot of the heavy lifting in Javascript itself and to hand off to native code that can do long hard things that may block on real threads (database queries, persistence backends, file IO).
The thing is that some code is easier to write with threads in mind. If I know there are 4 processes on a system, I can make 4 threads and use effectively use all for processors (potentially) at the same time. This is not really possible with Javascript. This makes heavy apps on a device like the Firefox phone (which is all about Javascript only) that want to could really benefit from using multiple cores impossible to take advantage of them (unless the user uses webworkers).
There are also just an epic amount of legacy with apps that are inherently built around the concepts provided by threads. Javascript is basically becoming the bytecode of the web. Things like ASM.js and emscripten are pushing it this way. But it's not really easy to port all code and fit on a model where threads are not allowed.
Conceptually web workers are more like processes than threads. It's a shitty work around in Javascript since we can't easily have threads without breaking the way Javascript works (which would break the web potentially). That means we have to write code into different processes where everything is isolated. On top of that, you can't share memory and you have to message and marshal data between webworkers. This is awkward and impossible to target from something like emscripten making it near impossible to port something expecting threads exist (theoretically if you have a C app that didn't use threads but different processes, you fit better but this almost never heard of). Most operating systems provide threads (pthreads most often except on Windows).
My solution is to make a scheduler that is used by my fork of emscpriten. All functions use return on call semantics and a "scheduler" in Javascript will decide which function should be called next from a list of virtual thread "stacks" (arrays) or if it should yield to the browser (setTimeout). The "scheduler" is single threaded itself so it's not really "threads" but does make it possible to emulate them for porting from native code.
It will, however, run (...) on the platforms that support SIMD. This includes both the client platforms (...) as well as servers that run JavaScript, for example through the Node.js V8 engine.
...and:
A major part of the SIMD.JS API implementation has already landed in Firefox Nightly and our full implementation of the SIMD API for Intel Architecture has been submitted to Chromium for review.
...and:
Google, Intel, and Mozilla are working on a TC39 ECMAScript proposal to include this JavaScript SIMD API in the future ES7 version of the JavaScript standard.
So, yes, there's definitely an intention there to put it into V8/Node.js/ES7 (guess, it will be in this exact order).
Considering the architecture of Node.JS is not suited to compute-heavy tasks, I wonder what kind of code are you willing to optimize in Node.JS with SIMD?
Any improvements are still welcomed of course. There are a number of people/entities that are building desktop apps on Node and those tend to do 'compute-heavy tasks', a developer whose piece of code runs in a mean of 10 seconds would also welcome a possible optimisation to run it in less than that.
Not knowing much, I think it'll be interesting to see how general purpose applications would benefit from SIMD if it's accessed from a higher level. Does that mean that if I want to loop through 103 items and run arithmetic operations on them I'd have to do the following, (let's say I'm multiplying each item in items[] by 2, and items.length % 4 !== 0):
var batch = [],
results= [],
i = 0,
j = 0,
len = items.length,
a, b = SIMD.int32x4(2, 2, 2, 2),
c;
var mod = len % 4;
items.forEach(function (item) {
if (i < mod) {
results.push(item * 2);
i++;
} else if (j < 4) {
batch.push(item);
j++;
} else {
a = SIMD.float32x4(batch[0], batch[1], batch[2], batch[3]);
c = SIMD.float32x4.mul(a, b);
results.push(c.x);
results.push(c.y);
results.push(c.z);
results.push(c.w);
batch = [];
j = 0;
}
});
Of course this is the interpretation of a non-CS graduate who taught himself JS, some of the stuff mentioned at https://01.org/node/1495 seems a bit over my head. It'd be great if V8 would (unless it already does) transparently handle creating SIMD-optimised code where one is looping through an array or the like instead.
That kind of 'pack non-SIMD into SIMD, do SIMD op, unpack back into non-SIMD' thing tends to be slower than just doing non-SIMD ops in most cases.
You'd want to convert your non-SIMD data into a big stream of SIMD data up front, then do lots of operations on it, and then after that perhaps unpack it. Most SIMD scenarios just keep data in SIMD format indefinitely.
(Sometimes a compiler can use SIMD operations on arbitrary data by maintaining alignment requirements, etc. That sort of optimization might be possible for the JS runtime, but seems unlikely for anything other than typed arrays.)
The architecture is suited just fine, it all depends on how you decide to use Node.js. If you're using it for handling lots of asynchronous IO (eg. for a webserver) then it's probably unwise to mix in lots of blocking compute-heavy tasks on the main thread. But that's not the only way to use Node.js.
For example, we have a media library that implements HTML5 canvas2d, WebGL, WebAudio, video/image/sound encode and decode and a bunch of other media related compute-heavy tasks. We've done this as a native module because it lets us control all those APIs using the same code we would use in the browser, but we run that code in Node.js as a separate worker process where we don't care about things like IO latency. It works great.
In Node JS you can spawn child processes. It's also designed to be asynchronous. But what I like about JS is that it doesn't force me into paradigms.
And when you run something heavy in Node JS the VM can use many threads. So I think it's weird to say it's single threaded. Aren't all programming languages single threaded then?
I was doing research on parallel computing last summer. Does anybody know if this SIMD object is similar to the ParallelArray object Intel made in Rivertrail? Or are there any similarities between the two libraries?
Yet another parallel framework. Without the 3rd party eco-system of APIs for matrix math, any framework is doomed to just add noise, not value. Sure there's some benefit in getting marginal speedup on some algorithms but for real speedup, you need to know the parallel architecture of the processor (GPU, CPU or APU) which means a learning curve. The GPGPU industry has been trying since long to abstract away the fine details and offer a plug-and-play kind of easy-to-learn framework but then we suffer performance losses and it really doesn't make sense to invest in GPUs for the kind of performance gains you get with these high-level APIs.
Read the release, this is a collaboration with Google and Mozilla. But you are right, one of the main reasons CUDA is so popular is because of cuBLAS. And it is a big pipe dream that you could program a GPU without being aware of communication and memory transfer behavior.
Won't we have HSA in the future? HSA is supposed to provide unified coherent memory access to both CPU and GPU. Do you think HSA is a pipe dream? If so why?
OK, it's not that HSA isn't useful, it's that coordination between the CPU and GPU is still stupidly hard and has a lot of CPU-side overhead, making it impractical for small workloads. The problem is that a large number of small workloads still can't be done by a GPU. I'm seriously doubting the limits of "coherent memory access" -- unless the GPU can snoop into the CPU's cache (or the GPU/CPU share L1 cache -- eeek), then you will still need cache flushes and fences. Let's hope that "HSA" is a lot lower overhead than current CPU/GPU combos from AMD/Intel.
It's not parallel, a framework, or a GPU feature. It's single-instruction-multiple-data (SIMD) which is used to speed up single threaded execution on a CPU when working with lists of numbers.
He found himself writing the NEON code in assembly entirely by hand because vector intrinsics didn't even expose CPU features he wanted to use—even in C, where vector intrinsics are CPU-specific.
Having access to SIMD is definitely better than not having it, but it really should be paired with good optimized implementations of things like BLAS and FFT libraries.
Most likely the same, unless they are using some chip specific instructions/anomalies instead of just the published instruction set (which I think is unlikely).
There may be performance differences of course where AMDs chips process the same instructions differently internally, but I would expect similar optimisation differences between different families of Intel's chips too.
They wouldn't bother if it didn't work well enough on AMD CPUs because if it worked significantly badly or not at all then people other than Intel simply wouldn't use it. Of course they'll make no specific efforts to optimise it specifically for chip designs that aren't their own, but that is not the same thing.
The important thing here is that while Intel is driving this, the code must be landed in browsers, which are not owned by Intel. Those browsers are not going to ship code that only works on one type of hardware, or is weirdly unoptimized on some hardware.
Having dealt with cpuid, there were plenty of caveats with how AMD did things vs how Intel did things. I can totally see Intel first checking to see if it was Intel and then punting just due to complexity and testing. Keep in mind that this article mentions p4 and athlon, so Intel would have also have had to care about Cyrix and Transmeta as well, which were different as well.
"There is encouraging evidence that SIMD will enable a whole new class of application domains and high-performance libraries in JavaScript."
Anyone take a guess what those might be (honest question)?
In the past SIMD has been the primary way to accelerate audio and graphics related compute tasks, but with WebGL and shaders, JS users already have a very powerful vector processing unit at their fingertips.
Also, using SIMD is way easier than using shaders. You just write something like this:
double average (Float32x4List list) {
var n = list.length;
var sum = new Float32x4.zero();
for (int i = 0; i < n; i++) {
sum += list[i];
}
var total = sum.x + sum.y + sum.z + sum.w;
return total / (n * 4);
}
Instead of:
double average (Float32List list) {
var n = list.length;
var sum = 0.0;
for (int i = 0; i < n; i++) {
sum += list[i];
}
return sum / n;
}
I am still looking for practical examples. Unless there is a usecase for averaging gazillion numbers on the client (I would like to know what the use case for that is)
OK- Look at people trying to make 3D games for the web. GPU performance is a concern, but if you can't even run the physics simulation or cull your object database fast enough to push triangles to the GPU, your performance will be hurt and the GPU will be idle. People want games based on WebGL to have comparable performance to native apps -- well, you're going to need to take all of the libraries that games use and convert them too. This is a good solution to allow such SIMD optimization that have been present in native libs for years to have a chance to surface in JS libraries.
There's a skeletal animation demo that was made to show off Dart SIMD support. The bottleneck was the animation, not the 3D rendering, and using SIMD allowed almost 4x the number of characters to be drawn.
It is somewhat hyperbolic, given nothing prevents jit compilers using SIMD on standard JS typed array loops. And SIMD rarely givers more than 2x speedup for on app level even in native games and such.
Yes, I know. The person I was replying to was questioning why someone would want SIMD when shaders on a GPU were available. Well, if you don't have a GPU, shaders aren't available (or at least emulated at a huge performance hit).
Why not bring SIMD to Lua or Python? Or Lisp, or Haskell? Why not just bring SIMD on the web to anyone who wants it?
Will we have to explain to our grandchildren why they can only write code in some flavor of Javascript? Will it make sense, then, to wave our hands at ActiveX? Or do we have any better ideas for the future?
Mozilla's mission to protect the web from all languages but Javascript is locking us into a future where there will be no choice and nothing better than Javascript, because there will be neither an audience nor hardware support for anything but Javascript.
> Will we have to explain to our grandchildren why they can only write code in some flavor of Javascript?
No, because while JavaScript is the "common" language of the web, its not the only language you can code in on the web -- there are plenty of implementations of other languages with it as a compilation target.
The value of having a guaranteed-to-be-everywhere target language for the web, even if it isn't the preferred development language of every developer, are fairly obvious.
> Mozilla's mission to protect the web from all languages but Javascript is locking us into a future where there will be no choice and nothing better than Javascript
As long as JavaScript keeps getting better -- and, particularly, as long as it is spurred on in that by efforts which propose alternative standard languages for the web with compelling stories so that JS has to keep moving forward in order to be acceptable as the universal, guaranteed target language -- that's fine. "Nothing better than JavaScript" isn't a real limitation if JavaScript is a moving target.
> "Nothing better than JavaScript" isn't a real limitation if JavaScript is a moving target
JS is only a "moving target" in the sense that stuff is being added to it. If you could make a perfect language by just adding things, then we'd be fine.
But the nature of the language itself is not going to change, because that would break backwards compatibility. The type system, prototype inheritance, `this`, type coercions, etc. There are plenty of undesirable things in JS which we're stuck with (unless we break compatibility, in which case it might as well be a different language).
>>No, because while JavaScript is the "common" language of the web, its not the only language you can code in on the web -- there are plenty of implementations of other languages with it as a compilation target.
They either have to bring SIMB to every language or to none at all?
They want to improve web application performance. JavaScript is the only language that works on the web. So they are bringing it to JavaScript. That doesn't mean they won't bring it to other languages in the future.
It doesn't make sense to improve HTML5 game performance by bringing SIMD to Lua or Python.
Because the more effort put into improving JavaScript performance, the more entrenched something terrible yet ubiquitous becomes, and the less chance we get of replacing it with something that is not a cyclopean horror?
Personally, I think its a good thing that JS gets SIMD support -- since Dart has it, and it would be better if the discussion over whether a new standard language to replace JS was needed was over language fundamentals, not one competitor or the other missing useful features that aren't part of the fundamental language structure.
You can always escape the "JavaScript lock-in" with a compile-to-JavaScript language. Thanks to asm.js and related performance improvements, you can even compile a "JavaScript binary" that looks nothing like vanilla JavaScript but is still fast.
"without the need to rely on any native plugins" - if I need a specific version of specific browsers (which it is), than it's no different (though sand-boxed, which plugins could be as well).
1. Browsers are often preinstalled for the user (especially but not only, on mobile Oses). Also, many browsers auto-update, so the new features will be automatically available eventually, without the user installing anything
2. The point is that while browsers are binary, they can then run a huge set of portable apps. As opposed to all those apps not being portable.
3. Everything Mozilla does is open source, and not just Mozilla but also most browsers today are open source - Chromium and WebKit in particular. This is a very open space.
"This is a very open space" - then why is it locked to one legacy scripting language of some sort of crappy transpilation they themselves don't use? It is effectively closed for new languages.
First, adding more options takes work. People need to volunteer to do that work, and prove that adding more VMs to the web can be effective (there are many technical challenges, like cross-VM garbage collection, sandboxing issues, etc.). People simply haven't shown this is practical yet.
But, people have meanwhile shown that cross-compiling to JS is practical, from things like CoffeeScript to C++. This is opening up the space to new languages, but it takes time and effort as well - again, the speed depends on how many people volunteer to help out.
if you use CSS, you need specific browser versions - it's not gonna run ok on ancient browsers - of the video/audio tags, or anything new
this is just exposing a yet another platform API in the browser - once all browsers are updated in the field, you can reliably deploy your new code; or you can do it even before, but with a performance penalty; I'm not seeing anything wrong with that.
Dev: "Shame... Let's just wait n years until it gets released."
X = Chrome or Unity plug-in
Y = e.g. Safari for iOS.
PS: the whole idea of API-mega-mutant browsers sounds like shipping Node.js/Angular with all the packages instead of just letting them be installed in a flexible way via NPM/Bower/whatever.
Only the newer Chrome/Safari/Firefox/IE get to billions of users with auto-updates and such. Whether they want to run your nice game or not, they'll get to those new browsers, sooner or later. And any "nice game" will be able to take advantage of that.
Whereas a third party plugin just used to run "a nice game" will never get the same adoption, and users wont be bothered.
So your example can be rendered as:
Dev: "I made this nice game - try it!"
User: "Cool! What do I need to run it?"
Dev: "Just install a browser version released after 2015!"
User: "Nice, my browser is already updated"
In the vast majority of the cases. And for those who haven't yet updated, either they're not your target audience (they have some ancient IE6 they use mandated by their company policy) or they can just go ahead an update their browser.
They should update their browsers regularly anyway, and getting the latest Chrome/FF/etc is not like being forced to installed some unknown plugin from some unknown developer just to run a single app.
Is it really just difficult to understand?
>PS: the whole idea of API-mega-mutant browsers sounds like shipping Node.js/Angular with all the packages instead of just letting them be installed in a flexible way via NPM/Bower/whatever.
Also called as "batteries included". But unlike in your example, those packages are one for each kind (e.g not 100 competing templating engines, or 100 MVC frameworks, like in NPM).
You need to ship the "the whole API" with the browser for security reasons. If the API could be installed in a modular fashion in realtime while browsing, it couldn't be made secure in practise.
> if I need a specific version of specific browsers (which it is)
No, you don't need a specific browser. You can polyfill it. (The reference implementation is a polyfill.)
If you want the performance benefits, you'll need to update your browser. But that's true for JavaScript in general. IE8's engine is very slow, for example. Only the most recent browsers can offer the best performance.
It is wrong to expose such level features into a a programming language. These are exposed as actual types. More and more stuff is being added to HTML, CSS and is without any thought as to what the language is meant for. Just because it can be done does not mean it should be. It is a better approach to advance type interference and compiler technology, add language primitives that assist such code generation.
The web page shows a Mandelbrot which is a terrible example because they are best done with shaders. In face there is no real use case for this. There will always be specific things that cannot be done and no language can address them all. How many apps need SIMD based physics? In fact I am not even sure what kind of physics needs SIMD.
So, before we had shaders, we used hand-written SIMD to perform those calculations as quickly as possible. Even with shaders available, there are broad categories of numerical work where you can't justify the expense of putting things out on a GPU but you still want them to be done quickly.
Good examples of this are basic vector arithmetic, which is useful for graphics and scene management and physics, or certain types of hashing, or anything else.
In some magical future land where sufficiently advanced compilers drop out of trees, maybe you have a point--but this is the web, son, and we're as quick and dirty as you can get.
Auto-vectorization is an extremely complicated topic. This stuff is slow. Not a big deal with AOT compilation, but it's of course a huge deal if you only have a few msecs to spare.
Also, a machine can't just jumble the operations around because that would change the result.
So, this is apparently something you have to do yourself. A compiler won't know what kind of data will be fed to that function. It can't make an informed decision.
> In fact I am not even sure what kind of physics needs SIMD.
It generally just means that you use some physics library which makes use of SIMD. Without having to do anything special, your game will run drastically better an use less energy to boot.
That's the primary use case; using libraries which use SIMD. Most people won't bother doing that by themself.
If speed and compilation is a concern, let's start with having proper types in javascript. That will speed things up a whole lot faster and save a lot of battery than this.
I am not saying this won't make things faster. There are many things you can do to make things faster but this really is such a niche feature (for web based apps).
> If speed and compilation is a concern, let's start with having proper types in javascript.
That is a much harder problem, especially since you brought up performance. Getting the interaction between dynamically typed and statically typed code to work in a way that's both easy and natural to use and allows compilers to get significant benefits from the statically typed code is an unsolved research problem.
Yes, it is. However, it's relatively easy to implement and the performance improvements and battery savings are fairly huge. All things considered, it's a pretty good deal.
I'm trying to figure out what happens when you port this to ARM NEON, and how you catch it with architectures that don't support NEON (they often lack them in Marvell and Allwinner).