Hacker News new | past | comments | ask | show | jobs | submit login
Future instruction set: AVX-512 (agner.org)
102 points by przemoc on Oct 9, 2013 | hide | past | favorite | 66 comments



The AVX / SIMD part of x86 instruction set is probably the most understandable subset to focus on learning! And i'm very excited about AVX-512

I'll actually going to be spending some of my time over the next year adding proper SIMD support (including all the shuffles) to the main Haskell compiler, GHC!

Theres some really interesting constraints on the SIMD shuffle primops that need some type system cleverness to compile correctly!

Namely, you need to know "statically, at code gen time", the shuffle constants that are given as "immediates" to the instructions! Normal values don't quite have the right semantics, and accordingly the simd intrinsics in C compilers kinda lie about the types they expect (ie if you give them a variable of the right type, they'll give you an error saying they need an actual constant literal).

tl;dr I'm going to make sure the GHC (and haskell) can support AVX 512 by the time thusly equipped CPUs are made available


  > Namely, you need to know "statically, at code gen time",
  > the shuffle constants that are given as "immediates" to 
  > the instructions!
Can you clarify where the extra difficulty is?

I'm ignorant of GHC, but I'd think that from the compilers POV all that matters is that that the operand is a constant. Then it's just a matter of putting the value in instead of the name. In GCC inline assembly, you can use the poorly documented '%c' prefix to have a number treated as an immediate, so I'd guess this must be possible internally too. Also possibly worth noting is that unlike the others, PSHUFB works from a register rather a value encoded into the instruction.


There's more than one shuffle instruction, in fact there's quite a few! You're right that some can take register args, but those aren't the ones I care about as much.

You're right, there are hacks in c that handle that. My goal for ghc is to actually have a systematic solution for handling any sort of constant literal expression at compile time. This includes making it easy to add new primops that require compile time literal data.

There's some interesting implications if you want that restriction to be typecheckable! This includes having a "static data kind". Part of why you want that is also because ghc is great at common sub expression elimination, and I consider any implementation strategy that could be broken by compiler optimization to be unacceptable.

[edit to clarify: just naively using normal literal values would likely be subject to cse optimization, and having code gen need to look around to lookup a variable rather than being a localized process is somewhat horrifying]

One particular end goal of mine is this: SIMD isn't that complicated, and it's really easy to experiment with (but only if you can cope with c). I want to make experimenting with SIMD much more accessible.

Interestingly enough, the notion of static data I want seems like it might be an especially strong version of the notion of static data that Cloud Haskell (the distributed process lib) would like. So there may be some trickle Down there!

One really cool optimization having a proper notion of static literals might enable is making it much easier to generate things like static lookup tables and related data structures that are small and perf sensitive

Edit: also if you want to try and stare at the source for a serious compiler, ghc (while huge) is pretty darn readable. Just pick a piece you want to understand and stare at the code for a while!

Edit: I should add that Geoff mainland has some great preliminary work adding experimental simd support to ghc that's in head/ pending 7.8. That said, ghc support for interesting simd won't be ready for prime time till 7.10 in a yearish


Seems to me like Intel and AMD aren't very forward thinking, adding new instructions and registers every year or two. As if the x86 instruction set wasn't bloated enough, now they're going to have instructions with 4-byte prefixes, and new registers you can only access with AVX. What's next after that, AVX-1024 with 6-byte prefixes? Meanwhile this renders MMX and SSE sort of redundant. Seems to me we might be better served with some kind of vector coprocessor and instructions that can operate directly on large vectors in memory, instead of doubling the size of the vector registers all the time and making x86 an ever harder target to generate efficient code for.

Maybe this is another area where ARM can beat x86. Have a better planned out vector instruction set that can be expanded without adding hundreds of new instructions all the time, and more compact machine code.


Meanwhile this renders MMX and SSE sort of redundant.

It's an evolution of the same SIMD ideas. Yes, the newer variants do render the older variants redundant, but hang around because code might use it.

MMX - SIMD, integer only, reused existing floating point registers making it a PITA that often didn't even payoff because of the expensive state switching between MMX and floating point.

SSE - starting as floating point SIMD with its own 128-bit registers. Evolved through SSE 4.2 with more instructions (functionality in hardware), flexibility (e.g. operate on 4 singles or 2 doubles or...) and the addition of integer functionality.

AVX - Double the size of the vector registers, adding more of them and lots of new instructions and functionality. The successor of SSE, at least on the floating point side. AVX2 brought the integer functionality.

AVX-512 - Double the size of the vector registers again. 16 single float operations in one go.

Maybe this is another area where ARM can beat x86. Have a better planned out vector instruction set that can be expanded without adding hundreds of new instructions all the time, and more compact machine code.

This kind of sounds like baseless griping. Unless you write a compiler, why do you care? Do you really sweat the instruction prefixes?


I wish that Intel would have spent some transistors on an arbitrary precision decimal arithmetic floating point unit. That would have helped scientific processing but in the past has been 'too expensive' in terms of transistors to implement. Now that we have more transistors than we know what to do with, seems like that should be revisited.

Then generalize the vector coproccessing abilities of the GPU and that would be a pretty flexible base to work from.


Why does scientific computing care about decimal arithmetic?


Same reason accountants do, numbers like .1 aren't approximate repeating fractions in decimal representation.


Citation needed, please. I don't know where nature is decimal.


Actually its more fun if you just "do the math" and its a fun computer science problem:

0.1 is 1 / 10, fractional binary bits are 1/2, 1/4, 1/8, 1/16 ... find a combination of bits that represents 0.1.

If you want a very long treatise about how binary sucks for doing arithmetic wander over to http://speleotrove.com/decimal/


I'd certainly like to have decimal FPU types too, I know IBM did some nice work but the rest of industry mostly ignored it and I think it's a pity.

Still I don't see where in scientific computing anybody would need it by the nature of the problems being solved -- what is decimal in nature? When you calculate with base2 FP you have better "resolution" and partial results in "absolute" sense (not in the "let me see the decimal digits" sense of course). For the same reason, when you make a series of calculations, the error accumulates slower with binary. That's why base 2 FP was used for all this years. When you don't need to calculate money amounts, it is better.

But what are the examples where decimal is more "natural" for scientific computing?


To amplify this a bit, in binary the relative and absolute errors are within a factor of two, but by a factor of up to 10 for decimal. IBM actually used to use base 16 on some hardware but the rel/abs error disparity of factor 16 hurt and it was eventually abandoned.


Right, and I can see why that's a problem in accounting, but why does it matter for scientific computing? I do a fair amount of stuff that could be called scientific computing, and I just use doubles. If I need to keep track of uncertainty or propagate errors, I normally use Gaussians as the representation (not "significant figures" as would be implied by using decimal).


It almost never matters in scientific computing. Doubles give us the equivalent of almost 16 digits of accuracy, and that's more precision than we know any physical constant to. You're right that the world isn't decimal, and switching to decimal encodings actually reduces the effective precision of any computation.


There's a reason they're called the natural numbers. Nature doesn't have to be decimal for decimals to be useful (the question that started this debate), it just has to be rational. Many many many parts of nature are rational, and sometimes we need to deal with them in scientific computing. DNA sequence processing comes to mind.


It matters for numerical methods, which are frequently used in optimizations and simulations. A recent example: https://www.circuitlab.com/blog/2013/07/22/double-double-ple...

That's why many math packages (eg. Mathematica) provide arbitrary precision arithmetic.


Arbitrary precision != decimal. So the question still stands, why would decimal matter?


When the question is stated like this, then the answer is: it is really more convenient for us humans, and the computers would less "lie" to us in significant number of cases. We don't care that much for accumulated partial errors in computation and we just don't expect our inputs to be interpreted immediately "wrong."

Think about it this way: you have the computer capable of billions operations per second, unfathomable capacity, still it lies to you as soon as you enter the number 16.1 in almost any program: it stored some other number, dropping infinite amount of binary decimals! Why you ask, and the answer is "but otherwise it's not in the format native to hardware."

So it should be native to hardware. Just not because of "scientific computing" but more "for the real-life, everybody's everyday computing." We need it for the computers to "just work."


Thanks for the response. So decimal-in-silicon doesn't matter for scientific computing after all. :)


Yes I was the one who questioned "scientific" motive, see the top of the thread! Still, on humane grounds, I claim we really, really need it in hardware. It doesn't matter for "scientific computing" it matters for us humans as long as we use decimal system as the only one we really "understand."

Any program that computes anything a lot has to use hardware based arithmetic to be really fast, still nobody expects that 0.10 cents he writes loses its meaning as soon as it is entered. It is absurd situation.


To be clear... I wasn't arguing for decimal. I was simply saying that "just use doubles" wasn't a valid solution for many scientific problems.


I'm sorry, are you asking for what purpose a scientist would need to multiply by a power of ten? Converting between units in scientific notation? Doing base 10 logarithms? Calculating things in decibels?

Maybe I have misunderstood the question.


The original post that started this thread was saying that chips should support decimal floating point natively in silicon instead of only base-2 floating point. Yes, those are different things: https://en.wikipedia.org/wiki/IEEE754#Basic_formats


I am well aware of the difference between base 2 and base 10.

You asked "why not just use doubles?" And my answer is "because one often multiplies by powers of ten."


Fascinating discussion, so there are a couple of threads and they can be summed into precision arguments and range arguments. I confess I'm friends with Mike Cowlishaw (the guy behind spelotrove.com) and he's influenced my thinking on this quite a bit.

So precision arguments generally come under the heading of how many significant digits do you need, and if its less than 15 or so you're fine using a binary representation. If its more than 15 you can still use binary but its not clear to me that it's a win.

The second is in range. So if you're simulating all of the CO2 molecules in the atmosphere, and you actually want to know how many there are, and you want to work in the precise values of the rations of N2, NO2, H2, O2, CO2, etc in the atmosphere then, as I understand it, you're stuck approximating. (For context I was asking the a scientist from the Sandia National Laboratory about their ability to simulate nuclear explosions at the particle level and wondering if the climate scientists could do the same for simulating an atmosphere of molecules, or better atoms, or even better, particles. ) that is a problem where there is a lot of dynamic range in your numbers adding 1 to .5866115*10^20 doesn't really change the value because the range breaks down.

And yes, you can build arbitrary precision arithmetic libraries (I built one for Java way back in the old days) but if your working with numbers at that scale then it gets painful without hardware support of some kind.

In my day to day use I find that imprecision in binary screws up navigation in my robots as they are trying to figure out where they are, but repeated re-location helps keep the error from accumulating. And yes, its a valid argument that dead reckoning is for sissies but its very helpful when you have limited sensor budgets :-)


It seems that acqq meant to ask where you might find exactly 1/10, 1/2 or .1 of something in nature.

I'm thinking maybe in stoichiometry?


Your fingers


Try Power7/8; IBM loves decimal math.


Well, all those instruction prefixes mean that decoding x86 instructions is really hard. That leads to more cycles to decode, for a larger branch mis-predict penalty. And your decode unit takes as much power as your integer cluster (which is still small potatoes compared to all the OoO resources). And you're limited to 1 decoded instruction per cycle when executing instructions brought into cache for the first time, before the processor can tag the boundaries (but again, most instructions are executed many times).


If the instruction cache still contains raw, undecoded x86 instructions, I'd be amazed. That seems like the lowest of low-hanging fruit.


Well, sort of.

For the actual 32 KB instruction cache they just tag the instructions in the cache at the instruction boundaries. The Pentium IV tried holding decoded instructions (or traces, rather) in the instruction cache, but it didn't work out very well. The complications induced by lining up the different instruction sizes with the previously specified jump targets and the fact that decoded instructions are much bigger made that approach less successful.

Intel Sandy Bridge and later processors do keep a sort of L0 instruction cache full of decoded instructions with all the annotations needed to sort out branching. It can hold 1.5 K uOps, which is certainly something but not so big that you won't be missing it regularly.


  > And you're limited to 1 decoded instruction per cycle
  > when executing instructions brought into cache for the
  > first time
I don't think that's true for any recent processors. For Sandy Bridge, you can get a throughput of 4 instructions decoded per cycle as long as they average 4 bytes or less in length and are well aligned. The exact details are a bit messy given the different layers, but you can mark up to 6 instructions or 16B per cycle, and you can put 4 into the queue each cycle (or 5 if the is fusion): http://www.realworldtech.com/sandy-bridge/3/

> It can hold 1.5 K uOps, which is certainly something but > not so big that you won't be missing it regularly.

You'll be missing it frequently, but it's big enough that it should hold almost any tight loop. Intel is claiming it's equivalent to a 6KB instruction buffer, and has about an 80% hit rate.


In fact, hasn't been true since the Pentium Pro. Agner Fog's excellent microarchitecture docs [1] indicate that the PPro (which was a 3-wide machine, and the first out-of-order x86) could find 3 instructions per cycle out of a 16-byte fetch chunk, and likely did so by speculatively starting decode at each byte in parallel.

[1] http://www.agner.org/optimize/microarchitecture.pdf


Is there a good place to begin to learn about this stuff from the ground up? Maybe a user friendly compiler written for the purpose of education?


From the ground up? Well, there's always the source:

http://www.intel.com/content/www/us/en/processors/architectu...

For learning assembly? For MASM I enjoyed this book years ago:

http://www.amazon.com/Assembly-Language-Intel-Based-Computer...

For GAS something like this might be more appropriate:

http://www.amazon.com/Professional-Assembly-Language-Richard...

Patterson & Hennessy is used a lot in colleges to teach low-level architecture and assembly:

http://www.amazon.com/Computer-Architecture-Fifth-Quantitati...


The shallow end of the pool is simple RISC CPUs e.g. Atmel 8-bit AVR. The complete instruction manual is something like 160 pages (compare to x86 at 3k+) and there are tons of beginner resources for doing assembler on those chips.


There's more structured material for MIPS than AVR. Plus the AVR has some funky memory maps and modes.


This all time classic from MIPS is a good starting point: "MIPS R4000 Microprocessor User's Manual" [1]

[1] http://groups.csail.mit.edu/cag/raw/documents/R4400_Uman_boo...


There is a developer friendly virtual machine called Bochs. It can be hooked to a debugger so you can see what is happening. I prefer DDD for debugging hooked to GDB (backend).

Here is an assembly cheat sheet that I like a lot: http://www.jegerlehner.ch/intel/

Also GCC can output assembly if you want to see what simple C code looks like in assembly http://stackoverflow.com/questions/137038/how-do-you-get-ass...

Finally I would say that the recommended method of learning about all of this and more is to take a course at uni. You should take C programming and operating system architecture where you should learn assembly. Then you should take a course where you get to build you own OS (http://www.uio.no/studier/emner/matnat/ifi/INF3151/index-eng...) followed by a course on multi core architecture (http://www.uio.no/studier/emner/matnat/ifi/INF5063/)


I would recommend not jumping right into x86 and instead starting with something simpler. I learned MIPS in school. The instruction set is very easy to wrap your head around.

MIPS Quick Tutorial: http://logos.cs.uic.edu/366/notes/mips%20quick%20tutorial.ht...

A simulator that will let you run MIPS on windows/linux/mac: http://pages.cs.wisc.edu/~larus/spim.html



And yet phones are already doing real-time 4k video encoding.

In other words, since a lot of compute intensive scenarios are already being served by less general purpose hardware, what are the most recognizable scenarios where AVX-512 will make a big difference over AVX-256?

I know video encoding can be an integer only algorithm while AVX seems to help floats more but still...


Which phone can do realtime 4k encoding? Can you tell me which SoC it uses?

Edit: It's the Acer Liquid S2, with Qualcomm Snapdragon 800 SoC


All Snapdragon 800-based phones, and there are quite a few of them, and there will be more in the coming months (and of course later even more chips doing the same). Right now it's the one you said, LG G2, Sony Xperia Z1, Samsung Galaxy Note 3, and soon the Nexus 5.

Now I'm not 100 percent sure if every single one of them has that as a user-centric feature (maybe they didn't enable it), but the chip supports 4k video recording.


Surely you'd prefer to encode video much faster than real time.


If it can do it instantly, that would be great. But why does it need to be faster than real-time? You can't capture video any faster. (Though, "real-time" probably needs some sort of framerate qualifier… no doubt 60 fps is harder to encode than 30 fps.)

Unless you're thinking of re-encoding … on a phone?


Sure, that way you could apply shake-reduction or framerate adjustments (like slow motion) after capture. Or, re-encode to send as MMS or something.


Yeah, there's always more processing on a video that can be done. Shake/Jitter reduction, histogram normalisation, hipster-like (instagram) filters, etc. (Just imagine a real time instagram filter for video on your phone.)

Edit: Plus it means less power used since the processor doesn't need to be on for as long to process the video (duty throttle / pulse width modulation etc).


I think AVX is for scientific computing where there are a variety of algorithms that can't justify specific offloads like video encoding.


The mask registers that can turn off operations on individual elements of a vector reminded me of CUDA. It might be possible to emulate individual "threads" on these pretty easily.

Does OpenCL have a similar threading concept? I don't know much about it, sadly.


I really wonder what this is for. Isn't AVX-256 code usually already limited by memory bandwidth? There are design tradeoffs between memory latency and memory bandwidth and in order for Intel CPUs to keep their advantage in high single threaded performance they have to lean towards the low latency side of things.

Is this aimed at Larrabee?


SKUs of Haswell with the GT3e graphics configuration include 128mb of on package DRAM, which is intended to give more memory bandwidth to the GPU, but it also acts as a cache for the CPU.

Based on what you're saying, it seems as though AVX-512 and future models with larger, faster embedded DRAM might play very nicely together.


I should have been clearer. I was talking about the cache hierarchy and CPU memory pipes. I suppose there's a divergence in main memory too, with CPUs using DDR3 and GPUs using GDDR5, but as you say you can just throw cache at that problem.


Larrabee itself is defunct, but one of its successors is the "Xeon Phi" line, and the Knights Landing generation of that will have AVX-512.


Memory technology keeps improving as well, and will have improved by late-2014/2015. Regardless, even if an eight-core chip would be starved for memory running eternal AVX-512 instructions, the advantage would be power and heat savings with a single core running a limited set of computations as quickly and efficiently as possible, doing the most with the least. As the power profile of chips gets smaller and smaller, that would be the biggest advantage.


Is there any x86 instruction set that supports quadruple precision floating point numbers? If not, why not? Is it not useful enough?


There aren't a lot of advantages to putting quadruple precision into the hardware. Typically with numeric code you don't run out of exponent space you run into precision limits. To increase precision you can use software techniques like double-double representation. This doubles precision and keeps the exponent range the same at the cost of increased numbers of instructions.

The real action is in FMA (fused multiply-add) instructions. These instructions do two operations then a correct rounding of the result (e.g. round(a*b+c)). FMA in hardware is great. It lets you write functions with provably tight errors or even provably correct rounding of the result. More and more platforms are providing FMA [1].

[1]: http://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_ope...


Where exactly is it useful? Double is already an overkill in all multimedia operations, and finance folks use fixed-point anyway.


Quadruple (or higher) precision is needed in many scientific applications.


Do you know of any specific applications? "many" seems like an extraordinary claim. I'm under the impression that quad precision is fringe. Double precision support on GPUs is fairly recent.


Sure. Just for a start, some of David Bailey's papers (http://www.davidhbailey.com/dhbpapers/), e.g. "High-Precision Arithmetic: Progress and Challenges" and "High-Precision Floating-Point Arithmetic in Scientific Computation" show some concrete examples. You will find many others by searching for papers containing the words "quad double arithmetic", "high precision arithmetic", "multiple precision arithmetic" or similar terms. Most applications are probably in physics, and of course in pure mathematics.


Floating point vector instructions have options for specifying the rounding mode and for suppressing exceptions.

That's huge.


This might well be a dumb question, but why don't we have 128 bit processors?

No advantage? Crazy expense?


At some point, you need to ask what you mean by 128 bit. When people talk about an 8 bit, 16 bit, 32 bit, or 64 bit processor, they are actually generally conflating two or more things. There's the size of general purpose register, the size of the data bus (how much you can load from memory in a single transfer), the size of the address bus (how many lines you have for addressing RAM), and the size of pointers. In many machines, these have been the same, though for example, 8 bit processors frequently had 14 or 16 bit addresses and busses so they could access up to 16 or 64k of memory; but there's also, for example, the 68008 with 32 bit registers, a 16 bit address bus, and an 8 bit data bus.

So, when people talk about 32 or 64 bits, they generally mean two things: the size of general purpose registers, and the size of addresses.

There's basically no need for addresses beyond 64 bits, at least for quite some time. With 64 bits, you can address 16,384 petabytes (16 exabytes) in a single process. Since the biggest single machines I can find these days support a maximum of 4 TB of RAM (if you filled it with 32 GB DIMMs that aren't yet available), we have a long way to go before you will need more than 64 bits of address space.

Furthermore, increasing address size can hurt performance. If your pointers are all 128 bit, they take up twice the space as 64 bit pointers. There have already been plenty of workloads that show a reduction in performance when ported to 64 bit machines, just because the 64 bit pointers fill up so much valuable cache space. In fact, for this reason, Linux even has support for the x32 ABI, which uses an x86-64 processor in 64 bit mode but only uses 32 bit pointers, so they can take advantage of extra registers available to x86-64 without paying the price for the larger pointers. https://en.wikipedia.org/wiki/X32_ABI

So, there's no benefit to 128 bit addresses and lots of potential downside, so it's not going to happen for quite some time. How about for data, though?

Well, most software doesn't really need to work with integers or floating point numbers larger than 64 bits, anyhow. For lots of applications, 64 or even 32 bits is sufficient. Public key crypto can frequently take advantage of large integers, though it generally needs even bigger integers, like 2048 bits, so you generally have to do bignum arithmetic anyhow.

Lots of the gains that you get from working with larger types come from working on vectors of smaller types. But for those purposes, chips have had 128 bit registers for quite some time. SSE, introduced in 1999, included 128 bit vector registers, which could be treated as 4 32 bit integers (AltiVec on PowerPC had introduced the same idea a few years earlier; the idea of SIMD has been around in supercomputers for many years). Later extensions like SSE2 expanded their use to allow you to treat them as two 64 bit floats, two 64 bit integers, 8 16 bit shorts, and 16 8-bit bytes.

So, for the only use case for which it's particularly valuable, working on vectors in aggregate, we've had 128 bit registers for quite some time. We've had 256 bit registers for a couple of years now in the form of AVX. Now this promises to expand those to 512 bits. There's no good reason to expand your addresses in the same way; at that point, you're just wasting space.


Up voted. Although 16 exabytes is less overhead if you are memory mapping persistent storage rather than just RAM which makes increasing sense with SSDs. 64bit addressing is still plenty for most scenarios for some time to come though even if this approach is taken.


There's no particular advantage to having 128-bit integers or pointers. 128-bit or larger SIMD has existed for 15 years or so.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: