Does anyone know what they mean by "remote"? I skimmed through the documentation, and it still is not clear. When I search for "remote assembler" the only results are 1) this project and 2) a kind of job in manufacturing (mostly with reference to mainframes).
Remote in the sense that a local assembler would be on the developer's system, and remote assembler is on the end user's system. Using this lib, the developer uses C++ as if it were metacode or a macro system to then emit machine code instructions on demand. This will presumably then execute faster than plain C++ in certain applications.
I can think of several interesting applications for this, such as text search.
Regexp JIT was already mentioned. It's notable that regexp JIT technique is being used by practically all high performance regexp libraries already. Such as these:
For text search in particular, you could also take advantage of SSE 4.2 string instructions, but still run on older CPUs. http://en.wikipedia.org/wiki/SSE4#SSE4.2
Similar story with AVX2, you have 256-bits wide registers. Soon (Intel Skylake in 2015) there will be AVX512, 512-bits wide registers with byte level processing instructions. Being able to process 64 bytes in one instruction, with ILP [1] potential of two instructions per clock cycle or more, can provide an order of magnitude performance advantage.
You can also optimize away unnecessary code for that particular search. No need to have those ignore case, etc. flags. Or for example you could to not include Unicode related logic, if you know ahead of time that normalization etc. won't be necessary for this particular text search case. This can have particularly high savings if you're branch predictor buffer limited already, by reducing the number of branches [2].
If the memory access patterns are not sequential (= predictable by the CPU), you could insert prefetch instructions at CPU model appropriate places to ensure data is going to be in L1 cache in time before use.
If you know the data is going to be searched only once, you could give a hint to the CPU that you're streaming the data. CPU then can optimize memory access patterns and minimize L1/L2 cache evictions, because it knows this data should not be stored in cache. In other words, non-temporal (= streaming) memory loads and stores. Like http://www.felixcloutier.com/x86/MOVNTDQA.html.
You could do profile guided optimization at runtime. Or just try random variations and pick the fastest for that particular combination of parameters and hardware without recompiling anything. Different CPU models have a lot of variation [3].
And a lot of other things. If the data sets are large, ability to adapt to a particular problem at runtime can have a huge payoff.
[1]: Instruction level parallelism.
[2]: A branch can mean if-statements, ?-ternary operator, boolean logic ("||", "&&" etc.), switch statements, and so on. Every branch in currently executing loop can potentially need an entry in the CPU branch predictor. If branch predictor buffer entries run out, this might cause CPU to mispredict that branch every time. The cost of a mispredicted branch is very high. On Intel Ivy Bridge processor, a single branch misprediction costs 14 clock cycles or the time to theoretically execute up to 4*14=56 instructions, practically about 15-30!
Slightly related links, LLVM CPU scheduler definitions:
I'm not sure either. My understanding is that they let you compile (part of) your program to a middle layer, a more abstract version of X86/64 instructions. And the binary then can be generated based on the actual machine using the instruction set that machine supports. So if some of the computation you do prefers a certain instruction, but you still what it to run without it, and you don't what the user to recompile from the source, then you can use this.