Hacker News new | past | comments | ask | show | jobs | submit login
The cpu_features library (googleblog.com)
183 points by stablemap on Feb 7, 2018 | hide | past | favorite | 35 comments



GCC has the __builtin_cpu_supports() function on some architectures, so you may want to use that instead.

It's supported:

* on x86 since GCC 4.8

* on PowerPC since GCC 6 (you will also need glibc ≥ 2.23)

Documentation:

https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.ht...

https://gcc.gnu.org/onlinedocs/gcc/PowerPC-Built-in-Function...


Does clang have anything similar?


Since Clang 3.8 at least, it has __builtin_cpu_supports() too.


an alternative library [cpuinfo](https://github.com/Maratyszcza/cpuinfo) is a similar offering but also additionally exposes more information such as cache sizes, topology information (number of sockets, etc.) and information on attached integrated GPU.


I guess this is helpful for new projects, but most high performance programmers are already parsing cpuid output.

I was hoping for a more convenient hwloc replacement for discovering NUMA topology and logical/physical cores.

I guess what I mean is that if you're already at the level of caring what avx you have, then parsing cpuid isn't too difficult.


They already had a cpu features library on the Android NDK.

https://android.googlesource.com/platform/ndk.git/+/master/s...


FYI I'm the author of the library. I designed it with the help of the person who wrote the code for the Android NDK. We agreed that the end goal is to replace Android NDK specific code with this library - the NDK will expose the exact same API but using this library under the hood.


> features are retrieved by using the cpuid instruction. *Unfortunately this instruction is privileged for some architectures, in which case we fall back to Linux.

What's the rationale for making this instruction privileged for the other procs? I can kinda understand making system registers privileged reads but if you have your own encoding there's "no reason" not to give this to everyone, right?


The complex CPUID instruction is, to my knowledge, unique to x86 processors. The equivalents for other processors are essentially special machine status registers that set bits appropriately.

The main reason that these are privileged is probably because:

a) The main use of these is checking if you've got floating-point and/or vector unit support. This support requires the OS to know about this information anyways (imagine if your OS didn't bother to save/restore these registers on context switches!), so making the OS check this information is not difficult.

b) There's generally already a fairly generic mechanism for stuffing OS-level special-purpose registers that's fairly unbounded in size, so it's essentially free to add an addressing of these specific registers in that mode. Making some of these registers be legal in user-space code is more expensive in hardware size and complexity, and (as mentioned earlier) the OS has to be modified anyways to take advantage, so requiring the OS to report these details to userspace code isn't problematic.

c) For RISC processors, you've generally got a fairly full opcode space anyways. Taking up an opcode for a very niche functionality that isn't particularly necessary isn't a good use of scarce extra space.


several

1. Malware can infer details from CPUID to guess if it’s in a VM or not. Useful to avoid detection/analysis.

2. When processes execute in a legacy mode you may trap on CPUID to hide details the library won’t understand, or doesn’t expect to exist. To avoid backwards compatibility issues... better safe then sorry.

3. When developing SIMD related libraries which need to be portable across multiple CPU versions you may set up a CPUID mask (so trap, then hide features) to ensure compatibility on legacy computers.

Overall having the ability to trap, and rewrite the CPUID instruction is incredibly useful. The difference between denying, and rewriting just boils down to if a callback is provided or not. Both features require disabling native CPUID execution.


CPUID on x86 is a user-level instruction. When you're doing processor-assisted emulation, CPUID is an instruction that does cause an exit to the hypervisor, which does allow you to do CPUID emulation. You clearly don't need such instructions to be privileged to do emulation games with CPUID.


>You clearly don't need such instructions to be privileged to do emulation games with CPUID.

There are configuration registers which can make CPUID a privileged instruction.


> 3. When developing SIMD related libraries which need to be portable across multiple CPU versions you may set up a CPUID mask (so trap, then hide features) to ensure compatibility on legacy computers.

Huh? My impression was CPUID was exactly the opposite: using it properly allows ensuring compatibility on older computers while still getting maximum performance, since one can dynamically dispatch to an implementation that uses only what is supported. (E.g. switch between an SSE2 version and an AVX version.)


I think GP is saying that unit tests for such dynamically-dispatching code will want to be able to inject older CPUIDs, so the compatibility fallbacks have test coverage. If CPUID is a native instruction, it's harder to inject a fake value.

I'm not sure I find this compelling though. It's easy to make your own CPUID function that doesn't actually call "CPUID" in a test build.


Your impression is correct.


If they give uniquely identifying chip serial numbers, that'd be reason to keep it secure.



AMD's Family 12h processors (Llano) are unknown to this library.

VIA also apparently does not exist.


Can you please file an issue on github?


Intel has really made a mess out of CPUID. It's supposed to be an easy way to query CPU features, and for some things it is, but they keep changing the way it works especially for newer features line AVX2. It's like they cannot even follow their own APIs.


Wonder if dmidecode could be updated to include more cpu details, the cache features and other flags are limited.

        Version: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
        Voltage: 0.8 V
        External Clock: 100 MHz
        Max Speed: 2900 MHz
        Current Speed: 2300 MHz
        Status: Populated, Enabled
        Upgrade: Other
        L1 Cache Handle: 0x004C
        L2 Cache Handle: 0x004D
        L3 Cache Handle: 0x004E
        Serial Number: To Be Filled By O.E.M.
        Asset Tag: To Be Filled By O.E.M.
        Part Number: To Be Filled By O.E.M.
        Core Count: 4
        Core Enabled: 4
        Thread Count: 8


dmidecode decodes the dmi. lshw lists the hardware. lscpu lists the cpus.


The std.parallelism library in D lets you get some basic CPU info:

https://dlang.org/phobos/std_parallelism.html#.totalCPUs

num_cores: find number of cores in your PC's processor

https://jugad2.blogspot.in/2016/09/numcores-find-number-of-c...

You can also use the library for some simple parallel processing, as the name indicates:

Simple parallel processing in D with std.parallelism

https://jugad2.blogspot.in/2016/12/simple-parallel-processin...


Is this something that can be done at compile time? Would the compiler not have access to this information?


The CPU you compile on will probably not be the CPU your program runs on. If you know what CPU you're running on I guess you can just make assumptions about what features that chip supports, but it's much safer to use something like this to verify at runtime that you support avx or whatever features you need.


I was getting at cross-compilation, where you do know the CPU you're running on, but I see that this breaks down because the binaries can run on CPUs that support different features.


Even for same architecture compilation, the host's feature set is rarely that of the targets. If you're distributing binaries, you generally want to target a safe, old minimum baseline to maximize user bases--for x86-64, resorting to just sse2 is safe [1] and often sufficient [2]. Most compile farms and developer machines are often newer hardware.

[1] The x86-64 ABI actually requires SSE2 extensions to work correctly.

[2] SSE2 added double-precision and vectorized integer support to the SSE registers, the former allowing you to replace x87 FPU usage for floating point (unless you need long double, which is extremely rare). The newer SSE sets generally add only specialized operations that are unlikely to show up in autovectorization anyways, and the wider instruction set of AVX is less useful for performance in the "it might be useful" autovectorization scenario. They are useful for specific known hot regions of code, and in distributing binaries, variants of these are constructed for different levels of feature sets and are dynamically selected based on the user's actual hardware.


To add on the other comments, not only the machine where you compile is not the same as the one where you run your software, but consider the case where you want a software to run on as many CPUs as possible while still making use of advanced features where supported.

You'd tell the compiler "target a machine with <basic instruction set>", so it's not allowed to use advanced features like FMA because it can't assume they're supported. With this library, you'd check at run-time if FMA is supported, and then change functions accordingly.


Yes, I realized that I didn't take that into account. A follow up though: could a executable loader find and "trim" implementations that aren't optimize for the processor it's using, if it's properly marked in the binary?



Comment from the post:

> Just in case anyone is tempted, please do not write code that assumes the host on which it is compiled will always be the host on which it will be executed.


A related note: Don't assume that all AWS EC2 instances of a given type use identical CPUs.


When the processor model varies for more recent generations of EC2 instances it is documented here: https://aws.amazon.com/ec2/instance-types/

Scroll down to get to the table that includes Physical Processor, Intel AVX, Intel AVX2, etc.

T2 instances do run on a number of different processor models, therefore it is listed as "Intel Xeon Family." M4 instances run on Intel Xeon E5-2676 v3 (Haswell) or Intel Xeon E5-2686 v4 (Broadwell), though m4.10xlarge only runs on Haswell and m4.16xlarge only runs on Broadwell. Pay close attention to the * in the table for those instances that run on either Haswell or Broadwell.

Generally you will find that the recent generations of EC2 instances for C and R have identical CPUs within a generation.


very good point, this has been my experience as well, particularly for burstable instances such as the T2s


Only if you are compiling and running on the same machine, no?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: