I wasn't talking about using SIMD.jl. I was talking about the implimentation of the package (which is why I linked to a specific file in the package) which does directly (with some macros) generate simd intrinsics. As for the performance difference per core you're seeing, it's only because your C code is using 32 bit floats compared to the 64 bit floats that Julia is using here.
He has a point. Currently there is no way in Julia of checking with CPU instructions are available. So in practice, it's impossible to write low-level assembly code in Julia.
IIUC, SIMD.jl only works because it only provides what is guaranteed by LLVM to work cross-platform, which is quite far from being able to use AVX2, for example.
IIRC it relies on HostCPUFeatures.jl which parses output from LLVM. However, this means it just crashes when used on a different CPU than it was compiled on (which can happen on compute clusters) and it crashes if the user sets JULIA_CPU_TARGET.