Yes, most instructions that modify only the bottom element(s) didn't get a ymm (...

Yes, most instructions that modify only the bottom element(s) didn't get a ymm (256-bit) version since it would serve no purpose as it would produce the same result as the xmm one, and the corresponding intrinsics mostly follow the same pattern. So there is no int64 -> ymm intrinsic.

An intrinsic cast works fine though:

https://godbolt.org/z/M9XWCb

Intel even says about the cast:

> Cast vector of type __m128i to type __m256i; the upper 128 bits of the result are undefined. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.

That seems a bit beyond their mandate since what the compilers generate is mostly up to them, and in fact it doesn't seem true: at -O0, both gcc and clang generate a few extra instructions for the cast. With optimization on, it's all good though.