There's a --device flag you can pass. I've been trying to get `--device cuda` to work on my Windows machine and it's saying that torch wasn't compiled with CUDA. Trying to figure out what's going on there.
And on the M1, supposedly PyTorch has support for hardware acceleration using MPS (Metal Performance Shaders, announced here https://pytorch.org/blog/introducing-accelerated-pytorch-tra...) but when I tried `--device mps` it blew up with an error "input types 'tensor<1x1280x3000xf16>' and 'tensor<1xf32>' are not broadcast compatible".
> I've been trying to get `--device cuda` to work on my Windows machine and it's saying that torch wasn't compiled with CUDA.
I struggled with the same. Here's what worked for me:
Use pip to uninstall pytorch first, should be "pip uninstall torch" or similar.
Find the CUDA version you got installed[1]. Go to PyTorch get started page[2] and use their guide/wizard to generate the pip string, and run that. I had to change pip3 to pip FWIW, and with Cuda 11.6 installed I ended up with "pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116".
After that I could use --device cuda, and the difference was immense. On my 2080Ti it went from roughly an hour for a minute with large model, to 10-20 seconds.
Yep, same for me, on M1 after enabling MPS (with `model.to("mps")`) it just either SIGSEGV or SIGABRTs every time with that line. The extremely unclean nature of the abort is making it hard to debug :(
I noticed the size seems to correspond to the model. With a large model, the error is tensor<1x1280x3000xf16>. With tiny, it's tensor<1x384x3000xf16>, and with medium it's tensor<1x1024x3000xf16>. It also seems like a bad thing that those are f16's but the "expected" data is f32.
I'm giving up for the night, but https://github.com/Smaug123/whisper/pull/1/files at least contains the setup instructions that may help others get to this point. Got it working on the GPU, but it's… much much slower than the CPU? Presumably due to the 'aten::repeat_interleave.self_int' CPU fallback.
Also hitting a nice little PyTorch bug:
> File "/Users/patrick/Documents/GitHub/whisper/whisper/decoding.py", line 388, in apply
logits[:, self.tokenizer.encode(" ") + [self.tokenizer.eot]] = -np.inf
> RuntimeError: dst_.nbytes() >= dst_byte_offset INTERNAL ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Copy.mm":200, please report a bug to PyTorch.
And on the M1, supposedly PyTorch has support for hardware acceleration using MPS (Metal Performance Shaders, announced here https://pytorch.org/blog/introducing-accelerated-pytorch-tra...) but when I tried `--device mps` it blew up with an error "input types 'tensor<1x1280x3000xf16>' and 'tensor<1xf32>' are not broadcast compatible".