GPU (currently CUDA only) is our primary target for our inference server implementation. It "runs" on CPU but our goal is to enable an ecosystem that is competitive with Alexa in every possible way and even with the amazing work of whisper.cpp and other efforts it's just not happening (yet).
We're aware that's controversial and not really applicable to many home users - that's why we want to support any TTS/STT engine on any hardware supported by Home Assistant (or elsewhere) in addition to ESP BOX on device local command recognition.
But for the people such as yourself, and other commercial/power/whatever users our inference server that we're releasing next week that works with Willow provides impressive results - on anything from a GTX 1060 to an H100 (we've tested and optimized for anything in between the two).
We use ctranslate2 (like faster-whisper) and some other optimizations for performance improvements and conservative VRAM usage. We can simultaneously load large-v2, medium, and base on a GTX 1060 3GB and handle requests without issue.
Again, it's controversial but the fact remains a $100 Tesla P4 that idles at 5 watts and has max TDP of 60 watts from eBay with our inference server implementation does the following:
large-v2, beam 5 - 3.8s of speech, inference time 1.1s
medium, beam 1 (suitable for Willow tasks) - 3.8s of speech, inference time 588ms
medium, beam 1 (suitable for Willow tasks), 29.2s of speech, inference time 1.6s
An RTX 4090 with large-v2, beam 5 does 3.8s of speech in 140ms and 29.2s of speech with medium beam 1 (greedy) in 84ms.
We're aware that's controversial and not really applicable to many home users - that's why we want to support any TTS/STT engine on any hardware supported by Home Assistant (or elsewhere) in addition to ESP BOX on device local command recognition.
But for the people such as yourself, and other commercial/power/whatever users our inference server that we're releasing next week that works with Willow provides impressive results - on anything from a GTX 1060 to an H100 (we've tested and optimized for anything in between the two).
We use ctranslate2 (like faster-whisper) and some other optimizations for performance improvements and conservative VRAM usage. We can simultaneously load large-v2, medium, and base on a GTX 1060 3GB and handle requests without issue.
Again, it's controversial but the fact remains a $100 Tesla P4 that idles at 5 watts and has max TDP of 60 watts from eBay with our inference server implementation does the following:
large-v2, beam 5 - 3.8s of speech, inference time 1.1s
medium, beam 1 (suitable for Willow tasks) - 3.8s of speech, inference time 588ms
medium, beam 1 (suitable for Willow tasks), 29.2s of speech, inference time 1.6s
An RTX 4090 with large-v2, beam 5 does 3.8s of speech in 140ms and 29.2s of speech with medium beam 1 (greedy) in 84ms.