Our inference server (open source - releasing next week) has support for loading LLaMA and derivative models complete with 4-bit quantization, etc. I like Vicuna 13B myself :). Not to mention extremely fast and memory optimized Whisper via ctranslate2 and a bunch of our own tweaks.
Our inference server also supports long-lived sessions via WebRTC for transcription, etc applications ;).
You can chain speech to text -> LLM -> text to speech completely in the inference server and input/output through Willow, along with other APIs or whatever you want.
For wake word and voice activity detection, audio processing, etc we use the ESP SR (speech recognition) framework from Espressif[0].
For speech to text there are two options and more to come:
1) Completely on device command recognition using the ESP SR Multinet 6 model. Willow will (currently) pull your light and switch entities from Home Assistant and generate the grammar and command definition required by Multinet. We want to develop a Willow Home Assistant component that will provide tighter Willow integration with HA and allow users to do this point and click with dynamic updates for new/changed entities, different kinds of entities, etc all in the HA dashboard/config.
The only "issue" with Multinet is that it only supports 400 defined commands. You're not going to get something like "What's the weather like in $CITY?" out of it.
For that we have:
2-?) Our own highly optimized inference server using Whisper, LLamA/Vicuna, and Speecht5 from transformers (more to come soon). We're open sourcing it next week. Willow streams audio after wake in realtime, gets the STT output, and sends it wherever you want. With the Willow Home Assistant component (doesn't exist yet) it will sit in between our inference server implementation doing STT/TTS or any other STT/TTS implementation supported by Home Assistant and handle all of this for you - including chaining together other HA components, APIs, etc.