The ESP BOX and ESP SR speech recognition library from Espressif handles the low-level audio stuff like wake word detection, DSP work for quality voice, voice activity detection, etc to get usable far-field audio. The wake word engine uses models from Espressif with wake words like "Alexa", "Hi ESP", "Hi Lexin", etc. If we get traction Espressif can make us a wake engine model for whatever we want (we're thinking "Hi Willow") but open to better ideas!
We currently stream audio after wake in realtime to our very high performance (optimized for "realtime" speech) Whisper inference server implementation. We plan to open source this next week.
We also patched in support for the most recent ESP SR version that has their actually amazingly good Multinet 6 speech command model that does recognition of up to 400 commands completely on device after wake activation. We currently try to pull light and switch entities from your configured Home Assistant instance to build the speech commands but it's really janky. We're working on this.
The default currently is to use our best-effort hosted inference server implementation but like I say in the README, etc we're open sourcing that next week so anyone can stand it up and do all of this completely locally/inside your walls.
The "TTS Output" and "Audio on device" sections make it seem like there is no spoken output, only status beeps.
A former Mycroft dev, Michael Hansen[1], is still building several year-of-the-voice projects after he was let go. I'm especially excited about Piper[2], which is a C++/py alternative to Mimic3.
We plan to make a Home Assistant Willow component to use any of their supported TTS modules to play speech output on device. We just didn't get to it yet.
Our inference server (open source, releasing next week) has highly optimized Whisper, LLaMA/Vicuna/etc, text to speech, etc implementations as well.
It's actually not that hard on the device - if the response from the HA component has audio, play it.