The ESP BOX has an acoustically optimized enclosure with dual microphones for noise cancelation, separation, etc.
Between that and the Espressif AFE (audio frontend interface) doing a bunch of DSP "stuff" in our testing it does remarkably well in noisy environments and far-field (25-30 feet) use cases.
Our inference server implementation (open source, releasing next week) uses a highly performance optimized Whisper which does famously well with less-than-ideal speech quality.
All in, even though it's all very early, it's very competitive with Echo, etc.
What’s the latency on inference on a rasbpi (I assume it’s not running it direct on the device)? I think I read previously that it was up to 7 secs, and if you wanted sub-second you’d need an i5.
Willow supports the Espressif ESP SR speech recognition framework to do completely on device speech recognition for up to 400 commands. When configured, we pull light and switch entities from home assistant and build the grammar to turn them on and off. There's no reason it has to be limited to that, we just need to do some extra work for better dynamic configuration and tighter integration with Home Assistant to allow users to define up to 400 commands to do whatever they want with their various entities.
With local command recognition with Willow I can turn my wemo switches on and off, completely on device, in roughly 300ms. That's not a typo. I'm going to make another demo video showing that.
We also support live streaming of audio after wake to our highly optimized Whisper inference server implementation (open source, releasing next week). That's what our current demo video uses[0]. It's really more intended for pro/commercial applications as it supports CPU but really flies with CUDA - where even on a GTX 1060 3GB you can do 3-5 seconds of speech in ~500ms or so.
We also plan to have a Willow Home Assistant component to support Willow "stuff" while enabling use of any of the STT/TTS modules in Home Assistant (including another component for our inference server you can self-host that does special Willow stuff).
I think controlling the odd device, setting a timer, adding items to a shopping list covers about 90% of my Alexa use. The remaining bits are asking it to play music, or dropping into another room. Seems like a good portion of these could be covered already.
Have you considered K2/Sherpa for ASR instead of ctranslate2/faster-whisper? It’s much better suited for streaming ASR (whisper transcribes 30 sec chunks, no streaming). They’re also working on adding context biasing using Aho-Corasick automata, to handle dynamic recognition of eg. contact list entries or music library titles (https://github.com/k2-fsa/icefall/pull/1038).
Whisper, per the model, does 30 second chunks and doesn't support "streaming" by the strict definition.
You'll be able to see when we release our inference server implementation next week that it's more than a version of "realtime" enough to fool nearly anyone, especially with an application like this where you aren't looking for model output in real time. You're streaming speech, buffering on the server, waiting for the end of voice activity detection, running Whisper, taking the transcription, and doing something with it. Other than a cool demo I'm not really sure what streaming ASR output provides but that's probably lack of imagination on my part :).
That said, these are great pointers and we're certainly not opposed to it! At the end of the day Willow does the "work on the ground" of detecting wake word, getting clean audio, and streaming the audio. Where it goes and what happens then is up to you! There's no reason at all we couldn't support streaming ASR output.
You never know if people are going to love your pet project as much as you do. We had a hunch the community would appreciate Willow but like I said, you just never know.
My suspicion is Espressif (until now, hah) hasn't sold a lot of ESP Boxes. We were concerned that if Willow takes off they will sell out. That already appears to be happening.
Espressif has tremendous manufacturing capacity and we hope they will scale up ESP BOX production to meet demand now that (with Willow) it exists. The only gaiting item for them is probably the plastic enclosure and they should be able to figure out how to produce that en masse :).
I really hope so, I've been waiting for good audio assistant hardware forever. I hope this is finally the time where I ditch Alexa once and for all, thanks for releasing Willow!
Between that and the Espressif AFE (audio frontend interface) doing a bunch of DSP "stuff" in our testing it does remarkably well in noisy environments and far-field (25-30 feet) use cases.
Our inference server implementation (open source, releasing next week) uses a highly performance optimized Whisper which does famously well with less-than-ideal speech quality.
All in, even though it's all very early, it's very competitive with Echo, etc.