I know there have been many discussions about speech interfaces here lately, but I really think it's going to be as transformative as the touch screen in that it removes abstraction between the users intent and the means to input it. Speech is very natural to us and as soon as the AI become more contextual it will be truly powerful in our every day lives.
I can of course only speak from personal experience but it completely changed the way we access various services in our family and allow us to be together while using Google home as a fifth member of the family.
Making it open source is great but I do believe the hardware needs to be much more polished to compete with Amazon and Google.
So today I built this same thing[1], using a Logitech universal remote, Bing Speech recognition, ifttt.com and about 6 lines of Python.
[1] OK, it's the easy part. And I agree the whole system of skills and features sounds great. But I can say "Turn on TV" and my TV comes on, and "Turn on Xbox" and the TV and Xbox come on. And.. 6 lines of Python.
What hardware interface are you using? I'm must getting into this myself (using my shiny new Echo for the interface to my homebrewed domestic control service).
Does anyone know what the hardware looks like? I haven't managed to find any info other than "it's a raspberry Pi with LEDs". I'm especially interested in whether it had a microphone array and how they process audio to remove noise.
I was a early backer on Kick starter... At this point I believe the Hardware to be vapor... They have pushed the delivery date for backers back soo many times I if I ever actually get one I will be shocked.
Many backers have made their own from the software using their Own rPI's
I don't know how expensive they are, but I did get a PS3 eye camera with a four-mic array for $4. I just don't have the software to do noise reduction for the streams.
This is the biggest shortcoming of this project IMO; it only uses PocketSphinx to recognize the trigger phrase, and then uses Google STT for the command phrase.
A big reason I (and I believe others) are wary of Amazon Echo and Google Home are because we don't like the idea of having an always-on microphone in my home, shuttling everything my family says to these giant companies.
I really hope the MycroftAI-backed OpenSTT[1] project gets off the ground so they (and others) can divorce themselves from Amazon, Google, Bing, etc. services.
It still relies on 3rd party proprietary STT- which is a huge problem for most of these speech interface projects.
The best results in the open-source STT field are accomplished using a library called Kaldi. However, it's a pain to set up an entire acoustic+phoneme+language model and harder still to run a continuous decoding server.
Luckily, a library [0] exists to accomplish this, which might provide an excellent replacement for Mycroft, provided some tweaks to the API.
A notable effort, however at first glance this seems like an imitation of the Alexa service with like-for-like terminology and backend concepts even using the word "intent". Does this not risk stepping on the toes of Amazon's copyright/trademark?
One thing I do regard highly is the focus on testing vs. Alexa's SDK where Amazon don't even seem to be a vaguely concerned.
Words like "intent", "sentiment", and "entities" were standard vernacular for natural language processing long before Amazon decided to jump into the space.
Trademark is not a problem unless you are selling the product, after all you can just do a global search and replace to change the name. And neither trademark nor copyright cover the use of ordinary English words in ordinary technical contexts. APIs are like recipes and telephone directories, the expression is protected but not the mere lists of ingredients.
Of course I have probably expressed this a little too strongly, I'm sure there are edge cases to worry about.
> stepping on the toes of Amazon's copyright/trademark?
I would not be surprised if Microsoft would have a problem with the company name. They market themselves as an open source project, but I assume they are in it to sell hardware and make a profit (or exit).
The company name sounds like a contraction of Microsoft to me and initially I misread the title as being another Microsoft open source project.
I'm actually expecting some initiative from Microsoft in exactly this area, with open hardware instead of open software.
From what I understand Siri isn't always-on. In the sense that you have to switch it on. This seems to be listening all the time.
Another comment mentioned that they use some Google speech recognition component, which probably means all of your conversations are delivered to Google. That's a big red flag for me.
It first detects its wake word using pocketsphinx on around the last two seconds of audio. Audio older than two seconds is discarded. After detecting its wakeword it begins recording until when it thinks the user has finished speaking.
I believe the Poketsphinx part is offline so only commands would be delivered to Google.
I do seem to recall that processing happens locally, which would give it an edge over at least Google Now (and possibly Siri, as I don't know how that works).
I can of course only speak from personal experience but it completely changed the way we access various services in our family and allow us to be together while using Google home as a fifth member of the family.
Making it open source is great but I do believe the hardware needs to be much more polished to compete with Amazon and Google.