Mycroft – A.I. for everyone

ThomPete · on Dec 26, 2016

I know there have been many discussions about speech interfaces here lately, but I really think it's going to be as transformative as the touch screen in that it removes abstraction between the users intent and the means to input it. Speech is very natural to us and as soon as the AI become more contextual it will be truly powerful in our every day lives.

I can of course only speak from personal experience but it completely changed the way we access various services in our family and allow us to be together while using Google home as a fifth member of the family.

Making it open source is great but I do believe the hardware needs to be much more polished to compete with Amazon and Google.

nl · on Dec 26, 2016

So today I built this same thing[1], using a Logitech universal remote, Bing Speech recognition, ifttt.com and about 6 lines of Python.

[1] OK, it's the easy part. And I agree the whole system of skills and features sounds great. But I can say "Turn on TV" and my TV comes on, and "Turn on Xbox" and the TV and Xbox come on. And.. 6 lines of Python.

eximius · on Dec 26, 2016

Throw it on github then, I'd be interested in something that simple.

kevlar1818 · on Dec 26, 2016

Interesting. Could you elaborate how the remote, Bing STT, IFTTT, and your Python code interact with one another?

daveguy · on Dec 26, 2016

I expect audio and Bing STT are handled by the Python SpeechRecognition library:

https://pypi.python.org/pypi/SpeechRecognition/

Python to IFTTT here (using maker channels):

https://github.com/briandconnelly/pyfttt

And then using the logitech universal remote (harmony remote) from IFTTT:

https://ifttt.com/harmony

Although you can also control a harmony remote directly from python:

https://github.com/jterrace/pyharmony

More about the harmony remote here:

http://www.logitech.com/en-us/harmony-remotes

nl · on Dec 26, 2016

Exactly this.

fixxer · on Dec 26, 2016

What hardware interface are you using? I'm must getting into this myself (using my shiny new Echo for the interface to my homebrewed domestic control service).

nl · on Dec 26, 2016

The Harmony remote is what you want.

_bz2r · on Dec 26, 2016

cool! tell us more...

stavros · on Dec 26, 2016

Does anyone know what the hardware looks like? I haven't managed to find any info other than "it's a raspberry Pi with LEDs". I'm especially interested in whether it had a microphone array and how they process audio to remove noise.

syshum · on Dec 26, 2016

I was a early backer on Kick starter... At this point I believe the Hardware to be vapor... They have pushed the delivery date for backers back soo many times I if I ever actually get one I will be shocked.

Many backers have made their own from the software using their Own rPI's

devoply · on Dec 26, 2016

The key feature of an Echo is its wide microphone array. Those are somewhat expensive.

goda90 · on Dec 26, 2016

Are they expensive because they are complicated or because the market for them has been small? I'm curious how hard it would be to make one.

stavros · on Dec 26, 2016

I don't know how expensive they are, but I did get a PS3 eye camera with a four-mic array for $4. I just don't have the software to do noise reduction for the streams.

jpalomaki · on Dec 26, 2016

Is there something available for Raspberry?

amelius · on Dec 26, 2016

Since it's open-source: are the underlying text-to-speech and speech-to-text engines already available as libraries for Linux?

kevlar1818 · on Dec 26, 2016

This is the biggest shortcoming of this project IMO; it only uses PocketSphinx to recognize the trigger phrase, and then uses Google STT for the command phrase.

A big reason I (and I believe others) are wary of Amazon Echo and Google Home are because we don't like the idea of having an always-on microphone in my home, shuttling everything my family says to these giant companies.

I really hope the MycroftAI-backed OpenSTT[1] project gets off the ground so they (and others) can divorce themselves from Amazon, Google, Bing, etc. services.

[1] https://openstt.org/

amelius · on Dec 26, 2016

Good points. I also think we really should not label software "open source" unless it is really open source.

But good to see that at least the project has the intention of becoming open source.

skoocda · on Dec 26, 2016

It still relies on 3rd party proprietary STT- which is a huge problem for most of these speech interface projects.

The best results in the open-source STT field are accomplished using a library called Kaldi. However, it's a pain to set up an entire acoustic+phoneme+language model and harder still to run a continuous decoding server.

Luckily, a library [0] exists to accomplish this, which might provide an excellent replacement for Mycroft, provided some tweaks to the API.

[0] https://github.com/alumae/kaldi-gstreamer-server

trome · on Dec 26, 2016

It uses sphinx, which is open source:

It first detects its wake word using pocketsphinx on around the last two seconds of audio.

https://docs.mycroft.ai/overview

alexellisuk · on Dec 26, 2016

A notable effort, however at first glance this seems like an imitation of the Alexa service with like-for-like terminology and backend concepts even using the word "intent". Does this not risk stepping on the toes of Amazon's copyright/trademark?

One thing I do regard highly is the focus on testing vs. Alexa's SDK where Amazon don't even seem to be a vaguely concerned.

tyingq · on Dec 26, 2016

>even using the word "intent"

Words like "intent", "sentiment", and "entities" were standard vernacular for natural language processing long before Amazon decided to jump into the space.

omash · on Dec 26, 2016

Like "chrome" in the browser space.

tyingq · on Dec 26, 2016

I think Google trying to forbid Firefox from using the word "bookmark" would be a closer analogy.

kwhitefoot · on Dec 26, 2016

Trademark is not a problem unless you are selling the product, after all you can just do a global search and replace to change the name. And neither trademark nor copyright cover the use of ordinary English words in ordinary technical contexts. APIs are like recipes and telephone directories, the expression is protected but not the mere lists of ingredients.

Of course I have probably expressed this a little too strongly, I'm sure there are edge cases to worry about.

yrro · on Dec 26, 2016

Even after Oracle v. Google?

ragebol · on Dec 26, 2016

Similar APIs use the same terminology, concepts and words:

- Wit.ai: https://wit.ai/docs/recipes#categorize-the-user-intent

- API.ai: https://docs.api.ai/docs/key-concepts

Maarten88 · on Dec 26, 2016

> stepping on the toes of Amazon's copyright/trademark?

I would not be surprised if Microsoft would have a problem with the company name. They market themselves as an open source project, but I assume they are in it to sell hardware and make a profit (or exit).

The company name sounds like a contraction of Microsoft to me and initially I misread the title as being another Microsoft open source project.

I'm actually expecting some initiative from Microsoft in exactly this area, with open hardware instead of open software.

davidmurdoch · on Dec 26, 2016

It's named after Mycroft Holmes, Sherlock Holmes' brother.

blktiger · on Dec 26, 2016

Which is the basis for the name of the AI in "The Moon is a Harsh Mistress" by Heinlein.

anexprogrammer · on Dec 26, 2016

How does this compare for privacy with the recently failed Protonet Zoe?

Can it keep your Speech recognition and Io(broken)T devices to the local net?

Dowwie · on Dec 26, 2016

Great work! Are you collaborating with the OpenAI team ? It's exciting to see how projects will cross-pollinate each other with their best ideas.

Tharkun · on Dec 26, 2016

What are the privacy implications of this thing?

traverseda · on Dec 26, 2016

Similar to the implications of siri and Google now, I imagine.

Tharkun · on Dec 26, 2016

From what I understand Siri isn't always-on. In the sense that you have to switch it on. This seems to be listening all the time.

Another comment mentioned that they use some Google speech recognition component, which probably means all of your conversations are delivered to Google. That's a big red flag for me.

omash · on Dec 26, 2016

It first detects its wake word using pocketsphinx on around the last two seconds of audio. Audio older than two seconds is discarded. After detecting its wakeword it begins recording until when it thinks the user has finished speaking.

I believe the Poketsphinx part is offline so only commands would be delivered to Google.

Vinnl · on Dec 26, 2016

I do seem to recall that processing happens locally, which would give it an edge over at least Google Now (and possibly Siri, as I don't know how that works).

neilellis · on Dec 26, 2016

Just a note, GPL 3 is not a permissive license, it's a viral license. BSD and MIT are permissive licenses.

rvern · on Dec 28, 2016

The license is LGPL 3, not GPL 3.