Had to read the privacy policy to see they use Google.
>We share your audio recording with Google Cloud’s speech-to-text service to assist us in processing and carrying out your commands. Audio recordings are shared without personally identifiable metadata, and we’ve instructed Google’s service not to retain the audio or transcript associated with a command after it processes the command
I built something like this 7 years ago, it's called Hands Free for Chrome, a now languishing project that I lost interest in a long time ago, unfortunately. It made the top 10 of HN back then though! My site's design is not nearly as nice as yours.
I just didn't get enough users or support to really care about it. But I wish you the best. It was an exciting thing to build and using it always felt futuristic to me.
This is just so fascinating though. It's like seeing what could have been if I had been a better developer and found the dedication to really stick to the project in the longterm.
Edit: I see we had the exact same idea! Your "tag" is my "map." Love it. One big difference is that mine was just a free project. I'd be super interested to know how many users you've got. I never had more than ~1100. From looking through your website mine was a much less intensive project. (Oh, CWS says 4000+ for you....wow, wonder how many are paid.)
Edit2: Looking over your update history is almost nostalgic. "Fixed issue with overlapping commands -- delaying commands that are partial matches of other commands." Had to do the exact same thing!
Edit3: We have so many overlapping command names that I wonder if you took inspiration from my project, almost. Either that or it's just a case of convergent evolution.
Edit4: Suggestion for dictation: a way to alternate between a special character and actually writing the word. Doesn't look like there's a way to do ^ vs carrot or & vs ampersand. Something like "Enter special character caret". Maybe you already have a plugin for this though, idk.
Edit5: God, this is so well architected! plugins and contexts are just fantastic ideas for this domain. Click by voice using hidden search-for-text is also a perfect solution to that problem. I wonder if this could be made more intelligent, i.e. "Click Submit in the sidebar on the left"-- challenging though.
Edit6: Wow, just noticed someone else built something called "Handsfree for Web" somewhere along the way and theirs is ALSO way better than what I had built. Geez. Starting to feel bad about my awful website.
Never saw yours before, but I discovered "Handsfree for Web" a few months after I started - and thought he had ripped mine off. But I no longer think so. Yes, seems like many commands are the same. Shame that so much wheel reinvention is going on. One thing that makes LipSurf "special" is the deep integration with sites. I wanted to use Duolingo, Reddit, HN and some others more with voice - so they get special plugins. Doing Duolingo with voice is a game changer for language learning - and if it weren't for usecases like that I would have likely lost interest like you long ago.
I'm convinced it's the future. The longterm future (a century from now) will be BMIs, but the nearer-term future (20 years from now) will be highly intelligent voice interfaces.
If you manage to get into the GPT-3 beta, I'd love to work on that with you :D.
For simple models, English -> OpenSCAD sounds like it's doable given the things I've seen on Twitter and for normal modeling, GPT-3 would probably make an excellent intent recognizer for voice commands.
The GPT-3 beta is something every programmer and their dog wants to get into these days and most of us can't. It's a really impressive new language processing neural network that people have managed to train to (among many other things) generate code from an English description of the program. If it can do that, it might be able to generate some reasonably complex Constructive Solid Geometry models and even something like MEL commands in Maya.
Greg, OpenAI's CTO, occasionally manually lets people in if they convince him of their use case (or, as one guy did, plant a bunch of trees in his name) in an email (gdb@openai.com). It might be worth shooting him a message explaining the idea. From what I've heard, once you have a key, it's mostly a matter of feeding the model examples until it does what you want.
"On some browsers, like Chrome, using Speech Recognition on a web page involves a server-based recognition engine. Your audio is sent to a web service for recognition processing, so it won't work offline."
Seems like the answer for why the API isn't in Firefox. It's also not standardized, and is prefixed, so...
You should try rhasspy. It's open source. It respects your privacy by using offline services. Fully customizable (each service can be replaced by another) All the services are containerized for easy installation and is available for several architectures such as arm ( on a raspberry pi). There is even n Option to use Mozilla deepspeech tts service.
I'm sorry, but cloud based speech recognition in itself would already be a red flag, even if Mozilla was doing it in-house. Outsourcing it to Google though? I feel like a company as ostensibly privacy-focused as Mozilla should really know better by now...
Mozilla DeepSpeech is rapidly maturing, but it needs thousands of hours of validated audio data to train each language. Its a feat that with only 2000 hours of audio they can achieve a 5.97% word error rate.
Baidu had 5000 hours of audio data to train their DeepSpeech and DeepSpeech 2 models, meanwhile Google, Microsoft & IBM have people constantly giving them fresh audio to train and validate their models with.
hoping they are able to continue their efforts and be more respectful. they are the only company that i am fully lenient with to "allow analytics" in hopes that they are able to improve and compete with the more nefarious competitors.
Mozilla and "respectful to their users" are not two sentences that belong in the same ballpark on the same planet in the same galaxy in the same galactic group. Anytime there's a choice between two options, one giving users more control and one stripping it away, they choose the latter.
It's the same kind of utter disdain I see in the GNOME design.
They just pushed an update on android on me that disabled all addons except uBO because they don't support them yet, a few hours ago.
That was the last straw that broke the camels back for me, after using FF since 1.0. Just rm -rf'd my firefox profile on my desktop a few minutes ago.
I've defended them for a long time even if I didn't agree with everything they did, but they're so completely off the rails, enough is enough. Blink mono-culture it is then.
The writing's been on the wall for a while now. I call it "The Mozilla Shuffle". Power features start out built into the menu. Then they get pushed to settings. Then they go to a key in about:config. Then you have to make the key yourself. Then they start ignoring the key but there's an extension that puts support back. Then the functionality which runs the extension gets disabled but if you're really determined you can work some magic with userChrome to get it back. Then you have to make keys in about:config to make it take the changes in userChrome. Finally, it's disabled entirely with no way to get back.
But hey, at least these days I get the benefit of having to use two completely separate methods to tell Pocket to get the hell out of my browser.
Wow, that's extremely accurate! Disabling the tab bar has also went this exact route (when you use a tiling wm with tabs, some people like to disable browser and terminal tabs as it's redundant)
I managed to do it with a combination of two extensions, some about:config flags and userChrome.css some years ago, but it may already be broken by now.
Mozilla keeps trying to get to the mainstream market while ignoring its dedicated users... The only times I got people to switch over to firefox was because things like ublock support on android. Nobody I've talked to has ever been interested in their side products (Pocket, that file transfer thing...), and something tells me this new voice engine is going to go the same route.
OTOH it's nice Mozilla is working on improving their voice dataset (which powers the DeepSpeech model, iirc). And again kind of sad they use Google Cloud while they could have their own version...
Disabling the tab bar is pretty easy. I use tree style tabs and they supply a couple of lines of css to paste into userChrome. You can even live debug userChrome using Inspector. The userChrome format did change about a year ago though.
They're paid something stupid like 100s of Millions of $USD to direct users to Google, and I understand the CEO was paid by Google (>$1M) though I've not seen good corroboration of that. Bias towards Google is not surprising.
Frankly it started going downhill when they jumped headfirst into the version number dick-measuring contest with Chrome. They care about pulling in new users, not keeping old ones. Anything to make a news headline and generate clicks and downloads. Some asshat feature one percent of their users will try one time and then never touch again. Disembowling their feature set to make a benchmark run five percent faster so they can make a graph with the biggest bar. Moral grandstanding about Google so a couple tech sites give them a mention.
It makes me so angry. Firefox was the chosen one. It slew the evil giant IE. Then Mr.moneybags Chrome came around and scared Mozilla so shitless they completely forgot why people chose them over IE in the first place. Their solution to every problem is "be more like Chrome while being less like Google".
Newsflash, Mozilla: you really suck at that. Maybe try something else.
> It makes me so angry. Firefox was the chosen one. It slew the evil giant IE. Then Mr.moneybags Chrome came around and scared Mozilla so shitless they completely forgot why people chose them over IE in the first place. Their solution to every problem is "be more like Chrome while being less like Google".
I totally agree. I feel so powerless watching them fail like that... Google annihilated firefox. Moral lesson here: free software can be destroyed. Not by openly fighting it, like Microsoft, but by rotting it from the core, like Google.
That's frankly a pathetic list when you consider not only the sum total of available extensions, but the list of Recommended Extensions alone. There's a Github issue for this with one particularly interesting reply: [1]
>Let it be put this way. I have extensions which I can already test while tethered using web-ext. They work. I know they work. But for some reason, Mozilla seems adamantly opposed to letting me test these extensions while not tethered to my computer.
That's from an extension developer who's struggling to test his own extensions.
Based on liuche's comments in that issue, the Fenix team's current stance seems to be, "The latest version disables some of the WebExtension APIs and adds some new ones which may break some extensions, so we're manually testing and approving every extension by ourselves, starting with the Recommended list. Sideloading unapproved extensions will probably never be allowed because of security and performance implications, except maybe on Nightly release."
I'm sure you can guess how well that's going to work out, given that only nine extensions are currently approved after 2+ months.
> Sideloading unapproved extensions will probably never be allowed because of security and performance implications
Very wise. They should apply those same principles to the Rust project - the code I write might very well have issues with security and performance, so it would be better if the compiler submitted it to Mozilla and awaited their approval before allowing me (or potentially not allowing me) to produce an executable.
>They just pushed an update on android on me that disabled all addons except uBO because they don't support them yet, a few hours ago.
It's worse than you thought. You just got the new Fenix browser they've been working on for some time, it's currently on staged rollout to stable channel. Old stable was Fennec and it's getting killed off. I know this because I've sat in their Matrix dev channel until today. I'm not sure what their priorities are, but it seems to not be focused on user.
While this submission was posted I spent the time filing a report to Mozilla because I brought up 3 specific concerns about this browser and wasn't sure where to file them (Bugzilla, Github issues, or now Jira?). Their Fenix Matrix channel responded by censoring my two messages in full and banning me for 'conspiracy theories'. So much for openness and inclusion. I was gonna save it up for a blog post but screw it, lets do HN, it's in your interest sphere:
First issue: I've been having occasional crashes with Fenix, this was why I was in their channel as I was hoping to get to the bottom of it. I installed it as Firefox Preview and they quietly updated to Nightly. This is sorta expected and fine I guess for beta quality software. The issue is I had all the data reporting stuff turned off. Browser starts crashing, sends a report to Mozilla (shows up briefly in android status bar, disappears). I don't know what it contains and this was highly concerning to me, so I listed this as a first issue as I don't want to be bitten for leaking client data every time the browser crashes. Where, why, and how are these being sent. Was it because I was now on Nightly channel and not Preview? I don't know how else to classify this other than user hostile behavior, in the same range of hostility as installing sponsored experiments without notification.
Second issue: One of the crashes happened while downloading. Every time I reopened the browser it'd re-initiate the download and crash again. Fenix doesn't have a working download manager other than to tell it to initially download. Sucks for you if you need to pause or view what was downloaded, there's no controls implemented. I expressed that this is likely to be a denial-of-service vector as I had to wipe the app's data to even use it again. It's also risky to users on metered service if it's continually pulling data attempting to re-initiate.
Third issue: for between 5 years (for Fenix) and 7 years (for Fennic) users have open bugs requesting pdfjs integration with mobile. Ability to inline pdf's in the browser has been a safari mobile feature since at least iOS 4 and possibly before, to the best of my knowledge it's always been a feature of Chrome. Desktop's pdfjs just got a promotion to first-class citizen last release. Someone wrote an extension to make the mobile browsers use it. Can't do that on Fenix anymore though, not because of a compatibility issue but rather because Mozilla won't allow outside addons anymore in Fenix without their blessing even if it's to use a product they've developed. Only way to override is build the browser yourself. Again, user hostile. If they don't want to add their own product, let us use an extension that adds it for us.
Finally regarding culture: I've been a part of IRC communities for a better part of a quarter century. I've also done HN for a few years now. I've gotten heated a few times in different communities, and been kicked or asked to leave. Forums I've had to remove content that was maybe a bit hotheaded. My interaction with Mozilla was literally the first time I've ever been censored and removed without a single word given in rebuttal. One of the things I love about HN is moderation knows when to step in and mods like Dan treat you like a human when those times occur. I can casually gripe some times about HN's preferences on topics, but the site works overall pretty well. The message I got from Mozilla is 'bot removed'. Here's their guidelines, they don't follow them (but I do, to be transparent I have this issue in their pipeline): https://www.mozilla.org/en-US/about/governance/policies/part...
To contrast, I've had to issue bugs to Chromium for browser issues and got nothing but decent things to say about their dev community there: You go to crbug.com and file an issue. I've had brief convos with some of their devs in Freenode IRC over some of the more buggy roll-outs (Aura was pretty crashy in the early days) and I guess the only thing you could say is they aren't very chatty. I'm concerned of course about Chrome eating the world but that's how it is. At least I understand Google's motivations. I don't understand Mozilla's anymore, they survive on Google money and are facing a continual bleed of users. Stuff like my first issue above is pretty major from user trust/legal liability standpoint and I don't appreciate being labeled a conspiracy theorist and censored for bringing up the behavior.
Issue 1: "Some fields, such as "URL" and "Email Address", are privacy-sensitive and are only visible to users with minidump access."[1] - So yes, you should not send crash reports when you are dealing with sensitive data.
Issue 2: That is unfortunate and should be fixed.
Issue 3: The issue for Fenix is not 5 years old, it is now one year old, see [2].
Regarding your ban:
- "I will not apologize for being spicy about these things"
- "It is a fact that you've made it intentionally difficult to use your own product to replicate functionality present in most browsers"
-> See the Mozilla Participation Guidelines - "Be respectful in all interactions and communications, especially when debating the merits of different options." [3] I guess you could have phrased things differently. I will not judge if this is "enough" to ban you, but I do also not know what you were posting before this specific comment, you mention additional comments.
Thanks for clarification on #1. It confirms what I was most concerned about and that was that my phone was leaking private info to Mozilla. On Linux I don't build the crash reporter, any crashes I used to backtrace and open with my distro first before moving to upstream. On mobile, Fennic had a specific option to turn off crash reporter, Fenix only has the two 'data sharing' sections.
I included the singular prior message posted the day before in my last comment on HN. I removed the part that was baseless in my second response which is what you saw.
@ #1: To be clear, I am not a Mozilla employee. So do not take this as a confirmation, I am just quoting the documentation.
@ "Fenix only has the two 'data sharing' sections." - You can always untick the checkbox in the crash reporter, see [1]: The data choice in the settings was removed in [2] and tickboxes were added at the same time in [3]. Fenix will also remember you crash reporting choice in the following crashes, you can test this yourself using - warning, this will crash your browser - about:crashparent.
Thanks for including your previous message: I think the ban was primarily based on your first message, but that is just my personal opinion. Hopefully you get more details from your report.
I mean, that’s the last text you sent. There was clearly other text you sent that was baseless speculation. There was history there you’ve not shown. Pretending as if this is the only text that mattered is dishonest.
I can see why they “banned” if you can’t even be honest to third parties.
>There was clearly other text you sent that was baseless speculation. There was history there you’ve not shown. Pretending as if this is the only text that mattered is dishonest.
Comments like this only further strengthens my point that rather than discourse this 'community' seeks to erase history and attack my credibility. Since Mozilla also censored it out and here's a screenshot of the original single text sent a day earlier: https://ibb.co/dgXBdJK
I handled all this (including attaching both screenshots) and more in my report to Mozilla. I don't believe it's dishonest to include the specific text that was censored amd banned when requested, especially when that text references the earlier 'baseless speculation' and I did not bring it up again. They asked what I was banned for, it was expressing my frustrations that caused me to send the original message in the first place.
Furthermore, its extremely concerning that the move to Matrix over IRC seems to be so that rather than just remove users they don't like, they can scrub the history of their message content and tag it with labels like 'baseless speculation' and 'conspiracy theories' to further debase the removed user.
I would include the text of the report I sent to Mozilla but it includes some personal information.
Kiwi Browser on Android supports Chrome extensions. After pushing the Firefox update that disabled my extensions, some of which I use when I buy groceries for recipes (Tap Translate), I am abandoning Mozilla too :(
Makes me sad after using Firefox for as long as I can remember, recommending it to people constantly and insisting that companies I worked at test in Firefox too, but I no longer trust Mozilla to do what's best for me.
Maybe they're only rolling it out for some users. I got it too and it was updated through some kind of internal update manager, not through the app store. Deleted my browsing history as well.
Switched over to Brave, as they have extensions on mobile cooking with POC videos on twitter. Hopefully they can accelerate it a bit. I think the HN crowd doesn't like brave though, idk, I do. IMO a lot of shit they get flak for are people not actually understanding. Like no, they don't replace ads on websites without their consent, or at least that's what they say.
So, because Firefox on your phone doesn't have X, you deleted it everywhere and installed a browser that... also doesn't have X?
FWIW This is one of the symmetries when we have to do policy shifts like TLS 1.0 deprecation. Even though every major browser will implement the policy and has announced that, some fraction of users will feel "betrayed" and switch from one browser implementing the policy to another browser also implementing the policy.
It's worse if you have defectors (e.g. back when Microsoft had Internet Explorer you could rely on IE being last to actually implement even if it had previously announced a timeline right in the middle of the ballpark, customers would push back and somewhere a Microsoft exec decides that hey, making a customer happy is more important than security) but it happens even for a more or less simultaneous policy change.
Guaranteed some CAs will lose customers over Apple's 398 day certificate expiry policy, even though it affects every CA equally at the same moment.
> So, because Firefox on your phone doesn't have X, you deleted it everywhere and installed a browser that... also doesn't have X?
No, because Firefox on my phone has removed X, I'm switchting to a browser that's getting X, and where the time-line isn't "we dunno lul". Maybe Brave doesn't hit their timeline, but Mozilla doesn't have one and I have no idea when my stuff will start working again. Until Brave Mobile has extension support I've downgraded my FF on mobile to the EOL version.
But as I said, it was the last straw that broke the camels back, I do not consider Mozilla to be trustworthy anymore, and this was just the latest in a series of issues.
I don't see how this is in any way similar to TLS 1.0 deprecation, because sites should have upgraded. There's nothing extension authors could have done. This is fully Mozilla dropping the ball (once again).
Except that FF is also getting the feature - according to their timeline around Q4 20 - Q1 21.
As many have pointed out, Mozilla updated your glitchy and slow browser with a newer and faster one that unfortunately still isn't at feature parity. They didn't remove the feature in the sense that they abandoned it, it was simply a regression that came with an update they considered more important. I agree it was way too soon to push Fenix to stable users, but switching to fancy Chrome just because of one FF regression that will be fixed seems a bit much.
I feel it the same way. If they had FOSS speech to text available, then why not. But oterwise, why? For me, this voice thing is totally useless. Instead of adding it or developing new STT, they could have spend time on other innovative things. I am a big fan of Mozilla and daily user, I am still very thankful for more open and privacy friendly ecosystem it offers but implementing voice on top of Google and presenting it as a big feature looks strange to me.
Bigger players have the leverage to get companies to do something they don't do out-of-the-box. They can contractually oblige them to do that, as well as sue each other if one side breaks its part of the deal.
DeepSpeech is a lot less accurate and much slower than Facebook's FOSS offering, wav2letter, on equivalent data. If they want something competetive they'll need to drop DeepSpeech, or overhaul it. Common Voice is where the value is.
Like, there's nothing stopping firefox from just using wav2letter. It's BSD-licensed.
But DeepSpeech has already been trained with millions of data samples!
I'd feel way better about it if they went for a slightly worse DeepSpeech based implementation, but kept it working in the free software spirit they have been known about for many years.
Also, for desktop devices inference on DeepSpeech is cheap enough, so they could even go the extra mile and work on some Wasm magic to get offline recognition.
That's the kind of work I'd expect from Mozilla! Not wiring up your data collection to the Google Cloud APIs and call it a day! I'm genuinely disappointed with them...
The audio Mozilla DeepSpeech is trained on is not very large (about 2000 hours) or diverse (eg: mostly native American English Male voices) and has very little ability to handle noise, accents or other errata.
Comparatively, Baidu had 5000 hours of English to train their versions of DeepSpeech and DeepSpeech2 on, and thus had better results years ago. Google, Microsoft, IBM and other companies have users providing more audio samples on a daily basis, enabling much better quality speech to text.
Mozilla needs to come more around to Apple's way of thinking. These things need to be done locally on the device, not farmed out to some cloud. Use the cloud (CDN) to deploy the software, but run the software locally.
* "Make me laugh" always brings me to the same YouTube video.
* Had pretty much no issues with the default prompts. It was able to find some challenging Spotify playlists, open random websites (including non-standard English domains ones when I spelled them out).
* "Read this page" uses an awful TTS engine, which is a shame considering that I might actually use this feature on a somewhat regular basis. I'm assuming it uses whatever it detects on the OS level, and so far I haven't bothered with finding a better one (on Ubuntu, if you know of one, please suggest).
* "Set a timer for X min" works just fine, which is probably the only thing I use Google's assistant on my phone (or whatever it's name might be now).
* I like the idea of routines in the app settings, which is supposed to tie multiple queries together. I could see myself using it for something like a morning routine (tell me what time it is, give me weather info, read me news, etc.)
Google-worries aside, judging from the preview it's pretty slow. I'm not a super-fast typer, but these delays sure look like something that would discourage me from actually using it. Maybe it's not even that it's slow, just that the delays are super-obvious somehow, all these disruptive animations and such.
Don't worry, it says they asked Google to not save everyone's voices!
On a serious note, it does bother me how much Mozilla constantly uses Google, even when they have their own solutions They could easily choose not to, especially with their massive budget, but often don't.
They have their own Voice API, but they use Google's. They have their own location API, but they use Google's ('use my location' sends your info to Google in Firefox). They have thrown Google analytics into browser components before, and used it on their own websites.
I don't understand the utility of it. Yes, I can see how this might be considered cool and hip, but.. which my problem as a user does it solve, exactly?
Speech-based control is almost mandatory for people with various disabilities (the blind, those with hand-movement problems/disabilities) and the elderly, a huge chunk of the population.
I regularly encounter people relying on voice-to-text to search or browser. especially people like cab drivers, but also various others who don't have the no- hands restriction.
It's not the kind of people I would expect to comment here, though.
Voice browsing on Windows is exactly what I don't need. I'd have a lot of use for being able to search the internet by voice and have the browser read an article to me while my phone is mounted to my dashboard. Without having to configure my whole phone for visual impairment, that is.
Everytime I use a voice interface I regret it. The only time it works is when I ask Google a single short six-grade-level question when nobody else is in the room talking, and I'm not otherwise occupied by anything that would prevent me from just using my phone. I get that there are some people who can't use their thumbs, and I pity those people because voice interfaces are the most frustrating things on this planet.
Why isn't Firefox implementing PWA features like the Share Target API instead of shaving this yak?
I actually tried this with Siri while cooking yesterday. It's not there yet but I asked "Hey, Siri ... read me the synopsis of the movie Adam's Rib" and Siri proceeded to read a short synopsis of that movie. It worked on another but had to make me choose one of 7. It failed on the 3 try where I tried another movie, it gave me selections, when I picked one "read me the first one" it just repeated the title instead of telling me the synopsis.
I don't use voice assistants so I don't know if these are common, but some of the examples in the list of commands[0] are interesting.
>Ask about a webpage
- Display or open information to the current page or website.
>Example
- What are people saying about this page? (Opens Reddit comments for a specific webpage or article)
- What did this page used to look like? (Shows page history in archive.org)
---
>Giving commands nicknames (experimental)
- Create names or shortcuts for actions.
> Example
- Say "open new york times", then "Give that the name news"
- news (will open nytimes.com)
> Audio from your voice request is sent to Mozilla’s Voicefill server without any personally identifiable metadata.
> Voicefill sends the audio to Google’s Speech-to-Text engine, which returns transcribed text. We’ve instructed the Google Speech-to-Text engine to NOT save any recordings. Note: In the future, we expect to enable Mozilla’s own technology for Speech-to-Text which enables us to stop using Google’s Speech-to-Text engine.
I was kind of hoping their homegrown speech-to-text engine had become good enough for production use. Disappointing to see that they still have to rely on Google.
Google charges 2.4 cents per minute for STT so there's no way Mozilla could afford to offer this service if it actually got popular. I mean, that obviously won't be an issue, but still.
From the FAQ: "Voicefill sends the audio to Google’s Speech-to-Text engine, which returns transcribed text. [...] In the future, we expect to enable Mozilla’s own technology for Speech-to-Text [...]"
> When you make a request using Firefox Voice, the browser captures the audio and uses cloud-based services to transcribe and then process the request.
Is it that hard to do local processing, either due to computational power or storage requirements? Or is it just more convenient for them to do it this way?
If I'm drawing the right conclusion, it's a bit of both: hundreds of megabytes of storage is fine for most people but not everyone, and while I probably wouldn't listen to the latest and greatest artists (and binary diffs are a thing, small additions aren't that large), it is convenient for devs to just push it to a server and be done rather than pushing model updates to everyone all the time.
Edit2: https://news.ycombinator.com/item?id=24096836 Wait, what?! The data is all sent to Google? I was thinking of using this for their sake (opting into using my data for common voice) but this is an instant deal breaker.
The default keyboard shortcut wasn't working and it was opening a different extension instead. I went to the voice extension settings and thought it was bad ux how you have to enter the case-sensitive keyboard shortcut names instead of pressing the keys to read the keys.
The Google Recorder app on Pixel phones (and I'm pretty sure general Android release) does super accurate on-device transcription, for what it's worth.
I see a lot of skeptical voices here, (somewhat warranted, given it's a voice assistant technology), but the fact remains that if we want open, on-device voice recognition, we'll have to do the work and donate sample data.
This extension is trying to provide some useful functionality in the hopes that Mozilla gets more data for https://commonvoice.mozilla.org
I'd at least consider recording your voice, especially if you're a non-native English speaker, like myself, have an accent etc.
It took many years for free software to start to take on the smartphone segment, with previous efforts, (including by Mozilla), failing and only now PinePhone & Librem 5 giving it another go, but unless you're a super hardcore enthusiast, you carry an iPhone/Android today.
I see this as a way to push back on the likes of Amazon, Google and Apple with this. If regular Firefox users are able to use an on-device, privacy respecting voice assistance and other open-source projects can use Mozilla's tools and datasets to build compelling competitors to Alexa, I'd see that as proof that free software is able to address new, emerging markets too.
> This extension is trying to provide some useful functionality in the hopes that Mozilla gets more data for https://commonvoice.mozilla.org
This is awesome! I love contributing to open source initiatives like this. I'm also a non-native speaker so hopefully I'll add some color to the voices recorded.
Very good point. Honestly I use my Echos for exactly two things: turning smart lights on and off and setting timers. I occasionally will ask it the weather or to play a song or a podcast. That’s about it. It seems like for my use cases it doesn’t need full on speech recognition and the million Alexa skills out there. Just a few simple phrases would suffice.
I'd go farther and say that I specifically don't want the million Alexa skills out there. A system that let me write my own intents with access to the top-level namespace would be ideal, but it would absolutely need to come with good hardware. That's where Alexa/Google have the upper hand currently, I think.
It still remains a risk well worth taking. With Mozilla, it is merely uncertain but for just about every other company, it is all but guaranteed. You can plainly see this philosophy in the industry's naming sense, where super-computers of yester-year are relegated to "edge" roles.
As of today, the open source and free software equivalents to machine learning and AI products are sorely lacking when compared to commercial offerings. Whether it is open-ended speech to text with good ergonomics, text to speech, intent recognition, speaker recognition, OCR for text, OCR in the wild, translation, object recognition, image segmentation, image to text or natural language processing, commercial offerings are leagues ahead of what free software can do.
If we look at one of the most impressive AI demonstrations in history, GPT-3, it is not apparent whether open source can even replicate it because with AI, unlike in the past, time and skill is no longer directly fungible with money. I would argue the concentration of such capabilities to Microsoft and Google servers is a threat to the ideals of free software as great as any it has seen before. Yet, relatively little attention is spent there because people are too focused on yesterday's problems.
This concentration is difficult to avoid because current algorithms require large amounts of data and computing ability, which only large corporations can marshal. Mozilla is far from perfect but despite their many stumbles, they're the only large organization seriously attempting to address this imbalance. As much as these algorithms are marketed as AI to users, ML is better thought of as libraries, in the line of ffmpeg, to programmers. Mozilla still do seem to care about creating a local-first offering. If everyone stops using them then what is gained exactly?
> it is not apparent whether open source can even replicate it because with AI, unlike in the past, time and skill is no longer directly fungible with money.
Indeed. The creator of LuaJIT is only a single, very skilled, person with a desktop computer. People like Fabrice Bellard can produce gigantic amounts of FLOSS source code. Yes, those are only examples, but people with their skills and motivations to build FLOSS software will need access to lots of money in order to be able to build ML models.
>With Mozilla, it is merely uncertain but for just about every other company, it is all but guaranteed
I disagree, most large companies view this data as a competitive advantage and won't sell it directly. They may sell the results but the data itself is their moat. Smaller companies on the other hand are more willing to sacrifice future profits for current money.
As long as they're upfront about what they collect and how to not send them data, am fine with it. Obviously opt-in is much proffered to opt-out, but from the Firefox Voice site, it clearly states sharing data is opt-in.
My guess would be there isn't much of a point selling voice data from an open dataset. Also, since the code is in the open, it would be relatively easy to spot if they were sending data to somewhere they do not or record when they shouldn't.
Right now you can only help the STT engine by contributing to commonvoice.mozilla.org, where the samples are published to the world. The add-on will now, if you opt into it, keep the data in only Mozilla's and Google's hands. Mozilla has an agreement that Google won't keep the data, but even if Google doesn't comply by the agreement, the number of parties with access to the data will be lower than "everyone".
You have wrong premise here. First of all, current Mozilla data is almost useless for training because it is carefully read speech. You do not need much of it, even accents. If you add 1000 hours of CV data to 1000 hours of random data the improvement in accuracy will be minimal. Same for the speech collected with Firefox Voice, it will be mostly a set of short commands, most likely it will not be very useful for generic transcription of random people speech.
Second, you can build models much better than Mozilla ones simply from public data, there is no need to collect user voices. We at Vosk https://alphacephei.com/vosk/ support 10+ languages for example without any user data. Everyone creates very good models from augmented text-to-speech data these days (Microsoft demonstrated in the last paper you can get almost as good as domain-specific data https://arxiv.org/abs/2007.15188).
Given that it surprises me that Mozilla continues to insist they need the voices of their userbase.
> if we want open, on-device voice recognition, we'll have to do the work and donate sample data.
We absolutely will not. The only reason people believe this is that they've forgotten how to do speaker-dependent recognition (SDR), which is more accurate and more secure anyway. We were doing SDR in the 80s with 1/1000 the CPU power and 1/1000 the memory.
SDR does require an initial training session, but once that's done any modern computer or smartphone should be able to handle it locally with no cloud server environment.
You say “forgotten” as if we had great tools everyone just forgot about. Having actually used those systems I am rather skeptical of that claim - they really seemed to have hit a certain functional plateau below the level of modern systems.
Put another way, if this was off the shelf, why isn’t anyone marketing it?
One reason may be that since it doesn't require a cloud, there's no personal data to mine. Try getting VC without a recurring revenue stream. It's probably possible but it's more difficult. Same story for IoT: Cloudless home automation is trivial from a technical point of view, but cloudless home automation is a non-starter VC-wise.
This was a field with multiple products on the market. How much VC do you need to deliver benchmarks of shipping software?
Similarly, saying cloudless home automation is easy sounds like you’re leaving out a lot of experience other people gained about the challenges of getting consumer adoption with the need to take on 24x7 server maintenance, connectivity challenges blocking popular features, etc. which made that class of products less appealing to most customers.
Training a speaker-specific recogniser that improves over a generic recogniser requires a lot more data nowadays. First, generic systems are a lot better and trained on a lot more data nowadays. Second, speaker adaptation worked better for the Gaussian mixture models from the late nineties (don’t know about the eighties) than for neural networks.
Who's "we" in this context? Because just below you, HN has comments from willing donors.
My point being that while there may still be a market for SDR, there's a broader market for speaker-independent recognition (SIR) simply because people want the tech to just work rather than feel like they messed up training the device when the device can't recognize them.
Using someone else's voice assistant is also a legitimate use case, especially if it's used to control music, lights, blinds, AC, car functionality ... that absolutely requires solid SIR.
I think this can be viewed as a marketing and UX problem, sort of. It reminds me of the Wii Amiibo - people actually paid money to train their AI bots because of how Nintendo designed them. Not sure how many people, but a reasonable enough segment of the market that Nintendo thought it a worthwhile investment anyway
I've looked into open source voice assistants before. I found mycroft, Jarvis and a few others, but either got bogged down in dependencies or configuration. Many supported shipping your data to Google or Amazon if you configured it, or an open source voice recognition tool.
I hate this idea that our voice has to be shipped somewhere to be processed. I remember a lot of the speech-to-text tools in the early 2000s weren't all that great (they needed a lot of training), but why haven't we been able to advance on-device processing? Why is everything done in "the cloud."
So the only way to semi-accurately do voice recognition is to source algorithms that re-train off of millions of people? We have processors in our desktops and laptops that dwarf that compute power by leaps and bounds. We should be looking to Star Trek TNG level voice processing, on each individual device, without some central mainframe.
But marketing, advertising revenue, data mining, free (as in beer) software that pumps your data like an oil rig, efficiency in data centre (cloud) design .. all these factors have led to these powerful little Intel/ARM/Ryzen chips to be nothing more than thin clients when they're not playing games.
If Mozilla really wanted to make something amazing and in the spirit of Firefox, give us an experiment where voice processing is done on our devices. Even if it meant I needed to download a 230GB data set, I'd gladly do it, if it could remotely help in getting away from these data silos.
Besides being a way of collecting data and ultimately making money, it avoids some of the "bogged down in dependencies or configuration" problem. My voice projects need to run entirely offline on a variety of hardware and operating systems. If each client was just a little app piping audio data to a cloud service, it would be way easier to write and maintain.
> So the only way to semi-accurately do voice recognition is to source algorithms that re-train off of millions of people?
Nope. You can absolutely tune a speech model locally on your own samples and get great accuracy. The trouble comes with open-ended speech: people expect the voice assistant to recognize that new artist or movie they heard about yesterday. That doesn't work without upkeep somewhere.
Rhasspy/voice2json are intended for pre-defined voice commands using a template language. You can get almost perfect accuracy with this approach, even with millions of possible commands. Re-training only takes a minute, so personal upkeep isn't bad.
> Even if it meant I needed to download a 230GB data set, I'd gladly do it, if it could remotely help in getting away from these data silos.
Mozilla DeepSpeech is trained on about 2000 hours of audio that is mostly spoken by American males. It has little ability to handle noise or accents and has a 5.97% Word Error Rate on LibriVox (which is noiseless, plain spoken english).
Meanwhile, Google, Microsoft & IBM have tons of fresh audio coming in constantly to use in augmenting their models.
Baidu was able to build a competitive English Speech to Text model with 5000 hours of quality audio to train against.
There's tens of millions of hours of video content out there that has been subtitled pretty well, and I'd wager a lot of it is under usable licenses for Mozilla. Has that been considered?
If you know such sources, file an issue, and better yet, download the video content yourself and publish a dataset.
But note that raw video content is not training data. It has to be segmented to be in short enough parts for training (few seconds), the subtitles have to be aligned to match what's said precisely, and one needs to balance the data, e.g. when 90% of speakers are men and 10% are women, you have a problem.
> why haven't we been able to advance on-device processing?
Probably because getting the payloads shipped for server side processing provides a constant stream of training material updates? That's my cynical take, at least.
But there's got to be more. Paraphrasing H.L. Mencken[0], for every complex problem there is an answer that is straightforward, easy to accept - and wrong. I remember how in the late naughties Nokia had an early version of on-device speech recognition: long-press the trigger key and you could call people by uttering out a saved "voice tag". IIRC the feature picked the right entry roughly 1/3 of the time.
For what comes next I'm relying on hearsay, but from what I heard at the time, the feature was originally developed at MIT. Nokia then financed the team to optimise their code and underlying detection model to be ported to ARM and to fit the constrained memory/CPU envelope. Because Nokia definitely had collaboration with MIT at the time, this is at least plausible.
If people are expected to use speech-to-text in real life, it has to work within very strict boundaries. Low latency and high accuracy are table stakes. At least some level of contextual awareness would be nice. As long as predictive text input routinely provides us with meme-worthy failures, I won't expect anything better from (fundamentally noisier) speech inputs. And if server-side processing is the only way to get performance from dismal to somewhat functional, practical applications don't have much of a choice. Plus, you don't have to ship your model to end user devices.
For what it's worth, I dislike voice interfaces. But when they do work, I dislike them less than Byzantine and user-hostile phone menu systems. I guess that qualifies as progress.
Try Nuance Dragon options (no affiliation, just a good tool). They have software that works offline, without cloud. They have a very competitive voice recognition quality for the supported languages. There are some products that are explicitly cloud-based (like mobile ones), but there are also desktop versions that do all the processing, inferring and even learning of your voice and new dictionaries on your local machine. You will need a somewhat powerful PC, but nothing crazy.
(PS but to be fair, it's a dictation tool with some extra commands, even if a very good one. It is not really a voice assistant. But I'm not sure if that is harder or easier to do. There are open-source offline assistant tools too that work quite well already, because once you have a set of pre-determined phrases/formulas, they are much easier to parse.)
Speech-to-text is relatively easy (not to discount the decades of work that went into it, lots of which was done by Dragon), but extracting intent and sending the user to the correct place is what is really difficult. IMO it is essentially an unsolved problem.
Let's say you have two companies A and B. A does on-device only and B does in the cloud. Even if their systems begin being just as accurate, B has an ever growing stream of new training data with which to improve their system. A does not. Over time B will become better and better while A does not. Worse, by downloading the model onto devices the model can be easily copied, cloned, reverse engineered, interrogated and otherwise used by competitors.
Kalliope[1] is a no code, but Python-based, voice assistant. I like it because you can switch out voice backends and it is designed to be modular with node-like constructs to build signals and responses.
I stuck it in a container and can control Home Assistant with it, run scripts, etc.
Drives me nuts. "Hey Siri, Pause" shouldn't require an internet connection. I often have bad or no connection (eg podcasts while hiking). Voice commands must be ubiquitous. If I have to fail-revert back to touch, what's the point?
The HomePod has a relatively powerful CPU for what is doing, I'd assume there's lots of headroom to do local intent recognition on device, but Apple likes to get new training data constantly too, so they haven't really tried.
Googled those, and it looks like some company selling remotely-running models, which I think is what GP was referring to. Is there another technique that's been SEOd out by this company?
Whether they are non-profit or not, they still run like a business, and the business decision to allocate resources are likely to use the same metrics as any other businesses - maximize profit protential (however you define it).
It may just be that fixing accessibility bugs just doesn't produce much "profit".
Ah, yes, the classic unrelated Mozilla hate thread. Wherein a user finds a single grievance that doesn't matter to 99% of the users, uses it as a warhorse and brings it into fully unrelated threads.
>We share your audio recording with Google Cloud’s speech-to-text service to assist us in processing and carrying out your commands. Audio recordings are shared without personally identifiable metadata, and we’ve instructed Google’s service not to retain the audio or transcript associated with a command after it processes the command