Hacker News new | past | comments | ask | show | jobs | submit login
Nerd-dictation, hackable speech to text on Linux (github.com/ideasman42)
202 points by ideasman42 on Jan 17, 2022 | hide | past | favorite | 62 comments



This is better than any other speech-to-text setup I've ever encountered, for one simple reason: I followed the dead-simple install steps in the readme, started the program, and it worked. Bonus points for the install being a git clone and pip install away. I don't know why this is a hard bar to clear, but bravo. (I suspect that it's because a lot of FOSS speech recognition is from academia where "follow the following 13 steps, including hand-crafting recognition parameters" is more normal and acceptable because everyone involved is already a domain expert, whereas I, as a user, just want "plug in a mic, run this thing, and get text on stdout".)


Most TTS and speech synthesis can be easy to install if you get rid of the GPU requirement. Both AMD and Nvidia have horrible workflows for installing their drivers and neural network/linear algebra kernels. Real time speech recognition/synthesis on generic consumer grade Intels/AMD cores is very, very difficult to do well which is why most providers are cloud based. (The alternative is targeting Mac only as they have standardized hardware everywhere)


Yeah, text-to-speech is and has been easy for ages; I'm pretty sure I used espeak like a decade ago. On the other hand, I have tried... pretty much all the big names in speech-to-text, without success, or at best "I kind of got the demo to work but couldn't figure out how to do anything useful with it". Kaldi, sphinx, julius, a handful of tiny PoC things I found online... maybe I'm just bad at following instructions, or I'm trying to do something that they're not trying to optimize for, but I have not had a good time.


You used to be able to configure, make and then install software from source quite easily. Maybe you were missing devel a lib or two so you installed them. Now it's grown into a dependency hell that's so bad we now employ a myriad of package managers and container platforms to manage the disaster. I blame the package manager fetish.


> I followed the dead-simple install steps in the readme, started the program, and it worked. Bonus points for the install being a git clone and pip install away. I don't know why this is a hard bar to clear, but bravo.

Woah, we really do write 2022.


What?


What you said used to be the standard. In fact, it used to be using your Linux distributions package manager which is even more convenient. I cannot even imagine a piece of software that you cannot get working as easily as cloning the git repository and then following the instructions, instructions that are typically pretty easy to follow and are supposed to work, or using a programming language's package manager, or your Linux distributions or BSD's package manager.

At any rate, what I am trying to say is that if the case of having poorly documented (i.e. usually untested documentation) piece of software is high, then we definitely are doing something wrong. You should be able to follow the installation instructions and it should work, i.e. just read INSTALL or README and follow the instructions, like good old times!

You said it yourself: "I don't know why this is a hard bar to clear, but bravo.". It should not be, it should be expected, and it should be done. It should not be a magical or surprising thing.


(I am the OP you're talking about)

Ah, yeah, very much agreed. I don't mind stuff not being packaged in official distro repos (although it would of course be very nice), but a build/install process being more complex than `configure && make && make install` borders on "bug" territory in my mind. In fairness, however, speech-to-text seems to mostly suffer from problems after the build/install step proper; it's common enough that getting to the point of running the program is okay, but it's useless without the language/speech models that make it actually understand speech, and those are... arcane, I think is the best word. Or huge files. A major differentiator for this one was that getting and using the model was "wget this single 40M file and place it here". And then at runtime, the program correctly grabbed the right microphone and automatically output text to stdout and was mostly accurate at doing so:) Others have done any one of those steps, but never all.


> (I am the OP you're talking about)

Yes, sorry, I edited it before you commented.

I agree. I was going to write exactly those commands. OK, you usually have to execute "autogen.sh" or run "autoconf" when you clone a repository, and I am usually fine with perhaps even "mkdir build && cmake ..".

> A major differentiator for this one was that getting and using the model was "wget this single 40M file and place it here". And then at runtime, the program correctly grabbed the right microphone and automatically output text to stdout and was mostly accurate at doing so:)

Damn, that is crazy as well. If I was the one whose software was dependent on models or assets (that are relatively large in size), I would either put them into their separate git repository and use them as a git submodule, or host them (probably with mirrors) and have a Makefile target (probably) that downloads it using "wget" or "curl". That, or execute a script from Makefile, or anything of the like.

For programs like this, grabbing the right microphone and outputting text to stdout should be pretty common, too[1]. If the program does not do this, then it should be considered buggy, and I would think that the developers did not even test their own stuff!

[1] I still have related issues with Audacity to this day! I cannot complain that much, but yeah.


What I like so much about using Arch and the AUR is there is a community who will do these fiddly processes for you. Of course not everything is in the AUR but the barrier to entry is much lower than typical Linux package distributions.



What is your point, and with what do you disagree exactly?


Go follow the build instructions. The "just build from git" line is pure fantasy.


No, it is not, but that is besides the point because all we have been saying is that it should be the case with new/newer projects as well.


You know... I have an idea. How about we use vosk and this tech to integrate with ffmpeg somehow so that peertube videos can get subtitles while being transcoded. Once we get English SRT, we could use libretranslate to translate that English SRT to multiple languages.

This could be similar to what YouTube does with it's automatic subtitles. What do you guys say?


The package that this project is built on (vosk-api) mentions & includes some examples to demonstrate exactly that type of use case with ffmpeg:

"...continuous large vocabulary transcription, zero-latency response with streaming API, reconfigurable vocabulary and speaker identification ... can also create subtitles for movies, transcription for lectures and interviews."

* https://github.com/alphacep/vosk-api/blob/master/python/exam...

* https://github.com/alphacep/vosk-api/blob/master/python/exam...

* https://github.com/alphacep/vosk-api/blob/master/python/exam...

(Edit: Also, thanks for introducing me to "libretranslate", looks like an interesting project.)


great. someone should link peertube github with this. i am sure the great people will do it much faster and more elegantly :-)


I just checked and apparently they are already aware: https://github.com/Chocobozzz/PeerTube/issues/3325#issuecomm... :)

(Tho I'll admit I have no idea what "bluffing" means in that context. :D )


cool.. https://gitlab.mim-libre.fr/extensions-peertube/plugin-trans...

so this already exists. damn i thought i was the first one to think of this. still, this existing means there are people who think faster than me


Sounds great! Do it!


I've never even heard of VOSK-API [0], the underlying offline speech to text engine that this project uses.

Does anyone have experience using it? Is it any good?

[0] https://github.com/alphacep/vosk-api


Vosk powers Dicio, a free and open source voice assistant for Android. If you have an Android device, this app is another way to try out Vosk:

- F-Droid: https://f-droid.org/packages/org.dicio.dicio_android/

- Source: https://github.com/Stypox/dicio-android

- HN: https://news.ycombinator.com/item?id=29762526

The accuracy of the English language recognition is not bad. I'm glad to see an implementation of Vosk for desktop Linux.


can second Dicio to give Vosk a try. For a local model it worked surprisingly well. But you can't yet mix languages mid sentence - difficult when searching for restaurants that have english names but are not located in a english speaking country.


Vosk-api isn't an SST engine itself, it is built using the Kaldi speech recognition toolkit (https://github.com/kaldi-asr/kaldi) and nicely implements and packages an API for Kaldi chain/LF-MMI models.


I use it to transcribe English robocalls. Vosk gets all the words right as long as I use the "Accurate generic US English" model. PocketSphinx (with the default en-us.lm.bin model in the distro package, no idea what it is) didn't get a single word right IIRC. I didn't try anything else.


Yeah, I was really impressed with the project when I encountered it last year when trying out a bunch of FLOSS Speech-To-Text options.

It was significantly better than the other FLOSS options I looked at--both in terms of getting it going initially & the quality of the speech to text results.

I tested it with a lightly modified version of this example script: https://github.com/alphacep/vosk-api/blob/master/python/exam...

What I found particularly interesting was when you have the "partial" recognition output shown in real-time you get to see how--at the end of a sentence--it may change a word earlier in the sentence in the final recognition output based on (I guess) the additional context of the full sentence.

(I just did a quick test again (with the installs from my testing last year) using an internal laptop microphone & the test script recognized a significant chunk of my speech (using a headset definitely improves things though) whereas with the same environment a test with `mic_vad_streaming` (from `DeepSpeech-examples-r0.9` with `deepspeech-0.9.0-models.pbmm`) failed to recognize any words at all.)


It's very well known among ppl who know the field. It's quite good, the lead has a nice blog too.


Results depend heavily on which speech files you use. You can even guess which it was, looking at the errors it makes.


Nice. Another notable mention in this space is Talon. Useful for automating all OS tasks with voice commands, as well as just dictation: https://talonvoice.com/


Talon has an EULA that is enough to send me scrambling for the hills: https://talonvoice.com/EULA.txt

Meanwhile, the meat of this speech recognizition is Vosk, which is just Apache-2: https://github.com/alphacep/vosk-api/blob/master/COPYING


Running it as a user prompts:

  > $ ./run.sh 
  > [+] Prompting for admin to set up Tobii udev rule
  > [sudo] password for dotancohen: 
That does not build trust. I would prefer an instruction on how to set up a udev rule, or better yet, I would prefer that requirement to be relaxed. What does it need more than standard microphone access that e.g. nerd-dictation or even Telegram need?


This is such an amazing technology for the many tech people who are having to deal with hand/finger/elbow issues after extensive usage for years on their keyboards.

I was looking for this type of tech for at least 2 years and I am glad it now exists.

FOSS is amazing!


Has anyone used this somehow inside Emacs or knows how to make Emacs take its output and put it into a buffer?


Just open emacs.

This program outputs like a keyboard. And, in English at least, it works really well. I cannot believe it.


I’ve used it for a while in German and in English and I was impressed, too, with its recognition performance. Even the small and therefore fast language models perform decently. However, a major downside is that it doesn’t do punctuation, new lines, capitalization, and similar. This means that you have to edit a lot of the recognized text by hand, which obviously spoils the fun. Having said that, the front end code is in python and you can easily hack it. With a small handful of lines of code I was able to address some of these issues somewhat.


We are working on punctuation/capitalization. Models are ready but not yet integrated. German BERT-based model is available at https://alphacephei.com/vosk/models#punctuation-models for example. Should be ready soon.


Was just about to mention this repo to the OP but suspect I found it from your site in the first place: https://github.com/benob/recasepunc :D

Punctuation/capitalization will make a massive difference to practical use! Look forward to it.


Reading my comment back, right here in the browser:

  > just open a max this program output like a keyboard and in
  > english at least it works really well i cannot believe it
The only problems that I see are:

1. Capitalization and punctuation.

2. Doesn't know what emacs is, so it got that wrong. A user-installed dictionary might help here.

3. "outputs" came out as "output". I just tried a few more times, and I got the same results. I suspect that like "emacs", the word "outputs" is not in the dictionary.


Regarding 1) I have 2 key bindings, 1 that starts a sentence and another binding but doesn't. while punctuation remains an issue (comers brackets and question marks for example) - since I'm mostly using this to save typing longer passages of text having to manually deal with punctuation isn't all that much of a hassle. But I can understand anyone attempting to go completely hands free would need something to support entering literal characters and punctuation.


How do the keybound scripts know when to stop listening? After a certain delay? Does the sentence-script just capitalize the first word and add a period to the end?

I'd love to see the scripts if you don't mind.


Just shortcuts to begin/end dictation.

Although I use a programmable keyboard, which I've configured to "push to talk", that is - I hold a key for dictation, releasing the case stops dictation.


I wonder if I could configure a shortcut such as Win-D (dictation) to start recording, and to poll the Win key to stop recording when it is released.


As long as you can run commands on key-press, I don't see why not.

You could test it with file creation/removal or something simple, the same will then work for nerd-dictation.


It is a matter of time someone puts a dictation mode together. Given the design of Emacs key combinations you will be able (required) to use also rythm and tone.


I was wondering how well it dealt with accents, them I saw that the Vosk API page specifically mentions "English, Indian English, German, French, ..." :D I don't know the story behind "Indian English" specifically being listed as a separate language, but I'm glad to see it's supported.


Well, I'm not Indian but I can see it being a separate dialect, much like American English. For instance, an Indian might say "I have a doubt" instead of "I have a question". And as you mentioned, there is an accent, just like with American English.


Adding Indian English was definitely doing the needful.


That's true, but that's why I mentioned "specifically listed" in my comment - the same things apply to Australian English for eg., and possibly to Filipino English and South African English and many others. Indian English is the only _dialect_ mentioned in the list, the rest are languages, which made me curious.

The reason is probably something pragmatic, like perhaps a large enough corpus was available for that specifically.


Vosk, is it "wax" in Russian ("воск")?

I think of wax recording rolls - old days CDs, aka Phonograph cylinder:

https://en.m.wikipedia.org/wiki/Phonograph_cylinder


I'm throwing another hat in the ring as this technology totally working most of the time. I used it to write this comment.

This should make my life a lot easier because I find myself going to my phone and using the dictation feature a lot recently. It's not as good as the one on my android, but it's 95% of the way there.


Reading your comment with nerd-dictation returns this for me:

  > i'm throwing another hat in the ring as this technology totally working most
  > of the time i used it to write this comment this should make my life ah lot
  > easier because i find myself going to my phone and using third dictation
  > feature a lot recently it's not as good as the one on my android for it's
  > ninety five percent of the way fair
For use with no training that looks great. I'm sure that as I learn to speak more clearly, your 95% estimate is achievable.


Interesting are you using the full model? I found with a good microphone and the full 1 gigabyte language model that the quality is quite good compared to other people's phones I have used from time to time.


How did you add the punctuation and capitalization?


That was by hand but it's a small task.


I'm imagining remapping some VIM shortcuts for even easier capitalizing words and adding common punctuation.

This remap makes the "." key add a period after the previous word, capitalize the current word, then move on to the next word:

  :noremap . bea.<esc>w~w


is there an offline good program for text to speech for german,french,spanish,english? and no, festival and espeak are not what i would consider good.

the at&t website with text to speech as audio file which were used in these anonymous publications are good, but not espeak. if i had sth like this for european (and russian and arab languages) as open source standalone, i would be happy :(


Yes!

The project is called Larynx, and it is amazing: https://github.com/rhasspy/larynx/

I waxed lyrical about it recently in this thread about private alternatives to Alexa: https://news.ycombinator.com/item?id=29562526

I can only vouch for the quality/variety in English but it does note support for 50 voices over 9 languages, including all the first group of languages you mentioned, and also Russian. (I've "played" with all those languages to test them but can't really vouch for how a native speaker/listener might find it. :D )

It is miles ahead of any of the other Free/Open Source TTS solutions I've tried, including the ones you mentioned.

(It's still synthesized speech but the output quality is so good and the project is still extremely early days.)

And there's a range of options in accent & gender--which are in general sorely lacking in other FLOSS TTS options. (In terms of licensing, some voices are licensed more freely than others but the majority are without significant restriction.)

I like Larynx so much that I've been working on an editor for it to assist in "auditioning" & recording speech in a narrative context, e.g. game/film pre-viz.


Thanks, i will look it up! Thank you! :)


Very cool. Does it have an erotic voice? Asking for a friend.


Nerd Dictation does speech-to-text (voice recognition), not text-to-speech. If you want to speak to your computer in an erotic voice, nobody's stopping you.


There is actually an API for this:

evStart: start speaking in an erotic voice

evStop: stop speaking in an erotic voice

evQuery: query whether you are speaking in an erotic voice

evLinuxClassic: enable the inability to speak until Firefox is closed (experimental)


Festival and install Scottish or French voice (whatever floats your boat)




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: