Hacker News new | past | comments | ask | show | jobs | submit login

I’ve been having similar issues lately, and after looking at talon, serenade, and caster, I ended up using caster [0]. The different programs all have significant differences in usability, and have clever ideas behind them, but unfortunately automatic speech recognition is still bad enough that the primary factor is which has a better ASR engine. Caster supports Dragon Naturally Speaking, which is expensive, but enough better to make it worthwhile.

There are moments where I think this is going to be the future of programming, since code as text is only the easiest for as long as typing is the easiest way to record things unambiguously.

But for the most part it’s still pretty frustrating. If ASR systems can get the sentence error rate down by an order of magnitude or two, I sincerely think this will take off not just for accessibility, but for normal use.

Until then, it’s a PITA that is saving my career nonetheless.

[0] https://caster.readthedocs.io/en/latest/




Talon actually can work with Dragon; it started out requiring Dragon but has since grown its own voice recognition. Anecdotally, a lot of folks on the talon slack seem to be refugees from Dragon-based solutions, and a common observation is that Talon's voice recognition is much lower-latency than Dragon. Unfortunately I can't confirm this as I haven't used Dragon myself.

My impression of talon's speech recognition is that it's good enough for voice coding, but could be improved when it comes to dictating English prose. That said, there are promising avenues of improvement. If you pay for the Talon beta, there's a more advanced voice recognition engine available that's much better at English, and you can also hook into Chrome's voice recognition for dictation.


> Talon actually can work with Dragon

More specifically, "can work" means: if you run Talon on Windows or Mac alongside Dragon, and don't configure Talon to use its own engine, Talon will automatically use Dragon as the recognition engine.


> or Mac alongside Dragon

Unfortunately it looks like Dragon on Mac isn't a thing anymore, which is a shame because my main dev machine is a Mac. Or at least it used to be, before I needed to start dictating everything.


Talon can also use Dragon remotely from a Windows machine or VM, kind of like aenea but more tightly integrated (the host is driving instead of the guest).

And I do have some users voluntarily switching from Mac Dragon to the improved speech engine in the Talon beta. Mac Dragon has been kind of buggy for the last few years so you're not missing much.


Any chance you have pointers on how to set that up? You'd probably laugh/cry to see my setup right now, with my Windows desktop on the left monitors and my MacBook on the right monitors because I need both... purely because Dragon is only sold on Windows since this started been an issue for me. A more tightly coupled super-aenea sounds pretty fantastic.


Sure: First run Talon on both sides. Then go to your talon home directory (click talon icon in tray -> scripting -> open ~/talon). There's a draconity.toml file there.

On the Dragon side, you need to uncomment and update the `[[socket]]` section to listen on an accessible IP.

On the client side (Mac in your case), uncomment / update the `[[remote]]` section to point at your other machine.

You also need to make sure both configs have the same secret value.

From there, restart Dragon and look in the Talon log on the Mac side for a line like "activating speech engine: Dragon".

To prevent command conflicts, I recommend setting Dragon to "command mode" (if you have DPI), and only adding scripts to Talon on the Mac side.

If it doesn't work, you can uncomment the `logfile` line in draconity.toml on the Dragon side, restart Dragon, and look in the log to see what's going on.


Do you know of any workflows like this using entirely open source software?

EDIT: Seems like caster itself has instructions for an open source recognition engine to pair with it. Not sure how accurate it'll be but I'm going to give it a shot!

I hate the idea of relying on closed source software to be able to continue in my profession. If this works I'm definitely going to be donating to the FOSS options


Yeah there is the Kaldi backend for caster, which I've tried on my Mac machine, since Dragon isn't a thing on Mac. Unfortunately it's not nearly as good :-(

I'd like to try to record my usage of Dragon that I could fine-tune my own model, but it's harder to get around to hobby coding projects like that now that coding is more of a pain in the ass.

The unfortunate reality in my case is that using Dragon is the least frustrating way to keep working. I don't think closed paid ASR models will stop being noticeably better better than three open ones until the state-of-the-art error rate is basically zero.

I do like that caster is open source where talon isn't, but preferences like that are pretty low priority when say your career is on the line.


> I do like that caster is open source where talon isn't, but preferences like that are pretty low priority when say your career is on the line.

For sure if its on the line right now you use what is available and works best right now, but the reality of niche software is the companies/people backing it tend to close down or move on as they realize the market is too small to sustain them. Or Microsoft updates its OS in a way that forces them to expend considerable time making it functional again and they just don't have the bandwidth for it so you're stuck for some amount of time or indefinitely.

If you have no other choice, its definitely better than nothing and you want to use what enables you to do your job the best. But in terms of where I'd be willing to throw my financial support? It'd need to be something for the better of the coding community as whole. For me, this means an open source tool or set of tools, not a proprietary system.


Fwiw talon has a setting to record everything locally, annotated with the recognized words. In the next release it will also work when using dragon.


My understanding is that Talon is built off of wav2letter's inference framework for ASR.


I do not use the wav2letter@anywhere inference frontend - I trained the acoustic model using the facebook upstream code, but the decoder is almost entirely new, and on Windows I use Pytorch for inference.

Talon ships with a libw2l.so/dylib on Linux/Mac built from my open source repos.


Feels like wav2letter will not be actively developed anymore. Understandable since it is hard to compete with Pytorch with a custom NN toolkit. Any plans to move to Pytorch/Tensorflow?


I don’t think that’s right at all. They moved development to the Flashlight repo. It seems very actively developed to me. Last commit 3 hours ago. wav2vec 2.0 blog post also went up last month (September) and the current state of the art iirc is a Google model based on Facebook’s Wav2vec 2.0 work.

For my own use I’ve already built Pytorch and CoreML frontends, with a shared model format (can convert models to/from wav2letter format and my custom format), and I have the ability to create new models in these frameworks from wav2letter architecture files.

I still run my training in the wav2letter framework, but for compatible training in Pytorch I would mostly just need criterion implementations. I assume warpCTC is fine for the CTC models. There’s also a third party Pytorch ASG criterion package but I haven’t tried it yet.


Interesting. What model are you using for your acoustic - the streaming convnet?

I didn't know that there was a pytorch implementation of the w2l architectures..




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: