This post starts out talking about expecting to spend around $1,000.
There are at least two cross-platform projects where the biggest expense is a microphone instead of software.
1. My project, Talon. Windows/Linux/Mac support, and a first party local speech recognition engine that is pretty good and getting better. It’s free, but the engine is in a private beta (which is $15/mo to support development, optional if there’s a financial issue).
2. Serenade. They are VC backed. Currently free, unsure about their longer term plans. They use cloud based recognition.
I have intermittent RSI and I've been using Talon for 1.5 years for programming (web) with intellij and spacemacs. I reckon I'm as productive as using my hands if I use only voice. When my hands don't hurt and i mix talon with my hands I feel i can do more than I could do with my hands alone. Thanks, lunixbochs. Talon is great.
When using vim keys for example, to jump to the beginning of the line I press ^ then press 'dw' keys delete first word in a line. So I do the caret with my voice and the other press with my hand which is faster than doing it by hand alone.
Interesting, is that something you’d consider using the noise input for once I finish that? (As noise will be lower latency, you’ll have about 40 noises to map, and it’s “out of band” with speech)
Sure! Id love that. I only use one noise at the moment for mouse clicking (since I only know one sound).
Could you point me to a video that shows an example of each of the 40 available sounds?
They aren’t available for use yet. https://noise.talonvoice.com is the submission collection site for the training data. Each noise there has a description and audio example. When they’re available you’ll be able to bind them just like voice commands.
This comment fails to mention Dragonfly with my Kaldi Active Grammar backend [1], which is cross platform (Windows/Linux now and Mac functional and to be released soon), completely free with no private beta features (although I do accept donations), and 100% open source (unlike Talon). The speech recognition is local, with extremely low latency. See the video demonstration [2] on the project page. I think the underlying Kaldi engine delivers unmatched accuracy as a free non-commercial engine.
I created Kaldi Active Grammar because I didn't trust relying on closed source software for something so crucial to my productivity, where a decision by an outside party determines whether I can function. As a bonus, open source means I can make it work better to fit my needs than closed source ever could.
Furthermore, the original article mentions Caster (which is built on Dragonfly), but doesn't mention that KaldiAG works with it, and that work is underway to expand Caster's platform support.
I think to state "unmatched accuracy" in good faith we should actually come up with a common benchmark and measure against it. I believe there aren't really any clean benchmarks for command accuracy floating around (it would be ideal if we used a strict grammar to properly measure the command decoders), and a wav2letter model holds the state of the art for librispeech WER% as of 2019.
I found your measurement here [1] which is against an unknown wav2letter acoustic+language model pair, as the web demo is at any given point in time running an arbitrary model based on having users test in-progress models, and it has never been running the model I am currently shipping with Talon.
(As a small example, the unfinished wav2letter experiment I am training right now has a 3.17% WER on speech commands, and 6.86% WER on librispeech clean, both numbers without using a language model)
I am all for devising a good, fair apples to apples comparison. If you have any suggestions, let me know. In lieu of that, I use what I have available. While accuracy numbers from papers are informative and interesting, I don't think they directly apply to our usage particularly well. I would prefer to use numbers from actual usage.
I realise this is a little off-topic, but FYI the bolding of so many words & phrases in the README for kaldi-active-grammar makes it really hard to read for me.
That video is fairly old. I’ve polished my workflow a fair bit since then. My advent of code videos described here [1] are probably more representative, and the later videos in the playlist even show the keys that are pressed with each command.
I have a non-technical friend who says Dragon is too unreliable to be usable. My intuition was that the problem is probably somewhere else and that if Dragon doesn't work nothing will. I'm under the assumption that competing voice recognition software competes on price. Is that assumption wrong?
(For example, I suspect the microphone their school supplies them with may be no good)
For what it's worth, my voice is quite abnormal, so most untrained speech recognition is terrible for me, and even performing the normal "training" for Dragon still resulted in very poor accuracy. However, apparently their training is quite limited, because once I developed my Kaldi Active Grammar [1], and did my own direct training, the results were fantastic in comparison, with orders of magnitude better accuracy.
Also, Dragon has many commands that cannot be disabled, and if your accuracy is already low, more commands available means more possible things to get wrong. Something like Dragonfly with KaldiAG could allow you to reduce the command set, and improve the practical accuracy.
So, more personalized training/software may work better.
If they are using Windows, packaging is quite easy: see the winpython distribution of KaldiAG. The more difficult part is writing the commands: it is not hard programming, but it is technical. But if they can describe what they would want, someone else may be able to write it easily. The personalized training is still pretty new and raw, and it needs a lot of setup to do the training itself. Without knowing more about what your friend found problematic, it is hard to say what could help the most.
On the homepage for Talon it lists macOS under dependencies. I've actually come across the homepage before and didn't look into it further because I thought it was Mac-only.
Windows/Linux are in earlier beta, see the patreon posts for more info on status. There are about 20 full time users for each OS beta. I really should update the site, thanks for reminding me it's confusing.
I had several onsets of RSI a few years back, and had to resort to voice coding as a last resort, after stretches, pauses and ergonomic everything did not do the job. It was pretty awful.
But then, after having seen doctors and neurologists, and finally a physical therapist, I came across my salvation:
- Exercising my hands.
I very rarely see this mentioned for some reason. I exercised regularly, but only the bigger muscle groups, rarely grip strength and wrist strength. It felt counter-intuitive to exercise my already extremely painfully aching hands (when typing), but using grip weights and other methods to work out my hands and wrists, the pain went away quickly! If you are not diagnosed with carpal tunnel, and not already doing this, definitely try it, it saved my career.
Could you please elaborate on the exercises you're finding helpful ? I have a mild RSI myself (outer part of the forearm, near the elbow) and have been trying some eccentric exercises for a few months now but I'm not seeing a big improvement.
Look up nerve flossing exercises on YouTube. Routinely doing these had the largest impact for me. You'll feel the ones that work on whatever nerve is inflamed.
Try to improve your posture. My general mantra is "lift your head as much as possible. Pull your shoulders back". It gets easier eventually. If your head lies forward from the spine you have a hunch. You may notice a small lump of muscle behind your neck. That's bad. There are exercises to try and strengthen the opposite muscles.
If you sleep on your arms try to stop as well. I recommend sleeping on your back. Fluffy couches and back rests sacrifice posture. Don't use a laptop in bed.
Sleep, eat nutrient-dense foods, and run or swim. Avoid alcohol. If you do pushups or bench press, make sure to exercise your upper back equally to avoid imbalance.
Counterpoint: After a promising two weeks, it did nothing for me, back to square one. OTOH a colleague of mine recommended this after having personal success.
It is frustrating that there seems to be only trial and error in all of this.
Just want to point out a very accurate, yet completely inelegant solution to voice text input: Saying each letter and symbol individually.
If you haven't tried it, you'll think I'm crazy, but it's amazing how fast computers can recognize individual letter names. You can just blurt them all out. I discovered this while entering a domain name by voice - just spell it out and poof, no problem, no corrections.
Not sure I'd want to spend all day doing it that way, but rather than fighting with voice recognition for misunderstood homonyms, just fall back on individual keys.
You’re exactly right, and this is something I put a lot of work into! I designed a new phonetic alphabet - the “Talon alphabet” - (which isn’t proprietary to Talon, people have used it with Caster as well).
The trick is one-syllable words to represent each letter in the alphabet, but they have high phoneme diversity (sub-syllable unique sounds) and care was taken to ensure they can be differentiated even when saying them “too fast” in a stream without gaps, and I tested them extensively to make sure I picked words that didn’t clog up your mouth too much. The result is I can type 58wpm for short bursts using the alphabet alone, which is enough to play ztype, write entire words that would be hard to dictate, and control vim directly without an additional command set.
The base alphabet is:
air bat cap drum each fine gust harp sit jury crunch look made near odd pit quench red sun trap urge vest whale plex yank zip
Go ahead, see how fast you can read the alphabet in order. Then try reading it even faster, blurring the word boundaries together a bit. You can still tell the words apart!
Some people change a couple of words if they don’t work well in their accent, but the base idea holds strong.
I think it’s more like more like 1.5 syllables in practice? I had a hard time finding a short word in the J space that was phonetically diverse but also didn’t clog the mouth. It’s worth noting that when doing the thing I mention where you blend the words together to go faster, “sit jury crunch” can be spoken more like “sitch ury crunch” or even “sitch re crunch” and still recognized fine.
And see sibling thread suggesting “jree” like “tree”, which is similar to how it sounds when you’re going quickly.
Edit: For people suggesting new words, please read my comments downthread. Also the biggest complaint I’ve had about jury in practice is it can sound similar to “three”. Not a ton of words are spelled with J, and when doing vim input it’s faster to use numbers or a repeat command than say the letter repeatedly. So as it’s a lower priority word to trim half a syllable on, it’s worth getting it right if you’re going to change it.
The alphabet isn’t strict law, but it’s best to replace individual words after you’ve had specific problems with them and not before, as it’s hard to match the phonetic feel.
See sibling for testing methodology. It seems good at face value to me, however I did stumble a bit when doing the full alphabet out loud test with it (joy air, joy bat, joy crunch, etc...). I think there’s a harder to quantify component that causes the stumbles/tongue twisting, that is where your mouth/tongue position ends up when you finish speaking the letter.
Offhand that makes me suspicious of the jug gust chain. I’m not sure how often that would come up though. Maybe just in vim. One shortcut for finding words that end in phonemes I’ve already tested, is to find words with the exact same ending sound as another word in the alphabet, but with a sufficiently distinct starting phoneme.
Something to consider: replacing 'jury' with 'jug' (or some other one-syllabic phoneme) might increase overall efficiency, even if one or two combinations need a slight pause.
Thanks for the suggestion! There are several rules and considerations in play.
For testing juke, I’d take a look at any words in the alphabet that could conflict with it on either side:
> air bat cap drum each fine gust harp sit jury crunch look made near odd pit quench red sun trap urge vest whale plex yank zip
I see: cap, each, gust, crunch, quench
I’d say “juke cap juke each juke gust ....” out loud to test it. I’d also test it out loud against every word in the alphabet if it passes the obvious collisions.
Ultimately juke doesn’t pass for me for two reasons:
1. ook is something of a glottal sound, it can be awkward to reset your mouth afterwards when flowing into another letter
2. When chaining with some of them other letters in the alphabet it can sound like you’re saying “Jew”, and it may be confusing to some people if you are saying that repeatedly. “Juice” had a similar effect.
Because the nato alphabet is several times slower/more verbose and suited better for lossy radio links. We have a high quality mic hooked directly into our computer and the tradeoffs are completely different. Speed and vocal effort are more important than universality.
With good autocompletion this would work in many instances.
I’d like to save my voice from spelling the entire word. Often I don’t need to type more than a letter or two before I can select the desired variable, type name, etc.
I had my second surgery just a few months ago. I can type again, but each time I had to use my left hand for months. Initially, it feels like your brain doesn't function properly anymore (not mentioning the psychological effort you have to make in order to be focused on work when you feel your hands are falling apart). Keyboard speed is directly related to how fast you can move your hands to support your thought flow. I tried Kinesis and even a split vertical keyboard (KeyboardIo) but none avoided the pain and numbness that came with typing. The other problem with thumb-cluster keyboards is that your IDE productivity goes to zero. I was faster with just my left hand on a regular keyboard than with both. I think this would be fixable with a good amount of time remapping shortcuts, etc. Now that my hand works again, I think I should start spending time getting used to my KeyboardIO and at least try to buy some time.
The "voice coding" space is maybe not a mess, but far from great or even acceptable. However, there seem to be more recent efforts to make better tools. I would definitely check https://serenade.ai/ out.
The main problem, I think is that "voice coding" is too much focused on editor typing which they can't do right as, when combined with code syntax, it becomes too complex. Instead, they should focus on higher level actions (which btw, Serenade does) along with a different approach to typing. I think Vim is a good example of where editing should be. IntelliJ refactoring is where voice coding should start. With all the AI buzz, it's unbelievable how bad voice recognition is. I'm not talking about "Siri set an alarm", but instead separating context from tone, not having to say things 2-3 times having good response latency, etc.
Lastly, I wish there was simple voice assistance for code navigation - like go to definition, find usages, etc. This is much simpler to "parse" than code structure. Unfortunately, this is not even tackled by any tool as far as I've seen.
I’m one of the creators of Serenade—thanks for mentioning us! We totally agree about the need for higher-level layers of abstraction, and we’re working on some of the code navigation functionality you mentioned right now. If you have any other ideas or feedback, we’d love to chat more, I’m matt@serenade.ai.
Among all demos above Serenade seems the only product that does the right thing. Why the hell I should say aloud "colon" or "quote" where tool can put them automatically.
“I type fast enough that I have never had much use for snippets in my text editor. Perhaps if I’d used them more effectively, I would not have developed the RSI symptoms that I have!"
Often we have discussions about advanced code completion on HN. Many developers feel they don’t need it, or that it gets in their way, for example.
Reading stories like this convinces me even more that our editors (tools) need to be smarter. There is so much repetition is coding, it’s hard to believe we can’t do better.
Because this says “single token completion” as the tabnine example gif I’m guessing it isn’t comparing to deep tabnine at all, which is on an entirely different level
the memory usage comparison is accurate for tabnine, at least in my case. local-only tab nine wasn't practical. and using a cloud server for an autocomplete seems unnecessary.
I like the idea of TabNine, unfortunately it doesn't seem to be super well supported :(
There are no news on whether it is being actively developed and the current implementation is unusable in a corporate environment because it can't dial home through a proxy so it refuses to activate the license.
It's a shame too because it was basically a "shut up and take my money" reaction from me. I'd pay for this product. I'd pay good money for this product.
Yes, but it still needs to talk to it's servers once to unlock the restricted version. It's all nice and dandy that after the one time activation it can function fully locally, but I never get to that point because the pointless "feature" of activation can't get through the corporate proxy.
So to recap:
1. There is currently no paid license
2. The free version nevertheless needs to be activated online to unlock the full power (otherwise there are severe limitations)
3. It can't handle proxies, so you can't activate it at all in corporate environments
I am relatively sure that you could tell that my keyboard is primarily used for coding C# in Visual Studio just by looking at which keys are the most worn and which carry a significant layer of dust on them.
The keys that complete an Intellisense selection (space, period, semi-colon, Enter) are nearly ground down to nubs, and the open-paren and open-bracket keys are worn smooth, while the corresponding close keys indicate near complete disuse. Similarly for F5 (Start Debugger) and F10 (Step Over) compared to the rest of the F-keys.
We should use better languages. Better understood and integrated with the editor and better at expressing every level of abstraction. coughpretty much any lispcough
If better tools also make "worse" programmers (i.e., people saner and happier and uninitiated with the brain death rigors of struggling with awful/no tools), tools also matter more the better they get.
I actively try to switch things up, depending on if I feel something is starting to get uncomfortable. If my mouse hand starts to give signals, I'll switch to using keyboard more. Regardless I'll switch seating position and hand orientation throughout the day.
I also worked with someone in a remote capacity who insisted on using Dragon Naturally Speaking and his communication on technical topics was unintelligible. It clearly wasn't working, but he insisted and everyone who worked with him suffered for it.
And now he's not working in tech anymore and is a field engineer doing repetitive labor with his hands...
I've been voice coding for about 5 years now. For those of you not on windows, I use talon voice on mac (linux version is in beta). It works quite well and I'm at least as productive writing code by voice than I ever was by hand. I was someone who would spend the time to get my emacs and then later vim configs highly optimized, but there is something liberating about not constraining yourself to key bindings. I used to type gcC to comment a python class in vim, now I say comment class. For commands you type frequently to get into muscle memory this isn't a huge gain, but for all the things you don't use regularly, it's so much easier for me to remember normal words than keyboard shortcuts.
At this point all of the projects mentioned in this thread (caster/talon/serenade) have some option for supporting the three main (win/lin/mac) platforms.
You can follow the Talon website in my HN profile to the Donate link (Patreon) and sign up for the beta tier then join the Slack to get access. Or go straight to the Slack and make your case for not being able to support the beta on Patreon at this time.
FYI, I thought my programming career was over due to RSI.
Now, I only type while wearing long-sleeves.
And of course, I still have to take regular breaks.
I no longer suffer RSI symptoms. I'm guessing because it increases blood flow to the area and perhaps the warmth helps keep ligaments and muscles flexible and loose.
I love these simple solutions that may take years to discover but are shamefully obvious after said discovery.
For the longest time, I'm talking near a decade, after working for a few hours, I'd get this unbearable neck and shoulder pain. Even after the workday it would last long until the night. I am not ashamed to say that I would be almost in tears some days. Not because of the level of pain, but because it was constant and I was frustrated.
Finally, almost as if by accident, I simply pushed my chair back a bit and extended my arms while typing. The pain went away almost immediately. I had been working with my shoulders hunched up the whole time, causing all manner of muscle tension and fatigue.
All I had to do was extend my arms and let my shoulders drop.
Thank you for sharing! I recently started working from home and had the stiffest neck after a few days. I realized I needed to raise my screen to eye-level again (as I had at the office).
Here's my submission to the "thing that solved my RSI but which is purely subjective and may or may not work for you" category:
I used to try to type with my arms/elbows floating -- not resting on anything. This seems to be fairly standard advice. For example, you can see it here: http://ergonomictrends.com/proper-ergonomic-typing-posture-a... I've had various doctors recommend the same thing.
But when I try that, the muscles in my shoulders/back get incredibly inflamed, and tension/pain tends to spread throughout my body, eventually to my hands/wrists.
Eventually I gave up on that, and now my elbows are splayed out and rest on the chair's arm pads. My hands rest on wrist pads, and my keyboard is split about 18" (using a Kinesis Freestyle 2). Basically, my hands/arms are nearly always at rest, and no energy is required to keep them suspended in the air. I also try to avoid "reaching" for the upper rows with my fingers, instead I move my entire arm forward to some extent to hit those keys (which is more a 'push' than a 'lift', because my arm pads have a bit of give in them).
Same, but for me the cure was John Sarno. Unfortunately RSI seems to be one of those disorders that everyone cures differently (if at all!) and you just have to stumble around until you find a method that works for you.
Yeah he probably is. Sarno wrote a whole range of books, primarily about back pain (https://www.saxo.com/dk/healing-back-pain_john-e-sarno_paper...), but also dealing with RSI. If you read up on it, a great deal of people have said it has helped them with their RSI.
Anecdotally it's done nothing for my RSI, but it has helped me deal with a 7+ year "chronic" upper back pain, though it did take a bit of time to swallow the concept at first.
I thought that too. Building muscles to protect and support ligaments to me sounds like a decent solution. I feel like people who are underweight are the ones that are most prone to suffer from RSI.
Well, be careful. I love trackballs, and have used them exclusively for 25+ years. But... “trackball thumb” is a thing. I had a physical therapist show me how to work out issues when I get them, but ultimately the healthiest thing for me to do is to switch between 3 different trackballs to change up the motions.
I use one of those at work, and a mouse at home. I like to think mixing up the motions helps, but honestly it's because I'm not nearly precise enough to game with a ball.
I've been using a SlimBlade for about 10 years now and really like it. Spinning the ball for the mousewheel was brilliant and the giant, centered ball gives you a lot of freedom in how to position your hand on it. They are also have nice quality builds, I've only just had to buy a replace recently after many years.
Same, I had debilitating RSI for years, to the point where I was using voice recognition and a foot mouse, but all symptoms went away once I started wearing hoodies to work. As soon as I use a computer with bare arms though it all comes right back.
Only a few minutes with bare arms and I get this snapping sensation in my forearms as I type. And if I push through, pain.
Doctor's wanted to do surgery. But I didn't have the problem prior to an injury that required me to use crutches, so I was hesitant to resort to surgery.
Keeping my arms covered has been a lower risk solution that's been working fine now for 6 years.
I remember reading some guy's article where he said wearing fingerless gloves helped him with RSI because he was actually sitting right below the AC fan the whole time.
I use a 250 watt heat lamp suspended about 3 feet above by keyboard. It warms my hands, the kbd, and the desk surface. Doing so for years, way more comfy.
That's very interesting, your report and those in the comments mentioning warmth. I'd some years ago experienced RSI problems and rectified it with neoprene "wrist braces"; while they didn't provide a huge amount of support, just some mild compression, they were certainly warm.
I had RSI for years as well, ended up changing careers away from tech for many years because of it. For an unrelated issue I started taking probiotics and gradually noticed the RSI had improved. It took a while to realize it was the probiotics doing it because I had generally been avoiding doing things that triggered the RSI. After I noticed I could type regularly again though, I've had a couple of instances where the RSI flared up again when I had stopped taking the probiotics for a week or two - that's when I finally realized that was likely the change that made the difference.
Looking back I had had the onset of significant chronic digestive issues a month or two before the RSI started and I know believe that was not a coincidence.
For me it’s Kinesis Advantage and 3M wrist braces. This way I can type and mouse all day, otherwise I get pain in couple hours. For laptop keyboard braces help, but in general I avoid truly prolonged typing on laptop.
The key is not to let pain develop, stop immediately and develop solution, otherwise you can get through the point of no return.
I wonder if the sleeves help your wrists slide more rather than your skin "sticking" in place against the desk/laptop and causing you to bend your wrists more to reach keys.
Rapid Reboot makes compression (peristaltic?) massaging boots and the like for athletes. Maybe something similar but heated, massaging sleeves, can be developed to help with keyboard RSI?
Greetings Everyone, I help maintain the Caster project. The key difference from other solutions out there as we seek to support a completely open source voice coding stack. Open source is only way to go long term if you're going to being using a tool for most of your life. Fortunately for some it acts as a bridge until their RSI symptoms becomes manageable or goes into remission.
We are working towards cross-platform support Linux and Mac as well as adding support for Kaldi. Dragonfly is already cross-platform so just a few windows specific functions to be ported yet in Caster.
I said this in another comment, but it can't be emphasized enough: I created Kaldi Active Grammar because I didn't trust relying on closed source software for something so crucial to my productivity, where a decision by an outside party determines whether I can function. As a bonus, open source means I can make it work better to fit my needs than closed source ever could.
For what it's worth, my voice is quite abnormal, so most untrained speech recognition is terrible for me, and even performing the normal "training" for Dragon still resulted in very poor accuracy. However, apparently their training is quite limited, because once I developed Kaldi Active Grammar, and did my own direct training, the results were fantastic in comparison, with orders of magnitude better accuracy.
I understand and won't argue on your preference for the core Talon app. However, as all of my wav2letter code, models, tools, training methodology, and general advice (e.g. I am very active on the github/facebookresearch/wav2letter issue tracker helping others) are open source, and wav2letter as used in Talon is built from the public repository and dynamically linked, I don't think the speech engine is the place to speak against Talon's source policy.
Sorry, I was only using the speech engine accuracy as an example. But the freedom of open source stands for any part of software: Dragon's spectacular failures for me are only in part because of its engine. Also, is the command portion of Talon's wav2letter backend open source? Nonetheless, thank you for releasing some of your work. It is all helpful.
Yes, the decoder in the open-source talonvoice/wav2letter/decoder will decode commands alongside speech if you hand it an NFA blob describing the command graph. It's up to you to generate that NFA, but it's probably identical to the graph you're creating with FSTs, and the C structures are described in the source/header.
Some of the voice coding videos I’ve found are a little contrived, or seem practiced (Even some of mine were just copying code from one window to another or demoing one small feature). In this playlist, I do a lot of code from scratch, by competing in Advent of Code 2018 for the first week:
As I start each day exactly when the problem is revealed, without any prior knowledge of the task, everything about these videos including debugging broken code is completely real.
I know it’s funny, but I hate the prevalence of this video so much, because people reference it as a reason voice programming doesn’t work, and I think it discourages people who may have otherwise had success with voice programming. You should consider linking to the middle of Emily Shea’s PerlCon video where she first plays this video then shows that it’s actually possible to dictate the same code effectively.
(note the fiddling around video playing isn’t a fundamental issue with voice input, google slides with video embeds turned out to be pretty unreliable with keyboard input and I believe that was ironed out in her later talks)
It's still surprising to see Dragon Speech Recognition as the recommended (and only) choice here.
Is anyone working on decent speech recognition for Mac/Linux or know good resources for that? The ideal output is a stream of what could have been said, as well as some alternatives, each with a confidence.
Every alternative I've tried has not been as effective as the version of Dragon I used from 2011. I think the focus on accents and training is a big thing here -- I'm happy to spend a couple hours training it for better results.
I'm working on a voice coding program for Linux so I've been forced to try all the different options and ultimately I decided to go with Google Cloud speech to text since the other ones were either too difficult to set up or just didn't work that well. I'm actually really impressed with it though even with my crappy gamer headset, and I'm even using it to type out this response right now.
The Talon beta ships with wav2letter and a really good many-accent English model that can handle both arbitrary commands and free form English. All of my trained models and some information is posted here: https://talonvoice.com/research/
I don’t consider it proprietary. As per the agreement I specifically ask for an open license so I will be able to release it in the future.
Right now it’s about 5 hours total, which isn’t a ton for actually training on, which is why I haven’t prioritized releasing it and haven’t even trained on it myself yet. I’ve been mostly using it for evaluation so far.
If someone approaches me and says “I have a compelling need for a bit of training data in the form of your prompts” I’ll probably prioritize a release higher.
As another perspective, a majority of the people at this point submitting their voice are already using Talon and just want the engine to be more robust.
If you are willing to do some training, you can get tremendously improved results, in my experience. For what it's worth, my voice is quite abnormal, so most untrained speech recognition is terrible for me, and even performing the normal "training" for Dragon still resulted in very poor accuracy. However, apparently their training is quite limited, because once I developed Kaldi Active Grammar [1], and did my own direct training, the results were fantastic in comparison, with orders of magnitude better accuracy. The personalized training is still pretty new and raw, and it needs a lot of setup to do the training itself currently.
I have been working on getting Mozilla's DeepSpeech and some additional JS libraries up to a level where it can be used (among other things) as a voice keyboard.
It can type numbers and symbols reasonably well, I need to do some additional work like build a custom language model to be able to type letters and plug some other gaps in Mozilla's CommonVoice model.
From my research on alternative keyboard research (http://tbf-rnd.life) I've come into contact with the author of (and managed to piss of) the author of talon. Seems like a competent solution, even though I haven't been able to dig that deep into it on account of not being a mac user (until know)...
There'd be some back and forth regarding it's suitability for coding over there. Much better support than I expected apparently...
Still as a general solution I do believe it has drawbacks, noisy environments etc.
Hi, I believe you are referring to this [1]. I wasn’t feeling pissed. I was critical of your dismissive tone, as I felt your post was possibly harmful to people who may be looking for solutions. You had said something along the lines of “voice coding can’t work well because Siri doesn’t recognize code” which is a surprising conclusion from a flawed premise.
As a generalization, you seem to be coming up with reasons you _think_ voice coding won’t work well, while ignoring the fact it already does. For example, noisy environments have several very good solutions, such as using microphones designed for them, leaving the environment, or software to reduce noise like Krisp.
The biggest realistic drawback from my perspective is the fact it’s not very quick at mousing, which is why I’ve done a bunch of research on fused eye/head mousing as well.
I would like to apologise for that. You really made me think and know that I've become a mac user I have the opportunity to try your solution out. The blog post was intended in a rather light tone. My intention was to add the more in depth analysis in the book.
Now that I am a mac user I finally have the chance to take a look at your solution. I do think that I have a lot to learn and do not think that the areas of research are in conflict at all.
Funny timing on switching to Mac, as Talon now has a triple platform beta. If you’re looking to get an accurate (pun intended) read on things, make sure to try the beta wav2letter engine or try Talon with Dragon. Apple’s MacOS speech engine was never as accurate or comfortable to use, and as of Catalina it might be more accurate but it’s broken for Talon’s use case anyway :(
Also, regardless whether you would be trying to tie in a custom chorded keyboard, plug in voice recognition or control your computer with a single muscle there would be a certain overlap in integration with computer software. That'd be the rationelle for creating a repository / loose collection of interoperable solutions so that when you have e.g. a great voice recognition platform you could have it working with N platforms straight away. Instead of having all the new interface attempts having to write all of this boilerplate from scratch. Let's say interface for eclipse, ubuntu, windows, chrome, ...
Along with benchmarks for testing the performance and in a more or less sciency fashion compare them to each other. An overlap would also exist in other areas such as word prediction and probably many more.
Talon isn’t a fundamentally just a voice control framework. It’s a general platform for bolting on accessibility tooling, because as you’ve said a lot of the requirements are related.
ah really glad to hear that! Will need to get a linux laptop as well. Am using a macbook for work now but I'd like to have a linux box as well. Will try out on several platforms as well! Will try to give you feedback on what I think!
> wav2letter is also under use by some, although requires more effort to setup
This isn’t quite accurate. To my knowledge the options are: you are either using it in Talon and it just works, or you want to use it outside Talon and you will need to write entirely new glue code to add support for it in your project of choice.
Off topic, but I really wish to find out any Alexa-like "smart speakers" capable of voice programming.
For example:
1. I would like to command the speaker listen for a keyword like the Fizz Buzz Test[1] if I counted to certain number.
2. Ask the speaker to remind me of something when hearing certain topics during a conversation. Much like the "if" keyword in text based computer programming languages.
3. Program a poem into the speaker over the microphone, tutor my kids to memorize it, correct the wrong parts. Share the snippet to other parents. program simple home made riddles and
tests over voice.
4. The ability to store certain list/map structure as global variables. e.g. asking the speaker, who is the second oldest son in this family? Who got up first this morning?
5. Voice memos and search engine. Stored and indexed securely offline on my home NAS.
I think all the big smart speaker makers would tell you to just write a serverless function and hook it up to your speaker. It's unlikely they would ever create such a functionality.
Most likely if you want to make it work you'd either have to build your own smart speaker or make a serverless function that used one of the other voice programming programs mentioned in this thread as it's backend.
> build your own smart speaker or make a serverless function
If I build my own smart speaker it would surely be run on my home server. Not so much for server-less. Yeah but I get it. the voice commands should be counted as new "keywords" or "functions". Let there be a general "voice programming" language.
I would probably start with the seeed studio ReSpeaker array hardware wise. Otherwise, can you write up a more specific use case list with specific commands and responses? This sounds fun and I can probably help you make this happen.
> write up a more specific use case list with specific commands and responses
"no swear words" rule for kids. The speaker need to maintain a global counter for kid_id/word counter for every time a swear word is heard. Will reset every week for rewards/penalty, etc.
The parents (as admin) could add/remove the swear words list, and can configure how long the counter need to be rotated.
A non-trivial example, I am an ESL speaker, trying to teach my kid to learn English. But had little patience with the grammar and pronunciation. (My Bad)
I'd like to record a simple poem e.g. The Moon by Robert Stevenson, ask my kid to listen to it without need to stare at any screen, the aim is to recite it correctly, the programmable speaker would correct any missing words with stronger emphasizes next time, and count any badly spoken words with a counter, test it several rounds to see if my kid improves and recite the full text correctly.
There are many family games that can be played around a programmable speaker, e.g. as the dungeon master. The story teller, the on-demand background music composer, etc.
Ok, here are some quick musings, I’m open to talking about different approaches for any of this or additional features I didn’t cover here. I’m hearing two stronger use cases here.
1. “Family database” - you tell it facts and it tries to remember and reproduce them later when asked. Like storing your family tree somehow and asking about it. This has a lot of nuance but is likely possible in some way with current technology.
2. Teaching assist. You give it a task, such as an English text, and it helps train and evaluate your kid.
For (1), I think there are three obvious approaches to me:
1. Manually enter the data with a computer, but use natural language queries (some kind of Intent model) to access it
2. Have a strict set of data that you can enter (such as family tree, grocery lists, voice memos), and allow slightly strict natural language-ish statements to add data, and natural language queries to access it.
3. Convince a general understanding model like GPT-2 to learn your facts, and just pipe your questions directly into it. This is the coolest answer but would likely also be wrong more often.
I think (2) is an easier task, and likely would use an entirely different approach for it than (1).
(2) is especially easiest if the system already knows the text of the poem. If you’re making up a new poem, or speaking a poem it can’t somehow look up, it will be harder to match two voices against the text of the poem.
One caveat to all of this is I’ve heard speech recognition doesn’t work as well on children because they aren’t well represented in the training data.
You gotta be careful not to get RSI in your vocal chords. I almost did when I tried voice coding years ago. For me at least, it tends to shift to whenever “work” is being done. Same thing happened when I experimented with eye tracking.
I've had some close calls with RSI and the most helpful things for me were:
1) Getting an ergo keyboard, in my case the Microsoft Sculpt
2) Remapping my keys to better match my workflow. Left and right parens are mapped to left and right shift - same for Ctrl-Braces and Alt for brackets. Mapping Caps lock to delete one word back was also a big one. Further, I have the number pad on the left to both make using the mouse require less movement in addition to remapping all the numpad keys to useful programming commands.
A minor tip for people fiddling with keyboards/layout. Be careful of regular pinky use for modifiers. Pinkies are weak and overuse can cause ulnar issues.
Corollary, if you have ulnar (pinky side) issues in your forearm, reduce pinky use.
Just a quick shout out to Microsoft (no affiliation) - Their Sculpt keyboard has taken all the discomfort out of 15hour coding sessions. If you're a professional developer, get one.
The Sculpt and Kensington expert trackball (big trackball, not thumb trackball), while not fixing my issues completely, did improve things for me quite a bit.
That said, barring genetic lottery, as the sibling said: enough 15 hour sessions are going to give you RSI no matter what you’re typing on. There’s no way to do that every day, and exercise, take breaks, and sleep enough. Balance is important.
I've been using the Microsoft Natural Ergo for 20+ years. It's the very first thing I ask for when I get a new job, or if they don't supply keyboards, the first thing I bring from home.
I've never had any RSI type symptoms or even fatigue after long typing sessions.
The sculpt seems to just be a fancier wireless version of the same thing (although I haven't tried it so I could be wrong).
For those on Linux, I've been working on a Talon inspired voice coding program called Osprey that uses the Google Cloud speech to text API: https://github.com/osprey-voice/osprey.
It's still very much a work in progress but it's already been working very well for me and I'm actually using it to type out this response right now.
Why would you work with google when there are much more accurate open source speech recognizers based on Kaldi? With that specific usecase it is very easy to beat Google on accuracy.
You can solve most carpal tunnel issues by massaging your forearms with a ball (or your knuckles).
I know it sounds crazy, but I solved a very intense bout of plantar fasciitis by massaging my calves. Took some time, but eventually the pain went away.
And when I had carpal tunnel, I did the same (although not learning heavily on my wrists helped a lot too).
You'll know if you're hitting the right spot because it'll hurt. A lot.
* I am not a doctor, I have simply had a lot of experience with treating my own injuries and speaking to others in similar situations
This is partially good, but dangerously incomplete advice. The spectrum of RSI is far bigger than carpal tunnel, such as cubital tunnel, tennis elbow, tendinitis, and general neuropathy.
It can be caused by many things including various muscle groups in your shoulders, neck, posture, past injuries, nerve adhesion, flexibility, lack of strength.
Massage is a wonderful way to address nerve adhesion, flexibility, general tightness, but if you have a different issue it’s not really a panacea.
My philosophy is that you should do these things in strict order if you have symptoms:
1. reduce use (take breaks, get hobbies that take you away from the screen at home)
2. vary conditions of use (standing desk, ergonomics, using more than one kind of keyboard/mouse)
3. work out (yoga, swimming are highly recommended)
4. physical treatment (see a doctor, physical therapist, deep masseuse, etc)
5. alternative input (voice, eye tracking, etc)
If you do these in a different order or skip any of them you’re probably doing yourself a disservice.
Also. Don’t ice. Don’t take anti inflammatories or other pain meds while you continue to work. Blood flow is important to recovery - heat can soothe your pain similarly to as ice, but without causing further damage. And pain meds during working hours will make you ignore the damage that is causing your pain, allowing you to cause more damage. Bad.
I’ve also seen some negative studies on steroids and surgery, suggesting if you get those without changing your habits, your pain will recur within a year in 90% of cases. Do your own research, but you might as well change your habits first and only use surgery/drugs as an extreme last resort. (Unless you have the specific form of carpal tunnel that can be permanently fixed in a day with a minor surgery in exchange for lower grip strength?)
There’s recently Common Voice from Mozilla, which is a huge free English dataset (1500 hours and growing), and wer_are_we [1] has shown really impressive accuracy increases in published research the past few years. Exciting times.
I suspect this setup would not be ideal as background noise getting worse would be annoying for all concerned and even more challenging than an open plan office already is, and your colleagues might not appreciate this unless you're already surrounded by people talking all day (e.g. support or sales teams on calls).
That aside, in terms of worrying about your mic picking up other people's voices and the voice dictation getting confused, most dedicated microphones these days (i.e. not ones that are built into your phone's headphones), are pretty good at background noise reduction.
I've not used the one OP recommends - I'd never have considered a table based mic like that before - but the noise reduction on the Plantronics Blackwire 3215 headset I use is so good that if I move the mic boom a few inches up or down away from my mouth, people can't really hear me on calls. It's superb at getting rid of background noises, and if somebody else was in my home office using voice dictation it would not be picked up by my headset.
I think I personally would, but then I would rather hear some of my colleagues talking rather than using their mechanical keyboards in an open plan office.
I work mostly from home, so I'm interested in this as a programming method to see how or if it changes the way I approach it as work.
I use Talon with Dragon and an Audio Technica PRO8HE.
I mostly whisper to the system, and people in my office say they don't hear me at all. Also whispering allows me to not get my throat get tired.
There are really good mics at background rejection, if you’re worried about external voices filtering in.
1. Cheapest is probably a USB dynamic mic of some kind.
2. Next is a Stenomask at around $250
3. A lot of folks swear by the DPA d:fine cardioid, which is $800-1300 including an interface. There’s also cardioid shirt worn lavalier I’m interested in trying sometime, which is the same interface but the mic is $150 cheaper ($650 -> $500)
If you’re worried about other people hearing you, your options include an isolated area, playing noise (white noise or music?), or using a StenoMask, which blocks sound in both directions.
Remember in the US your employer is required under the ADA to provide “reasonable accommodations” for disability, which may include a private working space, pair programming, or letting you work from home more often.
I am also thinking in a more general sense of keyboard usage. Walking for example would reduce precision, noisy areas such as car and in industry etc...
A small nitpick when talking about voice programming is there’s a superset of “voice control”. When your hands are hurt you don’t just want to type in your editor and terminal, there’s chat programs, application switching, web browsing, and other tasks it would be nice to offload from your hands too.
So the clearest way to represent some of this may be to encode the meaning of “voice input/control capable of programming” in a title. We might need a new name for this kind of input to best represent it.
I went through a similar situation and ended up using https://voicecode.io for a few years. It worked well enough for me, but I always found it hard using it in a public setting.
Ironically, I solved my RSI issue after coming across a book on HN. Everyone seemed to recommend the Mind Body Prescription, and after trying it, it did the trick.
Edit: I've been wanting to try this project for several years now: https://www.ctrl-labs.com. I think something like this is the solution moving forward vs voice coding.
There are at least two cross-platform projects where the biggest expense is a microphone instead of software.
1. My project, Talon. Windows/Linux/Mac support, and a first party local speech recognition engine that is pretty good and getting better. It’s free, but the engine is in a private beta (which is $15/mo to support development, optional if there’s a financial issue).
2. Serenade. They are VC backed. Currently free, unsure about their longer term plans. They use cloud based recognition.