Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Aqua Voice (YC W24) – Voice-driven text editor
716 points by the_king 7 months ago | hide | past | favorite | 244 comments
Hey HN! We’re Jack and Finn from Aqua Voice (https://withaqua.com/). Aqua is a voice-native document editor that combines reliable dictation and natural language commands, letting you say things like: “make this a list” or “it’s Erin with an E” or “add an inline citation here for page 86 of this book”. Here is a demo: https://youtu.be/qwSAKg1YafM.

Finn, who is big-time dyslexic, has been using dictation software since the sixth grade when his dad set him up on Dragon Dictation. He used it through school to write papers, and has been keeping his own transcription benchmarks since college. All that time, writing with your voice has remained a cumbersome and brittle experience that is riddled with painpoints.

Dictation software is still terrible. All the solutions basically compete on accuracy (i.e. speech recognition), but none of them deal with the fundamentally brittle nature of the text that they generate. They don't try to format text correctly and require you to learn a bunch of specialized commands, which often are not worth it. They're not even close to a voice replacement for a keyboard.

Even post LLM, you are limited to a set of specific commands and the most accurate models don’t have any commands. Outside of these rules, the models have no sense for what is an instruction and what is content. You can’t say “and format this like an email” or “make the last bullet point shorter”. Aqua solves this.

This problem is important to Finn and millions of other people who would write with their voice if they could. Initially, we didn't think of it as a startup project. It was just something we wanted for ourselves. We thought maybe we'd write a novel with it - or something. After friends started asking to use the early versions of Aqua, it occurred to us that, if we didn't build it, maybe nobody would.

Aqua Voice is a text editor that you talk to like a person. Depending on the way that you say it and the context in which you're operating, Aqua decides whether to transcribe what you said verbatim, execute a command, or subtly modify what you said into what you meant to write.

For example, if you were to dictate: "Gryphons have classic forms resembling shield volcanoes," Aqua would output your text verbatim. But if you stumble over your words or start a sentence over a few times, Aqua is smart enough to figure that out and to only take the last version of the sentence.

The vision is not only to provide a more natural dictation experience, but to enable for the first time an AI-writing experience that feels natural and collaborative. This requires moving away from using LLMs for one-off chat requests and towards something that is more like streaming where you are in constant contact with the model. Voice is the natural medium for this.

Aqua is actually 6 models working together to transcribe, interpret, and rewrite the document according to your intent. Technically, executing a real-time voice application with a language model at its core requires complex coordination between multiple pieces. We use MoE transcription to outperform what was previously thought possible in terms of real-time accuracy. Then we sync up with a language model to determine what should be on the screen as quickly as possible.

The model isn't perfect, but it is ready for early adopters and we’ve already been getting feedback from grateful users. For example, a historian with carpal tunnel sent us an email he wrote using Aqua and said that he is now able to be five times as productive as he was previously. We've heard from other people with disabilities that prevent them from typing. We've also seen good adoption from people who are dyslexic or simply prefer talking to typing. It’s being used for everything from emails to brainstorming to papers to legal briefings.

While there is much left to do in terms of latency and robustness, the best experiences with Aqua are beginning to feel magical. We would love for you to try it out and give us feedback, which you can do with no account on https://withaqua.com. If you find it useful, it’s $10/month after a 1000-token free trial. (We want to bump the free trial in the future, but we're a small team, and running this thing isn’t cheap.)

We’d love to hear your ideas and comments with voice-to-text!




This is cool! Some feedback:

- As others have said, "1000 tokens" doesn't mean anything to non-technical users and barely means anything to me. Just tell me how many words I can dictate!

- That serif-font LaTeX error rate table is also way too boring. People want something flashy: "Up to 7x fewer errors than macOS dictation" is cool, a comparison table is not.

- Similarly, ".05 Word Error Rate" has to go. Spell out what that means and use percentages.

- "Forgot a name, word, fact, or number? Just ask Aqua to fill it in for you." It would be nice to be able to turn this off, or at least have a clear indication when content that I did not say is inserted into my document. If I'm dictating, I don't usually want anything but the words I say on the page.


> People want something flashy: "Up to 7x fewer errors than macOS dictation" is cool, a comparison table is not.

Respectfully disagree on this one: as a startup, you can't effectively compete with the likes of Apple on flashiness. However, the very target market of those dictating large amounts of text will include a significant number of people in academia themselves. For those people, Aqua Voice will feel relevant. Those who aren't interested in comparison tables will simply skip over them :)


One of my favourite nitpicks, but IMO 7x fewer errors means -6 times the error rate. Maybe error rate reduced by 86%.


Isn't it just errors_count/7? (errors_count * 1/7)

For example, if you got 70 errors before you now get only 10 errors.


7 times 70 =490. 490 fewer than 70 is -420. But words mean what you want them to mean, so 7 times fewer to mean 1/7th is becoming commonplace.

(edited because formatting swallowed asterisk for times)


Is there any case where your interpretation makes sense? Would you ever say 0.86 times fewer instead of 0.86 of the size?

If new is 1/7 the size of the old, then 7 * new = old. It takes 7 times of new_count to get the value of old_count. 1/7th of the old_count. 7x fewer seems like a shorthand, but I'm not a native English speaker.

  x times != x times fewer
7 times => multiply

7 times fewer => multiply by fraction


I would never say times fewer, because it is ambiguous.


When value is above 1, it's not really ambiguous anymore as only one interpretation makes sense. But I understand that it could still be incorrect, if it's not well defined term in English.


You choose an interesting case at 1. What does 1 times less errors mean? To me it means no errors.


It's ambiguous as it's not above or below 1 :)


Gotta put a back-whack in like \* for *


Thanks for the feedback! On the last point, you can't see it in the sandbox, but the app has a Strict mode that does what you're looking for


I was wondering whether the table actually comes from some paper, or it's just a marketing trick for techy folks.


This is incredible, I said go back and swap one word with another, and it did it, this has blew my mind, I've not been able to do that before.

I'm a heavy voice dictation user, and I would switch to this in a heartbeat. I'll tell you why this is so impressive, it means you can make mistakes and correct them with voice, it takes away the overhead of preparing a sentence in your mind before saying it, one of the hardest things about voice dictation.

I often have my shoulder in pain, and I have to reach for my mouse to change a word, I would not if I used this. This software would literally prevent me pain.

However, I cannot use it without a privacy policy. I have to know where the recording of my voice is being saved, if it's being saved, and what is it going to be used for.

I would pay extra for my voice to be entirely deleted and not used afterwards, that could even be an upsell part of your packages. Extra $5 to never save your voice or data.

I love it, but I can't use it for most things without a privacy policy.


This is so cool! Great work. I'm writing this comment using Aqua Voice, and it's very impressive. I've been waiting for something like this. As a neurodivergent person, certain tasks (cough, email, cough) are about 10 times harder sitting down at my computer than they are handling them aloud with my assistant.

I'm sure you get this feedback 100 times a day, but I'd gladly pay a substantial amount to use this in place of the system dictation on my Mac and iPhone. Right now, the main limitation to me using it constantly would be the endless need to copy and paste from this separate new document editor into my email app or into Notion or Google Docs, etc.


Two more small pieces of feedback, in case they're useful:

- Consider a time-based free trial. As others have said, tokens are confusing, but also your model is unlimited so the chunk of tokens doesn't allow me to see what it might be like to actually use your product. I'm more than halfway through my tokens after writing an HN comment and a brief todo list for work, so I've been able to see what it'd be like to pay the $10 for about 5 minutes worth of work, which feels like a very short trial. A week, say, seems fair? And then you have some kind of cap on tokens that only comes up if someone uses an abusively huge amount (an issue, I'm sure, you'd face with paying customers too, right?)

- I had a bit of trouble with making a todo list—I kept wanting the system to do a "new line" or "next item" and show me a new line with a dash so I know I'm dictating to the right place, but I couldn't coax it into doing that for me. I had to sort of just start on the next item and then use my keyboard to push return. When making lists, it's good to be able to do so fluidly and intentionally as much as possible. Sometimes it did figure out, impressively, that long pauses meant I wanted a new line. But not always.


Awesome. Agree on the copy-paste annoyance, we're working on more clients.

But I do think that the reliability needs to take a few more steps before it becomes a true keyboard replacer.


Thanks for all your hard work! Even, as a start, I found myself asking the app to copy the text to the clipboard for me without even thinking. Might be nice to be able to do that more seamlessly, just as a start?

You've moved us all a lot closer to my dream: taking a long walk outside, AirPods in, and handling the day's email without even looking at a screen once.


That's a great idea, we should do that.

I have a similar dream, we'll make it happen!


Since voice-to-text has gotten so good I've used it a lot more and also noticed how distracting and confusing it can be. Using Apple's dictation has a similar feel to this where you're constantly seeing something that's changing on the screen. It's kind of irritating and I don't really know what the solution is.

One suggestion I have here is to have at least two different sections of the UI. One part would be the actual document and the other would be the scratchpad. It seems like much of what you say would not actually make it into the document (edits, corrections, etc) so those would only be shown in the scratchpad. Once the editor has processed the text from the scratchpad then it can go into the document how it's supposed to. Having text immediately show up in the document as it's dictated is weird.

Your big challenge right now is just that STT is still relatively slow for this usecase. Time will be on your side in that regard as I'm sure you know.

Good luck! Voice is the future of a lot of the interactions we have with computers.


Not trying to hijack this. Great demo! But STT can be very much real-time now. Try SoundHound's transcription service available through the Houndify platform [0] (we really don't market this well enough). It's lightning fast and it's half of what powers the Dynamic Interaction demos that we've been putting out.

I actually made a demo just like this aqua voice internally (unfortunately didn't get prioritized) but there is really no lag. However it will always be the case where the model will want to "revisit" transcribed words based on what comes next. So if you want the best accuracy you do want to wait a sec or two for the transcription to settle down a bit.

[0]: https://www.houndify.com


Distil-whisper is incredibly fast. Realtime on a 3060 Ti, and I used it to transcribe an 11 hour audiobook in 9 minutes.


You know, those audiobooks already have transcriptions. Often written by the original author!

I kid. Your comment made me think of a shower thought I had recently where I wished my audiobook had subtitles.


It really is a little absurd IMO that the text of the book is sold separately from the audio.


Book publishing industry is different from audio recording industry.


I developed an RSI-related injury back in 94/95 and have been using speech recognition ever since. I would love a solution that would let me move off of Windows. I would love a solution allowing me to easily dictate text areas in Firefox, Thunderbird, or VS code. Most important, however, would be the ability to edit/manipulate the text using what Nuance used to call Select-and-Say. The ability to do minor edits, replace sentences with new dictation, etc., is so powerful and makes speech much easier to use than straight captured dictation like most whisper apps. If you can do that, I will be a lifelong customer.

The next most important thing would be the ability to write action routines for grammar. My preference is for Python because it's the easiest target when using chatGPT to write code. However, I could probably learn to live with other languages (except JavaScript, which I hate). I refer you to Joel Gould's "natPython" package he wrote for NaturallySpeaking. Here's the original presentation that people built on. https://slideplayer.com/slide/5924729/

Here's a lesson from the past. In the early days of DragonDictate/NaturallySpeaking, when the Bakers ran Dragon Systems, they regularly had employees drop into the local speech recognition user group meetings and talk to us about what worked for us and what failed. They knew that watching us Crips would give them more information about how to build a good speech recognition environment than almost any other user community. We found the corner cases before anybody else. They did some nice things, such as supporting a couple of speech recognition user group conferences with space and employee time.

It seems like nuance has forgotten those lessons.

Anyway, I was planning on getting work done today, but your announcement shoots that in the head. :-)

[edit] Freaking impressive. It is clear that I should spend more time on this. I can see how my experience of Naturally Speaking limited my view, and you have a much wider view of what the user interface could be.


> when the Bakers ran Dragon Systems

For those who don't know what happened next, and why Dragon seem to stagnant so much in the aughts, the story about how Goldman Sachs helped them sell to essentially Belgian Enron, months before they collapsed, was quite illuminating to me, and sad.

https://archive.ph/Zck6i


That's only the intro. Here's the conclusion: https://www.cornerstone.com/insights/cases/janet-baker-v-gol...

> Professor Gompers opined that at the time the acquisition closed, Dragon was a troubled company that was losing money and had regularly missed its own financial projections. It was highly uncertain whether Dragon could survive as a stand-alone entity. Professor Gompers also showed that technology stocks were on a downward trend, and L&H was the only buyer willing to pay the steep price Dragon demanded. Thus, he concluded that if the company had not accepted the L&H deal, Dragon likely would have declared bankruptcy. The jury found in favor of the defendants and awarded no damages to the plaintiffs.


It’s crazy to me they were helped by what were essentially boys right out of college, and they had any faith it would work…


Goldman Sachs is such a wonderful model of what is possible via Capitalism. I think they are holding on what they really could achieve with a little will.


Down voters: sarcasm alert!


I remember being in a conversation back in 2002 or so, where some Smalltalkers were brainstorming over the idea of controlling the IDE and debugger with voice.

It just so happens, that many of the interfaces one has to deal with are somewhat low bandwidth. (For example, many spend most of their time stepping over, stepping into, or setting breakpoints in a debugger.) Code completion greatly cuts down the number of options to be navigated second to second. It seems like the time has arrived for an interactive voice operated AI pair programmer agent, where the human is taking the "strategic" role.


Thank you! We love hearing stories like this.

We want to get Aqua into as many places as possible — and will go full tilt into that as soon as the core is extremely extremely solid (this is our focus right now).

Great lessons from Dragon Dictation. Would love to learn more about the speech recognition user group meetings! Are those still running? Are you a part of any?


Unfortunately no. I think they faded out almost 20 years ago. The main problem was that without having someone able to create solutions, the speech recognition user group devolved into a bunch of crips complaining about how fewer and fewer applications work with speech recognition. We knew what was wrong; we knew how to iterate to where NaturallySpeaking should be, but nobody was there to do it.

FWIW, I am fleeing Fusebase, formally known as Nimbus, because they "pivoted" and messed up my notetaking environment. In the beginning, I went with Nimbus because it was the only notetaking environment that worked with Dragon. After the pivot, not so much. I'm giving Joplin a try. Aqua might work well as an extension to Joplin, especially if there was a WYSIMWYG (what you see is mostly what you get) front-end like Rich Markdown. I'd also look at heynote.


On a somewhat unrelated note, I remember Nuance used to be quite litigious, using its deep patent collection to sue startups and competitors. I'm not sure if this is still the case now that they're owned by Microsoft, but you may want to look into that.


I always felt coding could be such a great fit for voice recognition, as you have a limited number of tokens in scope and know all the syntax in advance (so recognition accuracy should be pretty good). Never saw a solution that really capitalized on that, though.


You should check out cursorless… it may be more directly targeting your use case


I saw it was based on Talon, but unfortunately, Talon makes things overly complex and focuses the user on the wrong part of the process. The learning curve to get started, especially when writing your action routines, is much higher than it needs to be. See: https://vocola.net/. It's not perfect; it's clumsy, but you can start creating action routines within 5 to 10 minutes of reading the documentation. Once you exceed the capabilities of Vocola, you can develop extensions in Python based on what you've learned in Vocola. One could say that Talon is the second system implementation according to Mythical Man Month.

My use case is dictating text into various applications and correcting that text within the text area. If I have to, I can use the dictation box and then paste it into the target application.

When you talk about using speech recognition for creating code, I've been through enough brute-force solutions like Talon to know they are the wrong way because they always focus the user on the wrong thing. When creating code, you should be thinking about the data structure and the environment in which it operates. When you use speech-driven programming systems, you focus on what you have to say to get the syntax you need to make it compile correctly. As a result, you lose your connection to the problem you're trying to solve.

Whether you like it or not, ChatGPT is currently the best solution as long as you never edit the code directly.


For anyone else reading, see: https://news.ycombinator.com/item?id=38214915 - "Cursorless is alien magic from the future" article linked from 4 months ago.


This is really great. I was hoping someone would build this: https://bprp.xyz/__site/Looking+for+Collaborators/Better+Loc...

I would really happily pay $10 / month for this, but what I really want is either: - A Raycast plugin or Desktop app that lets this interact with any editable text area in my environment - An API that I can pass existing text / context + audio stream to and get back a heartbeat of full document updates. Then, the community can build Obsidian/VSCode/browser plugins for the huge surface area of text entry

Going to give you $10 later this afternoon regardless, and congrats!


I would also love to integrate this into text areas in my app, or as an editor of a JSON object.

That would let me quickly build an interface for editing basically any application state, which would be awesome!


There should probably a community effort to build an open-source version of this around Obsidian?


Take this [TEXT] read it and then let me tell you how to edit it:

>Certainly - let me grok your text!!... OK - I am ready!

BLAH BLAH BLAH...

etc


Dictation software is huge in the healthcare industry. Every doctor uses it, and a solution like yours could likely make their work much more efficient.

Have you explored this market segment?


I’d consider dentistry first. It’s still an open market in terms of SaaS, and they tend to have the same computer sat there all day constantly switching between the patient and the machine.


Why do doctors use it?


Not OP but a big part of a doctor's job is clinical notes. Typing is slow, talking is fast. Less time spent taking notes == more time with patient.


From exploring this segment for a while, I believe that dictation software is the "brick" in the hair-on-fire analogy (hence, it provides some relief, but is far from an actual solution). There is a form of water (scribes on retainer) but it is too expensive for all but the most profitable of specialties. The problem to be solved is not "dictation but better," but "take this cognitive load away from doctors, and keep notes accurate." (which is what scribes with experience do.) In a broader sense, the problem to be solved is the American healthcare/insurance system (the reason these notes have to be taken in this way in the first place)...

> Less time spent taking notes == more time with patient.

This can be true in some cases, but from what I understand, industry wide it would end up more like:

Less time spent taking notes == more patients scheduled.

Which is still of value, but fails to solve the original point of a physician's frustration, and possibly makes it worse (assuming the physician is still the one generating, handling and verifying the notes, but with better efficiency).



Thanks! As I mentioned, I've been looking into this for a while. I'll add it to my list.

Now I have:

    - vetrec.io
    - Abridge
    - Scribeberry
    - Scribematic
    - Notezap
    - Lytte
    - Deepscribe
    - FreedAI
    - s10 AI
    - Nable
    - DeepCura
    - DAX copilot
    - Suki
    - M*Modal
    - Amazon Healthscribe
Slightly different (?) - maybe more human:

    - Overnight Scribe
    - Rev.ai


Plus the output is legible


My wife is a Radiologist and uses voice transcription literally ALL DAY LONG as she reads imaging and transcribes her findings. Powerscribe from Nuance in case your curious


This was such a well executed demo. A few seconds in and I'm seeing the value. The core of the product is fully explained in just 36 seconds.

It's less about how quickly all that transpires and more about presenting the product in a way that doesn't require a lot of talking around it. Well done.


I agree, very well spent seconds. Straight to the point and immediately obvious what the product is doing and how useful it could be.

My first thought, when reading the headline, was that this could be useful for my coworker that got RSI in both hands and codes using special commands to a mic. But after having watched it I think it can be much more than such a niche product.


Congrats on the launch!

I absolutely love the idea, as a fellow neurodivergent who works much better over voice than text. My only feedback is... I'd love to run this with more control. I already run LLMs locally (LM Studio), and I can run something like whisper too. I understand that open-sourcing (or even making the source code available) might go against any commercialization attempt. However, there are some options (Red Hat-esque) where it may be possible to charge for business use and allow local running for free for personal use.

On one hand you've got a solid first-mover advantage in a field where lots can benefit and use this, however if someone can bork together several layers of LLM output they might be able to offer competition (and such projects are often opensource, albeit sometimes less "polished".) If you offer a good deal you might have a good chance of major success. Best of luck!


So… what do you want?


Ideally, some way to run this locally on my own machine. That would offer more power (and also allow the product at a lower/no cost without any demands on their servers). Are you from the Aqua Voice team by the way?


After watching the video demo and logging in, I was able to compose and edit text easily. Nice job.

My own use case is a bit different from many others who have commented here. I'm a reasonably fast typist and don't currently have any physical or neurological issues that might make typing difficult. I have tried voice input methods a number of times over the years, as I thought speaking would be faster than typing, but I always went back to typing due to accuracy problems and difficulty editing.

Aqua Voice does seem to be a significant advance. I'm going to try it out from time to time to see if I can get comfortable with voice input. If I can, I will subscribe.

I drafted this comment using Aqua Voice, but I ended up editing it quite a bit with a keyboard before posting.


I don't think there is genuine use cases to this except for accessibility features etc. I always doubted voice; however, I'm beginning to see that Oai Voice Engine will potentially be huge and incredibly addicting to chat to. I.e. it's more interesting to confide in a human sounding 'friend' that vanilla ChatGPT with keyboards.


Appreciate it.

I think the preference for voice versus typing is something that hits everyone differently, and I think as the reliability and speed improves, more and more people will find themselves using voice as a "tool in the toolbox," which aside from the occasional "Hey Siri, set a timer," isn't the case today.


As others have said, good job.

This seems like it would be particularly good on a telephone or my watch. In those places it seems like a real game changer in terms of ability to take notes when the keyboard experience is less than awesome.

Have you tried using it to write code? This could be amazing as a IDE/text editor plugin.

It's nice to see someone do something that's not regrettable with AI. So many of the applications we see are horrible. What you've made is brilliant and very far from being just another cursed chocolate factory experience.


Nice work and congratulations on the launch! It’s awesome when a project works as expected/assumed. I immediately started using it to build my packing list for my upcoming weekend away.

Three suggestions:

1. If I try the demo on the landing page, and then sign up, I wanted it to automatically have the contents from my demo usage on the landing page copied into my newly created account.

2. I would like another option than an auto renewing subscription. If I could have loaded 5$ or 10$ that my usage would deduct from, instead of having to subscribe, I would have paid money right then and there, but I didn’t because I have too many damn subscriptions (alternative is to offer a non auto renew $10 one month trial).

3. iOS app please :)


Thanks Merik! "It does what it says " is honestly my favorite feedback...

1. Good call on demo page, would be slick to sync that up with account. 2. Subscription exhaustion is a real thing — credits/usage based billing is an interesting 3. asap!


You don't say so explicitly, but it'd be good to know what data goes to the cloud - I presume all of it including speech recordings? Or is STT on device? Also what your privacy / retention policies are around this data.

Excellent demo and great-looking product btw!


I just spent 10 seconds trying it. It was able to interpret my intentions and parse out commands from the literal transcription. "bazinga but in all caps and with a j" became "BAZINJA". So at the minimum, it's going through an LLM in the league of llama, which if run locally in browser is slow as molasses on my ancient MacBook. So it's definitely going to the cloud. As a rule of thumb, you should just assume any website you didn't completely code yourself is sending every mouse movement and every text that you type and then backspace, including passwords, to a cloud big data analytics repo via a few javascript listeners.


That’s a hilarious over assumption but point taken

Also I really enjoyed your analysis


Tried it. Seemed quite impressive. Two issues:

- it consistently uses word two instead of to

- forcing Google OAuth as the only way to sign up is not a good idea. That prevented me from signing up.


Did you wait for the text to turn blue and then black? And were the twos still wrong then? The real-time text is non-final tokens and has many more errors than what is ultimately committed to the document (but committing is slower than we'd like at the moment).


Yes I did. I even later tried to tell it to fix this, and was not successful.


From one dyslexic to another, who never got the option to even use a computer in school or college and instead was forced to write out everything long-hand, thank you so much for this.

I use voice-to-text in the workshop and when taking notes and reviewing a PR. And all the current options are pretty much what you would expect. More focused on accuracy, which is usually quite poor, which, to paraphrase, "It's Erin with an E. Oh for **s sake, Erin. ERIN! E. R. I. N. <pause> N. I said N. Eh-rin. Fine. Whatever." so anything that can improve on that experience will be immensely helpful.

Looking forward to seeing where you go with this, and I hope at some point you make a native desktop application.


I think I developed "bad handwriting" partially to hide misspellings—this is necessary in school though.

On pure WER we are state-of-the-art in our testing, but more importantly, mistakes in Aqua are correctable.

So you can speak your mind instead of having to wait until you have the perfect sentence and then dictate it.

That said, we know it's not perfect, but we know a few more months of work will have it really solid.


This is amazing! It's very satisfying to use and the combination of transcription + intent seems like it has huge potential.

I would love to use this in healthcare for dictating patient letters etc. I guess a local model / HIPAA compliance is some way off?


First impression: Wow, this is awesome.

So let's say I work in a quiet home office by myself. Could I just have Aqua open throughout the day and give it notes / to-dos without having to click the microphone on/off each time?


Thank you! And yes, the app has a Background mode which is designed for this use case exactly


I've wanted something like this for data entry for a while now. I often find my hands full measuring things and need to take notes. Can this output/format tabular data?


When I was playing around with the demo, I gave it a list of things to do and then asked it to convert the list to a markdown table and label the second column estimated duration. It worked like a charm. It set the first column heading to "Description" even.

I was then able to go through my list very quickly and add times to each item.

The one failure I had was when I asked it to add whitespace to visually align the table columns.

The table was, however, converted back into a list when I asked it to turn the text into an email.

It's not exactly punching numbers into a spreadsheet, but it worked pretty well for the simple use case I tried.


Thanks, I'll have to try it out. Ideally I'd like to ask for a table with lettered/numbered columns and rows, and then just call out "B6 is 2.024" as I go.


I don't think it would perform too well in tabular contexts, at least with just natural language, for reasons I've explored in one of my own projects [https://github.com/AmberSahdev/Open-Interface/?tab=readme-ov...].

That said, I would still think it's very doable to set reserved keywords to navigate it ourselves while keeping it conversational.


I tried the demo, it worked well, allowing me to add a line and then delete the first line - a test that Dragon or Apple would have failed.

What does the actual app look like though? Is it only in a browser or can I use this anywhere on my Mac?


I think I would pay for this in a heartbeat if it was more "available", as in not just a web app. I'd love a native app where I could use this in any textbox (web, app, etc) on my Mac. Ideally I'd be able to remap the key I use for mac dictation. I think I'd be fine with a popup where I write all the text and then it just inserts it after it finishes (so you can just render your UI and paste in the result instead of needing to interact with existing text fields.


I use Apple dictation heavily for transcribing interviews. I've tried all the voice-to-text services out there and none have been reliable enough *at transcribing an audio file. I've settled on playing audio in my headphones and pausing while I carefully dictate text into a document. If I could upload the audio file, get a first-pass transcription, and then go through and edit / make corrections with voice, that would be awesome.

A difference in error rate from 20-something percent down to less than 5 percent sounds incredible.


Have you tried using Whisper from OpenAI ? Aiko [0] have Whisper-v2-large built-in and allow for transcription of audio file

[0] https://apps.apple.com/fr/app/aiko/id1672085276


Is there anything like this for watching foreign television (or radio)? I don't want to create a document, I just want real-time translated subtitles, but I can't do it in advance for live shows.


This is amazing. Just tried really mumbling a long for a while and it got every word.


Have you tried openai whisper? Last time I compared it was quite a bit better than all the other options.


Check out Descript. It's been awesome when I used it in the past


Deepgram has been incredibly accurate for me.


This is excellent and very impressive. Have you thought about offering this as an API? I'd bet there are lots of startups that want to easily integrate better speech-to-text for conversational AI (e.g. word correction, adding punctuation, etc), and would pay you for the service.

(Personally, I would! Email me at matt@syntheticdreamlabs.com if you're interested in offering it as an API, I'd be pretty curious about pricing.)


We have thought about this: there are some boring reasons that it doesn't make sense to do right now but I would definitely be interested in learning about your use case and keeping you posted if and when we decide to do it!


For sure! Email me if you want to chat more :)


Yes, upvote for this please! We would pay to integrate this into our SaaS for our users.


Hands down one of the best AI demos I have seen. Last time I got a wow feeling like this was when ChatGPT was released.


I've just subscribed! Congratulations for this well crafted, immediately useful tool. It's something I was already looking for.

Some Feature Requests:

- Please implement proper undo/redo history and allow us to use it with voice, GUI and keyboard.

- When I take over with keyboard, do not mess while I'm writing.

- When I start speaking, if the text cursor is in between words, I'd expect it to insert, not append. (if you want to avoid this from happening accidentally, after some idle time, you may move cursor to end of the text)

- A little advanced feature; It would be very nice to have some form of tagging for sentences and/or paragraphs with numbers, text or colors. So it would be easier to delete, reorganize, move around them by saying like "move the green section before purple one" or "part 12 should go before part 4", then we can let the language model to do it's magic and rephrase/reformat it a bit. Your AI probably smart enough to understand which sentence we are referring to by just hearing part of it, but as a user we may feel too lazy for that.

- Add a cheat-sheet for what's possible next to editor. Something more succinct than the notion doc.

- Allow pausing/resuming dictation with voice command.

Note: I would really like to be able to benefit from its smart features while I'm not sitting in front of my computer or holding my phone. But I'm not sure there will be a day when Apple let us interface with smarter AI agents like ChatGPT or Aqua Voice on iOS while the screen is locked. IMO, this is gatekeeping of an inferior feature (Siri) in the name of protecting my privacy. I hope some day EU will also intervene with that.


I tried it in Firefox on my Android and got this error when I tried to use the demo:

"Error: NotSupportedError: AudioContext.createMediaStreamSource: Connecting AudioNodes from AudioContexts with different sample-rate is currently not supported."


FYI to the devs... I got the same error on Firefox Win11 x64.


Patching this now!


This is fixed!


Same here


Congratulations! This is really cool. Maybe your website could just load into the demo? Have a talking avatar that looks like a paperclip with googly eyes to explain how to use it...

edit: I refreshed and then it did load with the blue mic button


Congrats! The launch is a stellar example of focusing on key functionality and cutting everything else. Core tech is amazing.

I played writing an intro for some documentation I'm creating, and it was amazing. It felt very conversational, very different than traditional dictation software.

1. I have the same question as others, what's the privacy policy? Where is this being used? I'd like to expend for work but their first question is where the content is going.

2. And straying from focusing on core functionality, do you have a vision of a code specific version of this? Do you feel like a specialized model/tool is needed or do you believe this can take care of it.

Currently, I tried asking it to write out `def func():[newline]pass` and then correcting it but didn't get very far:

> "Just my funk. Oh no the word death try type the letters DEF FU. Okay, no let's stop."

Overall, great work!


Great work, really hope you'll be able to pierce into the medical market eventually. Dragon is still useless to anyone who can touch type.


No way, I was literally mulling over the exact idea of voice-driven text editing (focused towards programming), using a mix of voice commands and usual speech to text. This is really exciting to see!


The programming use case is pretty interesting.


Just thought I'd let you know about this event that popped up in my inbox that I think you should definitely attend.

2024 Bridge2A Voice Symposium | Voice as a Biomarker of Health

https://www.eventsquid.com/event.cfm?id=22807

The 2024 Voice AI Symposium will be a groundbreaking 2-day event and unique opportunity to connect with stakeholders invested in artificial intelligence and voice biomarkers. This year's symposium will serve as a nexus for dialogue, collaboration, awareness, and engagement across diverse sectors and members of the community about the use voice of artificial intelligence in healthcare. Attendees will experience dynamic speakers, panels, and networking opportunities. Innovative interactive events include a Call for Science with 3 submission categories, a Voice AI Tech Fair, and a patient challenge competition.


Great recommendation, thanks for sharing!


I love it when someone shares an idea that I wouldn't have considered (which is on me) and so clearly solves a real problem.


You’re early and this is effectively a demo but just in case this is a blind spot: “token” is an in-the-weeds LLMism that means nothing in the context of transcription. Your costs may be measured in tokens but that’s not relevant to customers. Just “A free trial” with no quantifier would be better than 1k tokens.


Appreciate the feedback, we'll take a look at that.


Great point. Someone needs to replace this measure.

Words? Minutes? Number of edits? Eg

Free - try 10 minutes active editing a month, great for trying it out

Light use - 120 minutes a month, perfect for jotting down a few things daily

Pro - 600 minutes a month, write an entire essay by voice

Ultra - unlimited. Make voice editing your main workflow and work 10 times faster


This is a great point and a topic I’ve been thinking about myself. As more LLM services pop up that are subject to token/consumption pricing, what is the right pricing model for consumer based consumption products like this?


Price based on value. Pricing is hard, something as simple as per-token is alluring because it doesn’t require any thought but it’s leaving a lot of money on the table. There’s nothing unique about LLMs when it comes to pricing, all common pricing wisdom applies.


That seems challenging to do with a writing/note taking app like this. First, what would the pricing tiers be based on? Word count? That would just be another way of saying token. Number of documents created? That puts you at risk of long unprofitable documents. Google Sheets doesn’t really have this problem because the incremental cost of storage is relatively cheap. Tokens on the other hand are not cheap.

How do you price based on value without a corollary to tokens? If you charged $40 for this service then maybe you don’t provide enough value for the casual user who does the occasional school report. On the other hand you may be unprofitable for the doctor that decides to dictate all of her interactions every day or the author who dictates an entire book.


> First, what would the pricing tiers be based on? Word count? That would just be another way of saying token.

A customer sees "word count", they understand what's going on perfectly, right away. Tokens? More than half of them will think "what, like, game tokens? do I have to buy them in advance?"

Generously, 10% of potential customers are going to have even an approximate idea of what a token means in this context, maybe 1% could tell you that words and tokens aren't quite the same thing.


Words. Just estimate how many tokens that'd be and talk in words, paragraphs, etc instead.


This is super awesome. Do you develop your own models, or is this a wrapper around existing APIs? It would be great to have a way to introduce environment variables like my name, my preferences, and the topics I usually write about. I've actually written this comment using your service. Thank you. Looking forward to seeing what it becomes.


One of the first Launch HN products I've been excited about in a long time. I'm a student, and really looking forward to using this to write papers, assignments, emails, etc.

Congrats!


Thanks Eliot! Let us know how we can make it better for you!


This is really great. I imagined such a thing should be created, amazing to see it in reality. It would be great for those of us not limited to exclusively voice to be able to use commands as well, as I still think in some cases doing explicitly what I want for simple things is easier than figuring out how to explain it :)


We agree totally; voice only can be ridiculous, for example, if you're spelling out a username or something.

The sandbox doesn't have typing, but the full app does - you can switch between typing and talking seamlessly there.

(written with Aqua)


Nice work. I am very involved in the Talon community and it is cool to see other projects tackling voice interaction from different perspectives.

I develop a very similar natural language voice interaction tool using the OpenAI API and Talon as the engine[0]. (i.e you apply any voice command transformation with AI on any text, or use it alongside Cursorless for semantically targeting scopes in the AST) You can use my solution with offline LLM models too.

If you are interested in chatting, please reach out[1], as I am very interested and experienced in this space.

[0] https://github.com/c-loftus/talon-ai-tools

[1] https://colton.bio/contact/


I have neuropathy in my arms, so this is something I'm very interested in!

Do I have to use a specific Aqua Voice text editor, or can I use it in apps like JetBrains Rider and Visual Studio Code? If so, are there some kind of plugins that would allow using IDE-specific features? (e.g. "build and run the API project")


Hey! Right now our focus is getting the core tech solid and we can do that much faster if we aren't juggling multiple platforms and plugins (we learned this the hard way), but after that we are going to blitz into as many places as possible.


Impressive to see the total lack of response for "umm" and "ahh" filler sounds (in the text editing area), it does seem to recognize them for what they are. Also the rest of it seems valuable, especially the formatting abilities.


Cool product. I signed up.

I wish there was a clear way of sending you feedback though, there are some details that annoy me a lot.

I'm transcribing some recorded sound, and after I have had it transcribed and I am editing it, every time I tab away from the browser the cursor position gets lost. The focus on the textarea is also lost, and when I click it, it doesn't insert the cursor where I click but at the start of the document so I even loose my scroll position.

As a paying customer I'd hope to have a way to give you this kind of feedback. It should be fairly easy to make this a much better experience.

Really cool product all in all though! I don't often subscribe to stuff.


Absolutely, and sorry for the UX annoyances. We'll add a way to quickly share this kind of thing from within the app.


This is very cool. I would immediately buy it if someone ends up making an Obsidian plugin


This would be very effective


I feel like I'd much prefer this as an API I can request and get realtime updates from so that I can hook it into any application. Is that on the roadmap?

Also latency seems to be a bit slow, wish it was faster, maybe thats due to traffic now


Infinite details to remark on, but,

NO NOTES.

This is the sort of the thing that I forward to people who are skeptical about the disruptive capacity AI has, to take long-standing seemingly intractable problems, and "solve" them.

Hats off. Truly inspiring in many senses!


Thank you very much. I can't say how much these comments motivate us.

When I was a kid I saw tech as a really aspirational thing (jobs, iPod, touch screens, Skyrim). What can be done with modern GPUs and transformers is making me feel like that again.


Suggestion for go-to-market: if you haven’t already done so, try to sell this direct to universities. They are an absolute pain to deal with, but they have various obligations to show how they are supporting dyslexic students etc and this fits the bill perfectly.

Their lead times are long so I’d start establishing trials now, partner with a university on agreement of a discount and use them to develop the software.

Once you’ve got a few uni’s onboard, you can rapidly expand and they are very unlikely to churn as you’re serving such an important niche.

I know a very similar product that has had huge success doing this.


Great idea.

This is definitely worth looking into. We don't want to slow down the pace of development, but this might be a case where the partnership makes too much sense.

A huge portion of what goes on in universities is writing.

edit: Reading this back, I thought I sounded too eager to partner with universities. We're a tech company, and quality and performance will always come first.


Working with universities won't slow your pace of development, because you can work directly with users on that.

But go-to-market, after the initial delay, uni's will drive sales fast at scale.


I'm surprised that universities have to consider dyslexic students. When I went to university, I was basically told to figure it out. "It's your problem after elementary school. nobody cares."


The demo seemed to struggle a bit with my accent (Scottish), getting quite a few words wrong - for example, every time I said "test" it would write "taste". Is this something you can improve going forward?


Sorry about that. We know we need to be better about that and of course add more languages.

A few things to try to maximize your accuracy right now are:

- Don't use AirPods, especially not AirPods Pro. Most built-in laptop mics or EarPods or a gaming headset are perfect. It doesn't need to be podcast quality.

- Correct transcription mistakes as you would a person, then "plow through" and often the error will be corrected as you complete the sentence.


All "normal" voice programs struggle with us non-native speakers and our funny accents (sample size: 2). The first try on your site was satisfactory but I'll have to lurk around more just to feel safer... And yes, I am really looking forward for more languages. And switching between them!


What’s the problem with the AirPods? Too much pre-processing?



In the past when I've been in the USA, I've legit had to put on an American accent when calling for taxis and the like!

I don't even have that strong an accent, and I always try my best to enunciate correctly when talking to others shrug


I'm getting married in Scotland in December and will presumably want to be able to demo so you can bank on priority support and a hard deadline :)


Lol, excellent :)


I'll certainly go give this a spin later as I use voice to text daily. My first few questions:

How's the dictation accuracy compare to Talons latest model, or Microsoft's new voice access? Or dragon? You've got a few comparisons already but nothing that I actually use.

What's the latency like?

At least for me a general voice editor isn't useful, give me something that can send text to wherever my mouse is pointing and that's useful. Then make sure it works with Microsoft's voice without borders, synergy, barrier, input director etc.

Oh and does it support a user dictionary?


We'll be releasing a custom dictionary and templates soon. We are testing them internally now, and they aren't quite reliable enough to release, but we understand how important this is for many workflows.

On accuracy, we benchmark very well against even large async models, with a WER of .05-.06 and when Aqua does make a mistake you can often correct it by just telling it "no it's our side not outside" and it won't mangle the text.


I have goosebumps!

Jiminy Crickets...

I have SOOO many use cases for your thing.

[edit: what does this mean: https://i.imgur.com/rHQt6ul.png when attempting to demo?]

---

* I want an agent that I can speak to on Mobile headset as I love to think out loud - and air my thoughts and thought process through talking through my internal dialogue - if this could just capture what I am saying and log it and I can refine thoughts as I go.

For example - I ride a lot. I try to cycle 1000 miles a month if I am doing a solid month - but else - I ride daily and its a movement meditation. as I ride - I think through things and I speak through thought processes with differing opposing 'experts' in my internal monologue to self-argue through to a solution....

If I could have this record all that, then random epipehnies I think through while on ride will be captured in a meaningful way.

---

* A meeting-notes-transcriber for whiteboard sessions.

* record everything you say in an interview and be able to review after for self-coaching

* talking through a dish as you wing the ingredients so that you speak out loud what you did (my grandmother was friends with Julia Child - my grandmother taught me to cook and when it came to measurements of things - they always wing it per feel/taste "salt to taste" for example means "eh... whatever"

so to be able to talk through what your 'winging it with' and it captures it into a salient reproducible recipe (i make a mean Chimi Churry (sometimes if I can recall)

* a voice "body cam" for things I may say in situations where I may be too flustered to recall.

* Speak authoring - start telling a story outline so it captures a synopsis that you can further develop

* Speech (like giving a speech) refinement as you can talk through the speech and capture and rework and reiterate etc

and thats just off the top of my head through your demo....

LOVE this.


Thank you, awesome to see how many ideas this inspired! We've thought about a lot of similar things ourselves and will certainly build some of them :)

Sorry about the error you are getting! It's a Firefox thing. We will patch. In the meantime, Chrome/Safari will work


<how many ideas this inspired!

I just want to qualify - you did not inspire these ideas.

These are desires sought which have been there for eons...

You are not inspiring them

You have atool that ENABLES them.

Seek that which already is a flustered pop of ideas waiting for a release valve for such thought.

You are not inspiring - you are enabling that which is already there, think of it as which valve to open - the pressure is mounting upon your dyke.


that's a better way to put it!


This is awesome!!! I'm really impressed with the demo. In particular, with how fast it seems to work given the number of models you use, the client-server back and forth, and the required processing and text gen. How did you do that? And at which point do you start to get bigger latencies e.g. writing an email, an essay, or a novel where you change the spelling of a character's name 2 chapters earlier.


> Aqua is smart enough to figure that out and to only take the last version of the sentence

I wish Siri, Alexa, et al would do this as well. They seem to expect you to speak perfectly the first time.


I liked how easy the demo was to play around with! I don't have the most amount of use for this product, but kudos to making something that clearly works very well!


Not sure I'm going to get any traction here, but seeing as the 'support' button in the website is greyed out, not sure where else to post this. I got double-charged for my subscription, and I'm hoping you folks can help me get a refund on the extra charge. I was going through the subscription flow and got an error. The website allowed me to subscribe again so I did, assuming the first one hadn't gone through. I now have two charges on my card.


Hi! I'm so sorry about that. Will you send me an email (jack@withaqua.com) and we will sort it out!

(we'll also fix the support button)


What are your opinions on https://github.com/cursorless-dev/cursorless?

Are you targeting developers?

My understanding was people who are serious about developing via voice use it pretty exclusively.

Like, yeah you need to learn commands, but "are often not worth it" feels like brushing a pretty massive offering under the rug.

Is learning vi / emacs commands not worth it (or shortcuts in another IDE?)

Is there a middle ground?


Cursorless is really cool, but we see the ideal computer-voice interaction a little differently.

Our approach is based around understanding intent from speech alone. We think this will be the ideal division of labor between man and machine going forward - let the person think and the machine fit it into the document/file/text. Over time we think this will reduce the number of commands you have to learn to use it to zero.

But our "command-less" approach isn't reliable for every use case yet - and as a fan of voice interfaces I am rooting for Cursorless - it's super sci-fi.


This is awesome, will likely subscribe--just need to pare down some of my other subscriptions--there are too many tempting AI products lately.


I understand the feeling :) great to hear!


My child is profoundly dyslexic. This kind of tool is a game-changer for him.


Hope this can be helpful. We know there are still many kinks to iron out.

On another note, I think once you leave school dyslexia can become a wash or even a net positive in the right setting. I think whatever the brain config is can be a huge unlock for creative thinking - it's not always super helpful in the school context, but can be really asymmetric in tech and probably other industries.


As many others have noted, once you've got everything stable (and hopefully profitable) you should seriously explore a way to use this as input into any text field in any program. Microsoft is actively experimenting with something similar in Copilot Voice although theirs is very integrated with the editor and specialized for code. It would be great to have these types of voice interfaces in all software. Maybe you could look at providing a way to integrate with your system through an API so others could do the heavy lifting of creating a native experience for each app?

Absolutely amazing product by the way! The 1000 free tokens is enough, the fact that people are complaining about running out too soon is good, it shows that they like the product and want to use it more. They do have a point about adding a rough word count, maybe just a subheading that says "on average, X spoken words".


This is really nice. I tried it out on the app and wanna start using it more. The only issue stopping me is the privacy policy.

I understand why you have to retain the voice data, however, is there any way you can implement an opt-out feature?

I’m just not comfortable with my voice data lingering in some servers, ready to be used for training ml models.


How else can it get better?


Money


This is really cool. Does anyone who knows this stuff better explain how it works on the backend? The founders mentioned 6 different models but how do they interact with each other so fast? There’s definitely an LLM in the back interpreting the commands but how exactly does it work


I tried using it for a screenplay and it more or less knew the format, but didn't remember what to do with the blocks of text or how to separate them properly.

It would be handy to be able to select various formats and have it know to keep to that format.

Also I really liked it but found it quite slow. Assuming that will improve over time.


100% agree on speed — there's lots we can and will do to improve that

also good point on memory. this turns out to be relevant in a lot of cases!


Love this app, love the story and everyone being so supportive is making my day! Hope you guys go all the way!


This is super impressive!

I am really looking forward to the day when dictation like this can be done locally on our phones. I'd really like to do a lot of my basic messaging with my voice but the need to do corrections with that tiny keyboard means it's not much of a time saver.


100%. This day will come!


Wow, great demo! Excited to see this grow.


Impressive demo.

I noticed a correction that was done retroactively in the demo

'make that H100 GPUs'

and noticed that there was only one instance of the token GPU. Hence the correction was seamless. Had there been a couple more instances I guess all GPU tokens would have been replaced by H100 GPUs. I guess you could say make that NVIDIA H100 GPUs that would be more accurate but if there were multiple instances and you needed the change only in one instance, not sure how that'd fly. I am nitpicking but this could be a common theme.

The fact this can retroactively change the text and also understand a command is quite brilliant. I don't see any trigger word for a command, so wondering if I needed the command as part of actual text, how would that work ?


Great question. I just tested it, and Aqua was smart enough to figure out which "GPU" I was talking about using context.

In the example below I asked it to "make it H100 GPUs" and it only modified the GPU in the list.

Aqua isn't perfect though, and while I think we are mostly solving this case there are plently where we need to do better.

---

Hey Team,

I just had a chat with the marketing people, and they confirmed we should buy enough GPUs to meet the demand.

Our Equipment List:

- 1,000 H100 GPUs

- 1,000 processors

- 500 NVME SSDs

- 200 standard racks

edit: formatting


Well done! Without context I was a bit confused regarding the usability given the speed of the models (which of course can and will be improved), but given your story I'm sold and sure this will help a lot of people.

More input modalities are IMO always better, and being able to switch to this in the future when your fingers get tired would be awesome, my key painpoint from the demo is the speed, but I'm sure that will go up as models and inference speed gets better.

One awesome thing would be to integrate the contextual understanding with a programming copilot, so you can pair program with only your voice as input.

Rooting for you guys!


Thank you. 100% I'm with you on speed.

I think we will be in a much better place speed-wise in a few months; some of that will be our stack, and some of that is what is happening lower down in the stack, but it will be meaningfully faster and more responsive soon.


I was impressed with the Demo, ready to pay 10 and no option to sign up with email :(


Made some tradeoffs for the sake of speed — email signup will come. We want it too!


Good to hear! I very much using my google account and other third parties to sign up for accounts.

Do you have any idea of how soon? Not looking for a public commitment to hang you with, just wondering if this is one of those "we're working on it now" (so days) or one of those "it's in the backlog" (months or maybe never depending on priorities).


Hah! Excellent question. Somewhere in between (i.e. it is not a p1 right now but "months" is way too long). Give us til after demo day then one of us will sit down and knock it out


Very excited for something like this. I'm hoping you are a victim of your own success, but it's far too slow and is missing words. Would love to see something like this succeed.


Congrats on the launch. The demo is truly impressive. On my Apple cell phone with a Chrome browser, the latency feels a little sluggish (I am sure you are working on it). Congrats again and all the best!


Thanks, appreciate it. We can do a lot better than most people experienced today in terms of latency.


I used this to type an email on my phone, the editing process in particular was very smooth once I realised I had to wait a little bit. You can have ten of my dollars per month.


Thanks! Sorry about the lag, today was slower than normal, but we need to improve latency overall.


I've got a pinned tab and the site saved to my home screen, I look forward to using it as you build it out, very promising product well done.


Nice demo, I like this.

Will this work for programming languages? I often would like to type code by talking and especially, to talk into AI coding tools that help you complete code.

Will this allow me to add links to text based on context? Like "Go to <mysite>/blog and link to the article on Voice AI" - I do this quite a lot in my article writing.

Will this work over the sound of me cooking a loud fry up?

Take all 3 and I may never leave the kitchen again.


Awesome demo. A challenge where I work is an extreme acronym rich lingo. Is your model open to extension or learning in some fashion to accommodate picking up thousands of acronyms? We also can shift into rapid, specialized speaking patterns that I think are quite learnable but that are not really 'out of the box' for normal software products. I would think many industries could feature their own lingos like this.


I remember hearing about Dragon when I was in elementary school. It's cool to reflect on how far things have progressed in the last decade and a half.


WOW!!! Just wow...

When will we PC peeps get to use it?


You can use it in the browser right now! but we get... native is better for voice stuff and we'll be in more places soon.


This looks very sick!

any chance of getting this as native apps on mobile? or better yet like global macos utility like dictation so you can "type" in any apps?


Fascinating. Are you still using Whisper in any of these MoExperts to tanscribe or do you have something custom? Would love to learn more about the tech.


Nice, (code like) Refactoring meets speech-to-text


Just wanted to inform you that your demo video is actually unlisted and invisible to public. I hope that is not intentional.


Fixed. Thanks for the heads up.


This is super cool! Should ideally happen at the OS level (some future version of Siri) across whatever apps you’re using


Super cool and I'm excited to try it out more as I also prefer voice to typing.

I'm think I'm having trouble getting it to edit the existing text when writing though? https://i.imgur.com/IfoWvMG.png


Yikes! I think we may have normal mode a bit too conservative at the moment. We've got a lot more tuning to do.

One thing you can try if Aqua doesn't seem to be "getting" what you're after is to say something like "make this the email," or "transform this into the email I want to write." Sometimes it needs an extra push.


This is awesome.

Video talks about a Mac App. Where can I get that?

Voice input did not work on Edge browser on Windows, btw.


Thanks!

We had to make a bunch of breaking API changes over the last week and the Mac app isn't ready to go on it quite yet, but we'll bring it back as soon as we can, max two weeks, hopefully sooner.


Please don't only offer registration via Google. That's bad for you. Bad for users. And very bad for the world. Yes it's cheap. And also catastrophic.


Great product idea, excellent demo. Fantastic use case for LLMs. Keep it up!


One more piece of feedback: I'd like the new note page to have a unique route so that I can make a shortcut for creating a new note to easily put somewhere else in a document


Amazing! I'm french and I'd like to know if there is a chance I can dictate in french to Aqua Voice someday? If yes, any idea when that would be implemented? Great work!


Beatiful. We are exploring new ways of human-machine interaction.


Are these your models or a wrapper around model apis?


We use our own fusion model in the transcription pipeline for intent understanding from encoded audio, but most of the rewriting tasks like "Turn this into a list" call out to fine-tunes of GPT-4. It's a combination.

The fusion model is similar to the architecture described here: https://arxiv.org/abs/2310.13289


That looks amazing, congrats!

Minor friendly heads up, the withaqua.com link in the description of your youtube video is currently not a link.


I was impressed with the demo video.

I was impressed with the live demo on the app's homepage.

I subscribed.

But it is so slow and the actual app was clunky and refused to enter much of the dictated text that I gave it (even though it seemed to be recognizing what I said): the words appeared on the screen, then they disappeared.

I had both the speed problem and the recognition problem on both of the latest versions of Firefox and Safari, even though my connection is ~ 620 Mbps up and down.

I canceled my subscription within about 10 minutes.

Oh, I should also add the following feedback:

* The homepage promises that your subscription will include support, but even when you're subscribed, the support button is greyed out.

* There is no way to manually enter or edit the titles of your documents.

I'm disappointed, because I was very excited about this.


Things like this always remind me of this excellent talk: https://youtu.be/8SkdfdXWYaI?si=MFxs7wFdqws0OeCi

Worth a watch.


Friendly FYI - not sure if this is a skill issue on my part or something that's not possible yet, but I couldn't figure out how to change the audio input. I think when it asked for microphone access (chrome latest, Mac) that it chose the Macbook microphone which won't work as it's docked.


Really cool product! The demo worked well for me. This must be incredibly useful to so many people.


Not working for me on firefox/macos


Sorry about that, will fix asap. I love firefox.


Trying out the app on Firefox gets me this error:

NotSupportedError:

AudioContext.createMediaStreamSource:

Connecting AudioNodes from AudioContexts with different sample-rate is currently not supported.

I would add that this really needs to be a native app with ability to use it within Microsoft Word, which itself has a decent voice to text tool built in.


Sorry about the Firefox error! Agreed on the sentiment behind native app — we plan to get Aqua in as many places as possible asap. For product iteration, you can’t beat the speed the browser affords.


Make an Electron app that simply wraps your website! Just build in best-practices updating of the wrapper as well from day one, in case you want to ship improvements to the wrapper or start to move more things to client side processing.

As a side benefit, you get real estate in people’s docks and desktops :)


Would that help with the problem of integration though? What would be absolutely killer would be to emulate a USB HID keyboard or something, which would make it usable with pretty much everything, though there are definitely some security considerations there. Or if there are higher-level APIs to hook into that could work, but I would guess those would also require native function calls.

The way Google's keyboard works on Android, but on my Linux computer (and my Android phone) would be my dream here. I'd pay $10 a month for that for sure.


Some of my thoughts:

1. This is an amazing idea!

2. I love that it is browser-based so can work everywhere. Native app would let you integrate more tightly (such as becoming a "keyboard" on the system), but that probably means "a mac app" which doesn't do me any good on Linux. If you could keep the bulk of it in cross-platform tech and just do the small integration part with native code, I think supporting at least "the big three" is doable. I bet if you provided a good API, somebody in the open source world would even do the work for you, on Linux at least.

3. Would really prefer being able to sign up with my email, and not having to log in with a third party account.

4. Online-only access is definitely fine for now, but to stay competitive in the future I would keep an eye toward being able to run inference locally so you don't have to be online to use it. This would also be a way for you to reduce costs and offer a cheaper version. If I were you, my long-term goal would be for this to be used by everybody (though that's years down the road). Local inference does complicate monetization, but that can be figured out.

5. For me to really use this enough to pay out every month, it needs to be relatively easy for me to get the output into whatever app I'm using, whether that is Chrome, Slack, Gmail, Google Docs, Vim, Gedit, or anything else. This is undoubtedly related to item 2 above, but I figured it warranted it's own mention as there may be solutions besides browser-based vs. native.

6. You're gonna have competitors hot on your heels, if they aren't already. Google in particular with GBoard on Android could be absolutely killer. Since it is Android-only, I don't think it's a major competitor now, but if they broadened it absolutely could be.

7. Do you have an exit strategy in mind already? Would you be willing to share anything on that? (I ask because it's relevant because your product could easily become part of my standard workflow, and I'm very conservative about becoming dependent on proprietary products, especially from startups). Please do not go native-only and only release a Mac app. At a minimum, please maintain the web-based version. And please for the love of all that is holy, don't sell to/get acquired by Apple! I want and need your product, and I don't and won't switch platforms (Fedora Linux currently) to get it.

Really amazing idea and great work! It is rare that I see products that I think could actually "change the world" but this one has some potential by changing the way we interact with our computers!


the signup just failed for me. the console was logging out the token... you might want to fix that


patching now!! good catch.


should be fixed.


I love this idea. Wish there's a browser extension so I can dictate in my emails.


Are you saving the voice recordings and/or using that data for training?


This is useful, I hope this is enabled for other languages soon


The video demo is sooo slow. Seems like I would type it faster


This is great! I'd love this as a plugin for Obsidian :-)


this needs to be built into the OS - you should talk to apple and make some money before its built into the OS


it's susceptible to subtle prompt injection. I got it to output LLM like responses to just snippets of speech.


Holy crap I’m blown away by the demo. That was so easy and natural to use.

I had a brush with RSI some years ago. I’m good now with an ergonomic keyboard and better habits (water, exercise) but it brings me great comfort knowing something like this exists. Thank you!


Thank you for trying it! Our goal is to make it not just a suitable alternative but way better than your keyboard. Still a lot of work to do, but that is the fun part :)


How soon can I use this to write code?


Curious - what makes you want to use this for code? I built an MVP sometime back, happy to share a demo and chat more (dm me at @tankots on X!)


How much would 1000 tokens give us?


It's 1000-1500 words. I know that seems cheap of us but the cost to run the Aqua stack is eye-watering right now. We will increase this amount as we optimize.


would subscribe in a heartbeat if i could upload audio files for transcription.


What would your use case be?


This site can’t be reached(?)


This is very well done!!


Paywalled behind Google Oauth? Why?


Train this model on this:

>>Dearest creature in creation, Study English pronunciation. I will teach you in my verse Sounds like corpse, corps, horse, and worse. I will keep you, Suzy, busy, Make your head with heat grow dizzy. Tear in eye, your dress will tear. So shall I! Oh hear my prayer.

>>Just compare heart, beard, and heard, Dies and diet, lord and word, Sword and sward, retain and Britain. (Mind the latter, how it's written.) Now I surely will not plague you With such words as plaque and ague. But be careful how you speak: Say break and steak, but bleak and streak; Cloven, oven, how and low, Script, receipt, show, poem, and toe.

>>Hear me say, devoid of trickery, Daughter, laughter, and Terpsichore, Typhoid, measles, topsails, aisles, Exiles, similes, and reviles; Scholar, vicar, and cigar, Solar, mica, war and far; One, anemone, Balmoral, Kitchen, lichen, laundry, laurel; Gertrude, German, wind and mind, Scene, Melpomene, mankind.

>>Billet does not rhyme with ballet, Bouquet, wallet, mallet, chalet. Blood and flood are not like food, Nor is mould like should and would. Viscous, viscount, load and broad, Toward, to forward, to reward. And your pronunciation's OK When you correctly say croquet, Rounded, wounded, grieve and sieve, Friend and fiend, alive and live.

>>Ivy, privy, famous; clamour And enamour rhyme with hammer. River, rival, tomb, bomb, comb, Doll and roll and some and home. Stranger does not rhyme with anger, Neither does devour with clangour. Souls but foul, haunt but aunt, Font, front, wont, want, grand, and grant, Shoes, goes, does. Now first say finger, And then singer, ginger, linger, Real, zeal, mauve, gauze, gouge and gauge, Marriage, foliage, mirage, and age.

>>Query does not rhyme with very, Nor does fury sound like bury. Dost, lost, post and doth, cloth, loth. Job, nob, bosom, transom, oath. Though the differences seem little, We say actual but victual. Refer does not rhyme with deafer. Feoffer does, and zephyr, heifer. Mint, pint, senate and sedate; Dull, bull, and George ate late. Scenic, Arabic, Pacific, Science, conscience, scientific.

>>Liberty, library, heave and heaven, Rachel, ache, moustache, eleven. We say hallowed, but allowed, People, leopard, towed, but vowed. Mark the differences, moreover, Between mover, cover, clover; Leeches, breeches, wise, precise, Chalice, but police and lice; Camel, constable, unstable, Principle, disciple, label.*

>>Petal, panel, and canal, Wait, surprise, plait, promise, pal. Worm and storm, chaise, chaos, chair, Senator, spectator, mayor. Tour, but our and succour, four. Gas, alas, and Arkansas. Sea, idea, Korea, area, Psalm, Maria, but malaria. Youth, south, southern, cleanse and clean. Doctrine, turpentine, marine.

>>Compare alien with Italian, Dandelion and battalion. Sally with ally, yea, ye, Eye, I, ay, aye, whey, and key. Say aver, but ever, fever, Neither, leisure, skein, deceiver. Heron, granary, canary. Crevice and device and aerie.

>>Face, but preface, not efface. Phlegm, phlegmatic, ass, glass, bass. Large, but target, gin, give, verging, Ought, out, joust and scour, scourging. Ear, but earn and wear and tear Do not rhyme with here but ere. Seven is right, but so is even, Hyphen, roughen, nephew Stephen, Monkey, donkey, Turk and jerk, Ask, grasp, wasp, and cork and work.

>>Pronunciation — think of Psyche! Is a paling stout and spikey? Won't it make you lose your wits, Writing groats and saying grits? It's a dark abyss or tunnel: Strewn with stones, stowed, solace, gunwale, Islington and Isle of Wight, Housewife, verdict and indict.

>>Finally, which rhymes with enough — Though, through, plough, or dough, or cough? Hiccough has the sound of cup. My advice is to give up!!!

=====

--

I dont have the energy to defend a F up -- but there is a LOT of really cool development happening on HN... from AI to all sorts of SHOW and ASK and a just F-TON of keeping track.

Iam not an OCD content influencuer focused type...

But know --

the VELOCITY of thought that is flowing through HN and human conscious as excelleratated by our tipping-the-cup on AI is having IRL consequences on both mentality and reality....

If there is a community for a higher velocity firehose of where we are going share it.

So - we are sprewing a firehose of ideas into the quantum future, as unknown-boomerangs

The truth is to understand the boomerangs...

( to de-vague-lize this: Tesla:

Pre compute an AI token. 3:6:9

This token is a prime reflection of that.


lol based, we'll do


Congratulations on an interesting project. There is a lost opportunity with your natural language only approach. The issue is natural language will never be efficient as an interface. Natural language helps with low domain knowledge. That's the plus side as it allows the end user to say a variety of phrases to get the desired result. Commands allow for surgical precision and efficiency/less voice strain for its end user. So there needs to be an approach that allows for both elements natural language and commands. As users develop their own process and workflow they will create actions as commands. (high domain knowledge)

Since these commands are self created by the end user they remember them for their specific purposes. These often are high frequency of use commands where low use would still leverage large language model. You have an opportunity here to leverage this workflow. Being able to create commands with large language model is not something many projects have explored.


I'm curious, what does an example of this look like? (this is one of the fun things I'm working on right now)


> I'm curious, what does an example of this look like? (this is one of the fun things I'm working on right now)

I work with people with disabilities and iot. If the project is open open source I would love to get in touch.

Back to your question, That largely depends on what kind of interface is being used. Is there a screen or only conveying information text to speech. Users aren't going you want to keep track of their customized commands through just a voice interface.

An example: alias command.

Terminology: Utterance (a series of words) Action (programic action to do something)

1. (Previous utterances)(Some actions)

2. Alias (new utterance)

3. A new command is created by voice to do some action in step 1 with the new to utterance in step 2.

The new command encapsulates both the context of a command (command availability) if there is one and the command itself. Think of it as a voice macro. The essentially it allow you to complete a series of complex tasks in series with a small voice command.

Alternatively the alias command without an utterance could trigger a GUI could pop up showing the history of the last commands where a user can select by voice or touch.

This could work for both LM or Commands. Commands would take priority before the LM for recognition.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: