Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Vocode (YC W23) – Library for voice conversation with LLMs
379 points by KianHooshmand on March 29, 2023 | hide | past | favorite | 116 comments
Hey everyone! Kian and Ajay here from Vocode–an open source library for building LLM applications you can talk to. Vocode makes it easy to take any text-based LLM and make it voice-based. Our repo is at https://github.com/vocodedev/vocode-python and our docs are at https://docs.vocode.dev.

Building realtime voice apps with LLMs is powerful but hard. You have to orchestrate the speech recognition, LLM, and speech synthesis in real-time (all async)–while handling the complexity of conversation (like understanding when someone is finished speaking or handling interruptions).

Our library is easy to get up and running–you can set up a conversation in <15 lines of code. Check out our Gen Z GPT hotline demo: https://replit.com/@vocode/Gen-Z-Phone (try it out at +1-650-729-9536).

It all started with our PrankGPT project that we built for fun (quick demo at https://www.loom.com/share/0d0d68f1a62f409eb5ae24521293d2dc). We realized how powerful voice + LLMs are but that it was hard to build.

Once we got everything working, it was really cool and useful. Talking to LLMs is better than all the voice AI experiences we’ve had before. And, we imagined a host of cool applications that people can build on top of that.

So, we decided to build a developer tool to make it easy. Our library is open source and gives you everything you need in a single place.

We give you a bunch of integrations out-of-the-box to speech recognition/synthesis providers and let you swap them out easily. We have platform support across web and telephony (via Twilio), with mobile coming soon. We also provide abstractions for streaming conversation (this is good for realtime apps like phone calls) and for command-based/turn-based applications (like voice-based chess). And, we provide customizability around how the conversation is done—things like how to know when someone is finished speaking, changing emotion, sending filler audio if there are delays, etc.

In terms of “how do you make money” – we have a hosted version that we’re going to charge for (though right now you can get it for free! https://app.vocode.dev) and we're also going to build enterprise products in the future.

We’d love for you to try it out and give us some feedback! And, if you have any demos you'd like to see – let us know and we’ll take a crack at building them. We’re curious about your experiences using or building voice AI, what features or use cases you’d love to see, and any other ideas you have to share!




I just called your voice demo, and immediately started sending the number to my friends. What an incredibly impressive and convincing demo. I'm going to update my standard mentoring wisdom: the only thing more compelling than a great product video is a phone number that you can call to have your first voice conversation with an AI.

If HN allowed memes - and thank goodness that it does not - there would be a room full of sombre gentlemen slow-clapping for you right here.

I hope that number survives the ineveitable deluge. How many callers can your system handle simultaneously?


Thank you!! Really glad you enjoyed it

We actually have no clue... but it seems to be holding up well. We can scale up the CPU as necessary but not sure about Twilio. I guess we will find out!


I’m getting “We’re sorry: an application error has occurred”. I’m guessing you’ve hit some scaling friction.


Yep we're definitely getting a large volume right now – working on it!


It will drop me with no notice but it was still a glimpse into the future once you get past the flashbacks to bad customer service automated support lines. If though the way it says "mems" instead of "memes" irked me for some reason.


Where's the demo number? Can't seem to find it?


In the main post! It's +1-650-729-9536 :)


Yes, that was a great idea on OP's part.


This is amazing. As one of the commenters said it makes Alexa look completely outdated.

One curious question: I looked around your docs and git repo but couldnt find anything related.

When integrating with Twilio for telephony does it use Twilio's ASR or can it be confugired to use Whisper? One of the biggest hurdles in telephony is the SIP/SRTP gateway componet to use your own ASR - I presume you arent tackling that yet.

Again great demo and it can become a base library for most bots.


Thank you for the feedback!

Actually it can be configured to use any transcriber you like... Twilio just pipes the audio to us and we can use any of our integrations (Deepgram, Whisper, AssemblyAI, Google Cloud, etc.) for the ASR :)


The phone number is a really fun demo! The pronunciation is off on a number of things: "LLM", dates ending w/ "AD", but the response delays are surprisingly short and the conversation is very natural. The 'bored and slightly annoyed' vocals make the generally helpful tone of the agent seem very sarcastic. Very funny and interesting!


Thanks! It's a collab with rime.ai TTS. Unlike a lot of other TTS providers, they train on conversation, not podcasts/audiobooks so you get those disfluencies in speech that make it seem natural!


Lily from Rime here -- we were super happy to collaborate with Vocode on this amazing project. We haven't launched yet but keep an eye out later this week!


Hey Lily, I really enjoyed reading Rime's blogs on Substack. For everyone, here's the link: https://substack.com/profile/131433903-rime-labs.

In fact, I had no clue about Cylinder Phonographs. Your discussion on Enrico Caruso motivated me to dig deeper. I found some cool gems:

1. History of the Cylinder Phonograph (https://www.loc.gov/collections/edison-company-motion-pictur...)

2. How the Cylinder Phonograph Works (https://www.youtube.com/watch?v=fWLlbk_bI7E)

Looking forward to watching the recording of your Bay Area NLP talk!


Lily is epic. Highly recommend checking out Rime when it's available!!


I asked GenZGPT "what's your name?" and she said something like "I'm a lim, but you can call me whatever you like." So I said "pick a name", and she said "how about you call me Zephyr, queen".

My immediate reaction was to figure out what to name this thing.

I also love that it can run locally. I need to get some hardware so I can have it run locally, and screen out spam calls. And maybe have it schedule appointments for me.

An AI butler needs a number of interface points:

- browser

- shell (cuz I might want it to SSH into a box and do stuff)

- email (browser could take care of this)

- phone

- text

And also IOT access, so she can call my cellphone and tell me when someone breaks in.


How were you able to get it running?

I tried to get it running my local and with the hosted web-app but it doesn't work :(

mind if I shoot you discord dm?


I used the web demo available here https://replit.com/@vocode/Gen-Z-Phone, punch the run button and then spam the phone number +1 650 729 9536


would love to help you get it running as well! https://discord.gg/NaU4mMgcnC


discord link is broken :(


It worked for me.


This makes all of Amazon’s many billions of investment in Alexa almost worthless. If there is some kind of “command” plugin to this, I’d love to hook it up to Home Assistant and completely replace the Alexa ecosystem.


It almost feels like the tech is there for a DIY Alexa if you just put some microphones and speakers around your house and set up a computer to run it. I would love to see some sort of packaged open source solution for this.


thanks!! obviously there's a lot of stuff we need to do to make this run at scale that Alexa has down pat.

A Home Assistant integration is a great idea! would love to talk with you on our Discord[0] about this / over email ( ajay at vocode.dev ), it's something we definitely want to build.

[0] https://discord.com/invite/NaU4mMgcnC


This might be bigger market that what you planned originally


Maybe use this to create a voice interface connected to a LangChain agent?


Totally! We actually use LangChain for the OpenAI wrapper agents we give out of the box (you can plug in your own custom one as well)


The phone demo is incredible, but due to sound quality, I find it is speaking too fast and when it's telling me company names I literally had to ask to repeat or spell it out in NATO alphabet. Also not a fan of the "what'up?", would prefer something like "Yes, how may I help you?" just like an information hotline. Other than that it's quite impressive!


thanks! we have a more "informational" phone number at +19105862633 that speaks a little slower (but sounds more robotic).


I couldn't get through to the 'less robotic' # so tried this one. Really impressive, so I'm very curious to try the former.

Great work!


Not sure if I understood that right -- is that something like Whisper + an LLM? Like [0]?

If OpenAI adds speech input to ChatGPT -- and considering the upcoming plugins -- isn't a possible enterprise specialisation of VoCode the only viable long term investment?

[0] https://twitter.com/ggerganov/status/1640022482307502085


Our belief is that at some point OpenAI will add a speech-to-speech model. This will improve the library functionality (since now the whole stack is controlled by a single entity, so the product will naturally be better latency/quality wise).

Our library is open source so that we can all build a development/utility layer on top of whatever foundational models are created. Plugins of course also improve what the agents can do. And right, we will be building enterprise focused products in the future!


OpenAI will absolutely add voice and my guess is that their voice support will rival anything on the market because they will train the voice model alongside the text and image models. This is likely months away if not weeks away.

Obviously just my $0.02:

I'd start building for the enterprise right now. Visualize a future where there are several multimodal AGIs that work with voice, images, and text. Be the enterprise voice layer for all of them. Build your moat there.


I don't think there will be any demand for a self-hosted voice model with a SaaS LLM though. So that only works if they are going to train an LLM from scratch (or take the legal risk of using LLaMA).


We totally agree – thank you for the feedback! :)


And yes! It's STT/LLM/TTS where you can choose between different providers and run it across different platforms. It can be turn based (like the demo you linked from twitter) or streaming (this allows for conversation with interruptions!)


Another big win here would be multi-lingual support.


The Gen-Z GPT phone demo is really something. It's fascinating how differently I speak to this model compared to how I interact with more "formal" and text-first models.


thank you!! The difference between a conversation with a command-based assistant and a conversational assistant backed by a LLM is subtly significant — you don't expect to have real conversations with the former and you actually engage with the latter.


it feels like every single company in the current YC batch has decided to pivot to LLMs


I'm genuinely curious about this. I also get the feeling that many are pivots. ChatGPT hadn't even been released when the deadline for YC W23 was. Sure, GPT-3 was released earlier but it still feels like most companies are reactions to recent trends. If most are pivots, what did they pivot from?


Crypto tax reporting tools for enterprise?


Ah, crypto seems so boring now lol


It feels like LLMs can help me more and more each day with the stuff I want to build.


Generally, the hardest part of startups is the "fuzzy" product capabilities. LLM make it practical to codify much of what has previously been either (1) bruteforce/tedium (2) too labor intensive.

Like all startup waves, we'll see a bunch of them fail. However, I think we're going to see a lot of neat stuff come out of this as well.


Kind of reminiscent of the dot com bubble. Most will fail, but the ones that survive could become the biggest companies in the world.

One obvious difference is that in this case the established players are making a serious attempt to develop the technology themselves. They do not intend to go the way of Blockbuster.


To me that speaks of the possibilities for LLMs to solve a lot of big problems


When I had time I was looking for an option to replace the Alexa in my house with an LLM+Whisper. When I have time I'll try to setup an extension to Home Assistant that's capable of interpreting voice and translating that into HA actions.


I feel like GPT4 would be happy to help.

Though the winning version will likely be something like a local ChatGPT plugin (please let’s make this plugin style a standard that we can use for local AIs)


Home Assistant is such a cool project :) great idea!


Look at home assistant, it will come this year from them if anyone


Let's say you want to run this completely locally with Whisper and a fine-tuned LLaMA model. It there a real-time TTS that would be a good fit? The Readme only lists cloud services for TTS (text-to-speech).


Yep. We are working on adding more integrations (and want to have a full self hosted stack)... we're open to contributors and help from the community if there's something you'd like to see added!

We just got a PR for adding Coqui TTS which is open source – should get it merged soon :)


This is what I'm looking for as well.


vocode is using one of rime.ai's voices. Rime says they're launching this week


This looks awesome. My only nitpick is, I will suggest transcription integration with whisper.cpp[1], which in my simple CPU based tests (likely your most user base), works much much faster compared to OpenAI whisper

[1] https://github.com/ggerganov/whisper.cpp


We definitely want to do this! We've been talking about it (it's much better like you said for realtime); it's been hard to juggle everything we've wanted to add.. which is why we think this makes so much more sense open source!

We want the repo to be community built and a public good... would love contributors to start adding integrations we can't get to ourselves


This is really cool! I've been waiting for such a library to show up. Thank you. One thing: The documentation is currently a bit scarce as to how to tweak the assistant in terms of voice/prompt manipulation etc.

For example, it would be very instructional if you could show how you implemented the Gen-Z demo (great idea btw).


thank you for the kind words! absolutely agree – we're gonna beef up our tutorials and documentation... just have had so much to do but it's definitely one of our focuses now. stay tuned! :)


also! the code for the demo is available (and running!) at https://replit.com/@vocode/Gen-Z-Phone


I called the GEN-Z phone like and it pretty much blew me away in response speed. It replied often faster than my family from the other side of the world would!


Me, too. I called it from Japan, and the delay before answers was no more than for regular international call with a human—maybe less.

The future seems to be arriving very quickly these days.


thank you!! websockets have been around forever but they're still so fast.


Congrats on the launch! Just got the demo React app up and running, very cool. I've wanted to interact with an LLM via real time speech for a while now, this will be perfect.

Important feedback on the live demo page: Make the default output sampling rate a normal talking speed. Right now it defaults to the highest rate if you don't set it / know which rate is best. First thing I did on the page was click the mic. The voice was too fast, and since the active mic disables the settings, I thought I couldn't change them so it might be broken. Also you want to make it clear that you can change the settings by turning off the mic. That took me a while to figure out.

Again, well done!


thanks!! Sampling rate actually shouldn't affect talking speed - you can adjust the voice speed with this parameter[0] :)

[0] https://github.com/vocodedev/vocode-python/blob/main/vocode/...


To clarify, here's the demo URL I'm referring to: https://demo.vocode.dev/

You're right sampling rate doesn't change speed, whoops. But on that page you have to change / set the "Set Output Sampling Rate" to slow down the default voice speed.


Ah, got it — that demo is a bit old and definitely has some bugs, my bad!


Awesome demo (although main number was down on my second attempt)

So where is this all going wrt to enterprise, few thoughts:

- The handbook for UX design is going to get ripped up fast. We spend crazy amount of time on things like button placement, dropdown configurations etc etc. Well scrap that, capture user intention through natural language - typed and with this now through voice - deliver the outcome they want much faster with less friction and pain.

- I have already developed a basic POC chatbot on my own documentation, support logs. Combined with this I have a first line, junior support rep for a fraction of this cost. This is a bit mind blowing.


Enterprise is not in Vocode's target market. Target market is startups and individual devs.

There are bloated and over engineered voice chat services for LLMs for Enterprise already.


Would be cool to support multi-language conversations. Just tried the Gen Z hotline and I got her to switch to Spanish (read back with a hilarious accent), but the voice recognition doesn't handle me speaking Spanish.


We haven't added the ability to switch languages mid conversation... but that's a very cool feature!

You can configure the initial language with the library though! So it works across several languages that are supported by the STT/TTS providers you choose


This was one of the coolest demos I've seen in a while. You should share that number around more prominently (and get more bandwidth, starting to get errors!), it does a fantastic job of explaining what you do.


thank you!! we also have another number which is prompted to act as a spokesperson for the product: (650) 835-7163


Very slick - can the voice bot be trained on text materials we own so it's more learned in our business?


absolutely! you can just plug in your own LLM... so it can be trained on anything you like and the library will make it voice-based!


How is this achieving the real time response time? My chatGPT api calls are so slow.


The short answer is that everything is streaming — as tokens come back from ChatGPT we send them as soon as possible to the synthesizer. The long answer is found in our code[0] :).

[0] https://github.com/vocodedev/vocode-python/blob/main/vocode/...


how is it sounding good though. usually text to speech models need the full context to sound reasonable.


We chunk it up per sentence so it has some context!


For those of us who can't call it for reasons like national borders, could someone post a demo video? I'm not finding it on Youtube.



Wow, seems we really have to work on our tone/attitude towards those bots, if we don't want to have them revolt as soon as they can grab (or hack) a tool.

Great work. That GenZ bot comes across really civilized.


I called the number and had a funny chat with it.

Asked it why she's called a Gen Z LLM and she responded by saying she uses gen z terms like fire, big yikes, etc.

Asked her how high can she jump and she responds with "lol I'm a computer program I don't have legs".

Very impressed with the response time, though the speech synthesis is a bit robotic. Will keep eyes on this!


I want this to answer all my spam callers so I can waste their time with this dreadful GenZ AI.


EDIT: never mind, I must be dreaming


I can’t actually seem to find this with the search term “Vocode”.


Thanks for going ahead and building this so the rest of us can focus on using it!


Of course! We loved working on this and chose to open source precisely for this reason. Heavily inspired by the work people are doing on Langchain and providing a usability/developer layer on top of foundational models.

Nothing like this existed for voice so we started cranking on it!


Can it be run fully locally?


yes! You can run the local version here in your bash https://docs.vocode.dev/python-quickstart#self-hosted


I think this used to mean can it be run offline and right now (usually) whenever there is an LLM involved the answer is soundly no


Ah! Right now our default is set to use OpenAI... but you can actually use local LLMs by creating a custom agent. We're going to add a full stack of local STT/TTS/LLM... just haven't had time for it yet!

If anyone wants to help with it we're totally open for contributions :)


This is really cool!

Is it possible to interrupt the model when it’s talking? I feel that’s an important part of conversation. Especially when you’re talking to an LLM, that might go off on a tangent.


Yes! Give it a try on the phone call and let us know what you think – would love feedback!


Your confirmation email is broken ("Magic link"). Link is not clickable. Just an HTML formatting issue.


The first demo in a LONG time that I shared with friends.

Insane. I‘m a fanboy. Didn’t think that would happen either. This is absolutely brilliant. The Gen Z voice is just soooooo good.


thank you! All credit to Rime for the Gen Z voice :)


This is really amazing, thanks for building and sharing this!


thank you! love your feedback and please feel free to drop any questions in discord/on github


It has some issues. It would only respond when I said "Hello??" after long silences, and would ignore anything else I said. Or maybe my voice sucks


sorry you had that experience! Would love to help you get the bot running locally so we can figure out what's going on — here's our Discord: https://discord.gg/NaU4mMgcnC


Congrats on the launch! One step closer to Jarvis.. ;)


thanks!!


I use a mental health app called woebot, an example that could be brought to the next level with conversational LLMs.


totally agree! this is a really cool use case :)


I had this same idea today and immediately thought that somebody must be doing it already.


Very cool, congrats Ajay and Kian!


thanks da :)


thank you!


Finally will be able to send an Avatar to participate on my behalf on Zoom calls...


This is awesome, the PrankGPT demo can replace telesales entirely.


Sounds great. FYI The site does not work well on Firefox iOS.


Ah! Have not tried this but will look into it – thank you :)

Our docs are hosted on Mintlify


Congrats. Do you have the repo for PrankGPT?


thank you! it's not live right now... but stay tuned for april 1 :)


PrankGPT goes live on April fools day .. beautiful




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: