I suspect this is coming. I mean we do have decent text to speech systems already, but in this vein of “we used neural networks and now it’s very very good” you can imagine that with something like GPT-3, to extend it they could use this speech to text system so you could speak to it for input, and then a natural progression is that it can use text to speech to return the output, so you just have a voice oriented conversational system.
So I think TTS is a logical part of the system. I also think that there are peculiarities of voice interaction that aren’t captured in text training datasets, so they would need to do some fine tuning on actual voice conversation to make it feel natural.
A full NLP system would include speech recognition, TTS, a large language model, and a vector search engine. The LM should be multi modal, multi language and multi task, "multi-multi-model" for short haha. I'm wondering when we'll have this stack as default on all OSes. We want to be able to search, transcribe, generate speech, run NLP tasks on the language model and integrate with external APIs by intent detection.
On the search part there are lots of vector search companies - Weaviate, Deepset Haystack, Milvus, Pinecone, Vespa, Vald, GSI and Qdrant. But it has not become generally deployed on most systems, people are just finding out about the new search system. Large language models are still difficult to run locally. And all these models would require plenty of RAM and GPU. So the entry barrier is still high.
Ah very interesting thank you. I’m not familiar with research in to vector search, I’ll look that up.
But yeah you make a good point about LLMs being too large to run on a normal PC. I do somewhat suspect that we might see some rapid acceleration in the size of neural network processors as large models begin to offer more utility. I think for now they have limited appeal but we’re already seeing things like Tesla’s Dojo make large leaps in capability to rapidly process complex networks.
In five to ten years we may see built in accelerators come standard in most computers capable of running very complex models. Already Apple provides ever more powerful accelerators in their phones. You could imagine Adobe offering real time diffusion models as part of Photoshop, among other things.
So I think TTS is a logical part of the system. I also think that there are peculiarities of voice interaction that aren’t captured in text training datasets, so they would need to do some fine tuning on actual voice conversation to make it feel natural.
All in due time I suppose.