Hacker News new | past | comments | ask | show | jobs | submit login

Video here: https://youtu.be/5rPJyrU-WE4

Can anyone explain why people would use text to speech for something like this, when they have perfectly good voices themselves?




Lots of people don't have perfectly good voices, and if you can copy the best voices, why settle for lesser ones? There are lots of voices people like more than yours, likely.

To give an example, 15-kun recently built on the Pony Preservation Project to use neural nets to voice-clone among others, My Little Pony voices, offering it as a service: https://fifteen.ai/ People have used it for all sorts of things: https://www.equestriadaily.com/2020/03/pony-voice-event-what... Suppose you want to do, say, an F1 commentary on Austrian GP 2019 (#4) - why do it with your voice if you can do it with Fluttershy's voice?

This will be the next evolution of streamers, especially Virtual Youtubers and their ilk.


Because reading from a script for five minutes is likely to require multiple takes for someone who isn't a practiced voice actor, while text to speech requires no extra effort on their part?


> Because reading from a script for five minutes is likely to require multiple takes for someone who isn't a practiced voice actor

This depends on how much you can tolerate speech errors. Most listeners will gloss over them, preferring the human voice to the speech synthesizer while not even really noticing the errors.


None of the authors appear to be native English speakers, so perhaps they're self-conscious about their accents?


Could also have speech problems. Could be lazy. Could want to save time. Could be useful at producing consistent CC information across mediums. Could allow people to choose arbitrary voice synthesis in the future which super futurists may like the idea of. Could have used a translator to produce the text (I haven't listened) and not know English atall.

Personally, I'll take the human voice unless you literally cannot speak (e.g. disability) or feel uncomfortable.


Many academics like the ability to "compile" latex, and probably want to "compile" their videos too, complete with autogenerated script. That way, when they make a small change to their source code, the new version will autogenerate a new video with an updated script.


Yup.

I like real voice but I could see if you were generating lots of videos and doing so in multiple languages, you could abstract that away a bit and dynamically generate videos these days. This is part of the reason video containers are often separated into components/layers (for video and audio tracks). I dont see why you couldn't also have the subtitle data read and generate the audio dynamically based on language. Some of this probably already happens somewhere by some group. Just an idea that I found interesting, similar to composing documents with LaTeX etc. Think of the audio as the "presentation" layer for a lot of visual frameworks used and think of similar structures for audio. It's especially useful for videos where the speaker isn't visible so syncing audio with lip movements across languages isn't a problem.


My thought as well. The TTS is good enough that it won't take much of an accent before the accent is harder to understand than the TTS as well. I know my own accent is strong enough that I'd have to put in very conscious effort to be easier to understand than this video.


I've found it's surprisingly hard to get a satisfactory recording setup - noise, volume, echoes.


I did research on a speech-to-text-to-speech system, and lots of non-native English speakers were self-conscious of their speech and prefer text-to-speech that wasn't in the style of their original voice.

Also, it's much simpler to make changes to a publication video, since using original voice requires re-recording with a high-quality microphone and post-processing of background noise.


Judging from SIGGRAPH videos/presentations, it's pretty common for graphics researchers who are non-native speakers to use text-to-speech or a native-speaking acquaintance for narration. I think it's done explicitly to help comprehension, although I think self-consciousness or fear of public speaking plays a role, too.


Well I have a slight speech impediment that probably 1 in a 1000 people notices maybe once in a while and then thinks they must be mistaken when speaking to me in person, but if they were listening to a video they might here it clearer.

Although I can get rid of it if I focus.


In addition to the possible reasons stated by peer comments, revisability. One-click builds are just as attractive for writing as they are for programming.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: