Alas, no (made some contributions to TTS at G, and worked on Android). iOS has t...

anotherevan · 2024-01-02T00:42:55 1704156175

> I'll try to ping you

Thanks. Use the email address in my profile if anything eventuates.

> is an Android app with a keyboard and "speak" button that does API calls to eleven labs sufficient for something worth trying?

Maybe. Obviously something with local processing would be preferred, but it might be an option when internet connectivity is good. Is there such an app?

refulgentis · 2024-01-02T03:15:17 1704165317

There isn't an ElevenLabs app like that, but I think that's the most expedient method, by far. (i.e. O(days) instead of O(months))

(warning: detailed opinionated take, I suggest skimming)

Why? Local inference is hard. You need two things: the clips to voice model (which we have here, but bleeding edge), and text + voice -> speech model.

Text to voice to speech, locally, has excellent prior art for me, in the form of a Raspberry Pi-based ONNX inference library called [Piper](https://github.com/rhasspy/piper). I should just be able to copy that, about an afternoon of work! :P

Except...when these models are trained, they encode plaintext to model input using a library called eSpeak.

eSpeak is basically f(plaintext) => ints representing phonemes.

eSpeak is a C library and written in a style I haven't seen in a while and depends on other C libraries. So I end up needing to port like 20K lines of C to Dart...or I could use WASM, but over the last year, I lost the ability to be able to reason through how to get WASM running in Dart, both native and web.

Re: ElevenLabs

I had looked into the API months ago and vaguely remembered it was _very_ complete.

I spent the last hour or two playing with it, and reconfirmed that. They have enough API surface that you could build an app that took voice recordings, created a voice, and then did POSTs / socket connection to get audio data from that voice at will.

Only issue is pricing IMHO, $0.18 for 1000 characters. :/ But this is something I feel very comfortable saying wouldn't be _that_ much work to build and open source with a "bring your own API key" type thing.

I had forgotten about Eleven Labs till your post, which made me realize there was an actually meaningful and quite moving use case for it. All of Elevens advantages (cloning, peak quality by a mile) come into play, and the disadvantages are blunted: local voice cloning isn't there yet, and $0.18 / 1000 characters doesn't matter as much when it's interpersonal exchanges instead of long AI responses

l-albertovich · 2024-01-02T15:05:00 1704207900

Wouldn't it be better to use FFI to build an idiomatic interface to use in Dart instead?

refulgentis · 2024-01-02T15:54:53 1704210893

It's a good point but I'm a perfectionist and can't abide without a web version.

though, now that I write that...

Native: FFI.

Web: Dart calling simple JS function, and the JS handles WASM.

...is an excellent sweet spot. Matches exactly what I do with FONNX. The trouble with WASM is Dart-bounded.

(n.b. re: local cloning for anyone this deep, this would allow local inference of the existing voices in the Raspberry Pi x ONNX voicer project above. It won't _necessarily_ help with doing voice cloning locally, you'll need to prove out that you can get a voice cloning model in ONNX to confirm.)

(n.b. re: translating to Dart, I think the only advantage of a pure Dart port would be memory safety stuff but I also don't think a pure Dart port is feasible without O(months) of time. The C is...very very very 2000s C. globals in one file representing current state that 3 other files need to access. array of structs formed by just reading bytes from a file at runtime that matches the struct layout)