What challenges have you faced? I’ve been planning do the same thing (whisper fo...

gorbypark · 2023-12-04T08:09:46.000000Z

Mostly issues with plumbing it all together. So far what I've accomplished is using blackhole to create a virtual speaker / microphone, so system audio is available as a microphone input device. Then, take that input and feed it into Whisper. Out of the box, Whisper seems designed to work on wav files, not a stream of audio. There are some "hacks" that chunk 2 or 3 seconds of audio and feed that into Whisper. Of course, it needs to be more advanced than just chunking audio, because what if a word is split in half? There are some projects that are attempting to do this intelligently (split chunks between words). So far this is as far as I've got, besides feeding that into seamless m4t, which also doesn't seem to support "streaming" in text and instead wants a txt file or a command line argument.

Of course this is all before the new seamless v2 drop, so I'm hoping it will become easier! There's a lot of interesting things in this new seamless release, among them ggml inference and the streaming model. Poking around the code, there is even some whisper binding in there too, so it seems like a possible integration is already in use (I haven't had time to really dive in).