Unfortunately, it often hallucinates wrong parameters (or gets their order wrong) if there are multiple different APIs for similar packages. For example, there are plenty ML model inference packages, and the code suggestions for NVIDIA Triton Inference Server Python code are pretty much always wrong, as it generates code that’s probably correct for other Python ML inference packages with slightly different API.
Yes, its performance is rather poor and there can be a lot of headaches with caching (especially if you're using a file system that doesn't support reflinks). For large sharded datasets (e.g. WebDataset), you're better off with other solutions, especially when your ML pipeline can stream them directly from object storage.
Right, DVC caches data for consistency and reproducibility.
If caching is not needed and streaming required, we've created a sister tool DataChain. It's even supports WebDataset and can stream from tar archives and filter images by metadata.
There’s just not a single one-size-fits-all model/pipeline. You choose the right one for the job, depending on whether you need streaming (i.e., low latency; words output right when they’re spoken), run on device (e.g. phone) or server, what languages/dialects, conversational or more “produced” like a news broadcast or podcast, etc. Best way is to benchmark with data in your target domain.
Sure, you're just going to try lots of things and see what works best, but it's confusing to be comparing things at such different levels of abstraction where a lot of the time you don't even know what you're comparing and it's impossible to do apples-to-apples even on your own test data. If your need is "speaker identification", you're going to end up comparing commercial black boxes like Speechmatics (probably custom) vs commercial translucent boxes like Gladia (some custom blend of whisper + pyannote + etc) vs [asr_api]/[some_specific_sepformer_model]. Like, I can observe that products I know to be built on top of whisper don't seem to handle overlapping speaker diarization that well, but I don't actually have any way of knowing if that's got anything to do with whisper.
Their specifications include the technology used in telephony, such as encodings/codecs, link adaptation schemes, etc. They also publish reference implementations of many of the algorithms used.
There are some approaches that use an LLM to generate “scripts” (you can think of them as a DSL) for composing/arranging media, essentially driving other models to generate parts of the media. One example is WavJourney: https://audio-agi.github.io/WavJourney_demopage/
Not OP, but I agree that this could lead to questionable learning outcomes, especially since Whisper isn’t that good for low-resource languages. It’s probably fine for languages like English/Spanish/Mandarin, though.
reply