More

woodson · 2024-11-05T06:46:09 1730789169

There’s also John Markoff’s history of the relationship of human and intelligent machines, artificial intelligence vs intelligence augmentation, etc. (https://www.goodreads.com/book/show/23460922-machines-of-lov...).

woodson · 2024-10-30T22:18:13 1730326693

Unfortunately, it often hallucinates wrong parameters (or gets their order wrong) if there are multiple different APIs for similar packages. For example, there are plenty ML model inference packages, and the code suggestions for NVIDIA Triton Inference Server Python code are pretty much always wrong, as it generates code that’s probably correct for other Python ML inference packages with slightly different API.

woodson · 2024-10-20T04:12:18 1729397538

Yes, its performance is rather poor and there can be a lot of headaches with caching (especially if you're using a file system that doesn't support reflinks). For large sharded datasets (e.g. WebDataset), you're better off with other solutions, especially when your ML pipeline can stream them directly from object storage.

dmpetrov · 2024-10-20T04:57:20 1729400240

Right, DVC caches data for consistency and reproducibility.

If caching is not needed and streaming required, we've created a sister tool DataChain. It's even supports WebDataset and can stream from tar archives and filter images by metadata.

WebDataset example: https://github.com/iterative/datachain/blob/main/examples/mu...

notrealyme123 · 2024-10-21T05:01:29 1729486889

Thank you! Thats news to me. I will absolutely give it a try

woodson · 2024-10-13T06:02:34 1728799354

There’s just not a single one-size-fits-all model/pipeline. You choose the right one for the job, depending on whether you need streaming (i.e., low latency; words output right when they’re spoken), run on device (e.g. phone) or server, what languages/dialects, conversational or more “produced” like a news broadcast or podcast, etc. Best way is to benchmark with data in your target domain.

staticautomatic · 2024-10-13T06:37:32 1728801452

Sure, you're just going to try lots of things and see what works best, but it's confusing to be comparing things at such different levels of abstraction where a lot of the time you don't even know what you're comparing and it's impossible to do apples-to-apples even on your own test data. If your need is "speaker identification", you're going to end up comparing commercial black boxes like Speechmatics (probably custom) vs commercial translucent boxes like Gladia (some custom blend of whisper + pyannote + etc) vs [asr_api]/[some_specific_sepformer_model]. Like, I can observe that products I know to be built on top of whisper don't seem to handle overlapping speaker diarization that well, but I don't actually have any way of knowing if that's got anything to do with whisper.

woodson · 2024-10-05T18:28:54 1728152934

Their specifications include the technology used in telephony, such as encodings/codecs, link adaptation schemes, etc. They also publish reference implementations of many of the algorithms used.

woodson · 2024-10-04T23:40:32 1728085232

There are some approaches that use an LLM to generate “scripts” (you can think of them as a DSL) for composing/arranging media, essentially driving other models to generate parts of the media. One example is WavJourney: https://audio-agi.github.io/WavJourney_demopage/

woodson · 2024-09-25T21:51:15 1727301075

Yes, vLLM does (though marked experimental): https://docs.vllm.ai/en/latest/models/vlm.html

woodson · 2024-09-14T17:46:02 1726335962

Portland embraces it, see for example the Ned Flanders Crossing, on Flanders St, no less: https://en.m.wikipedia.org/wiki/Ned_Flanders_Crossing

(The Simpsons character is named after the street.)

woodson · 2024-09-08T16:56:54 1725814614

The concern may be about what effect this will have on future papers (just like news headlines engineered for clickbait).

woodson · 2024-08-13T23:42:47 1723592567

Not OP, but I agree that this could lead to questionable learning outcomes, especially since Whisper isn’t that good for low-resource languages. It’s probably fine for languages like English/Spanish/Mandarin, though.