The performance on Apple Silicon should be much better today compared to what is shown in the video as whisper.cpp now runs fully on the GPU and there have been significant improvements in llama.cpp generation speed over the last few months.
This is cool. I hooked up Llama to an open-source TTS model for a recent project and there was lots of fun engineering that went into it.
On a different note:
I think the most useful coding copilot tools for me reduce "manual overhead" without attempting to do any hard thinking/problem solving for me (such as generating arguments and types from docstrings or vice-versa, etc.). For more complicated tasks you really have to give copilot a pretty good "starting point".
I often talk to myself while coding. It would be extremely, extremely futuristic (and potentially useful) if a tool like this embedded my speech into a context vector and used it to as an additional copilot input so the model has a better "starting point".
I'm a late adopter of copilot and don't use it all the time but if anyone is aware of anything like this I'd be curious to hear about it.
This assertion is not supported by the text of Biden’s Executive Order. There are a number of requirements placed on various government agencies to come up with safety evaluation frameworks, to make initial evaluations related to open weight models, and to provide recommendations back to the President within 270 days. But there’s nothing whatsoever that I can find that outlaws open weight models. There’s also little reason to think that “outlaw them” would be amongst the recommendations ultimately provided to the executive.
(I can imagine recommendations that benefit incumbents by, for instance, placing such high burden on government adoption of open weight models that OpenAI is a much more attractive purchase. But that’s not the same as what you’re talking about.)
I dunno, the EO seems pretty easy to read. Am I missing something in the text?
Oh they have. Many of the industry leaders are importuning them for this. For industry leaders this is a business wet dream. It's "regulatory capture" at its finest
OpenAI has lobbied them among others, already. Our elected officials accept written law from lobbyists. Money and power games don't look out for common folks except incidentally or to the extent commoners mobilize against them.
I honestly can't blame OpenAI. They likely threw a huge amount of money at training and fine-tuning their models. I don't know how open-source will surpass them and not stay a second-tier solution without another huge generous event occurring like Facebook open-sourcing LLaMa
The only thing this directs is... consulting with a variety of groups, including industry, and writing a report based in that consultation.
So, literally, they can’t enforce it without consulting with industry, since enforcement is just sonewhat in the government holding someone else in government accountable for consulting with, among others, the industry.
Soliciting Input on Dual-Use Foundation Models with Widely Available Model Weights. When the weights for a dual-use foundation model are widely available — such as when they are publicly posted on the Internet — there can be substantial benefits to innovation, but also substantial security risks, such as the removal of safeguards within the model. To address the risks and potential benefits of dual-use foundation models with widely available weights, within 270 days of the date of this order, the Secretary of Commerce, acting through the Assistant Secretary of Commerce for Communications and Information, and in consultation with the Secretary of State, shall...
I believe that is the relevant section, which I am hoping they realize how dumb this is going to be.
China and Russia will keep using US models because they don't care about US laws. I think if restrictions on AI are only applied in the US, such restrictions will only put Americans at a disadvantage.
I feel like this is another case of screens for e-book readers; where a single player had everybody else in a chokehold for decades, and slowed down innovation to a drip
But this would be akin to them saying "you know, bytecode could be used for evil, and we'd to regulate/outlaw the development of new virtual machines like the one we already have".
> this cat is already out of the bag, so this is pointless legislation that just hampers progress
This isn't legislation.
And its not proposing anything except soliciting input from a broad range of civil society groups and writing a report. If the report calls for dumb regulatory ideas, that’ll be appropriate to complain about. But “there’s a thing that is happening that seems like it might have big effects, gather input from experts and concerned parties and then writeup findings on the impacts and what, if any, action seems warranted is... not a particularly alarming thing.
This is insanity. I have to be missing something, what else do the safeguards prevent the LLM from doing? This has to be more about this than like preventing an LLM from using bad words or showing support for Trump...
Yes, but companies will have no incentive to publish open source models anymore. Or, it could be so difficult/beaurocratic no one will bother and keep it close source
Mistral and Falcon is not from megacorps and not even US.and many other opensource chinese models .
And both are based models that means they are totally organic outside of US.
That’s what they told us. Turns out Google stopped innovating a long time ago. They could say stuff like this when Bard wasn’t out but now we have Mistral and friends to compare to Llama.
Now it turns out they were just bullshitting at Google.
> Now it turns out they were just bullshitting at Google.
I don't think Google was bullshitting when they wrote, documented and released Tensorflow, BERT and flan-t5 to the public. Their failure to beat OpenAI in a money-pissing competition really doesn't feel like it reflects on their capability (or intentions) as a company. It certainly doesn't feel like they were "bullshitting" anyone.
Everyone told us they had secret tech that they were keeping inside. But then Bard came out and it was like GPT-3. I don’t know man. The proof of the pudding is in the eating.
> The innovation is largely happening within the megacorps anyway
That was the part I was replying to. Whichever megacorp this is, it’s not Google.
Hey, feel free to draw your own conclusions. AI quality is technically a subjective topic. For what it's worth though, Google's open models have benched quite competitively with GPT-3 for multiple years now: https://blog.research.google/2021/10/introducing-flan-more-g...
The flan quantizations are also still pretty close to SOTA for text transformers. Their quality-to-size ratio is much better than a 7b Llama finetune, and it appears to be what Apple based their new autocorrect off of.
Still, one of those corporations wants to capture the market and has monopolistic attitude. Meta clearly chose the other direction, when publishing their models and allowing us all to participate in.
Then we'll create a distributed infrastructure for the creation of models. Run some program and donate spare GPU cycles to generate public AI tools that will be made available to all.
I'm getting a "floating point exception" when running ./talk-llama on arch and debian. Already checked sdl2lib and ffmpeg (because of this issue: https://github.com/ggerganov/whisper.cpp/issues/1325) but nothing seems to fix it. Anyone else?
I'm not sure what changed, but basically I purged ffmpeg and libsdl2-dev and the `make` in the root of the repo. Then I installed libsdl2 and ffmpeg and `make talk-llama`.
It's quite slow on 4 core i7-8550U and 16 GB of RAM.
Aren't there text to talk solution that can receive a stream of text so one doesn't have to wait for llama to finish production before getting the answer talked out?
I guess it'd only work if the model can keep the buffer filled fast enough so the tts engine doesn't stall.
I can't tell if you're disparaging the usage or not (truly, I can't tell), but such utterances exist because they serve a real function. Disfluency is an integral part of speech.
I think it's a good idea, if done well. It could also potentially be combined with dynamically adjusting speed of the speech, and reducing or increasing the use of shortcuts and contractions, making word replacements.
I know wish for a model built to be a low-computation filter that takes text in and produces padded text out intended for TTS and annotated with pauses or sounds and extra words that maintains the same meaning but provides the ability to dynamically adjust the level of verbosity to maintain a fixed rate of words per minute.
Timing and emphasis work better if you know where the sentence is going; otherwise you sound like one of those translators at the UN with the flat stream of words.
I mean.. 99% of the current TTS engines wouldn't know timing and emphasis if it hit them.
Besides humans do this all the time, we start saying words before we even have, uhh, any idea how we're gonna end the sentence and for the most part it, uhh, works out. Should be doable.
The difference is that if you are the one saying the sentence, you at least have some idea where it's going. Receiving one word at a time from a different source isn't the same.
I suppose it could be buffered, and used to only shorten the wait in cases with long responses, rather than aiming for perfect and push out words as they come. Besides, GPTs sound like translators anyway.
ElevenLabs and Gemelo.AI are services that both support text input streaming for exactly this use-case. I am not aware of any open-source Incremental TTS (this is the term used in research afaik) model but you can already achieve somthing similar by buffering tokens and sending them to the TTS model on punctuation characters.
ElevenLabs only has streaming output available. I've had a look at both recently and ElevenLabs doesn't have streaming input listed as a feature. Would be cool if it had it, though. You could probably approximate this on a sentence level, but you would need to do some normalisation to make the speech sound even.
Would it be possible to reduce lag by streaming groups of ~6 tokens at a time to the TTS as they're generated, instead of waiting for the full LLM response before beginning to speak it?
Yes, I was planning to do this back then, but other stuff came up.
There are many different ways in which this simple example can be improved:
- better detection of when speech ends (currently basic adaptive threshold)
- use small LLM for quick response with something generic while big LLM computes
- TTS streaming in chunks or sentences
One of the better OSS versions of such chatbot I think is https://github.com/yacineMTB/talk.
Though probably many other similar projects also exist by now.
I keep wondering if a small LLM can also be used to help detect when the speaker has finished speaking their thought, not just when they've paused speaking.
That works when you know what you’re going to say. A human knows when you’re pausing to think, but have a thought you’re in the middle of expressing. A VAD doesn’t know this and would interrupt when it hears a silence of N seconds; a lightweight LLM would know to keep waiting despite the silence.
And the inverse: the VAD would wait longer than necessary after a person says e.g. "What do you think?", in case they were still in the middle of talking.
Here is an open source project that supports voice as well:
https://github.com/cogentapps/chat-with-gpt
I think it is meant to be used with ElevenLabs and OpenAI API, but might be easy to configure for use with local Whisper.cpp + llama?
It's not open source but it is still (for now) free: lmstudio.ai. Chat histories, a good config UI, easy prompt management, model management, model discovery, easy set up, cross platform, able to serve an API for connectivity to other tools.
They're hiring and have no currently disclosed monetization strategy so I expect a rugpull soon where some now-free feature gets paywalled or purposefully crippled, but it's not like porcelain apps for free LLMs that rely entirely on llama.cpp to function can do vendor lock-in. I'd second ollama if OSS is a higher priority than features though.
This makes me wonder, what's the equivalent to ollama for whisper/SOTA OS tts models? I'm really happy with ollama for locally running OS LLMs, but I don't know of any project that makes it that simple to set up whisper locally.
I've just been looking for SOTA TTS. I found coqui.ai and elevenlabs.io (and a bunch of others). They're good (and better than older TTS), but I am not fooled by any of them. Do you have recommendations?
Could anyone explain the capability of this in plain English? Can this learn and retain context of a chat and build on some kind of long term memory? Thanks
I'm not an LLM expert by any means but here is my take.
It's Speech Recognition -> Llama -> Text to Speech, running on your own PC rather than that of a third party.
The limitations on the context of the LLM are that of the model being used, e.g. Llama 2, Wizard Vicuna, whatever is chosen, in whatever compatible configuration is set by the user regarding context window etc, and given a preliminary transcript (as the LLM doesn't "reply" to the user in a sense, it just predicts the best continuation of a transcript between the user and a useful assistant, resulting in it successfully pretending to be a useful assistant, thus being a useful assistant - it's confusing).
I can imagine that it's viable to get that kind of behaviour by modifying the pipeline.
If the architecture was instead Speech Recognition -> Wrapper[Llama] -> Text 2 Speech, where "Wrapper" is some process that lets Llama do its thing but hooks onto the input text to add some additional processing, then things could get interesting.
The wrapper could analyse the conversation and pick out key aspects ("The person's name is Bob, male, 35, he likes dogs, he likes things to be organised, he wants a reminder at 5pm to call his daughter, he is an undercover agent for the Antarctic mafia, and he prefers to be spoken to in a strong Polish accent") and perform actions based on that:
- Set a reminder at 5pm to call his daughter (through e.g. HomeAssistant)
- Configure the text-2-speech engine to use a Polish accent
- Modify the starting transcript for future runs:
- Put his name as the human's name within the underlying chat dialogue
- Provide a condensed representation of his interests and personality within the preliminary introduction to the next chat dialogue
This way there's some interactivity involved (through actions performed by some other tool), some continuity (by modifying the next chat dialogue) and so on.
I've been wondering about how feasible it is to simulate long term memory by running multiple LLMs at the same time. One of them would be tasked with storing and retrieving long term memories from disc, so it'd need to be instructed about some data structure where memories were persisted, and then you'd feed it the current context, instructing it to provide a way to navigate the memory data structure to any potentially relevant memories. Whatever data was retrieved could be injected into the prompt to the next LLM, which would just respond to the given prompt.
No idea what sort of data structure could work. Perhaps a graph database could be feasible, and the memory prompt could instruct it to write a query for the given database.
This is achieved using vector databases to store memories as embeddings. Then you can retrieve a “memory” closest to the question in the embedding space.
This is an active area of research. The best we currently have is vector databases and/or sparse hierarchical information storage (you retrieve a summary of a summary via vector search, find associated summaries via vector search once more, then pluck out the actual data item and add it to the prompt.
Does anybody have a quick start for building it all in Windows for this?
I could probably check it out as a VS project and build but I'm going to bet since it's not documented it's going to have issues specifically because the Linux build instructions are the only ones that are a first class citizen...
What are currently the best/go-to approaches to detect the end of an utterance? This can be tricky even in conversations between humans, requiring semantic information about what the other person is saying. I wonder if there’s any automated strategy that works well enough.
Voice activity detection (referred to as VAD for short) is what you're looking for. Personally, I'd prefer a hold-down hotkey a la press-to-speak for more immediate UI feedback, but both work pretty well.
It waits for sufficient silence to determine when to stop recording the voice and send it to the model. There is other modes in the source as well and methods of setting the length of silences in order to chunk up and send bits at a time, but I imagine that is either work in progress or not planned for this demo.
I was surprised they didn’t combine this work with the streaming whisper demo. So I guess I will implement that for iOS/macos (streaming whisper results in realtime without waiting on an audio pause, but as you say using the audio pauses and other signals like punctuation in the result to determine when to llm complete; makes me also wonder about streaming whisper results in to the llm incrementally before ready for completion)
It may be using the streaming demo. The reason I know to answer your question is that I had modified the streaming demo myself for personal use before. I think there is bugs in the silence detection code (as of a few months back, maybe fixed now). Maybe what we are seeing in this demo is just the "silence detection" setting to be waiting for very long pauses, I believe its configurable.
A major benefit (for me) is that I can build it with "make" and know that it just works.
Now perhaps this is a skills issue on my part (because I'm not a Python dev), but I've had endless trouble with Python-based ML projects, with some requiring I use/avoid specific 3.x versions of Python, each project's install instructions seemingly using a different tool to create a virtual environment to isolate dependencies, and issues tracking down specific custom versions of core libraries in order to allow the use of GPU/Neural Engine on Apple Silicon.
The whisper and llama.cpp projects just build and run so easily by comparison
> A major benefit (for me) is that I can build it with "make" and know that it just works.
So, as someone who has never gotten around to doing this, and who also likes not having to deal with the Python tools, it's not quite that simple. Steps I had to take for talk-llama after cloing whisper.cpp:
* apt install libssdl2-dev (Linux; other steps elsewhere)
* make talk-llama from the root of whisper.cpp, not from
* ./download-ggml-model.sh small.en from the models directory
* Tried to run it with the command line in the README, have it
seg fault after failing to open ../llama.cpp/models/llama-13b/ggml-model-q4_0.gguf, cloning llama.cpp, finding the file is not in the repo.
* Searching through the readme for how to find the models, and finding I need to go searching elsewhere because no urls were listed.
* Having to install a bunch of Python dependencies to quantize the models...
This is still far from "build and run". Though I will fully believe that a lot of the Python-based ML projects are worse.
I've had the same experience. One of the things I like most about llama.cpp is the relatively straightforward build process, at least when compared to the mess of Python library requirements you run into if you want to experiment with ML models.
Having said that, I have the sense that the ML ecosystem is coalescing around using `venv` as a standard for Python dependencies. Most of the build instructions for Python ML projects I've seen recently begin with setting up the environment using venv, and in my experience, it works fairly reliably. I don't particularly like downloading gigabytes of dependencies for each new project, but that mess of dependencies is what's powering the rapid pace of prototypes and development.
As another non-python dev, interested in and trying to get into AI/ML, I think the limitation of venv is that it can't handle multiple versions of system libraries.
CUDA for example, different project will require different versions of some library like pytorch, but these seem to be tied to cuda version. This is where anaconda (and miniconda) come in, but omfg, I hate those. So far all anaconda does is screw up my environment, causing weird binaries to come into my path, overriding my newer/vetted ffmpeg and other libraries with some outdated libraries. Not to mention, I have no idea if they are safe to use, since I can't figure out where this avalanche (literally gigs) of garbage gets pulled in. If I don't let it mess with my startup scripts, nothing works.
And note, I'm not smart, but I've been a user of UNIX from the 90's and I can't believe we haven't progressed much in all these decades. I remember trying to pull in source packages and compiling them from scratch and that sucked too (make, cmake, m4, etc). But package managers and other tech has helped the general public that just wants to use the damn software. Nobody wants to track down and compile every dependency and become an expert in the build process. But this is where we are. Still.
I am currently in the trying to get these projects working in docker, but that is a whole other ordeal that I haven't completed yet, though I am hopeful that I'm almost there :) Some projects have Dockerfiles and some even have docker-compose files. None have worked out-of-the-box for me. And that's both surprising and sad.
I don't know where the blame lies exactly. Docker? The package maintainers that don't know docker or unix (a lot of these new LLM/AI projects are windows only or windows-first and I hear data scientists hate-hate-hate sysadmin tasks)? Nvidia for their eco-system? Dunno, all I know is I'm experiencing pain and time wastage that I'd rather not deal with. I guess that's partly why open-ai and other paid services exist. lol.
I'm in the same situation. I found this cog project to dockerise ML https://github.com/replicate/cog : you write just one python class and a yaml file, and it takes care of the "CUDA hell" and deps. It even creates a flask app in front of your model.
That helps keep your system clean, but someone with big $s please rewrite pytorch to golang or rust or even nodejs / typescript.
This. The Python machine learning ecosystem is the singular most difficult to navigate crustlefuck [0] that I've experienced in my life - issues listed, and more. Good luck if you need specific versions of any hardware drivers or SDKs too.
[0]: *Crustlefuck (noun)*: A labyrinthine, solidified matrix of chaos that has accreted over time. A crustlefuck differs from a clusterfuck due to an added dimension of rigid, ingrained complications that render any attempt at untangling the issues extraordinarily daunting and taxing.
memgpt has, 1. discord corporation based community, 2. python ecosystem requiring layers and layers of virtualization/containers and dep management management, 3. heavy commercial openai-first concentration with local LLM as an afterthought.
4. there are llama.cpp ways to do pretty much all that it does and you aren't just restricted to a couple model types like memgpt.
long memory/context, easily talk to your docs, define personas, works with gpt4 out-of-the-box but also supports local LLMs... and it comes with a CLI chat app
No. This is a demo of directly prompting a model using voice-to-text.
If your model has a long enough context (models like MistralLite can do 32,000 tokens now, which is about 30 pages of text) you could run a PDF text extraction tool and then dump that text into the model context and ask questions about it with the remaining tokens.
You could also plug this into one of the ask-questions-of-a-long-document-via-embedding-search tools.
The performance on Apple Silicon should be much better today compared to what is shown in the video as whisper.cpp now runs fully on the GPU and there have been significant improvements in llama.cpp generation speed over the last few months.