Hacker News new | past | comments | ask | show | jobs | submit login
Talk-Llama (github.com/ggerganov)
474 points by plurby on Nov 2, 2023 | hide | past | favorite | 140 comments



Heh, funny to see this popup here :)

The performance on Apple Silicon should be much better today compared to what is shown in the video as whisper.cpp now runs fully on the GPU and there have been significant improvements in llama.cpp generation speed over the last few months.


13 minutes between this and the commit of a new demo video, not bad :D

And impressive performance indeed!


Ah, forget the other message, I watched the videos in the wrong order! And I can’t delete or edit using the Hack app!


Is it just me, or is the gpu version actually slower to respond?


You are kinda famous now man. Odds are, people follow your github religiously.


Is ggerganov to LLM what Fabrice Bellard is to QuickJS/QEMU/FFMPEG?


That's a big burden to place on anyone.


I have sent a PR to move that new demo to the top. I think the new demo is significantly better.


Is running this on Apple Silicon the most cost effective way to run this, or can it be done cheaper on a beefed up homelab Linux server?


will this work with latested distilled llama?


This is cool. I hooked up Llama to an open-source TTS model for a recent project and there was lots of fun engineering that went into it.

On a different note:

I think the most useful coding copilot tools for me reduce "manual overhead" without attempting to do any hard thinking/problem solving for me (such as generating arguments and types from docstrings or vice-versa, etc.). For more complicated tasks you really have to give copilot a pretty good "starting point".

I often talk to myself while coding. It would be extremely, extremely futuristic (and potentially useful) if a tool like this embedded my speech into a context vector and used it to as an additional copilot input so the model has a better "starting point".

I'm a late adopter of copilot and don't use it all the time but if anyone is aware of anything like this I'd be curious to hear about it.


A few months until this is effectively outlawed if the open weights proposal in 270 days comes into existence


This assertion is not supported by the text of Biden’s Executive Order. There are a number of requirements placed on various government agencies to come up with safety evaluation frameworks, to make initial evaluations related to open weight models, and to provide recommendations back to the President within 270 days. But there’s nothing whatsoever that I can find that outlaws open weight models. There’s also little reason to think that “outlaw them” would be amongst the recommendations ultimately provided to the executive.

(I can imagine recommendations that benefit incumbents by, for instance, placing such high burden on government adoption of open weight models that OpenAI is a much more attractive purchase. But that’s not the same as what you’re talking about.)

I dunno, the EO seems pretty easy to read. Am I missing something in the text?

https://www.whitehouse.gov/briefing-room/presidential-action...


Yeah, they're not going to outlaw them. The well-worn path is to make the regulatory burden insurmountable for small companies; that will be enough.

Disobey.


How does one provide the KYC information for open models?


More to the point, why in the world should anyone "know" me to download a file? A simple antithesis of open technology, that's what this proposal is.


Nothing in the order requires KYC information for open models.


I don't see anywhere that it says weights are outlawed. The part I saw says something about making a report on risks or benefits of open weights.

I agree that it is concerning the way it's open-ended. But where is the actual outlawing?


Common sense AI control NOW. Ban assault style GPUs


Nobody needs more that 4GB of VRAM


> Common sense AI control NOW. Ban assault style GPU

That is beautiful, I made you a shirt. https://sprd.co/ZZufv7j


the laws were designed for the GPUs of 2020, there's no way they could have predicted what technology would come to.


How seriously threatening is this? How can they enforce something this stupid without even consulting with industry leaders?


Oh they have. Many of the industry leaders are importuning them for this. For industry leaders this is a business wet dream. It's "regulatory capture" at its finest


OpenAI has lobbied them among others, already. Our elected officials accept written law from lobbyists. Money and power games don't look out for common folks except incidentally or to the extent commoners mobilize against them.


I honestly can't blame OpenAI. They likely threw a huge amount of money at training and fine-tuning their models. I don't know how open-source will surpass them and not stay a second-tier solution without another huge generous event occurring like Facebook open-sourcing LLaMa


It seems like we need a distributed training network where people can donate GPU time, but then any training result is automatically open.


Great idea!


The only thing this directs is... consulting with a variety of groups, including industry, and writing a report based in that consultation.

So, literally, they can’t enforce it without consulting with industry, since enforcement is just sonewhat in the government holding someone else in government accountable for consulting with, among others, the industry.


Can you elaborate please?



Soliciting Input on Dual-Use Foundation Models with Widely Available Model Weights. When the weights for a dual-use foundation model are widely available — such as when they are publicly posted on the Internet — there can be substantial benefits to innovation, but also substantial security risks, such as the removal of safeguards within the model. To address the risks and potential benefits of dual-use foundation models with widely available weights, within 270 days of the date of this order, the Secretary of Commerce, acting through the Assistant Secretary of Commerce for Communications and Information, and in consultation with the Secretary of State, shall...

I believe that is the relevant section, which I am hoping they realize how dumb this is going to be.


China and Russia will keep using US models because they don't care about US laws. I think if restrictions on AI are only applied in the US, such restrictions will only put Americans at a disadvantage.

P.S. I'm from one of the said countries


I feel like this is another case of screens for e-book readers; where a single player had everybody else in a chokehold for decades, and slowed down innovation to a drip


This exists solely to protect the incumbents (OpenAI et al)


OpenAI has truly lived up to its name!

Not even Oracle dared to pull shit like this w/ Java.


Eh, I may be misremembering my history, but didn't they try back in the day and quickly got smacked down for one reason or another?


They do defend their business, tooth and nail.

But this would be akin to them saying "you know, bytecode could be used for evil, and we'd to regulate/outlaw the development of new virtual machines like the one we already have".


This is huge news... Has anybody seen any discussions around that topic?



This is not a discussion, this is just you posting verbatim what the executive order said.


> I believe that is the relevant section, which I am hoping they realize how dumb this is going to be.

How dumb its going to be to... solicit input from a wide range of different areas and write a report?


> the removal of safeguards within the model

this cat is already out of the bag, so this is pointless legislation that just hampers progress

I have already had good-actor success with uncensored models


> this cat is already out of the bag, so this is pointless legislation that just hampers progress

This isn't legislation.

And its not proposing anything except soliciting input from a broad range of civil society groups and writing a report. If the report calls for dumb regulatory ideas, that’ll be appropriate to complain about. But “there’s a thing that is happening that seems like it might have big effects, gather input from experts and concerned parties and then writeup findings on the impacts and what, if any, action seems warranted is... not a particularly alarming thing.


OK. The way it was portrayed is that this was "very bad" so I assumed something was decreed without sufficient input from the industry/community.


> removal of safeguards within the model

This is insanity. I have to be missing something, what else do the safeguards prevent the LLM from doing? This has to be more about this than like preventing an LLM from using bad words or showing support for Trump...


we just have to download the weights and use piratebay, or even emule, just like old times


Yes, but companies will have no incentive to publish open source models anymore. Or, it could be so difficult/beaurocratic no one will bother and keep it close source


What will actually happen is that innovation there will just move somewhere else (and it has partly done so).

This proposal is the US doing that bicycle/stick meme, it will backfire spectacularly.


The innovation is largely happening within the megacorps anyway, this is solely intended to make sure the innovation cannot move somewhere else.


Mistral and Falcon is not from megacorps and not even US.and many other opensource chinese models . And both are based models that means they are totally organic outside of US.


That’s what they told us. Turns out Google stopped innovating a long time ago. They could say stuff like this when Bard wasn’t out but now we have Mistral and friends to compare to Llama.

Now it turns out they were just bullshitting at Google.


> Now it turns out they were just bullshitting at Google.

I don't think Google was bullshitting when they wrote, documented and released Tensorflow, BERT and flan-t5 to the public. Their failure to beat OpenAI in a money-pissing competition really doesn't feel like it reflects on their capability (or intentions) as a company. It certainly doesn't feel like they were "bullshitting" anyone.


Everyone told us they had secret tech that they were keeping inside. But then Bard came out and it was like GPT-3. I don’t know man. The proof of the pudding is in the eating.

> The innovation is largely happening within the megacorps anyway

That was the part I was replying to. Whichever megacorp this is, it’s not Google.


Hey, feel free to draw your own conclusions. AI quality is technically a subjective topic. For what it's worth though, Google's open models have benched quite competitively with GPT-3 for multiple years now: https://blog.research.google/2021/10/introducing-flan-more-g...

The flan quantizations are also still pretty close to SOTA for text transformers. Their quality-to-size ratio is much better than a 7b Llama finetune, and it appears to be what Apple based their new autocorrect off of.


Still, one of those corporations wants to capture the market and has monopolistic attitude. Meta clearly chose the other direction, when publishing their models and allowing us all to participate in.


Then we'll create a distributed infrastructure for the creation of models. Run some program and donate spare GPU cycles to generate public AI tools that will be made available to all.


I really really would like to believe this could work in practice.

Given current data volume used during the training phase (tb/s), I highly doubt it's possible without two, magnitude changing, breakthroughs at once


I'm getting a "floating point exception" when running ./talk-llama on arch and debian. Already checked sdl2lib and ffmpeg (because of this issue: https://github.com/ggerganov/whisper.cpp/issues/1325) but nothing seems to fix it. Anyone else?


I was struggling with the same error on PopOS 22.04, and this helped: https://github.com/ggerganov/whisper.cpp/issues/352#issuecom...

I'm not sure what changed, but basically I purged ffmpeg and libsdl2-dev and the `make` in the root of the repo. Then I installed libsdl2 and ffmpeg and `make talk-llama`.

It's quite slow on 4 core i7-8550U and 16 GB of RAM.

basically, in the root of the repo:

$ sudo apt purge ffmpeg $ make clean $ git pull $ make $ sudo apt install libsdl2-dev $ make talk-llama $ ./talk-llama -mw ./models/ggml-small.en.bin -ml ../llama.cpp/models/llama-2-13b.Q4_0.gguf -p "t0mk" -t 8\n\n

HTH


Aren't there text to talk solution that can receive a stream of text so one doesn't have to wait for llama to finish production before getting the answer talked out?

I guess it'd only work if the model can keep the buffer filled fast enough so the tts engine doesn't stall.


Just have llama.cpp emit an “um”, “uhh” etc. when the buffer’s down to one word. :D


you laught at the end but I love this solution


Humans have loved the same solution since we first started talking, as well


Don’t forget to mix it with “apparently”, “you know what I’m saying”, “I mean”, “you know” etc.


I can't tell if you're disparaging the usage or not (truly, I can't tell), but such utterances exist because they serve a real function. Disfluency is an integral part of speech.


I think it's a good idea, if done well. It could also potentially be combined with dynamically adjusting speed of the speech, and reducing or increasing the use of shortcuts and contractions, making word replacements.

I know wish for a model built to be a low-computation filter that takes text in and produces padded text out intended for TTS and annotated with pauses or sounds and extra words that maintains the same meaning but provides the ability to dynamically adjust the level of verbosity to maintain a fixed rate of words per minute.


I always thought of them as the human equivalent of hard drive noises. <brrrrr brrbrrbr>


array_rand($verbal_fry[$locale]) /* :D */


Timing and emphasis work better if you know where the sentence is going; otherwise you sound like one of those translators at the UN with the flat stream of words.


I mean.. 99% of the current TTS engines wouldn't know timing and emphasis if it hit them.

Besides humans do this all the time, we start saying words before we even have, uhh, any idea how we're gonna end the sentence and for the most part it, uhh, works out. Should be doable.


The difference is that if you are the one saying the sentence, you at least have some idea where it's going. Receiving one word at a time from a different source isn't the same.


> you at least have some idea where it's going

Unless you're Michael Scott.


Also there's a big difference between waiting for a buffer of outputs and waiting for all the outputs.


I suppose it could be buffered, and used to only shorten the wait in cases with long responses, rather than aiming for perfect and push out words as they come. Besides, GPTs sound like translators anyway.


ElevenLabs and Gemelo.AI are services that both support text input streaming for exactly this use-case. I am not aware of any open-source Incremental TTS (this is the term used in research afaik) model but you can already achieve somthing similar by buffering tokens and sending them to the TTS model on punctuation characters.


ElevenLabs only has streaming output available. I've had a look at both recently and ElevenLabs doesn't have streaming input listed as a feature. Would be cool if it had it, though. You could probably approximate this on a sentence level, but you would need to do some normalisation to make the speech sound even.


Would it be possible to reduce lag by streaming groups of ~6 tokens at a time to the TTS as they're generated, instead of waiting for the full LLM response before beginning to speak it?


Yes, I was planning to do this back then, but other stuff came up. There are many different ways in which this simple example can be improved:

- better detection of when speech ends (currently basic adaptive threshold)

- use small LLM for quick response with something generic while big LLM computes

- TTS streaming in chunks or sentences

One of the better OSS versions of such chatbot I think is https://github.com/yacineMTB/talk. Though probably many other similar projects also exist by now.


I keep wondering if a small LLM can also be used to help detect when the speaker has finished speaking their thought, not just when they've paused speaking.


Maybe using a voice activity detector, VAD would be a lighter (less resources required) option.


That works when you know what you’re going to say. A human knows when you’re pausing to think, but have a thought you’re in the middle of expressing. A VAD doesn’t know this and would interrupt when it hears a silence of N seconds; a lightweight LLM would know to keep waiting despite the silence.


And the inverse: the VAD would wait longer than necessary after a person says e.g. "What do you think?", in case they were still in the middle of talking.


> use small LLM for quick response with something generic while big LLM computes

Can't wait for poorly implemented chat apps to always start a response with "That's a great question!"


“Uhm, i mean, like, you know” would indeed be a little more human.


Just like poorly implemented human brains tend to do :P


What’s the best chat interface for llama? I have a 3090 and would love to get one of the models running in my terminal for quick coding tasks.


ollama is dead easy to use. It's a single binary that downloads models on demand, like docker downloads images.

  pacman -S ollama
  ollama serve
  ollama run llama2:13b 'insert prompt'
https://ollama.ai/


ollama wraps llama.cpp into a docker container, correct? Besides that it seeks like a go server for chat?


The ollama shell is really nice.


Here is an open source project that supports voice as well: https://github.com/cogentapps/chat-with-gpt I think it is meant to be used with ElevenLabs and OpenAI API, but might be easy to configure for use with local Whisper.cpp + llama?


It's not open source but it is still (for now) free: lmstudio.ai. Chat histories, a good config UI, easy prompt management, model management, model discovery, easy set up, cross platform, able to serve an API for connectivity to other tools.

They're hiring and have no currently disclosed monetization strategy so I expect a rugpull soon where some now-free feature gets paywalled or purposefully crippled, but it's not like porcelain apps for free LLMs that rely entirely on llama.cpp to function can do vendor lock-in. I'd second ollama if OSS is a higher priority than features though.


This is the easiest to setup: https://faraday.dev/

I think Wizard is the “meta” for technical questions now.


Is this open source? I don't see a Linux download link and I want to try building it for Linux myself


Depends on what you mean by "best"? Absolute bleeding edge fastest possible inference? ExLlama or ExLlamaV2 on a 4090.


This makes me wonder, what's the equivalent to ollama for whisper/SOTA OS tts models? I'm really happy with ollama for locally running OS LLMs, but I don't know of any project that makes it that simple to set up whisper locally.


For SRT, here are some front-ends: https://www.reddit.com/r/OpenAI/comments/163hzhe/recommended...

Also I saw this thing called WhisperScript that looks pretty slick: https://github.com/openai/whisper/discussions/1028

That being said, WhisperX isn't that hard to setup. My step by step from a couple months ago: https://llm-tracker.info/books/logbook/page/transcription-te...


I've been using MacWhisper as a macOS app for running Whisper transcription jobs for a few months, I really like it.

https://goodsnooze.gumroad.com/l/macwhisper


McWhisper sounds like a diet burger. Does it come with fries?


Whisper is an STT model, you can use whisperx to transcribe audios locally via the CLI or whisper-turbo.com that runs in the browser.

For TTS coqui has the best UX and models for a lot of languages although quality is not on par with commercial TTS providers.


I've just been looking for SOTA TTS. I found coqui.ai and elevenlabs.io (and a bunch of others). They're good (and better than older TTS), but I am not fooled by any of them. Do you have recommendations?


Gemelo was the other one listed. I doubt you'll get anything sounding more natural than ElevenLabs with the following settings:

* Model: Multilingual v2

* All options and sliders to boost similarity: set to max/yes

* Stability slider: experimentally set to a value where the model sounds natural enough without destabilising sound output


Could anyone explain the capability of this in plain English? Can this learn and retain context of a chat and build on some kind of long term memory? Thanks


I'm not an LLM expert by any means but here is my take.

It's Speech Recognition -> Llama -> Text to Speech, running on your own PC rather than that of a third party.

The limitations on the context of the LLM are that of the model being used, e.g. Llama 2, Wizard Vicuna, whatever is chosen, in whatever compatible configuration is set by the user regarding context window etc, and given a preliminary transcript (as the LLM doesn't "reply" to the user in a sense, it just predicts the best continuation of a transcript between the user and a useful assistant, resulting in it successfully pretending to be a useful assistant, thus being a useful assistant - it's confusing).

I can imagine that it's viable to get that kind of behaviour by modifying the pipeline.

If the architecture was instead Speech Recognition -> Wrapper[Llama] -> Text 2 Speech, where "Wrapper" is some process that lets Llama do its thing but hooks onto the input text to add some additional processing, then things could get interesting.

The wrapper could analyse the conversation and pick out key aspects ("The person's name is Bob, male, 35, he likes dogs, he likes things to be organised, he wants a reminder at 5pm to call his daughter, he is an undercover agent for the Antarctic mafia, and he prefers to be spoken to in a strong Polish accent") and perform actions based on that:

- Set a reminder at 5pm to call his daughter (through e.g. HomeAssistant)

- Configure the text-2-speech engine to use a Polish accent

- Modify the starting transcript for future runs:

  - Put his name as the human's name within the underlying chat dialogue

  - Provide a condensed representation of his interests and personality within the preliminary introduction to the next chat dialogue

This way there's some interactivity involved (through actions performed by some other tool), some continuity (by modifying the next chat dialogue) and so on.


I've been wondering about how feasible it is to simulate long term memory by running multiple LLMs at the same time. One of them would be tasked with storing and retrieving long term memories from disc, so it'd need to be instructed about some data structure where memories were persisted, and then you'd feed it the current context, instructing it to provide a way to navigate the memory data structure to any potentially relevant memories. Whatever data was retrieved could be injected into the prompt to the next LLM, which would just respond to the given prompt.

No idea what sort of data structure could work. Perhaps a graph database could be feasible, and the memory prompt could instruct it to write a query for the given database.


This is achieved using vector databases to store memories as embeddings. Then you can retrieve a “memory” closest to the question in the embedding space.


This is an active area of research. The best we currently have is vector databases and/or sparse hierarchical information storage (you retrieve a summary of a summary via vector search, find associated summaries via vector search once more, then pluck out the actual data item and add it to the prompt.


this has really strong eliza vibes.


Does anybody have a quick start for building it all in Windows for this? I could probably check it out as a VS project and build but I'm going to bet since it's not documented it's going to have issues specifically because the Linux build instructions are the only ones that are a first class citizen...


What are currently the best/go-to approaches to detect the end of an utterance? This can be tricky even in conversations between humans, requiring semantic information about what the other person is saying. I wonder if there’s any automated strategy that works well enough.


Voice activity detection (referred to as VAD for short) is what you're looking for. Personally, I'd prefer a hold-down hotkey a la press-to-speak for more immediate UI feedback, but both work pretty well.


How does this choose when to speak back? (Like is it after a pause, or other heuristics.) I tried looking through the source to find this logic.


It waits for sufficient silence to determine when to stop recording the voice and send it to the model. There is other modes in the source as well and methods of setting the length of silences in order to chunk up and send bits at a time, but I imagine that is either work in progress or not planned for this demo.


Thanks

I was surprised they didn’t combine this work with the streaming whisper demo. So I guess I will implement that for iOS/macos (streaming whisper results in realtime without waiting on an audio pause, but as you say using the audio pauses and other signals like punctuation in the result to determine when to llm complete; makes me also wonder about streaming whisper results in to the llm incrementally before ready for completion)


It may be using the streaming demo. The reason I know to answer your question is that I had modified the streaming demo myself for personal use before. I think there is bugs in the silence detection code (as of a few months back, maybe fixed now). Maybe what we are seeing in this demo is just the "silence detection" setting to be waiting for very long pauses, I believe its configurable.


I added libfvad


very sick demo! if anyone wants to work on packaging this up for broader (swiftUI/macos) consumption, I just added an issue https://github.com/psugihara/FreeChat/issues/30


Elevenlabs voice is amazing but it's so expensive. You can easily spend $20 on a single conversation.


If you have adequate local resources, TorToiSe has pretty remarkable quality, especially after a little training.


All I need is ... Vulkan support :) pls, pls, pls :)


why use this instead of "memgpt run" ?


A major benefit (for me) is that I can build it with "make" and know that it just works.

Now perhaps this is a skills issue on my part (because I'm not a Python dev), but I've had endless trouble with Python-based ML projects, with some requiring I use/avoid specific 3.x versions of Python, each project's install instructions seemingly using a different tool to create a virtual environment to isolate dependencies, and issues tracking down specific custom versions of core libraries in order to allow the use of GPU/Neural Engine on Apple Silicon.

The whisper and llama.cpp projects just build and run so easily by comparison


> A major benefit (for me) is that I can build it with "make" and know that it just works.

So, as someone who has never gotten around to doing this, and who also likes not having to deal with the Python tools, it's not quite that simple. Steps I had to take for talk-llama after cloing whisper.cpp:

* apt install libssdl2-dev (Linux; other steps elsewhere)

* make talk-llama from the root of whisper.cpp, not from

* ./download-ggml-model.sh small.en from the models directory

* Tried to run it with the command line in the README, have it seg fault after failing to open ../llama.cpp/models/llama-13b/ggml-model-q4_0.gguf, cloning llama.cpp, finding the file is not in the repo.

* Searching through the readme for how to find the models, and finding I need to go searching elsewhere because no urls were listed.

* Having to install a bunch of Python dependencies to quantize the models...

This is still far from "build and run". Though I will fully believe that a lot of the Python-based ML projects are worse.


Got stuck on the model same as you did. No idea what model to use, not interested in fighting with Python to convert the models.

I was able to get llama.cpp itself to work, though, including image analysis.


I also had the same issue, in my case it was because I was trying to use a llama 2 model. When trying with codellama https://huggingface.co/TheBloke/CodeLlama-7B-GGUF/tree/main, which is based on the first llama, it works.


Correction: looks like it has to do with the quantization rather: 8bit quantization works while less does not not seem to work. Other working model example (no conversion needed): https://huggingface.co/TheBloke//Yarn-Mistral-7B-64k-GGUF/ya...


I've had the same experience. One of the things I like most about llama.cpp is the relatively straightforward build process, at least when compared to the mess of Python library requirements you run into if you want to experiment with ML models.

Having said that, I have the sense that the ML ecosystem is coalescing around using `venv` as a standard for Python dependencies. Most of the build instructions for Python ML projects I've seen recently begin with setting up the environment using venv, and in my experience, it works fairly reliably. I don't particularly like downloading gigabytes of dependencies for each new project, but that mess of dependencies is what's powering the rapid pace of prototypes and development.


As another non-python dev, interested in and trying to get into AI/ML, I think the limitation of venv is that it can't handle multiple versions of system libraries.

CUDA for example, different project will require different versions of some library like pytorch, but these seem to be tied to cuda version. This is where anaconda (and miniconda) come in, but omfg, I hate those. So far all anaconda does is screw up my environment, causing weird binaries to come into my path, overriding my newer/vetted ffmpeg and other libraries with some outdated libraries. Not to mention, I have no idea if they are safe to use, since I can't figure out where this avalanche (literally gigs) of garbage gets pulled in. If I don't let it mess with my startup scripts, nothing works.

And note, I'm not smart, but I've been a user of UNIX from the 90's and I can't believe we haven't progressed much in all these decades. I remember trying to pull in source packages and compiling them from scratch and that sucked too (make, cmake, m4, etc). But package managers and other tech has helped the general public that just wants to use the damn software. Nobody wants to track down and compile every dependency and become an expert in the build process. But this is where we are. Still.

I am currently in the trying to get these projects working in docker, but that is a whole other ordeal that I haven't completed yet, though I am hopeful that I'm almost there :) Some projects have Dockerfiles and some even have docker-compose files. None have worked out-of-the-box for me. And that's both surprising and sad.

I don't know where the blame lies exactly. Docker? The package maintainers that don't know docker or unix (a lot of these new LLM/AI projects are windows only or windows-first and I hear data scientists hate-hate-hate sysadmin tasks)? Nvidia for their eco-system? Dunno, all I know is I'm experiencing pain and time wastage that I'd rather not deal with. I guess that's partly why open-ai and other paid services exist. lol.


I'm in the same situation. I found this cog project to dockerise ML https://github.com/replicate/cog : you write just one python class and a yaml file, and it takes care of the "CUDA hell" and deps. It even creates a flask app in front of your model.

That helps keep your system clean, but someone with big $s please rewrite pytorch to golang or rust or even nodejs / typescript.


This. The Python machine learning ecosystem is the singular most difficult to navigate crustlefuck [0] that I've experienced in my life - issues listed, and more. Good luck if you need specific versions of any hardware drivers or SDKs too.

[0]: *Crustlefuck (noun)*: A labyrinthine, solidified matrix of chaos that has accreted over time. A crustlefuck differs from a clusterfuck due to an added dimension of rigid, ingrained complications that render any attempt at untangling the issues extraordinarily daunting and taxing.


memgpt has, 1. discord corporation based community, 2. python ecosystem requiring layers and layers of virtualization/containers and dep management management, 3. heavy commercial openai-first concentration with local LLM as an afterthought.

4. there are llama.cpp ways to do pretty much all that it does and you aren't just restricted to a couple model types like memgpt.


I dunno, why would you use memgpt run?


long memory/context, easily talk to your docs, define personas, works with gpt4 out-of-the-box but also supports local LLMs... and it comes with a CLI chat app


Whoever downvoted this, I'd be interested in an argument as for why.


can this be used to talk with local documents?

say I have a research paper in pdf, can I ask llama questions about it?


No. This is a demo of directly prompting a model using voice-to-text.

If your model has a long enough context (models like MistralLite can do 32,000 tokens now, which is about 30 pages of text) you could run a PDF text extraction tool and then dump that text into the model context and ask questions about it with the remaining tokens.

You could also plug this into one of the ask-questions-of-a-long-document-via-embedding-search tools.


Tacoma


I don’t want to talk to anything in my terminal. It’s a shitty interface for that.


I have the opposite opinion. I don't see the point of a simple window with a microphone image, or something like the Siri bubble, being any better.


Then use llama.cpp or whatever. No need to be sour over more options & innovation




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: