I mean anything in the 0.5B-3B range that's available on Ollama (for example). Have you built any cool tooling that uses these models as part of your work flow?
I built an Excel Add-In that allows my girlfriend to quickly filter 7000 paper titles and abstracts for a review paper that she is writing [1]. It uses Gemma 2 2b which is a wonderful little model that can run on her laptop CPU. It works surprisingly well for this kind of binary classification task.
The nice thing is that she can copy/paste the titles and abstracts in to two columns and write e.g. "=PROMPT(A1:B1, "If the paper studies diabetic neuropathy and stroke, return 'Include', otherwise return 'Exclude'")" and then drag down the formula across 7000 rows to bulk process the data on her own because it's just Excel. There is a gif on the readme on the Github repo that shows it.
I don't know. This paper [1] reports accuracies in the 97-98% range on a similar task with more powerful models. With Gemma 2 2b the accuracy will certainly be lower.
Y'all definitely need to cross validate a small number of samples by hand. When I did this kind of research, I would hand validate to at least P < .01.
She and one other researcher has manually classified all 7000 papers as per standard protocol. Perhaps for the next article they will measure how this tool agreed with them against them and include it in the protocol if good enough.
Looks like I'm out...
Would be great if there was a google apps script alternative. My company gave all devs linux systems and the business team operates on windows. So I always use browser based tech like Gapps script for complex sheet manipulation
I've been using Llama models to identify cookie notices on websites, for the purpose of adding filter rules to block them in EasyList Cookie. Otherwise, this is normally done by, essentially, manual volunteer reporting.
Most cookie notices turn out to be pretty similar, HTML/CSS-wise, and then you can grab their `innerText` and filter out false positives with a small LLM. I've found the 3B models have decent performance on this task, given enough prompt engineering. They do fall apart slightly around edge cases like less common languages or combined cookie notice + age restriction banners. 7B has a negligible false-positive rate without much extra cost. Either way these things are really fast and it's amazing to see reports streaming in during a crawl with no human effort required.
This is so cool thanks for sharing. I can imagine it’s not technically possible (yet?) but it would be cool if this could simply be run as a browser extension rather than running a docker container
I did actually make a rough proof-of-concept of this! One of my long-term visions is to have it running natively in-browser, and able to automatically fix site issues caused by adblocking whenever they happen.
There are a couple of WebGPU LLM platforms available that form the building blocks to accomplish this right from the browser, especially since the models are so small.
It should be possible using native messaging [1] which can call out to an external binary. The 1password extensions use that to communicate with the password manager binary.
I think there is real potential here, for smart browsing. Have the llm get the page, replace all the ads with kittens, find non-paywall versions if possible and needed, spoof fingerprint data, detect and highlight AI generated drivel, etc. The site would have no way of knowing that it wasn’t touching eyeballs. We might be able to rake back a bit of the web this way.
You probably wouldn't want to run this in real-time on every site as it'll significantly increase the load on your browser, but as long as it's possible to generate adblock filter rules, the fixes can scale to a pretty large audience.
I was thinking running it in my home lab server as a proxy, but yeah, scaling it to the browser would require some pretty strong hardware. Still, maybe in a couple of years it could be mainstream.
To me this take is like smokers complaining that the evil government is forcing the good tobacco companies to degrade the experience by adding pictures of cancer patients on cigarette packs.
Ha, I'm not sure the EU is prepared to handle the deluge of petitions that would ensue.
On a more serious note, this must be the first time we can quantitatively measure the impact of cookie consent legislation across the web, so maybe there's something to be explored there.
why don't you spam the companies who want your data instead? The sites can simply stop gathering your data, then they will not require to ask for consent ...
It’s the same comments on HN as always. They think EU setting up rules is somehow worse than companies breaking them. We see how the US is turning out without pesky EU restrictions :)
I have ollama responding to SMS spam texts. I told it to feign interest in whatever the spammer is selling/buying. Each number gets its own persona, like a millennial gymbro or 19th century British gentleman.
Given the source, I'm skeptical it's not just a troll, but found this explanation [0] plausible as to why those vague spam text exists. If true, this trolling helps the spammers warm those phone numbers up.
Carriers and SMS service providers (like Twillio) obey that, no matter what service is behind.
There are stories of people replying STOP to spam, then never getting a legit SMS because the number was re-used by another service. That's because it's being blocked between the spammer and the phone.
Calling Jessica an old chap is quite a giveaway that it's a bot xD
Nice idea indeed, but I have a feeling that it's just two LLMs now conversing with each other.
Yes it’s possible, but it’s not something you can easily scale.
I had a similar project a few years back that used OSX automations and Shortcuts and Python to send a message everyday to a friend. It required you to be signed in to iMessage on your MacBook.
Than was a send operation, the reading of replies is not something I implemented, but I know there is a file somewhere that holds a history of your recent iMessages. So you would have to parse it on file update and that should give you the read operation so you can have a conversation.
Very doable in a few hours unless something dramatic changed with how the messages apps works within the last few years.
If you are willing to use Apple Shortcuts on iOS it’s pretty easy to add something that will be trigged when a message is received and can call out to a service or even use SSH to do something with the contents, including replying
For something similar with FB chat, I use Selenium and run it on the same box that the llm is running on. Using multiple personalities is really cool though. I should update mine likewise!
You realize this is going to cause carriers to allow the number to send more spam, because it looks like engagement. The best thing to do is to report the offending message to 7726 (SPAM) so the carrier can take action. You can also file complaints at the FTC and FCC websites, but that takes a bit more effort.
I have a mini PC with an n100 CPU connected to a small 7" monitor sitting on my desk, under the regular PC. I have llama 3b (q4) generating endless stories in different genres and styles. It's fun to glance over at it and read whatever it's in the middle of making. I gave llama.cpp one CPU core and it generates slow enough to just read at a normal pace, and the CPU fans don't go nuts. Totally not productive or really useful but I like it.
FORTUNE=$(fortune) && echo $FORTUNE && echo "Convert the following output of the Unix `fortune` command into a small screenplay in the style of Shakespeare: \n\n $FORTUNE" | ollama run phi4
Do you find that it actually generates varied and diverse stories? Or does it just fall into the same 3 grooves?
Last week I tried to get an LLM (one of the recent Llama models running through Groq, it was 70B I believe) to produce randomly generated prompts in a variety of styles and it kept producing cyberpunk scifi stuff. When I told it to stop doing cyberpunk scifi stuff it went completely to wild west.
You should not ever expect an LLM to actually do what you want without handholding, and randomness in particular is one of the places it fails badly. This is probably fundamental.
That said, this is also not helped by the fact that all of the default interfaces lack many essential features, so you have to build the interface yourself. Neither "clear the context on every attempt" nor "reuse the context repeatedly" will give good results, but having one context producing just one-line summaries, then fresh contexts expanding each one will do slightly less badly.
(If you actually want the LLM to do something useful, there are many more things that need to be added beyond this)
Sounds to me like you might want to reduce the Top P - that will prevent the really unlikely next tokens from ever being selected, while still providing nice randomness in the remaining next tokens so you continue to get diverse stories.
They linked to an interactive explorer that nicely shows the diversity of the dataset, and the HF repo links to the GitHub repo that has the code that generated the stories: https://github.com/lennart-finke/simple_stories_generate
So, it seems there are ways to get varied stories.
It's a 3b model so the creativity is pretty limited. What helped for me was prompting for specific stories in specific styles. I have a python script that randomizes the prompt and the writing style, including asking for specific author styles.
> Do you find that it actually generates varied and diverse stories? Or does it just fall into the same 3 grooves?
> Last week I tried to get an LLM (one of the recent Llama models running through Groq, it was 70B I believe) to produce randomly generated prompts in a variety of styles and it kept producing cyberpunk scifi stuff.
oh wow that is actually such a brilliant little use case-- really cuts to the core of the real "magic" of ai: that it can just keep running continuously. it never gets tired, and never gets tired of thinking.
I have a small fish script I use to prompt a model to generate three commit messages based off of my current git diff. I'm still playing around with which model comes up with the best messages, but usually I only use it to give me some ideas when my brain isn't working. All the models accomplish that task pretty well.
Interesting idea. But those say what’s in the commit. The commit diff already tells you that. The best commit messages IMO tell you why you did it and what value was delivered. I think it’s gonna be hard for an LLM to do that since that context lives outside the code. But maybe it would, if you hook it to e.g. a ticketing system and include relevant tickets so it can grab context.
For instance, in your first example, why was that change needed? It was a fix, but for what issue?
In the second message: why was that a desirable change?
This reminds me of the antics of streamer DougDoug, who often uses LLM APIs to live-summarize, analyze, or interact with his (often multi-thousand-strong) Twitch chat. Most recently I saw him do a GeoGuessr stream where he had ChatGPT assume the role of a detective who must comb through the thousands of chat messages for clues about where the chat thinks the location is, then synthesizes the clamor into a final guess. Aside from constantly being trolled by people spamming nothing but "Kyoto, Japan" in chat, it occasionaly demonstrated a pretty effective incarnation of "the wisdom of the crowd" and was strikingly accurate at times.
What approach/stack would you recommend for listening to an ongoing conversation, transcribing it and passing through llm? I had some use cases in mind but I'm not very familiar with AI frameworks and tools
I'd love to hear more about the hardware behind this project. I've had concepts for tech requiring a mic on me at all times for various reasons. Always tricky to have enough power in a reasonable DIY form factor.
It's a lightweight tool that summarizes Hacker News articles. For example, here’s what it outputs for this very post, "Ask HN: Is anyone doing anything cool with tiny language models?":
"A user inquires about the use of tiny language models for interesting applications, such as spam filtering and cookie notice detection. A developer shares their experience with using Ollama to respond to SMS spam with unique personas, like a millennial gymbro or a 19th-century British gentleman. Another user highlights the effectiveness of 3B and 7B language models for cookie notice detection, with decent performance achieved through prompt engineering."
I originally used LLaMA 3:Instruct for the backend, which performs much better, but recently started experimenting with the smaller LLaMA 3.2:1B model.
It’s been cool seeing other people’s ideas too. Curious—does anyone have suggestions for small models that are good for summaries?
That's cool, I really like it. One piece of feedback: I am usually more interested in the HN comments than in the original article. If you'd include a link to the comments then I might switch to GopherSignal as a replacement for the HN frontpage.
My flow is generally: Look at the title and the amount of upvotes to decide if I'm interested in the article. Then view the comments to see if there's interesting discussion going on or if there's already someone adding essential context. Only then I'll decide if I want to read the article or not.
Of course no big deal if you're not interested in my patronage, just wanted to let you know your page already looks good enough for me to consider switching my most visited page to it if it weren't for this small detail. And maybe the upvote count.
Hey, thanks a ton for the feedback! That was super helpful to hear about your flow—makes a lot of sense and it's pretty similar to how I browse HN too. I usually only dive into the article after checking out the upvotes and seeing what context the comments add.
I'll definitely add a link to the comments and the upvote count—gotta keep my tiny but mighty userbase (my mom, me, and hopefully you soon) happy, right? lol
And if there's even a chance you'd use GopherSignal as your daily driver, that's a no-brainer for me. Really appreciate you taking the time to share your ideas and help me improve.
Microsoft published a paper on their FLAME model (60M parameters) for Excel formula repair/completion which outperformed much larger models (>100B parameters).
That paper is from over a year ago, and it compared against codex-davinci... which was basically GPT-3, from what I understand. Saying >100B makes it sound a lot more impressive than it is in today's context... 100B models today are a lot more capable. The researchers also compared against a couple of other ancient(/irrelevant today), small models that don't give me much insight.
FLAME seems like a fun little model, and 60M is truly tiny compared to other LLMs, but I have no idea how good it is in today's context, and it doesn't seem like they ever released it.
This is wild. They claim it was trained exclusively on Excel formulas, but then they mention retrieval? Is it understanding the connection between English and formulas? Or am I misunderstanding retrieval in this context?
Edit: No, the retrieval is Formula-Formula, the model (nor I believe tokenizer) does not handle English.
But I feel we're going back full circle. These small models are not generalist, thus not really LLMs at least in terms of objective. Recently there has been a rise of "specialized" models that provide lots of values, but that's not why we were sold on LLMs.
Specialized models work much better still for most stuff. Really we need an LLM to understand the input and then hand it off to a specialized model that actually provides good results.
But that's the thing, I don't need my ML model to be able to write me a sonnet about the history of beets, especially if I want to run it at home for specific tasks like as a programming assistant.
I'm fine with and prefer specialist models in most cases.
I would love a model that knows SQL really well so I don't need to remember all the small details of the language. Beyond that, I don't see why the transformer architecture can't be applied to any problem that needs to predict sequences.
I think playing word games about what really counts as an LLM is a losing battle. It has become a marketing term, mostly. It’s better to have a functionalist point of view of “what can this thing do”.
We (avy.ai) are using models in that range to analyze computer activity on-device, in a privacy sensitive way, to help knowledge workers as they go about their day.
The local models do things ranging from cleaning up OCR, to summarizing meetings, to estimating the user's current goals and activity, to predicting search terms, to predicting queries and actions that, if run, would help the user accomplish their current task.
The capabilities of these tiny models have really surged recently. Even small vision models are becoming useful, especially if fine tuned.
So you are using a local small model to remove identifying information and make the question generic, which is then sent to a larger model? Is that understanding correct?
I think this would have some additional benefits of not confusing the larger model with facts it doesn't need to know about. My erasing information, you can allow its attention heads to focus on the pieces that matter.
> So you are using a local small model to remove identifying information and make the question generic, which is then sent to a larger model? Is that understanding correct?
You're using it to anonymize your code, not de-anonymize someone's code. I was confused by your comment until I read the replies and realized that's what you meant to say.
I read it the other way, their code contains eg fetch(url, pw:hunter123), and they're asking Claude anonymized questions like
"implement handler for fetch(url, {pw:mycleartrxtpw})"
And then claude replies
fetch(url, {pw:mycleartrxtpw}).then(writething)
And then the local llm converts the placeholder mycleartrxtpw into hunter123 using its access to the real code
> Put in all your work related questions in the plugin, an LLM will make it as an abstract question for you to preview and send it
So the LLM does both the anonymization into placeholders and then later the replacing of the placeholders too. Calling the latter step de-anonymization is confusing though, it's "de-anonymizing" yourself to yourself. And the overall purpose of the plugin is to anonymize OP to Claude, so to me at least that makes the whole thing clearer.
Llama 3.2 has about 3.2b parameters. I have to admit, I use bigger ones like phi-4 (14.7b) and Llama 3.3 (70.6b) but I think Llama 3.2 could do de-anonimization and anonimization of code
Llama 3.2 punches way above its weight. For general "language manipulation" tasks it's good enough - and it can be used on a CPU with acceptable speed.
Are you using the model to create a key-value pair to find/replace and then reverse to reanonymize, or are you using its outputs directly? If the latter, is it fast enough and reliable enough?
I've made a tiny ~1m parameter model that can generate random Magic the Gathering cards that is largely based on Karpathy's nanogpt with a few more features added on top.
I don't have a pre-trained model to share but you can make one yourself from the git repo, assuming you have an apple silicon mac.
It also does RAG on apps there, like the music player, contacts app and to-do app. I can ask it to recommend similar artists to listen to based on my music library for example or ask it to quiz me on my PDF papers.
Not sure it qualifies, but I've started building an Android app that wraps bergamot[0] (the firefox translation models) to have on-device translation without reliance on google.
Bergamot is already used inside firefox, but I wanted translation also outside the browser.
I built https://ffprompt.ryanseddon.com using the chrome ai (Gemini nano). Allows you to do ffmpeg operations on videos using natural language all client side.
I wonder how big that model is in RAM/disk. I use LLMs for FFMPEG all the time, and I was thinking about training a model on just the FFMPEG CLI arguments. If it was small enough, it could be a package for FFMPEG. e.g. `ffmpeg llm "Convert this MP4 into the latest royalty-free codecs in an MKV."`
Please submit a blog post to HN when you're done. I'd be curious to know the most minimal LLM setup needed get consistently sane output for FFMPEG parameters.
You can train that size of a model on ~1 billion tokens in ~3 minutes on a rented 8xH100 80GB node (~$9/hr on Lambda Labs, RunPod io, etc.) using the NanoGPT speed run repo: https://github.com/KellerJordan/modded-nanogpt
For that short of a run, you'll spend more time waiting for the node to come up, downloading the dataset, and compiling the model, though.
If you have modern hardware, you can absolutely train that at home. Or very affordable on a cloud service.
I’ve seen a number of “DIY GPT-2” tutorials that target this sweet spot. You won’t get amazing results unless you want to leave a personal computer running for a number of hours/days and you have solid data to train on locally, but fine-tuning should be in the realm of normal hobbyists patience.
Not even on the edge. That's something you could train on a 2 GB GPU.
The general guidance I've used is that to train a model, you need an amount of RAM (or VRAM) equal to 8x the number of parameters, so a 0.125B model would need 1 GB of RAM to train.
Hm... I wonder what your use case it. I do the modern Enterprise Java and the tab completion is a major time saver.
While interactive AI is all about posing, meditating on the prompt, then trying to fix the outcome, IntelliJ tab completion... shows what it will complete as you type and you Tab when you are 100% OK with the completion, which surprisingly happens 90..99% of the time for me, depending on the project.
I've had good speed / reliability with TheBloke/rocket-3B-GGUF on Huggingface, the Q2_K model. I'm sure there are better models out there now, though.
It takes ~8-10 seconds to process an image on my M2 Macbook, so not quite quick enough to run on phones yet, but the accuracy of the output has been quite good.
what are you feed into the model? Image (like product packaging) or Image of Structured Table? I found out that model performs good in general with sturctured table, but fails a lot over images.
I am doing nothing, but I was wondering if it would make sense to combine a small LLM and SQLITE to parse date time human expressions. For example, given a human input like "last day of this month", the LLM will generate the following query `SELECT date('now','start of month','+1 month','-1 day');`
It is probably super overengineering, considering that pretty good libraries are already doing that on different languages, but it would be funny. I did some tests with chatGPT, and it worked sometimes. It would probably work with some fine-tuning, but I don't have the experience or the time right now.
LLMs tend to REALLY get this wrong. Ask it to generate a query to sum up likes on items uploaded in the last week, defined as the last monday-sunday week (not the last 7 days), and watch it get it subtly wrong almost every time.
I bought a tiny business in Brazil, the database (Excel) I inherited with previous customer data *do not include gender*. I need gender to start my marketing campaigns and learn more about my future customer. I used Gemma-2B and Python to determine gender based on the data and it worked perfect
I use a small model to rename my Linux ISOs. I gave it a custom prompt with examples of how I want the output filenames to be structured and then just feed it files to rename. The output only works 90ish percent of the time, so I wrote a little CLI to iterate through the files and accept / retry / edit the changes the LLM outputs.
I built a platform to monitor LLMs that are given complete freedom in the form of a Docker container bash REPL. Currently the models have been offline for some time because I'm upgrading from a single DELL to a TinyMiniMicro Proxmox cluster to run multiple small LLMs locally.
The bots don't do a lot of interesting stuff though, I plan to add the following functionalities:
- Instead of just resetting every 100 messages, I'm going to provide them with a rolling window of context.
- Instead of only allowing BASH commands, they will be able to also respond with reasoning messages, hopefully to make them a bit smarter.
- Give them a better docker container with more CLI tools such as curl and a working package manager.
If you're interested in seeing the developments, you can subscribe on the platform!
We're using small language models to detect prompt injection. Not too cool, but at least we can publish some AI-related stuff on the internet without a huge bill.
My husband and me made a stock market analysis thing that gets it right about 55% of the time, so better than a coin toss. The problem is that it keeps making unethical suggestions, so we're not using it to trade stock. Does anyone have any idea what we can do with that?
Suggestion: calculate the out-of-sample Sharpe ratio[0] of the suggestions over a reasonable period to gauge how good the model would actually perform in terms of return compared to risks. It is better than vanilla accuracy or related metrics. Source: I'm a financial economist.
I think this is a good highlight of why context and reality checks are incredibly important when doing work like this. At first glance, it might look like 55% is a really good result, but in the previous year, a flat buy every day strategy would've been right 56.7% of the time.
You can literally flip coins and get better than 50% success in a bull market. Just buy index funds and spend your time on something that isn't trying to beat entropy. You won't be able to.
We are building a framework to run this tiny language model in the web so anyone can access private LLMs in their browser: https://github.com/sauravpanda/BrowserAI.
With just three lines of code, you can run Small LLM models inside the browser. We feel this unlocks a ton of potential for businesses so that they can introduce AI without fear of cost and can personalize the experience using AI.
Would love your thoughts and what we can do more or better!
Has anyone ever tried to do some automatic email workflow autoresponder agents?
Lets say, I want some outcome and it will autonomousl handle the process prompt me and the other side for additional requirements if necessary and then based on that handle the process and reach the outcome?
I think I know what he means. I use AI Chat. I load Qwen2.5-1.5B-Instruct with llama.cpp server, fully offloaded to the CPU, and then I config AI Chat to connect to the llama.cpp endpoint.
What's your workflow like? I use AI Chat. I load Qwen2.5-1.5B-Instruct with llama.cpp server, fully offloaded to the CPU, and then I config AI Chat to connect to the llama.cpp endpoint.
I have several rhetoric and logic books of the sort you might use for training or whatever, and one of my best friends got a doctorate in a tangential field, and may have materials and insights.
We actually just threw a relationship curative app online in 17 hours around Thanksgiving., so they "owe" me, as it were.
I'm one of those people that can do anything practical with tech and the like, but I have no imagination for it - so when someone mentions something that I think would be beneficial for my fellow humans I get this immense desire to at least cheer on if not ask to help.
I'll be very positively impressed if you make this work; I spend all day every day for work trying to make more capable models perform basic reasoning, and often failing :-P
I am using smollm2 to extract some useful information (like remote, language, role, location, etc.) from "Who is hiring" monthly thread and create an RSS feed with specific filter.
Still not ready for Show HN, but working.
I have this idea that a tiny LM would be good at canonicalizing entered real estate addresses. We currently buy a data set and software from Experian, but it feels like something an LM might be very good at. There are lots of weirdnesses in address entry that regexes have a hard time with. We know the bulk of addresses a user might be entering, unless it's a totally new property, so we should be able to train it on that.
Apple’s on device models are
around 3B if I’m nit mistaken, and they developed some nice tech around them that they published, if I’m not
mistaken - where they have just one model, but have switchable finetunings of that model so that it can perform different functionalities depending on context.
I've been working on a self-hosted, low-latency service for small LLM's. It's basically exactly what I would have wanted when I started my previous startup. The goal is for real time applications, where even the network time to access a fast LLM like groq is an issue.
I haven't benchmarked it yet but I'd be happy to hear opinions on it. It's written in C++ (specifically not python), and is designed to be a self-contained microservice based around llama.cpp.
Kinda? All local so very much personal, non-business use. I made Ollama talk in a specific persona styles with the idea of speaking like Spider Jerusalem, when I feel like retaining some level of privacy by avoiding phrases I would normally use. Uncensored llama just rewrites my post with a specific persona's 'voice'. Works amusingly well for that purpose.
when i feel like casually listening to something, instead of netflix/hulu/whatever, i'll run a ~3b model (qwen 2.5 or llama 3.2) and generate and audio stream of water cooler office gossip. (when it is up, it runs here: https://water-cooler.jothflee.com).
some of the situations get pretty wild, for the office :)
I'm interested in finding tiny models to create workflows stringing together several function/tools and running them on device using mcp.run servlets on Android (disclaimer: I work on that)
I’m tired of the bad playlists I get from algorithms, so I made a specific playlist with an Llama2 based on several songs I like. I started with 50, removed any I didn’t like, and added more to fill in the spaces. The small models were pretty good at this. Now I have a decent fixed playlist. It does get “tired” after a few weeks and I need to add more to it. I’ve never been able to do this myself with more than a dozen songs.
Interesting! I wrote a prompt for something similar[1], but I use Claude Sonnet for it. I wonder how a small model would handle it. Time to test, I guess.
I programmed my own version of Tic Tac Toe in Godot, using a Llama 3B as the AI opponent. Not for work flow, but figuring out how to beat it is entertaining during moments of boredom.
I don’t know if this counts as tiny but I use llama 3B in prod for summarization (kinda).
Its effective context window is pretty small but I have a much more robust statistical model that handles thematic extraction. The llm is essentially just rewriting ~5-10 sentences into a single paragraph.
I’ve found the less you need the language model to actually do, the less the size/quality of the model actually matters.
I'm using ollama, llama3.2 3b, and python to shorten news article titles to 10 words or less. I have a 3 column web site with a list of news articles in the middle column. Some of the titles are too long for this format, but the shorter titles appear OK.
Is there any experiments in a small models that does paraphrasing? I tried hsing some off-the-shelf models, but it didn't go well.
I was thinking of hooking them in RPGs with text-based dialogue, so that a character will say something slightly different every time you speak to them.
I put llama 3 on a RBPi 5 and have it running a small droid. I added a TTS engine so it can hear spoken prompts which it replies to in droid speak. It also has a small screen that translates the response to English. I gave it a backstory about being a astromech droid so it usually just talks about the hyperdrive but it's fun.
We're prototyping a text firewall (for Android) with Gemma2 2B (which limits us to English), though DeepSeek's R1 variants now look pretty promising [0]: Depending on the content, we rewrite the text or quarantine it from your view. Of course this is easy (for English) in the sense that the core logic is all LLMs [1], but the integration points (on Android) are not so straight forward for anything other than SMS. [2]
A more difficult problem we forsee is to turn it into a real-time (online) firewall (for calls, for example).
The nice thing is that she can copy/paste the titles and abstracts in to two columns and write e.g. "=PROMPT(A1:B1, "If the paper studies diabetic neuropathy and stroke, return 'Include', otherwise return 'Exclude'")" and then drag down the formula across 7000 rows to bulk process the data on her own because it's just Excel. There is a gif on the readme on the Github repo that shows it.
[1] https://github.com/getcellm/cellm
reply