Hey, project lead here. I had a very specific use case in mind: I’m playing with using LLM agent frameworks for software engineering - like MemGPT, swe-agent, Langchain and my own hobby project called headlong (https://github.com/andyk/headlong). Headlong is focused on making it easy for a human to edit the thought history of an agent via a webapp. The longer term goal of headlong is collecting large-ish human curated datasets that intermix actions/observations/inner-thoughts and then use those data to fine-tune models to see if we can improve their reasoning.
While working on headlong I tried out and implemented a variety of ‘tools’ (i.e., functions) like editFile(), findFile(), sendText(), checkTime(), searchWeb(), etc., which the agents call using LLM function calling.
A bunch of these ended up being functions that interacted with an underlying terminal. This is similar to how swe-agent works actually.
But I figured instead of writing a bunch of functions that sit between the LLM and the terminal, maybe let the LLM use a terminal more like a human does, i.e., by “typing” input into it and looking at snapshots of the current state of it. Needed a way to get those stateful text snapshots though.
I first tried using tmux and also looked to see if any existing libs provide the same functionality. Couldn’t find anything so teamed up with Marcin to design and make ht.
playing with the agent using the terminal directly has evolved into a hypothesis that I’ve been exploring: the terminal may be the “one tool to rule them all” - i.e., if an agent learns to use a terminal well it can do most of what humans do on our computers. Or maybe terminal + browser are the “two tools to rule them all”?
Not sure how useful ht will be for other use cases, but maybe!
This makes a lot of sense. I would call that out, because it's really surprising out of context. Hopefully you can see how unusual it would be to try to use human interfaces from code for which in at least the majority of cases, there are programatic interfaces for each task that already exist, and would be much less bug prone / finicky. I guess the analogy would be choosing to use Webdriver to interact with a service for which there is already an API.
While working on headlong I tried out and implemented a variety of ‘tools’ (i.e., functions) like editFile(), findFile(), sendText(), checkTime(), searchWeb(), etc., which the agents call using LLM function calling.
A bunch of these ended up being functions that interacted with an underlying terminal. This is similar to how swe-agent works actually.
But I figured instead of writing a bunch of functions that sit between the LLM and the terminal, maybe let the LLM use a terminal more like a human does, i.e., by “typing” input into it and looking at snapshots of the current state of it. Needed a way to get those stateful text snapshots though.
I first tried using tmux and also looked to see if any existing libs provide the same functionality. Couldn’t find anything so teamed up with Marcin to design and make ht.
playing with the agent using the terminal directly has evolved into a hypothesis that I’ve been exploring: the terminal may be the “one tool to rule them all” - i.e., if an agent learns to use a terminal well it can do most of what humans do on our computers. Or maybe terminal + browser are the “two tools to rule them all”?
Not sure how useful ht will be for other use cases, but maybe!