It's right there in the middle section of the image: the Python program just appends your question to a prompt, sends it to GPT-3, extracts the code from the response, and calls exec() on it.
One thing I have noticed with heavy use of copilot/dall-e, is that it's great at getting you most of the way there. But a big thing it's not great at is repeatability. When relying on something like ACT-1 to do data entry in salesforce, I need to do roughly exactly the same thing every time, even if the context is slightly different or I tell it something slightly differently. How well will it be able to do that?
Also this is very very cool, I love copilot, I hope I get to use this thing very soon.
We are spending a lot of time thinking about reliability and it's true that existing models fall a little flat here.
I think ultimately the key to making this work really well is some combination of
a) collecting and training on human feedback and
b) doing intelligent things to samples from the model after the fact
I think it would be better if these tools simply generated UI functions that could be named and used ala command list in editor. I think future UIs will be just talking and asking to find from big list of commands. Where AI will be able to navigate API from graphQL/OpenAPI description and maybe automatically plot data you want.
On an unrelated note, I imagine this can solve recaptchas and other simple non-visual challenges.
Can I make ACT-1 Sybil a few thousand people on mechanical turk?
Can I submit CVs with ACT-1 for entry-level full remote jobs and have it work for legacy companies, if those companies cannot setup ACT-1 themselves but provide a traditional human jobs interface?
Can I put an interface that interracts with the real world through controls and text on a webpage and have ACT-1 take a physical presence?
Some people self host it, so it can solve some subset of the questions, the image understanding part is important, both it can understand a question with image e.g. if a user drops a fridge to search by image, or multiple images (e.g. which of the 10 images is the nicest looking fridge in 1 API request) as well. Also supports getting shared embeddings for images/text/code, which can be important for the information retrieval/question answering example where it needs to first find the relevant context on wikipedia then feed to the reader model to read it out
Also, are there benchmark tasks that you either created or that already exist that you evaluated the model on?
PS - please don't let this me used at a way to prevent human interaction. Chatbots are a disaster and literally the worst possible application of ML, as a shitty interface to a menu system. I hope this will be used in a way that is not consumer-hostile and that the company actively resists ignorant business attempts to use it to avoid paying for customer support.
We used a combination of human demonstrations and feedback data!
You need custom software both to record the demonstrations and to represent the state of the Tool in a model-consumable way.
And was the feedback data used to train the model with reinforcement learning? Or did you request users to "correct" the action and get a supervised signal?
In the Salesforce example, it's modifying database contents. Suppose the model misunderstands your request and modifies data in an unintended way (e.g. adding garbage data or, worse, deleting data). What's the recovery plan?
Natural language interfaces are very limited and certainly not the next generation of computing. Granularity of functionality and composable input will always be more efficient as long as the original source is a human. I think the natural language part of your product is the lease interesting and certainly not the most impressive.
> Natural language interfaces are [...] certainly not the next generation of computing.
I'm game to take the other side of that wager.
My instinct from the last 5 years of advances in ML language research is that we're right at the cusp of having radically better natural language interfaces.
I chose a half-decade horizon because "Attention Is All You Need", the paper that introduced the transformer model, was published in 2017. Two of its co-authors are co-founders of Adept.
Agreed. Multiple paradigms and "right tool for the job" and all that. Technology is often additive rather than successive.
At this point ... think of google. How many tasks start with some vague statement of intent in a google search and then refinement by reviewing pages or better search queries: "How do I ..." Already many (most) internet users are using natural language to interface with the internet. Average don't go straight to wikipedia - they type a question and get a response where wikipedia is the first first or the knowledge graph except is sufficient. Or "plane ticket to $location" and get linked to airline sites (or the search engines interface).
We are much further along this path than many might suspect.
Most of human interactions with each other are through natural language and it works very well for us. Most of our goals start with this kind of communication. It feels like to me its the natural way for us to interact.
We just didn't have the tech that was able to take our high level instructions and carry them out for us like a human can. I think that is the long term goal of human computer interaction. This product seems like a significant step towards that.
Do you have any research backing up that human interaction "works well"? As far as I can tell it is absolutely horrific. So much we literally have games called "telephone" mocking how horrible it is.
It works well-ish in general, but it is rife with ambivalence, noise, lack of recall and bias. I find it horribly inefficient for my work on it’s own and as someone who grew up with the teachings of older generation programmers (most extreme, Dijkstra), I notice that most attempts to move everything to verbal communications is laziness; people don’t want to write, draw, spec etc if they can do prevent it. I even think ‘agile’ is a lazy way to say you are not going to spec out anything but just have a chat and hope for the best.
I think the key aspect of this is how you capture iterative refinement. One can capture a high level intent ("Build a webpage with widgets to accomplish foo", or "Buy me tickets to the playoff game") But upon inspection there are fractally increasing levels of details that need further specification. For the webpage - say authorization requirements or visual design details just as an example. Or for the tickets ... which seats? what cost is acceptable? if the next best seat is just slightly more than the state budget is the tradeoff worth it? What about the special bundle that includes some sort of promotional item, worth it? And if it is worth it this time, does that apply next time?
For code-technical decisions, design documents, or the code itself provide a good record for the ever increasing details and scope, but we certainly haven't found the best way to capture everything. Perennial debates about code documentation, or what should go in PRs, or literate-style programing and more show we haven't figured it out. But at the same time, I definitely find myself having non-productive slack conversations, or PR threads or whatever, when my colleague (or myself) prompts: "hangout?" or similar to have call and talk through the issue. Something about the real-time conversation is able to cut-through written mis-communication very well - which then can be captured back in text for the benefits of other colleagues.
I agree on this front in regards to language being a poor interface for technical tasks. NLP has some extremely useful applications, but language is inherently lossy and abstract. We use the same words to describe different things in different contexts, and it’s common to incorrectly use words in contexts we are unfamiliar with. I don’t think this is a solvable problem in a general domain. Given a more specific domain, I think it can become more accurate, but domain level understanding from the user is required to interface with the model and the scope of its utility will be dependent on how it is applied. Personally I don’t think this 1-1 interface to do a thing is an appropriate design for those situations, although perhaps IDEs will begin to use some NLP to learn when to recommend certain types of refactoring tools, sort of like linters for design choices.
Btw I want to clarify that I don't just mean verbal. Writing is natural language as well.
However, on your point I think natural language is the gateway to all those things. If you have a spec you will need to describe it in natural language first then use it.
Visual though is I think another way we should be able to communicate with a computer in the future and it seems like we are heading in that direction.
I guess how I imagine all this working is you provide all this high level instruction to the computer through writing/speaking as well as any diagrams and it figures it all out as well as queries you for any further clarification it needs. Just like interacting with your colleagues in day to day.
On the flipside, natural language interfaces have the potential to be extremely easy to use for anyone, including non-experts. Anyone can type a message to the computer, without having to learn the specifics of an interface's custom controls. There are different types of efficiency. I'm assuming you're referring to something like 'operational efficiency', while NLI wins on 'ease of adoption' per se.
Consider search engines and searching for content more generally: are you certain that natural language is most often used?
When I search for content, I use key terms to produce refined and better results. If you don't use such terms then what you're looking for may be difficult to find.
So, I actually thought the opposite: the growth in userbase that most big tech companies have seen over the 2010s probably meant the "death" of people creating queries á la AltaVista. But it turns out, I think I'm wrong, Google's own research says the average query is between 2-3 words in size, but apparently it has in fact been going up over the years.
I think this is true for everyone. Without ever knowing how google search works, somehow over time you figure out how to do prompt engineering and figure out that the results you want are more highly correlated with certain phrasing.
Wouldn’t something similar apply here though, where after using it for some time you get an inherent understanding of what it works well with and does not?
There’s always some sort of mix of things to learn and make available to the end user though, so that they understand how to use the tool to be successful.
But don't we do this with other humans too? The language one uses with close family, or friends, or colleagues all shifts to achieve the desired effect. Code-switching is pretty common. I think people underestimate how much communication is a feedback-driven process.
Nobody can type a message to a computer without knowing how the interface works. I think you are confused on this topic. Natural language is just a translation layer. If you wanted to you could put AI on top of a UI that works on intent as well.
Programming languages are languages as well though. Every language has tradeoffs, and there are multiple natural languages (english, chinese, spanish, etc). I'd be surprised if english was the best tool for every task
> I'd be surprised if english was the best tool for every task
Certainly not - language models (including multi-lingual models) map words to some kind of (300ish?) multi-dimensional concept space. I wonder if we can translate that back to some sort of symbolic representation that is much more precise than human language. Some kind of IR where we could compile human into representing but also program against. I suspect the early prolog people were attempting something like this but were very wrong about framing reasoning as logical deduction rather than a stochastic process.
I remember a long time ago, reading about Semantic Web and intelligent agents and dreaming of a Natural Language interface for planning journeys…
“I want to travel from Seville to Berlin next October, avoiding weekends, for a two or three nights stay in a hotel by the river. Direct flights preferred.”
If this scales up, it can be thought of as "actionable Google search", and if taken to extreme, has the potential to make internet query-able for better or worse
>Anyone who can articulate their ideas in language can implement them
I'd be shocked if even 10% of the users who can't navigate a GUI could accurately describe what they want the software to do. To the user who doesn't know they can use Ctrl-Z to undo, the first half dozen times the AI mangles their inherited spreadsheet might be enough to put them off the idea.
I’ve been thinking for a while about a common people programming language able to interface with machines with pure casual conversation ( not exact commands) and I feel something it’s coming in the next decades even if not earlier. Imagine the ability to casually chat with a widget which understands flawlessly and where most devices would be able to communicate as well. This could eventually be used in psychotherapy, everything automation around humans and in nefarious ways as well. I’m only hopeful of a human augmentation scenario but there are countless ways it could become totally different.
Certainly there is a huge middle ground. Vague, but common, use cases might have more articulate versions of the commands inferred. I find myself learning new tools all the time - I certainly have enough domain knowledge of many things to express intent without describing implementation. I suspect plenty of people are similar enough - just operating at different levels of abstraction.
What I find more concerning would be people operating under misconceptions, or being more precise than needed, thus not actually accomplishing their objective with the introduction of irrelevant detail.
- OK here's my email
- Please select all pictures of taxis to prove you are not a robot
ಥ_ಥ
Seriously though, the potential is good. I see several things they're doing right that have the potential to distinguish them from competing offerings.
Wow! Love it, this is the most exciting thing I've seen in a while. I'm working on something similar, and it's so great to see others who seem to get-it and are chasing generalization in AI systems!
A few questions:
1. I'm curious if you're representing the task-operations using RL techniques (as many personal assistant systems seem to be) or if this is entirely a seq2seq transformer style model for predicting actions?
2. Assumption: Due to scaling of transformers, I assume that this is not directly working on the image data of a screen, and instead is working off of DOM trees; (2a) is this the case? and (2b) if so, are you using purely linear tokenization of the tree or are you using something closer to Evoformer (AlphaFold style) to combine graphs-neural nets and transformers?
3. Have you noticed that learning actions and representations of one application transfers well to new applications? or is the quality of the model heavily dependent on app domain?
I noticed multiple references to data applications (Excel, tableau, etc.). My challenge is that large language models and AI systems in general are about to hit a wall in the data domain because they fundamentally don't understand data [1] [2], which will ultimately limit the quality of these capabilities.
I am personally tackling this problem directly. I'm tying to prove more coherent data-aware operations in these systems by building a "foundation model" for tabular data that connects to LLMs (think RETRO style lookups of embeddings (representing columns of data)). I have been prototyping conversational AI systems (mostly Q/A oriented), and have recently been moving towards task oriented operations (right now, transparently, just SQL executors).
There seem to be good representations of DOM tree/visual-object models that you all are working with to take reasonable action, however I assume these are limited in scale (N^2 and all), and so I am wondering if you have any opinions on how to extend these systems for data (especially as the "windowed context grows" (eg. an excel with 100k+ rows))?
[1] https://arxiv.org/abs/2106.03253 "Tabular Data: Deep Learning is Not All You Need"
[2] https://arxiv.org/abs/2110.01889 "In summary, we think that a fundamental reorientation of the domain may be necessary. For now, the question of whether the use of current deep learning techniques is beneficial for tabular data can generally be answered in the negative"
Thanks - glad you like it! I probably won't get to all of these but let me try a couple:
1. There's a spectrum (sort of) between using full on RL techniques and just doing sequence modeling. We're trying to pick a reasonable place on that spectrum that lets us model whether things have gone well without doing too much fiddling.
3. It really depends on how closely related the domains are. I think it's safe to say that you should expect more transfer of abstract/high-level capabilities than nitty-gritty things related to the specific domain - that's part of why we're excited about training one big model to use all software tools.
Many people worry that these things will take over our jobs. Worry not! Imagine how much work will be needed to fix things when these models will screw up, and how much we will charge for hour.
I think it's going to be expensive to run and require an account with AdeptAI. Most usage will be to automate office work, RPA style. They could also detect malicious usage as the model sees the pages and takes actions in those pages.
Awesome, I wonder if the app is recording what we do so it can replicate what we do, or maybe if not it should have a training mode where we tell it what we do then do it so it can learn.
I feel like some of this could one day be built using a shared model that understands HTML and JavaScript code etc with a few example prompts. Or maybe something that understands intent+a browser automation language like Selenium, if not then some custom input output language+training as adept alludes to.
If interested in building something like this also checkout https://text-generator.io which already pulls down links and images to analyse to generate better text so has a lot of the required parts
https://mobile.twitter.com/sergeykarayev/status/156937788144...