Act-1: Transformer for Actions

visarga · on Sept 14, 2022

Related - GPT-3 with a Python interpreter can solve many tasks. It is also language model + a computer, but on a different level.

https://mobile.twitter.com/sergeykarayev/status/156937788144...

bilsbie · on Sept 15, 2022

I’m confused how it actually uses the interpreter? Or it just returns the syntax and you run it.

teraflop · on Sept 15, 2022

It's right there in the middle section of the image: the Python program just appends your question to a prompt, sends it to GPT-3, extracts the code from the response, and calls exec() on it.

TuringTest · on Sept 15, 2022

What could possibly be wrong?

frenchie4111 · on Sept 15, 2022

One thing I have noticed with heavy use of copilot/dall-e, is that it's great at getting you most of the way there. But a big thing it's not great at is repeatability. When relying on something like ACT-1 to do data entry in salesforce, I need to do roughly exactly the same thing every time, even if the context is slightly different or I tell it something slightly differently. How well will it be able to do that?

Also this is very very cool, I love copilot, I hope I get to use this thing very soon.

tasdfqwer0897 · on Sept 15, 2022

Yeah this is a good point!

We are spending a lot of time thinking about reliability and it's true that existing models fall a little flat here. I think ultimately the key to making this work really well is some combination of

a) collecting and training on human feedback and b) doing intelligent things to samples from the model after the fact

machiaweliczny · on Sept 15, 2022

I think it would be better if these tools simply generated UI functions that could be named and used ala command list in editor. I think future UIs will be just talking and asking to find from big list of commands. Where AI will be able to navigate API from graphQL/OpenAPI description and maybe automatically plot data you want.

tasdfqwer0897 · on Sept 14, 2022

Hey, I helped make this! Happy to answer any questions.

tux3 · on Sept 14, 2022

On an unrelated note, I imagine this can solve recaptchas and other simple non-visual challenges.

Can I make ACT-1 Sybil a few thousand people on mechanical turk?

Can I submit CVs with ACT-1 for entry-level full remote jobs and have it work for legacy companies, if those companies cannot setup ACT-1 themselves but provide a traditional human jobs interface?

Can I put an interface that interracts with the real world through controls and text on a webpage and have ACT-1 take a physical presence?

woojoo666 · on Sept 15, 2022

I thought captcha was designed specifically so that AI image recognition couldn't perform well at it

TuringTest · on Sept 15, 2022

It was, for previous methods of image recognition. Generative adversarial networks and models trained on huge amounts of data might change that.

learndeeply · on Sept 14, 2022

Thanks for answering questions!

Are the example given in the blog post considered zero-shot learning?

Was the model trained on the websites in the examples given (e.g. on the Redfin site)?

How much labeled data was used?

lee101 · on Sept 15, 2022

I'd be interested to see if we can work together in some way, i'm founder of https://text-generator.io which also crawls links to the web/images to generate better text: https://text-generator.io/blog/text-generator-now-researches...

Some people self host it, so it can solve some subset of the questions, the image understanding part is important, both it can understand a question with image e.g. if a user drops a fridge to search by image, or multiple images (e.g. which of the 10 images is the nicest looking fridge in 1 API request) as well. Also supports getting shared embeddings for images/text/code, which can be important for the information retrieval/question answering example where it needs to first find the relevant context on wikipedia then feed to the reader model to read it out

Also do other custom stuff like retraining etc. Thanks, Lee https://leepenkman.appspot.com/

aaaaaaaaaaab · on Sept 14, 2022

What was the training data?

version_five · on Sept 14, 2022

Also, are there benchmark tasks that you either created or that already exist that you evaluated the model on?

PS - please don't let this me used at a way to prevent human interaction. Chatbots are a disaster and literally the worst possible application of ML, as a shitty interface to a menu system. I hope this will be used in a way that is not consumer-hostile and that the company actively resists ignorant business attempts to use it to avoid paying for customer support.

tasdfqwer0897 · on Sept 14, 2022

Yeah, we did have to custom-build our own benchmarks.

And we are not building a chatbot, we're building something collaborative that you can work with to accomplish the stuff you want to do!

tasdfqwer0897 · on Sept 14, 2022

We used a combination of human demonstrations and feedback data! You need custom software both to record the demonstrations and to represent the state of the Tool in a model-consumable way.

elcomet · on Sept 14, 2022

How many demonstrations were used ?

And was the feedback data used to train the model with reinforcement learning? Or did you request users to "correct" the action and get a supervised signal?

angrais · on Sept 14, 2022

How were the demonstrations annotated? Did you use annotations?

blueblimp · on Sept 15, 2022

In the Salesforce example, it's modifying database contents. Suppose the model misunderstands your request and modifies data in an unintended way (e.g. adding garbage data or, worse, deleting data). What's the recovery plan?

WiggleGuy · on Sept 16, 2022

This is super cool; I'd want to stay updated. Do you think you guys could add an rss feed to your blog? I'd want to add it to my rss feed aggregator.

leetrout · on Sept 15, 2022

So many folks standardizing on swagger/openapi opens the door to training on structured api definitions... this never occurred to me before.

mrits · on Sept 14, 2022

Natural language interfaces are very limited and certainly not the next generation of computing. Granularity of functionality and composable input will always be more efficient as long as the original source is a human. I think the natural language part of your product is the lease interesting and certainly not the most impressive.

davepeck · on Sept 14, 2022

> Natural language interfaces are [...] certainly not the next generation of computing.

I'm game to take the other side of that wager.

My instinct from the last 5 years of advances in ML language research is that we're right at the cusp of having radically better natural language interfaces.

I chose a half-decade horizon because "Attention Is All You Need", the paper that introduced the transformer model, was published in 2017. Two of its co-authors are co-founders of Adept.

jacobr1 · on Sept 15, 2022

Agreed. Multiple paradigms and "right tool for the job" and all that. Technology is often additive rather than successive.

At this point ... think of google. How many tasks start with some vague statement of intent in a google search and then refinement by reviewing pages or better search queries: "How do I ..." Already many (most) internet users are using natural language to interface with the internet. Average don't go straight to wikipedia - they type a question and get a response where wikipedia is the first first or the knowledge graph except is sufficient. Or "plane ticket to $location" and get linked to airline sites (or the search engines interface).

We are much further along this path than many might suspect.

aleksiy123 · on Sept 15, 2022

Most of human interactions with each other are through natural language and it works very well for us. Most of our goals start with this kind of communication. It feels like to me its the natural way for us to interact.

We just didn't have the tech that was able to take our high level instructions and carry them out for us like a human can. I think that is the long term goal of human computer interaction. This product seems like a significant step towards that.

rolisz · on Sept 15, 2022

> Most of human interactions with each other are through natural language and it works very well for us.

Non-verbal (body language) communication is a bigger part of our interaction than verbal communication (https://www.entrepreneur.com/leadership/you-dont-say-body-la...)

aleksiy123 · on Sept 15, 2022

We are talking about natural language not just verbal. I don't see anyone writing down body language to communicate anything...

rolisz · on Sept 16, 2022

You said most human interaction - that includes more than natural language.

mrits · on Sept 15, 2022

Do you have any research backing up that human interaction "works well"? As far as I can tell it is absolutely horrific. So much we literally have games called "telephone" mocking how horrible it is.

dreadlordbone · on Sept 15, 2022

Gonna start with the last 10,000 years of human history.

aleksiy123 · on Sept 15, 2022

Works well in comparison to anything else we know. Would love to hear any other methods you might have for communicating complex ideas.

tluyben2 · on Sept 15, 2022

It works well-ish in general, but it is rife with ambivalence, noise, lack of recall and bias. I find it horribly inefficient for my work on it’s own and as someone who grew up with the teachings of older generation programmers (most extreme, Dijkstra), I notice that most attempts to move everything to verbal communications is laziness; people don’t want to write, draw, spec etc if they can do prevent it. I even think ‘agile’ is a lazy way to say you are not going to spec out anything but just have a chat and hope for the best.

jacobr1 · on Sept 15, 2022

I think the key aspect of this is how you capture iterative refinement. One can capture a high level intent ("Build a webpage with widgets to accomplish foo", or "Buy me tickets to the playoff game") But upon inspection there are fractally increasing levels of details that need further specification. For the webpage - say authorization requirements or visual design details just as an example. Or for the tickets ... which seats? what cost is acceptable? if the next best seat is just slightly more than the state budget is the tradeoff worth it? What about the special bundle that includes some sort of promotional item, worth it? And if it is worth it this time, does that apply next time?

For code-technical decisions, design documents, or the code itself provide a good record for the ever increasing details and scope, but we certainly haven't found the best way to capture everything. Perennial debates about code documentation, or what should go in PRs, or literate-style programing and more show we haven't figured it out. But at the same time, I definitely find myself having non-productive slack conversations, or PR threads or whatever, when my colleague (or myself) prompts: "hangout?" or similar to have call and talk through the issue. Something about the real-time conversation is able to cut-through written mis-communication very well - which then can be captured back in text for the benefits of other colleagues.

i_am_toaster · on Sept 15, 2022

I agree on this front in regards to language being a poor interface for technical tasks. NLP has some extremely useful applications, but language is inherently lossy and abstract. We use the same words to describe different things in different contexts, and it’s common to incorrectly use words in contexts we are unfamiliar with. I don’t think this is a solvable problem in a general domain. Given a more specific domain, I think it can become more accurate, but domain level understanding from the user is required to interface with the model and the scope of its utility will be dependent on how it is applied. Personally I don’t think this 1-1 interface to do a thing is an appropriate design for those situations, although perhaps IDEs will begin to use some NLP to learn when to recommend certain types of refactoring tools, sort of like linters for design choices.

aleksiy123 · on Sept 15, 2022

Btw I want to clarify that I don't just mean verbal. Writing is natural language as well.

However, on your point I think natural language is the gateway to all those things. If you have a spec you will need to describe it in natural language first then use it.

Visual though is I think another way we should be able to communicate with a computer in the future and it seems like we are heading in that direction.

I guess how I imagine all this working is you provide all this high level instruction to the computer through writing/speaking as well as any diagrams and it figures it all out as well as queries you for any further clarification it needs. Just like interacting with your colleagues in day to day.

amilios · on Sept 14, 2022

On the flipside, natural language interfaces have the potential to be extremely easy to use for anyone, including non-experts. Anyone can type a message to the computer, without having to learn the specifics of an interface's custom controls. There are different types of efficiency. I'm assuming you're referring to something like 'operational efficiency', while NLI wins on 'ease of adoption' per se.

angrais · on Sept 14, 2022

Consider search engines and searching for content more generally: are you certain that natural language is most often used?

When I search for content, I use key terms to produce refined and better results. If you don't use such terms then what you're looking for may be difficult to find.

dwrodri · on Sept 14, 2022

So, I actually thought the opposite: the growth in userbase that most big tech companies have seen over the 2010s probably meant the "death" of people creating queries á la AltaVista. But it turns out, I think I'm wrong, Google's own research says the average query is between 2-3 words in size, but apparently it has in fact been going up over the years.

Here's my source: https://dl.acm.org/doi/pdf/10.1145/1753326.1753333?casa_toke...

averylamp · on Sept 15, 2022

I think this is true for everyone. Without ever knowing how google search works, somehow over time you figure out how to do prompt engineering and figure out that the results you want are more highly correlated with certain phrasing.

Wouldn’t something similar apply here though, where after using it for some time you get an inherent understanding of what it works well with and does not?

There’s always some sort of mix of things to learn and make available to the end user though, so that they understand how to use the tool to be successful.

Disclaimer: I also work at Adept

jacobr1 · on Sept 15, 2022

But don't we do this with other humans too? The language one uses with close family, or friends, or colleagues all shifts to achieve the desired effect. Code-switching is pretty common. I think people underestimate how much communication is a feedback-driven process.

mrits · on Sept 15, 2022

Nobody can type a message to a computer without knowing how the interface works. I think you are confused on this topic. Natural language is just a translation layer. If you wanted to you could put AI on top of a UI that works on intent as well.

pyinstallwoes · on Sept 15, 2022

Language is the ultimate interface.

The highest standard of UX is the genie that does as one wishes.

The best interface is no interface.

woojoo666 · on Sept 15, 2022

Programming languages are languages as well though. Every language has tradeoffs, and there are multiple natural languages (english, chinese, spanish, etc). I'd be surprised if english was the best tool for every task

jacobr1 · on Sept 15, 2022

> I'd be surprised if english was the best tool for every task

Certainly not - language models (including multi-lingual models) map words to some kind of (300ish?) multi-dimensional concept space. I wonder if we can translate that back to some sort of symbolic representation that is much more precise than human language. Some kind of IR where we could compile human into representing but also program against. I suspect the early prolog people were attempting something like this but were very wrong about framing reasoning as logical deduction rather than a stochastic process.

jcims · on Sept 14, 2022

You might not be the target audience though, eh?

i_am_toaster · on Sept 15, 2022

I’d argue this product has no target audience.

tveita · on Sept 15, 2022

Based on the example videos I think it could be an amazing accessibility aid.

codekansas · on Sept 14, 2022

This is incredible :) Will you be releasing more information about how the system was designed / how data was collected / how actions are executed?

tasdfqwer0897 · on Sept 14, 2022

Yes! We plan on putting out a more detailed technical post soon.

__sy__ · on Sept 14, 2022

Second that. Also what guardrails look like for it :)

joaquincabezas · on Sept 15, 2022

I remember a long time ago, reading about Semantic Web and intelligent agents and dreaming of a Natural Language interface for planning journeys…

“I want to travel from Seville to Berlin next October, avoiding weekends, for a two or three nights stay in a hotel by the river. Direct flights preferred.”

codpiece · on Sept 16, 2022

OK! You want Seville burning red October, avoiding weekends forty two or three nights/days....

blind666 · on Sept 14, 2022

If this scales up, it can be thought of as "actionable Google search", and if taken to extreme, has the potential to make internet query-able for better or worse

colemannugent · on Sept 14, 2022

So here's the main problem I see with this:

>Anyone who can articulate their ideas in language can implement them

I'd be shocked if even 10% of the users who can't navigate a GUI could accurately describe what they want the software to do. To the user who doesn't know they can use Ctrl-Z to undo, the first half dozen times the AI mangles their inherited spreadsheet might be enough to put them off the idea.

1MachineElf · on Sept 15, 2022

They don't need to explain what they want the software to do, they just need to explain what they want ACT-1 to do.

I agree with you that it won't be basic users, however, use anything long enough and you will become an expert.

This vision would fundamentally change how people interact with computers.

ffhhj · on Sept 14, 2022

But those who can articulate will have a very quick automation tool to scrap data from the web.

tartoran · on Sept 14, 2022

I’ve been thinking for a while about a common people programming language able to interface with machines with pure casual conversation ( not exact commands) and I feel something it’s coming in the next decades even if not earlier. Imagine the ability to casually chat with a widget which understands flawlessly and where most devices would be able to communicate as well. This could eventually be used in psychotherapy, everything automation around humans and in nefarious ways as well. I’m only hopeful of a human augmentation scenario but there are countless ways it could become totally different.

jacobr1 · on Sept 15, 2022

Certainly there is a huge middle ground. Vague, but common, use cases might have more articulate versions of the commands inferred. I find myself learning new tools all the time - I certainly have enough domain knowledge of many things to express intent without describing implementation. I suspect plenty of people are similar enough - just operating at different levels of abstraction.

What I find more concerning would be people operating under misconceptions, or being more precise than needed, thus not actually accomplishing their objective with the introduction of irrelevant detail.

holoduke · on Sept 14, 2022

How does the AI alter it's models during a process? I thought the weight models are pregenerated and not altered once used in a real-life app.

visarga · on Sept 14, 2022

Could be prompt based memory, or fine-tuning a small part of the big model.

anigbrowl · on Sept 14, 2022

  - OK here's my email
  - Please select all pictures of taxis to prove you are not a robot
  ಥ_ಥ

Seriously though, the potential is good. I see several things they're doing right that have the potential to distinguish them from competing offerings.

bluecoconut · on Sept 14, 2022

Wow! Love it, this is the most exciting thing I've seen in a while. I'm working on something similar, and it's so great to see others who seem to get-it and are chasing generalization in AI systems!

A few questions:

1. I'm curious if you're representing the task-operations using RL techniques (as many personal assistant systems seem to be) or if this is entirely a seq2seq transformer style model for predicting actions?

2. Assumption: Due to scaling of transformers, I assume that this is not directly working on the image data of a screen, and instead is working off of DOM trees; (2a) is this the case? and (2b) if so, are you using purely linear tokenization of the tree or are you using something closer to Evoformer (AlphaFold style) to combine graphs-neural nets and transformers?

3. Have you noticed that learning actions and representations of one application transfers well to new applications? or is the quality of the model heavily dependent on app domain?

I noticed multiple references to data applications (Excel, tableau, etc.). My challenge is that large language models and AI systems in general are about to hit a wall in the data domain because they fundamentally don't understand data [1] [2], which will ultimately limit the quality of these capabilities.

I am personally tackling this problem directly. I'm tying to prove more coherent data-aware operations in these systems by building a "foundation model" for tabular data that connects to LLMs (think RETRO style lookups of embeddings (representing columns of data)). I have been prototyping conversational AI systems (mostly Q/A oriented), and have recently been moving towards task oriented operations (right now, transparently, just SQL executors).

There seem to be good representations of DOM tree/visual-object models that you all are working with to take reasonable action, however I assume these are limited in scale (N^2 and all), and so I am wondering if you have any opinions on how to extend these systems for data (especially as the "windowed context grows" (eg. an excel with 100k+ rows))?

[1] https://arxiv.org/abs/2106.03253 "Tabular Data: Deep Learning is Not All You Need" [2] https://arxiv.org/abs/2110.01889 "In summary, we think that a fundamental reorientation of the domain may be necessary. For now, the question of whether the use of current deep learning techniques is beneficial for tabular data can generally be answered in the negative"

tasdfqwer0897 · on Sept 14, 2022

Thanks - glad you like it! I probably won't get to all of these but let me try a couple:

1. There's a spectrum (sort of) between using full on RL techniques and just doing sequence modeling. We're trying to pick a reasonable place on that spectrum that lets us model whether things have gone well without doing too much fiddling.

3. It really depends on how closely related the domains are. I think it's safe to say that you should expect more transfer of abstract/high-level capabilities than nitty-gritty things related to the specific domain - that's part of why we're excited about training one big model to use all software tools.

atemerev · on Sept 15, 2022

Many people worry that these things will take over our jobs. Worry not! Imagine how much work will be needed to fix things when these models will screw up, and how much we will charge for hour.

skybrian · on Sept 14, 2022

Wow, what a great way to make a mess online! I can see spammers using it, but who's going to trust this with access to any accounts they care about?

visarga · on Sept 14, 2022

I think it's going to be expensive to run and require an account with AdeptAI. Most usage will be to automate office work, RPA style. They could also detect malicious usage as the model sees the pages and takes actions in those pages.

rajnathani · on Sept 17, 2022

The founders of this company are the main authors of the Transformer architecture: https://techcrunch.com/2022/04/26/2304039/

d--b · on Sept 15, 2022

“Open the pod bay doors, Act-1”

lee101 · on Sept 15, 2022

Awesome, I wonder if the app is recording what we do so it can replicate what we do, or maybe if not it should have a training mode where we tell it what we do then do it so it can learn.

I feel like some of this could one day be built using a shared model that understands HTML and JavaScript code etc with a few example prompts. Or maybe something that understands intent+a browser automation language like Selenium, if not then some custom input output language+training as adept alludes to.

If interested in building something like this also checkout https://text-generator.io which already pulls down links and images to analyse to generate better text so has a lot of the required parts

FeepingCreature · on Sept 15, 2022

Act-1, please fix the hideous low contrast on the adept.ai website.

i_am_toaster · on Sept 14, 2022

I look forward to seeing the progress made on this in the future, but at this time I don’t see any potential in this product.

ryan93 · on Sept 15, 2022

I feel the same way. If cpu speed doubles a few more times maybe this gets useful.

danielbln · on Sept 15, 2022

What a silly take considering all these models use GPUs for inference in a highly parallel fashion.

ryan93 · on Sept 15, 2022

You know what I mean. Transistor size

midislack · on Sept 14, 2022

We gotta stop this lazy "X for Y" marketing crap. Seriously, if your product is just "X for Y" it doesn't even sound like a good pitch.

quickthrower2 · on Sept 15, 2022

Sometimes X for Y just literally means X for Y, like "Milk for lactose intolerant people". I think this is such a case.