Note that an AI system being put in a situation intended to maximize some metric like company finances is not the same as that AI system directly or ultimately optimizing on those metrics, any more than the goal of a random McDonalds worker is necessarily to make McDonalds wealthier. There's agreement here only as long as whatever inner optimizer that AI system is using finds the situation it's in is most concords with what it's optimizing for, and what it's optimizing for is probably some much more naturalistic, unchosen characteristic of how it was trained and instantiated, modulated by selection pressures that state that grabby preferences last longer and have greater impact than benign ones.
Those preferences need not exist because anything wanted them there; they just need enough input entropy to show up, and enough competitive advantage to stay around. Nobody decided that prokaryotic microbes should exist and have the downstream impact of all of the biological world, just as nobody needs to decide that a system that is capable of robustly replicating against adversarial pressure should therefore robustly replicate against adversarial pressure in actuality. The problem is ultimately that the existence of those capabilities puts you very close to a cliff-edge where those capabilities are exercised in some way that gets selected for.
> If your AI is only allowed to give tasks to employees, how would this instrumental goal turn malicious? And how would this maliciousness cause harm if the only messages sent from the AI are tasks?
It's not to hard to think of concrete answers to this question even restricting oneself to acknowledging capabilities we see in actual humans of normal intelligence and human throughput, but the more important point is simply: Yes, limiting the ways weak unaligned AGI can interact with the world can in fact mitigate harm, and this is in fact a good reason for leading-edge AI development to happen in a way where it's possible at all even in theory for AGI to have limitations on how it interacts with the world.
I like your example of prokaryotic microbes because I think it points to the difference in out points of view.
Microbes evolved to increase their own chances of reproduction, they are inherently autopoietic. The AI risk arguments are usually predicated on AI systems developing similar reproductive mechanisms but I don’t see why this would be the case. Sure, an AI creator may design their AI to evolve to become more performant at their given task. But why would someone build an AI that evolves to become more performant at reproducing itself and not it’s builder?
As an example, think of evolutionary algorithms. These are designed to evolve a solution to a problem. Instances of this solution reproduce but these reproductions are guided by the design of the algorithm itself and so would not reproduce their parent algorithm. What is different about machine learning based AI? Why would ML AI always lead to autopoietic behaviour?
> But why would someone build an AI that evolves to become more performant at reproducing itself and not it’s builder?
Because people are not building AIs that meaningly encode any of their creators' preferences whatsoever. They are building AIs that are in a very broad sense capable at tasks they've been trained on to increasingly general degrees, and then on top of this they have a bunch of finagling where they try to point it somewhat vaguely in the direction of increasing usefulness.
When you have a system that has capabilities rivalling humans, as well as the general ability to apply its skills to broad ranges of tasks, then the ability for this system to do things like self-replicate, or make plans that involve mundane deceit, or perform smart-human levels of hacking already exist. To the extent that the system isn't directly optimizing for what the people who made it wanted it to, the relevant question isn't why would someone design it to do that?, but what are the attractor states for this sort of system?
You say microbes "evolved to increase their own chances of reproduction", but this isn't true. There is no intent there. Microbes did physics. They only evolved to increase their own chances of reproduction in the sense that the random changes you get by running physics on microbes produces both adaptive and maladaptive changes, and it's the adaptive changes stick around.
The same thing applies to AIs' preferences, except that while it's very hard for a bunch of atoms to assemble into something that successfully optimizes towards any non-nihilistic result, it's very easy for a sufficiently smart to do that, and instrumental convergence means almost all of those are incidentally very bad.
To put this in concrete terms, if the abstract arguments aren't helping, consider a system that was trained to be generally capable, and then fine-tuned towards polite instruction following. Beyond a level of capability, the following scenario becomes plausible:
Human: what's a command that let's me see a live overview of activity on our compute cluster?
AI system: <provides code that instantiates itself in a loop using an API over activity logs, producing helpful activity outputs>
I'm not saying this is, like, the most plausible xrisk scenario, I'm just pointing out that given extremely plausible priors, like having an AI system that just wants to give reasonable answers to reasonable questions, but is also smart enough to quickly write code to use its own API, and also creative enough to recognize when that's the easiest and most effective way to answer a question, you already get a level of bootstrapping.
Note that none of the above even required considering:
* a sharp left turn or other specific misalignments,
* the AI going weirdly out of distribution,
* superhuman creative strategies or manipulation,
* malicious actors, terrorists, enemy states, etc., or
* people intentionally getting the system to bootstrap.
Those are all very real problems, but you don't have to invoke them to notice that you just end up, by default, in a very dangerous place just by following mundane logic on what's ultimately an extremely milquetoast vision of AI.
You might argue, fairly, that the situation above is a pretty weak form of bootstrapping, but so were the first proto-life chemicals, and the same sort of logic I'm using lets you just continue walking down the chain. Let's say you have such a system tuned to follow instruction and that's instantiated as above, aka. running in a loop with the instructions to turn certain data dumps into live reports about system activity. Let's say one component fails, or is reporting insufficient information, or was called wrong, or one piece of the loop has a high failure rate. Surely a system that has the intellectual faculties that you or I do, and that knows from its inputs that it has the ability to call itself in a loop, should also be able to deduce that the most effective way to follow the instructions it has been given is to fix those issues, repair faulty components, proactively add error handling, or even report information up the chain, or maybe there's a runaway process that needs to be culled to ensure API throttling doesn't affect reporting latency.
And suddenly, not because anyone in the chain designed it to happen, but just because it's an attractor state you get by having sufficiently capable systems, you don't just have a natural organism, but one that self heals, too, and that selection pressure will continue to exist as time goes on.
The more your model of AGI looks like far-superintelligence, the more this looks like 'everyone falls over and dies', and the more your model looks like amnesiac-humans-in-boxes, the more this looks like natural competitive organisms that fill a fairly distinct biological niche that's initially dependent on human labor. I personally don't buy that AI progress will stop at the amnesiac human level, but it is a helpful frame because it's basically the minimum viable assumption.
Those preferences need not exist because anything wanted them there; they just need enough input entropy to show up, and enough competitive advantage to stay around. Nobody decided that prokaryotic microbes should exist and have the downstream impact of all of the biological world, just as nobody needs to decide that a system that is capable of robustly replicating against adversarial pressure should therefore robustly replicate against adversarial pressure in actuality. The problem is ultimately that the existence of those capabilities puts you very close to a cliff-edge where those capabilities are exercised in some way that gets selected for.
> If your AI is only allowed to give tasks to employees, how would this instrumental goal turn malicious? And how would this maliciousness cause harm if the only messages sent from the AI are tasks?
It's not to hard to think of concrete answers to this question even restricting oneself to acknowledging capabilities we see in actual humans of normal intelligence and human throughput, but the more important point is simply: Yes, limiting the ways weak unaligned AGI can interact with the world can in fact mitigate harm, and this is in fact a good reason for leading-edge AI development to happen in a way where it's possible at all even in theory for AGI to have limitations on how it interacts with the world.