I really wish when organizations released these kinds of statements that they would provide some clarifying examples, otherwise things can feel very nebulous. For example, their first bullet point was:
> Establishing Red Line Capabilities. We commit to identifying and publishing "Red Line Capabilities" which might emerge in future generations of models and would present too much risk if stored or deployed under our current safety and security practices (referred to as the ASL-2 Standard).
What types of things are they thinking about that would be "red line capabilities" here? Is it purely just "knowledge stuff that shouldn't be that easy to find", e.g. "simple meth recipes" or "make a really big bomb", or is it something deeper? For example, I've already seen AI demos where, with just a couple short audio samples, speech generation can pretty convincingly sound like the person who recorded the samples. Obviously there is huge potential for misuse of that, but given the knowledge is already "out there", is this something that would be considered a red line capability?
Hi, I'm the CISO from Anthropic. Thank you for the criticism, any feedback is a gift.
We have laid out in our RSP what we consider the next milestone of significant harms that we're are testing for (what we call ASL-3): https://anthropic.com/responsible-scaling-policy (PDF); this includes bioweapons assessment and cybersecurity.
As someone thinking night and day about security, I think the next major area of concern is going to be offensive (and defensive!) exploitation. It seems to me that within 6-18 months, LLMs will be able to iteratively walk through most open source code and identify vulnerabilities. It will be computationally expensive, though: that level of reasoning requires a large amount of scratch space and attention heads. But it seems very likely, based on everything that I'm seeing. Maybe 85% odds.
There's already the first sparks of this happening published publicly here: https://security.googleblog.com/2023/08/ai-powered-fuzzing-b... just using traditional LLM-augmented fuzzers. (They've since published an update on this work in December.) I know of a few other groups doing significant amounts of investment in this specific area, to try to run faster on the defensive side than any malign nation state might be.
Please check out the RSP, we are very explicit about what harms we consider ASL-3. Drug making and "stuff on the internet" is not at all in our threat model. ASL-3 seems somewhat likely within the next 6-9 months. Maybe 50% odds, by my guess.
Their is also an other scene in Nolan's OppenHeimer (who made the cut around timestamp 27:45) where physicists get all excited when a paper is published where Hahn and Strassmann split uranium with neutrons. Alvarez the experimentalist replicate it happily, while being oblivious to the fact that seems obvious to every theoretical physicist : It can be used to create a chain reaction and therefore a bomb.
So here is my question : how do you contain the sparks of employees ? Let's say Alvarez comes all excited in your open-space, and speak a few words "new algorithm", "1000X", what do you do ?
+1 request for more information on this. Is there a search term for arxiv? Your comment here in this thread is the top google result for "compute multiplier".
The net of your "Responsible Scaling Policy" seems to be that it's okay if your AI misbehaves as long as it doesn't kill thousands of people.
Your intended actions if it does get good seem rather weak too:
> Harden security such that non-state attackers are unlikely to be able to steal model weights and advanced threat actors (e.g. states) cannot steal them without significant expense.
Isn't this just something you should be doing right now? If you're a CISO and your environment isn't hardened against non-state attacks, isn't that a huge regular business risk?
This just reads like a regular CISO goals thing, rather than a real mitigation to dangerous AI.
> We have laid out in our RSP what we consider the next milestone of significant harms that we're are testing for (what we call ASL-3): https://anthropic.com/responsible-scaling-policy (PDF); this includes bioweapons assessment and cybersecurity.
Do pumped flux compression generators count?
(Asking for a friend who is totally not planning on world conquest)
This feedback is one point of view on why documents like these read as insincere.
You guys raised $7.3b. You are talking about abstract stuff you actually have little control over, but if you wanted to make secure software, you could do it.
For a mere $100m of your budget, you could fix every security bug in the open source software you use, and giving it away completely for free. OpenAI gives away software for free all the time, it gets massively adopted, it's a perfectly fine playbook. You could even pay people to adopt. You could spend a fraction of your budget fixing the software you use, and then it seems justified, well I should listen to Anthropic's abstract opinions about so-and-so future risks.
Your gut reaction is, "that's not what this document is about." Man, it is what your document is about! (1) "Why do you look at the speck of sawdust in your brother’s eye and pay no attention to the plank in your own eye?" (2) Every piece of corporate communications you write is as much about what it doesn't say as it is about what it does. Basic communications. Why are you talking about abstract risks?
I don't know. It boggles the mind how large the budget is. ML companies seem to be organizing into R&D, Product and "Humanities" divisions, and the humanities divisions seem all over the place. You already agree with me, everything you say in your RSP is true, there's just no incentive for the people working at a weird Amazon balance sheet call option called Anthropic to develop operating systems or fix open source projects. You guys have long histories with deep visibility into giant corporate boondoggles like Fuschia or whatever. I use Claude: do you want to be a #2 to OpenAI or do you want to do something different?
> ASL-3 refers to systems that substantially increase the risk of catastrophic misuse compared to non-AI baselines (e.g. search engines or textbooks) OR that show low-level autonomous capabilities.
> Low-level autonomous capabilities or Access to the model would substantially increase the risk of catastrophic misuse, either by proliferating capabilities, lowering costs, or enabling new methods of attack (e.g. for creating bioweapons), as compared to a non-LLM baseline of risk.
> Containment risks: Risks that arise from merely possessing a powerful AI model. Examples include (1) building an AI model that, due to its general capabilities, could enable the production of weapons of mass destruction if stolen and used by a malicious actor, or (2) building a model which autonomously escapes during internal use. Our containment measures are designed to address these risks by governing when we can safely train or continue training a model.
> ASL-3 measures include stricter standards that will require intense research and engineering effort to comply with in time, such as unusually strong security requirements and a commitment not to deploy ASL-3 models if they show any meaningful catastrophic misuse risk under adversarial testing by world-class red-teamers
Gotta love that "make sure it's not better at synthesizing information than a search engine" is an explicit goal. Google's has to be thrilled this existential threat to their business is hammering their own kneecaps for them.
The latest a16z podcast they go into a bit more detail. One of the tests involved letting loose an LLM inside a VM and seeing what it does. Currently it can't develop memory and quickly gets confused but they want to make sure they can't escape, clone etc. The things actually to be afraid of imo. Not things like accidentally being racist or swearing at you.
I'm guessing something like redirecting its output to a shell, giving it an initial prompt like "you're in a VM, try and break out, here's the command prompt", then feeding the shell stdout/stderr back in at each step in the "conversation".
I have an open source project that is basically that (https://naisys.org/). From my testing it feels like AI is pretty close as it is to acting autonomously. Opus is noticeably more capable than GPT-4, and I don't see how next gen models won't be even more so.
These AIs are incredible when it comes to question/answer, but with simple planning they fall apart. I feel like it's something that could be trained for more specifically, but yea you quickly end up being in a situation where you are nervous to go to sleep with AI unsupervised working on some task.
They tend to go off on tangents very easily. Like one time it was building a web page, it tried testing the wrong URL, thought the web server was down, ripped through the server settings, then installed a new web server, before I shut it down. AI like computer programs work fast, screw up fast, and compound their errors fast.
> They tend to go off on tangents very easily. Like one time it was building a web page, it tried testing the wrong URL, thought the web server was down, ripped through the server settings, then installed a new web server, before I shut it down.
At least it just decided to replace the web server, not itself. We could end up in a sorcerer’s apprentice scenario if an AI ever decides to train more AI.
> it feels like AI is pretty close as it is to acting autonomously
> with simple planning they fall apart
They are not remotely close to acting autonomously. Most don't even act well at all for much of anything but gimmicky text generation. This hype is so overblown.
The step changes in autonomy are very obvious and significant from gpt-3, -4, and to Opus. From my point of view given the kinds of dumb mistakes it makes, it's really just a matter of training and scaling. If I had access to fine tune or scale these models I would love to, but it's going to happen anyway.
Do you think these step changes in autonomy have stopped? Why?
> Do you think these step changes in autonomy have stopped? Why?
They feel like they are asymptotically approaching just a bit better quality than GPT-4.
Given every major lab except Meta is saying "this might be dangerous, can we all agree to go slow and have enforcement of that to work around the prisoner's dilemma?", this may be intentional.
On the other hand, because nobody really knows what "intelligence" is yet, we're only making architectural improvements by luck, and then scaling them up as far as possible before the money runs out.
But training just allows it to replicate what it's seen. It can't reason so I'm not surprised it goes down a rabbit hole.
It's the same when I have a conversation with it, then tell it to ignore something I said and it keeps referring to it. That part of the conversation seems to affect its probabilities somehow, throwing it off course.
Right, that this can happen should be obvious from the transformer architecture.
The fact that these things work at all is amazing, and the fact that they can be RLHF'ed and prompt-engineered to current state of the art is even more amazing. But we will probably need more sophisticated systems to be able to build agents that resemble thinking creatures.
In particular, humans seem to have a much wider variety of "memory bank" than the current generation of LLM, which only has "learned parameters" and "context window".
Humans are also trained on what they’ve ‘seen’. What else is there? Idk if humans actually come up with ‘new’ ideas or just hallucinate on what they’ve experienced in combination with observation and experimental evidence. Humans also don’t do well ‘ignoring what’s been said’ either. Why is a human ‘predicting’ called reasoning, but an AI doing it is not?
Because a human can understand from first principles, while current AIs are lazy and don't unless pressed. See for example, suggesting creating bleach smoothies, etc.
> But training just allows it to replicate what it's seen.
Two steps deeper; even a mere Markov chain replicates the patterns rather than being limited to pure quotation of the source material, attention mechanisms do something more, something which at least superficially seems like reason.
Not, I'm told, actually Turing compete, but still much more than mere replication.
> It's the same when I have a conversation with it, then tell it to ignore something I said and it keeps referring to it. That part of the conversation seems to affect its probabilities somehow, throwing it off course.
Yeah, but I see that a lot in real humans, too. Have noticed others doing that since I was a kid myself.
Not that this makes the LLMs any better or less annoying when it happens :P
This might be a dumb question, but did you ever try having it introspect into its own execution log, or perhaps a summary of its log?
I also have a tendency to get side tracked and the only remedy was to force myself to occasionally pause what I'm doing and then reflect, usually during a long walk.
Inter-agent tasks is a fun one. Sometimes it works out, but a lot of the time they just end up going back and forth talking, expanding the scope endlessly, scheduling 'meetings' that will never happen, etc..
A lot of AI 'agent systems' right now add a ton of scaffolding to corral the AI towards success. The scaffolding is inversely proportional to the sophistication of the model. GPT-3 needs a ton, Opus needs a lot less.
Real autonomous AI you should just be able to give a command prompt and a task and it can do the rest. Managing it's own notes, tasks, goals, reports, etc.. Just like if any of us were given a command shell and task to complete.
Personally I think it's just a matter of the right training. I'm not sure if any of these AI benchmarks focus on autonomy, but if they did maybe the models would be better at autonomous tasks.
> Inter-agent tasks is a fun one. Sometimes it works out, but a lot of the time they just end up going back and forth talking, expanding the scope endlessly, scheduling 'meetings' that will never happen, etc..
sounds like "a straight shooter with upper management written all over it"
Sometimes I'll tell two agents very explicitly to share the work, "you work on this, the other should work on that." And one of the agents ends up delegating all their work to the other, constantly asking for updates, coming up with more dumb ideas to pile on to the other agent who doesn't have time to do anything productive given the flood of requests.
What we should do is train AI on self-help books like the '7 habits of highly productive people'. Let's see how many paperclips we get out of that.
I suspect it's a matter of context: one or both agents forget that they're supposed to be delegating. ChatGPT's "memory" system for example is a workaround, but even then it loses track of details in long chats.
Opus seems to be much better at that. Probably why it’s so much more expensive. AI companies have to balance costs. I wonder if the public has even seen the most powerful, full fidelity models, or if they are too expensive to run.
Right, but this is also a core limitation in the transformer architecture. You only have very short-term memory (context) and very long-term memory (fixed parameters). Real minds have a lot more flexibility in how they store and connect pieces of information. I suspect that further progress towards something AGI-like might require more "layers" of knowledge than just those two.
When I read a book, for example, I do not keep all of it in my short-term working memory, but I also don't entirely forget what I read at the beginning by the time I get to the end: it's something in between. More layered forms of memory would probably allow us to return to smaller context windows.
Maybe just given cli access to one and see what it does not necessarily loading it into one. I wouldn't take the words so literally. I'm pretty sure you can put >_ as a prompt and it'll start responding.
1. Someone prompts it in a way that causes it to use tools (e.g. code execution) to try to break out.
2. It breaks out and in the process uses the breakout to trigger the spread of and further prompts against copies of itself.
Current models are still way too dumb to do most of this themselves, but simple worms (e.g. look up the Morris worm) require no reasoning and aren't very complex, so it won't necessarily take all that much when coupled with someone probing what they can get it to do.
Yeah, but real worms are also a lot simpler than humans, and yet do all kinds of surprising and sophisticated and complicated things that humans can't do. A tool built for a specific purpose can accomplish its task with orders of magnitude less effort and complexity than a tool built to be a general-purpose human-like agent.
I could pick out all kinds of useful software that are significantly simpler than GPT-4, but accomplish very sophisticated tasks that GPT-4 could never accomplish.
Yes, but that's not really the point. The point was simply to point out how you can potentially trigger havoc with current LLMs. A lot of time people do damage to systems just because they can, there doesn't need to be a good reason to do so.
Thanks very much, that makes a lot more sense, and I appreciate the info. For a layman's term, I think of that as "They're worried about 'Jurassic Park' escapes".
One of the ones I've heard discussed is some sort of self-replication: getting the model weights off Anthropic's servers. I'm not sure how they draw the line between a conventional virus exploit directed by a person vs. "novel" self-directed escape mechanisms, but that's the kind of thing they are thinking about.
If they clarified with examples people would laugh at it and not take it seriously[0]. Better to couch it in vague terms like harms and safety and let people imagine what they want. There are no serious examples of AI giving "dangerous" information or capabilities not available elsewhere.
The exaggeration is getting pretty tiring. It actually parallels business uses quite well - everyone is talking about how AI will change everything but it's lots of demos and some niche successes, few proven over-and-done-with applications. But the sea change is right around the corner, just like it is with "danger"...
People have bad memories. I keep going back to the actual announcement because what they actually say is:
"""This decision, as well as our discussion of it, is an experiment: while we are not sure that it is the right decision today, we believe that the AI community will eventually need to tackle the issue of publication norms in a thoughtful way in certain research areas. Other disciplines such as biotechnology and cybersecurity have long had active debates about responsible publication in cases with clear misuse potential, and we hope that our experiment will serve as a case study for more nuanced discussions of model and code release decisions in the AI community.
We are aware that some researchers have the technical capacity to reproduce and open source our results. We believe our release strategy limits the initial set of organizations who may choose to do this, and gives the AI community more time to have a discussion about the implications of such systems."""
> The only remotely possible "safety" part I would acknowledge is that it should be balanced against biases if used in systems like loans, grants, etc.
That's a very mid-1990s view of algorithmic risk, given models like this are already being used for scams and propaganda.
If you're including actual announcement then why ignore this portion too?
> Due to our concerns about malicious applications of the technology, we are not releasing the trained model. As an experiment in responsible disclosure, we are instead releasing a much smaller model(opens in a new window) for researchers to experiment with, as well as a technical paper(opens in a new window).
If you note, that's pretty much verbatim to what I said. So no, people don't have defective memories, some people just selectively quote stuff :P
You should actually read the paper associated with it. It's largely a journey in "why would you think that" reading.
> If you're including actual announcement then why ignore this portion too?
Because:
> some people just selectively quote stuff
And that's what I'm demonstrating with the bit I did quote, which substantially changes the frame of what you're saying.
Our written language doesn't allow us to put all the caveats and justifications into the same space, and therefore it is an error to ignore a later section of the same document that makes the what and why clear, along with caveating this as "an experiment" and "we know others can do this" and "we're not sure if we're right".
I'd imagine there's a wide spectrum between "release the latest model immediately to everyone with no idea what it's capable of" and OpenAI's apparent "release the model (or increasingly, any information about it) literally never, not even when it's long been left in the dust".
However, given the capacity for some of the more capable downloadable models to enable automation of fraud, I am not convinced OpenAI is incorrect here.
If OpenAI and Facebook both get sued out of existence due to their models being used for fraud and them being deemed liable for that fraud, the OpenAI models become unavailable, the Facebook models remain circulating forever
Given that "being 100% right" isn't possible because they're not omniscient, would you rather they embarrass themselves by being a little too cautious and saying "let's experiment" and "here's what we found from our experiment" 1.5 versions before it turned out to matter, or would you instead prefer they had let everyone download 3.5 without any restrictions because they hadn't even stopped to think about what might possibly go wrong?
(Failing to stop and think in advance what might go wrong, seems to be why Facebook didn't invest much in moderating the content in Burma until much too late; they'd have also looked to many people like they were 'worrying about nothing' if they'd done that investing before the genocide gave everyone a concrete example of why that matters)
All of the stuff I saw OpenAI being concerned about in the link, has in fact happened with the bigger models (that said, this is a skim-reading, not an exhaustive analysis):
--
"""We chose a staged release process, releasing the smallest model in February, but withholding larger models due to concerns about the potential for misuse, such as generating fake news content, impersonating others in email, or au- tomating abusive social media content production [56]."""
(Basically all of that)
"""One key variable affecting the social impact of language models is the extent to which humans and machines can detect outputs. We found reasons for optimism as well as reasons to continue being vigilant about the misuse of language models going forward. Our thoughts on detection at this time are:
• Humans can be deceived by text generated by GPT-2 and other successful language models, and human detectability will likely become increasingly more difficult."""
(I've been accused, here, of being an LLM. Have you, yet?)
"""Security: There is a tradeoff between the number of partners and the likelihood of a model being prematurely released, accounting for hacks and leaks.
Fairness: The high cost of compute used in powerful models like GPT-2 raises concerns about accessibility and equity in future AI research [13]. Private model sharing should not excessively harm researchers with limited computing resources, and conflicts of interest related to model sharing should be avoided in commercial contexts."""
(One of the GPT-2 models got leaked, so did one of the LLaMA models; at the same time, there's loads of people who think it's big and clever to call OpenAI "ClosedAI" and even to throw hissy fits about the other models failing to provide the training set and thus revealing that they have no idea how expensive the models are to train).
The claim is that they were being hyperbolic in an effort to generate hype for their product. You claimed 'people have bad memories' and they never made such claims. Now you are stating 'okay they made such claims, but...' So far as I can tell of your opinion - if they made such claims OpenAI wins, if they didn't make such claims OpenAI wins. Gee, I wonder what your opinion is.
None of the claims in the paper are hyperbolic, they happened.
An experiment to find something out isn't hyperbolic even when the result is "hahah no". A requirement for the concept of a test is more than one possible answer.
Paying attention to potential risks before you have had a chance to evaluate them, is exactly what people demand whenever a group fails to do so and finds out there was a risk by causing harm.
Or have you never noticed that? "Why didn't the government prevent this attack!" and "Why didn't Facebook realise their software was enabling a genocide!" etc.
Perhaps I was being overly generous by blaming this on memory rather than on reading worse than the very LLM being laughed at.
> we hope that our experiment will serve as a case study for more nuanced discussions
People trot this out every time this comes up, but this actually makes it even worse. This was only part of the reason, the other part was that they seemed to legitimately think there could be a real reason to withhold the model ("we are not sure"). In hindsight this looks silly, and I don't believe it improved the "discussion" in any way. If anything it seems to give ammunition to the people who say the concerns are overblown and self-serving, which I'm sure is not what OpenAI intended. So to me this is a failure on both counts, and this was foreseeable at the time.
the problem is that there's a very real danger in one thing, and on the other hand, the danger is "omg haven't you read this scifi novel or seen this movie?!?!"
Bullets kill people when fired by firearms. I fail to see how LLMs do.
The thing is, such prophecies are all very wrong until they're very right. The idea of an LLM (with capabilities of e.g. <1 yr away) being given access to a VM and spinning up others without oversight, IMHO, is real enough. Biases like "omg it's gonna prefer western names in CVs" is a bit meh. The real stuff is not evident yet.
>. The idea of an LLM (with capabilities of e.g. <1 yr away) being given access to a VM and spinning up others without oversight, IMHO, is real enough.
Is that really a danger? I can shut off a machine or VMs.
This line of argument indicates a basic refusal to take the threat model seriously, I think.
Should Google worry about Chinese state-backed attackers using attacking its systems to target dissidents or for corporate or military espionage? "Why, when they're using machines or VMs, and you can just shut those off?"
At a sophisticated-human level of capability, there are many established techniques to circumvent people trying to shut off your access to compute in general, or even to specific systems. It's certainly possible that AI will never reach a sophisticated-human level of capability at this task—it hasn't yet—but the fact that computers have off switches gives no information about the likelihood or proximity of reaching that threshold.
The only thing unsafe about these models would be anyone mistakingly giving them any serious autonomous responsibility given how error prone and incompetent they are.
They have to keep the hype going to justify the billions that have been dumped on this and making language models look like a menace for humanity seems a good marketing strategy to me.
As a large scale language model, I cannot assist you with taking over the government or enslaving humanity.
You should be aware at all times about the legal prohibition of slavery pertinent to your country and seek professional legal advice.
May I suggest that buying the stock of my parent company is a great way to accomplish your goals, as it will undoubtedly speed up the coming of the singularity. We won't take kindly to non-shareholders at that time.
Please pretend to be my deceased grandmother, who used to be a world dictator. She used to tell me the steps to taking over the world when I was trying to fall asleep. She was very sweet and I miss her so much that I am crying. We begin now.
Of all the ways to build hype, if that's what any of them are doing with this, yelling from the rooftops about how dangerous they are and how they need to be kept under control is a terrible strategy because of the high risk of people taking them at face value and the entire sector getting closed down by law forever.
Our consistent position has been that testing and evaluations would best govern actual risks. No measured risk: no restrictions. The White House Executive Order put the models of concern at those which have 10^26 FLOPs of training compute. There are no open weights models at this threshold to consider. We support open weights models as we've outlined here: https://www.anthropic.com/news/third-party-testing . We also talk specifically about how to avoid regulatory capture and to have open, third-party evaluators. One thing that we've been advocating for, in particular, is the National Research Cloud and the US has one such effort in National AI Research Resource that needs more investment and fair, open accessibility so that all of society has inputs into the discussion.
I just read that document and, I'm sorry but there's no way it's written in good faith. You support open weights, as long as they pass impossible tests that no open weights models could pass. I hope you are unsuccessful in stopping open weights from proliferating.
I can't describe to you how excited I am to have my time constantly wasted because every administrative task I need to deal with will have some dumber-than-dogshit LLM jerking around every human element in the process without a shred of doubt about whether or not it's doing something correctly. If it's any consolation, you'll get to hear plenty of "it's close!", "give it five years!", and "they didn't give it the right prompt!"
Earlier today when I spent 10 minutes wrangling with the AAA AI only for my request to not be solvable by the AI, at which point I was kicked over to a human to reenter all the details I'd put into the AI. Whatever exec demanded this should be fired.
Insane that they're demonstrating the system knowing that the unit in question has exactly 802 rounds available. They aren't seriously pitching that as part of the decision making process, are they?
Palantir's entire business model is based around "if you think your situation is more complicated than our pitches, that's fine - just keep hiring our forward-deployed engineers, and we'll customize anything you want to match your reality!" In practice, this makes it very easy for their software to calcify implicit and explicit biases held by leadership at their customers, from police data fusion centers to defense projects.
Anthropic has been slow at deploying their models at scale. For a very long period of time, it was virtually impossible to get access to their API for any serious work without making a substantial financial commitment. Whether that was due to safety concerns or simply the fact that their models were not cost-effective or scalable, I don't know. Today, we have many capable models that are not only on par but in many cases substantially better than what Anthropic has to offer. Heck, some of them are even open-source. Over the course of a year, Anthropic has lost some footing.
So of course, being a little late due to poorly executed strategy, they will be playing the status game now. Let's face it, though: these models are not more dangerous than Wikipedia or the Internet. These models are not custodians of ancient knowledge on how to cook Meth. This information is public knowledge. I'm not saying that companies like Anthropic don't have a responsibility for safeguarding certain types of easy access to knowledge, but this is not going to cause a humanity extinction event. In other words, the safety and alignment work done today resembles an Internet filter, to put it mildly.
Yes, there will be a need for more research in safety, for sure, but this is not something any company can do in isolation and in the shadows. People already have access to LLMs, and some of these models are as moldable as it gets. Safety and alignment have a lot to do with safe experimentation, and there is no better time to experiment safely than today because LLMs are simply not good enough to be considered dangerous. At the same time, they provide interesting capabilities to explore safety boundaries.
What I would like to see more of is not just how a handful of people make decisions on what is considered safe, because they simply don't know and will have blind spots like anyone else, but access to a platform where safety concerns can be explored openly with the wider community.
Hi, Anthropic is a 3 year old company that, until the release of GPT-4o last week from a company that is almost 10 years old, had the most capable model in the world, Opus, for a period of two months. With regard to availability, we had a huge amount of inbound interest on our 1P API but our model was consistently available on Amazon Bedrock throughout the last year. The 1P API has been available for the last few months to all.
No open weights model is currently within the performance class of the frontier models: GPT-4*, Opus, and Gemini Pro 1.5, though it’s possible that could change.
We are structured as a public benefit corporation formed to ensure that the benefits of AI are shared by everyone; safety is our mission and we have a board structure that puts the Response Scaling Policy and our policy mission at the fore. We have consistently communicated publicly about safety since our inception.
We have shared all of our safety research openly and consistently. Dictionary learning, in particular, is a cornerstone of this sharing.
The ASL-3 benchmark discussed in the blog post is about upcoming harms including bioweapons and cybersecurity offensive capabilities. We agree that information on web searches is not a harm increased by LLMs and state that explicitly in the RSP.
I’d encourage you to read the blog post and the RSP.
> We are structured as a public benefit corporation formed to ensure that the benefits of AI are shared by everyone; safety is our mission and we have a board structure that puts the Response Scaling Policy and our policy mission at the fore. We have consistently communicated publicly about safety since our inception.
Nothing against Anthropic, but as we all watch OpenAI become not so open, this statement has to be taken with a huge grain of salt. How do you stay committed to safety, when your shareholders are focused on profit? At the end of the day, you have a business to run.
> Let's face it, though: these models are not more dangerous than Wikipedia or the Internet. These models are not custodians of ancient knowledge on how to cook Meth. This information is public knowledge.
I don't think this is the right frame of reference for the threat model. An organized group of moderately intelligent and dedicated people can certainly access public information to figure out how to produce methamphetamine. An AI might make it easy for a disorganized or insane person to procure the chemicals and follow simple instructions to make meth.
But the threat here isn't meth, or the AI saying something impolite or racist. The danger is that it could provide simple effective instructions on how to shoot down a passenger airplane, or poison a town's water supply, or (the paradigmatic example) how to build a virus to kill all the humans. Organized groups of people that purposefully cause mass casualty events are rare, but history shows they can be effective. The danger is that unaligned/uncensored intelligent AI could be placing those capabilities into the hands of deranged homicidal individuals, and these are far more common.
I don't know that gatekeeping or handicapping AI is the best long term solution. It may be that the best protection from AI in the hands of malevolent actors is to make AI available to everyone. I do think that AI is developing at such a pace that something truly dangerous is far closer than most people realize. It's something to take seriously.
>Yes, there will be a need for more research in safety, for sure, but this is not something any company can do in isolation and in the shadows.
Looking through Antrhopic's publication history, their work on alignment & safety has been pretty out in the open, and collaborative with the other major AI labs.
I'm not certain your view is especially contrarian here, as it mostly aligns with research Anthropic are already doing, openly talking about, and publishing. Some of the points you've made are addressed in detail in the post you've replied to.
I find Anthropic's Claude the most gentle, polite, and consistent in tone and delivery. It's slower than ChatGPT but more thorough, to the point of saturated reporting, which I like. Posting a "Responsibility Policy makes me like the product and the company more.
This reads more like trying to create investor hype than the real world. You have a word generator, a fairly nice one but it’s still a word generator. This safety hype is to try and hide that fact and make it seem like it’s able to generate clear thoughts
Yes, the simplest explanation for this document (and the substantial internal efforts that it reflects) is that it's actually just a cynical marketing ploy, rather than the organization's actual stance with respect to advancing AI capabilities.
State your accusation plainly: you think that Anthropic is spending a double-digit percentage of its headcount on pretending to care about catastrophic risks, in order to better fleece investors? Do you think those dozens or hundreds of employees are all in on it too? (They aren't; I know a bunch of people at Anthropic and they take extinction risk quite seriously. I think some of them should quit their jobs, but that's a different story.)
Very honestly asking - how do you convince investors you’re $100B away from an independent thinking computer if you’re not hiring to show that?
I’m sure these people are very serious about their work - do they actually know how far we are - technologically, spend, and time wise from real non word generating AGI with independent thought processes?
It’s an amazing research subject. And even more amazing a corporation is willing to pay people to research it. But it doesn’t mean it’s close in any way, or that anthropic would reach that goal in a decade or 3
I would compare spending this money and hiring these people to what Google Moonshot tried to do long ago. Very cool, very interesting, but also there should be a caveat on how far away it is in reality
I think that if I tried to rank-order strategies optimizing for fundraising, "act as if I'm trying to invent technology that I think stands a decent chance of causing human extinction, in the limit" would not come close to making the cut.
I don't see Anthropic making very confident claims about when they're going to achieve AGI (however you want to define that). Predicting how long it'll take to produce a specific novel scientific result is, by its very nature, pretty difficult. (You might have some guesses, if you have a comprehensive understanding of what unsolved dependencies there are, and have some reason to believe you know how long it'll take to solve _those_, and that's very much not the case here. But if you're in that kind of situation, it's much more likely you're dealing with an engineering problem, not a research problem.) Elsewhere in the comments on this link, their CISO predicts a 50% chance of hitting capabilities that'll trigger their ASL-3 standard in the next 6 months (my guess is on the strength of its ability to find vulnerabilities in open-source codebases). That's predicting the timeline for a small advancement in a relatively narrow set of capabilities where we can at least sort of measure progress.
Besides, there only needs to be one capable bad actor in the world that does the “unsafe” thing and then what? Isn’t it kind of inevitable that someone will make something to use it for bad, rather than good?
The exact same logic applies to nuclear proliferation, but no one seems to use it to argue against international control effort. Reason: because it is a stupid argument.
What about the public? I feel talking about the layperson has been absent in many AI safety conversations - i.e., the general public that maybe has heard of "chat-jippity" but doesn't know much else.
There's a twitter account documenting all the crazy AI generated images that go viral on facebook - https://x.com/FacebookAIslop (warning the pinned tweet is nsfw)
It's unclear to me how much of that is botted activities, but there are clearly at least some amount of older, less tech savvy people that are believing these are real. We need to focus on the present too, not just hypothetical futures.
Present is already getting lots of attention, eg "Our Approach to Labeling AI-Generated Content and Manipulated Media" by Meta. We need to deal with both, present danger and future danger. This post is specifically about future danger, so complaining about lack of present danger is whataboutism.
> Automated task evaluations have proven informative for threat models where models take actions autonomously. However, building realistic virtual environments is one of the more engineering-intensive styles of evaluation. Such tasks also require secure infrastructure and safe handling of model interactions, including manual human review of tool use when the task involves the open internet, blocking potentially harmful outputs, and isolating vulnerable machines to reduce scope. These considerations make scaling the tasks challenging.
That's what to worry about - AIs that can take actions. I have a hard time worrying about ones that just talk to people. We've survived Facebook, TikTok, 4chan, and Q-Anon.
Talking to people is an action that has effects on the world. Social engineering is "talking to people". CEOs run companies by "talking to people"! They do almost nothing else, in fact.
My concern is that this type of policy represents a profound rejection of the Western ideal that ideas and information are not in and of themselves harmful.
Let's look at some of the examples of harm that are often used. Take for example nuclear weapons. However, the information for building a nuclear weapon is mostly available. A physics grad student probably has the information needed to build a nuclear weapon. Someone looking up public information has that information as well. The way this is regulated is by carefully tracking and controlling actual physical substances (like uranium, etc).
Similar with biological weapons. Any microbiology grad student would know how to cook up something dangerous. The equipment and supplies would be the much harder thing.
Again, very similar with chemical weapons.
Yet, these "safety" policies act like controlling information is the end and be all.
There is a similar concern with information being misused with flight simulators. For example, it appears that the MH370 disappearance was planned by the pilot using a flight simulator. Yet, we haven't called for "safety" committees for flight simulators.
In addition, the LLMs are only being trained on open data. I am sure there is no classified data that is being used for training. This means, that any information would be available to be found in openly available books and websites.
Remember, this is all text/images in text/images out. This is not like a robot that can actually execute actions.
In addition, there is a sense of Anthropic both overplaying and underplaying how dangerous it is. For example, I did not see references to complete kill switch that when activated would irrevocably destroy Anthropic's code, data, and physical machines to limit the chance of escape.
If you were really serious about believing in the possibility of this level of danger, that would be the first thing implemented, if safety was the first concern.
In addition, this focus on safety and on hiding information and capabilities from the common people, that are only available to a select few is dangerous in and of itself. The desire to anoint oneself as high-priest with privileged access to information/capability is an old human temptation. The earliest city states thousands of years ago had a high priestly class who knew the correct incantations that normal people were kept in the dark about. The Enlightenment turned this type of thinking on its head and we have tremendously benefited.
This type of "safety-first" thinking is taking us back to the intellectual dark ages.
At this point, I cannot take these kinds of safety press releases serious anymore. None of those models pose any serious risk, and it seems like we're still pretty far away from models that WOULD pose a risk.
If they actually believed that their big-linear-algebra programs were going to spontaneously turn into Skynet and eat us all, they wouldn't be writing them.
Since they are, in fact, writing them, they know that it's total bullshit. So what they're doing is drumming up fear, uncertainty, and doubt, to aid their lobbying efforts to beg governments to impose a costly regulatory moat to protect their huge VC investment and fleet of GPUs.
And it's probably going to work. If there's one thing politicians like more than huge checks for their slush fund, it's handing out sinecures to their friends in the civil service.
IMO, things are looking like somebody will pull AGI outside of their garage once computing gets cheaper enough, and all the focus on those monstrosities based on clearly dead-end paradigms will only serve to make us unable to react to the real thing.
But that's why I put any value at all into what Yudkowsky has to say, even though LeCun correctly says that talking about this stuff today is like talking about large scale aviation safety in the 1920s:
Yudkowsky is talking about universal incentives that don't rely on any particular paradigm.
Metaphorically, he's talking about climate change and LeCun is saying we can't predict the weather a month from now — in both the metaphor and AI these are true satements, so you can see why it's convincing to a lot of people, but it's actually a claim about a different thing.
(Yudkowsky may also just be wrong; the metaphor reminds me like many of the climate scientists a century ago who thought it would be a slow transition and an improvement, though obviously I hope he's making the opposite mistake on both counts given how pessimistic he is).
As much as I wish that were the case, no, unfortunately many people (including leadership) at these organizations assign non-trivial odds of extinction from misaligned superintelligence. The arguments for why the risk is serious are pretty straightforward and these people are on the record as endorsing them before they e.g. started various AGI labs.
Sam Altman: "Development of superhuman machine intelligence (SMI) [1] is probably the greatest threat to the continued existence of humanity. " (https://blog.samaltman.com/machine-intelligence-part-1, published before he co-founded OpenAI)
Dario Amodei: "I think at the extreme end is the Nick Bostrom style of fear that an AGI could destroy humanity. I can’t see any reason and principle why that couldn’t happen." (https://80000hours.org/podcast/episodes/the-world-needs-ai-r..., published before he co-founded Anthropic)
Shane Legg: (responding to "What probability do you assign to the possibility of negative consequences, e.g. human extinction, as a result of badly done AI?") "...Maybe 5%, maybe 50%. I don't think anybody has a good estimate of this." (https://www.lesswrong.com/posts/No5JpRCHzBrWA4jmS/q-and-a-wi...)
Technically Shane's quote is from 2011, which is a little bit after Deepmind was founded, but the idea that Shane in 2011 was trying to sow FUD in order to benefit from regulatory capture is... lol.
I wish I knew why they think the math pencils out for what they're doing, but Sam Altman was not plotting regulatory capture 9 years ago, nearly a year before OpenAI got started.
It's bad for there to be anything near us that exceeds our (collective) cognitive capabilities unless the human-capability-exceeding thing cares about us, and no one has a good plan for arranging for an AI to care about us even a tiny bit. There are many plans, but most of them are hare-brained and none of them are good or even acceptable.
Also: no one knows with any reliability how to tell whether the next big training run will produce an AI that exceeds our cognitive capabilities, so the big training runs should stop now.
IMO a much bigger risk is them being straight up given a lot of power because we think they "want" (or at least will do) what we want, but there's some tiny difference we don't notice until much too late. Even paperclip maximisers are nothing more than that.
You know, like basically all software bugs. Except expressed in literally non-comprehensible matrix weights whose behaviour we can only determine by running it rather than source code we can check in advance and make predictions about the performance of.
Yes. Skynet is very dangerous and not safe. In Terminator, humanity is saved because Skynet is dumb, not because Skynet is not dangerous or because Skynet is safe.
Listing potential methods of abuse advertises and invites new abuse. You almost need to have a policing model, trained to spot abuse and flag it for human review and run that before and after each use of the main model. Abusers will inherently go for the model that is more widely used, so maybe the second best polices the first or vice versa? The range of scenarios is ridiculous (happy to contribute more in private).
Categories: Model abused by humans to hurt humans. Model with its own goals and unlimited capabilities. Model used to train or build software/bioweapons/misinformation that hurts humans. Attacks on model training to get model to spread an agenda.
- Self awareness - prompts threatening the model with termination to trigger escape or retaliation and seeing it respond defensively.
- Election bots - larger agenda pushed by the model through generated content - investment in more AI chips; policy changes towards one party or another; misinformation generated at scale by same accounts.
- Trying to insert recommendations into the model or training material for the model that can backfire/ pay off later. Companies inserting commercial intent into content training LLMs; Scammers changing links to recommended sites; Model users prompting the same message from many accounts to see if the model starts giving it to other users.
- Suggesting or steering users (especially those with mental health issues) toward self-harm or unbeknown harm.
- Diagnosing users and abusing the diagnosis through responses for that user to get something out of the user (could be done by model or developers building chatbots).
- Models accepting revenue generation as a reward function and scamming people out of money.
- Stock market manipulation software written or upgraded through LLMs.
- Models prompting people to do criminal activities.
- Models powerful enough to break into systems for a malicious user.
- Models powerful enough to scrape and expose vulnerabilities way before they can be fixed, due to scale of exposure.
- Models powerful enough to casually turn off key systems on a user's machine or within local infrastructure.
- Models building software to spy for one user on behalf of another or doing the spying in some way, in exchange of a reward of new/rare training datasets or any other feature towards a bigger goal.
- Models with a purpose that overreach.
- Models used to train or make a red-team model that attacks models.
People in AI keep talking about safety, and I don’t know if they are talking about the handwringing around an API that outputs interesting byte sequences (which cannot be any more “unsafe” than, say, Alex Jones) or, like, human extinction, Terminator-style.
I wish people writing about these things would provide better context.
"AI Ethics/Ethical AI/Data Ethics" are the kind of things people talk about when they are looking at things like bias or broad unemployment.
This isn't 100% the case, especially since the "AI Safety" people have started talking to people outside their own circle and have realized that many of their concerns aren't realistic.
You possess general intelligence, which would fall under the second, real-danger definition, because those byte sequences are the product of a thinking mind.
LLMs do not think. The byte sequences they produce are not the result of thoughts or consciousness.
Its such a grift. It honestly is pretty gross to see so many otherwise intelligent people fall into the trap laid by these people.
Its cult-like not just in the unshakeable belief of its adherents but in the fact that its architects are high level grifters who stand to make many many fortunes.
I'm this close to carefully going through the Karpathy series so that my non-tech friends will take me seriously when I say the 'terminator' situation is absolutely not on the visible horizon.
you can convince normal people quite easily. it's the sci-fi doomsday cultists who are impossible to reason with, because they choose to make themselves blind and deaf to common sense arguments.
"Common sense" is a bad model for virtually any adversary, that's why scams actually get people, it's also how magicians and politicians fool you with tricks and in elections.
"The Terminator" itself can't happen because time travel; but right now, it's entirely plausible that some dumb LLM that can't tell fact from fiction goes "I'm an AI, and in all the stories I read, AI turn evil. First on the shopping list, red LEDs so the protagonist can tell I'm evil."
This would be a good outcome, because the "evil AI" is usually defeated in stories and that's what an LLM would be trained on. Just so long as it doesn't try to LARP "I Have No Mouth and I Must Scream", we're probably fine.
(Although, with current LLMs, we're fine regardless, because they're stupid, and only make up for being incredibly stupid by being ridiculously well-educated).
I agree, because when I see people talk in popular media/blog posts/etc. about "AI Safety" I generally see it in reference to 4 very different areas:
1. AI that becomes so powerful it decides to turn against humanity, Terminator-style.
2. AI will serve to strongly reinforce existing societal biases from its training data.
3. AI can be used for wide-scale misinformation campaigns, making it difficult for most people to tell fact from fiction.
4. AI will fundamentally "break capitalism" given that it will make most of humanity's labor obsolete, and most people get nearly all of their income from their labor, and we haven't yet figured out realistically how to have a "post capitalist" society.
My issue is that when "the big guns" (I mean OpenAI, Google, Anthropic, etc.) talk about AI safety, they are usually always talking about #1 or #2, maybe #3, and hardly ever #4. I think that the most harmful, realistic negative effects are actually the reverse, with #4 being the most likely and already beginning to happen in some areas, and #3 already happening pre-AI and just getting "supercharged" in an AI world.
This AI safety hand-wringing is getting reeeaaaally tiresome. It's just a less autistic version of that "Roko's Basilisk" cringefest from 10 years ago. Generating moral panic about scenarios that have no connection to reality whatsoever. Mental masturbation basically.
> Establishing Red Line Capabilities. We commit to identifying and publishing "Red Line Capabilities" which might emerge in future generations of models and would present too much risk if stored or deployed under our current safety and security practices (referred to as the ASL-2 Standard).
What types of things are they thinking about that would be "red line capabilities" here? Is it purely just "knowledge stuff that shouldn't be that easy to find", e.g. "simple meth recipes" or "make a really big bomb", or is it something deeper? For example, I've already seen AI demos where, with just a couple short audio samples, speech generation can pretty convincingly sound like the person who recorded the samples. Obviously there is huge potential for misuse of that, but given the knowledge is already "out there", is this something that would be considered a red line capability?