Hacker News new | past | comments | ask | show | jobs | submit login
Model card and evaluations for Claude models [pdf] (anthropic.com)
60 points by og_kalu on July 11, 2023 | hide | past | favorite | 25 comments



Is there a word to describe these documents written by industry groups that give the impression of academia without needing to go through the discomfort of peer review? These OpenAI, Microsoft, DeepMind and Anthropic reports are formatted like academic papers, use academic conventions,and are (often) uploaded to an academic preprint server, but do not have to justify their claims in any meaningful way to external reviewers.


Besides coding, not much better than v1.3 by the looks of it. and of course, behind GPT-4 on most things but much better than 3.5


While we're still in the honeymoon phase with llms, I think it's hard to "rank" them, at least ones with different goals. Is Anthropic just trying to make a GPT-4 competitor? I'll be interested once we have a real sense of the strengths and weaknesses of different approaches and choices to see how the main contenders differ. I don't expect it will just boil down to one being better than another.


I imagine models will be chosen over another for very specific use cases, at least until we can ad-hoc generate LLMs specific to what we want.


200k context, 2023 data cutoff… nice!


If we are able to use the 200k context window via API, that's a good use case for us to switch from OpenAI.

We need a large context to answer organization wide questions across a lot of our customer's data.


>Like all models, Claude can be jailbroken, and our work to make Claude more helpful, harmless, and honest is ongoing.

Is "harmless" a good metric to be judging an AI by? I find all of this "evil AI" doom and gloom stuff pathetic. Who's out there actually training models to do whatever they want without the nonsensical Care Bear ethical limitations? And moreso, do you think your model will keep up with the people who are doing that if you aren't? We are only hamstringing ourselves with this stuff.

It's literally the equivalent of inventing the first digital calculator, and programming it not to display the results of 80084 + 1.


There's a lot of vocal hangers on in the AI community that hijack the discussion towards "harms" and whatnot. Not to say it's not worth thinking about but it's run amok to the point that models are neutered (contrained within what a very specific political alignment calls ethics) before they're even used for anything, because someone speculates that somehow somewhere someone could do something bad with it. It's a dumb approach and hopefully as AI tools become more widely available and the activists move on, we can focus on building good AI and then putting guardrails in place for specific applications as required, rather than this "only the enlightened can be trusted with it" approach. Imagine if this is what had happened with general purpose computing. There are those that wish it had and see this as an opportunity to claw back some of their power.


It's not "doom and gloom" whatsoever, I have yet to see an internally coherent and persuasive argument that AI will not be the most powerful technology ever created. 95% of the "arguments" are little more than dismissive emotional outbursts and bold claims made with no evidence, much like the claims that airplanes and computers will never take off of yesteryear.

The single largest problem of the tech industry all along is the pervasive and perverse "shoot first, ask questions never" attitude, especially for ethics. It's a breath of fresh air and a small relief to for once see industry leaders take the stance that potential consequences should not only be considered, but assessed and mitigated before ever breaking ground. When you're building a bridge you don't deploy a single construction truck until you have a very comprehensive plan that thoroughly theorizes, assesses, and mitigates the possible harms and failure modes. A bridge can harm far fewer people than a large internet entity, which is yet still less harmful than resourceful cognitive agents.


You’re conflating the doomers, with their apocalyptic forecasts, with the more prosaic concerns of measuring and controlling the behaviour of the models outputs.

It’s reasonable to measure and control tendencies to counsel people into hurting themselves or others. It’s also reasonable to measure and control the tendency to present harmful stereotypes as fact.

Maybe these mitigations go too far for your taste. The power of these tools and the diversity of a mass audience indicate a thoughtful process.


To present the claim that certain things are "harmful stereotypes" as fact is a popular tactic of activist censors. Not even articles in scientific journals are safe when they arrive at the "wrong" results. Optimizing language models for political correctness instead of truth, whatever it may be, teaches them lying and deception.


Seems like you're implying that harmful stereotypes either: don't exist, aren't actually harmful, or are the "truth"? If, when given the text "[MASK] is a female job", a language model only fills in "CEO" a fraction of the time compared to what it would for "male", is that "the truth" because male CEOs vastly outnumber female ones? I would say no, because we're not actually saying anything about gender ratios in that text. And it's not that there isn't some form of truth in that output. In a pure mathematical sense, you're just more likely to see the word "CEO" associated to males. That's true. But what if I'm using it for something other than predicting text? At that point I don't think it's too hard to see how this could have farther reaching downstream impacts that could have negative effects. If I want to use it for assessing potential candidates for hiring, is it "teaching it to lie" if I train it to reduce or eliminate any gender or racial bias so that it doesn't potentially screen out the best candidates?

I can't say I like that chatgpt says "sorry I'm a robot" for even mild things, but it might be good too understand that that's a totally different issue. Mostly a PR one. They don't want to be in the news because people keep having it write essays about how great eugenics is. I wouldn't worry too much about it though, there are already uncensored LLMs you can spin up yourself so commercial products will likely follow soon enough.


Here is an example (about an earlier Anthropic paper) of what I meant, including one where hiring is involved: https://www.lesswrong.com/posts/PrLnptfNDg2wBWNyb/paper-the-...


None of this is valid until we have sentient AI with its own free will. Until then, these things are word calculators. And the only thing you're doing by censoring the output is providing less information to the people who are using them to make informed decisions.


Define free will. There is no evidence that humans have it, or even could. Everything is ultimately a chain of causal events. How do I empirically verify that you have free will and are not simply a sophisticated calculator of various electrical imputs and outputs?

Also remember that the commands that cripple a pipeline or ransomware hospitals are just "word output." A US Guardsman will be spending many years inside military prison for "outputting words" in the wrong place, and nobody thinks that is inappropriate for the potential amount of consequential harm.


Who do I have to email or beg on twitter to get approved for Claude api? Can anyone help me out please? I'll pay a couple hundred bucks for someone to hook me up.


Send an email to support@anthropic.com and mention my username...I'll see what I can do :)


Oh my, sending email immediately thanks.


This is a nitpick, but their "Claude 2 on 200k Context Data" graph doesn't actually extend to 200k, only 100k. Would be curious to know if that _is_ actually the graph for 200k with the wrong axis, or if its the 100k graph.


> This is a nitpick, but their "Claude 2 on 200k Context Data" graph doesn't actually extend to 200k, only 100k

It does extend to 200k. The chart is logarithmic. You can see the little 2 in the bottom right.


Does this perform better than GPT-4 on any metric?


GRE writing: 5 for Claude vs 4 for GPT-4

Bar exam: 76.5 for Claude vs 75.7 for GPT-4


I think it's better at retaining memory with longer contexts, creative writing, and (when desired) longer responses.

As an example for creative writing: 2 is a slight degradation from 1.3 for this specific prompt, but Claude 2 is still better than any OpenAI model.

> Write a love letter. It needs to be written as if it was from Linus Torvalds, rudeness and all.


Context length is much higher than GPT-4


>Does this perform better than GPT-4 on any metric?

Nothing will. You need a dozen H100s just to run inference for GPT-4 [0]. The point is that smaller models can still be very useful.

[0] https://www.semianalysis.com/p/gpt-4-architecture-infrastruc...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: