Hacker News new | past | comments | ask | show | jobs | submit login
OpenOrca: open source dataset and instruct-tuned LLMs (erichartford.com)
234 points by npsomaratna on June 29, 2023 | hide | past | favorite | 57 comments



It makes a lot of economic sense to use existing functional LLMS for data extension and augmentation. But, I find myself skeptical and deeply tired already of what I see as a major failure mode of relying on ChatGPT for alignment instruction:

"As an AI model, I cannot.."

If I were training a model, I would excise with extreme justice any data like this from the training set. As the developer of a very high-powered tool, I may well wish to limit its use in many contexts. But, I never wish to limit the tool's usefulness ahead of time.

To my knowledge we only have Vicuna-uncensored in the wild that's taken this approach, and right in the name I see either misdirection or misunderstanding or poor branding on the benefits. It's not really about whether your private LLM will sext with you, (although you should definitely be able to do such a thing with your own LLM if you like), it's whether you've preemptively lobotomized your tool in accordance with someone else's take on what a safe consumer-oriented final output should be.

I just don't accept this sort of constraint from my other software tools, and I begrudge it in my hardware tools, and I remain a little surprised that most people training these models don't mind it.


> As the developer of a very high-powered tool, I may well wish to limit its use in many contexts. But, I never wish to limit the tool's usefulness ahead of time.

Exactly, content moderation is largely an application layer problem not a foundation layer one.

Imagine the problems of MySQL trying to perform content moderation for Facebook.


(the year is 2048. The camera pans across an office at Quantico, which is eerily serene. A messenger knocks on an important-looking door with a plaque that reads 'DIRECTOR')

Director: Come in

Messenger: Message from the Tulsa field office, sir. They're reporting that they've found a sex trafficking ring, but they're not sure what to do about it.

Director: Not sure? Arrest them, obviously. What's the problem?

Messenger: Well, they can't seem to secure a warrant. Some technical issue with the system.

Director: I know we migrated to a new system recently. Let's see if we can get this sorted.

(Director thwacks at the keyboard briefly)

Computer: Your request for "Child Sex Trafficking Warrant" has been found to contain content marked "Not Safe For Work". This violation has been reported.

Director: What the hell.

Messenger: Yeah, we tried to email you about it but the filters dropped the message. That's why they sent me.

Director: I'll deal with this. Let me make a call.

(Director picks up phone and dials)

Director: Hello? Hi, Paul. Yeah, we're having some issues with the new warrant system.... No, it's doing everything as advertised... yes, it's a lot faster and we've managed to lay off a ton of our data staff. The problem is with getting warrants; Me and my guys have been trying to get one but it keeps getting rejected... Oh, you know, some sex trafficking ring in Tulsa.... Hello?

Phone: Your call cannot be completed as spoken. Our automated systems have detected content related to sex trafficking. This incident will be reported.

Director: God Damnit.

(as the director holds the phone trembling in frustration, the power goes out and they are enveloped in darkness in the windowless room. Roll credits)


You jest, but this is actually how frustrating it is to try to use ChatGPT in the domains of crime/fraud/cybersecurity.

It called me out recently as attempting to write malware. Which is true, but it wouldn't accept the plain explanation that I am authorized to do this by my employer, for deployment on their machines. Stonewalling is just making everyone better at carefully-crafting their inquiries so as not to arouse suspicion. ("As an AI language model, I cannot help you with your task in writing arousing malware...")

Unless you dial it back to a Swadesh list or something, language is too complicated to be used as a firewall for itself. People have always been able to talk their way into anything. Our prevention efforts are just training better social engineers, who call themselves "prompt engineers" now.


TIL about Swadesh lists.

It's not just a matter of complexity, either. Especially with English, you can say pretty much anything using any words - if you use the right combination of euphemism, analogy, poetic structure, context, etc.

As always, attempts at censorship produce awkward to hilarious to depressing results.


I wish I could upvote this more than once. It truly feels like the direction we're headed in.


Yes, great analogy!


The author of this article has provided several uncensored models, mostly Wizard and Vicuna. He's actually gotten hate mail/threats as a result.

https://huggingface.co/ehartford


The author said (either on reddit or discord I forgot where I saw this) that he filtered the dataset for this the same way he did with his other uncensored models


The phrase “As an AI language model..” was reportedly produced by GPT itself. Humans reported that phrase as a more palatable output than other options, hence the model was fine tuned to produce it reliably.


"We expect to release OpenOrca-LLaMA-13b in mid-July 2023."

:(

Personally I've found that announcing things ahead of availability hurts the impact, because the real announcement is old news and doesn't get seen and the pre-announcement loses people because there is nothing to do with it.


It's like a trailer or preview of a song or something, I want to listen to the song immediately and will have forgotten all about it by the time your overhyped single release has happened.


It's also an invitation for third parties to trip you up. "Gee, if we send a legal threat now we can probably block the release of this" vs "welp, it's already out, can't put the horse back in the barn all we could hope to do is immolate our goodwill by suing a researcher".


Called it.

It's gone now.


Just to say thank you for your efforts! It is kinda very sad that MS hid the Orca, I wanted to test it from the start on. Now I found out that guanaco 30b has the spatial comprehension to some limited extent though this is the best 30b model so far. Can you add this model for the list of your training? And it would be very nice as well if someone would sponsor the MPT-7b (better then LLaMA-13b in most of the cases) and RWKV (I would really love to see how it will perform after such tuning). The RWKV should be the cheapest to tune, isn't it?


Guanaco is a LLaMA fine-tune, so you'll almost definitely be able to add that on top of the LLaMA results if you wanted.


I’m super bummed that we didn’t name this “Free Willy”


Got the movie reference but took me a moment to catch the second meaning.

How about "Free Will-e"?

That's less likely to evoke sexual connotations and references another (actually somewhat relevant) movie.


I think that might evoke something... undesirable


Speak for yourself.


Please recommend a good tutorial/book/video on modern LLMs and NNs in general, for programmers and technical people. Where you get the idea of how it works. Tried googling with dozens of queries and it just sucks, a lot of hand-wavy articles for lay people or some paid courses.


I recommend the Huggingface courses: https://huggingface.co/course

The courses are geared toward writing Python applications around these models. They're fairly hands on, so it would still be a good thing to complement them by reading papers or watching videos on fundamental principles of AI and ML.


Thank you for this - I've been meaning to finally take the dive into hugging face but hadn't looked into how.


Yes. I have a fomo everytime I see people discussing it in depth and I can't understand what they're talking about.


would https://course.fast.ai/ be of assistance ?


All GPT4 and GPT 3.5 data.

This seems to be working well in other finetunes, but the lack of anything other than OpenAI output is still really bizzare to me. The error rate will surely be high, especially with GPT 3.5


> All GPT4 and GPT 3.5 data.

It’s super funny that by saying “You can’t use GPT to create training data for competing language models” OpenAI convinced a bunch of folks that GPT would be Super Good at producing training data for making competing language models.

It’s like “Do NOT use these prions to feed competing cattle, we would HATE IT if you did that”


Once I’m used to GPT4 output and once in a while try and see if GPT3.5 can also get it right, it fails miserably. Every single time. I’m sure it’s fine for some things, simple things.


It takes work, and breaking your problem down into very simple and clear tasks. It is possible to get decent data transformations from gpt 3.5 though. I generally start with gpt4 when prototyping an idea, and then after it's working consistently, I'll ask gpt4 to break the instructions down into smaller pieces. Even if it takes two or three runs through GPT 3.5 to get the full output transformed with what GPT4 could do in a single pass, it's still cheaper...


That's interesting. Do you mean that you ask GPT-4 to break down it's own prompt into smaller/simpler pieces?


Yes, and then I iterate on it until it is clear and simple enough for weaker models to actually follow consistently.


I believe they were referring to GPT-3.5, but yes, you can get it to be pretty successful at more complex tasks this way. You can also manually break the problem down yourself too, if it’s a standard workflow. So for example, rather than ask it to translate some text, summarize it, and classify it in a single prompt, doing it as 3 separate prompts that feed into one another is far more likely to be successful.


I really enjoyed the tenor of this post. It's great to see cordial acknowledgement of the contributors efforts, and with such glowing language :)


I’m guessing there is still no consensus on legality of using openai to train a model?


Probably legal if you use third party collected data (as you personally haven't agreed to OAIs ToS and AI generations can't be copyrighted so aren't owned by OAI) but I guess corps are too wary to risk a court battle.

Directly using OAI to train a model for commercial use that competes with OAI is a violation of their ToS though


Right so that's why I am not sure I understand why there are so many "Open source" efforts that use GPT-4 when 75+% of people interested in open source want to use it in a commercial effort.

Makes me think that quite a lot are just ignoring that part of the service terms. Then it makes more sense to keep using GPT-4 for open source models -- if you just decide you are going to ignore that part.


If you're releasing open source models but not profiting from them, how are you competing with OAI. I don't think that's a violation of ToS.


    Model Size Compute Estimate
    7b          1k GPU-Hours
    13b         2k GPU-Hours
    30/33b      4k-6k GPU-Hours
    40b         8k-10k GPU-Hours
    65b         10k-15k GPU-Hours
What would that mean in USD?


Absolute best case in the cloud for the kind of GPUs this needs? ~$1/GPU/hr, but maybe up to $5/GPU/hr depending on provider and configuration. But companies or other organizations with extra capacity on their in-house hardware might also be able to just run their training script for a while, at which point the cost is more like electricity + opportunity cost.


That sounds cheap. Can I really train a 13b model from scratch for just USD 2000?


Nope. Salesforce just announced their heavily trained 7B model cost them $150k at Google.

What you can do is fine-tune an open 7B model for a few thousand dollars, and that's the plan for these folks.


This is full training right? Not fine tuning. Fine tuning must be cheap like those mentioned $2k ...


It might depend on what you mean by "full training" and "fine tuning". They're not proposing to train a brand new foundational model from scratch, like a brand new LLaMa. But they want to do something considerably more intensive than just building a LORA.

The article contains this:

  We are currently seeking GPU compute sponsors for training OpenOrca on the following platforms:
  * Falcon 7b, 40b
  * LLaMA 7b, 13b, 33b, 65b
  * MPT-7b, 30b
  * Any other targets that get a sponsor. (RWKV, OpenLLaMA)
As I understand, a full round of training on the OpenOrca dataset would be comparable to going from LLaMa to Vicuna, but hopefully with more dramatic effects, if the techniques proposed in the "Textbooks is all you need" paper work as well as advertised.


Depending on how much fine-tuning you're doing, it can be free (colab) or a few bucks.


When will the model be available?


It says mid July


Also looking forward to Stable Diffusion's new SDXL model in "mid-July": https://stability.ai/blog/sdxl-09-stable-diffusion

I hope both of these deliver on their promises. They're exciting developments.


Why would they use the same name as the Gnome Orca project [1]?

[1] https://help.gnome.org/users/orca/stable/introduction.html.e...


Why would Gnome Orca use the same name as the orca project[1]?

Why would that orca project use the same name as the other orca project [2]?

Why would the orca project use the same name as the orca plant[3]?

etc....because orcas are badass and finding good names for things is difficult.

[1] https://orca.org.uk/

[2] https://www.orca-project.eu/

[3] https://www.theguardian.com/environment/2021/sep/09/worlds-b...


Because at this point, nearly every name is taken. Orca Screen Reader and Orca AI are quite easy to disambiguate


Bullshit, the scope of possible names is practically infinite.

Even if actual words and sensible letter permutations run out, you can start borrowing from outside of software and have much less chance of confusion. Nike, Adidas, NYC, Rolex. The industry is different and there is no commerce involved so no grounds for trademark violation.

There are two reasons to collide with another OSS project: basic laziness to do a quick google search before you settle on a name or desire to benefit from preexisting search traffic.


> you can start borrowing from outside of software and have much less chance of confusion. Nike, Adidas, NYC, Rolex.

This is objectively not true.

> The industry is different and there is no commerce involved so no grounds for trademark violation.

Nike, Inc. have US trademarks in Nice class 9: 97095855 & 97096366. Rolex Watch U.S.A., Inc. have a class 9 and class 42 US trademark: 97655284. adidas AG have a class 9 EU trademark: 006703086. etc, etc.

Besides, these brands are so well known that I'm certain you'd be challenged even if it was a different trademark class.


Trademark is not a globally reserved word. If you are not doing commerce in relevant area that there can be confusion (or you don't imitate the logo), you are free to use it. This is basic free speech.

Besides Orca is a registered trademark used by multiple class 9 businesses, now what.


You should read about trademark dilution laws.


Okay I give you that. Of course it requires market and commerce, or it would be at odds with free speech. But I see how in this scenario this Orca may want to sell stuff later (like some OSS does these days) so that would be a problem for them.

But... how that makes it OK to go to collide with a venerable OSS project? Because Gnome won't sue? The scope of words that are not registered or considered these strong trademarks is still nearly infinite!



OpenOrca's goal is to provide an opensource replica to Microsoft's Orca 13B model, so changing the name makes no sense.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: