Hacker News new | past | comments | ask | show | jobs | submit login
InternLM – new open source 7B LLM (github.com/internlm)
308 points by freediver on July 6, 2023 | hide | past | favorite | 89 comments



Note that this is apparently a 7B version of a 104B model trained with the intention of competing with OpenAI offerings on the Chinese market [1]. There is a number of those projects: Baichuan, ChatGLM2, InternLM and some more iirc, and they all have small-scale opensource versions.

For what it's worth, I've tried out ChatGLM2-6B and Baichuan converted to LLaMA (the architecture is literally identical in that case). They're okay, though underwhelming given their reported benchmarks; probably the main point of creating them is gaining experience for engineers, and feedback from the wider community that has less incentive to downplay their shortcomings.

Surprisingly, they do not appear censored in any particularly "Chinese" political direction, but they share sensibilities of ChatGPT and Claude.

1. https://github.com/InternLM/InternLM-techreport


Chinese regulation around generative AI isn’t yet formalized, including provisions for censorship. The Cyberspace Administration of China published a set of draft measures[0] for public comment, but it doesn’t seem like a revised version has been released.

The draft indicates that there will be some level of censorship, but it’s unclear what the scope will be. This analysis[1] suggests that generative AI for research purposes could be exempted (section 1). The same analysis points out that there are other government bodies at play that are more focused on advancing AI as an industry within China.

It does seem likely that there will be some kind of censorship carve-out for AI research, whereas companies offering generative AI products to the public will need to self-censor to avoid fines and/or prosecution.

[0] https://digichina.stanford.edu/work/translation-measures-for...

[1] https://fpf.org/blog/unveiling-chinas-generative-ai-regulati...


> Surprisingly, they do not appear censored in any particularly "Chinese" political direction, but they share sensibilities of ChatGPT and Claude.

Perhaps they used GPT4 responses for the instruct finetuning, as many LLaMA finetunes do?

The paper doesn't say where they got the data from, other than "The pre-trained language model is further fine-tuned, following the mainstream procedure as in InstructGPT."

(Also, I don't like how they use raw LLaMA 65b as a benchmark rather than an instruct tuned derivative)


I believe it's more like they used Anthropic human preference data [1] or similar, and accordingly Anthropic/progressive American notion of honest-helpful-harmless behavior. Thus I've seen models misgeneralize towards prudish finger-wagging. For example they parse badwords like "beat", "abuse", "steal" in morally neutral contexts ("beat a benchmark" or something) as signifiers of substantial transgression and spiral into telling me how, as language models, they insist it's never okay to etc. etc. This attitude was strikingly reminiscent of American models, even though other failure modes – like hallucinations – don't seem so similar.

Papers like Tulu [2] suggest that LLaMA-65b is indeed an appropriate baseline, given reasonable prompting. Instruct datasets only convey a flavor of responses, and for a strong foundation model that can infer the intended flavor on its own, naive finetuning seems to be detrimental. GPT-4 was much more powerful prior to having been finetuned, if reports of early witnesses and researchers are to be believed.

1. https://huggingface.co/datasets/Anthropic/hh-rlhf

2. https://arxiv.org/abs/2306.04751


I don't know anything about what you're talking about. Where do I start to learn some of the AI terminology, models, benefits and drawbacks of each, etc?


The most patient lecturer would probably be ChatGPT itself ...


> trust_remote_code=True

This is a hard no from me, anyone know why this is so common in models from China? I'm not getting into conspiracies or anything here, but I've seen it in quite a few others from there.

I wouldn't run a model with this requirement from anyone else for that matter.


That's because the model architecture hasn't been added to huggingface/transformers yet, because it literally was just published today.

    >>> from transformers import AutoTokenizer, AutoModel
    >>> model = AutoModel.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True, device='cuda')
Here, the "trust_remote_code=True" means "download the model code from huggingface repo 'internlm/internlm-chat-7b'", along with the weight, and run it. If it's False, the library would use builtin model architectures hardcoded in huggingface/transformers and only download the weight.

The scary flag is here because, of course, newcomers may not realize that model == code and if you load arbitrary model you are likely executing arbitrary code.

Wonder why, for example, you don't remember seeing LLaMA had this on release day? Because they don't use huggingface transformers library and don't use huggingface to distribute their model. You just clone and run their code from GitHub, and... how is this not "trust_remote_code"?


> newcomers may not realize that model == code

This makes sense in a way given the API of typical ML libraries. But there is no fundamental reason this needs to be the case.

Or, more correctly stated: model == code for sure, but said code need not have any rights to perform side effects. For some reason e.g. TensorFlow has stuff like tf.io.write_file [1] (is that actually an operation you can put in a model???), but one could easily imagine a more appropriate domain-specific model language that your code is compiled to, that can by design not perform any IO. Imagine that a model you distribute is not random Python code that may or may not run a model, but instead the model itself, i.e. the graph encoded in that domain-specific language.

Then downloading a random model from some random untrusted place is no different from downloading some random data from some untrusted place: you're going to execute the model, which may DOS you, but nothing much else will happen.

Unfortunately the ML world is too stuck in the imperative mindset for this (IMO more sensible) way of doing things. :)

[1]: https://www.tensorflow.org/api_docs/python/tf/io/write_file


At that point you'd need a machine learning DSL and runtime. Currently, it's all python libraries, so you can do everything python can... Which is everything, essentially.

It's highly unlikely that the market for running these models like an appliance securely in an untrusted context will ever manifest. It's just too much of a niche, as it would also reduce their extensibility/usability significantly


Something like this may grow out of the GGML project, which is gaining traction. They already have a weights format which can be loaded with mmap, though AFAIK the model architecture still needs to be defined in C++.


I've only used llama via llama.cpp.

In general I think the python ML stuff is a mess. But I still won't execute code that recommend me to trust arbitrary remote code as the remote code can change at any time, it would be better to wait with the release until it was published to the transformers library or just include it in a clonable repo without the trust_remote_code flag.

It is much better to just be able to clone the code and have it locally so you can verify it once and not trust that it won't download any new code suddenly that you haven't been able to look at.

trust_remote_code means you have no control really, cloning a repo means you control when new code is added yourself.


Yeah, I agree promoting this usage is as bad as promoting `curl | sh` in README.md.

Similar to how you can inspect the content of a `curl | sh` script and then run it, the model is also in a clonable repo, you may just:

   git clone https://huggingface.co/internlm/internlm-7b-chat
and:

    >>> from transformers import AutoTokenizer, AutoModel
    >>> model = AutoModel.from_pretrained("./internlm-chat-7b", device='cuda')


This way is much more palpable for me, thank you for showing :)


Interesting attack vector. Malicious model codes.


Thank you for the explanation.


I believe it's because the model architecture isn't added to Huggingface transformer library, so it needs to eval some python code (i.e. load a pickle) to create the PyTorch model. Have not noticed it to be specific to models from China, almost all lesser known models have to do this.


Seems pretty common though, for defining custom architecture configs whatnot?

AFAIK the "remote code" is still openly hosted on huggingface so you can audit it if you like. Seems no more dangerous than things like `pip install some_random_library`?


This has become less common in recent days, at least for image generation (e. g. safetensors in Stable Diffusion).

The point of opensource models is that they can be finetuned. When many people create finetuned versions, a zoo of models appear. So far so good (maybe), but the bad practice of using untrusted code from the zoo sooner or later will lead to a wave of cryptominers, ransomware, and credential theft incidents.


I like this pip metaphor. If we had required `--trust-remote-code` for every `npm install` we could have avoided left-pad and most of the software supply chain drama in the past years.


How would that have avoided left-pad? Do you just mean that people would have been discouraged from pulling in so many dependencies?


I think that would just teach people to type --trust-remote-code fast.


Mind pasting the link to that line? Am on mobile and can’t find it myself easily.


> The code in this repository is open-source under the Apache-2.0 license. The InternLM weights are fully open for academic research and also allow commercial use with written permission from the official team. For inquiries about commercial licenses and collaborations, please contact internlm@pjlab.org.cn.

This makes me much less excited about this model.


Agreed, this basically moves it to the "don't bother" pile. There are already the llama variants with non-commercial licenses, and open-llama as an open source model (I'm thinking in the 7B space specifically). This would have to be pretty friggin compelling to spend any time on.


Do they even have the legal justification of saying how you can or cannot use the weights? It could be ruled that weights are uncopyrightable. I think we as a community should advocate for that.

If you train on data you don't own, the results (weights, unmodified outputs) should be public domain. When people create novel works on top (SaaS tools, music, films), then those human combinations should hold copyright. Not the model weights.

If you can prove you own all of the inputs into training, then perhaps it's another story. But that could also be dangerous and allow for data cartels to own the future.


Is that even valid? This seems to be the only place where they’ve made this exemption, it’s not written in the license. Even the weights on hugging face are licensed under apache 2.0.

Doesn’t apache 2.0 allow for fairly unrestricted commercial use? Isn’t that the whole point of using that license?


The model code is Apache 2.0, the weights are proprietary.


That's not their huggingface repo says: https://huggingface.co/internlm/internlm-7b

The current release on huggingface is available under plain apache 2.0


If weights are even protected by copyright at all...which would be a departure from current law.


Less excited that you can't freely take work from academics to resell it in a shitty SaaS company ?


You do understand that academics are usually funded by taxpayers? Obviously, not by me, as I don't pay taxes in China, but it's not like academics are doing this work for free. Society pays them for their work so that it can benefit from its results.


You do understand that private companies, as a whole, are a drain on the academic system, pushing to lower the very taxes that fund this research ? Society should benefit from these results. The 5000th LLM-haiku-generator-saas-company-incorporated-in-delaware is not society.


Is there any breakdown of private vs government funding for general purpose academic research. I was under the impression that most of the funds in field like ML come from fees from undergrads and donations by alumnus, or by private companies.


> I was under the impression that most of the funds in field like ML come from fees from undergrads and donations by alumnus, or by private companies.

In the United States, tuition makes up less than 35% of most universities' revenue.[0] Donations are significant, but if we were to just look at research funding, it would mostly be government grants.

"The federal government is by far the largest funder of academic R&D..." [1]

[0] https://nces.ed.gov/programs/coe/indicator/cud [1] https://ncses.nsf.gov/pubs/nsb20202/academic-r-d-in-the-unit...


Don't worry, the private money will not go to their pocket but to fund future projects of the university, lab or whatever. It's a way to lessen the burden on taxpayers, and to shift it to those who benefit the most from it.


"Society"

quiet chuckle


So you don't use open source software? You should try it, there's a great ecosystem of free software, including lots written by academics who are happy to have their work add value to industry


I do. And I also don't use open-source software with a commercial license for my job, because I respect the wishes of the author. It doesn't make the projects, tools and libraries any less good. OP is just looking to quickly cash in to something he didn't put an ounce of effort in.

AGPL & Dual-licensing are the way forward, because of leeches.


The entitlement and audacity of people who consume open source blows me away. I've maintained a project for 12 years and recently someone wanted me to help them implement the software in their system. I politely told them that since this wasn't a bug, they would need to purchase a support package. They then accused me of trying to "sell open source software" and closed the issue. People are unbelievable. Fuck me for trying to make a living providing you personal development time, using my software that I've supported for free for over a decade.


I neither agree nor disagree with your position (still thinking about it), but I do think that's right uncharitable mind-reading you've done of GP. They never said anything about "looking to quickly cash in to something he didn't put an ounce of effort in."


Less excited that I can't freely take work from someone, create a startup that is going to resell it in a shitty SaaS company and cash out for a half B. Yes, yes I am.


In Europe we call that a success story.


There is a great opportunity for totalitarian and authoritarian regimes (China and UAE so far) to create commercially usable and free LLMs that work significantly better than alternatives (backed by large amounts of government money).

Over time, as they get used in more and more products, these LLMs can become more 'aligned' to these regimes way of thinking.

There are no Chinese companies that are not part of the Chinese government.

This is a new kind of cultural soft power.


This would be more interesting for discussion if the comment described an actual threat scenario instead of a vague hypothetical. Even in the hypothetical, there’s no actual described consequence, only that China gains some undefined level of soft power which means nothing on its own.

Some ideas:

- How are these LLMs being used? Who is the end user and what are they using the application for?

- If a state-level threat actor wanted to compromise an LLM, how would they do it? What would their goals be? How would they then use the attack vector to accomplish their goals?

- What benefit would the actor get from doing so? What are the costs? What are the consequences if they fail or are discovered?

- How would a target detect if they’ve been compromised? How easily could they recover?


I didn't think it was that vague. But if you're looking for ideas.

1. High quality LLM made free for commercial use.

2. LLM is used in many places as it is the best.

3. LLM is aligned to subtly promote the interests of the threat actor.

There is no 'compromise'. It is not hacking software, only wetware.

A concrete example...

LLM created by Advanced Persistent Threat (APT) is used in educational software aimed at kids. Over time as the kid interacts with it the LLM promotes a way of thought that either aligns with the APT or undermines the ideology of the society they are in. There is no moment that can be pointed to that says: "look they are trying to hack us!", but decades later you have adult members of a foreign society more open to your way of thinking.

i.e. Soft Power.


Aligned how? If you download the code, it won't change under your feet.


The weights, calculated through very intensive computing, are what hold the knowledge in LLMs, the source code just executes those. These products could just update/patch their weights periodically, and no one would complain because that's not bad per se.


Tried it with a few PyLLMs benchmarks [1] and it did not do very well.

Q. Use m to substitute p, a to substitute e, n to substitute a, g to substitute c, o to substitute h, how to spell peach under this rule?

A.

m: p

a: e

n: a

g: c

o: h

how: peach

Q. write three sentences each ending with the word apple

A.

1. The apple tree was covered in apples.

2. She sliced the apple with a knife.

3. The apple juice was too sweet.

Wonder if the synthetic benchmarks are very biased and not that useful.

[1] https://github.com/kagisearch/pyllms/blob/6c11cd51b099d33e61...


“when a measure becomes a target, it ceases to be a good measure”

The model itself it optimising weights for its targets, and the model builders are optimising the model for the benchmarks.


A related question -- when fine tuning a model like this to a specific corpus, how does the fine tuning effect the actual chat capability, since the chat model weights seem to come as a separate model? Does one fine tune the LLM+Chat model directly? If so, does that not require some kind of prompt based training as opposed to just lookahead prediction? Does one have to fine tune the LLM and then repeat whatever they do to get the LLM+Chat model?


Is this also censored/nerfed? I'd love to play with a "raw" unnerfed model to fully grasp what an LLM can do (and see how biased it is). Does anyone have any recommendations for unnerfed models to try out?


LLaMA 65B is the best uncensored model we've got, and the Airoboros fine-tuning if you want it to follow instructions.



The most powerful available foundation model is code-davinci-002, a.k.a. GPT-3.5. It's only available on Azure since OpenAI removed it from their own Playground and API for some reason.


All 3 text-davinci models are available on openAI's api. including 3 (which is the GPT-3.5 gen). Code-davinci-002 is a code-tuned model, You can see a nice visual summary of the relationships between the openAI models at https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tr...

Or the official source is https://platform.openai.com/docs/model-index-for-researchers


> All 3 text-davinci models are available on openAI's api.

That's irrelevant because these are all fine-tuned.

> Code-davinci-002 is a code-tuned model

No, "code-tuned" isn't even a thing. It is a foundation model, which consists purely of pretreating. No fine-tuning is involved.

> Or the official source is

The official source says exactly what I just said.


OK perhaps I used slightly the wrong term. The docs[1] say that code-davinci-002 is "optimized for code completion tasks" though so it seems unlikely to fulfil the OPs purpose of playing around with an unaligned/sweary model which was my main point. Some of the uncensored models from huggingface would probably serve that purpose much better.

[1] see the entry for code-davinci-002 in https://platform.openai.com/docs/models/gpt-3-5


Code was just part of its pretraining. All other GPT-3.5 models are fine-tuned versions of code-davinci-002.

Quote:

1 code-davinci-002 is a base model, so good for pure code-completion tasks

2 text-davinci-002 is an InstructGPT model based on code-davinci-002

3 text-davinci-003 is an improvement on text-davinci-002

4 gpt-3.5-turbo-0301 is an improvement on text-davinci-003, optimized for chat

Quote end.

https://platform.openai.com/docs/model-index-for-researchers

The reason you want a base model for code completion has nothing to do with code itself, it has to do with the fact that it completes text unlike all the instruction tuned models, which expect instructions. When you have code, there aren't necessarily any instructions present. You basically want autocomplete. That's what a base model does. But that doesn't mean it doesn't work with other things apart from code. After all, all other GPT-3.5 models are just code-davinci-002 with additional instruction and RLHF fine-tuning added, and they know countless other subject areas apart from code.

I don't get why this is so hard to understand.


It's not hard to understand. We just have a disagreement about something that you think is very important probably partly because you know more about this than I do. Have a nice day. Thanks for explaining.


The model isn't available at all?


It is available in the sense that it is accessible. The weights are not available for download of course, but the OP wanted to "play around" with it, for which only access is required. There is no other accessible foundation model that can compete with GPT-3.5.


Why are you guys downvoting me?


Because GPT 3.5 not very good compared to LLaMA 65b or even 33b finetunes, from my testing.

Also because 3.5 is not really available?


Have you actually tested code-davinci-002?


Maybe you mean gpt-3.5-turbo or text-davinci-003? Or GPT-4 (technically in beta so not fully available to everyone)?


No, those are all fine-tuned models which are "nerfed" in the terminology of the OP. I mean code-davinci-002, the GPT-3.5 base model.


code-davinci models are finetuned on code so I don't think that's what the OP wants. For reference the family tree is here https://platform.openai.com/docs/model-index-for-researchers


As the website you linked says, code-davinci-002 is not fine-tuned. It is the GPT-3.5 base model.


Is that what nerfed means? I usually see "nerfed" used in a way that means that it will refuse to answer certain topics. "I can't answer that as it would violate copyright" and such.


The fine-tuned models are certainly censored and not "raw".


But doesn't code-davinci-002 also have OpenAI's filters in between you and the model?


Yes, but that's different from the model itself being fine-tuned.


> I mean... the GPT-3.5 base model

That would be text-davinci-003, I believe.


No, text-davinvi-003 is fine-tuned. The base model is code-davinci-002. See https://platform.openai.com/docs/model-index-for-researchers


Saving you a click: despite what the repo title might suggest, while the code is open source, the model weights cannot be used commercially without permission.

> The code in this repository is open-source under the Apache-2.0 license. The InternLM weights are fully open for academic research and also allow commercial use with written permission from the official team. For inquiries about commercial licenses and collaborations, please contact internlm@pjlab.org.cn.

https://github.com/InternLM/InternLM#open-source-license


What kind of hardware do you need to run a model like this? Do you need an A100, or can something smaller suffice?

What about for fine tuning? Are hardware requirements higher?


I welcome new models! The more, the merrier.

That said, this model has been tailored but they are comparing it to non-finetuned LLaMA-7B in their benchmark? That seems a bit fainthearted.


> HumanEval: InternLM-7B: 10.4, LLaMA-7B: 14.0

The funny part is that the base model apparently outperforms the fine tune.

So far the HumanEval benchmark seems to be the only one that can objectively compare overall model performance despite being a coding-only benchmark, the rest mostly just give a "99.7% chatgpt" bullshit results. Turns out you can't compare creative writing because all outputs are basically valid.


why is 7B parameters seemingly a magic number?


It matches LLaMA 7B, and it's "cheap" to train for a demo.

If they actually wanted to finetune/train for commodity hardware use, 13b-40b would be a better target.


I guess going with a parameter count that matches existing models makes it easier to compare benchmarks. Perhaps there is another particular reason like required memory, but momentum is probably also significant.


7B params would take 14gb of gpu RAM at fp16 precision. So it would be able to run on 16gb GPUs with 2gb to spare for other small things.


But in practice, no one is running inference at FP16. int8 is more like the bare minimum.


I have an 8GB, and I am considering two more 8GB, it should I get a single 16GB? The 8GB card was donated, and we need some pipelining... I have 10~15 2GB quadro cards... Apparently useless.


I mean... It depends?

You are just trying to host a llama server?

Matching the VRAM doesn't necessarily matter, get the most you can afford on a single card. Splitting beyond 2 cards doesn't work well at the moment.

Getting a non Nvidia card is a problem for certain backends (like exLLaMA) but fine for llama.cpp in the near future.

AFAIK most backends are not pipelined, the load jumps sequentially from one GPU to the next.


just easier to run on smaller hardware


Is the dataset on which it is trained on mentioned anywhere?


Great name for a simple LLM hehe




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: