Note that this is apparently a 7B version of a 104B model trained with the intention of competing with OpenAI offerings on the Chinese market [1]. There is a number of those projects: Baichuan, ChatGLM2, InternLM and some more iirc, and they all have small-scale opensource versions.
For what it's worth, I've tried out ChatGLM2-6B and Baichuan converted to LLaMA (the architecture is literally identical in that case). They're okay, though underwhelming given their reported benchmarks; probably the main point of creating them is gaining experience for engineers, and feedback from the wider community that has less incentive to downplay their shortcomings.
Surprisingly, they do not appear censored in any particularly "Chinese" political direction, but they share sensibilities of ChatGPT and Claude.
Chinese regulation around generative AI isn’t yet formalized, including provisions for censorship. The Cyberspace Administration of China published a set of draft measures[0] for public comment, but it doesn’t seem like a revised version has been released.
The draft indicates that there will be some level of censorship, but it’s unclear what the scope will be. This analysis[1] suggests that generative AI for research purposes could be exempted (section 1). The same analysis points out that there are other government bodies at play that are more focused on advancing AI as an industry within China.
It does seem likely that there will be some kind of censorship carve-out for AI research, whereas companies offering generative AI products to the public will need to self-censor to avoid fines and/or prosecution.
> Surprisingly, they do not appear censored in any particularly "Chinese" political direction, but they share sensibilities of ChatGPT and Claude.
Perhaps they used GPT4 responses for the instruct finetuning, as many LLaMA finetunes do?
The paper doesn't say where they got the data from, other than "The pre-trained language model is further fine-tuned, following the mainstream procedure as in
InstructGPT."
(Also, I don't like how they use raw LLaMA 65b as a benchmark rather than an instruct tuned derivative)
I believe it's more like they used Anthropic human preference data [1] or similar, and accordingly Anthropic/progressive American notion of honest-helpful-harmless behavior. Thus I've seen models misgeneralize towards prudish finger-wagging. For example they parse badwords like "beat", "abuse", "steal" in morally neutral contexts ("beat a benchmark" or something) as signifiers of substantial transgression and spiral into telling me how, as language models, they insist it's never okay to etc. etc. This attitude was strikingly reminiscent of American models, even though other failure modes – like hallucinations – don't seem so similar.
Papers like Tulu [2] suggest that LLaMA-65b is indeed an appropriate baseline, given reasonable prompting. Instruct datasets only convey a flavor of responses, and for a strong foundation model that can infer the intended flavor on its own, naive finetuning seems to be detrimental. GPT-4 was much more powerful prior to having been finetuned, if reports of early witnesses and researchers are to be believed.
I don't know anything about what you're talking about. Where do I start to learn some of the AI terminology, models, benefits and drawbacks of each, etc?
This is a hard no from me, anyone know why this is so common in models from China? I'm not getting into conspiracies or anything here, but I've seen it in quite a few others from there.
I wouldn't run a model with this requirement from anyone else for that matter.
That's because the model architecture hasn't been added to huggingface/transformers yet, because it literally was just published today.
>>> from transformers import AutoTokenizer, AutoModel
>>> model = AutoModel.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True, device='cuda')
Here, the "trust_remote_code=True" means "download the model code from huggingface repo 'internlm/internlm-chat-7b'", along with the weight, and run it. If it's False, the library would use builtin model architectures hardcoded in huggingface/transformers and only download the weight.
The scary flag is here because, of course, newcomers may not realize that model == code and if you load arbitrary model you are likely executing arbitrary code.
Wonder why, for example, you don't remember seeing LLaMA had this on release day? Because they don't use huggingface transformers library and don't use huggingface to distribute their model. You just clone and run their code from GitHub, and... how is this not "trust_remote_code"?
This makes sense in a way given the API of typical ML libraries. But there is no fundamental reason this needs to be the case.
Or, more correctly stated: model == code for sure, but said code need not have any rights to perform side effects. For some reason e.g. TensorFlow has stuff like tf.io.write_file [1] (is that actually an operation you can put in a model???), but one could easily imagine a more appropriate domain-specific model language that your code is compiled to, that can by design not perform any IO. Imagine that a model you distribute is not random Python code that may or may not run a model, but instead the model itself, i.e. the graph encoded in that domain-specific language.
Then downloading a random model from some random untrusted place is no different from downloading some random data from some untrusted place: you're going to execute the model, which may DOS you, but nothing much else will happen.
Unfortunately the ML world is too stuck in the imperative mindset for this (IMO more sensible) way of doing things. :)
At that point you'd need a machine learning DSL and runtime. Currently, it's all python libraries, so you can do everything python can... Which is everything, essentially.
It's highly unlikely that the market for running these models like an appliance securely in an untrusted context will ever manifest. It's just too much of a niche, as it would also reduce their extensibility/usability significantly
Something like this may grow out of the GGML project, which is gaining traction. They already have a weights format which can be loaded with mmap, though AFAIK the model architecture still needs to be defined in C++.
In general I think the python ML stuff is a mess. But I still won't execute code that recommend me to trust arbitrary remote code as the remote code can change at any time, it would be better to wait with the release until it was published to the transformers library or just include it in a clonable repo without the trust_remote_code flag.
It is much better to just be able to clone the code and have it locally so you can verify it once and not trust that it won't download any new code suddenly that you haven't been able to look at.
trust_remote_code means you have no control really, cloning a repo means you control when new code is added yourself.
I believe it's because the model architecture isn't added to Huggingface transformer library, so it needs to eval some python code (i.e. load a pickle) to create the PyTorch model. Have not noticed it to be specific to models from China, almost all lesser known models have to do this.
Seems pretty common though, for defining custom architecture configs whatnot?
AFAIK the "remote code" is still openly hosted on huggingface so you can audit it if you like. Seems no more dangerous than things like `pip install some_random_library`?
This has become less common in recent days, at least for image generation (e. g. safetensors in Stable Diffusion).
The point of opensource models is that they can be finetuned. When many people create finetuned versions, a zoo of models appear. So far so good (maybe), but the bad practice of using untrusted code from the zoo sooner or later will lead to a wave of cryptominers, ransomware, and credential theft incidents.
I like this pip metaphor. If we had required `--trust-remote-code` for every `npm install` we could have avoided left-pad and most of the software supply chain drama in the past years.
> The code in this repository is open-source under the Apache-2.0 license. The InternLM weights are fully open for academic research and also allow commercial use with written permission from the official team. For inquiries about commercial licenses and collaborations, please contact internlm@pjlab.org.cn.
Agreed, this basically moves it to the "don't bother" pile. There are already the llama variants with non-commercial licenses, and open-llama as an open source model (I'm thinking in the 7B space specifically). This would have to be pretty friggin compelling to spend any time on.
Do they even have the legal justification of saying how you can or cannot use the weights? It could be ruled that weights are uncopyrightable. I think we as a community should advocate for that.
If you train on data you don't own, the results (weights, unmodified outputs) should be public domain. When people create novel works on top (SaaS tools, music, films), then those human combinations should hold copyright. Not the model weights.
If you can prove you own all of the inputs into training, then perhaps it's another story. But that could also be dangerous and allow for data cartels to own the future.
Is that even valid? This seems to be the only place where they’ve made this exemption, it’s not written in the license. Even the weights on hugging face are licensed under apache 2.0.
Doesn’t apache 2.0 allow for fairly unrestricted commercial use? Isn’t that the whole point of using that license?
You do understand that academics are usually funded by taxpayers? Obviously, not by me, as I don't pay taxes in China, but it's not like academics are doing this work for free. Society pays them for their work so that it can benefit from its results.
You do understand that private companies, as a whole, are a drain on the academic system, pushing to lower the very taxes that fund this research ? Society should benefit from these results. The 5000th LLM-haiku-generator-saas-company-incorporated-in-delaware is not society.
Is there any breakdown of private vs government funding for general purpose academic research. I was under the impression that most of the funds in field like ML come from fees from undergrads and donations by alumnus, or by private companies.
> I was under the impression that most of the funds in field like ML come from fees from undergrads and donations by alumnus, or by private companies.
In the United States, tuition makes up less than 35% of most universities' revenue.[0] Donations are significant, but if we were to just look at research funding, it would mostly be government grants.
"The federal government is by far the largest funder of academic R&D..." [1]
Don't worry, the private money will not go to their pocket but to fund future projects of the university, lab or whatever. It's a way to lessen the burden on taxpayers, and to shift it to those who benefit the most from it.
So you don't use open source software? You should try it, there's a great ecosystem of free software, including lots written by academics who are happy to have their work add value to industry
I do. And I also don't use open-source software with a commercial license for my job, because I respect the wishes of the author. It doesn't make the projects, tools and libraries any less good. OP is just looking to quickly cash in to something he didn't put an ounce of effort in.
AGPL & Dual-licensing are the way forward, because of leeches.
The entitlement and audacity of people who consume open source blows me away. I've maintained a project for 12 years and recently someone wanted me to help them implement the software in their system. I politely told them that since this wasn't a bug, they would need to purchase a support package. They then accused me of trying to "sell open source software" and closed the issue. People are unbelievable. Fuck me for trying to make a living providing you personal development time, using my software that I've supported for free for over a decade.
I neither agree nor disagree with your position (still thinking about it), but I do think that's right uncharitable mind-reading you've done of GP. They never said anything about "looking to quickly cash in to something he didn't put an ounce of effort in."
Less excited that I can't freely take work from someone, create a startup that is going to resell it in a shitty SaaS company and cash out for a half B. Yes, yes I am.
There is a great opportunity for totalitarian and authoritarian regimes (China and UAE so far) to create commercially usable and free LLMs that work significantly better than alternatives (backed by large amounts of government money).
Over time, as they get used in more and more products, these LLMs can become more 'aligned' to these regimes way of thinking.
There are no Chinese companies that are not part of the Chinese government.
This would be more interesting for discussion if the comment described an actual threat scenario instead of a vague hypothetical. Even in the hypothetical, there’s no actual described consequence, only that China gains some undefined level of soft power which means nothing on its own.
Some ideas:
- How are these LLMs being used? Who is the end user and what are they using the application for?
- If a state-level threat actor wanted to compromise an LLM, how would they do it? What would their goals be? How would they then use the attack vector to accomplish their goals?
- What benefit would the actor get from doing so? What are the costs? What are the consequences if they fail or are discovered?
- How would a target detect if they’ve been compromised? How easily could they recover?
I didn't think it was that vague. But if you're looking for ideas.
1. High quality LLM made free for commercial use.
2. LLM is used in many places as it is the best.
3. LLM is aligned to subtly promote the interests of the threat actor.
There is no 'compromise'. It is not hacking software, only wetware.
A concrete example...
LLM created by Advanced Persistent Threat (APT) is used in educational software aimed at kids. Over time as the kid interacts with it the LLM promotes a way of thought that either aligns with the APT or undermines the ideology of the society they are in. There is no moment that can be pointed to that says: "look they are trying to hack us!", but decades later you have adult members of a foreign society more open to your way of thinking.
The weights, calculated through very intensive computing, are what hold the knowledge in LLMs, the source code just executes those. These products could just update/patch their weights periodically, and no one would complain because that's not bad per se.
A related question -- when fine tuning a model like this to a specific corpus, how does the fine tuning effect the actual chat capability, since the chat model weights seem to come as a separate model? Does one fine tune the LLM+Chat model directly? If so, does that not require some kind of prompt based training as opposed to just lookahead prediction? Does one have to fine tune the LLM and then repeat whatever they do to get the LLM+Chat model?
Is this also censored/nerfed? I'd love to play with a "raw" unnerfed model to fully grasp what an LLM can do (and see how biased it is). Does anyone have any recommendations for unnerfed models to try out?
The most powerful available foundation model is code-davinci-002, a.k.a. GPT-3.5. It's only available on Azure since OpenAI removed it from their own Playground and API for some reason.
All 3 text-davinci models are available on openAI's api. including 3 (which is the GPT-3.5 gen). Code-davinci-002 is a code-tuned model, You can see a nice visual summary of the relationships between the openAI models at https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tr...
OK perhaps I used slightly the wrong term. The docs[1] say that code-davinci-002 is "optimized for code completion tasks" though so it seems unlikely to fulfil the OPs purpose of playing around with an unaligned/sweary model which was my main point. Some of the uncensored models from huggingface would probably serve that purpose much better.
The reason you want a base model for code completion has nothing to do with code itself, it has to do with the fact that it completes text unlike all the instruction tuned models, which expect instructions. When you have code, there aren't necessarily any instructions present. You basically want autocomplete. That's what a base model does. But that doesn't mean it doesn't work with other things apart from code. After all, all other GPT-3.5 models are just code-davinci-002 with additional instruction and RLHF fine-tuning added, and they know countless other subject areas apart from code.
It's not hard to understand. We just have a disagreement about something that you think is very important probably partly because you know more about this than I do. Have a nice day. Thanks for explaining.
It is available in the sense that it is accessible. The weights are not available for download of course, but the OP wanted to "play around" with it, for which only access is required. There is no other accessible foundation model that can compete with GPT-3.5.
Is that what nerfed means? I usually see "nerfed" used in a way that means that it will refuse to answer certain topics. "I can't answer that as it would violate copyright" and such.
Saving you a click: despite what the repo title might suggest, while the code is open source, the model weights cannot be used commercially without permission.
> The code in this repository is open-source under the Apache-2.0 license. The InternLM weights are fully open for academic research and also allow commercial use with written permission from the official team. For inquiries about commercial licenses and collaborations, please contact internlm@pjlab.org.cn.
The funny part is that the base model apparently outperforms the fine tune.
So far the HumanEval benchmark seems to be the only one that can objectively compare overall model performance despite being a coding-only benchmark, the rest mostly just give a "99.7% chatgpt" bullshit results. Turns out you can't compare creative writing because all outputs are basically valid.
I guess going with a parameter count that matches existing models makes it easier to compare benchmarks. Perhaps there is another particular reason like required memory, but momentum is probably also significant.
I have an 8GB, and I am considering two more 8GB, it should I get a single 16GB? The 8GB card was donated, and we need some pipelining... I have 10~15 2GB quadro cards... Apparently useless.
For what it's worth, I've tried out ChatGLM2-6B and Baichuan converted to LLaMA (the architecture is literally identical in that case). They're okay, though underwhelming given their reported benchmarks; probably the main point of creating them is gaining experience for engineers, and feedback from the wider community that has less incentive to downplay their shortcomings.
Surprisingly, they do not appear censored in any particularly "Chinese" political direction, but they share sensibilities of ChatGPT and Claude.
1. https://github.com/InternLM/InternLM-techreport