Hacker News new | past | comments | ask | show | jobs | submit login
InternLM2 (arxiv.org)
136 points by milliondreams 6 months ago | hide | past | favorite | 24 comments



We really need better long context benchmarks than needle-in-a-haystack. There is LV-Eval (https://arxiv.org/abs/2402.05136) with multi-hop QA that's better but still pretty basic.



Yes, I don't understand why they are using a search benchmark for these... it would be much better to have something like giving it a story up to the context length (from a book? how to find one that's not in the training data?) and have it write a new chapter/ending that is consistent with all prior text and introduces zero inconsistencies.

But how can you automatically evaluate whether it did this?


Because that’s how people use llms. You go to ChatGPT to ask a question and get an answer, rather than searching on Google, revising your search because you didn’t know a term, and then look at 3-5 different links to find the answer to what you were searching for


It seems you might be mixing up different types of "context" in LLM benchmarking. In this case, it refers to the input text directly provided to the model during evaluation by the user (as in in-context learning). This is separate from the text an LLM is trained on or can access via RAG methods.


That's how people use llms because they (llms) don't seem to be good at the more sophisticated thing


But also because search engines have gotten even worse at answering questions.


Pretty amazing to see training data being discussed more openly


Indeed. I think part of the reason when they are not discussed openly may be that much of the data used is copyrighted, which introduces some legal ambiguities.


IANAL but hiding something doesn't make someone legally immune. Any company could sue LLM companies and they can't hide it during the case. e.g. there is already a similar case on OpenAI.


Yes, but it at the very least delays any findings while you rake in the cash and try to create a favorable environment. OpenAI even stated that think using copyrighted texts is necessary and should be covered by fair use.


TLDR; 1. InternLM2 is an open-source Large Language Model that has shown improvements over previous models, particularly in long-context modeling. 2. The model uses a unique approach, combining traditional training with Supervised Fine-Tuning and Conditional Online Reinforcement Learning from Human Feedback. 3. It offers a variety of model sizes and training stages to the community, demonstrating significant advancements in AI research and application.


Excited to see how it will perform on the lmsys leaderboard


Is it normal for papers to have that many authors?


It's not abnormal in many fields. A lot of biology or physics papers have more than that number.

In academic labs, so you're a postgrad working the overnight shift watching some petri dish to make sure the bacteria doesn't die, etc. It's super boring grunt work but you do it so you get on the paper's author list.


Only in fields like foundation models and high-energy physics, where immense resources are required. Look at GPT-4's credits: https://openai.com/contributions/gpt-4


Particle physics papers usually have more pages for the names than for the work.


Anyone that even sniffed it can get in on the action


Does anyone know how the free commercial license works? Do they usually grant it? https://wj.qq.com/s2/12727483/5dba/ looks like a form there.

Apache 2 code, free commercial license with application form for weights.


The name suggests this is interns posing as a chatbot, especially considering today’s date.


I experimented with this model and vLLM around a month ago. The long context length is attractive, but it was incredibly slow on a g5.12xlarge (4 NVIDIA A10G GPUs). I actually could not get it to respond for single examples longer than 50K tokens.



How good is the base (non-instruction-tuned) model? Everyone is trying to make chat bots, but for my use cases, I find base models more suitable.


Interesting. What are some of those use cases?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: