Hacker News new | past | comments | ask | show | jobs | submit login
Replit's new Code LLM: Open Source, 77% smaller than Codex, trained in 1 week (latent.space)
891 points by swyx on May 3, 2023 | hide | past | favorite | 220 comments



Some links:

- Repo: https://github.com/replit/ReplitLM/tree/main/replit-code-v1-...

- HuggingFace: https://huggingface.co/replit/replit-code-v1-3b

- Demo: https://huggingface.co/spaces/replit/replit-code-v1-3b-demo

- Early benchmark results: https://twitter.com/amasad/status/1651019556423598081

A lot about this project was surprising. We knew it was going to be good, but didn't expect to be this good -- especially surprising was the finetuned performance boost, and the fact that the model is decent at language tasks and reasoning (in some cases much better than much larger general-purpose models).

It feels like there is a lot more to do with this model, and I have a suspicion you can even make a half-decent chatbot (at least one focused on code) by finetuning it on conversation (and/or instruction) datasets.

Will follow up with a more comprehensive technical report and the UL2R version (fill-in-the-middle support).


First - thank you for open sourcing this! It's a real gift to the community to have a model intended for "commercial use" that's actually licensed as such.

I'd be very interested to hear about the choice/evaluation of the ALiBi approach for positional embedding (perhaps in the technical report).

My intuition suggests that while this allows for better generalizability for longer sequence lengths, it penalizes scenarios where an LLM might need to check for things like a function signature far away from where the next token is generated. My initial testing of this model tracks with this intuition but that's by no means a rigorous evaluation.


(I wrote ALiBi) You can read the paper here https://arxiv.org/abs/2108.12409

While intuitively it does seem like ALiBi would make it hard for the model to attend to things that are far away, in many scenarios we've tested with different models trained on different datasets, ALiBi always performs better than sinusoidal, rotary, and other embedding types, even when we're not using it to extrapolate to longer sequence lengths.

These findings have been confirmed by others, including by the BLOOM open source LM project.


Small world!

Thanks for the link (which I've now skimmed beyond the abstract). What wasn't obvious to me from the abstract is that different attention heads have different penalty strengths, so if some prediction task requires long range dependencies you might expect one of the less-penalized heads to end up specializing. I wonder what would happen if the penalty for one head is zero? (The paper suggests this might've been tried and just made things worse, but unclear)

I must admit that this is a wonderfully elegant (and interpretable) way to do this... much more intuitive (to me at least, a wannabe practitioner) than all of the trig-based embeddings.


> so if some prediction task requires long range dependencies you might expect one of the less-penalized heads to end up specializing Exactly. You have heads that focus on content nearby and ones that focus on stuff that is far away.

> I wonder what would happen if the penalty for one head is zero? (The paper suggests this might've been tried and just made things worse, but unclear) Yup, this is something we tried. Making one of the heads zero doesn't improve or degrade performance.

>I must admit that this is a wonderfully elegant (and interpretable) way to do this... much more intuitive (to me at least, a wannabe practitioner) than all of the trig-based embeddings.

Thanks so much!!


Impressive model, thank you for releasing it under a business-friendly license!

Have you considered using Google's sparse "scaling transformer" architecture as the base? Even at 3B scale it can generate 3-4x more tokens per FLOP while being competitive at perplexity with a dense transformer. I think OpenAI uses a variant of it in their ChatGPT-3.5-Turbo product.

Here is the paper https://arxiv.org/abs/2111.12763 and the implementation https://github.com/google/trax/blob/master/trax/models/resea... if you are interested.

Hope you get to look into this!


Thank you for releasing the weights along with the announcement. The posts that made great headlines, but “weights are on their way!”

Like why did we even get excited? This? Great work.


> I think OpenAI uses a variant of it in their ChatGPT-3.5-Turbo product.

is that a guess or is there a source? im curious to read more


It is a guess informed by some familiarity with the literature and by going over the papers authored by researchers credited in the OpenAI's "GPT-4 contributors" web page.

I have an expanded list of foundational research that is likely to serve as basis for gpt4 here in my blog: https://kir-gadjello.github.io/posts/gpt4-some-technical-hyp...

Hope it helps!


Interesting resource. I had been wondering whether anyone had tried to compile such a list.


thank you! glad i asked


I don't think it's a business friendly license?


It allows for modifications and commercial use: https://creativecommons.org/licenses/by-sa/4.0/

>You are free to:

>Share — copy and redistribute the material in any medium or format

>Adapt — remix, transform, and build upon the material

>for any purpose, even commercially.

Compare this to the latest release from StabilityAI lab DeepFloyd, "IF", which in addition to various restrictive clauses strictly prohibits commercial use: https://github.com/deep-floyd/IF/blob/develop/LICENSE-MODEL

Repl.it's release is as open as it gets these days, in my book.


It's a copyleft license; and lots of folks on HN seem to think that copyleft, while being open, isn't business friendly.


Wow! I sincerely wonder how all those folks manage to do business in the tech industry without ever touching Linux, Git, Bash, GCC, glibc, WordPress, Ansible, Grafana, MongoDB, 7-Zip, Vim, Emacs, Firefox, Thunderbird, StackOverflow, Wikipedia, most web fonts, most ad blockers, and all the rest!


What does "fine tuning" mean in this context? Does it mean you fine-tuned it on a specific code repository, or collection of code repositories and then had it do work in those repositories?


Broadly finetuning is any post pretraining training. Most of the time it is used in the context of fitting a more narrow task. In our case, it was the same training objective as the pretraining but meant to be more representative of what Replit users like to code. However, we were surprised by how well it boosted overall performance. Best guess: it's a) novel data and b) the model could take even more training!!


How feasible and effective would it be to fine-tune a model against an organization's private source code, resulting in an "internal" model that knows how to work with that org's stuff?

Could you, say, fine-tune the model every week with the latest merges? Every hour?


Finetuning is a relatively quick process. Training the base model is the expensive part (can take weeks and huge amounts of compute), whereas finetuning usually is only on the last few layers and can be done with much less resources. You could definitely have a "nightly" finetune model that is retrained every day or so.


Interesting - how would that work for a company that wanted to run their own codex model, on-prem, trained on their own code? Perhaps also trained on their dependencies?


Finetuning a smaller model leading to better performance seems like a significant finding that'll lead to a lot of companies fine-tuning their own internal "ChatGPT"s


You seem to know your stuff some, so I'll ask you a question on this: Are there any good books on all the different approaches in this space, or is it all too new and fast moving for such a thing?


There are no books on Large LMs but almost any resource about neural networks covers fine tuning. I like the FastAI courses, and these do cover language models.


You can also check the NLP with transformers book


When you fine-tune it, do you train just the head/last few layers or do you also unfreeze the model afterwards and retrain the whole model with a very small LR for a few epochs?


You can take a network and its weights that someone else trained, and use that pretrained network to train on your own data, which is likely to be a better starting point than random weights.


How is this code licensed? I didn't see a license in the repo. It looks interesting!


The README indicates:

The base model checkpoint is licensed under the Creative Commons license (CC BY-SA-4.0). Under the license, you must give credit to Replit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests that Replit endorses you or your use.


Doesn't the Stack contain HumanEval? So you're basically comparing numbers on the pretraining data.


Can't find it now but pretty sure BigCode said somewhere they explicitly looked for it and removed it. Also subjective measure does match up to the benchmark. Our finetuned model performed +50% on HumanEval and then when using it felt at least that much improved.


You can view the prompts, solutions, and checks here[0]. See my sibling comment (to yours) where I quote the Human Eval paper and do some more analysis. But I think if you look at [0] you'll see that these aren't really unique problems and are likely to have large repetitions in the dataset. I should add to that comment to include the dataset[1] (too late to edit) where they mention that they just scrape all of GitHub (Jan 1 2015 - Mar 31 2022). They do exact and near de-duplicate but near de-duplication is messy.

> We implement near-deduplication in our pre-processing pipeline on top of exact deduplication. We first split the files into words/tokens based on non-alphanumeric characters and remove files with fewer than 10 tokens. Next, we compute the MinHash with 256 permutations of all documents, and use Locality Sensitive Hashing to find clusters of duplicates. We further reduce these clusters by ensuring that each file in the original cluster is similar to at least one other file in the reduced cluster. We consider two files similar when their Jaccard similarity exceeds 0.85.

Near-duplicates are still difficult to measure. So we should expect duplication, and it should be proportional to the number of samples we have (even if the same variance, but I'd wager higher variance with larger duplications).

[0] https://github.com/openai/code-align-evals-data/tree/97446d9...

[1] https://arxiv.org/abs/2211.15533


My favorite line from the HumanEval paper[0]

> It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources.

So to answer your question, yes, the evaluation dataset is spoiled. You can find such unique and never before seen docstrings like

> For a given list of input numbers calculate the Mean Absolute Deviation around the mean of this dataset. Mean Absolute Deviation is the absolute difference between each element and a centerpoint (mean in this case)[1]

And here's a repo I found that is 8 years old[2]. But how about a more recent one that is even closer?[3] There's plenty more examples[4] (does anyone know how actually limit the date to prior to 2021? `pushed:<2021` doesn't work nor does using the `created` keyword. Date searching doesn't seem to work well).

In essence, we can still use this evaluation method to determine how good our model is at doing fuzzy searching. Which, mind you, is still a useful thing. But I would be careful in concluding that this means the model is good at generalizing arbitrary descriptions of code or novel pieces of code. That said, one may be able to argue that not many lines of code are actually that novel. Still, we need to be careful about our conclusions and understand the limitations of our metrics (something I am currently deeply troubled by)

[0] https://arxiv.org/abs/2107.03374

[1] https://github.com/openai/code-align-evals-data/blob/97446d9...

[2] https://github.com/bertomartin/stat4701/blob/ec2b64f629cbbf6...

[3] https://github.com/danielwatson6/hate-speech-project/blob/64...

[4] https://github.com/search?q=abs%28x+-+mean%29+for+language%3...


(follow-up: Figured this should be a different comment)

I wanted to demonstrate what I said above so I came up with some examples of things I think a human would have an easy time implementing but might be hard to implement. BUT a key part is that I expect these to be in the dataset! I just don't expect these to be in hundreds or thousands of githubs because they will be uncommon (but not rare). Also, we'll pretty much ask for few-liners to give the model the biggest advantage we can (errors will compound).

Prompt:

from torch import nn

class LipSwish(nn.Module):

""""

The Swish activation function is defined by a gated linear unit,

where the gate is defined by a sigmoid function and multiplies the input with

a learnable parameter, beta. Beta is initialized as 0.5.

The Lipswish function normalizes the output by the upper bound of 1.1.

""""

    def __init__(self:

        super().__init__()
Result: Mostly correct but missing the division by 1.1. The forward is `return x * F.sigmoid(self.beta * x)`, which is Swish (it also assumes we had "import torch" and applied type hinting). It did properly set the parameter (this is just a 3 liner)

Discussion: The Swish function should be in the dataset and is a well known activation function (though beta is not in the pytorch version). Despite LipSwish being in the dataset (introduced in 2019 from Residual Flows[0]) it is not common. I could get the code to generate the swish function (initializing beta, and performing the gate) but could not get the code to divide the output by 1.1. I would not expect a human to have difficulties with this.

Okay, so let's try something else that might be a bit more common and older. The same paper uses a concatenated activation function, and those aren't "uncommon". CReLU was introduced in 2016[1] and there's plenty of concatenated activations around since then. The pytorch documentation even uses it as an example[2]. There's far more examples of CReLU (3k python results for "class CReLU" vs 58 for "class LipSwish. Use these numbers as weak hints because search sucks and isn't always accurate).

Prompt:

from torch import nn

from torch.nn import functional as F

class CReLU(nn.Module):

""""

Concatenated version of ReLU. The activation is applied to both the positive and

negative of our input and the result is concatenated.

""""

    def __init__(self):

        super().__init__()

    def forward(self, x):
Result: `return torch.cat([x.clamp(min=0), -x.clamp(min=0)], 1)`. This is correct but not the expected one-liner result.

Discussion: This was a bit surprising, it didn't use functional as we might expect (or hinted). But interestingly it will if we change the class name to "ConcatenatedReLU". I found exact copies on GitHub with the full name (memorization) but the fist page of instances for CReLU I found used functional (I did find one that was exactly the above code, when adding "clamp" to the search, but missing the minus sign. There were plenty of errors in CReLU implementations). Interesting side note: CReLU continues and defines a function CReLU6 with uses the same docstring but clamps with a max of 6 on the positive input whereas Concatenated starts to define a convolutional block (Conv + BatchNorm + ReLU) called Conv2d.

So we have kinda mixed results, and in both cases these are rather odd and probably not what we wanted. We can clearly see that there are issues where a human would not have too much trouble. There's a big issue in these types of problems: we need to memorize a lot of information (otherwise we can't even write code or know library calls) but too much memorization prevents creativity. There is a lot of gray area between the _pure_ "Stochastic Parrot"/"Fancy copy machine" vs a generalized intelligence (with a broad and flexible definition of intelligence). I'd still call them stochastic parrots because to me the evidence suggests that we're closer to the memorization side than the creation side. But that doesn't mean these frameworks aren't useful. We all know a lot of code is boiler plate (otherwise we wouldn't have the joke "copy paste from SO") and these tools can be very useful for that. But I think the utility is highly going to depend on what you are coding for and how you code. If you're doing standard stuff, this probably has high utility to you and can save you a lot of time. The same way writing macros does, but this is FAR more powerful. It can also help novices a lot. Also, if your main errors are reading mistakes (e.g. you're dyslexic) -- this is my largest problem -- then this might make things difficult as you have a tendency to gloss over text and miss minor errors. I also don't think these tools would help if you're a researcher or writing optimized or specialized code. These differences are probably why we see such differences in people's reactions. But it may also be a hint into what people do and how they work when we see who raves and who rants about these.

[0] https://arxiv.org/abs/1906.02735

[1] https://arxiv.org/abs/1603.05201

[2] https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html

Edit: We can also check if code is in the stack[3]. We see that [0] is indeed in the dataset so we know there is information leakage. Interestingly the exact copy I found in the previous comment[4] isn't! (The repo, though the user is)

[3] https://huggingface.co/spaces/bigcode/in-the-stack

[4] https://github.com/bertomartin/stat4701/blob/ec2b64f629cbbf6...


Hi there, I have two question:

1 - Why did you choose Markdown? It seems an odd choice for training a model like this.

2 - Have you tried to train only one single PL and then benchmark it against this more general version?


1- We trained on languages that are most popular on Replit. Markdown is important because you need some amount of natural language in the data, and it will act as a sort of "natural language label" for code.

2- I like how portable it is being a single small model doing a lot of languages. Single code models are an approach that models like Salesforce/Codegen did that, but I believe we beat (or get very close) to their mono models on benchmarks.


Have you thought of finding or creating something like this [0]?

I created this as the basis for my origami folding descriptive language. I tried to find something similar, requirements being both well structured and English-like but couldn't find any, so I created it.

The origami folding app will hopefully be out in 2 weeks, so you can see how it's used.

[0] https://github.com/fuzzthink/mation-spec


They trained on https://huggingface.co/datasets/bigcode/the-stack-dedup which is a massive curated dataset accumulated from GitHub. Details are here: https://www.bigcode-project.org/docs/about/the-stack/

Many of the most-represented "languages" on GitHub are actually things like JSON, XML, HTML, CSV, text, markdown, YAML, and SVG.

More details from them here: https://blog.replit.com/llm-training


Did any interns help in developing this? If so are you planning on intimidating them as usual? :)

Reference: How Replit used legal threats to kill my open-source project https://intuitiveexplanations.com/tech/replit/


Wow. That's extremely poor behaviour if the account is accurate.



Very exciting, thanks for sharing all this


The model is way too small, comparing it to Codex feels disingenous. Sure it's 77% smaller, it's also 77% worse. Although, it's a cool project nonetheless.

For instance, even this simple snippet generates wrong inline completions:

   // Only return even numbers bigger than 10 from the array
   const arrayFilter = (array) =>
Replit-code-v1:

   // Only return even numbers bigger than 10 from the array
   const arrayFilter = (array) => {
     return array.filter((item) => item > 10);
   };
Gets it wrong, returns odd numbers.

Codeium:

   // Only return even numbers bigger than 10 from the array
   const arrayFilter = (array) => {
     return array.filter((num) => num > 10 && num % 2 === 0);
   };
ChatGPT (GPT-3.5 Turbo) - Code-only, without the rest of the completion since it's instruction-tuned:

   const arrayFilter = (array) => {
     return array.filter(num => num % 2 === 0 && num > 10);
   }
Not comparable at all. For reference if anyone wants to test I ran this through the HuggingFace space using the default parameters, ChatGPT through chat.openai.com, and Codeium through the VSCodium extension on an empty JavaScript file.


Interesting. This seems like a weakness of natural language understanding. If you rephrase your prompt slightly it would get it right. Try:

  // return even numbers that are also more than 10
  const arrayFilter = (array) =>
It would do the right thing. The fine-tuned version gets your prompt right so maybe it benefited from natural language data. Will look more into it.


That's really interesting, indeed I can reproduce this by changing the comment. I also managed to get correct output for this sample by renaming the function.


clearly your original comment was unfair.


Is it, though? The major selling point of coding LLMs is that you can use natural language to describe what you want. If minor changes to wording - the ones that would not make any difference with a human - can result in drastically worse results, that feels problematic for real-world scenarios.


The model is small, so it has weaker semantics.


I get that. But they are explicitly comparing it to Codex themselves.


The criticism stands if you have to continue to rewrite your "prompt" until you can coax out the correct desired output.


I agree. Maybe it interpreted it as return the numbers that are more than 10 in the given array of even numbers.

For example, if the instruction says "return person objects that are at least 20 years old", it might be more reasonable to generate:

array.filter(item => item.age >= 20)

as oppose to

array.filter(item => (item instanceof Person) && (item.age >= 20))


It seems like every week someone comes out with some version of "we can get results similar to OpenAI's API with our model that you can run on a Commodore 64!"

And then you dig in, and it's always far behind in some important way.

Not hating here, I love the pace of iteration, just not the hyperbole.


>"we can get results similar to OpenAI's API with our model that you can run on a Commodore 64!"

I have felt similar frustrations with statements that feel disingenuous too. Thanks for articulating this with such a beautifully hilarious metaphor.


I need more time to compare it, the short 128 tokens in the demo is a bit rough but -

On first look this seems to blow the current llama based models out of the water including the 30B ones.

Pasting what you want + url + example json with no other context and it "knows" what the url and the json is for, without even telling it.

I'm not even saying it's as good as chatGPT, but this is a tenth the size of the best llama models I've seen.


Yeah I tried the demo, it wrote some wrong code with comments in Chinese. I think I'll pass.

It's a pretty well accepted fact now that bigger LLM = moar better without exceptions. I'm not sure why there's a race to the bottom of who'll make the most useless model that can run everywhere.


> It's a pretty well accepted fact now that bigger LLM = moar better without exceptions.

That's not true, the amount of training is a MAJOR factor.

See the Chinchilla paper - https://arxiv.org/abs/2203.15556

tl;dr - a "fully" trained small model can outperform a "undertrained" larger model. If you have a fixed amount of compute (budget), then you need to optimize for the largest model that you can fully train, and not simply up the parameter count.

EDIT: Also you can't necessarily compare the parameter count across model architectures*

This thing seems to outperform the finetuned 30B llama models I've seen.


Well if you're set on training something for a specific budget then it does of course make sense to pick the most optimal model size, true.

But the problem is that these models don't exist in a vacuum, and have to go against slightly larger ones that are also compute optimal and use more data, which will definitely perform better.

Then again maybe there is a sweet spot for a model that's small enough to run effortlessly on regular machines while only serving as a control node in an autoGPT style setting, where it fetches the context it can't possibly have from a curated online database to make up for its shortcomings.


> But the problem is that these models don't exist in a vacuum, and have to go against slightly larger ones that are also compute optimal and use more data, which will definitely perform better.

They don't have to go against those though. Most of these models are research models, either from academia or from companies experimenting to see what works. From my understanding, most of these are a - "We have X amount of USD for the next month or so, we'll try a few things, then whatever our best bet is we'll stick the time out on that".

Very few companies have the resources to train big models with as much compute as Google/OpenAI/Microsoft/Facebook.

These are also not being monetized as they're open source.

Going from their 2.7B model to 10B would be ~10X the compute (FLOPS) required for an optimal model. And this is likely their first open model and not their last, since Replit likely doesn't have the budget that openai does it makes sense they didn't want to blow their entire year's budget on their first open model.

2.7B would also be a really nice if anyone can get it working because it's more likely to be able to run in the IDE at that point instead of needed a massively scaled cloud (which might be valuable for replit).


> Sure it's 77% smaller, it's also 77% worse.

Hehe, yeah, imagine saying you made a new programming language with 77% less lines of code than Python.


Finally, an opportunity to share this https://nsl.com/papers/denial.html


I’m curious about the downvotes because I thought I was just agreeing with OP. Obviously lines of code in a programming language repo is no correlate at all to quality. It’s like the old adage about measuring aircraft quality by weight.


That’s an inverse correlate, not no correlate


No, a PL could have millions of lines of code and still not be very good. Consider any enterprise language that no one likes :)


I didn't get the punchline of this, so I asked GPT-4 to explain the punchline. Actually quite amusing.


From the context it presumably finds the maximum of an array in K. Also quite a nice demonstration of why K is a bad language. For writing maintainable software at least; for code golf it's clearly amazing! (And maybe interactive calculator style usage?)


hi HN! back again with an exclusive deep dive with Replit’s head of AI. I attended their developer day last week (https://twitter.com/swyx/status/1650989632413401089) just expecting a regular fundraise announcement and was totally shocked when they annoucned their own LLM and also said they would open source it. so immediately asked them for a podcast interview and this is the result.

my favorite learning is how they are pushing the state of the art - openai’s HumanEval is the industry standard benchmark for code LLMs, but Reza kindly went above and beyond to show how they use “AmjadEval” - using coder intuition to capture human preference on what output is more helpful to coders (see screenshots https://twitter.com/swyx/status/1653791019421569024?s=20)

please AMA!


Sorry, I have to ask this: how does this compare to ChatGPT?


It's not crucial that it beat ChatGPT this year. That's a pretty unattainable goal for a group like Replit. From the users POV, none of the current copilots compare favorably against ChatGPT, even Microsoft's OpenAI-powered GitHub Copilot.

What's important is that they're preparing for the future by building all the tooling/UI/UX around coding copilots. This way, when costs and feasibility of building ChatGPT-quality LLM's drop and multiple open-source models are available, Replit has the ability to immediately drop them into their production environment. They'll also have the skills and systems to finetune any new models and wring extra performance out of them.

This is more important to users than it seems at first because current UX of things like GitHub Copilot don't allow me to use their AI against my codebase the way that I want to (the way I use ChatGPT). Right now GitHub Copilot is a glorified auto-complete, but I want it to do widespread scaffolding, refactoring, and analysis across my whole codebase. Microsoft has access to LLM's that can do this through their control of OpenAI -- but Microsoft lacks the tooling/UI/UX to bring the power of ChatGPT to me as a user of VSCode/IntelliJ/PyCharm/Visual Studio.

So if Replit can find more innovative, boundary-pushing ways of integrating LLM's, they won't necessarily need the highest quality LLM's to produce a superior user experience. It's a strong signal that Replit is well-positioned for the future, when ChatGPT-like models are democratized.

Hopefully JetBrains is paying attention. They definitely have time to wait a bit more (1-2 years?), but not a lot of time. JetBrains shouldn't solely rely on Github Copilot plug-in to provide their users with LLM's, because it's not clear that the user experience of that plug-in will stay competitive with the user experience that GitHub Copilot will offer directly in VSCode. The IntelliJ/PyCharm plugin may remain "just a fancy auto-complete" while VSCode gets more interactive workflows.

Future IDE's with LLM integration require novel, smart, clever UX typically invented only by very creative people.

It's also worth noting that Replit is not just trying to be an IDE -- they're also building a marketplace to buy/sell coding work, and establishing a small foothold as a niche cloud computing provider.


I keep saying that it's obvious that local execution is the future of LLMs. Remote execution makes a ton of sense for databases, and most web apps are on some level just CRUD over a remote DB, so we've all gotten used to the idea that in the 21st century a software business should be running remote servers… But LLMs don't need to run remotely, and they don't especially benefit from running remotely either (okay, more training data, but you can batch that and send it back asynchronously). The future is local.


The future is using the best possible tool to drive your work. Won’t local models be systematically inferior to bigger commercial offerings for the next few years at least?


"The future is using the best possible tool to drive your work"

Not if that tool is censored, and you need an uncensored version to do your work. Or maybe you have privacy considerations, or your company policies forbid using something hosted remotely or owned by another company, etc...


cries in HIPAA


Maybe. I wonder if very narrow, multi-model systems might eventually deliver better performance and utility than monolithic models like GPT. Rather than pay for access to that, you might be better off investing in resources that can train and learn on exactly what you're doing, rather than something general that is good at a lot of things but not incredible at your specific task.


http://www.incompleteideas.net/IncIdeas/BitterLesson.html

Generally we have continued finding that the more "other"/general stuff an AI model is trained on, the better it performs on specific tasks. As in, an AI model trained to identify photos of all animals will perform better than an AI model that is only trained to identify breeds of dogs. Even at identifying breeds of dogs.

Taken to the extreme, we've found that training image models with "multi-modal" LLM capabilities improves their ability to identify dogs/etc. A lot of people don't realize that GPT-4 is actually multi-modal...while OpenAI has only allowed API access to use text input, the model itself can also accept image input.

Note that we've moved on from ImageNet-style tests "Choose the most appropriate label for this image from 200 possible labels" to much more advanced "Reasoning" tests[0]. PaLI[1] is potentially the SoTA here but BeIT-3[2] may be better example for my thesis. Notice that BeIT-3 is trained on not just images, but also trained like an LLM. Yet it outperforms purely image-trained models on pure-image tasks like Object Detection and Semantic Segmentation.

More importantly, it can understand human questioning like "What type of flowers are in the blue buckets of this image?" and respond intelligently.

0: https://paperswithcode.com/area/reasoning

1: https://arxiv.org/pdf/2209.06794v2.pdf

2: https://paperswithcode.com/paper/image-as-a-foreign-language...

3: http://www.incompleteideas.net/IncIdeas/BitterLesson.html


Very interesting. I’d hoped it was going to go a different direction, but the evidence clearly suggests that what you linked here is correct so far.


For shops that want to ensure their own codebase stays local, definitely no.


I think that we'll reach "good enough" - and that the commercial offerings won't have much tangible benefit for at least simply being "fancy autocomplete".

Currently you don't really use LLMs for designing the structure, just completing the implementation, and I think that will be very doable locally.


Local models can access anything on your filesystem without sending it over the network. Easy to imagine certain tasks that would have better performance.


Whether it beats ChatGPT right now is important to me, right now.

I'm very excited about everyone doing work even when they're not beating ChatGPT right now, of course.

But how it compares to ChatGPT right now is extremely relevant to lots of people.

It's also become very common to vaguely reference OpenAI's offerings when announcing new models without saying how they actually compare, or only mentioning some small way in which it compares favorably.

(Though it seems to often be that some comment from the article comparing to OpenAI gets promoted to the title when posted on HN, like here.)


I think this is somewhat of a naive way to look at this. Yes, ChatGPT is really good, but they're basically completely closed source. A lot of the power of LLMs can and will come from open sourced models that anyone can dig into the weights and tune it for their use case, as well as train and run on their own platform.


What does this mean for the future of editors like emacs and (neo)vim? Right now the Copilot plugin for Neovim works pretty much the same as the one for VSCode, but as LLMs get integrated more into IDEs and new workflows are built around them, will the old-school editors be able to keep up? I'm a little worried because I just switched from VSCode to Neovim a few months ago!


This could be the dawn of a new day for the old-school editors. Not to start any wars here, but I could never get the hang of Vim, and that's hardly an unusual complaint. But now, free high-quality personalized "tuition" just became economically viable.


Side note, potentially check out vimtutor, or also https://vim-adventures.com/


I'd second the advice to go through vimtutor.

Highly recommended.


Stuff like the Language Server shows that people are interested in making new stuff work well with our old beloved editors. I have faith.


There is a ChatGPT shell for Emacs: https://xenodium.com/chatgpt-shell-available-on-melpa/


It'll be great if they'd build language servers via the language server protocol, that would be editor agnostic.


Github Copilot actually works through the language server protocol already. Document contents are sent to it and it responds with code completions.


Neovim already can’t keep up by itself. The future of vim won’t be as a standalone application, but as a plugin into other IDEs. The support for Neovim and VSCodeVim within VSCode greatly reduces the utility of a standalone app for anything other than edits to very small projects.


vim is a text editor.


I'm a bit surprised that IP and infosec isn't a much bigger part of this discussion.

ChatGPT ought to be a non starter for many use cases where data cannot be shared with OpenAI or where the copyright situation of the generated output could become too vague.

Having the option of open source models that potentially could be self hosted could make those use cases viable.


ChatGPT now has an option to turn off both historical logging of conversations and use of your interactions to train the model. There's still concern, but for any company which is already using GitHub/BitBucket to host their code or Azure/GCP/AWS for their build/CI/CD servers...it's not like they're hermetically sealed in the first place.

OpenAI probably hasn't gone through all the SOC2/etc/etc/etc/etc audit certification that AWS/GCP/Azure have, but if you're using those, then this decision is just a matter of degree. Plus OpenAI is clearly aware of the concerns and beginning to address them in order to expand their addressable market.

For defense companies, yeah, this is a non-starter. But they often don't even have access to StackOverflow and cell signals are physically, purposefully blocked by the building construction materials. And they only recently even started using cloud computing and use a purpose-built cloud at Azure/GCP/AWS that's specifically walled off for DoD partners.


It says it has those things, but it changes absolutely nothing for corporate use - you're still exfiltrating important information. To say nothing of sending personal information there, that is an even worse idea (because you have less money and are therefore less important to keep in mind for a large corporation).

Saying that they're "working on it" is not useful IMO - at the end of the day, they'll be exactly as unethical as they can get away with. We live in a time where we can comfortably say that that is "very unethical".


> changes absolutely nothing for corporate use - you're still exfiltrating important information.

How is it different from storing all the sourcecode on a private GitHub.com repo?


How are you using ChatGPT with your codebase that makes it superior to copilot?


it doesn't. replit-code-v1-3b is a code LLM, ChatGPT is an app on top of LLMs. it compares to OpenAI Codex, a small version of which is behind GitHub Copilot.


It (replit-code-v1-3b) is already quite good at explaining code:

Input:

    below is a SQL statement:

    SELECT
      CAST(DATE_TRUNC('week', "t1"."TIMESTAMP") AS DATE) AS "WEEK_START",
      COUNT(\*) AS "EVENT_COUNT"
    FROM "ANALYTICS"."POSTHOG"."POSTHOG_EVENTS" AS "t1"
    GROUP BY
      "WEEK_START"
    ORDER BY
      "WEEK_START"
    LIMIT 2000

    Explain this SQL. Respond in JSON format with the following keys: 
    TITLE, DESCRIPTION, TABLES
    JSON response:
output:

    {
        "title": "Weekly Events Count",
        "description": "Count of weekly events",
        "tables": [
            {
                "name": "POSTHOG_EVENTS",
                "columns": [
                    "WEEK_START",
                    "EVENT_COUNT"
                ]
            }
        ]
    }


Free ChatGPT is based on code-davinci-002 (GPT-3.5), which is used in OpenAI Codex. See

https://platform.openai.com/docs/model-index-for-researchers

https://help.openai.com/en/articles/6195637-getting-started-...


Sorry, but this is wrong. OpenAI's gpt-3.5-turbo and gpt-4 models (aka ChatGPT) are what it should be compared against. Codex has been deprecated since March, something that has been widely discussed in the LLM world and is prominently noted on the main Codex page: "As of March 2023, the Codex models are now deprecated. Please check out our newer Chat models which are able to do many coding tasks with similar capability"

See: https://platform.openai.com/docs/guides/code


Hard to compare them actually. The thing about ChatGPT is the chat part. It was trained to interact and respond with human conversation. This is more like CodePilot, with code complete based off of actual code


This was a lot of fun to record, and second episode where I get an eval question wrong, I'm going to be demoted to bot soon lol


means you are human! like the rest of us


we also did an interview with Varun Mohan of Codeium, which is another competing code model trained from complete scratch: https://lspace.swyx.io/p/varun-mohan#details


Replit: I tried to write HMM Forward function code:

```

# Write a function that performs the forward algorithm of an HMM. Ideally it uses numpy vectorization,

# But if you cannot do that it is fine too. Just perform forward inference, given HMM parameters and Observations.

# def HMM_forward() :

    # Write your code here

    # Hint: 

    # - Remember that HMMs are stochastic, so you will need to multiply transition matrices by observation probabilities

    # - Remember that HMMs are stochastic, so you will need to multiply emission matrices by observation probabilities

    # - Remember that HMMs are stochastic, so you will need to multiply emission matrices by transition matrices

    # - Remember that HMMs are stochastic, so you will need to multiply emission matrices by transition matrices

    # - Remember that HMMs are stochastic, so
``` It gets stuck in an infinite loop. ChatGPT gets it write, first shot:

```

import numpy as np

def forward_algorithm(A, B, pi, obs):

    """
    Performs the forward algorithm for an HMM.

    Parameters:
        A: numpy.ndarray, shape (N, N)
            Transition matrix of the HMM, where N is the number of states.
        B: numpy.ndarray, shape (N, M)
            Emission matrix of the HMM, where M is the number of possible observations.
        pi: numpy.ndarray, shape (N,)
            Initial probability distribution over states.
        obs: numpy.ndarray, shape (T,)
            Sequence of T observations.
    
    Returns:
        alpha: numpy.ndarray, shape (T, N)
            Forward probabilities for each state at each time step.
    """

    T = obs.shape[0]
    N = A.shape[0]

    alpha = np.zeros((T, N))
    alpha[0] = pi * B[:, obs[0]]

    for t in range(1, T):
        alpha[t] = np.dot(alpha[t-1], A) * B[:, obs[t]]

    return alpha
``` OpenAI managed to do the important but extremely hard, they moved out of the DL benchmark frame and made something that is general purpose useful. Great effort and congrats to Replit team though, hopefully they can keep iterating on this and reach ChatGPT capabilities someday


The model is not RLHF'd or instructed. It's an inline autocomplete model so it will get confused if you talk it like you're talking to a person. Altho it is possible to finetune it this way. To get better full function completion try giving it the function definition and a descriptive docstring as a prompt.


> But if you cannot do that it is fine too. Just perform forward inference, given HMM parameters and Observations.

Stuff like this will make your outcomes worse for any model.


Really? My experience with GPT is more the description I add the better the results. I presume this is because it has a longer prompt to attend upon, I think the whole idea of focusing on keywords/ concise sentences is a very “search engine” paradigm and language models do better the more you describe your question


Details are fine. But think about how this thing works. It does not think about your request. It comes up with the most probable answer. There is some tuning to imply self reflection, but that’s mostly fake. When you say “Do X, but if you can’t do X, do Y”, you may very well encourage the model to do Y without any qualitative assessment over whether it could actually do X.

Same for questions where you ask “is X good or bad? And why?”. It answers good or bad before it comes up with the reasons. That’s very plausibly ok, but it’s different from how people imagine it works and thinks.


More tools in the field is great! I tried a few things, and it's reasonable, but it does have some quirks that seem to repeat, like:

I tried a prompt of:

  # python function that returns a random integer between min and max
And it produced:

  def random_int(min, max):
      return random.randint(min, max)

  # define the size of the grid
  n = 5
It doesn't add the needed import statement, and I'm unclear why it's "defining the size of the grid".


LLMs generally but more so small models will keep going and generate seemingly unrelated things. On the frontend tools like Copilot and Ghostwriter do a lot of things like use stopwords or simply not show completions outside a single block.

As for your prompt, it's following your prompt a little too closely and generating just the function. You can however condition it that this is the start of the program it will do the import, e.g.

   # python function that returns a random integer between min and max
   import
This is in fact a suggestion from OpenAI on best practices for prompting called "leading words" https://help.openai.com/en/articles/6654000-best-practices-f...


That's because it's not following instructions like ChatGPT, it's just trying to guess that could plausibly come after what you put, like Copilot or the old GPT-3 models


Isn’t ChatGPT also just generating plausible text that could be a response to an instruction?


It's not generating the most likely next word in the 'meta-corpora' of all possible discussions similar to the ones it has been trained on, it is trying to generate plausible text that would be scored well as a helpful assistant - and in the process has transferred knowledge acquired from its pre-training task.


"that could be a response to an instruction" is the critical part here


Yeah, at their core they’re both trying to guess/generate what comes next. Differences: Being trained towards conversations versus code. Hyperparameters set to stop differently. “Wrappers” that form the prompt.


and imports are (almost) always at the top of your file not with this function


I tried the same input, except wrapping it in triple-quotes instead of commenting it. So that it would match the standard practice for module doc strings. Here's the result:

    """python function that returns a random integer between min and max"""
        return random.randint(min, max)


    def gen_random_float(min, max):
        """python function that returns a random float between min and max"""
        return random.uniform(
So, it assumed the triple-quote was a function's doc string, despite it not being indented. It then assumes I'll want a similar function for floats (I assume it was cut off by a token limit).


Based on the the replies, I tried a different prompt:

  # python script that prints out an integer between min and max
And it did better. Included the import, didn't add unrelated code, but did still put the code inside a function.


I've had the issue of generating random code after the completion with other models as well; it's due to how the models are trained. You need to stop generating when you encounter token(s) that indicate you're done - see https://huggingface.co/replit/replit-code-v1-3b#post-process...


I get such unrelated statements from copilot too, not often, but a few I remember.


This is amazing work and bravo on to the people working on redpajama.

This is fantastic for the world, this means LLMs will not be controlled by a couple of companies with the associated rents.

Yes, private LLMs will likely be a couple of years ahead of 'free' alternatives, but that's OK, we want to incentivize for profit research so long as the services are low priced in time (and in this case in short order).

AMAZING WORK.


My first reaction was, "why is replit building LLMs," but I guess it fits their needs to have one optimized for their use. But I wonder, is this the beginning of another wave of "every company is an AI company?" Are we going to see a spike in tech hiring around AI/LLM, money starting to flow again, etc? And how many years until it all blows up and the layoffs start?


Finetuning models and LLMs (and any model) is going to a be common practice . Each company is its own domain, which domain knowledge and data to specialize open sourced models or used other models to distill/teach their own proprietary model (home grown or modify someone else's).


Have you even tried it? It’s pretty bad


But that's fine it can be a year or two behind the state of the art. That's not the point.

The point is that there will be alternatives and that will reduce the price in time further increasing the impact of the technology.

There was a possible future where only MSFT and maybe GOOG and maybe one or two other companies had this technology and extracted massive rents.


to be clear this work is not based on redpajama - though we did discuss that in the previous episode https://twitter.com/swyx/status/1648080532734087168?s=46&t=9...


Oh my bad!

I thought I read that, is it based upon:

https://arxiv.org/abs/2211.15533 (The Stack) ?


partially. Reza discussed their data pipeline in the blogpost that we reference in the show notes


No Clojure. No Julia. No Haskell. No Racket. No Scheme. No Common Lisp. No OCaml. And, as much as I despise Microsoft, No C#. No F#. No Swift. No Objective-C. No Perl. No Datalog. A glaringly lacking choice of languages.


I fed it some OCaml and it worked, though the example was trivial:

    type point = { x: int; y : int }
    let manhattan_distance (a: point) (b: point) : int =
which it completed to

    type point = { x: int; y : int }
    let manhattan_distance (a: point) (b: point) : int =
        abs (a.x - b.x) + abs (a.y - b.y)
...which is a valid and correct OCaml definition of this method:

https://try.ocamlpro.com/#code/type'point'='$4'x:'int;'y':'i...


I hate to admit, but Python, C, Java, and JS cover most of the modern programming. But not supporting C# sounds like a bad idea.


C# was available in the dataset they link, and is the most glaring ommission by global usage...


Despite the lack of examples, it still completes trivial clojure like "(defn connect [" and other lisp syntax like "(define (hello" which is promising for further refinement training on Lisp languages.


I'm sure that has to do with the dataset available to them.


Which is a deduplicated version of this: https://www.bigcode-project.org/docs/about/the-stack/

And probably, yes. While it contains 358 programming languages, obviously there's a long tail after the 20 most-represented languages. Some people might not expect without thinking about it for a bit that many of the most-represented "languages" are actually things like JSON, XML, HTML, CSV, text, markdown, YAML, SVG.

Also note that it won't be able to parse natural language nearly as well without additionally being trained on something like the LAION dataset, so this version will be more of an autocomplete like Copilot rather than something which can manifest high level business logic from whole cloth like ChatGPT.


You could take it and finetune it on a bunch of Lisps, probably cost on the order of 50-500 to do that.


if anyone from MosaicML is reading this, i’d love a guide on how to do exactly this!


This is a bit hard to believe that the system is decent at producing code which captures complex ideas and higher level structure when the tokens/param value is >30 (it's ~200 here? ) The 'good' models (meaning having lots of 'knowledge' or 'memorization' about the dataset) typically tend to be around 2 tokens/param and models with decent generation of language with less knowledge/memorization are around 30 tokens/param. Perhaps the domain allows for this, but due to the fact that the linguistic interface on the input is still needed... It's hard to believe.


Tokens/param shouldn't matter more than the total training FLOPs, I believe. Clearly if we train a your claimed 'ideal' 2 tokens/param a very small dataset (not many tokens in the first place), it wouldn't have enough data to properly learn the relevant languages. Once there is enough data, then it becomes a question of model capacity (does it have enough degrees of freedom to support the computational structures needed?).

I believe the overparametrization largely helps with generalization and reducing overfitting, at 2 tokens/param there's much more degrees of freedom than structures that can be learned from what I can tell (the extra capacity just provides good breathing room for internal structures). But if your model has enough capacity, and you can find a good enough training method (and you have enough data to learn the task), then you should be able to succeed in arbitrary low tokens/param, which is good to keep in mind to make efficient models.


this kind of critical thinking is exactly what replit is going to need for their stated goal of doing whole-app generation. right now they only test it on AmjadEval. you… might wanna consider joining them to work on it?


I'm not sure noticing tokens/params or simplicial modeling properties occur requires much critical thought - perhaps it's just a standard first thought for anyone given an LM now. I've worked tangentially to NLP for about 7 years in academia, but most of my work has been focused on a less known field of mathematics applied to either to outputs or to the NNs, as well as bioinformatics. As such, my expertise may not be as refined as the real players in the field such as Glaese, Finn, Velockovic, etc. but I try typically keep up with the actual key advancements in the field (usually the stuff few people notice). This area takes far too much compute capability for many people to actually become experts in it, so much of my interests have been less on large LMs and more towards algorithms. But I suppose I agree that it is frustrating to see how little knowledge many of the hype-filled crowds possess that are piling into this area. (Not calling anyone specifically out in this thread, just in general across the internet)


Are you saying the less you train the model the better it is? I'm confused


i believe GP is referencing the Kaplan and Chinchilla scaling laws. we reference those in the podcast but i’m not sure if some deeper insight is being hinted at here where different scaling laws apply for different domains/purposes


But these say exactly the opposite, the more tok/param the better. There is some optimum after which you need more training FLOPS to improve than if you add parameters but it is definitely not the other way around


Tried it out on the HuggingFace demo, with default settings.

Prompt:

>def nth_prime(n):

Completion:

> if n == 1:

> return 2

> if n == 2:

> return 3

> if n == 3:

> return 5

> if n == 4


I genuinely don't understand how anyone can use something like this and seriously think "oh yeah, this is revolutionary." It's almost complete garbage and can't do anything remotely interesting.

    # a method that approximates the hyperbolic tangent (clamped tanh)

    def rational_tanh(x):
        return (x + 1) / (x - 1)
Even gave it the BIG hint of a "clamped" and "rational" tanh, but that ain't it, chief. Forget GPT-4, I would be embarrassed to even show this as a tech demo.


I use "something like this" (GPT4) all the time. Use a good model. Can't wait till proper open source models catch up, but they're not there yet.

Here's GPT4's response:

```

import math

def clamped_tanh(x, n_terms=10): """ Approximate the hyperbolic tangent (tanh) function using a Maclaurin series expansion.

    Args:
        x (float): The input value for which to compute the tanh.
        n_terms (int, optional): The number of terms to use in the Maclaurin series. Default is 10.

    Returns:
        float: The approximated tanh value.
    """
    tanh_approx = 0

    for n in range(n_terms):
        coef = ((-1) ** n) * (2 * n + 1)
        term = coef * (x ** (2 * n + 1)) / math.factorial(2 * n + 1)
        tanh_approx += term

    # Clamping the tanh approximation to the range [-1, 1]
    tanh_approx = max(-1, min(tanh_approx, 1))

    return tanh_approx
# Example usage x = 0.5 result = clamped_tanh(x) print(f"clamped_tanh({x}) = {result}")

```


I don't mean to get in the weeds here, but this is still bad, as the Padé approximation[1] (which is what you're actually really looking for) is magnitudes better than the Maclaurin series (which is just a special case of the Taylor expansion).

Keep in mind that I'm not even an expert (I merely earned a minor in math). In fact, I only know this because I did some research years ago on rational approximations of hyperbolic functions. The right answer gets clipped to -1 or 1 beyond (-3, 3), but it should look like: `x * ( 27 + x * x ) / ( 27 + 9 * x * x )`—though the coefficients (and clip range) can vary.

[1] https://en.wikipedia.org/wiki/Pad%C3%A9_approximant


Ah. It weakens my point, but I'll say it anyway: you do need to be able to tell the difference between good and bad responses to use it to make you more efficient. I definitely wouldn't use it blindly for things I don't know much about at this point.

Pretending I know my Pade from my Maclaurian, I'd follow up with: "use the better Pade approximation".


I recognized the name Replit and couldn't remember why. A quick search reminded me: https://news.ycombinator.com/item?id=27424195


This founder has extreme views and full of hyperbole: https://twitter.com/amasad/status/1504092244168478728?s=20


Is this the best you can find? not even top 10 bangers.


Makes it very much seem like you were only sorry you got caught — and were actually never sorry and didn’t learn from what should have been a teachable moment. Sad.


this feels like an attempt to hive mind against anything cool from this company


I think it's fair to evaluate a company's behavior before engaging in business with them. And I personally dislike persons in power abusing their position, which is why I remembered the company name almost two years later.

I haven't heard of any similar behavior since then, which is a good sign. But a reputation can be a hard thing to shake. The CEO should have considered that before doing what he did.


Threatening a guy for making an open source version of replit sounds pretty crummy in my eyes.


I think people are smart enough to receive extra information and do whatever they want with that.


+1, this is unnecessary.


Alternatively, it's called consequences of your actions. Don't be surprised if shitty behaviour comes back to bite you.


[flagged]


Replit doesn't have special mod powers. A HN moderator downweighted this subthread, the same way we do any generic/indignant/offtopic subthread when we see it. That's standard HN moderation.

In this case we did so less than we normally would because we moderate HN less when the topic is a YC co - see https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu... for lots of past explanation.


I for one downvoted it and I have no relationship with Replit.


Darn it doesn't look like it has c sharp.


Any idea how much it cost to train it and how it was trained?


I keep thinking there should be a way to train a copilot against just one set of code libraries. I know LLMs require training against a lot of text to get their smarts, but is there a way to set this up so a model can be created for a specific library by anyone, so it could provide open source support via a transformer + model? Maybe this would be a better approach than a jack of all trades, master of none.


Yes, this is what fine tuning is for.

It's pretty obvious that lots of people will want to take a strong code completion model, then fine tune it on their docs + libraries and then make it available inside their docs/discord/slack as a support thing.


I guess as soon as a kit is available for this purpose that doesn't require advanced knowledge (say, `aisupport4 my-repo/`) and runs on mainstream-ish hardware and doesn't require a centralized service (even running in the browser via eg transformers.js), things will change considerably.


As someone who is very interested in decentralized services (as in my day job involves decentralized databases and I'm actively working on WebGPU support for training) I'd say that the browser-based vision is a fair way off.

The software ecosystem is pretty immature, and there are numerous things that need to change before the core technologies are good enough to fine tune competitive LLMs.

I do think fine tuning moderate sized LLMs on your own (pretty expensive) hardware using consumer GPUs maybe possible this year.

Unfortunately all the evidence is that training (as opposed to inference) requires high-precision, and hence high memory. This is something that consumer GPUs for the most part lack. New techniques are likely to be required (eg better sharing of training on low memory GPUs) but it's hard to predict how they will develop.


This probably makes a self-hosted and/or local Copilot a lot more feasible


Yes, something like FauxPilot[0] should be able to use it instead of CodeGen

[0] https://github.com/fauxpilot/fauxpilot


I can barely keep up with this stuff, but quick question. Is there a way to simply change the URL setting of copilot to point to this model? Obviously it needs an endpoint, I could hack something up, but asking if somebody has already done this? Would be nice to cancel my copilot.


It's nowhere close to Codex/Copilot. Try the demo: https://huggingface.co/spaces/replit/replit-code-v1-3b-demo



They are probably not lying, but good performance on benchmarks does not imply good performance on your use cases.


Yep


There's https://github.com/fauxpilot/fauxpilot but it doesn't use this model


I don't think it's possible to point Copilot to other models. I don't think Microsoft would benefit much from that feature. You could use existing tools [0] to host your own model which in theory could be used by an extension your IDE uses. But I'm not sure if an extension like that exists.

[0] https://github.com/oobabooga/text-generation-webui


Of course it's possible, just not officially

See https://github.com/fauxpilot/fauxpilot/blob/main/documentati...


It just gave me prototypes lol

def sieve_eratosthenes(n):

##a function to sort 10 numbers

    def bubble_sort(a):
##a function to sort 10 numbers

    def insertion_sort(a):
##a function to sort 10 numbers

    def quick_sort(a):


Did you mess around with the settings? I'm getting a correct implementation and since it's deterministic (with default settings) it should be the same for you.


I left the settings. All I added was ##a function to sort 10 numbers. Assuming it would complete it like copilot


I think that 20 years from now, we'll all be sitting around wondering 1) where the fuck are my flying cars, and 2) what were they thinking using computers to write code?

And the reason I say this is because these tools are answering a question that we haven't asked yet: what common problems need to be solved in this programming language, and where do I get code to solve that problem?

These LLM modules are basically telling us how to duplicate code, and what we need is the opposite: how to stop reinventing the wheel for the 100th time.

Instead of writing code for me, tell me if I already have it. If I'm writing it, tell me there's a library for that. If I'm a library writer, give me suggestions for what libraries are missing from the toolkit.

All we've done so far is begun the process of automating the production of duplicate code. With absolutely no way to go back in time and correct bugs introduced in earlier iterations. We are likely, for instance, to see 0 day attacks that affect hundreds of applications, but with no simple way to describe which applications are affected. That's going to be a first rate trainwreck.


Well fwiw, working with GPT 4 it often suggests which libraries to use assuming the question allows for it, so it's not like everyone's writing everything from scratch.

But libraries and especially frameworks as they are these days are also a giant liability more often than not. APIs change for no reason, they can be removed from the package manager at any moment without warning, people may slip malicious code into them past LGTM reviews, have recursive dependencies upon dependencies that bloat and slow down your build process, etc.

Sometimes you don't need the to install the entire damn car manufacturing plant and dealership it comes with just to get that one wheel you needed. And an LLM can just write you the code for a very nicely customized wheel in a few seconds anyway.


> how to stop reinventing the wheel for the 100th time.

The idea of libraries may not have been a good one. It saved human time but no library is perfect because no abstraction is perfect and this causes unnecessary bloat. It seems tha Nature does not use libraries, it uses replication instead, and we can now have that too.


You have a point, but I think there are some big trade-offs...

Nature uses replication, but it's also horrifically complex and we have no real idea about the specifics of how it all works, or what to do when many, many things go wrong.

Also, I think nature uses cloning, which I kind of think would be called a 'library' in this case, for single-celled organisms (archaea and bacteria). In addition many eukaryotic organisms can reproduce via cloning under special situations.

I don't know, I'm not really trying to argue one way or the other. I'm kinda' thinking out loud here... but I'd like to see LLMs used to create really great libraries, or some other abstractions, that are easy to use and also understandable. It might not happen soon, but I think that there is a lot of value in moving things that way.


> but it's also horrifically complex and we have no real idea about the specifics of how it all works, or what to do when many, many things go wrong.

Sounds a lot like neural networks.

I think libraries are a result of the human brain's limited working memory capacity , while organisms or neural networks aren't so limited in what they can focus on. Perhaps the transformer became successful in syntactic abstraction because it is limited on how many things it can focus on.

Computer libraries are equally a result of limited memory and disk. Javascript is a counterexample of what happens when computing is free (because it runs on someone else's computer)


So instead everyone who has to solve a problem has to be an expert on that problem, rather than just an informed consumer.


Replication does not help in managing completely. That's why we use abstractions, even with the problems they have.


Ha I never wondered what the physical/life version of a shared library is until I read your post so thanks for that.


Reminds me of Java's debate of autogenerating boilerplate vs using the Lombok library: https://old.reddit.com/r/java/comments/c8oqkq/why_we_removed...


I agree -- maybe someday LLMs will give me a the code for a set of simple abstractions that are well-matched for the problems I currently face. Something like a Pattern Language that was all the rage, but, um, better? More objective and pragmatically useful. Not galaxy-brain theory.

That's what I really want. But that would also put me out of a job.


I'm always sad to see these things being trained on a tiny number of programming languages. Makes it harder still for the good languages to compete.


Unfortunately I'm someone who sometimes can't separate the art from the artist. Replit is the company where the founder sent these nasty pompous threats to their ex-employee for their innocent side project and then tried to double talk his way out of it with a bs non-apology when it got exposed in public. I won't support Replit or anything they make.


Is this a Co-pilot like assistant or something more? Co-pilot is neat but is basically not much more than an automated snippet system. The actual writing of the code is not the part that I want help with, I want an AI system that helps me design better software systems. Something more akin to program mind mapping than some fancy auto-completion system.


I wonder if LLM with something like plantUML would generate anything useful


> Markdown, Java, JavaScript, Python, TypeScript, PHP, SQL, JSX, reStructuredText, Rust, C, CSS, Go, C++, HTML, Vue, Ruby, Jupyter Notebook, R, Shell

No Kotlin T_T

I wonder if fine-tuning to a new language would even make sense. AFAIK, it is the core knowledge within the model that really matters, finetuning is essentially specialising.


Is there any way to connect these new code focused LLMs into VS Code in order to replace Github Copilot?


"1M concurrent containers" curious about replit containers. Do they run on firecracker?


Can I use repl.it with an external Code LLM, with or without paying repl.it for Ghostwriter ?


Yes we have a robust extension system and some are already building alternatives.


Hi from the Codeium team. It's awesome to hear you are allowing other code LLMs to be used on the Replit platform (we're big fans)! We'd love to enable our free chrome extension on Replit.


would love to be able to compare codeium vs ghostwriter inside replit! (or toggle between them based on known strengths or preferences, perhaps by project or by filetype)


Important distinction that I'm learning today is that not all LLMs will be interoperable with each others' queries/prompts/inputs.

Code LLM right now is not responding how a Chat LLM would respond.

~~~~~ Hats off to the team on the impressive work!


3 billion parameters. Does that mean I will be able to run on a 8gb consumer GPU?


Probably not out of the box but if some of the local deep learning wizards get a quantized version working well and optimize it a bit, definitely.


Means that once it's incorporated into llama.cpp, you can run it on your laptop.


Hopefully on phones too


No, I could only get 2.7B to run on 8GB VRam unfortunatly.


it is 2.7B


Loading seems to have worked on my laptop's RTX 3070, `nvidia-smi` shows `5188MiB / 8192MiB` in memory usage.


their pytorch_model.bin is 10.4GB


I just loaded this on my laptop's RTX 3070 GPU by following the instructions here: https://huggingface.co/replit/replit-code-v1-3b

I don't know how I can test the model, but it seem loading worked. When I run `nvidia-smi` on another terminal, I see `5188MiB / 8192MiB` in the memory-usage column.


you can load it but you cant run inference? whats the issue?


No issue, I'm simply unfamiliar with python machine learning APIs.

I managed to run inference locally by installing the requirements and running app.py from the demo: https://huggingface.co/spaces/replit/replit-code-v1-3b-demo/...

It is very fast on my RTX 3070, VRAM usage goes to ~= 6.3GB during inference.


Weak spot which I guess similar to other LLMs. If you mention recursion somewhere in comments model sometimes start to recursively generate the same lines over and over again.


title is missing: "trained in 1 week, and like most open source LLMs so far... it sucks compared to the closed source alternatives"

Great effort of course bla bla bla...

Open source really needs some benchmarking, and up their game quality-wise.

And yes I know they're expensive as shit to train... let's not keep wasting our money and actually work together, pool our resources, to make a GOOD model.

But oh no, everyone wants to put their stamp on it. "Replit did this! Look at us!"


This is easy to say, but I think the issue is that getting an LLM right isn't easy, so it's not clear who should steward such a project. Something like BLOOM shows that even if you have the necessary compute, you can still get a model that isn't good.

I think it will take some time for it to be clear who is a leader in training open source models (maybe it will be the red pajama folks?) and I think they'll get more support after that.


Fair point


> It's been trained on 525 billion tokens of, of code all permissively licensed code

What does "permissively licensed" mean?


Interesting how this guy has a finance background but knows how to code, especially for emerging technologies like LLMs


Didn't MosaicML do the training for them?


Can this be used with the copilot plugins for every ide?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: