ankitmathur's comments

ankitmathur · on April 13, 2023

We'd love to help you all deploy this!

1. We just released a couple models that are much smaller (https://huggingface.co/databricks/dolly-v2-6-9b), and these should be much easier to run on commodity hardware in a reasonable amount of time.

2. Regarding this particular issue, I suspect something is wrong with the setup. The example is generating a little over 100 words, which probably is something like 250 tokens. 12 minutes makes no sense for that if you're running on a modern GPU. I'd love to see details on which GPU was selected - I'm unfamiliar with which modern GPU has 30GB of memory (A10 is 24GB, T4 is 16GB, and A100 is 40/80GB). Are you sure you're using a version of PyTorch that installs CUDA correctly?

3. We have seen single GPU inference work in 8-bit on the A10, so I'd suggest that as a followup

mr_magoo · on April 13, 2023

I've also been struggling to run anything but the smallest model you have shared on paper space:

import torch from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

import torch from transformers import pipeline

generate_text = pipeline(model="databricks/dolly-v2-6-9b", torch_dtype=torch.bfloat16, trust_remote_code=True, device=0) generate_text("Explain to me the difference between nuclear fission and fusion.")

Causes the kernel to crash, GPU should be plenty

I'm extremely excited to try these models but they are by far the most difficult experience I've ever had trying to do basic inference.

ankitmathur · on April 15, 2023

I’ve never used Paperspace, so I’ll try to give it a try this weekend. How much RAM do you have attached to the compute. We don’t think it should be any harder to run this via HF pipelines than other similarly sized models, but I’ll look into it.

ankitmathur · on April 13, 2023

Hey there! I'm one of the folks working on Dolly - Dolly-V2 is based on the GPT-NeoX architecture. llama.cpp is a really cool library that was built to optimize the execution of the Llama architecture from Facebook on CPUs, and as such, it doesn't really support this other architecture at this time from what I understand. Llama also features most of the tricks used in GPT-NeoX (and probably more), so I can't imagine it's a super heavy lift to add support for NeoX and GPT-J in the library.

We couldn't use Llama because we wanted to use a model that was able to be used for commercial use, and the Llama weights aren't available for that kind of use.

ankitmathur · on April 12, 2023

Hey! Worked on this here at Databricks: the blog post goes into the dataset collection design a bit (https://www.databricks.com/blog/2023/04/12/dolly-first-open-...). In summary, you're right - brainstorming and GeneralQA will have overlap because the taxonomy naturally has some overlap

ankitmathur · on April 12, 2023

Out of curiosity: what's an example of a metric that you would use to evaluate the ability of the model? For example, just looking qualitatively, asking a prompt like "How do I tie a tie?" to Pythia produces content that isn't even reasonably responding to that. And yet many benchmarks have no problem with that

ankitmathur · on April 12, 2023

Hey there! I worked on Dolly, and I work on Model Serving at Databricks. DollyV1 is GPT-J-based, so it'll run easily on llama.cpp. DollyV2 is Pythia-based, which is built with the GPT-NeoX library

GPT-NeoX is not that different than GPT-J (it also has the rotary embeddings, which llama.cpp supports for GPT-J). I would imagine it's not too heavy of a lift to add NeoX architecture support

samstave · on April 13, 2023

Because the firehost of AI/GPT is a lot to try to take in, please ELI5 unpack and provide more definitions for this comment.

-

Thank you.

Just so I am clear, "parameters" refers to the number of total node-relation-connections btwn a single node and its neighbors for that Prompt/Label? Or how would you explain this ELI5 style?

ankitmathur · on April 13, 2023

Sure! I'll try to briefly summarize though almost certainly will oversimplify. There are a couple of open source language models trained by Eleuther AI - the first one was called GPT-J, and it used some newer model architecture concepts. Subsequently, they released a model architected in the likeness of GPT-3, called GPT-NeoX-20B. Functionally, it was quite similar architecturally to GPT-J, but just with more parameters. Pythia is a model with the same architecture and the same dataset but with different parameter sizes to test scaling laws.

DollyV2 is a Pythia model fine tuned on the Databricks 15K dataset

ankitmathur · on April 13, 2023

Augmenting the answer to address your followup: parameters are any trainable variable in a model's definition. Model training is a process where you basically tweak the parameters in your model and then re-evaluate the model on a metric judging its quality. A lot of models consist of matrix multiplication, so if you are multiplying matrix A of size 2x2 with matrix B of size 2x2 and both matrices can we tweaked, then you've got 8 parameters, since you've got 8 numbers that can be tweaked

ankitmathur · on May 28, 2020

While this is true, I'm pretty sure the referenced poll was conducted well before that was announced. In fact, I've heard internal criticisms of the opposite direction.

The prospect of a full time shift to remote was not communicated as context for the poll. Employees answered the polls thinking they were talking about how they'd go to the office given COVID-19 (a lot more people saying they'd do 50/50, when in reality, that just reflected their lack of comfort due to the disease).

Companies should be sure not to confuse actions people are willing to take due to a pandemic to be what they'd do in a post-pandemic world. Same thing with productivity: just like we don't know the long term impact of this disease, we also don't know that employees will remain as productive as they've been so far.