Show HN: Autolabel, a Python library to label and enrich text data with LLMs

bomewish · on June 21, 2023

What can this do that the new ‘calling functions’ feature can’t? It seems to be roughly the same thing?

nihit-desai · on June 21, 2023

function calling, as I understand it, makes LLM outputs easier to consume by downstream APIs/functions (https://openai.com/blog/function-calling-and-other-api-updat...).

Autolabel is quite orthogonal to this - it's a library that makes interacting with LLMs very easy for labeling text datasets for NLP tasks.

We are actively looking at integrating function calling into Autolabel though, for improving label quality, and support downstream processing.

devjab · on June 21, 2023

This is very interesting to me. We spent a significant time “labelling” data when I was in the public sector digitalisation. Basically what was done, was to do the LLM part manually and then have engines like this on top of it. Having used ChatGPT to write JSDoc documentation for a while now, and been very impressed with how good it is when it understands your code through good use of naming conventions, I’m fairly certain it’ll be the future of “librarian” styled labelling of case files.

But the key issue is going to be privacy. I’m not big on LLM, so I’m sorry if this is obvious, but can I use something like this without sending my data outside my own organisation?

oli5679 · on June 21, 2023

You can self-host an open-source model. Llama CCP is a very popular project with great docs.

https://github.com/ggerganov/llama.cpp

You need to be careful about liscencing - some of these models its a legal grey area whether you can use them for commercial projects.

The 'best' models require some quite large hardware to run, but a popular compression methodology at the moment is 'quantization', using lower precision model weights. I find it a bit hard to evaluate which open source models are better than others, and how they are impacted by quantization.

You can also use the Open-AI API. They don't use the data. They store for 30 days, which they use for fraud-monitoring, and then delete. It doesn't seem hugely different to using something like Slack/Google doc/AWS.

I think some people imagine their data will end up in the knowledge-base of GPT-5 if they use Open-AI products, but this would be a clear breach of TOS.

https://openai.com/policies/api-data-usage-policies

devjab · on June 22, 2023

I’m not sure the OpenAI model is EU regulation compliant. It’s not just GDPR these days, the laws are ramping up to the point where we might not even be able to use Azure (as Microsoft unlike Amazon still can’t guarantee that only EU citizens work support on the services). This is obviously worse in some EU sectors than others, but I’m not sure I’ll ever work outside Green Energy, Public Sector or Finance so I’ll always have to deal with the harshest parts.

I wonder if one day they will sell a “self-hosted” version of GPT. We wouldn’t mind having a ChatGPT with its 2021 data set and no ability to use the internet if it meant it lives up to regulations.

But can you do that? Can you “download” a model and then just use it?

As far as the hardware goes I think we will be fine. My sector uses a lot of expensive hardware like mainframes for old legacy systems where we come together as organisations and buy the service from companies like IBM (or similar, typically there are 3-5 companies that take turns winning the 8-12 year contracts) who then operate the stuff inside our country. I’m sure we can do the same with LLMs.

nihit-desai · on June 21, 2023

Yep! I totally understand the concerns around not being able to share data externally - the library currently supports open source, self-hosted LLMs through huggingface pipelines (https://github.com/refuel-ai/autolabel/blob/main/src/autolab...), and we plan to add more support here for models like llama cpp that can be run without many constrains on hardware

devjab · on June 22, 2023

Very interesting. I’ll most certainly favorite this and keep an eye on it. I think that sort of thing will be the future of LLMs for many of us.

viswajithiii · on June 20, 2023

Thank you for open sourcing this! This seems very useful, especially because of the confidence estimation, which lets you use LLMs for the points they can do well and fall back to manual labelling for the rest.

msp26 · on June 21, 2023

>Refuel provides LLMs that can compute confidence scores for every label, if the LLM you've chosen doesn't provide token-level log probabilities.

How does this work exactly?

isawczuk · on June 20, 2023

You should read carefully OpenAI terms and conditions before using it to build custom datasets.

Takennickname · on June 21, 2023

No you don't. OpenAI didn't ask for permission when it took everyone's work to create gpt.

Pirate all LLMs. They're all yours anyway.

victorbjorklund · on June 20, 2023

Which part?

moffkalast · on June 20, 2023

The part that says you shouldn't take outputs from their models to build datasets for training competitor models.

Outputs from models that they trained on stolen ebooks, unpaid reddit data, data scraped from millions of websites without credit, etc. Sort of like stealing a bike and then getting mad that it got stolen again later, because it was clearly rightfully yours.

https://i.pinimg.com/originals/d7/72/22/d77222df469b50e3b4cd...

chillbill · on June 20, 2023

I get your point but your analogy doesn’t quite work.

nickstinemates · on June 20, 2023

Yeah it's more like stealing a million bikes, putting all parts into a pile and custom assembling them on request.

chillbill · on June 22, 2023

Still not exactly right. Stealing bikes deprives owners of them, while scraping data doesn’t.

moffkalast · on June 22, 2023

How about torrenting the entirety of the world's filmography, using that content to make clips compilations on youtube, then claiming copyright strikes and demonetizing videos that contain those clips?

In a sense, it's almost patent trolling.

iillexial · on June 20, 2023

For anyone wondering it's here https://openai.com/policies/terms-of-use:

>use output from the Services to develop models that compete with OpenAI;

Well, I still can use ChatGPT labeling for many other purposes anyway.

binarymax · on June 20, 2023

There’s some room for interpretation here. Are small sentiment analysis models competing with a large general purpose generative model? OpenAI doesn’t provide the former.

I see competing models as those of LLaMa, Falcon, etc. which would fall into the terms in my interpretation.

applgo443 · on June 21, 2023

How does the confidence scores work?

voz_ · on June 20, 2023

You just posted this here https://news.ycombinator.com/item?id=36384015

It's one thing to show HN / share, its another thing to spam it with your ads.

nihit-desai · on June 20, 2023

Hi!

The earlier post was a report summarizing LLM labeling benchmarking results. This post shares the open source library.

Neither is intended to be an ad. Our hope with sharing these is to demonstrate how LLMs can be used for data labeling, and get feedback from the community