Hacker News new | past | comments | ask | show | jobs | submit login

I almost believe that if I know how make an LLM prompt and how to make an API call to OpenAI, Mistral, Claude 3, or together.ai, then as an application programmer, I can skip this whole book. I see people posting project specifications asking for NLP, named entity extraction, etc. But most things in those jobs look like they could be handled by an LLM, possibly even smaller than 7b, and probably more robustly. My other assumption is that this wasn't true at all three years ago.

Maybe machine learning engineers want to explain what I am missing?




This is an interesting question.

As an NLP researcher who is in contact with companies that demand NLP applications, I think we are not there yet. For example, a company that wants to extract information from medical records cannot use solutions like GPT-4 or Claude that involve sending protected data to third parties in foreign jurisdictions. Modest local models (like 7B models) don't work so well for things like named entity extraction yet. And furthermore, companies typically want some explanation and accountability for the results (especially in sensitive domains...) and sure, LLMs can explain, but you have no guarantee that the explanation isn't hallucinated. When I mention the possibility of hallucinations, companies typically balk and say that they prefer the classic way.

My (potentially biased) opinion is that classic NLP still has life left in it. For how long, I don't know. If small, locally-runnable LLMs get much better and more reliable, addressing the hallucination problems, what you mention might become largely true for most engineering applications.

Also note that beyond engineering, things like syntactic parsing are also useful for scientific pursuits too, and at the moment they seem to be out of reach of LLMs.


From my experience, entity recognition is very good with gpt4. I recorded CNN for hours and it was able to identify and summarize all the commercials.

It’s magic stuff


You're not entirely wrong.

10 years ago, I started an ML & NLP consulting firm. Back then nobody was doing NLP in production (SpaCy hadn't come out yet, efficient vector embeddings were not around, the only comprehensive library to do NLP was NLTK, which was academic and hard to run in prod).

I recently revisited some of our projects from back then (like this super fun one where we put 1 Million words into the dictionary [1]) and realized how much faster we could have done many of those tasks with LLMs.

Except we couldn't — the whole "in production" part would have made LLMs for the most minute tasks prohibitively expensive, and that is not going to change for a while, sadly. So, if you want to work something in prod that is not specifically an LLM application, this book is still super valuable.

[1] https://www.nytimes.com/2015/10/04/technology/scouring-the-w...


LLMs silently accumulate errors with the length of the sequence predicted ("label bias" is the term that should be in that book, but not in the context of LLMs).

The problem is exacerbated by a very big token dictionary. Every new token you generate is independently sampled (from tokens outside of the context size). If you have any task that requires joint distribution modeling LLM will fail and HMMs/CRFs will succeed. Of course, it depends when the problem will manifest and I do not know that, but skipping the whole book is not recommended.

For example, there were approaches for many language tasks in that book that used CNNs (instead of something more principled) and were extremely successful (despite the lack of joint modeling). Who knows when this accumulation of errors becomes measurable.

As long as your generation of tokens fits inside the context size, you're modeling everything jointly. But I guess you need to be aware that when you drop the furthest tokens to generate next ones, the performance can drastically drop (multiplicative accumulation of errors).


If you want to process 100M documents, you want to use the most performant and fastest option - which a 7b param model isn't going to be it.

Additionally, entity linking is an extremely common task, for which LLMs fail at pretty miserably (certainly if the dictionary is custom/private. Additional work must be done to somehow (!? Many options here ?!) perform EL.

So, in the end, making a silver corpus from an LLM may be an option for NER to train a much much smaller algorithm. But EL is _still_ not a 'plug and play' problem, and can actually be pretty difficult to do "well" (using the modern techniques of MHS, etc).


It depends on what you are trying to do.

LLMs are powerful, but can be difficult to work with w.r.t. doing post processing or adding markup/annotations in the text. For example, asking it to label terms in a sentence it can produce varied output, e.g. sometimes listing "word or clause: description of the meaning" which is hard to parse in downstream tasks, or sometimes out of order.

I've also seen LLMs label split infinitives with the correct meaning at the preposition instead of the whole subclause. With NLP you would label either the start and span of the label (common) or the root of the subclause, depending on the application. The NLP approach would work for more complex nested and interconnected expressions where the structure is not flat text.

If you ask it to generate XML or HTML, it will usually invent its own markup, or sometimes generate invalid output. E.g. I've seen it output HTML using tags like `<Question Word>` in some cases.

Asking it to generate CoNLL-U (which it knew about from its response) -- it only outputted 3 columns as "in _ PREP" etc. which is not correct.

Asking `Can you lemmatize "She is going to the bank to get some money."` I get `"She go(ing) to bank(er) for money(y)"` which leaves out words, doesn't lemmatize some words correctly, and has a format that is difficult to parse/interpret reliably.

LLMs also have a limited context window, so if you are trying to process a large document, or have a large complex set of instructions (e.g. on how to label, process, or format the data) then it can lose those instructions and deviate from what you are doing.

LLMs are also susceptible to being guided by the input text as the prompt and input are taken together. Thus, if the text is talking about a different format or something else it can easily switch to that.

--

While NLP pipelines require more work to get right, they can often be more efficient computationally as they are often a lot smaller than the 7b parameters, or use other techniques that don't use ML/NNs.

You can build more custom pipelines by querying the different features (part of speech, lemma, lexical features, etc.) without having to reparse the data, and you can keep the annotations consistent across those different pipelines, e.g. when labelling, extracting and storing the data in a database for searching, etc. so a user can see all the places where a term is referenced in a given text.


That’s kinda like saying we don’t need to teach CS students anything about algorithms.

I don’t think that book was ever aimed at application engineers in the first place.


Well, I can see it as being useful to give aspiring ML researchers a survey and jumping off point for possibly digging deeper.

But I also think when they created that book originally it was about giving ML practitioners tools they would use.


LLMs typically do worse on specific extractive and classification tasks than a finetuned BERT-large model, so no you actually can't replace everything with LLM calls with similar performance.

(Cf. BloombergGPT paper all financial benchmark tasks).

And that's not even taking into account inference cost, but that is a business case issue.


BERT is a language model! It was considered a "large" language model for its time, and it's even based on transformers. It's a very small language model by today's standards (340MM params), and is encoder-only instead of decoder-only, but trying to draw a hard line between "BERT" and "LLMs" is more about parameter count than capabilities — in fact, the original GPT-3 paper benchmarked GPT-3 against finetuned BERT-large and beat it on nearly every measure [1]. And BERT-large is not unique in being able to be finetuned; finetuning Mistral 7B on your task should result in very good performance (similarly, OpenAI allows finetuning of gpt-3.5-turbo, and there are plenty of non-Mistral open-source LLMs like Yi and Qwen that should do well too).

I'm not sure what BloombergGPT has to do with LLMs vs non-LLMs; BloombergGPT is an LLM [2], and it defeating other LLMs on financial benchmarks doesn't prove much about large language models other than "LLMs can be trained to be better at specific tasks."

1: https://arxiv.org/abs/2005.14165

2: https://www.bloomberg.com/company/press/bloomberggpt-50-bill...


Of course BERT is an LM, I never claimed it wasn't. It is just much smaller than what is now termed "LLM" which is typically an extremely large generative decoder-only transformer for general multi-language (often instruction and chat tuned). I was training and finetuned dozens of LMs back when BERT-large was considered a GPU memory issue and I have finetuned hundreds LMs over my NLP research and engineering career. I have even implemented Hidden Markov Model LMs from scratch (as an exercise mostly, performance was mostly bad even in 2015).

When I used "specific task", I meant specialized, domain specific tasks like financial sentiment and event extraction in which I hold a PhD. As a matter of fact, for Fiqa SA finetuned Roberta scores 88% F1 while BloombergGPT scores 75% F1. [1] Still very impressive for zero/few shot learner, but depending on data availability, performance targets and inference cost tradeoffs, it might not need your meds.

My point was "small" masked encoder transformer LMs like BERT can still hold their own on narrow domain tasks. And what OP claims that all NLP is solved by prompting a general purpose LLM service is simply inaccurate.

1. https://arxiv.org/pdf/2310.12664.pdf


Ah, yeah, it's definitely true that prompts alone are typically beaten by finetunes on narrow domain tasks.

I hadn't read the financial paper you linked, it's very interesting! One odd bit I did notice was they set the gpt-4 temperature to 1.0, which is... not a great setting for analysis, and probably harmed the results somewhat. Typically you'd want a value much closer to 0 for that. But while a lower, more reasonable temperature setting would probably improve gpt-4's performance, I would still expect a finetuned LM to outperform larger models with just prompting on those kinds of narrow domains, especially once cost is a factor.

It's somewhat surprising to see how bad Bloomberg-GPT was... Even gpt-4 trounced it on every published metric, and it wasn't trained for finance tasks specifically. The bitter scaling lesson, I suppose.


Take a Lexis Nexis as an example. They used to build sophisticated NLP pipelines to squeeze information out of legal documents: Part-of-Speech tagging, dependency parsing, NER, topic modeling, relation extraction, event extraction, summarization, and etc. They also spend millions of dollars on custom training just to expand the entity types and etc. The whole process is error prone, painful to maintain, and expensive. Similarly, McDonald must have struggled a lot building their automatic ordering system.

LLM must be a godsend for these companies. All of sudden, low-level tasks like POS can be eliminated. Tasks like NER have only limited use and the companies can enjoy orders more entity types almost for free. Tasks like intention slotting and topic modeling become trivial compared to the pre-LLM-era pipelines.


Reliability, is what you are missing. LLM are not reliable.


And cost effectiveness.


I suggest going through the exercise of seeing whether this is true quantitatively. Get a business-relevant NER dataset together (not CoNLL, preferably something that your boss or customers would care about), run it against Mistral/etc, look at the P/R/F1 scores, and ask "does this solve the problem that I want to solve with NER". If the answer is 'yes', and you could do all those things without reading the book or other NLP educational sources, then yeah you're right, job's finished.


You're correct except for the use-cases where one of these come into play:

A. Latency: for some systems, you need near real-time predictions. LLMs (today) are slow for that.

B. Cost: when the low dev. effort (for building and deploying an ML model) and low sample complexity (i.e. zero/few-shot) doesn't translate into proportionate monetary gains over what you pay for LLM usage.

C. Precision: when you want the model to reliably tell you when it doesn't know the correct answer. Hallucination is a part of it - but I think of this requirement as the broader umbrella of good uncertainty quantification. I think there are two reasons why this is worse for LLMs: (1) traditional ML models also suffer from this, but there are some well known ways for mitigation. For LLM's there is still no universal or accepted way to perform this reliably (2) the quality of generated language an LLM produces seems to be more likely to deceive you when it is wrong. I don't know how to scientifically think about this - maybe as LLMs proliferate people would build appropriate mental defenses?

There is also the practical problem of prompt transferability across LLMs: what works best for one LLM might not work well for another, and there is no systematic way to optimally modify the original prompt. This is painful in some setups where you're looking to be not locked-in. But I didn't put it in the list because this seems to be a problem for niche groups - everyone seems to be busy in getting stuff working with one LLM. Maybe this will become a larger issue later.


It is not written for application programmers, it is a machine learning book.


Imagine a category - books for machines (some sort of LLM or RAG augment). Kind of like the downloadable helicopter manual that Trinity uses in the Matrix. Then having a higher order integration onto your foundational muscle memory would be the next step. Human (vCurrent + 0.1).

The nice thing is performance would be different based on the human getting it. It kind of preserves some unique element.


It's probably covered many other machine learning books, but generally assumed to be common knowledge here: The final validation for what you need the application for should be done on your end.

Need to use CRFs for NER? Test on your own validation set Need to use LLMs for NER? Test on your own validation set

The validation set should be made up of carefully curated samples that you've seen errors on, edge cases, just like Test-driven Development, but at a much larger scale. Assessing LLMs by demo is a horrible habit that we've taken to that needs to be changed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: