Very interesting. Can you please elaborate or give a TLDR on the linguistics ang...

Mezzie · on March 27, 2023

I thought about this for a while and I think I would boil it down to being used to dealing with language as data instead of just as a communication medium. Experience with corpora, sentiment analysis, and the various parts of linguistics does give you a solid grounding in the why the frequency distribution in the training set(s) occurs the way it does.

An example of things I consider when interfacing with an LLM that derive from my linguistic knowledge:

* That the language it's trained on is thoroughly modern/confined to the speech of a particular place in time. Which means that I know any viewpoint I receive is not only modern but from a very specific time and place. The language that everybody uses is sort of like the water fish swim in in that most people don't think about it. Which means I know that if I ask it something about (to use an example that is a culture war issue) the history of racism, I know that the answer is being run through modern conceptions of racism and if I want historical views, I need to get those elsewhere.

* That which words are most commonly used relies on the social and economic status of the speaker as well as word properties like phonetics and phonology. This makes it much easier to pick and choose which vocabulary and sentence structures to use in order to 'sub-pick' the part of the training set you want answers to. Asking 'how to grow peppers' and 'what soil variables control how well a Capsicum annuum plant grows' are going to get you different answers.

* Related to this, the differences between spoken and written English on a large scale - one problem with the 'everybody can just use LLMs' idea is that the LLMs are trained on written English but the majority of people interfacing with them speak to them as though it were a verbal conversation. There are parts of verbal communication that don't exist in written communication, and knowing how to provide scaffolding to make a request make sense requires understanding the difference between the two.

* A basic knowledge of sociolinguistics is fantastically helpful for developing personae and/or picking out biases in the training data set. (Somewhat similar to how I can usually ID an American's media diet and political affiliation after a 5-10 minute casual conversation).