Hacker News new | past | comments | ask | show | jobs | submit login
GPT-4 LLM simulates people well enough to replicate social science experiments (treatmenteffect.app)
223 points by thoughtpeddler 45 days ago | hide | past | favorite | 94 comments



I'm very skeptical on this, the paper they linked is not convincing. It says that GPT-4 is correct at predicting the experiment outcome direction 69% of the time versus 66% of the time for human forecasters. But this is a silly benchmark because people are not trusting human forecasters in the first place, that's the whole purpose for why the experiment is run. Knowing that GPT-4 is slightly better at predicting experiments than some human guessing doesn't make it a useful substitute for the actual experiment.


For sure. Great argument

+ the experiments may already be in the dataset so it’s really testing if it remembers pop psychology


Yes. A stronger test would be guessing the results of as-yet-unpublished experiments.


They did this. Read the paper


Well, they looked at papers that weren't published as of the original model release. But GPT very likely had unannounced model updates. Is it not possible that many of the post 2021 papers were in the version of GPT they actually worked with?


Furthermore, there’s a replication crisis in social sciences. The last thing we need is to accumulate less data and let an LLM tell us the “right” answer.


You can see this in their results, where certain types of studies have a lower prediction rate and higher variability


Predicting the actual results of real unpublished experiments with a 0.9 correlation factor is a very non-trivial result. The human forecasts comparison is not the central finding


Nicely put! Well argued!

I was not able to put my finger on what I felt wrong about the article -- till I read this


That's surprisingly low considering it was probably trained on many of the papers it's supposed to be replicating.


I totally agree. So many people are missing the point here.

Also important is that in Psychology/Sociology, it's the counter-intuitive results that get published. But these results disproportionately fail to replicate!

Nobody cares if you confirm something obvious, unless it's on something divisive (e.g. sexual behavior, politics), or there is an agenda (dieting, etc). So people can predict those ones more easily than predicting a randomly generated premise. The ones that made their way into the prediction set were the ones researchers expected to be counter-intuitive (and likely P-hacked a significant proportion of them to find that result). People know this (there are more positive confirming papers than negative/fail-to-replicate).

This means the counter-intuitive, negatively forecast results, are the ones that get published i.e. the dataset saying that 66% of human forecasters is disproportionately constructed of studies that found counter-intuitive results compared to the overall neutral pre-published set of studies, because scientists and grant winners are incentivised to publish counter-intuitive work. I would even suggest the selected studies are more tantalizing that average in most of these studies, they are key findings, rather than the miniature of comments on methods or re-analysis.

By the way the 66% result has not held up super well in other research, for example, only 58% could predict if papers would replicate later on: https://www.bps.org.uk/research-digest/want-know-whether-psy... - Results with random people show that they are better than chance for psychology, but on average by less than 66% and with massive variance. This figure doesn't differ from psychology professors which should tell you the stat represents more the context of the field and it's research apparatus itself rather than capability to predict research. What if we revisit this GPT-4 paper in 20 years, see which have replicated, ask people to predict that - will GPT-4 still be higher if it's data is frozen today? If it is up to date? Will people hit 66%, 58%, or 50%?

My point is, predicting the results now is not that useful because historically, up to "most" of the results have been wrong anyhow. Predicting which results will be true and remain true would be more useful. The article tries to dismiss the issue of the replication crisis by avoiding it, and by using pre-registered studies, but such tools are only bandages. Studies still get cancelled, or never proposed after internal experimentation, we don't have a "replication reputation meter" to measure those (which affect and increase false positive results), and we likely never will, with this model of science for psychology/sociology statistics. If the authors read my comment and disagree, they should use predictions for underway replications with GPT-4 and humans, wait a few years for the results, and then conduct analysis.

Also, more to the point, as a Psychology grant funded once told me - the way to get a grant in Psychology is to: 1) Acquire a result with a counter-intuitive result first. Quick'n'dirty research method like students filling in forms, small sample size, not even published, whatever. Just make the story good for this one and get some preliminary numbers on some topic by casting a big web of many questions (a few will get P < 0.05 by chance eventually in most topics anyway at this sample size) 2) Find an angle whereby said result says something about culture or development (e.g. "The Marhsmallow experiment shows that poverty is already determined by your response to tradeoffs at a young age", or better still "The Marshmallow experiment is rubbish because it's actually entirely explained by SES as a third factor, and wealth disparity in the first place is ergo the cause". Importantly, change the research method to something more "proper" and instead apply P-hacking if possible when you actually carry out the research. The biggest P-hack is so simple and obvious nobody cares: you drop results that contradict or are insignificant, and just don't report them - carrying out alternate analysis, collecting slightly different data, switching from online to in person experiments, whatever you canto get a result. 3) Upon the premise of further tantalizing results, propose several studies which can fund you over 5 years, apply some of the buzz words of the day. Instead of "Thematic Analysis", It's "AI Summative Assessment" for the Word Frequency amounts, etc. If you know the grant judgers, avoid contradicting whatever they say, but be just outside of the dogma enough (usually, culturally) to represent movement/progress of "science".

This is how 99% of research works. The grant holder directs the other researchers. When directing them to carry out an alternate version of the experiment or change what we are analyzing, you motivate them that it's for the good of the future, society, being at the cutting edge, and supporting the overarching theory (which ofcourse, already has "hundreds" of supporting evidence from other studies constructed in the same fashion).

As to sociology/psychology experiments - Do social experiments represent language and culture more than people and groups? Randomly.

Do they represent what would be counter-intuitive or support developing and entrenching models and agendas? Yes.

90% of social science studies have insufficient data to say anything at P < 0.01 level which should realistically be our goal if we even want to do statistics with the current dogma for this field (said kindly because some large datasets are genuine enough and used for several studies to make up the numbers in the 10%). I strongly see a revolution in psychology/sociology within the next 50 years to redefine a new basis.


I think this analysis is misguided.

Even considering a historic bias for counter-intuitive results in social science, this has no bearing on the results of the paper being discussed. Most of the survey experiments that the researchers used in their analyses came from TESS, an NSF-funded program that collects well-powered nationally representative samples for researchers. A key thing to note here is that not every study from TESS gets published. Of course, some do, but the researchers find that GPT4 can predict the results of both published and unpublished studies at a similar rate of accuracy (r = 0.85 for published studies and r = 0.90 for unpublished studies). Also, given that the majority of these studies 1) were pre-registered (even pre-registering sample size), 2) had their data collected through TESS (an independent survey vendor), and 3) well-powered + nationally-representative, makes it extremely unlikely for them to have been p-hacked. Therefore, regardless of what the researchers hypothesized, TESS still collected the data and the data is of the highest quality within social science.

Moreover, the researchers don't just look at psychology or sociology studies, there are studies from other fields like political science and social policy, for example, so your critiques about psychology don't apply to all the survey experiments.

Lastly, the study also includes a number of large-scale behavioral field experiments and finds that GPT4 can accurately predict the results of these field experiments, even when the dependent variable is a behavioral metric and not just a text-based response (e.g., figuring out which text messages encourage greater gym attendance). It's hard for me to see how your critique works in light of this fact also.


Yes, I am sure you should have said the same about the research before 2011 with the replication crisis, when it was always claimed that scientists like Bell (premonition) and Baumeister (Ego-depletion) could not possibly be faking their findings - they contributed so much, their models have "theoretical validity", they had hundreds of studies and other researchers building on their work! They had big samples. Regardless of TESS/NSF, the studies it focuses are have been funded (as you mention) and they were simply not chosen randomly. People had to apply to grants. They had to bring in early, previous or prototype results to convince people of funding.

The specificness to psychology applies to most fields in the soft sciences with their typical research techniques.

The main point is that prior research shows absolutely no difference between field experts and random people in predicting the results of studies, per-registered, replications, and others.

GPT-4 achieving the same approximate success rate as any person has nothing whatsoever to do with it simulating people. I suspect an 8 year old could reliably predict psychology replications after 10 years with about the same accuracy. It's also key that in prior studies, like the one I linked, this same lack of difference occurred even when the people involved were provided additional recent resources from the field, although with higher prediction accuracy.

The meat of the issue is simple - show me a true positive study, make the predictions on whether it will replicate, and let's see in 10 years when replication efforts have been taken out, whether GPT-4 is any higher than a random 10 year old who no information on the study. The implied claim here is that since GPT-4 can supposedley simulate sociology experiments and so more accurately judge the results, we can iterate it and eventually conduct science that way or speed up the scientific process. I am telling you that the simulation aspect has nothing to do with the success of the algorithm, which is not really outpeforming humans because to put it simply, humans are bad at using any subject-specific or case knowledge to predict the replication/success of a specific study(there is no difference between lay people and experts) and the entire set of published work is naturally biased anyhow. In other words, this style may elicit higher test score results, by altering the prompt.

The description of the role of GPT-4 here as simulating is a human theoretical construction. We know that people with a knowledge advantage are not able to apply this to predicting output results any more accurately than lay people. That is because they are trying to predict a biased dataset. The field of sociology as a whole, as are most studies that involve humans (because they are vastly underfunded for large samples) struggles to replicate or conduct scientific in a reliable, repeatable way, and until we resolve that, the GPT-4 claims of simulating people, are spurious and unrelated at best, misleading at worst.


I'm not sure how to respond to your point about Bem and Baumeister's work since those cases are the most obvious culprits for being vulnerable to scientific weakness/malpractice (in particular, because they came before the time of open access science, pre-registration, and sample sizes calculated from power analyses).

I also don't get your point about TESS. It seems obvious that there are many benefits for choosing the repository of TESS studies from the authors' perspective. Namely, it conveniently allows for a consistent analytic approach since many important things are held constant between studies such as 1) the studies have the exact same sample demographics (which prevents accidental heterogeneity in results due to differences in participant demographics) and 2) the way in which demographic variables are measured is standardized so that the only difference between survey datasets is the specific experiment at hand (this is crucial because the way in which demographic variables are measured varies can affect the interpretation of results). This is apart from the more obvious benefits that the TESS studies cover a wide range of social science fields (like political science, sociology, psychology, communication, etc., allowing for the testing of robustness in GPT predictions across multiple fields) and all of the studies are well-powered nationally representative probability samples.

Re: your point about experts being equal to random people in predicting results of studies, that's simply not true. The current evidence on this shows that, most of the time, experts are better than laypeople when it comes to predicting the results of experiments. For example, this thorough study (https://www.nber.org/system/files/working_papers/w22566/w225...) finds that the average of expert predictions outperforms the average of laypeople predictions. One thing I will concede here though is that, despite social scientists being superior at predicting the results of lab-based experiments, there seems to be growing evidence that social scientists are not particularly better than laypeople at predicting domain-relevant societal change in the real world (e.g., clinical psychologists predicting trends in loneliness) [https://www.cell.com/trends/cognitive-sciences/abstract/S136... ; full-text pdf here: https://www.researchgate.net/publication/374753713_When_expe...]. Nonetheless, your point about there being no difference in the predictive capabilities of experts vs. laypeople (which you raise multiple times) is just not supported by any evidence since, especially in the case of the GPT study we're discussing, most of the analyses focus on predicting survey experiments that are run by social science labs.

Also, based on what the paper is suggesting, the authors don't seem to be suggesting that these are "replications" of the original work. Rather, GPT4 is able to simulate the results of these experiments like true participants. To fully replicate the work, you'd need to do a lot more (in particular, you'd want to do 'conceptual replications' wherein you the underlying causal model is validated but now with different stimuli/questions).

Finally, to address the previous discussion about the authors finding that GPT4 seems to be comparable to human forecasters in predicting the results of social science experiments, let's dig deeper into this. In the paper, but specifically in the supplemental material, the authors note that they "designed the forecasting study with the goal of giving forecasters the best possible chance to make accurate predictions." The way they do this is by showing laypeople the various conditions of the experiment and have the participants predict where the average response for a given dependent variable would be within each of those conditions. This is very different from how GPT4 predicts the results of experiments in the study. Specifically, they prompt GPT to be a respondent and do this iteratively (feeding it different demographic info each time). The result of this is essentially the same raw data that you would get from actually running the experiment. In light of this, it's clear that this is a very conservative way of testing how much better GPT is than humans at predicting results and they still find comparable performance. All that said, what's so nice about GPT being able to predict social science results just as well as (or perhaps better than) humans? Well, it's much cheaper (and efficient) to run thousands of GPT queries than is to recruit thousands of human participants!


Fair enough, you might have indeed rejected those authors - however, vast swathes, for Baumeister the majority, did not at the time. It's almost certainly true now for existing authors we are yet to identify, or maybe never will.

I admit the point on TESS, I didn't research that enough. I'll look into that at a later point as I have an interest in learning more.

To address your studies regarding expert / study forecasting - thank you for sharing some papers. I had time and knew some papers in the area so I have formulated a response because, as you allude to later regarding cultural predictions, there is debate in the question of the usefulness of expert vs non-expert forecasts (and e.g. there is a wide base of research on recession/war predictions showing the error rate is essentially random at a certain number of years out). I have not fully comprehended the first paper but I understand the gift of it.

Economics bridges sociology and the harder science of mathematics, and I do think it makes sense for it to be more predictable than psychology studies by experts(and note the studies being predicted were not survey-response like most are in psychology), but even this one paper does not particularly support your point. Critically, one conclusion in the paper you cite is that "Forecasters with higher vertical, horizontal, or contextual expertise do not make more accurate forecasts.", "If forecasts are used just to rank treatments, non-experts, including even an easy-to-recruit online sample, do just as well as experts", and "Fourth, experts as a group do better than non-experts, but not if accuracy is defined as rank ordering treatments.". "The experts are indistinguishable with respect to absolute forecast error, as Column 7 of Table 4 also shows... Thus, various measures of expertise do not increase accuracy". Critically at a glance, of the selected statements, almost 40% are outperformed by non-experts anyhow in Table 2 (the last column). I also question the use of Mturk Workers as lay people(because of historic influences of language and culture on IQ tests, the lay person group would be better being at least geographically or WEIRD-ly similar to the expert groups), but that's a minimal point.

Another point that further domain information, simulation or other tactics does not impact the root issue of the biased dataset of published papers - "Sixth, using these measures we identify `superforecasters' among the non-experts who outperform the experts out of sample.". Might we be in danger, with some claim 8 years later with LLMs, of the very "As academics we know so little about the accuracy of expert forecasts that we appear to hold incorrect beliefs about expertise and are not well calibrated in our accuracy. " that the paper warns against?

I know what you are getting at that these are not replications, that it feels elementally exciting that GPT-4 could simulate a study taking place - rather than a replication as such - and determine the result more accurately than a human forecast. But what I am saying is, historically, we have needed replication data to assess if human forecasts (expert and non-expert) are correct long term anyhow, and we need those to be for future or current replications to avoid the training data including the results, to draw any conclusion about the method of GPT-4 in getting this accuracy in forecasting results with any method, simulation or direct answer. The idea that it is cheaper to run GPT queries than recruit human participants makes me wonder if you are actively trolling though - you can't be serious? Fields in which awful statistics and research goes on all the time, awaiting an evolution to a better basic method, and a result that is accurate 3% higher than a group of experts, when we don't even know whether those studies will replicate in the long run (and yes, even innocently pre-registered research tends to proliferate more false positives because the proportion of pre-registered studies published is not close to 100% and thus the results of false-positive publishing still occur https://www.youtube.com/watch?v=42QuXLucH3Q

The problem is until we have fundamentals more stable, small increments and large claims on behaviour are repeating the mistake of anthropomorphizing biological and computational systems before we understand them to the level we need to, to make those claims. I am saying the future is bright in this regard- we will likely understand these systems better and one day be able to make these claims, or counter-claims. And that is exciting.

Now this is a seperate topic/argument, but here is why I really care about these non-substantial, but newsworthy claims: Lets not jump the gun for credence. I read a PhD AI paper in 2011. It was the very furthest from making bold claims - people were so low-mooded about AI. That is because AI was pretty much at its lowest in 2011, especially with cuts after the recession. It was a cold part of the "AI winter". Now that AI is raring at full speed, people overclaim. This will cause a new, 3rd AI winter. Trust me, it will, so many members of faculty I know started feeling this way even back in 2020. It's harmful not only to the field but our understanding really, to do this.


This so much. There was another similar one recently which was also BS.


If GPT emulations of social experiments are not correct, policy decisions based on them will make them so.

“GPT said people would hate buses, so we halved their number and slashed transportation budget… Wow, do our people actually hate buses with passion!”

“A year ago GPT said people would not be worried about climate change, so we stopped giving it coverage and removed related social adverts and initiatives. People really don’t give a flying duck about climate change it turns out, GPT was so right!”

This is an oversimplification, of course; to say it with more nuance, anything socio- and psycho- is a minefield of self-fulfilling prophecies that ML seems to be nicely positioned to wreak havoc in. (But the small “this is not a replacement for human experiment” notice is going to be heeded by all, right?)

As someone wrote once, all you need for machine dictatorship is an LLM and a critical number of human accomplices. No need for superintelligence or robots.


> “GPT said people would hate buses, so we halved their number and slashed transportation budget… Wow, do our people actually hate buses with passion!”

You jest, but if you don't mind me going off on a tangent, this reminds me how in the summer 2020 post-lockdown-period the local authorities of Barcelona decided that to reduce the spread of COVID they had to discourage out-of-town people going to the city for nightlife... so they halved the number of night buses connecting Barcelona with nearby towns. Because, of course, making twice the number of people congregate at bus stops and making night buses even more crammed was a great way to reduce contagion. Also, as everybody knows, people's decision whether or not to party in town on a Friday night is naturally contingent on the purely rational analysis as to the number of available buses to get home afterwards.


Institutions have shown themselves not well-geared for coordinating and enacting consistent policy changes and avoiding unintended consequences under time pressure. Hopefully COVID was a lesson they learned from.

I remember how in Seoul city authorities put yellow tape over outdoor sitting areas in public parks, while at the same time cafes (many of which are next to parks, highlighting the hilarity in real time) were full of people—because another policy allowed indoor dining as long as the number of people in each party is small and you put on a mask while not eating and leave when you are finished (guess how well that all was enforced).


All you need for dictatorship in general is a critical number of human accomplices. I don’t see how an LLM in the mix would make it worse.

IMO mass communication technologies (radio, TV, internet) are much more important in building a dictatorship.


The quote was mostly a flourish (and apparently too open to interpretation to be useful).

In any case, it is about hypothetical “machine dictatorship” in particular, not human dictatorships you describe. Machine dictatorship traditionally invokes an image of “AGI” and violent robots forcing or eliminating humans with raw power and compute capabilities, and thus with no substantial need for accomplices (us vs. them). In contrast, it could be that the more realistic and probable danger from ML is in fact more insidious and prosaic.

What you say about human dictatorship is trivially true, but the quote is not about that.

> I don’t see how an LLM in the mix would make it worse

How about a thought experiment.

1. Take some historical persona you consider well-intentioned (for example, Lincoln), throw an LLM in that situation, and see if it could make it better

2. Take a person you consider a badly intentioned dictator (maybe that is Hitler), throw an LLM in that situation, and see if it could make it worse

Let me know what you find.


Don't forget the deceptive aura of objectivity that machines have. It's easier to issue a command when "the machine has decided" or "God has decided" rather than "I just made this up".


This. The point of the "AI" is that it may make the humans are more willing to go along with the orders.


Even a pair of dice helps in that regard.


> As someone wrote once, all you need for machine dictatorship is an LLM and a critical number of human accomplices. No need for superintelligence or robots.

If that dictatorship shows up, the real dictator will be a human - the one who hacks the AI to control it. (Whether hacking from the inside or outside, and whether hacking by traditional means, or by feeding it biased training data.)


In actuality though, GPT would likely be correct on the democratic will of the people for the things you cited. It’s literally just the blended average of human knowledge. What’s more democratic than that?

Meanwhile, it seems the bigger risk for dictatorship is the current system where we put a tiny group of elites who condescendingly believe they’re smarter than the rest of us in charge (“you will take the bus with your 3 kids and groceries in hand and you will like it”).

This is how you get do-nothing social signaling policies for climate change (eg. Straws, bottle caps, grocery bags). Which make urban elites feel good about themselves but are ironically actively harmful towards getting the correct policies inacted (eg. Investment in nuclear).


> It’s literally just the blended average of human knowledge. What’s more democratic than that?

No, it's the 'blended average' of the texts it's been fed with.

To state the obvious: illiterate people did not get a vote. Terminally online people got plenty of votes.

And, GPT is also tuned to be helpful and to not get OpenAI in the news for racism etc, which is far from the 'blended average' of even the input texts.


> GPT would likely be correct on the democratic will of the people for the things you cited

This is a dangerous line of thought, if you extend it to “why bother actually asking people what they want, let’s get rid of voting and just use unfeeling software that can be pointed fingers at whenever things go wrong”.

> a tiny group of elites who condescendingly believe they’re smarter

I suppose I don’t disagree, a small group without a working democratic election process is how dictatorships work.

> you will take the bus with your 3 kids and groceries in hand and you will like it

Bit of a tangent from me, but it looks like you are mixing bits of city planner utopia with bits of, I guess, typical American suburban reality. In a walkable city planned for humans (not cars) the grocery store is just downstairs or around the corner, because denser living makes them profitable enough. When you can pop down for some eggs, stop by local bakery for fresh bread, and be back home in under 7 minutes, you don’t really want to take a major trip to Costco with all your kids to load up the fridge for the week. You could still drive there, of course, and I don’t think those “condescending elites”* frown too much on a fully occupied car (especially with kids), but unless you really enjoy road trips and parking lots you probably wouldn’t.

> do-nothing social signaling policies for climate change (eg. Straws, bottle caps, grocery bags)

Reducing use of plastic is not “do-nothing” for me. I’m not sure it has much to do with climate change but I don’t want microplastics to accumulate in my body or bodies of my kids. However, I can agree with you that these are only half-measures with good optics.

* Very flattering by the way, I can barely afford a car** but if seeing benefits to walkable city planning makes me a bit elite I’ll take it!

** If my lack of wealth now makes you think I’m some kind of socialist, well I can only give you my word that I am far from.


Everyone and their mom in advertising sold "GPT Persona" tools which are basically just an api call to brands for target group simulation. Think "Chat with your target group" kinda stuff.

Hint: They like it because it's biased for what they want... like real marketing studies.


Yeah anyone who has used ChatGPT for more than 30 minutes of asking it to write poetry about King Charles rapping with Tupac and other goofy stuff has realized that it is essentially trained to assume that whatever you're saying to it is true and to not say anything negative to you. It can't write stories without happy endings, and it can't recognize when you ask it a question that contains a false premise. In marketing, I assume if you ask a fake target demographic if it will like your new product that is pogs but with blockchain technology, it will pretty much always say yes


I’ve noticed this in article summaries. It does seem to have some weird biases.

I’ve been able to get around that by directly asking it for pros/cons or “what are the downsides” or “identify any inconsistencies” or “where are the main risks”… etc

There’s also a complexity threshold where it performs much better if you break a simple question down into multiple parts. You can basically to prompt-based transformations of your own input to break down information and analyze it in different ways prior to using all of that information to finally answer a higher level question.

I wish ChatGPT could do this behind the scenes. Prompt itself “what questions should I ask myself that would help me answer this question?” And go through all those steps without exposing it to the user. Or maybe it can or already does, but it still seems like I get significantly better results when I do it manually and walk ChatGPT through the thought process myself.


If you can do that and do it, for what do you need to ask the chatbot? Genuine question, because in my mind that's the heavy lifting you do there and you will get to a conclusion in the process. All the bot can do is agree with you and that serves what purpose?


Another interesting case with this is an instance I had with Google Assistant's AI summary feature for group chats. In the group chat, my mom said that my grandma was in the hospital and my sister said she was going to go visit her. In the AI summary, my grandma was on vacation and my sister was in the hospital. Completely useless.


> it is essentially trained to assume that whatever you're saying to it is true and to not say anything negative to you.

Oh, it's actually worse than that: A given LLM probably has zero concept of "entities", let alone "you are one entity and I am another" or "statements can be truths or lies."

There is merely one combined token stream, and dream-predicting the next tokens. While that output prediction often resembles a conversation we expect between two entities with boundaries, that says more about effective mimicry than about its internal operation.


I agree there is limited modeling going on but the smoking gun is not on the fact that all there is to an LLM is mere "next token prediction".

In order to successfully predict the next token the model needs to reach a significant level of "understanding" of the preceding context and the next token is the "seed" of a much longer planned response.

Now, it's true that this "understanding" is not even close to what humans would call understanding (hence the quotes) and that the model behaviour is heavily biased towards productions that "sound convincing" or "sound human".

Nevertheless LLMs perform an astounding amount of computation in order to produce that next token and that computation happens in a high dimensional space that captures a lot of "features" of the world derived from an unfathomably large and diverse training set. And there is still room for improvement in collecting. cleaning and/or synthesizing an even better training corpus for LLMs.

Whether the current architecture of LLMs will ever be able to truly model the world is an open question but I don't think the question can be resolved just by pointing out that all the model does is produce the next token. That's just an effective way researchers found to build a channel with the external world (humans and the training set) and transform to and from the high-dimensional reasoning space and legible text.


I think this is just wrong.

Any "understanding" is a mirage.

Case in point. prompt: "Is it possible to combine statistical mechanics, techno music production and pizza?"

If the model had even the slightest understanding of the world the answer would just be no. Instead gpt4o:

"Yes, it is possible to combine statistical mechanics, techno music production, and pizza, though it might require some creative thinking. Here’s how these three seemingly unrelated things could be connected" then gives a list of complete nonsense.

The trick is that it can't say no because it doesn't understand ANYTHING.

It has no understanding of the difference between combining pizza with two completely disparate nonsense subjects/items with combining pizza with two other food items. The later would seem to have a mirage, high dimensional, understanding from data of "food" though in the response.


I just asked chatgpt-4o and the answers were perfectly logical although not creative at the level of a creative human (but many humans are not that creative either.

For example one of the outputs:

"Host an event where statistical mechanics concepts are explained or demonstrated while making pizzas, all set to a backdrop of live techno music. The music could be dynamically generated based on real-time data from the pizza-making process, perhaps using sensors to monitor heat, time, or the distribution of toppings, with this data influencing the techno tracks played."

It not doing such a bad job trying to mix up three unrelated concepts. It knows music is not an ingredient for the pizza and knows that pizza requires heat for cooking and that heat is explained with statistical mechanics.

Sure you can nitpick and find nuances that are wrong but honestly an average human asked to come up with something for a school assignment would probably not do a much better job.

Now, there are clearly better examples of utter failures where even the best model trip on that reveals that they are not even close at understanding and modeling the world correctly.

My point is just that their weakness cannot merely explained by the next token prediction process.


"Is it possible to combine statistical mechanics, techno music production and pizza?"

You just did.


Yes, but with the caveat of in some very specific cases no.

I spent a good deal of time trying to get it to believe there was a concept in the law of "praiseworthy homicide". I even had (real) citations to a law textbook. It refused to believe me.

Given the massive selling point of ChatGPT to the legal profession, and the importance of actually being right, OpenAI certainly reduces the "high trait agree-ability" in favor of accuracy in this particular area.


There’s a way to apply the concept of praiseworthy homicide – metaphorically – to your battle with it.


here's a story with a sad ending, called sad musical farewell.

https://chatgpt.com/share/0d651c67-166f-4cef-bc8c-1f4d5747bd...


I should have clarified that I meant that it has trouble writing stories with bad endings unless you ask for them directly and specifically, which can be burdensome if you're trying to get it to write a story about something specific that would naturally have a sad ending.


Apparently counter examples are very unappreciated. I also gave a counter example for each of their claims, but my comment got flagged immediately.

https://news.ycombinator.com/item?id=41187549


As a probabilistic machine, singular counterexamples do not mean much. Two people can ask LLMs the same question and get different results.


Reminds me of:

> Out of One, Many: Using Language Models to Simulate Human Samples

> We propose and explore the possibility that language models can be studied as effective proxies for specific human sub populations in social science research. Practical and research applications of artificial intelligence tools have sometimes been limited by problematic biases (such as racism or sexism), which are often treated as uniform properties of the models. We show that the "algorithmic bias" within one such tool -- the GPT 3 language model -- is instead both fine grained and demographically correlated, meaning that proper conditioning will cause it to accurately emulate response distributions from a wide variety of human subgroups. We term this property "algorithmic fidelity" and explore its extent in GPT-3. We create "silicon samples" by conditioning the model on thousands of socio demographic backstories from real human participants in multiple large surveys conducted in the United States. We then compare the silicon and human samples to demonstrate that the information contained in GPT 3 goes far beyond surface similarity. It is nuanced, multifaceted, and reflects the complex interplay between ideas, attitudes, and socio cultural context that characterize human attitudes. We suggest that language models with sufficient algorithmic fidelity thus constitute a novel and powerful tool to advance understanding of humans and society across a variety of disciplines.

https://arxiv.org/abs/2209.06899



why not just link directly? https://existentialcomics.com/comic/557


A YC company called Roundtable tried to do this.[1]

The comments were not terribly supportive. They’ve since pivoted to a product that does survey data cleaning.

[1] https://news.ycombinator.com/item?id=36865625


A social science experiment in and of itself. A fine thread of tragedy in the rich tapestry of enterprise.


Products like this make me pretty cynical about VCs’ ability to evaluate novel technical products. Any ML engineer who spent 5 minutes understanding would have rejected the pitch.


I’m an ML engineer who’s spent more than 5 minutes thinking about this idea and would not have automatically rejected the pitch.


There are so many basic questions raised in the Launch HN thread that didn’t have good answers. It indicates to me that YC didn’t raise those questions, which is a red flag.


Can someone translate for us non-social-scientists in the audience what this means? "3. Treatment. Write a message or vignette exactly as it would appear in a survey experiment."

Probably would be sufficient to just give a couple examples of what might constitute one of these.

Sorry, I know this is probably basic to someone who is in that field.


A treatment might look like

"In the US, XXX are much more likely to be unemployed than are YYY. The unemployment rate is defined as the percentage of jobless people who have actively sought work in the previous four weeks. According to the U.S. Bureau of Labor Statistics, the average unemployment rate for XXX in 2016 was five times higher than the unemployment rate for YYY"

"How much of this difference do you think is due to discrimination?"

In this case you'd fill in XXX and YYY with different values and show those treatments to your participants based on your treatment assignment scheme.


Same for me. I had no idea what was being asked.


But do the experiments replicate better in LLMs than in actual humans? :D

We should expect LLMs to be pretty good at repeating back to us the stories we tell about ourselves.


I don't think "replicate" is the appropriate word here.


I'm sure Philip K Dick would disagree.


Only until you realize that repli-cant.


Dick would hate these guys lol.


So did ELIZA[0] about sixty (60) years ago.

0 - https://en.wikipedia.org/wiki/ELIZA


So, we finally found the cure for the replication crisis in social sciences: just run them on LLMs.


At least they will confirm the experiments they have been trained on.


Maybe that will help extend the veneer of science on social studies for a few more years before the echo chamber implodes.


Problem is that many policy decisions are based on bad science in the social sciences, because it provides an excuse. The validity is completely secondary.


Why stop at social science? I say we make a questionnaire, give it to the GPT over a broad range of sampling temperatures, and collect the resulting score:temperature data. From that dataset, we can take people's temperatures over the phone with a short panel of questions!

(this is parody)


Accompanying working paper that demonstrates 85% accuracy of GPT-4 in replicating 70 social science experiment results: https://docsend.com/view/qeeccuggec56k9hd


Do you even get 85% replication rate with humans in social science? Doesn't seem right.

But at least it can give them hints of where to look, but going that way is very dangerous as it gives LLM operators power to shape social science.


The study isn't trying to do replication, but seems to have tested the rate that GPT-4 predicts human responses to survey studies. After reading the study, the writers really were not clear on how they were feeding the studies they were attempting to predict the responses to into the LLM. The data they used for training also was not clear, as they only dedicated a few lines referring to this. For 18 pages, there was barely any detail on the methods employed. I also don't believe the use of the word "replication" makes any sense here.


I wonder if this could be used for testing marketing or UX actions?


Is that the solution to social science's replication problem?


With the temperature parameter effectively set to 0, it may finally be possible!


Were those experiments in the training set?

If so, how close was the examination vs the record the model was trained on.

Some interesting insights there, I think.


The answers to your questions are in the paper linked in the first line of the app


> Accuracy remained high for unpublished studies that could not appear in the model's training data (r = 0.90).


Did they test whether GPT4 replicated already existing social science experiments? If so, this might have happened because the experiment was in the training data


That's only for known situations.

Eg. Try LLM's to find availability hours when you have the start and end time of each day.

LLM's don't really understand that you need to use the day 1 endhour and then the starthour of the next day.


What happened to running virtual experiments aka running filters on NSA metadata from all conversation of all mankind?


Nature's technology vs. Human Technology

Emergent vs Reductionist

LLMs are Emergent

Therefore, LLMs break the mold of human technology and enter the realm of nature's technology

We should talk about AI Beings not apps if we wish to stay with this analogy.

We can always say that's BS and that the analogy does not reach that deep

Or we can take it for what it is and admit LLMs are not similar in their behavior to any tool that humans have created to date, all the way from homo habilus till homo sapiens sapiens. Tools are predictable. Intelligences are not.


The good news is that they should be able to replicate real world events to validate of this is true or not.

Tesla FSD is a good example of this in real life. You can measure how closely the car acts like a human based off of interventions and crashes that were due to unhuman behavior, as well in the first round of the robot taxi fleet which will have a safety driver, you can measure how many people complain that the driver was bad


I think it is far, far more likely that it replicates social science experiments well enough to simulate people


Not so sure about that, but I can absolutely see people getting addicted to them.


Ooooor maybe, testing if the experiments are similar to what was in the corpus.


But does it replicate _better_ than really running the experiment again?

Joking…but not joking.


This is gonna end well…


this tells more about how social science data is manipulated than the usefulness of llm


please don't, need I remind you the joke that social science is not real science


I love that anyone can just write whatever they want and post it online.

GPT-4 can stand in for humans. Charlie Brown is mentioned in the Upanishads. The bubonic plague was a spread via telegram. Easter falls on 9/11 once every other decade.

You can just write shit and hit post and boom, by nature of it being online someone will entertain it as true, even if only briefly so. Wild stuff!


Well that’s one way to solve the replication crisis


And yet it can't replicate a human support agent. Or even a basic search function for that matter ;)


garbage in, eh?


Phsycohistory


Source: trust us. This is some bullshit science.


Is it possible to train an LLM that is minimally biased and that could assume various personas for the purpose of the experiments? Then I imagine it’s just some prompt engineering no?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: