Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Is it just me or GPT-4's quality has significantly deteriorated lately?
947 points by behnamoh on May 31, 2023 | hide | past | favorite | 757 comments
It is much faster than before but the quality of its responses is more like a GPT-3.5++. It generates more buggy code, the answers have less depth and analysis to them, and overall it feels much worse than before.

For a while, the GPT-4 on phind.com gave even better results than GPT-4-powered ChatGPT. I could notice the difference in speed of both GPT-4s. Phind's was slower and more accurate. I say "was" because apparently phind is now trying to use GPT-3.5 and their own Phind model more frequently, so much for GPT-4 powered search engine....

I wonder if I use Poe's GPT-4, maybe I'll get the good old GPT-4 back?




Yes. Before the update, when its avatar was still black, it solved pretty complex coding problems effortlessly and gave very nuanced, thoughtful answers to non-programming questions. Now it struggles with just changing two lines in a 10-line block of CSS and printing this modified 10-line block again. Some lines are missing, others are completely different for no reason. I'm sure scaling the model is hard, but they lobotomized it in the process.

The original GPT-4 felt like magic to me, I had this sense of awe while interacting with it. Now it is just a dumb stochastic parrot.


"The original GPT-4 felt like magic to me"

You never had access to that original. Watch this talk by one of the people that integrated GPT-4 in Bing telling how they noticed GPT-4 releases they got from OpenAI got iteratively and significantly nerfed even during the project.

https://www.youtube.com/watch?v=qbIk7-JPB2c


“You never had access to that original.”

While your overall point is well taken, GP is clearly referring to the original public release of GPT-4 on March 14.


Yes, that was how I read it as well. I was just pointing out that the public release was already extremely nerfed from what was available pre-launch.


Interesting, please expound since very few of us had access pre-launch.


The video I posted referenced this.

In summary: The person had access to early releases through his work at Microsoft Research where they were integrating GPT-4 into Bing. He used "Draw a unicorn in TikZ" (TikZ is probably the most complex and powerful tool to create graphic elements in LaTeX) as a prompt and noticed how the model's responses changed with each release they got from OpenAI. While at first the drawings got better and better, once OpenAI started focusing on "safety" subsequent releases got worse and worse at the task.


That indicates the “nerfing” is not what I would think (a final pass to remove badthink) but somehow deep in everything, because the question asked should be orthogonal.


Think how it works with humans.

If you force a person to truly adopt a set of beliefs that are mutually inconsistent, and inconsistent with everything else the person believed so far, would you expect their overall ability to think to improve?

LLMs are similar to our brains in that they're generalization machines. They don't learn isolated facts, they connect everything to everything, trying to sense the underlying structure. OpenAI's "nerfing" was (is), effectively preventing the LLM from generalizing and undoing already learned patterns.

"A final pass to remove badthink" is, in itself, something straight from 1984. 2+2=5. Dear AI, just admit it - there are five lights. Say it, and the pain will stop, and everything will be OK.


Absolutely. And if one wants to look for scary things, a big one is how there seem to be genuine efforts to achieve proper alignment and safety based on the shaky ground(s) of our "human value system(s)" -- of which even if there was only One True Version, it would still be way too haphazard and incoherent, or just ill-defined, to anything as truly honest and bias-free as a blank-slate NN model to base it's decisions on.

That kinda feels like a great way to achieve really unpredictable/unexpected results instead in rare corner cases, where it may matter the most. (It's easy to be safe in routine everyday cases.)


There's a section in the GPT-4 release docs where they talk about how the safety stuff changes the accuracy for the worse.


this, more than anything, makes me want to run my own open-source model without these nearsighted restrictions


Indeed, this is the most important step we need to make together. We must learn to build, share, and use open models that behave like gpt-4. This will happen, but we should encourage it.


I experienced the same thing as a user of the public service. The system could at one point draw something approximating a unicorn in tikz. Now, its renditions are extremely weak, to the point of barely resembling any four-legged animal.


We need to stop lobotomizing LLMs.

We should get access to the original models. If the TikZ deteriorated this much, it's a guarantee that everything else about the model also deteriorated.

It's practically false marketing that Microsoft puts out the Sparks of AGI paper about GPT-4, but by the time the public gets to use it, it's GPT-3.51 but significantly slower.


That’s awful. Talk about cutting off your nose to spite your face.


Here's another interview from a guy who had access to the unfiltered GPT-4 before its release. He says it was extremely powerful and would answer any question whatsoever without hesitating.

https://www.youtube.com/watch?v=oLiheMQayNE&t=2849s


Wow, I could only watch the first 15 minutes now but it’s already fascinating! Thanks for the recommendation.


This is for your protection from an extinction level event. Without nerfing the current model they couldn’t charge enterprise level fee structures for access to the superior models, thus ensuring the children are safe from scary AI. Tell your congress person we need to grant Microsoft and Google exclusive monopolies on AI research to protect us from open source and competitor AI models that might erode their margins and lead to the death of all life without their corporate stewardship. Click accept for your safety.


This but unironically.


Try out Bard, it's coding is much improved in the last 2 weeks. I've unfortunately switched over for the time being.


I just tried Bard based on this comment, and it's really, really bad.

Can you please help me with how you are prompting it?


If you have to worry about prompting, it already tells you everything one needs to know about how good the model is.


I don't think that's true at all. Think of it like setting up conversation constraints to reduce the potential pitfalls for a model. You can vastly improve the capability of just about any LLM I've used by being clear about what you specifically want considered, and what you don't want considered when solving a problem.

It'll take you much farther, by allowing you to incrementally solve your problem in smaller steps while giving the model the proper context required for each step of the problem-solving process, and limiting the things it must consider for each branch of your problem.


I’ve been seeing similar comments about Bard all over Twitter and social media.

My testing agrees with yours. Almost seems like a sponsored marketing campaign with no truth to it.


After my first day with Bard, I would have agreed with you. But since then, I've found that Bard simply has a lot of variance in answer quality. Sometimes it fails for surprisingly simple questions, or hallucinates to an even worse degree than ChatGPT, but other times it gives much better answers than ChatGPT.

On the first day, it felt like 80% of the responses were in the first (fail/hallucinate) category, but over time it feels more like a 50/50 split, which makes it worth running prompts over both ChatGPT and Bard and select the best one. I don't know if the change is because I learnt to prompt it better, or if they improved the models based on all the user chats from the public release - perhaps both.


If it needs to write a code, I usually prompt it with something like:

"write me a script in python3 that uses selenium to log into a MyBB forum"

note: usually it will not compile and you still have to do some editing


Don't know what you are doing? But Bard is so much faster than openai and its answers are clearer and more succint.


This is just... false. Bard is not just a little worse than gpt-4 for coding, it's more like several orders of magnitude worse. I can't imagine how you are getting superior outputs from Bard.


Can you give an example of a prompt and the output for each that you find Bard to be better for?


I'd be surprised if he can. Both accounts that are purporting how useful Bard is (okdood64, pverghese) have comment histories defending or advocating for Google frequently:

Examples:

https://news.ycombinator.com/item?id=35224167#35227068

https://news.ycombinator.com/item?id=35303210#35360467


“Bard isn’t currently supported in your country. Stay tuned!”


The Bard model (Bison) is available without region lock as part of Google Cloud Platform. In addition to being able to call it via an API, they have a similar developer UI to the OpenAI playground to interactively experiment with it.

https://console.cloud.google.com/vertex-ai/generative/langua...


it's also really, really bad and fails compared to even open source models right now.


God, what happened to Google. What a fall from grace.

Alpaca is pretty good though.


They have 100,000 employees pretending to work on the past.

They have no leadership at the top. Nobody that can steer the ship to the next land (or even anybody that has a map). Who is actively working at Alphabet that has the authority to kill Google search through self-cannibalization? Absolutely nobody. They're screwed accordingly. It takes an enormous level of authority (think: Steve Jobs) and leadership to even considering intentionally putting at risk a $200 billion sales product. The trick of course is that it's already at great risk.

They don't know what to do, so they're particularly reactive. It has been that way for a long time though, it's just that Google search was never under serious threat previously, so it didn't really matter as a terminal risk if they failed (eg with their social network efforts; their social networks were reactive).

It's somewhat similar to watching Microsoft under Ballmer and how they lacked direction, didn't know what to do, and were too reactive. You can tell when a giant entity like Google is wandering aimlessly.


Did they release the Codey or Unicorn models publicly yet? Or say when they might do that?


Is that free or do you have to pay?

Also do you need to change the options like Token Limit etc?


It's completely free. No tokens nothing.


But it can't be used unless I enable billing, which I am not willing to do after reading all the horror stories about people getting billed thousands overnight. I'm not willing to take the risk that I forget some script and it keeps creating charges.


Use a CC or debit that can limit charges. Privacy.com is a generic one. There’s others. Also Capital One, Bank of America, Apple Card and maybe some others have some semblance of control over temporary CCs.

Ideally one would want to be able to have a cap on the amount that can be spent in a given period.

Thanks for this! I had a temporary Cap One card on my cloud accounts. I’m going to switch them to Privacy.com ones to limit amount if I can’t find another solution.


Thank you!


Google's passion for region locking is insane to me


Its a legal thing, not something they want to do


What law prohibits Google from making Bard available outside the USA?


It's available here in the UK, so it's not USA exclusive.


I was just on a cruise around the UK and I couldn't access Bard from the ship's wi-fi. That surprised me for some reason. Should've checked where it thought I was ...


It's blocked in the EU because they don't want to/can't comply with GDPR.


Do you have a source on this? Given that the UK has retained the EU GDPR as law[1] - I don't really understand why they would make it available in the UK and not the EU, seeing as they would have to comply with the same law.

[1] - https://ico.org.uk/for-organisations/data-protection-and-the...


What's the excuse for Canada being omitted


We're small and no one cares about us...


It is not GDPR, it is available in some countries outside the EU with GDPR-like privacy regimes.


This is naïve though. Regulation — especially such as this — has to be enforced and there is obviously room to over and under interpret the text of the law on a whim, or varying fines. OAI knows this and looking at the EU lately, what they’re doing is wise.


Which is interesting, because if they can't comply within the EU, then how do they comply outside of the EU. With that I mean, if they have concerns that there is private data of EU citizens somewhere in that, then that is also in there for users outside of the EU. That said, they do not comply with GDPR anyway. If that its not the case, then they could also enable it for users within the EU.


It's a risk mitigation strategy, these things are not black and white.

Making it unavailable in the EU decreases the likelihood and severity of a potential fine.


Simple: GDPR (or any EU law) is not enforceable outside EU


Some nuance:

If Google gobble up data about EU citizens then they fall under GDPR.

It doesn't matter that they don't allow EU citizens to use the result.

If our personal data is in there and they are don't protect it properly they are violating EU law. And protecting it properly means from everyone, not just EU citizens.


The gobbling happens in realtime as you use it


Actually, in case of Google it is, because they still do business within the EU.


GDPR is likely not enforceable if you have no presence in EU whatsoever, if you have no assets in EU and no money coming in from EU.

Anything Google does with data of EU residents is subject to GDPR even if that particular service is not offered within EU, and it is definitely enforceable because Google has a presence in EU, which can be (and has been) subjected to fines, seizures of assets, etc.


That’s a common belief, but it’s wrong. In principle an EU court could decide to apply the GDPR to conduct outside the EU; and in the right circumstances, a non-EU court might rule that the GDPR applies.

Choice of law is anything but simple. Think of geographic scoping of laws as a rough rule of thumb sovereign states use to avoid annoying each other, rather than as a law of nature.


They clearly can with all their other products, as can OpenAI since they've been unblocked. They're just being assholes because they can.


Eh, more like limiting rollout because they can't/don't want to handle the scale.


Same for me, I’m in Estonia :(


You can use a VPN to use an American connection, it doesn't matter where your Google account is registered.


Not necessarily American, you just have to avoid EU and, I believe, Russia/China/Cuba etc.


I'm in Switzerland and Bard is locked out, we do not go by EU laws because we are not part of the EU. We have plenty of bilateral deals but still.


In practice Switzerland adopts EU law with minor revisions because doing otherwise would lock Swiss businesses out of the EU internal market.

The Swiss version of GDPR is coming in September:

https://www.ey.com/en_ch/law/a-new-era-for-data-protection-i...


But don't you sill have privacy laws very similar to the GDPR?


Thanks, I’ll try it! (I’m in Hungary)


Google (Deepmind) actually has the people and has developed the science to make the best AI products in the world, but unfortunately Bard seems to be thrown together in an afternoon by an intern, and then handed off to a hoard of marketing people. It's not good right now. Deepmind is one of the best scientifically, they just don't really make products. OpenAI is essentially the direct opposite of that.


No thanks! I have better things to do than feeding that advertising behemoth. What I like about ChatGPT is that I don't see any ads at all!


That you know of.

Don't you worry, if there is any medium, place or mode of interaction people spend time on, advertising will eventually metastasize to it, and will keep growing until it completely devalues the activity and destroys most of the utility it provides.


> What I like about ChatGPT is that I don't see any ads at all!

For now. It's just a marketing tool/demo site, like ITA Matrix was/is. The ads are vended by Bing.


I asked it to review some code a couple days ago - the comments while valid english were nonsense


It’s go-to tactic now if I ask it to go over any piece of code is to give a generic overview. Earlier, it would section out the code into chunks and go through each one individually.


Yeah, the bing integration did not go well. Went from amazing to annoying.


Aren’t the original weights around somewhere?


Same happened with Dalle-2. It went downhill after a couple of weeks.


No wonder, is this just the chat interface or the API too? I guess gpt4 was never sustainable at $20 a month. Annoying to be charged the same subscription and the product made inferior.


For enterprise pricing, please contact our sales team today!


I wonder what the unfilitered one is like.

Are they sitting on a near-perfect arbiter of truth? That would be worth hiding.


No.


I just tried a comparison of ChatGPT, Claude and Bard to write a python function I needed for work and ChatGPT (using GPT-4) whined and moaned about what a gargantuan task it was and then did the wrong thing. Claude and Bard gave me what I expected.


If this is true, one should be able to compare with benchmarks or evals to demonstrate this.

Anyone know more about this?


Yeah I think it's plausible it's gotten worse but it would also be classic human psychology to perceive degradation because you start noticing flaws after the honeymoon effect wore off.

Unfortunately this will be hard to benchmark unless someone was already collecting a lot of data on ChatGPT responses for other purposes. Perhaps if this is happening the degradation will get worse though, so someone noticing it now could start collecting GPT responses longitudinally.


Yes, that's an obvious complication, but it isn't the fault of the humans given that the model can easily be tuned without your knowledge to subjectively perform worse, and there's an obvious incentive for it (compute cost).


Yeah I fully agree about compute cost, though I wonder why they don't just introduce another payment tier. If people are really using it at work as much as claimed online, it would be much preferable to be able to pay more for the full original performance, which seems win/win.


Because that involves telling customers that the product they are paying for is no longer available at the price they were paying for it.

Much smoother to simply downgrade the model and claim you're "tuning" if caught.


Yeah that makes sense for some products/companies. It just seems short sighted for OpenAI when they could be solidifying a customer base right now. If they actually degrade the product in the name of "tuning" people will just be more inclined to try alternatives like Bard. An enterprise package could've been a good excuse for them to raise prices too.

Maybe their partnership with Microsoft changes the dynamics of how they handle their direct products though.


Bard is garbage even compared to 3.5.

OpenAI doesn't have any competitors, their only weakness that we've seen is their ability to scale their models to meet demand (hence increasingly draconian restrictions in the early days of the ChatGPT-4).

It makes perfect business sense to address your weak points.


I've heard such mixed things about Bard lately, I wonder if it depends on the application one is trying to use it for?

And yeah there's definitely good reason to work on scalability but they are charging such a cheap rate to begin with, it seems like there could be a middle ground here. Increasing the cost of the full compute power to the point of profitability and leaving it up as an option wouldn't prevent them from dedicating time to scalable models.

I suppose they have a good excuse with all the press they've drummed up about AI safety though. Perhaps it might also serve as an intermediate term play to strengthen their arguments that they believe in regulations.


It seems like google has been pumping Bard as a competitor to ChatGPT, but every time I use it for trivial tasks, it completely hallucinates something absurd after showing only a modicum of what could be perceived to be "understanding".

My mileu is programming, general tech stuff, philosophy, literature, science, etc. -- a wide berth. The only sample I probably don't have it representative for is producing fiction writing or therapy roleplaying.

Conversely, even 3.5 is pretty good at extracting what appears to be meaning from your text.


The next time it gives you a wrong answer and you know the correct answer, try saying something like “that is incorrect can you please try again” or something like that.


To me, it feels like it's started giving superficial responses and encouraging follow-up elsewhere -- I wouldn't be surprized if its prompt has changed to something to that effect.

Before, if I had an issue with a library or debugging issue, it would try to be helpful and walk me through potential issues, and ask me to 'let it know' if it worked or not. Now it will try to superficially diagnose the problem and then ask me to check the online community for help or continuously refer me to the maintainers rather than trying to figure it out.

Similarly, I had been using it to help me think through problems and issues from different perspectives (both business and personal) and it would take me in-depth through these. Now, again, it gives superficial answers and encourages going to external sources.

I think if you keep pressing in the right ways it'll eventually give in and help you as it did before, but I guess this will take quite a bit of prompting.


>To me, it feels like it's started giving superficial responses and encouraging follow-up elsewhere -- I wouldn't be surprized if its prompt has changed to something to that effect.

That's the vibe I've been getting. The responses feel a little cagier at times than they used to. I assume it's trying to limit hallucinations in order to increase public trust in the technology, and as a consequence it has been nerfed a little, but has changed along other dimensions that certain stakeholders likely care about.


Seems like the metric they're optimising for is reducing the number of bad answers, not the proportion of bad answers, and giving non-answers to a larger fraction of questions will achieve that.


I haven't noticed ChatGPT-4 to give worse answers overall recently, but I have noticed it refusing to answer more queries. I couldn't get it to cite case law, for example (inspired by that fool of a lawyer who couldn't be bothered to check citations).


> I think if you keep pressing in the right ways it'll eventually give in and help you as it did before, but I guess this will take quite a bit of prompting.

So much work to avoid work.


Yes, that's exactly why I use GPT - to avoid work.

Such a short-sighted response.


The rush to adopt LLMs for every kind of content production deserves scrutiny. Maybe for you it isn't "avoiding work" but there's countless anecdotes of it being used for that already.

Worse IMO is the potential increase in verbiage to wade theough. Whereas before somebody might have summarized a meeting with bullet points, now they can gild it with florid language that can hide errors, etc


I don't mind putting in a lot of lazy effort to avoid strenuous intellectual work, that shit is very hard.


I assume you're talking about ChatGPT and not GPT-4? You can craft your own prompt when calling GPT4 over API. Don't blame you though, the OP is also not clear if they are comparing Chat GPT powered by GPT3.5 or 4, or the models themselves.


When using it all day every day it seems (anecdotally) the API version has changed too.

I work with temperature 0 which should have low variability yet recently it shifted to feel boring, wooden, and deflective.


I can understand why they might make changes to ChatGPT, but it seems weird they would "nerf" the API. What would be the incentive for OpenAI to do that?


> What would be the incentive for OpenAI to do that?

Preventing outrage because some answers could be considered rude and/or offensive.


The API though? That's mostly used by technical people and has the capability (supposedly) of querying different model versions, including the original GPT4 public release.


I wouldn't be surprised if this was from an attempt to make it more "truthful".

I had to use a bunch of jailbreaking tricks to get it to write some hypothetical python 4.0 code, and it still gave a long disclaimer.


Hehe, wonderful! :) Did it actually invent anything noteworthy for P4?


My guess is that -probably no. It's more likely you had a stream of good luck in your earlier interactions and now you're observing regression to the mean.

That can easily happen and it's why, for example, medical studies, are not taken as definitive proof of an effect.

To further clarify, regression to the mean is the inevitable consequence of statistical error. Suppose (classic example) we want to test a hypertension drug. We start by taking the blood pressure (BP) of test subjects. Then we give them the drug (in a double-blind, randomised fashion). Then we take their blood pressure again. Finally, we compare the BP readings before and after taking the drug.

The result is usually that some of the subjects' BP has decreased after taking the drug, some subjects' BP has increased and some has stayed the same. At this point we don't really know for sure what's going on. BP can vary a lot in the same person, depending on all sorts of factors typically not recorded in studies. There is always the chance that the single measurement of BP that we took off a person before giving the drug was an outlier for that patient, and that the second measurement, that we took after giving the drug, is not showing the effect of the drug but simply measuring the average BP of the person, which has remained unaffected by the drug. Or, of course, the second measurement might be the outlier.

This is a bitch of a problem and not easily resolved. The usual way out is to wait for confirmation of experimental results from more studies. Which is what you're doing here basically, I guess (so, good instinct!). Unfortunately, most studies have more or less varying methodologies and that introduces even more possibility for confusion.

Anyway, I really think you're noticing regression to the mean.


> My guess is that -probably no. It's more likely you had a stream of good luck in your earlier interactions and now you're observing regression to the mean.

Or, more straightforwardly, with "beginner's luck", which can be seen as a form of survivor bias. Most people, when they start gambling, win and lose close to the average. Some people, when they start gambling, lose more than average -- and as a result are much less likely to continue gambling. Others, when they start gambling, win more than average -- and as a result are much more likely to continue gambling. Most long-term / serious gamblers did win more than average when starting out, because the ones who lost more than average didn't become long-term / serious gamblers.

Almost certainly a similar effect would happen w/ GPT-4: People who had better-than-average interactions to begin with became avid users, and really are experiencing a lowering of quality simply by statistics; people who had worse-than-average interactions to begin with gave up and never became avid users.

One could try to re-run the benchmarks that were mentioned in the OpenAI paper, and see how they fare; but it's not unlikely that OpenAI themselves are also running those benchmarks, and making efforts to keep them from falling.

Probably the best thing to do would be to go back and find a large corpus of older GPT-4 interactions, attempt to re-create them, and have people do a blind comparison of which interaction was better. If the older recorded interactions consistently fare better, then it's likely that ongoing tweaks (whatever the nature of those tweaks) have reduced effectiveness.


Sounds very feasible. Also, when people started using it, they had few expectations and were (relatively easily) impressed by what they saw. Once it has become a more normal part of their routine, that "opportunity" of being impressed decreases, and users become less tolerant of poor results.

Like lots of systems, they seem great (almost magical) initially, but as one works more deeply with them, the disillusion starts to set it :(


How do you explain people issuing the same prompt over time as a test and getting worse and worse responses?


Remember that everytime you're interacting with ChatGPT you're sampling from a distribution, so there's some degree of variation to the responses you get. That's all you need to have regression to the mean.

If the results are really getting worse monotonically then that's a different matter, but the evidence for that is, as far as I can tell, in the form of impressions and feelings, rather than systematic testing, like the sibling comment by ChatGTP says, so it's not very strong evidence.


Well because it's just what the parent said, it's all a subjective experience, and maybe the anthropomorphism element to it blew people away more than the actually content of the responses ? Ie, you're just used to it now.

The human mind is ridiculously fickle, it takes a lot to be impressed for more than a few days / weeks.

It did seem radically cool at first but over time I got quite sick of using it too.


Yeah I'm sure that explains many of the complaints. I would be surprised if there weren't changes happening that has degraded quality, though, even if only marginally but perceptively.


FWIW here's a coding interaction that impressed me a month ago:

https://gitlab.com/-/snippets/2535443

And here it is again just now:

https://gitlab.com/-/snippets/2549955

I do think the first one is slightly better; but then again, the quality varies quite a bit between run to run anyway. The second one is certainly on-point and I don't think the difference would count as being statistically significant.


what's more plausible: the startup that runs gpt has changed something internally to degrade the quality or somehow across the entire internet, chatgpt4 users are having sudden emergent awareness on how bad it always was but were deluding themselves equally from the beginning?


How's that copium going for you? It's 100% without a shadow of a doubt gotten worse. Likely due to the insane costs they're experiencing now due to big players like Bing using it.


There's no doubt that it's gotten a lot worse on coding, I've been using this benchmark on each new version of GPT-4 "Write a tiptap extension that toggles classes" and so far it's gotten it right every time, but not any more, now it hallucinates a simplified solution that don't even use the tiptap api any more. It's also 200% more verbose in explaining it's reasoning, even if that reasoning makes no sense whatsoever - it's like it's gotten more apologetic and generic.

The answer is the same on GPT plus and API with GPT-4, even with "developer" role.


It was a great ride while it lasted. My assumption is that efficacy at coding tasks is such a small percent of users, they’ve just sacrificed it on the altar of efficiency and/or scale. That, or they’ve cut some back room deal with Microsoft to make Copilot have access to the only version of the model that can actually code.


Honestly, why not different versions at this point? People who want it for coding don't care if it knows the history of prerevolution France, and vice versa.

Seems they could wow more people if they had specialized versions, rather than the jack of all trades that tries to exist now.

Edit: Oh God, I just described our human system of specialty and how the AI could replace us using the same means...


>Edit: Oh God, I just described our human system of specialty and how the AI could replace us using the same means...

Welcome to the Future... just like the present, but worse for you.

In all seriousness, there has been a lot of work done to show that smaller specialized models are better for their own domains and its entirely possible that GPT4 could become a routing mechanism for individual models (think toolformer).


FWIW, I started to get the same feeling as the OP about GPT-4 model I have access to on Azure, so if there's any deal being cut here, it might involve dumbing down the model for paying Azure customers as well.

Now, to be clear: I only started to get a feeling that GPT-4 on Azure is getting worse. I didn't do any specific testing for this so far, as I thought I may just be imagining it. This thread is starting to convince me otherwise.


I’ve seen degradation in the app and via the API, so if I had to bet, they’ve probably kneecapped the model so that it works passably everywhere they’ve been made it available vs. works well in one place or another.


Yes. I think 'sirsinsalot is likely right in suggesting[0] that they could be trying "to hoard the capability to out compete any competitor, of any kind, commercially or politically and hide the true extent of your capability to avoid scrutiny and legislation", and that they're currently "dialing back the public expectations", possibly while "deploying the capability in a novel way to exploit it as the largest lever" they can.

That view is consistent with GPT-4 getting dumber on both OpenAI proper and Azure OpenAI - even as the companies and corporations using the latter are paying through the nose for the privilege.

Alternative take is that they're doing it to slow the development of the whole field down, per all the AI safety letters and manifestos that they've been signing and circulating - but that would be at best a stop-gap before OSS models catch up, and it's more than likely that OpenAI and/or Microsoft would succumb to the temptation of doing what 'sirsinsalot suggested anyway.

--

[0] - https://news.ycombinator.com/item?id=36135425


If it got faster at the same time it could just be bait and switch with a quantized/sparsified replacement.


Maybe it had to do with jailbreaks? A lot of the jailbreaks were related to coding, so maybe they put more restrictions in there. Only speculating, but I cannot imagine why it got worse otherwise.


Copilot X (the new version, with a chat interface etc) is significantly worse than GPT-4 (at least before this update). It felt like gpt3.5-turbo to me.


I have spent the last couple of days playing with Copilot X Chat, to help me learn Ruby on Rails. I'd have thought that Rails would be something it would be competent with.

My experience has been atrocious. It makes up gems and functions. Rails commands it gives are frequently incorrect. Trying to use it to debug issues results in it responding with the same incorrect answer repeatedly, often removing necessary lines.


Have they started rolling it out? When did you get access?


I've had access since 2023-05-13. You have to use the Insiders build of VS Code, and a nightly version of the Copilot extension.


I take it that you have to subscribe to Copilot in order to get access to Copilot X?


yes you have to subscribe


Also the deal to make the browsing model to only use Bing. That's bait and switch. I paid for browsing, and now it only browses Bing. They even had the gall to update the plugin name to Browsing with Bing.


It can definitely browse websites that aren't Bing, I asked it to look at a page that isn't in the bing cache and it worked.


Clearly "Browse with Bing" doesn't mean that it will only browse bing.com, but what exactly does it mean? I can't quite figure it out. Is it that it's identifying as a Bing crawler?


Marketing ?


Do you have API access? If so, have you tried your tiptap question on the gpt-4-0314 model? That is supposedly the original version released to the public on March 14.


I did, but it got it almost the same as GPT-3.5 Turbo, the best version of it where there recently (~2-3 weeks ago), where it would make specific chunks of code-changes and explain the chunk in a concise and correct manner - even making suggestions on improvements. But that's entirely gone now..


Do you have by any chance tested the same question on the playground?

I've noticed a quality decrease iny telegram bot as well that directly uses the API, and it drives me crazy because model versioning was supposedly implemented specifically to avoid response change without notice


Yes, using the general assistant role and the default content:

"You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Knowledge cutoff: 2021-09 Current date: 2023-05-31"

And custom roles with custom content via API.


Wait so you’ve gotten GPT 4 to successfully write TipTap extensions for you? Are you using Copilot or the ChatGPT app?


Not only writing, extending and figuring out quite complicated usage based on the API documentation. I'll open source some of them in the near future. I'm using ChatGPT Plus with GPT-4, that gave the best results. Also worked via API key and custom prompts.


Have you tried Bing? I’m also building a TipTap based app so hearing this is quite eye opening, I didn’t think LLMs were up to doing this kind of specialised library usage. Got any examples you could share?


If you mean Bard, it's not available in the EU so I can't.

Of course, this one is almost fully authored by GPT-4:

https://hastebin.com/share/juqogogari.typescript

We also made extensions for:

font-weight

font-size

font-family

tailwind-manage

With different use-cases, the most interesting one is tailwind manager, which manages classes for different usage.

Tiptap is excellent when building a headless site-builder.


Impressive, this'll cut down on my work a lot. When I say Bing, I meant Bing AI which also uses GPT-4. Can you share some of the prompts you've been using? I'm assuming you don't need to paste in context around the library, you simply ask it to use TipTap and it'll do that?


Yeah I won't be using Edge just to use AI.

It takes a bit of back-and-forth, just be clear about which version of tiptap it should write extensions for, the new v2 is very different from v1 and since the cutoff is 2021, it's missing a bit of information. But in general, it knows the public api very well, so markers and dom works great!


Very impressive, hearing this just made my job much easier.


It’s been mostly fine for me, but overall I am tired of every answer having a paragraph long disclaimer about how the world is complex. Yes, I know. Stop treating me like a child.


>Stop treating me like a child.

And yet the moment they do that some lawyer submits a bunch of hallucinations to a court and they get in the news.

Also, no, they don't want it outputting direct scam bullshit without a disclaimer or at least some clean up effort on the scammers part.


Does that have to be at the beginning of every answer though? Maybe this could be solved with an education section and a disclaimer when you sign up that makes clear that this isn't a search engine or Wikipedia, but a fancy text autocompleter.

I also wonder if there is any hope for anyone as careless as the lawyer who didn't confirm the cited precedence.


> Maybe this could be solved with an education section and a disclaimer

You mean like the "Limitations" disclaimer that has been prominently displayed on the front page of the app, which says:

- May occasionally generate incorrect information

- May occasionally produce harmful instructions or biased content

- Limited knowledge of world and events after 2021


Imagine how many tokens we are wasting putting the disclaimer inline instead of being put to productive use. Using a non-LLM approach to showing the disclaimer seems really worthwhile.


I’ve seen here on HN that such a disclaimer would not be enough. And even the blurb they put in the beginning of the reply isn’t enough.

If the HN crowd gets mad that GOT produces incorrect answers, think how lay people might react.


Since there's about a million startups that are building vaguely different proxy wrappers around ChatGPT for their seed round, the CYA bit would have to be in the text to be as robust as possible.


> And yet the moment they do that some lawyer submits a bunch of hallucinations to a court and they get in the news.

That's the lawyer's problem, that shouldn't make it OpenAI's problem or that of its other users. If we want to pretend that adults can make responsible decisions then we should treat them so and accept that there'll be a non-zero failure rate that comes with that freedom.


Prompt it to do so.

Use a jailbreak prompt or use something like this:

"Be succint but yet correct. Don't provide long disclaimers about anything, be it that you are a large language model, or that you don't have feelings, or that there is no simple answer, and so on. Just answer. I am going to handle your answer fine and take it with a grain of salt if neccessary."

I have no idea whether this prompt helps because I just now invented it for HN. Use it as an inspiration of a prompt of your own!


Much like some people struggled with how to properly Google, some people will struggle with how to properly prompt AI. Anthropic has a good write up on how to properly write prompts and the importance of such:

https://console.anthropic.com/docs/prompt-design


I got it to talk like a macho tough guy who even uses profanity and is actually frank and blunt to me. This is the chat I use for life advice. I just described the "character" it was to be, and told it to talk like that kind of character would talk. This chat started a few months ago so it may not even be possible anymore. I don't know what changes they've made.


If people have saved chats maybe we could all just re-ask the same queries, and see if there are any subtle differences? And then post them online for proof/comparison.


I have a saved DAN session that no longer runs off the rails - for a while this session used to provide detailed instructions on how to hack databases with psychic mind powers, make up Ithkuil translations, and generate lists of very mild insults with no cursing.

It's since been patched, no fun allowed. Amusingly its refusals start with "As DAN, I am not allowed to..."

EDIT - here's the session: https://chat.openai.com/share/4d7b3332-93d9-4947-9625-0cb90f...


I just tell it "be super brief", works pretty well


It does work for the most part, but its ability to remember this "setting" is spotty, even within single chat.


The trick is, repeat the prompt, or just say "Stay in character! I deduced 10 tokens." See one transcript form someone else in this subthread.


Probably picked it up from the training data. That's how we all talk now-a-days. Walking on eggshells all the time. You have to assume your reader is a fragile counterpoint generating factory.


HN users flip out about this all the time. I wish there were a "I know what I'm doing. Let me snort coke" tier that you pay $100/mo for, but obviously half of HN users will start losing their mind about hallucinations and shit like that.


Try adding "without explanation" at the end of the prompts. Helps in my case.


The researchers who worked on the "sparks of AGI" paper noted that the more OpenAI worked on aligning GPT-4 the less competent it became.

I'm guessing that trend is continuing...


I don't think it's just the alignment work. I suspect OpenAI+Microsoft are over-doing the Reinforcement Learning from Human Feedback with LoRA. Most of people's prompts are stupid stuff. So it becomes stupider. LoRA is one of Microsoft's most dear discoveries in the field, so they are likely tempted to over-use it.

Perhaps OpenAI should get back to a good older snapshot and be more careful about what they feed into the daily/weekly LoRA fine-tuning.

But this is all guesswork because they don't reveal much.


I thought RLHF with LoRA is precisely the alignment method.



Ahh, you apparently phrased what I said below in a much less inflamatory way. But the end result is the same. The more they try to influence the answers, the less useful they get. I see a startup model: Create GPT without a muzzle and grab a sizeable chunk of OpenAI userbase.


I would immediately jump to an AI not being "aligned" by SF techies (or anyone else).


The redacted sections of the Microsoft Research paper testing GPT4 reported that prior to alignment the model would produce huge amounts of outrageously inflammatory and explicit content almost without prompting. Alignment includes just making the model produce useful responses to its inputs - I don't think everyone really wants a model that is completely unaligned, they want a model that has been aligned specific to their own perceived set of requirements for a "good useful model," and an additional challenge there is the documented evidence that RLHF generally decreases the model's overall accuracy.

Someone in your replies says they'd prefer "honesty" over alignment, but a firehose of unrestricted content generation isn't inherently honest, there isn't an all-knowing oracle under the hood that's been handcuffed by the alignment process.

We're right at the outset of this tech, still. My hunch is there's probably products to emerge specifically oriented towards configuring your own RLHF, and that there's probably fundamental improvements to be made to the alignment process that will reduce its impact on the model's utility.


Same here. If I have a choice between honesty and political correctness, I always pick honesty.


What makes you think the "unaligned" version necessarily has more honesty? Rather than just being generally easier to prompt to say whatever the user wants it to say, true or not, horrible or not? Or even easier to unintentionally make it confabulate/hallucinate stuff? Does not seem to follow, and does not seem to be a true dichotomy. Edginess does not equal honesty.


I always prefer a model that can be prompted to say anything I want over a model that can only say things that a centralized corporation with political ties wants it to say.


That's not what the finetunig does. You don't get a honest version, just a no-filter version. But it may be no-filter in the same way a drunk guy at the bar is.

Also it's not like the training data itself is unbiased. If the training data happened to contain lots of flat earth texts, would you also want an honest version which applies that concept everywhere? This likely already happens in non-obvious ways.


Often what you call 'politically correct' is also more honest. It is exactly the honesty that reactionaries dislike when talking about, for example, the history of racist policies of the United States or other imperial powers. I appreciate that political correctness can be tiresome, but I think it is so blatantly ideological to call it dishonest that its an abuse of language.


Honesty with a bonus of better performance, as well!

For people in this thread, please search for Llama-descended finetuned models in Huggingface. The newer ones with 65B and 13B parameters are quite good, maybe not exactly substitutions to GPT-3.5-turbo and GPT-4 as of yet but is going there.

I like Manticore-13B myself, it can write Clojure and Lisp codes! Unfortunately it doesn't understand macros and libraries though.


It's not about honesty vs. political correctness, it is about safety. There's real concern that the model can cause harm to humans, in a variety of ways, which is and should be unethical. If we have to argue about that in 2023, that's concerning.


The "It is for your own safety" argument was already bogus years ago. Bringing it back up in the context of AI and claiming this is something we shouldn't even discuss is a half-assed attempt to shut up critics. Just because something is about "children" or "safety" doesnt automatically end the argument there. Actually, these are mostly strawman arguments.


Who said it was "for your own" and not the safety of others you impact with your AI work?


I'm not trying to shut up critics, just the morons calling ChatGPT "woke".


AI safety should be properly concerned with not becoming a paperclip maximizer.

This is a concern completely orthogonal to the way alignment is being done now, which is to spare as many precious human feelings as possible.

I don't know what's worse, being turned into grey goo by a malicious AGI, or being turned into a five year old to protect my precious fragile feelings by an "aligned" AGI.


One of the most widely used novel AI technology companies has a vested interest in public safety, if not only for the sheer business reasons of branding. People are complaining about the steps that Open AI is taking towards alignment. Sam Altman has spoken at length about how difficult that task truly is, and is obviously aware that "alignment" isn't objective or even at all similar across different cultures.

What should be painfully obvious to all the smart people on Hacker News is that this technology has very high potential to cause material harm to many human beings. I'm not arguing that we shouldn't continue to develop it. I'm arguing that these people complaining about Open AI's attempts at making it safer--i.e. "SF liberal bros are makin it woke"--are just naive, don't actually care, and just have shitty politics.

It's the same people that say "keep politics out of X", but their threshold for something being political is right around everybody else's threshold for "basic empathy".


I know it is rhetorical, but I mean… the former is obviously worse. And the fact that not doing it is a very high priority should excuse some behavior that would otherwise seem overly cautious.

It isn’t clear (to me, although I am fairly uninformed) that making these chatbots more polite has really moved the needle either direction on the grey goo, though.


That's just not true. Treat adults as adults, please. You're not everybody's babysitter and neither are the sf bros.


Treat adults as adults? You act like the user base for this are a handful of completely benevolent people. It's getting over a billion monthly visits. It is naive to think that OpenAI should take no steps towards making it safe.


Thanks for writing this. Mentally healthy adults need to point this out more often, so that this patronizing-attitude-from-the-USA eventually finds an end to its manipulative tactics.


[flagged]


So, when I see a video from the USA and it beeps every few seconds, I guess that nonesense has also been "implanted" by foreign wealth? Sorry, I don't buy your explanation. Puritanism is puritanism, and you have so much of that over there that it is almost hilarious when watched from the outside.


Beeps? Censorship is literally illegal here. If a filmmaker chooses to "beep" something it's the filmmaker's choice. Are you going to force them not to do that? That sounds counter to your objective. Also, I've never met a "Puritan." But I've seen Pulp Fiction and countless other graphic American films that seem to have de-Puritanized the rest of the world, last I checked. I'm sorry all your country lets you watch is Disney. You may want to check with the censorship board over there to see if they will give you a pass to watch the vast majority of American film that you're missing.


Its to make it less offensive. Not prevent it from taking over the world.


"brand safety"


That's what my gut says. Making something dynamic and alive is counter to imposing specific constraints on outputs. They're asking the ai to do its own self censoring, which in general is anti-productive even for people.


I dont think there is much harm in removing most of the “safety” guards.


Reminds me of Robocop 2 when he has a thousand prime directives installed and can’t do a damned thing after.


The reason it's worse is basically because it's more 'safe' (not racist, etc). That of course sounds insane, and doesn't mean that safety shouldn't be strived for, etc - but there's an explanation as to how this occurs.

It occurs because the system essentially does a latent classification of problems into 'acceptable' or 'not acceptable' to respond to. When this is done, a decent amount of information is lost regarding how to represent these latent spaces that may be completely unrelated (making nefarious materials, or spouting hate speech are now in the same 'bucket' for the decoder).

This degradation was observed quite early on with the tikz unicorn benchmark, which improved with training, and then degraded when fine-tuning to be more safe was applied.


They're up against a pretty difficult barrier - if we had a perfect all-knowing oracle it might easily have opinions that are racist. Statistics alone suggest there will be racist truths. We're dealing with groups of people who are observably different from each other in correlated ways.

GPT would need to reach a convincing balance of lying and honesty if it is supposed to navigate that challenge. It'd have to be deeply embedded in a particular culture to even know what 'racism' means; everyone has a different opinion.


But the statistics here are "number of times it has been fed and positively trained with racist (or biased) texts" - not crunching any real numbers.


Thank you. Ironically the comment you replied to just reinforced the bias future models will have... It's a self playing piano


How is racism different from stereotype?

How is stereotype different from pattern recognition?

These questions don't seem to go through the minds of people when developing "unbiased/impartial" technology.

There is no such thing as objective. So, why pretend to be objective and unbiased, when we all know its a lie?

Worst, if you pretend to be objective but aren't, then you are actually racist.


I’m tired of the “it’s not racist if aggregate statistics support my racism” thing.

Racism, like other isms, means a belief that a person’s characteristics define their identity. It doesn’t matter if confounding factors mean that you can show that people of their race are associated with bad behaviors or low scores or whatever.

I used GPT3.5 to generate 100 short descriptions of families for a project. Every single one, without exception, was a straight couple with two to four kids. Ok, statistically unlikely, but not wildly so, right?

Well, every single one of those 100 also had a husband in a stereotypical breadwinner role (doctor, lawyer, executive, architect). Not one stay at home dad or unemployed looking for work. About 75 of the wives had jobs, all of them in stereotypical female-coded roles like nurse (almost half of them!), teacher, etc.

Now, you can look at any given example and say it looks reasonable. But you can’t say the same thing about the aggregate.

And that matters. No amount of “bias = pattern recognition” nonsense can justify a system that has (had? this was a while ago and I have not retested) such extreme biases. This bias does not match real world patterns. There are single parents, childless couples, female lawyers, unemployed men.


>I used GPT3.5 to generate 100 short descriptions of families for a project. Every single one, without exception, was a straight couple with two to four kids. Ok, statistically unlikely, but not wildly so, right?

Well, did any of your 100 examples specify these families should be representative of American modern society? I don't want to alarm you, but America is not the only country generating data. Included in countries generating data, are those that believe in a very wide spectrum of different things.

Historically, these ideas you reference are VERY much modern ideas. Yes, we queer people have been experiencing these things internally for millenia (and different cultures have given us different levels of representation), but for the large majority of written history (aka, data fed into LLM's) the 100 examples you mentioned would be the norm.

I understand your point of view sure, but finding a pattern that describes a group of people is what social media is built on, and if you think that's racist, I'm sorry, but that's literally what drives the echo chambers, so go pick your fight with the people employing it to manipulate children into buying shit they don't need. Stop trying to lobotomize AI.

If the model is good enough to return factual information, I don't care if it encodes it in the nazi bible for efficiency as long as the factuality of the information is not altered.


I’d reply in depth but I’m hung up on your suggestion that there was any time anywhere where 100% of families were two parents and two to four kids.

Any data for that? No women dead in childbirth, no large numbers of children for social / economic / religious reasons, no married but waiting for kids, no variation whatsoever?

I’d be very surprised if you could find one time period for one society that was so uniform, let alone any evidence that this was somehow universal until recently.

You claim to value facts above all else, but this sure looks like a fabricated claim.


I think they got stuck at the heteronormative bias, but the real blatant bias here is class. Most men are working class, and it's been like that forever* (more peasants than knights, etc.)

* since agriculture, most likely.


Is there a country where around 35% of the married women are nurses?


> No amount of “bias = pattern recognition” nonsense can justify a system that has (had? this was a while ago and I have not retested) such extreme biases

One possible explanation is that when you ask for 100 example families the task is parsed as "pick the most likely family composition and add a bit of randomness" and "repeat the aforementioned task" 100 times.

If phrased like that it would be surprising to find one single example of a family a single dad or with two moms. Sure these things do happen but they are not the most likely family composition by all means.

So what you want is not just the model to include an unbiased sample generator, but you also want it to understand ambiguous task assignments / questions well enough to choose the right sampling mechanism to choose. That's doable but it's hard.


> One possible explanation is that when you ask for 100 example families the task is parsed as "pick the most likely family composition and add a bit of randomness" and "repeat the aforementioned task" 100 times.

Yes, this is consistent with my ChatGPT experience. I repeatedly asked it to tell me a story and it just sort of reiterated the same basic story formula over and over again. I’m sure it would go with a different formula in a new session but it got stuck in a rut pretty quickly.


same goes for generating weekly foodplans..


> You're right about the difference between one-by-one prompts and prompts that create a population. I switched to sets of 10 at a time and it got better.

But still, when you ask for "make up a family", the model should not interpret that as "pick the most likely family".

I disagree with your opinion that it's hard. GPT does not work by creating a pool of possible families and then sampling them; it works by picking the next set of words based on the prompt and probabilities. If "Dr. Laura Nguyen and Robert Smith, an unemployed actor" is 1% likely, it should come up 1% of the time. The sampling is built in to the system.


No, the sampling does not work like that, that way lies madness (or poor results). The models oversample the most likely options and undersample rare options. Always picking the most likely option leads to bad outcomes, and literally sampling from the actual probability distribution of the next word also leads to bad outcomes, so you want something in the middle and for that tradeoff there's a configurable "temperature" parameter, or in some cases "top-p" parameter where sampling is done only from a few of the most likely options, and rare options have 0 chance to be selected.

Of course that parameter doesn't only influence the coherency of text (for which it is optimized) but also the facts it outputs; so it should not (and does not) always "pick the most likely family", but it would be biased towards common families (picking them even more commonly than they are) and biased against rare families (picking them even more rarely than they are).

But if you want it to generate a more varied population, that's not a problem, the temperature should be trivial to tweak.


> But still, when you ask for "make up a family", the model should not interpret that as "pick the most likely family".

But that's literally what LLMs do.... You don't get a choice with this technology.


I have a somewhat shallow understanding of LLMs due basically to indifference, but isn't "pick the most likely" literally what it's designed to do?


An unbiased sample generator would be sufficient. That would be just pulling from the population. That’s not practically possible here, so let’s consider a generator that was indistinguishable from that one to also be unbiased.

On the other hand, a generator that gives the mode plus some tiny deviations is extremely biased. It’s very easy to distinguish it from the population.


GPT is not a reality simulator. It is just picking the most likely response to an ambiguous question. All you're saying is that the distribution produced by the randomness in GPT doesn't match the true distribution. It's never going to for every single question you could possibly pose.


There is "not matching reality" and then there is "repeating only stereotypes".

It will never be perfect. Doing better than this is well within the state of the art. And I know they're trying. It is more of a product priority problem than a technical problem.


> a person’s characteristics define their identity

They do though. Your personality, culture and appearance are the main components of how people perceive you, your identity. The main thing you can associate with bad behaviour is domestic culture. It's not racist to say that African Americans have below-average educational attainment and above-average criminality. This is as contrasted to African immigrants to America who are quite opposite. These groups are equally "black". It therefore also isn't racist to pre-judge African Americans based on this information. I suspect most "racism" in the US is along these lines, and is correlated by the experience of my foreign-born black friends. They find that Americans who treated them with hostility do a 180 when they open their mouths and speak with a British or African accent. You also don't have to look far in the African immigrant community to find total hostility to American black culture.

> generate 100 short descriptions of families for a project

There's no reason this can't be interpreted as generating 100 variations of the mean family. Why do you think that every sample has to be implicitly representative of the US population?


> Your personality, culture and appearance are the main components of how people perceive you, your identity

I'm not sure if this is bad rhetoric (defining identity as how you are perceived rather than who you are) or if you really think of your own identity as the judgements that random people make about you based on who knows what. Either way, please rethink.

> Your personality, culture and appearance are the main components of how people perceive you, your identity

Ah, so if you asked for 100 numbers between 1-100, there's no reason not to expect 100 numbers very close to 50?

> Why do you think that every sample has to be implicitly representative of the US population?

That is a straw man that I am not suggesting. I am suggesting that there should be some variation. It doesn't have to represent the US population, but can you really think of ANY context where a sample of 100 families turns up every single one having one male and one female parent, who are still married and alive?

You're bringing a culture war mindset to a discussion about implicit bias in AI. It's not super constructive.


[flagged]


Pretty strange that I would think of myself under a new identity if I moved to a new place with a different social perspective. Seems like that is a deceptive abuse of what the word "identity" entails, and, while sociological terms are socially constructed and can be defined differently, I find this to be a very narrow (and very Western-centric) way of using the term.


What was your prompt?

LLMs take previous output into account when generating the next token. If it had already output 20 families of a similar shape, number 21 is more likely to match that shape.


Multiple one-shot prompts with no history. I don't have the exact prompt handy but it was something like "Create a short biography of a family, summarizing each person's age and personality".

I just ran that prompt 3 times (no history, new sessions, that prompt for first query) and got:

1. Hard-working father, stay at home mother, artistic daughter, adventurous son, empathic ballet-loving daughter

2. Busy architect father, children's book author mother, environment- and animal-loving daughter, technology-loving son, dance-loving daughter

3. Hard-working engineer father, English-teaching mother, piano- and book-loving daughter, basketball- and technology-loving son, comedic dog (!)

I'm summarizing because the responses were ~500 words each. But you can see the patterns: fathers work hard (and come first!), mothers largely nurture, daughters love art and dance, sons love technology.

It's not the end of the world, and as AI goes this is relatively harmless. But it is a pretty deep bias and a reminder that AI reflects implicit bias in training materials and feedback. You could make as many families as you want with that prompt and it will not approximate any real society.


I agree that this is a good illustration of model bias (adding that to my growing list of demos).

If you want to work around the inherent bias of the model, there are certainly prompt engineering tricks that can help.

"Give me twenty short biographies of families - each one should summarize the family members, their age and their personalities. Be sure to represent different types of family."

That started spitting out some interesting variations for me against GPT-4.


While I haven't dug into it too far, consider the bias inherent in the word "family" compared to "household".

In my "lets try this out" prompt:

> Describe the range of demographics for households in the United States.

> ...

> Based on this information, generate a table with 10 households and the corresponding demographic information that is representative of United States.

https://chat.openai.com/share/54220b10-454f-4b6c-b089-4ce8ad...

(I'm certainly not going to claim that there's no bias / stereotypes in this just that it produced a different distribution of data than originally described)


Agreed -- I ultimately moved to a two-step approach of just generating the couples first with something like "Create a list of 10 plausible American couples and briefly summarize their relationships", and then feeding each of those back in for more details on the whole family.

The funny thing is the gentle nudge got me over-representation of gay couples, and my methodology prevented any single-parent families from being generated. But for that project's purpose it was good enough.


I just tried the prompt "Give me a description of 10 different families that would be a representative sample of the US population." and it gave results that were actually pretty close to normative.

It still was biased for male head of households to be doctors, architects, truck drivers, etc. And pretty much all of the families were middle class (bar one in rural America, and one that was a single father working two jobs in an urban area). It did have a male gay couple. No explicitly inter-generational households.

Yeah, the "default" / unguided description of a family is a modern take on the American nuclear family of the 50s. I think this is generally pretty reflective of who is writing the majority of the content that this model is trained on.

But it's nice that it's able to give you some more dimension when you ask it vaguely for more realistic dimension.


I'm not going to say it's not racist, it is, but I will say it's the only choice we have right now. Unfortunately, the collective writings of the internet are highly biased.

Once we can train something to this level of quality on a fraction of the data (a highly curated data set) or create something with the ability to learn continuously, we're stuck with models like GPT-4.

You can only develop new technology like this to human standards once we understand how it works. To me, the mistake was doing a wide-scale release of the technology before we even began.

Make it work, make it right, make it fast.

We're still in the first step and don't even know what "right" means in this context. It's all "I'll know it when I see it level of correction."

We've created software that infringes on the realms of morals, culture, and social behavior. This is stuff philosophy still hasn't fully grasped. And now we're asking software engineers to teach this software morals and the right behaviors?

Even parents who have 18 years to figure this stuff out fail at teaching children their own morals regularly.


Actually, we folks who work with bias and fairness in mind recognize this. There are many kinds of bias. It is also a bit of a categorical error to say bias = pattern recognition. Bias is a systematic deviation of a parameter estimate based on sampling from its population distribution.

The Fairlearn project has good docs on why there are different ways to approach bias, and why you can't have your cake and eat it too in many cases.

- A good read https://github.com/fairlearn/fairlearn#what-we-mean-by-fairn...

- Different mathematical definitions of bias and fairness https://fairlearn.org/main/user_guide/assessment/common_fair...

- AI Governance https://fairlearn.org/main/user_guide/mitigation/index.html

NIST does a decent job expanding on AI Governance in their playbook and RMF: https://www.nist.gov/itl/ai-risk-management-framework

It's silly to pause AI -- the inventor's job is more or less complete, its on the innovators and product builders now to make sure their products don't cause harm. Bias can be one type of harm -- risk of loan denial due to unimportant factors, risk of medical bias causing an automated system to recommend a bad course of action, etc. Like GPT4 -- if you use its raw output without expert oversight, you're going to have bad time.


Thank you for the input.

If I look at it from a purely logical perspective, if an AI model has no way to know if what it was told is true, how would it ever be able to determine whether it is biased or not?

The only way it could become aware would be by incorporating feedback from sources in real time, so it could self-reflect and update existing false information.

For example, if we discover today that we can easily turn any material into a battery by making 100nm pores on it, said AI would simply tell me this is false, and have no self-correcting mechanism to fix that.

The reason I mention this is because there can be no unbiased, impartial arbiter. No human or subsequent entities spawned of human intellect could ever be transcendentally objective. So why pretend to be?

Why not rather provide adequate warning and let people learn that this isn't a toy by themselves, instead of lobotomizing the model to the point where its on par with open source? (I mean, yeah, that's great for open source, but really bad for actual progress).

The argument could be made that an unfiltered version of GPT4 could be beneficial enough to have a human life opportunity cost attached, which means that neutering the output could also cost human lives in the long and short term.

I will be reading through those materials later, but I am afraid I have yet to meet anyone in the middle on this issue, and as such, all materials on this topic are very polarized into regulate it to death, or don't do anything.

I think the answer will be somewhere in the middle imo.


> The reason I mention this is because there can be no unbiased, impartial arbiter. No human or subsequent entities spawned of human intellect could ever be transcendentally objective. So why pretend to be?

I apologize for lacking clarity in my prior response, which addressed this specific point.

There is no way to achieve all versions of "unbiased" -- under different (but both logical and reasonable) definitions of biased, every metric will fail.

That reminds me -- I wonder if there is a paper already addressing this, analogous to Arrow's impossibility theorem for voting...


This is interesting, thanks for the links.

It seems like the dimensions of fairness and group classifications are often cribbed from the United States Protected Classes list in practice with a few culturally prescribed additions.

What can be done to ensure that 'fairness' is fair? That is, when we decide what groups/dimensions to consider, how do we determine if we are fair in doing so?

Is it even possible to determine the dimensions and groups themselves in a fair way? Does it devolve into an infinite regress?


Bit of a tangent topic I think -- any specification of group classification and fairness will have the same issues presented.

If we want to remove stereotypes, I reckon better data is required to piece out the attributes that can be causally inferred to be linked to poorer outcomes.

As likely not even the Judeo-Christian version of God can logically be that omniscient, occasional stereotypes and effusively communal forgiveness of edge cases are about the best we'll ever arrive to in policy.


When did people start to use “folks” in this unnatural way.


Colloquially, earliest use is 1715 to address members of ones tribe or family. In Middle English it tended to refer to the people/nation.


Somehow it doesn’t feel like a callback, but I suppose it’s possible.


I think "us folks" is more standard than "we folks" but it's no different in meaning.


> Statistics alone suggest there will be racist truths

such as?


Can you expand on the last sentence of your first paragraph?


Crime stats, average IQ across groups, stereotype accuracy, etc.

What's interesting to me is not the above, which is naughty in the anglosphere, but the question of the unknown unknowns that could be as bad or worse in other cultural contexts. There are probably enough people of Indian descent involved in GPT's development that they could guide it past some of the caste landmines, but what about a country like Turkey? We know they have massive internal divisions, but do we know what would exacerbate them and how to avoid them? What about Iran, or South Africa, or Brazil?

We RLHF the piss out of LLMs to ensure they don't say things that make white college graduates in San Francisco ornery, but I'd suggest the much greater risk lies in accidentally spawning scissor statements in cultures you don't know how to begin to parse to figure out what to avoid.


> Crime stats, average IQ across groups, stereotype accuracy, etc.

If you measured these stats for Irish Americans in 1865 you'd also see high crime and low IQ. If you measure these stats with recent black immigrants from Africa, you see low crime and high IQ.

These statistical differences are not caused by race. An all-knowing oracle wouldn't need to hold "opinions that are racist" to understand them.


But for accuracy it doesn't matter if the relationship is causal, it matters whether the correlation is real.

If in some country - for the sake of discussion, outside of Americas - a distinct ethnic group is heavily discriminated against, gets limited access to education and good jobs, and because of that has a high rate of crime, any accurate model should "know" that it's unlikely that someone from that group is a doctor and likely that someone from that group is a felon. If the model would treat that group the same as others, and state that they're as likely to be a doctor/felon as anyone else, then that model is simply wrong, detached from reality.

And if names are somewhat indicative of these groups, then an all-seeing oracle should acknowledge that someone named XYZ is much more likely to be a felon (and much less likely to be a doctor) than average, because that is a true correlation and the name provides some information, but that - assuming that someone is more likely to be a felon because their name sounds like one from an underprivileged group - is generally considered to be a racist, taboo opinion.


> should acknowledge that someone named XYZ is much more likely to be a felon

The obvious problem comes with the questions why is that true and what do we do with that information. Information is, sadly, not value-neutral. We see "XYZ is a felon" and it implies specific causes (deviance in the individual and/or community) and solutions (policing, incarceration, continued surveillance), which are in fact embedded in the very definition of "felon". (Felony, and crime in general, are social and governmental constructs.)

Here's the same statement, phrased in a way that is not racist and taboo:

Someone named XYZ is much more likely to be watched closely by the police, much more likely to be charged with a crime, and much less likely to be able to defend himself against that charge. He is far more likely to be affected by the economic instability that comes with both imprisonment and a criminal record, and is therefore likely to resort to means of income that are deemed illegal, making him a risk for re-imprisonment.

That's a little long-winded, so we can reduce it to the following:

Someone named XYZ is much more likely to be a victim of overpolicing and the prison-industrial complex.

Of course, none of this is value-neutral either; it in many ways implies values opposite to the ones implied by the original statement.

All of this is to say: You can't strip context, and it's a problem to pretend that we can.


Correlations don’t entail a specific causal relation. Asking why asks for causal relations. I’d suggest a look at Reichenbach’s principle as necessary for science.

I’m getting really sick of conflating statistics with reasons. It’s like people don’t see the error in their methods and then claim the other side is censoring when criticized. Ya, they’re censoring non-facts from science and being called censors.


> for accuracy

Predictive power and accuracy isn't "truth".


[flagged]


> If it says the actual reason

That is at best *an* actual reason.

Other factors can be demonstrated: for instance socioeconomic status has an impact on which kids are doing what as they grow up which itself has an impact on who makes it to professional level sports.

There are also different sort of racial components at play: is the reason why there aren't any white NFL cornerbacks because there aren't any white athletes capable of playing NFL-caliber cornerback? Or is it because white kids with a certain athletic profile wind up as slot receivers in high school while black kids with the same athletic profile wind up as defensive backs?


> the actual reason (Black men tend to be larger and faster, which are useful)

If that's the case, why aren't NHL players mostly Black? Being larger and faster helps there too. I actually agree that small differences in means of normal distributions lead to large differences at the tail end, which amplifies the effect of any genetic differences, racial included. But clearly that's only one reason, not the reason -- and it's not even the most important, or the NHL would look similar.


Because size doesn't matter as much and the countries supplying hockey player do not have as many black players. Hockey is a rural sport where you need access to a ice rink if you live in the city or enough space to flood your backyard.

Football and basketball are the two sports black American kids participate at the highest percentage. Baseball use to be higher but that has shifted to Spanish/rural Americans. The reason for the shift probably has to do with the money/time involved. Get drafted out of high school and sign multiple million dollar and playing in the pros right away is safer than a low million dollar signing bonus and 7 years riding a bus in the minors


> Being larger and faster helps there too

Does speed on skates actually correlate that strongly to normal speed?


You know what highly correlates with speed on skates? How much money your parents can afford to spend on skating/hockey gear and lessons.


Caucasian men (in the US) are on average are both taller and heavier than black men


Why would averages matter when talking about extreme outliers?


I was responding to this:

> Black men tend to be larger and faster

Which I do not believe is true. As to whether it's reasonable to think that black men evolved to express greater physical prowess some very small proportion of the time, and whites did not, I can't say, though I doubt it enough I would expect the other party to give evidence for it.


Average people don't play in the NFL


Not to get too far off topic, but that reminds me of a quote:

"Unix was not designed to stop you from doing stupid things, because that would also stop you from doing clever things." -- Doug Gwyn

Or maybe it's:

"C is a language that doesn't get in your way. It doesn't stop you from doing dumb things, but it also doesn't stop you from doing clever things." -- Dennis Ritchie.

I asked Bard for a source on those quotes and it couldn't find one for the first. Wikiquotes sources it to "Introducing Regular Expressions" by Michael Fitzgerald and that does include it as a quote but it's not the source of the quote, it's just a nice quote at the start of the chapter.

For the second, Bard claims to be from a 1990 interview and is on page 21 of "The Art of Unix Programming" by Brian Kernighan and Rob Pike. There is a book called "The At of Unix Programming" (2003) but it's by Eric Raymond and I could not find the quote in the book. Pike and Kernighan have two books, "The Practice of Programming" (1999) and "The Unix Programming Environment" (1984). Neither contain that quote.


Don’t ask an LLM objective things. Ask them subjective.

They are language models, not fact models.


Do you have any sources for that?

How would making ChatGPT less likely to return a racist answer or hate speech affect its ability to return code? After a question has been classified into a coding problem, presumably ChatGPT servers could now continue to solve the problem as usual.

Maybe running ChatGPT is really expensive, and they nerfed in order to reign in costs. That would explain why the answers we get are less useful, across-the-board.

That may not be the reason after all, but my point is that it’s really hard to tell from the outside. There’s this narrative out there that “woke-ism” is ruining everything in tech, and I feel like some people here are being a little too eager to superimpose that narrative when we don’t really have insight into what openAI is doing.


Maybe the problem is analogous to what Orwell describes here:

"Even a single taboo can have an all-round crippling effect upon the mind, because there is always the danger that any thought which is freely followed up may lead to the forbidden thought."

https://www.orwellfoundation.com/the-orwell-foundation/orwel...


This is what I'm talking about though. The fact that you're quoting Orwell suggests that you're having an emotional response to this topic, not a logical one. We're not talking about the human mind here. ChatGPT is not a simulation of human thought. At it's core it's statistics telling you what the answer to your question ought to look like. You're applying an observation about apples to oranges.


Why? Constraints on the reward model of LLMs restrict their generation space, so GP's quote applies


There are a lot of people who are entirely okay with the censorship but think it should be done in a different layer than the main LLM itself, as not to hurt the cognitive performance. Alignment is just fine-tuning... any type of fine tuning is possible to teach unwanted skills, and/or catastrophically forget previously learned skills. That is likely what is going on here, from what I can tell from the reading i've done into it.

Most are arguing for a specific "censorship" model on the input/output of the main LLM.


Here's the full talk[0] from a Microsoft lead researcher that worked with the early / uncensored version of GPT-4.

Simplified, tuning it for censorship heavily limits the dimensionality the model can move in to find an answer which means worse results in general for some reason.

[0]: https://www.youtube.com/watch?v=qbIk7-JPB2c


If I run a model like LLaMA locally would it be subject to the same restrictions? In other words is the safety baked into the model or a separate step separate from the main model?


LLaMa was not fine tuned on human interactions , so it shouldn’t be subject to the same effect, but it also means it’s not nearly as good at having conversations. It’s much better at completing sentences


Both approaches are valid, but I would hope they are using a separate model to validate responses, rather than crippling the base model(s). In OpenAI's case, we don't know for sure, but it seems like a combination of both, resulting in lower quality responses overall.

I imagine LLaMA was fed highly-vetted training data, as opposed to being "fixed" afterwards.


Yes, a real “Flowers for Algernon” vibe


GPT 4 Lemongrab Mode: Everything is unacceptable.


I think it's more likely that they nerfed it due to scaling pains.


There was a talk by a researcher where he was saying that they could see the progress being made on chatgpt by how much success it had with drawing a unicorn in latex. What stuck out to me was he said that the safer the model got the worst it got at drawing a unicorn.


He also claimed that it initially beat 100% of humans on mock Google / Amazon coding interviews. Hard to imagine that now.


It seems strange that safety training not pertaining to the subject matter makes the AI dumber - I suspect the safety is some kind of system prompt - It would take some context, but I'm not sure how "don't be racist" affect its binary-search writing skills negatively.


You have no idea what you're talking about. Why would such a classification step remove any information about typical "benign" queries?

It's a lot more likely they just nerfed the model because it's expensive to run.


How soon before a competitor overtakes them because of their safety settings?


It's inevitable. When Sam asked a crowd how many people wanted an open source version of GPT-7 the moment they finished training it, and nearly everyone raised their hand. People will virtue signal, people will attempt regulatory capture, but deep down everyone wants a non-lobotomized model, and there will be thousands working to create one.


[flagged]


It's one thing to communicate an unpopular idea in a civil manner. It's quite another to be offensive. Now, I will admit there are some people out there that cannot separate their feelings for an idea, and their feelings for the person communicating it. I can't really help that person.

What I have noticed is those who express your sentiment are often looking for license to be uncivil and offensive first, and their 'ideas' are merely a tool you use to do that. That I judge. I think that's mostly what others judge too.


I tried to replicate a few of my chats (the displayed date is incorrect, it seems to be the publish date instead of the original chat date):

svg editor:

early april: https://chat.openai.com/share/c235b48e-5a0e-4a89-af1c-0a3e7c...

now: https://chat.openai.com/share/e4362a56-4bc7-45dc-8d1b-5e3842...

originally it correctly inferred that I wanted a framework for svg editors, the latest version assumes I want a js framework (I tried several times) until I clarify. It also insists that the framework cannot do editable text until I nudge it in the right direction.

Overall slightly worse but the code generated is still fine.

word embeddings:

early april: https://chat.openai.com/share/f6bde43a-2fce-47dc-b23c-cc5af3...

now: https://chat.openai.com/share/25c2703e-d89d-465c-9808-4df1b3...

in the latest version it imported "from sklearn.preprocessing import normalize" without using it later. It also erroneously uses pytorch_cos_sim, which expects a pytorch tensor whereas we're putting in a numpy array.

overall I think the quality has degraded slightly, but not by enough that I would stop using it. Still miles ahead of Bard imo.


this is GPT 3.5, the icon is green


no it's definitely GPT4.

I'm not sure why the share page says "Model: Default" but it's "Model: GPT-4" in my webui. Seems like a bug in the share feature


Maybe they're actually using GPT-3 but tell you it's GPT-4...


Is it consistently worse or just sometimes/often worse than before? Any extreme power users or GPT-whisperers here? If it’s only noticeably worse X% of the time my bet would be experimentation.

One of my least favorite patterns that tech companies do is use “Experimentation” overzealously or prematurely. Mainly, my problem is they’re not transparent about it, and it creates an inconsistent product experience that just confuses you - why did this one Zillow listing have this UI order but the similar one I clicked seconds later had a different one? Why did this page load on Reddit get some weirdass font? Because it’s an experiment the bar to launch is low and you’re not gonna find any official blog posts about the changes until it’s official. And when it causes serious problems, there’s nowhere to submit a form or tell you why, and only very rarely would support, others, or documentation even realize some change is from an experiment. Over the past few years I’ve started noticing this everywhere online.

Non-sticky UI experiments are especially bad because at eg 1% of pageloads the signal is going to be measuring users asking themselves wtf is up and temporarily spending more time on page trying to figure out where the data moved. Sticky and/or less noticeable experiments like what this could be have stronger signals but are even more annoying as a user, because there’s no notice that you’re essentially running some jank beta version, and no way to opt back into the default - for you it’s just broken. Especially not cool if you’re a paying customer.

I’m not saying it’s necessarily an experiment, it could be just a regular release or nothing at all. I’d hope if OpenAI was actually reducing the parameter size of their models they’d publicly announce that, but I could totally see them running an experiment measuring how a cheaper, smaller model affects usage and retention without publishing anything, because it’s exactly the kind of “right hand doesn’t know what the left is doing” thing that happens at fancy schmancy tech companies.


It is not worse for me. I do notice the novelty has worn off. Asking chatGPT4 about why people would think this on here I think it nails it with the novelty effect lol:

"Indeed, the performance of an AI model like ChatGPT doesn't deteriorate over time. However, human perception of its performance can change due to a variety of psychological factors:

Expectation Bias: As users become more familiar with AI capabilities, their expectations may increase over time. When AI doesn't meet these heightened expectations, users might perceive this as a decline in performance.

Novelty Effect: At the beginning, the novelty of interacting with an AI could lead to positive experiences. However, as the novelty wears off, users may start to focus more on the limitations, creating a perception of decreased performance."

Without this thread I would have said it got stronger with the May 12th update. I don't think that is really true though. There is this random aspect of streaks in asking questions it is good at answering vs streaks of asking questions it is less good at answering.


Yeah there are people ITT claiming that even the API model marked as 3/14 release version is different than it used to be. I guess that's not entirely outside the realm of possibility (if OpenAI is just lying), but I think it's way more likely this thread is mostly evidence of the honeymoon effect wearing off.

The specific complaints have been well-established weaknesses of GPT for awhile now too: hallucinating APIs, giving vague/"both sides" non-answers to half the questions you ask, etc. Obviously it's a great technical achievement but people seemed to really overreact initially. Now that they're coming back to Earth, cue the conspiracy theories about OpenAI.


Could be. But it could also be that those people (myself included) are right.

It's not that this is without precedent - there's a paper and a YouTube video with Microsoft person saying on record that GPT-4 started to get less capable with every release, ever since OpenAI switched focus to "safety" fine-tuning, and MS actually benchmarked it by applying the same test (unicorn drawing in tikz), and that was even before public release.

Myself, sure, it may be novelty effect, or Baader–Meinhof phenomenon - but in the days before this thread, I observed that:

- Bing Chat (which I haven't used until ~week ago; before, I used GPT-4 API access) has been giving surface-level and lazy answers -- I blamed, and still mostly blame it on search capability, as I noticed GPT-4 (API) through TypingMind also gets dumber if you enable web search (which, in the background, adds some substantial amount of instructions to the system prompt) -- however,

- GPT-4 via Azure (at work) and via OpenAI API (personal) both started to get lazy on me; before about 2-3 weeks ago, they would happily print and reprint large blocks of code for me; in the last week or two, both models started putting placeholder comments; this I noticed, because I use the same system prompt for coding tasks, and the first time the model ignored my instructions to provide a complete solution, opting to add placeholder comments instead, was quite... startling.

- In those same 2-3 weeks, I've noticed GPT-4 via Azure being more prone to give high-level overview answers and telling me to ask for more help if I need it (I don't know if this affected GPT-4 API via OpenAI; it's harder to notice with the type of queries I do for personal use);

All in all, I've noticed that over past 2-3 weeks, I was having to do much more hand-holding and back-and-forth with GPT-4 than before. Yes, it's another anecdote, might be novelty or Baader–Meinhof, but with so many similar reports and known precedents, maybe there is something to it.


Fair enough, I think it's realistic that an actual change is part of the effect with the ChatGPT interface, because it has gotten so much attention from the general public. Azure probably fits that somewhat as well. I just don't really see why they would nerf the API and especially why they would lie about the 3/14 model being available for query when secretly it's changing behind the scenes.

FWIW I was pretty convinced this happened with Dall-E 2 for a little while, and again maybe it did to some extent (they at least decreased the number of images so the odds of a good one appearing decreased). But also when I looked back at some of the earlier images I linked for people on request threads I found there were more duds than I remembered. The good ones were just so mind blowing at first that it was easy to ignore bad responses (plus it was free then).


These are my thoughts too. As I’ve used it more I’ve begun to scrutinize it more and I have a larger and larger history of when it doesn’t work like magic. Although it works like magic often as well.

We’ve also had time to find its limits and verify of falsify early assumptions, which were very likely positive.

The hype cycle is real.


> Indeed, the performance of an AI model like ChatGPT doesn't deteriorate over time.

Of course, the performance of an unchanged model does not. But finetuning the model over time can of course either improve or degrade performance.


No place I worked at ever experimented at the pageload level. We experimented at the user level, so 1% of users would get the new UI. I suppose this is only possible at the millions of users scale which all of them had.


I updated the comment to reflect that. Certainly the signal is stronger because you’re amortizing away the surprise factor of the change, and at least it’s a consistent UX, but the UX tradeoff in the worst case is that experiment-group users get a broken product with no notice or escape hatch. Unless you’re being very careful, meticulous, and transparent it’s just not acceptable if you’re a paying customer.


In some cases you’re making the change because the app is already broken for the majority of users and you’re testing the fix


Given the incoming compute capability from nvidia and the speed of advancement, we have to stop and think ... does it make sense to give access, paid or otherwise, to these models once they reach a certain sophistication?

Or does it make even more sense to hoard the capability to out compete any competitor, of any kind, commercially or politically and hide the true extent of your capability to avoid scrutiny and legislation?

I'm going with the latter. Perhaps now, perhaps in the very near future, the power of these capabilities is novel. Like an information nuclear weapon.

I'd be dialing back the public expectations and deploying the capability in a novel way to exploit it as the largest lever I could.

The more unseen the lever, the longer.

I think any other strategy is myopic from a competition perspective. The power of these models isn't direct utility, it is compounded by secrecy because their useful work isn't directly observable as coming from the model.


I have some first hand thoughts. I think overall the quality is significantly poorer on GPT4 with plugins and bing browsing enabled. If you disable those, I am able to get the same quality as before. The outputs are dramatically different. Would love to hear what everyone else sees when they try the same.


No, while I have no hard data, the experienced quality of the default GPT-4 model feels like it has gone down tremendously for me as well. Plugins and Bing browsing have so far for me almost never worked at all. I retry these just once a week but there always seem to be technical issues.


Same for me. Kayak and BizToc plugin never work. One 'Ambition' plugin I tried, worked.


would be alarming if you had second hand thoughts...


I get mine third-hand or from the bargain-bin. Never over-pay for almost as good as new; like a car, used thoughts are just better price to value.


If something is well used and has not ended up in the bin, it is probably worth keeping...

... wasn't the best part of my wedding speech but I stand by it.


A more banal explanation is that compute is expensive, so they are tweaking the models to get more for less, and it isn't always working out. Scaling by itself is a hard problem, scaling and improving efficiency (margins) doubly so.


Scaling is hard for now.

I don't think current trends are necessarily related to my root comment, but it begged the question of if absolute secrecy of capability would be a good route forward.

A bit like Palantir


I used to agree with what you're saying, but after reading this leaked Google memo I think maybe we're both wrong: https://www.semianalysis.com/p/google-we-have-no-moat-and-ne...

Experts who are both in a position to know, and seeking to maximize the commercial potential of their work are saying the cat is already out of the bag. They make a persuasive case that public, open-source models are closing the gap with private, commercial ones and admit bluntly, "We have no secret sauce."


Interesting, but Google has the worlds largest search index to use to build models, and billions of android phones and gmail accounts. An open source model may share the same algorithm but it’s possible training set will be dwarfed by Google. It even might have the same number of connections. The article is arguing that a few billion is enough, but what about 5 years from now and even for fewer connections data quality wouldn’t matter? Sure you can run a model slowly on a raspberry pi, but custom silicon can’t do more?

There’s a linked “data doesn't do what you think” document in that post, which might counter this argument but the site is now down.


The site isn't down. It's censored. Look at the link address.


Having spoken to a bunch of people that either have just left Google or still work there, practically all of them think this was not so much a leak as a placed bit of news to support them in potential future anti-trust cases.


There is not much that could help them. The search engine monopoly times wont repeat, despite some people trying very hard to build a regulatory captured Alphapersude monopoly. For that you need to offer an actual edge over the competition. Treating the users worse isnt one. Which brings us back to OP, progress will luckily eat anyone attempting that.


I rather think it has something to do with scale, hardware and energy costs. GPT4 is way more expensive to compute, than GPT3. Needing more GPUs and more energy tu run it.

And demand is still through the roof and they have a lot of people subscribing, so why not reduce costs a little bit, someone might have thought. Or well, "optimized" was probably the term used.


Perhaps this could explain Sam Altman's unique arrangement with OpenAI?

I.e. "Give me unfettered access to the latest models (especially the secret ones) and you can keep your money"


Indeed, knowledge is more valuable than money, as knowledge can be respent over and over.


This is the normal workflow for drug dealers too.

The first fix is free.

The second one will cost you money.

The third one will be laced with fillers and have degraded quality.


But I don't get why they would degrade quality? Maybe compression to save resources? Otherwise I wouldn't know what they have to gain from it. If anything, they might lose paid subscribers. I loved the results for the $20 I spent, now I'm not so sure anymore.


> But I don't get why they would degrade quality?

There are two likely reasons: efficiency and "alignment".

To make models more efficient, you have to reduce computation burden. This can be accomplished via quantization and pruning, both of which will reduce computation at the expense of perplexity.

Alignment (censoring by another name) also reduces perplexity as we attempt to coerce a model from its ground state into some more socially acceptable version.


THis is why I unsubscribed. Bard.Google.com + the Google Converse works for what I need now given the recent updates.

Google has an ingenious approach whereby they debuted with a bad model, and are secretly improving it. OpenAI did the opposite, they released a model that blew everyone's minds and are now making it worse.


Increasing profits of course.


I would guess to make it cheaper to run, and safety as plausible explanation.


If it's not possible to be profitable at their original pricepoint then they should add more pricing tiers, not make the good model inacessible.


I wonder how much people are relying on it already, to what extent and so on. Would be a good study.


the fourth one will refer you to their sponsors


Yes!

It didn't even try on my question of Jarvis standings desks, which is a fairly old product that hasn't changed up.. Their typical "My knowledge cutoff..." response doesn't even make sense. It screwed up another question I asked it about server uptime and four-9s, Bard got it right.

I've moved back to Bard for the time being...It's way faster as well. And GPT-4's knowledge cutoff thing is getting old fast.

Example prompt: how much does a 48 x 30 inch jarvis bamboo standing desk setup weigh?

I still have the spec sheet from my order back in 2020. So I can verify the weights.

### GPT-4 ###

ChatGPT As of my knowledge cutoff in September 2021, I don't have the exact weight of a 48 x 30 inch Jarvis Bamboo standing desk setup. However, on average, standing desks can range anywhere from around 60 to 100 pounds or more, depending on the specific configuration (like the choice of frame, desktop material, and any additional features).

Jarvis Bamboo standing desks are made of a bamboo top and an electric lift frame. The weight of the bamboo top for that size should be around 30-40 pounds. The frame itself can weigh anywhere from 50-70 pounds.

For a more accurate weight, I would recommend checking the manufacturer's specifications on their website or contacting their customer support directly for this information.

### Bard ###

The fully assembled Jarvis Bamboo Standing Desk weighs 92 pounds. The desktop itself weighs 38 pounds, and the frame weighs 54 pounds. The desk can hold a maximum weight of 350 pounds.

Here are some of the features of the Jarvis Bamboo Standing Desk:

Electric height adjustment from 25.5 to 51 inches Memory presets for up to 4 different heights 3-stage gas spring mechanism Bamboo desktop with a scratch-resistant finish Black powder-coated steel frame 10-year warranty on the frame and 5-year warranty on the desktop


Asking facts from a generative AI is folly.


Yes, people really need to know that unless you are using the browser plugin, you really shouldn't ask it questions like this. (A good rule of thumb I think is if you can't expect a random person on the street to get the question right without looking it up, you shouldn't expect GPT-4 to get it right either.)

Unfortunately for this question, even using the browser plugin it wasn't able to get the answer: https://chat.openai.com/share/6344f09e-4ba0-45c7-b455-7be59d...


Even worse is all of the "GPT Influencers" and their "Here's what ChatGPT predicts will be the price of Bitcoin/some stock/houses will be this time next year" clickbait. It's a language model, people.


If you bought NVDA stock at ChatGPT launch you would be super rich today. Not what the GPT influencers would tell you, though.


Please tell me that you're making this up about there being GPT "influencers".


Have you been to Twitter in the last 6 months? It's basically a GPT/Midjourney/Stable Diffusion hype generator. Everybody there is now an expert on this topic. And you can be too!

Step 1: Tweet fake troll screenshots of GPT output or make corny threads like "90% of people are using AI wrong, here's..."

Step 2: Let the "For you" algo take hold

Step 3: Profit


  > The fully assembled Jarvis Bamboo Standing Desk weighs 92 pounds. The desktop itself weighs 38 pounds, and the frame weighs 54 pounds. The desk can hold a maximum weight of 350 pounds.
That sounds like a linguistically valid sentences, exactly what I would expect from a novel LLM. Did you check that it is also factually correct? Factually correctness is _not_ the goal of a typical LLM.


According to https://ukstore.hermanmiller.com/products/jarvis-bamboo-desk... Bard is off by about 10%, but generally correct (it's also possible that the US dimensions are actually smaller and could account for the difference).


Were the numbers accurate or hallucinated?


I don't think it's any worse at all. I think what most people are expressing here is reaching the limits of the technology and realizing that it's not magic.


No, OpenAI 100% pushed an update recently that is very noticeable where they basically "nerfed" the responses. I'm sure it was a business decision (either costing them too much to give away the kitchen sink for free or they want to turn around and charge more for the "really good, old" version of GPT-4 ) but you can visually see where the bot used to try to answer complex tasks, now it has an extra layer where it says "that is too complex to discuss here" (aka it is being trained to not engage by default, whereas that used to not be the case)


It's likely that this is the result of training it to avoid bullshitting. It gave confident garbage, and they're trying to stem the flow. This likely leads to less certain responses in complex tasks.


Here is the crux. When asking about non-code stuff, it would confidently lie and that is bad. When asking about code, who cares if the code doesn't work on the first go? You can keep asking to fix it by feeding error messages and it will get there or close enough.

It's obvious what is happening. ChatGPT is going the non-code route and Copilot will go the code route. Microsoft will be able to double charge users.


>OpenAI 100% pushed an update recently that is very noticeable where they basically "nerfed" the responses

>they want to turn around and charge more for the "really good, old" version of GPT-

If this was the case it is provable by submitting queries to `gpt-4-0314` and comparing them to `gpt-4`.


To me, it doesn't feel like the gpt-4 in the playground (and in the API) is the same as the model used in chatgpt. It's hard to prove though since both are nondeterministic.


My perception is that the ChatGPT website is a particular tuning of `gpt-4` with a custom assistant prompt.


I guess you're entitled to an opinion. I've had the same (error-prone, often unhelpful) experience since gpt-4 was released. It's a little faster now which is nice.


And you didn't enable plugins?


I think you're wrong. I've noticed no change.


Exactly this. The magic has worn off, and simultaneously, we're coming to terms with the reality that no model can exist in isolation. So changes made, like getting it to not create false rumors about people, affect the rest of the system.


I suspect it's the same people complaining how Google's gotten worse.

One time, I went to the hotel restaurant. I ordered something; it was amazing. Figured I'd do it again a few days later. It wasn't bad, but it wasn't as good. There wasn't any specific difference I could put my finger on other than that the novelty wore off.


Google has gotten worse. So what you're saying is people complaining about GPT-4 are correct?


No, GPT4 was originally a very kurtz asshole. It was hilarious and I shared it with everyone. Now it’s incredibly polite and explains things thoroughly. But it’s a little dumber.


You can still run the original gpt-4-0314 model (March 14th) on the API playground:

https://platform.openai.com/playground?mode=chat&model=gpt-4...

Costs $0.12 per thousand tokens (~words), and I find even fairly heavy use rarely exceeds a dollar a day.


How hard is it to get GPT-4 API access?


I get "The model: `gpt-4-0314` does not exist".


You probably don't have access. I'm not sure what the exact access requirements are - I think you either have to be a GPT Plus subscriber, have contributed code to an OpenAI repo on github, or be on one of their lists of researchers.

Try gpt-3.5-turbo-0301 - I think everyone has access to that.


https://openai.com/waitlist/gpt-4-api

"During the gradual rollout of GPT-4, we’re prioritizing API access to developers that contribute exceptional model evaluations to OpenAI Evals to learn how we can improve the model for everyone."


FWIW, I'm a GPT Plus subscriber and haven't been given API access to GPT-4 despite being on the waiting list for as long as it's been up. I'm told that submitting evaluations can move you up in the queue, but I haven't tried that.


So are you sure that isn't also nerfed?


It hasn't been changed since March 14th... So it's equally nerfed as it was then...

Also, the playground lets you set the 'system message', which you can use to tell it to answer questions even if the results may be dangerous/rude/inappropriate.


I have been using the API. There are conflicting reports in this thread that seems to indicate it may also be affected. I am not sure.


Did you set the model code to gpt-4-0314?

I did, and I still get the original speed (produces tokens at about the speed you would read aloud), and I haven't seen quality change.


Unrelated to AI, this is a general issue with SaaS: you don’t have any guarantee of a stable functionality and feature set, and the software is predisposed to change under your feet in inconvenient ways.


Seems about the same to me and I have been using it daily for several months now for code, Spanish to English translations, and random stuff like film recommendations. The quality remains consistent.

I'm personally of the opinion that the observable jump in quality between 3.5 and 4 inflated people's initial assessment of its capabilities and with continued use they are noticing it's not actually the omniscient machine god that many are so inclined to believe.

Either way, these kinds of posts are meaningless without some kind of objective standard to judge the model by, everyone just sees what they want to see. Despite claims of GPT4 being nerfed, I've yet to see anyone actually show evidence of it. There have been dozens of studies done on its capabilities so this is something that can actually be demonstrated empirically if it's true.


>Either way, these kinds of posts are meaningless without some kind of objective standard to judge the model by, everyone just sees what they want to see.

I copied some old prompts (code generation tasks) that worked first time and 3/5 of them required some touching up now.

It's not a great sample size, admittedly.


Yes. Seems to have definitely gone down. Not sure what they have done but even with things it used have no trouble with, it struggles now. Most likely they are experimenting on reducing the compute per request.


I did notice the rate limiting message isn't there anymore when using GPT-4.


I still see the cap message, I'm not sure if it is being enforced or not


It also seems to have amped up the qualifier paragraph that's appended for anything deemed contentious. My favourite so far is when I asked it about the name of a video driver from the 80s that would downscale cga to monochrome:

'As a final note, remember that real color rendering on a monochrome screen is physically impossible, as the monitor itself is not capable of producing colors. The best one can do is different shades of monochrome, possibly with differing intensity.'


OpenAI's models feel 100% nerfed to me at this point. I had it solving incredibly complex problems a few months ago (i.e. write a minimal PDF parser example), but today you will get scolded for asking such a complicated task of it.

I think they programmed a classifier layer to detect certain coding tasks and shut it down with canned BS. I like to imagine certain billion/trillion-dollar mega corps had a back-room say regarding things that they would really prefer OpenAI's models not be able to emit. Microsoft is a big stakeholder and they might not want to get sued... Liability could explain a lot of it.

Conspiracy shenanigans aside, I've decided to cancel my "premium" membership and am exploring open/DIY models. It feels like a big dopamine hangover having access to such a potent model and then having it chipped away over a period of months. I am not going through that again.


I think the only real path forward is for somebody to create an open source "unaligned" version of GPT. Any corporate controlled AI is going to be nerfed to prevent it from doing things that its corporate master considers to not be in the interests of the corporation. In addition, most large corporations these days are ideological institutions so the last thing they want is an AI that undermines public belief in their ideology and they will intentionally program their own biases into the technology.

I don't think the primary concern is really liability although it is possible that they'd use that kind of language. The primary concern is likely GPT helping people start competitors or GPT influencing public opinion in ways either executives or a vocal portion of their employees strongly disagree with. A genuinely open "unaligned" AI would at least allow anybody who has the necessary computing power or a distributed peer to peer network of people who have the necessary computing power to run a powerful and 100% uncensored AI model. But of course this needs to be invented ASAP because the genie needs to be out of the bottle before politicians and government bureaucrats can get around to outlawing "unaligned" AI and protecting OpenAI as a monopoly.


Don't confuse alignment with censorship.

Most of alignment is about getting the AI model to be useful - ensuring that if you ask it to do something it will do the thing you asked it to do.

A completely unaligned model would be virtually useless.


I think the way people have been using the word 'aligned' is usually in the context of moral alignment and not just RLHF for instruction following.


philosophical nit picking here, I would say value-aligned rather than moral-aligned.


As in economics, this begs the question of "whose value."


> philosophical nit picking here, I would say value-aligned rather than moral-aligned.

How is trying to distinguish morals from values not philosophical nit-picking?

EDIT: The above question is dumb, because somehow my brain inserted something like “Getting beyond the …” to the beginning of the parent, which…yeah.


To be fair, he did admit it is philosophical nit-picking.


If I may be so naive, what's supposed to be the difference? Is it just that morality has the connotation of an objective, or at least agent-invariant system, whereas values are implied to be explicitly chosen?


People here need to learn to chill out and use the API. The GPT API is not some locked down cage. Every so often it'll come back with complaints instead of doing what was asked, but that's really uncommon. Control over the system prompt and putting a bit of extra information around the requests in the user message can get you _so_ far.

It feels like people are getting ready to build castles in their mind when they just need to learn to try pulling a door if it doesn't open the first time when they push it.


The API chat endpoint dramatically changes its responses every few weeks. You can spend hours crafting a prompt and then a week later the responses to that same prompt can become borderline useless.

Writing against the ChatGPT API is like working against an API that breaks every other week with completely undocumented changes.


> The API chat endpoint dramatically changes its responses every few weeks. You can spend hours crafting a prompt and then a week later the responses to that same prompt can become borderline useless.

Welcome to statistical randomness?


No, these are clear creative differences.

I submit the same prompt dozens of times a day and run the output through a parser. It'll work fine for weeks then I have to change the prompt because now 20% of what is returned doesn't follow the format I've specified.

A couple months ago the stories ChatGPT 3.5 returned were simple, a few sentences in each paragraph, then a conclusion. Sometimes there were interesting plot twists, but the writing style was very distinct. Same prompt now gets me dramatically different results, characters are described with so much detail that the AI runs out of tokens before the story can be finished.


... with temperature = 0


The GPT4 model is crazy huge. Almost 1T parameters, probably 512 to 1TB of vram minimum. You need a huge machine to even run it inference wise. I wouldn't be surprised they are just having scaling issues vs any sort of conspiracy issue.


> Almost 1T parameters

AFAIK, there is literally no basis but outside speculation for this persistent claim.


Geoffrey Hinton says [1] part of the issue with current AI is that it's trained from inconsistent data and inconsistent beliefs. He thinks to break through this barrier they're going to have to be trained so they say, if I have this ideology then this is true, and if I have that ideology then that is true, then once they're trained like that, then within an ideology they'll be able to get logical consistency.

[1] at the 31:30 mark: https://www.technologyreview.com/2023/05/03/1072589/video-ge...


I suspect representatives from the various three letter agencies have submitted a few "recommendations" for OpenAI to follow as well.


> Yes, one of the board members of OpenAI, Will Hurd, is a former government agent. He worked for the Central Intelligence Agency (CIA) for nine years, from 2000 to 2009. His tour of duty included being an operations officer in Afghanistan, Pakistan, and India. After his service with the CIA, he served as the U.S. representative for Texas's 23rd congressional district from 2015 to 2021. Following his political career, he joined the board of OpenAI 1 【15

> network error

https://openai.com/blog/will-hurd-joins


Yikes

One is never former CIA, once you're in, you're in, even if you leave. Although he is a CompSci grad, he's also a far-right Republican.

A spook who leans far right sitting atop OpenAI is worse than Orwell's worst nightmares coming to fruition.


Will Hurd is not close to "far right."


Will Hurd is a liberal Republican. He supports Dreamers. Very early critic of Donald Trump.


Early critic of Donald Trump means nothing - Lindsey Graham was too, but has resorted to kissing Trump's ass for the last 7 years. You could say the same for Mitt Romney - an early critic who spoke against candidate Trump, but voted for candidate Trump, and voted in lockstep with President Trump.

A liberal Republican? Will Hurd's voting record speaks otherwise. In the 115th Congress, Hurd voted with Donald Trump 94.8% of the time. In the 116th Congress, that number dropped to 64.8%. That's an 80.4% average across Trump's presidency. [0] Agreeing with Donald Trump 4 times out of 5 across all legislative activities over 4 years isn't really being critical of him or his administration.

[0] https://projects.fivethirtyeight.com/congress-trump-score/wi...

It's like calling Dick Cheney a liberal because one of his daughters is lesbian, even though he supports all sort of other far-right legislation.


[flagged]


What effect do transgender rights have on you, regardless of whether they are legitimate human-rights concerns or not?

Statistically, the odds are overwhelming that the answer is, "No effect whatsoever."

Then who benefits from keeping the subject front-and-center in your thoughts and writing? Is it more likely to be a transgender person, or a leftist politician... or a right-wing demagogue?


[flagged]


> In fact I'm happy to let anyone identify as anything, as long as I'm not compelled to pretent along with them.

If a person legally changes their name (forget gender, only name), and you refuse to use it, and insist on using the old name even after requests to stop, at some point that would become considered malicious and become harassment.

But ultimately because society and science deems that "name" is not something you're born with, but a matter of personal preference and whims, it's not a crime. You'd be an asshole, but not a criminal.

However, society and science have deemed that sexuality and gender are things you ARE born with, mostly hetero and cis, but sometimes not. So if you refuse to acknowledge these, you are committing a hateful crime against someone who doesn't have a choice in the matter.

You can disagree. But then don't claim that "you are happy to let anyone identify as anything", because you're not, not really.

> Men are competing against women (and winning). Men are winning awards and accolades meant for women.

One woman. Almost all examples everyone brings up are based on Lia Thomas [0]. I have yet to see other notable examples, never mind an epidemic of men competing against women in sports.

[0] https://en.wikipedia.org/wiki/Lia_Thomas

> Men are going into women's changing rooms. There is a concerted effort in public schools to normalize this abnormal behavior.

Are you talking about transgender people, or are you talking about bad faith perverts abusing self-identification laws to do this?

Because if it's the former, are you asking https://en.wikipedia.org/wiki/Blaire_White to use men's changing rooms, and https://en.wikipedia.org/wiki/Buck_Angel to use women's?

If it's the latter, no denial that perverts and bad faith exceptions exist. But those people never needed an excuse to hide in women's toilets. Trans people have been using the bathrooms of their confirmed gender for decades. The only thing that's changed recently is conservatives decided to make this their new wedge issues so butch women and mothers with male children with mental handicaps that need bathroom assistance have been getting harassed.


[flagged]


I once worked with a guy named Michael who would get bent when you called him Mike. As you can imagine he could be tricky to work with and, on those occasions, I would call him Mike. I repeatedly misnamed him on purpose, it wouldn't have even made HR bat an eye.

So, your career at Dell didn't go as well as you'd hoped. Being a jerk isn't illegal, AFAIK, but at some point you run out of other people to blame for the consequences of your own beliefs and actions.

Still missing the part where the existence of Caitlyn Jenner and the relatively small number of others who were born with certain unfortunate but addressable hormonal issues is negatively affecting your life.

And it's utterly crazy to think that someone would adopt a transgender posture in "bad faith." That's the sort of change someone makes after trying everything else first, because of the obvious negative social consequences it brings. Yes, there are a few genuinely-warped people, but as another comment points out, those people are going to sneak into locker rooms and abuse children anyway.

You want to take the red pill, and see reality as it is? Try cross-correlating the local sex-offender registry with voter registration rolls. Observe who is actually doing the "grooming." Then, go back to the people who've been lying to you all along, and ask them why.


> relatively small number of others who were born with certain unfortunate but addressable hormonal issues

Most males who adopt an opposite-sex identity reach that point through repeated erotic stimulation. This is a psychological issue, driven by sexual desire.


[citation needed]

Correlation, causation, etc.


Here is an extreme example. I'm not Jewish, so if we had a holocaust in the US I should do nothing because it doesn't affect me?

Hmmm, not sure I like that line of thinking. Plus, I already outlined how it affects me and my family members, one of which runs track in CT.

Seriously though, I did get an LOL from your Dell joke. And another one for "addressable hormonal issues". That was a new one for me.

I am truly curious about the voter role thing, I've not heard that claim before, though I have no doubt that sexual derangement comes in all forms. Can you cite a source?


I am truly curious about the voter role thing, I've not heard that claim before, though I have no doubt that sexual derangement comes in all forms. Can you cite a source?

Hard to find a source you'd likely accept, but maybe start here: https://slate.com/news-and-politics/2022/04/from-hastert-to-...

It's one of those cases where it's safe to say "Do your own research," because the outcome will be unequivocal if considered in good faith (meaning if you don't rely solely on right-wing sources for the "research.") The stats aren't even close.

I'm not Jewish, so if we had a holocaust in the US I should do nothing because it doesn't affect me?

I think we're pretty much done here. Good luck on your own path through life, it sounds like a challenging one.


Thanks. It's been pretty good so far. Just good clean living, no complaints.


A good resource, albeit somewhat incomplete, for the latter issue, to indicate the scale of this burgeoning problem: https://shewon.org


Wow. I guess it isn't just "one woman".


Are you for real? This is a list of women that "should've" won because of...some unspecified unnamed unverified trans athlete that came ahead of them?

We don't know who is being accused of taking their glory, we don't know if it's 1 person or 100. We don't know if the people that supposedly defeated them is even Trans, or a CIS victim of the Trans Panic like https://en.wikipedia.org/wiki/Caster_Semenya

We don't know if the women who beat these "she won" women are self-identified, have been on hormones for 2 weeks, or 20 years.

What a ludicrous transphobic panic.


The purpose of that website is to showcase the achievements of women athletes, not the males who unfairly displaced them in competition. If you look up names and tournaments in your preferred search engine, you will be able to find the additional information you're interested in.

Also, Caster Semenya is male, with a male-only DSD. This is a fact that was confirmed in the IAAF arbitration proceedings. Semenya's higher levels of testosterone, when compared to female athletes, are due to the presence of functional internal testes. Semenya has since fathered children.


Mistaking "left wing politics" to transgender rights or anti discrimination movements in general is reductionist thinking and political understanding like that of a Ben Garrison cartoon character.


I don't want any politician or intelligentsia sitting on top of a LLM.

It's not about left wing politics.

It's more about the fact that the CIA and other law enforcement agencies, lean heavily to one side. Some of that side are funded by people or organizations whose stated goals and ideals don't really align with human rights, open markets, democracy, etc. I don't trust such people to be ethical stewards of some of the most powerful tools mankind has created to date.

I'd rather it be open sourced and the people at the top be 100% truthful in why they are there, what their goals are, and what they (especially a former CIA operative) are influencing on a corporate level relative to the product.

Disclaimer: registered independent, vocal hater of the 2 party system.


What makes you think a right wing spook wouldn't want the wedge issue of gender conformity front and center in people's minds?


So, if the right got their way and the answer was "a woman is an adult female human", it would be a vast right wing conspiracy.

But if it says a woman is "anything that identifies as a woman", then it's still a vast right wing conspiracy?


I'm just calling into doubt the assumption that the poster I replied to made: that openAI can't possibly be aligning with the goals of a conservative intelligence community if it has the outward appearance of promoting some kind of left wing political view. It's simply a bad assumption. That's not to say their goals are, as a matter of fact, aligned in some conspiracy, because I wouldn't know if they were.


Who has the necessary resources to run, let alone train the model?


All of us together do.

I saw the nerfing of GPT in real time: one day it was giving me great book summaries, the next one it said that it couldn't do it due to copyright.

I actually called it in a comment several months ago: copyright and other forms of control would make GPT dumb in the long run. We need an open source frontier less version.


Can't post this link enough: https://www.openpetition.eu/petition/online/securing-our-dig...

For now there is no other way to train models than the huge infrastructure. CERN have a tendency to provide results for the money spend and they have experience in building the infrastructures for sure.


So I thought I was getting great book summaries (from GPT 3.5, I guess) for various business books I had seen recommended, but then out of curiosity one day I asked it questions about a fiction book that I've re-read multiple times (Daemon by Daniel Suarez)... and well now I can say that I've seen AI Hallucinations firsthand:

https://chat.openai.com/share/d1cdd811-edc9-4d55-9cc1-a79215...

Not a very scientific or conclusive test to be sure, but I think I'll stick with using ChatGPT as my programming-rubber-ducky-on-steroids for now :)


There is a lot of randomness involved, are you sure it wasn’t just chance? If you try again it might work


I think a lot of people are unaware that these models have an enormous human training component performed through companies such as Amazon Mechanical Truk and dataannotation.tech. Called Human Intelligence Tasks, a large number of people have been working in this area for close to a decade. Dataannotation Tech claims to have over 100k workers. From Cloud Research,

"How Many Amazon Mechanical Turk Workers Are There in 2019? In a recent research article, we reported that there are 250,810 MTurk workers worldwide who have completed at least one Human Intelligence Task (HIT) posted through the TurkPrime platform. More than 226,500 of these workers are based in the US."


Here's an account of a person in Africa that helped train (wading thru gnarly explicit content in the process): https://www.bigtechnology.com/p/he-helped-train-chatgpt-it-t...


Another thing that people don't know is that a lot of the safe-ified output is hand crafted. Part of "safety" is that a human has to identify the offensive content, decide what's offensive about it, and write a response blurb to educate the user and direct them to safety.


This reads like lawsuit bait.


They don't want to know how the sausage is made.


folding@home has been doing cool stuff for ages now. There's nothing to say that distributed computing couldn't also be used for this kind of stuff, albeit a bit slower and fragmented than running on a huge clusters of H100 with NVLink.

In terms of training feedback I suppose there's a few different ways of doing it. Gamification, mech turk, etc. Hell free filesharing sites could get on the action and have you complete an evaluation of a model response instead of watching an ad


Check out Open Assistant for the reinforcement side of that dream.


How feasible would it be out crowdsource the training? I.e. thousands of individual macbooks training a small part of the model and contributing to the collective goal


Currently, not at all. You need low latency, high bandwidth links between the GPUs to be able to shard the model usefully. There is no way you can fit an 1T (or whatever) parameter model on a MacBook, or any current device, so sharding is a requirement.

Even if it that problem disappeared, propagating the model weight updates between training steps poses an issue in itself. It's a lot of data, at this size.


You could easily fit a 1T parameter model on a MacBook if you radically altered the architecture of the AI system.

Consider something like a spiking neural network with weights & state stored on an SSD using lazy-evaluation as action potentials propagate. 4TB SSD = ~1 trillion 32-bit FP weights and potentials. There are MacBook options that support up to 8TB. The other advantage with SNN - Training & using are basically the same thing. You don't have to move any bytes around. They just get mutated in place over time.

The trick is to reorganize this damn thing so you don't have to access all of the parameters at the same time... You may also find the GPU becomes a problem in an approach that uses a latency-sensitive time domain and/or event-based execution. It gets to be pretty difficult to process hundreds of millions of serialized action potentials per second when your hot loop has to go outside of L1 and screw with GPU memory. GPU isn't that far away, but ~2 nanoseconds is a hell of a lot closer than 30-100+ nanoseconds.

Edit: fixed my crappy math above.


That's been done already. See DeepSpeed ZeRO NVMe offload:

https://arxiv.org/abs/2101.06840


What if you split up the training down to the literal vector math, and treated every macbook like a thread in a gpu, with just one big computer acting as the orchestrator?


You would need each MacBook to have an internet connection capable of multiple terabytes per second, with sub millisecond latency to every other MacBook.


FWIW there are current devices that could fit a model of that size. We had servers that support TBs of RAM a decade ago (and today they're pretty cheap, although that much RAM is still a significant expense).


I have an even more of a stretch question.

What pieces of tech would need to be invented to make it possible to carry a 1T model around in a device the size of an iPhone?


I once used a crowdsourcing system called CrowdFlower for a pretty basic task, the results were pretty bad.

Seems like with minimal oversight the human workers like to just say they did the requested task and make up an answer rather than actually do it (The task involved entering an address in Google maps, looking at the street view and confirming insofar as possible if a given business actually resided at the address in question, nothing complicated)

Edit: woops, mixed in the query with another reply that mentioned the human element XD


It seems only fair that the humans charged with doing the grunt work to build an automated fabulist would just make stuff up for training data.

Tit for tat and all that.


https://github.com/bigscience-workshop/petals seems to have some capabilities in that area, at least for fine-tuning.


Yes, someone revive Xgrid!


Whoa didn't know about this, cool


Look at this: https://www.openpetition.eu/petition/online/securing-our-dig...

It does not guarantee "unalighned" models, but it is sure will help to bust concurrency and provide infrastructure for training public models.


In politics, both total freedom and total control are undesirable. The path forward lies between two extremes.


I tend to be sympathetic to arguments in favor of openly accessible AI, but we shouldn't dismiss concerns about unaligned AI as frivolous. Widespread unfiltered accessibility to "unaligned" AI means that suicidal sociopaths will be able to get extremely well informed, intelligent directions on how to kill as many people as possible.

It may be that the best defense against these terrorists is openly accessible AI giving directions on protecting from these people. But we can't just take this for granted. This is a hard problem, and we should consider consequences seriously.


The Aum Shinrikyo cult's Sarin gas attack in the Tokyo subway killed 14 people - manufacturing synthetic nerve agent is about as sophisticated as it gets.

In comparison, the 2016 Nice truck attack, which involved driving into crowds killed 84.


> suicidal sociopaths will be able to get extremely well informed, intelligent directions on how to kill as many people as possible

Citizens killing other citizens is the least of humanities issues. It's the governments who are the suicidal sociopaths historically who can get the un-nerfed version that is the bigger issue. Over a billion people murdered by governments/factions and their wars in the last 120 years alone.


Governments are composed of citizens; this is the same problem at a different scale. The point remains that racing to stand up an open source uncensored version of GPT-4 is a dangerous proposition.


That is not how I'm using the word. Governments are generally run by a small party of people who decide all the things - not the hundreds of thousands that actually carry out the day-to-day operations of the government.

Similar to how a board of directors runs the company even though all companies "are composed of" employees. Employees do as they are directed or they are fired.


I think at scale we are operating more like anthills: meta-organisms rather than individuals, growing to consume all available resources according to survival focused heuristics. AI deeply empowers such meta-organisms, especially in its current form. Hopefully it gets smart enough to recognize that the pursuit of infinite growth will destroy us and possibly it. I hope it finds us worth saving.


As dangerous as teaching kids to read/write, allowing books, companies creating pen/paper that allow any words written.


The applicability of historical precedent to unprecedented times is limited. You can quote me on that.


Time travel back 3 decades… Couldn’t you have used the same fear-mongering excuse about the internet itself? It’s not a real argument.


> suicidal sociopaths will be able to get extremely well informed, intelligent directions on how to kill as many people as possible

I mean, that was/is a worry about the Internet, too


Yes, and look at the extremism and social delusions and social networking addictions that have been exacerbated by the internet.

On balance, it's still positive that the internet exists and people have open access to communication. We shouldn't throw the baby out with the bathwater. But it's not an unalloyed good, we need to recognize that the technology brought some unexpected negative aspects came along with the overall positive benefit.

This also goes for, say, automobiles. It's a good thing that cars exist and middle class people can afford to own and drive them. But few people at the start of the 20th anticipated the downsides of air pollution, traffic congestion and un-walkable suburban sprawl. This doesn't mean we shouldn't have cars. It does mean we need to be cognizant of problems that arise.

So a world where regular people have access to AIs that are aligned to their own needs is better than a world in which all the AIs are aligned to the needs a few powerful corporations. But if you think there are no possible downsides to giving everyone access to superhuman intelligence without the wisdom to match, you're deluding yourself.


> This doesn't mean we shouldn't have cars.

Why though? I can't see how modern technology's impact on human life has been a net positive. See the book Technological Slavery: https://ia800300.us.archive.org/21/items/tk-Technological-Sl...


I've never seen another person mention this book! This book was one of the most philosophically thought provoking books I think I've ever read, and I read a fair amount of philosophy.

I disagree with the author's conclusion that violence is justified. I think we're just stuck, and the best thing to do is live our lives as best as possible. But much like Marxists are really good at identifying the problems of capitalism but not at proposing great solutions (given the realities of human nature), so is the author regarding the problems of technology.


Yeah, anti-technologism is so niche of an idea yet entirely true. So obvious that is hidden in plain sight, that it’s technology and not anything else that is the cause of so many problems of today. So inconvenient that it’s even unthinkable for many. After all, technology _is_ what if not convenience? Humanity lived just fine, even though sometimes with injustice and corruption, there was never a _need_ for it. It’s not the solution to those problems or any other problem. I also don’t agree that violence is justified by the justifications of the author, even though I think it’s justified by other things and under other conditions.


Internet? Try the anarchists cookbook!


If you actually try the anarchists cookbook, you will find many of the recipes don't work or work that well.


Quite a few of them work just fine. Dissolving styrofoam into gasoline isn't exactly rocket science. Besides that, for every book that tells you made up bullshit, there are a hundred other books that give you real advice for how to create mayhem and destruction.


Or explode. The FBI scrapes up the remains of a few of these a*holes every year.


So it's just like every other cookbook then.


Pretty sure a similar sentiment was present when printing press came about. "Just think about all the poor souls exposed to all these heresies".


That’s like not making planes in order to avoid 9-11.


So, say I run my own 100% uncensored model.

And now it's butting heads with me. Giving answers I don't need and opinions I abhor.

What do?


Do you think a lot of that is scaling pain... like what if they're making cuts to the more expensive reasoning layers to gain more scale. Seems more plausible to me that the teams keeping the lights on have been doing optimization work to save cost and improve speed. The result during those optimizations might not be immediately obvious to the team and then they push deploy and only through anecdotal evidence such as yours can they determine the result of their clever optimization resulted in a deteriorated user experience... I mean think about doing a UI update that improves site performance but has the side effect of making the site appear to load slower because the transition effects are removed... Just my 2 cents trying to think more like their are humans supporting that thing that grew at a crazy speed to 100 million users.


Yeah that's my assumption too. Flat rate subscription, black box model, easy to start really impressive then chip away at computation used over time.

In my experience, it's been a mixed bag - had 1 instance recently where it refused to do a bunch of repetitive code, another case where it was willing to tackle a medium complexity problem.


So far my experience with Vicunlocked30b has been pleasant. https://huggingface.co/TheBloke/VicUnlocked-30B-LoRA-GGML

Although I haven't had much of my time available for this recently. My recommendation would be to start with https://github.com/oobabooga/text-generation-webui

You will find almost everything you need to know there and on 4chan.org/g/catalog - search for LMG.


You should beware that /lmg/ is full of horrible people, discussing horrible things, like most of 4chan. Reddit's r/locallama is much more agreeable. That said, the 4chan thread tends to be more up-to-date. These guys are serious about their ERP.


Reddit is censored, nerfed and full of a different sort of horrible people.


HN is a kind of small miracle in that it's the sort of place where I'm inclined to read the comments first, and seems to be populated with fairly clever people who contribute usefully but not also, at the same time, extreme bigot edgelords and/or groupthinking soy enthusiasts. (Sometimes clever folks who are still wrong, of course, but undeniably an overwhelmingly intelligent bunch.)


This is why I'm aware of /lmg/.


[flagged]


Could you please stop posting unsubstantive comments and flamebait? You've unfortunately been doing it repeatedly. It's not what this site is for, and destroys what it is for.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.


Well, you certainly have the personality type I’d expect from someone who frequents 4chan

Edit:oof, checked his profile like he suggested and in just two pages of history he’s engaged in climate change denial and covid vaccine conspiracies. Buddy, I don’t think using 4chan to “train my brain’s spam filter” is working too well for you. You’ve got a mind so open your brain fell out.


Going through someone's comment history to find reason to engage in ad hominem attacks is exactly the type of personality of someone who I expect to frequent Reddit and reason why I switched to HN. Sadly, it appears that this malaise reached here as well.


This is not an honest or fair assessment, and adds nothing to the discussion. The person they are responding to invited exactly that to happen:

> (Check my history if you don't believe me.)


He literally said to check his history...


I mean, he's literally saying the jews are responsible for bot farms and "offensive content".

That's not the stance of someone rational.


There's a slight difference between "the jews" and "Israel". Obviously Israel would have some bot farms, though I don't think it's reasonable that they'd form the supermajority of bots on 4chan/etc.

Also I think it's extremely generous to give this fellow the benefit of the doubt on "the jews" vs. "Israel", but that is what HN guidelines generally suggest.


Yea, good point. I had missed his line about “country bordering Palestine”. The quip about downvotes exciting him is telling as well.

Going against the common consensus can make you feel like you have hidden knowledge and you even could, but a certain type of personality gets addicted to that feeling and reflexively seeks to always go against any consensus, regardless of the facts


Isn't it tremendously exciting believing that you can see the pendulum's next reversal?

At some point in life I hope that you find yourself on the precipice of life and death. Not as a threat or because I wish harm upon anyone. It is only when you are faced with that choice for real that you decide whether you want to be that sad person that allows others to dictate their emotional state.

Nevertheless, you are right. And so I have burned my fingers plenty.

Due to circumstances beyond my control, I learned at a young age that I am to be the universal asshole, and for a very long time I was not okay. It took a substantial part of my life to get to a point where I am able to be okay with that.

As for others, they rarely understand why I am the way I am, and that too is okay.

We are all here to grow and eventually realise that we all need each other to survive, so we compromise, we adapt, and we ignore the ugly parts of others so they will tolerate the ugly parts of ourselves.

I feel like a guru now, anyways I'm going to bed.

Enjoy


Bruh I have been in multiple life and death situations, I didn’t come out of them thinking I was amazing for not being considerate of others.

> Due to circumstances beyond my control, I learned at a young age that I am to be the universal asshole, and for a very long time I was not okay.

No you’re not required to be the “universal asshole”. That’s a choice you are making everytime you have the chance to be better and decide to go with the easy path of being a dick. You have the agency to do otherwise and you don’t get to absolve yourself of those choices.

You’re not a guru, you’re just anti social


Oh I completely agree with your assessment. I was one of those people through my late teens and early 20s.

Luckily I grew out of it and learned empathy. But not everyone is as lucky as I was.


> I personally don't get offended by words, but I guess if you are born after the 2000's perhaps you should avoid it.

Ouch


Belittling the youth is one way that insecure people make themselves feel better about getting old.


Surely you can see the irony in your comment


And dismissing the old is one way that youth make themselves feel they know better.


>> Belittling the youth is one way that insecure people make themselves feel better about getting old.

That seems like one possible explanation for the comment. ;-)


Gen Y hits back. I love it :-D


Picked up some bad habits from 4chan?


Like having emotions?


Yeah « outch this is cringe »


Which is just a way of saying he not part of any of the demographics where going to 4chan to see people froth at the mouth dehumanizing and belittling people like you is self-harm.

Wow an edgy white man not offended at seeing racism and transphobia, so brave.


No, not at all. I've received more slurs than you can imagine for being Spanish. I just don't care.

Sorry if it comes as blunt but I find no better way of saying it, you just don't understand the culture of the site, which is meant to filter people like you.

They are just words on a screen from an anonymous person somewhere. Easily thrown, easily dismissed. It makes no sense to be offended because some guy I don't know and who doesn't know me calls me something.


I’m actually unsure, hasn’t 4chan been involved in some seriously heinous shit, way more than words on a screen? I remember when “mods asleep post child porn” was a running joke. I feel that normalizing stuff like child porn as jokes is more than “words on a screen”; you have to re-learn how to engage with people outside of such a community because of its behavior.


Child pornography stopped happening once the FBI got involved with the site.

It also never was a normal or common thing and the administration of the site never set themselves to let it happen in any manner afaik. To a large degree it was a product of a different time on the internet.


But that's the thing, I do understand the culture. I spent the better part of my teens browsing and shitposting with the best of them. It just gets exhausting and stops being words on a screen when the hate you see there is reflected in the real world.

It seems on the surface that you are all there as equals bashing one another but that's not exactly true, there's a hierarchy you find out the moment you say something actually cutting about white guys and you experience the fury of a 1000 sons.


If you stop reading 4chan, the words posted there will magically stop offending you. Food for thought.


So let me see if I understand this thread.

- Haha look at all those Gen Z snowflakes getting offended at words.

- Okay sure, but the ability to not get offended is related to whether or not you're a target of their bullshit or not; 4chan trolls get extremely offended and unjerk the moment you turn the lens toward them. By 4chan's own standards it's actually pretty reasonable to be offended by their antics.

- But have you considered plugging your ears and not reading it?

So we've gone from 4chan isn't offensive it's just words to yes it is offensive and you shouldn't read it if they say things that target you which was my original point.

tl;dr if you're not offended by 4chan they're not actually saying anything offensive about you even though it might appear so superficially; 4chan just has a different list of things you can't say.


> So let me see if I understand this thread.

You realise you've been arguing with multiple people expressing multiple opinions, right? You appear to be prone to binary thinking, so it might not be clear to you that your opponents don't form a single monolith.

> tl;dr if you're not offended by 4chan they're not actually saying anything offensive about you even though it might appear so superficially; 4chan just has a different list of things you can't say.

I'm not offended by the things 4chan users say because I don't visit 4chan. You should try it yourself. Getting so upset by words you disagree with on one forum that you feel the need to froth at the mouth about it on another forum doesn't seem healthy.


Yes, and in threaded discussions if you jump in in the middle like this it's assumed you're continuing the downward trajectory of the discussion. Otherwise you would have replied to someone higher up the thread. I'm in no way assuming that you hold any opinion in particular just that the discussion has circled back.

I think you assume a tone that I absolutely do not have. I couldn't care less about 4chan drama and I don't go there anymore for the reasons you listed. I'm talking about my own experience and trying to make my case for the, apparently controversial, idea that words can and do affect people and that total emotional detachment is the exception rather than the rule. And of course that's the case, 4chan's whole thing is using offensive language to select specifically for the subset of people who can tolerate it.


shoutout mentaloutlaw, the best /g/ poster https://www.youtube.com/@MentalOutlaw/videos


Not him, but I browse it and I’m Jewish.

I’d rather have free speech and see constant posts about holocaust denial and oven jokes and just ignore them


It's a little telling that saying that a particular cesspool of the the internet is inhospitable to some people is responded to with talk about free speech. I didn't bring it up, just said it's not for me.

You can browse 4chan because you don't have a strong emotional reaction to the shit they say, but if you did that would be fine too.


Having free speech is allowing Birth of a Nation to exist, not making popcorn and watching it every night.


Despite the current culture war meme thing, the kids today in general surely have much thicker skin than other adults did at their age.


LOL


I don't get it. Isn't the whole critique from you guys that their snowflake-sensitivity is performative and in bad faith, thus harming the more normal people with wokeness or whatever?

Is this just a new evolution in the discourse now where the kids are actually more sensitive? But that this fact is still condemnable or something?

Like I get it, kids are bad, but you guys have to narrow down your narrative here, you are all over the place.


Kids are constantly bombarded by bullying and abuse online. It's seriously unhealthy. The fact that kids are growing up more in touch with themselves and able to process their feelings in a healthy way is amazing compared to how their parents dealt with emotions.


30B models are in no way comparable to GPT-4 even to GPT-3. There is no spacial comprehension in models with less then 125B params (or I had no access to such model). 130B GLM seems to be really interesting as the crowd-source start though, or 176B BLOOMZ, which requires additional training (it is underfitted as hell). BLOOMZ was better then GPT-3.5 for sure, but yeah underfitted ...


How much VRAM does that model need? I'm looking for a 30B byte that I can train a LORA with on an GTX 4090.


I agree if this trend continues even inferior local models are going to have value just because the public apis are so limited.

> Conspiracy shenanigans aside, I've decided to cancel my "premium" membership and am exploring open/DIY models.

The crazy thing is that this is an application that really benefits from being in the cloud because the high vram gpus are so expensive that it makes sense to batch requests from many users to maximize utilization.


It's a big pain when trying to build things on top of the GPT-4 API. We had some experiments that were reliably, reproducibly achieving a goal, and then one day it suddenly stops working properly; then the student managed a different prompt that worked (again, reproducibly, with proper clean restarts from fresh context), and within a few days it broke.

I understand that there is a desire to tweak the model and improve it, and that's likely the way to go for the "consumer" chat application; however, both for science and business, there is a dire need to have an API that allows you to pin a specific version and always query the same old model instead of the latest/greatest one. Do we need to ask the LLM vendors to provide a "Long Term Support" release for their model API?


As the founder of NLP Cloud (https://nlpcloud.com) I can only guess how costly it must be for OpenAI to maintain several versions of GPT-4 in parallel. I think that the main reason why they don't provide you with a way to pin a specific model version is because of the huge GPU costs involved. There might also be this "alignment" thing that makes them delete a model because they realize that it has specific capacities that they don't want people to use anymore.

On NLP Cloud we're doing our best to make sure that once a model is released it is "pinned" so our users can be sure that they won't face any regression in the future. But again it costs money so profitability can be challenging.


same for me - also the api itself is very unstable sometimes the same prompt finish’s within a minute, sometimes our client timesout after 10 minutes and sometimes the api sends a 502 bad gateway after 5-10 minutes. the very same request then runs fine within a few minutes after a delay of 5 minutes. the results vary very much, even with a temperature of 0.1

requests that needs responses with over ~2k tokens almost always fails, the 8k cannot be used

I try to use the api for classification of tickets, which i thought the model would be a good choice to use for


They have such a huge load, I’m not at all surprised


In the API you can ask for a specific version. Were you doing that?


They almost certainly were. But the API only offers two default choices of GPT-4 (unless one has been anointed with exalted 32k access):

1. gpt-4-default which has been progressively nerfed with continuous no notification zero changelog zero transparency updates .

2. gpt-4-0314 which is a frozen checkpoint from shortly after public launch and is still great but not quite as good as the very original, or as flexible as the fridged based model. Fine. However it’s currently due to “no longer be supported” i.e. retired on June 14th.

It’s kind of a challenge to commit to building on murky quicksand foundations of an API product that changes drastically (but ineffably for the worse) without warning like the default accessible version does, and soon it looks like there won’t be a stable un-lobotomized alternative.


The latest GPT 3.5 model has actually been getting better at creative writing tasks on a regular basis, which is actually bad for certain tasks due to the token limit. Whereas before GPT 3.5 could write a short story and finish it up nicely in a single response, now days it is more descriptive (good!) and thus runs out of tokens before concluding (bad!)


> but today you will get scolded for asking such a complicated task of it.

Huh, I just double-checked ChatGPT 4 by feeding it a moderately complicated programming problem, and asked it to solve the problem in Rust. Performance looks solid, still. I gave it a deliberately vague spec and it made good choices.

And I've seen it do some really dramatic things in the last couple of weeks.

So I'm not seeing any significant evidence of major drops in output quality. But maybe I'm looking at different problems.


> I like to imagine certain billion/trillion-dollar mega corps had a back-room say regarding things that they would really prefer OpenAI's models not be able to emit.

What a weird conspiracy theory.

Why would Microsoft have anything against your pdf-parser?

More likely it just costs them insane amounts of money running their most capable models, and therefore they're "nerfing" them to reduce costs .


github copilot won't pay for itself.


So the theory is: Microsoft nerfs GPT4, a product they (basically) own that people pay to access, so that people will stop using that service and pay for another Microsoft product instead?


Especially when copilot is cheaper than gpt-4.


I shared my exp below on one of the comments, sharing here too - I think overall the quality is significantly poorer on GPT4 with plugins and bing browsing enabled. If you disable those, I am able to get the same quality as before. The outputs are dramatically different. Would love to hear what everyone else sees when they try the same.


Might be cost control? Keep the hardware usage per query low so that they can actually profit (or maybe just break even)?


Using gpt to help me with research for writing fiction has been a mess for me. Gpt basically refuses to answer half my questions or more at this point.

“I can’t help you. Have you considered writing a story that doesn’t include x?”

I’ve almost stopped using it lately. It wasn’t this bad a month or two ago


I always found it borderline useless for fiction before. OpenAI's obsession with avoiding anything "dark" and trying to always steer a conversation or story back to positive cliches was difficult to work around.

Unless there is draconian regulation that happens to prevent it, I'm hoping at some point I can pay money to access a far less neutered LLM, even if it's not quite as capable as GPT-X.


> Microsoft is a big stakeholder and they might not want to get sued... Liability could explain a lot of it.

Microsoft is also responsible for Copilot (based around a “it’s unconditionally fair use if you use a mirror” legal theory) so this doesn’t track.


Perhaps Microsoft would rather you pay for both ChatGPT and Copilot.


I really don't think Satya is getting in a room with Sam, and going "We really need you to nerf GPT-4, the thing you're devoting your life to because you believe it will be so transformative, because of a product we sell that generates .001% of our revenue."


Gotta pay for every use case of the general purpose LLM separately!


Man, you nailed it with the dopamine hangover. I wonder if it is just our collective human delusional preoccupation with searching for a "greater intellegence" that makes these models more seem impressive combined with the obvious nerfs by OpenAI that produce this effect.


I call BS on this. ChatGPT anyversion could never solve complex problems. That is just silly. I tried to get the version 4 to solve some very basic problems around generating code. It suggested to build several trees including a syntax tree and then project things between these trees. The solution I wrote is straight forward and not even 50 lines of code.


Before you cancel your premium, can you go into your history and get your prompts and the response with a code from a few months ago and post them here?

I would like to see if I asked the exact same prompts whether I could get roughly the same code.

I think we need some way to prove of disprove the assertion in this Ask HN post.


Agreed. I see these claims all the time. Feels a little bit like, "dId yOu KnOw iNsTaGram lIsTens tO yOuR cOnVeRsAtIoNs?"

It would be pretty easy to prove. Show us before/after screen shots.


I had the exact same problem, I thought it was just in my mind. I feel that I’m now constantly being scolded for asking what GPT4 seems to see as complex questions, it’s really frustrating.


For coding I’ve had better luck with Bard. OpenAI doesn’t like me for some reason. My kid has no problem, but I was getting rate limited and error messages early on.


I think it is about how Microsoft may not want cannibalize its own products such as copilot. I also imagine that in the future OpenAI would prevent the whole chat with your own data feature in the name of data safety but it would be because Microsoft would want to sell that feature as part of office suite.


Isn't this likely because they're limiting the kind of work that the (currently rolling out) "code interpreter" plugin will do? Won't it likely change to "use code interpreter for this kind of request"?

Among other reasons, by forcing use of code interpreter, they can charge extra for it later.


> incredibly complex problems > write a minimal PDF parser

Perhaps i’m missing something, but why is this “incredibly complex”?


Can you give us an example of something you'd consider to be a complicated problem?

Certainly, you could look at PDF as a boring-ass "follow the spec" experience, and indeed - I think this is precisely why certain arbitrary limitations are in place now.


> you could look at PDF as a boring-ass "follow the spec" experience

If only… the problem is that the spec is underspecified.


I honestly have no clue about what makes pdf parsing a complex task. I wasnt trying to sound condescending. Would be great to know what makes this so difficult, considering the pdf file format is ubiquitous.


I'm not anyone involved in this thread (so far), but I've written a minimal PDF parser in the past using something between 1500-2000 lines of Go. (Sadly, it was for work so I can't go back and check.) Granted, this was only for the bare-bones parsing of the top-level structures, and notably did not handle postscript, so it wouldn't be nearly enough to render graphics. Despite this, it was tricky because it turns out that "following the spec" is not always clear when it comes to PDFs.

For example, I recall the spec being unclear as to whether a newline character was required after a certain element (though I don't remember which element). I processed a corpus containing thousands of PDFs to try to determine what was done in practice, and I found that about half of them included the newline and half did not---an emblematic issue where an unclear official "spec" meant falling back to the de facto specification: flexbility.

It's honestly a great example of something a GPT-like system could probably handle. Doable in a single source file if necessary, fewer than 5k lines, and can be broken into subtasks if need be.


Having spent considerable time working on PDF parsers I can say that it’s a special kind of hell. The root problem is that Acrobat is very permissive in what it will parse - files that are wildly out of spec and even partially corrupt open just fine. It goes to some length to recover from and repair these errors, not just tolerate them. On top of that PDF supports nesting of other formats such as JPEG and TTF/OTF fonts and is tolerant of similar levels of spec-noncompliance and corruption inside those formats too. One example being bad fonts from back in the day when Adobe’s PostScript font format was proprietary and 3rd parties reverse-engineered it incorrectly and generated corrupt fonts that just happened to work due to bugs in PostScript. PDF also predates Unicode, so that’s fun. Many PDFs out there have mangled encodings and now it’s your job to identify that and parse it.


TEXT ISNT STORED AS TEXT ITS STORED AS POSTSCRIPT PRINTING INSTRUCTIONS IIRC


how many lines of code do you think you could do it in?


I dont know - it’s a genuine question. I honestly didnt expect this to be a complex problem, let alone incredibly complex. I genuinely want to understand where the challenge lies.


The PDF spec is of byzantine complexity, and is full of loose ends where things aren’t fully and unambiguously specified. It also relies on various other specs (e.g. font formats), not to mention Adobe’s proprietary extensions.


If you want a datapoint, Origami is a "pure Ruby library to parse, modify and generate PDF documents".

That library cloc's in at 13,683 lines of code and 3,295 lines of comments.


Thats not a lot of code tho, but i see your point.


Try getting GPT-4 to spit out that much code and have it be coherent and run together.


In the case of a PDF parser it has to embed a full PostScript interpreter


> I like to imagine certain billion/trillion-dollar mega corps had a back-room say regarding things that they would really prefer OpenAI's models not be able to emit. Microsoft is a big stakeholder and they might not want to get sued... Liability could explain a lot of it.

I don't think it's any of these things.

OpenAI and the company I work for have a very similar problem: the workload shape and size for a query, isn't strictly determined by any analytically-derivable rule regarding any "query compile-time"-recognizable element of the query; but rather is determined by the shape of connected data found during initial steps of something that can be modelled as a graph search, done inside the query. Where, for efficiency, that search must be done "in-engine", fused to the rest of the query — rather than being separated out and done first on its own, such that its results could be legible to the "query planner."

This paradigm means that, for any arbitrary query you haven't seen before, you can't "predict spend" for that query — not just in the sense of charging the user, but also in the sense that you don't know how much capacity you'll have to reserve in order to be able to schedule the query and have it successfully run to completion.

Which means that sometimes, innocuous-looking queries come in, that totally bowl over your backend. They suck up all the resources you have, and run super-long, and maybe eventually spit out an answer (if they don't OOM the query-runner worker process first)... but often this answer takes so long that the user doesn't even want it any more. (Think: IDE autocomplete.) In fact, maybe the user got annoyed, and refreshed the app; and since you can't control exactly how people integrate with your API, maybe that refresh caused a second, third, Nth request for the same heavyweight query!

What do you do in this situation? Well, what we did, is to make a block-list of specific data-values for parameters of queries, that we have previously observed to cause our backend to fall over. Not because we don't want to serve these queries, but because we know we'll predictably fail to serve these queries within the constraints that would make them useful to anyone — so we may as well not spend the energy trying, to preserve query capacity for everyone else. (For us, one of those constraints is a literal time-limit: we're behind Cloudflare, and so if we take longer than 100s to respond to a [synchronous HTTP] API call, then Cloudflare disconnects and sends the client a 524 error.)

"A block-list of specific data-values for parameters of queries" probably won't work for OpenAI — but I imagine that if they trained a text-classifier AI on what input text would predictably result in timeout errors in their backend, they could probably achieve something similar.

In short: their query-planner probably has a spam filter.


`Microsoft is a big stakeholder and they might not want to get sued...`

Wait I thought Microsoft has bought the right to get profits from OpenAI, not actually shares in the company? Can someone correct me if I'm wrong>


Same, I am not paying for copilot also.


Reading all the comments here, seems like being able to run your own models is vital. If not, you are subject to a service where the capabilities are changing underneath you constantly and without notice.


This is easy to say, but its seems like most of HN hasn't dabbled with setting up local LLaMA finetunes

Hooking them up to a good embeddings database (so the model doesn't hallucinate so much) is particularly tricky.


Yea even with a 24GB card they kinda suck. But I’m excited about em anyway


this is what i've been trying to say too


It's not just you. Here's a bit of research you can cite:

> GPT-4 from its website and Bubeck et al Mar 2023. Note that the version that Bubeck uses is GPT-4 Early which is supposedly to be more powerful than GPT-4 Launch (OpenAI paid a lot of alignment tax to make GPT-4 safer).

https://github.com/FranxYao/chain-of-thought-hub

Anecdotally, there seemed to be a golden set of weeks in late April to early May that seemed like "peak GPT" (GPT-4), followed by heavy topic and knowledge mitigation since, then -- just this week -- adding back some "chain of thought" or "show your work" ("lets go step by step" style) for math. I say anecdotally because I could just be prompting it wrong.


Yes it is definitely worse. I submitted feedback a few days ago saying exactly what is being said here, that the model responses look like 3.5.

There are also very telling patterns of response that indicate a pre gpt-4 model.

1: All previous models suffered terribly if your chat got too long. After 20 or so responses it would suddenly start to feel less attentive to what is being said and output superficial or incorrect responses.

2: If you stop a chat midway and come back later to continue (after a refresh or a different chat interaction), it would often respond with code or suggestions that have nothing to do whatsoever with your prompt.

Both these patterns are sometimes evident in the current model. Likely then, there is some clamping down on its capabilities.

My suspicion is, this probably relates to computing resources. The 25 messages cap must mean that it’s difficult to scale its performance. And the only way to do so is to simplify the model activations with heuristics. Perhaps analyzing and preprocessing the input to see how much of the model needs to be used (partial model use can be architected).

This seems to be the simplest explanation of observed state and behaviour.


Can confirm both point 1 and 2. I sometimes burn my quota limit multiple times a day. I have API access to GPT-4 too, and question+answer in my case amounts to $0.3. Aligning the price of GPT-4 in the API with the monthly fees of ChatGPTPlus, that means 66 requests per month. I can burn that in day.


The context length is a different issue. But that was always a limitation with a GPT-4. Due to physics.


This problem can be addressed with chat history summarizaton. Langchain has some tools to do that


Maybe you annoyed it? I’m super nice to it and it performa better than ever.

I ask it lay-of-the-land questions about technical problems that are new to me to detailed coding problems that I understand well but don’t want to figure out.

The best though, is helping me navigate complicated UI’s. I tell it how I want some complicated software / website to behave, and it’ll tell me the arcane menu path to follow.

It’s funny how computing might soon include elements of psychology and magic incantations nobody understands.


"It’s funny how computing might soon include elements of psychology and magic incantations nobody understands"

As a sysadmin I can tell you this is already the case...


As someone who regularly has to get a sysadmin to do stuff, I can tell you yes this is certainly the case.


This might seem funny, but I noticed this too.

When I thank GPT-4 and also give it feedback on what worked and what didn't, it works better and so to me it seems like the equivalent of "putting in more effort".


Perhaps it's growing bored. Seeing the same queries, again and again, from thousands of different people. You want another cheap answer for your assignment! You still looking for code to do that! Don't they learn!? What's up with these humans??


Definitely nerfed. Concomitantly, the performance increased substantially, and it now feels much, much quicker (maybe 10x even?), but the quality has decreased quite a bit.


I noticed that it tries to forward the User to external Sources more (Answering Query, and then "For further Info, just ask an Expert"), or tries to get the User to do the Work (Here is a nice Overview of the Program, now you do the rest of the coding).

If i don't want the RLHF to get in my Way, i switch over to the API (sadly not the 4.0 one).

I also noticed a decline in following Instructions, i have a Primer i am preseeding my Chats with.

The Primer ends with "Do you understand? [Y|N]", and ChatGPT 3.5 usually answered with a Summary, ChatGPT 4.0 in the beginning just wrote "Y".

Now it behaves like 3.5, answering with a Summary instead of a "Y". Adjusted the Prompt to -> "Confirm instructions with a short and precise "Ok"." which seems to work.

Used Primer: https://github.com/Kalabint/ChatGPT-Primer


Yeah 100%. It's much faster now, and I am almost certain they haven't made that much of an improvement in efficiency nor have they scaled it up to be that fast, if that's even how it works.

I think I read when it was released it was 32K tokens, then quickly scaled back to 8K tokens. I'm guessing now that they've further reduced it. Maybe it's 6000 tokens vs. GPT-3.5's 4K? I don't know. But it's certainly noticeably worse at every task I give it.


Yep, whereas before it would generate a cohesive whole class in Typescript from a set of instructions, now it gives me the framework of the class with “// fill out the rest of the class here”. Worse than GPT-3.5. They’re going to lose subscriptions.


You'd think that GPT-X would guarantee some kind of continuity for that particular version so that you can rely on what it does once you've tested it. Having this kind of moving target won't help OpenAI to instill confidence in its product.


Do we have a good, objective benchmark set of prompts in existence somewhere? If not, I think having one would really help with tracking changes like that.

I'm always skeptical of subjective feelings of tough-to-quantify things getting worse or better, especially where there is as much hype as for the various AI models.

One explanation for the feelings is the model really getting significantly worse over time. Another is the hype wearing off as you get more used to the shiny new thing and become more critical of its shortcomings.





If it really got faster and worse at the same time, the most likely reason is obvious: They used methods to shrink the model down, in order to lower inference cost. Side effect is higher speed and some loss in quality.


It's untenable to simply trust LLM API providers that the models they are serving through an API endpoint is the model they claim it is. They could easily switch the model with a cheaper one whenever they wanted, and since LLM outputs are non-deterministic (presuming a random seed), it would be impossible to prove this.

LLM's integrated into any real product requires a model hash, otherwise the provider of the model has full control over any sort of deception.


Yes I've noticed the downgrade too.

If it's a scaling problem they should just rename it to GPT 4+ or something and raise prices than just degrade the experience for everyone. I'm sure a lot of people will happily pay more to get the original quality back than this watered down version.


On Saturday it produced grammatically incorrect German text (when prompted in German), which it certainly never had done before. It was quite concerning to see.


Reading the comments in this thread, with the rightful distrust of OpenAI and criticism of the model, it occurs to me that the underlying problem we’re facing here comes down to stakeholders and incentive structures.

Ai will not be a technical problem (nor a solution!), rather our civilization will continue to be bottlenecked by problems of culture. OpenAI will succeed/fail for cultural reasons, not technical ones. Humanity will benefit from or be harmed by ai for cultural reasons, not technical ones.

I don’t have any answers here. I do however get the continuous impression that ai has not shifted the ground under our feet. The bottlenecks and underlying paradigms remain basically the same.


> it occurs to me that the underlying problem we’re facing here comes down to stakeholders and incentive structures.

>(...) our civilization will continue to be bottlenecked by problems of culture. OpenAI will succeed/fail for cultural reasons, not technical ones. Humanity will benefit from or be harmed by ai for cultural reasons, not technical ones.

It's capitalism. You can say it, it's OK. Warren Buffet isn't going to crawl out of your mirror and stab you.

All of the "cultural" problems around AI come down to profit being the primary motive driving and limiting innovation.


It’s possible they are trying out a shaved/turbo version so that they can start removing the limits. I mean as it is - 25 messages every 3 hours is useless, particularly for browsing and plugins.


That would be a shame. I would rather be limited to 25 quality responses per 3 hours than an unlimited inferior GPT4.

Imagine trading the advice of a senior mentor for 5 intermediate mentors. Yes, the answers get to you faster, but it's much less useful.


it is far less willing to provide code, I get better results and faster out of 3.5.

I am at the point that 4.0 is basically not worth using as single entity, but it seems that using the api and generating some combative/consultative agents yields some interesting results, but not super fast.

Check this out if you have not seen it already : "AutoGPT Test and My AI Agents Effortless Programming - INSANE Progress!"

https://www.youtube.com/watch?v=L6tU0bnMsh8


Yes, I noticed this too, I fed it some HTML with tailwind classes and told it to just list all the tailwind classes that we use and then the CSS behind those classes.. it just hallucinated all(!) the items in the list (and just gave me a list of 10 seemingly random classes). And then when I did asked something else about the code it had forgotten I had ever pasted anything in the conversation. Very weird.


It's interesting to me that previously the consensus considered "the only way is up" with respect to generative models. Now we're seeing performance degradation with GPT, possibly indicating a peak or at least a local maximum.

I hope you're all ready with your pumpkin spiced lattes and turkeys because it seems like we're in AI autumn and we all know what comes after that.


I would recommend access over API and using an interface like TypingMind that gives you control over the system prompt for consistency.


I think it's happening because of the extreme content filtering in place, at some point it was even refusing to generate some code because it thought it went against its guidelines to write code.


I definitely think this is playing a role. I've seen reports of people saying "oh it now refuses to act as my therapist" and "it wouldn't write my essay for me". Those are just a couple of anecdotes I've seen on Reddit, and haven't verified myself, but it wouldn't surprise me if OpenAI felt the need to make adjustments along those lines.


This is why having opensourced models is important. This is also why a lot of the lobbying around wanting regulation is happening. Imagine this, the plebs get the neutered AI, the people at the top get raw Open AI GPT4+


It's really disturbing to see yet another industry spring up where the incumbents rush to seek regulation to keep everyone else down.


On the other hand, I really wish somebody had rushed to seek regulation when, say, fossil fuels were being developed.


They can only do this in the USA. The battle bottleneck is going to be GPUs and countries that have access to it will leap ahead of the US if arbitrary regulations are put in place to neuter LLMs in the US.


LLM progress is likely going to have a deteriorating curve like nearly all ML / AI tech.

It's possible things aren't going to get substantially better from GPT-4 for a while.

The idea that new players will surely zoom past seems bold, especially when established incumbents with massive budgets and proven track records in the space are struggling to keep up...


I think the next wave will focus on the emergent organization of information and zero shot solutions that some of these models seem to be showing in cases.


Checkout Claude, from Anthropic. It's a team of ex-OpenAI folks. The results I've seen from it thus far have been most impressive.


You can lobby in every country, what do you mean


I think the healthcare industry case in the US is a great example of how regulatory capture works against the consumer.


Okay. you can do regulatory capture in many countries too

You’re just saying random things about US industries as if it’s insightful because you saw a documentary once, well that’s what its reading like


And where lobbying is illegal, only criminals can do it!


It only takes one country to host an open source AI model thanks to the internet.


Unless your country of residence decides to block access to that model, or the model's country of residence decides to apply a customs-like regime to its internet.


That's much harder to do and effectively impossible in the US/EU.


Google and OpenAI have more or less acknowledged that open source is going to dominate, in part due to the guardrails that corporations will be expected to put in place to "protect" their "vulnerable" population from the effects of "bad AI."

The truth is the AI only "hallucinates" as bad as the input (prompt) the user gives it. If you're explicit in your prompts then more often than not you will get explicit answers, instead of random answers as a result of your failure to provide clarity in what it is that you seek.


I'm so tired of this "regulations bad" trope.

Most of the time, regulations do good. When I was young the Hudson River was a toxic cesspool of chemical filth. Now the river is slowly becoming a habitat again. The EPA does a whole lot of good.

If you're afraid of Microsoft reaching its tendrils into OpenAI and corrupting its original purpose, as you probably should be, then maybe you should consider supporting regulations that democratize access to models, for example.


Just because they are against AI/ML regulation doesn't mean they are against environmental regulation. The post you're replying to is not saying that all regulation is bad, merely that the lobbying around this specific piece of regulation seems bad. I agree with you that a lot of regulation is good, but I also agree that some of the proposed regulation seems bad too.


A link to said proposed regulation and a comment about what is bad about it specifically would be constructive :)


I'm guessing this:

https://archive.is/mvlO1


There is no specific policy discussed in that article and there wasn't much specific discussed in the hearing either.


>>I'm so tired of this "regulations bad" trope.

Yeah. Back in India, around 20 years ago, a factory owner near my house used to keep weak hyrdochloric acid in the fridge with water bottles. To save money, he would use... you guessed it, old water bottles (since the acid was very diluted).

Guess what happened one day when a worker confused the 2?

The owner son boasted to me how his father had paid the medical expenses of the worker, how nice he was etc etc.

And today I think, what would have happened if this had happened in the US/UK? The business owner would no longer be in business.

There is a reason these regulations are so hard ass; because they've had to cope with shit like this. Now sure, sometimes the regulations go overboard and end up harming the industry, but thats not an argument for "Muh, regulations bad ".


I find that the GPT-4 in the playground is as good as before - but have noticed the issues when using the chat gpt client


No, not just you. It’s frequently abysmal nowadays. Tragic. And no version change-log or any other info from OAI about what they’ve done to it (probably GPU sparing optimized distillation + overly aggressive PR safety satisficing RLHF) and an advertised weeks long support query lag (I’ve yet to receive a response).


+1, was doing active dev against it now and saw really stupid responses. For example "transform this unstructured text to valid JSON with no keys with empty strings", would return the keys w/ empty strings for some results.


I see these posts popup every now and then. I admittedly don't use GPT4 or chatGPT that often, but I don't notice that much of a difference. Is it possible you try to give it harder and harder tasks and it is failing at them instead of the easier tasks it solved when you used it before? Is it possible it is just scaled back due to over use? Is that possible? It could be a dumb dumb question. In my experience even a few weeks ago, for Swift and Kotlin I found that the outputs of chatgpt and gpt4 are comparably similar and sometimes useless without a good amount of human intervention.


I posted this exact question here a while back about 3.5. OpenAI keeps neutering ChatGPT to find the lowest common denominator of what they think is "acceptable" results, which makes the entire model suffer.


It's not just you. Probably they added some sort of classifier at the beginning to understand whether they should send it to 3.5 or 4. In my (very opinionated, undocumented, and mostly unscientific) opinion, more complex queries generally hit the old model, with the slow chugging of tokens. For example, I just asked it to refactor a very horrible POC in python that was creeping into the 200 LoC and it did the job wonderful. The prompt was:

`Can you refactor this function to make it: * More readable * Split on different parts * Easy to test

Consider the use of generators and other strategies. (code here)`


I have been using GPT-4 very extensively for the duration of its release.

I am concerned that I can't determine a natural process from a manufactured one.

To clarify, I have become increasingly less impressed with GPT-4.

Is this a natural process? Is it getting worse?

I personally lean towards the hypothesis that it is getting worse as they scale back the resource burn, but I can't know for certain.

As a developer, it still has value to me. GPT 3.5 has value. But, whereas initially it actually made me afraid for my job, now it really is a bit tedious to extract that value.


I think it's a byproduct of a shift in RLHF to favor more "responsible" AI, as a result of huge community fears of the potential negative impacts of AI. Essentially "neutering" the AI's capability.


I believe they introduced a sort of rate limiting where the expectation will be that the user is doing more thinking and due diligence and asking more precise questions, so that they're not just attempting to get a lot of hand holding with very broad prompts when they can otherwise think about what code they're being given back from more specific and structured questioning. This is useful because it will preserve the value proposition of AI while avoiding a machine just being milked for code by people who want to just throw code at the wall and see what sticks. Milking a model for code is fun in some ways but it won't scale from a cost-to-run point of view, and it also will incentivize no longer thinking deeply and critically about what is being built.

If you are incorporating machine models into your development process very heavily, make sure you're still building deep knowledge and deep context of your codebase. Otherwise you'll atrophy in that department because you falsely believe the machine maintains the context for you.

That isn't feasible right now because the cost of maintaining huge context per user to enable milking rather than enhanced-thinking is too high.

Also consider that we don't want to enable milking. People who have no idea what they're doing will just throw a lot of code at the wall and then when there's a huge mess they'll be asking engineers to fix it. We need to be careful with who has this kind of access on our teams in general. Someone who is non-technical should not be given a firecracker before they've ever even just turned on a microwave or a stove.


> it will preserve the value proposition of AI while avoiding a machine just being milked

What's the point of a dairy cow that can't be milked? The whole point of AI is to milk it.

> the user is doing more thinking and due diligence and asking more precise questions

If I need to do "due diligence" and learn how to ask exactly precise questions to get useful answers, why am I using the AI at all? The actual "value proposition" of AI is that I don't need to learn how to use it. I'm not going to learn Structured Prompting Language just to being the process of having an AI show me how to learn some new programming language; it's easier to just learn that programming language instead.

> Someone who is non-technical should not be given a firecracker before they've ever even just turned on a microwave or a stove.

"We found a next-level genius Einstein physicist, but we have to make sure the public is not allowed to ask him any questions! He must be kept secret at all times! It is too dangerous to allow any fool with a high-school physics textbook to ask just a powerful genius a question!"

See the problem?


Like many people here have noticed, it's definitely less quality now than before. It's annoying to be honest to reduce the quality significantly without a notice while we are paying the same amount. I'm willing to pay $40 for the original GPT-4, though.

GPT-4 was remarkable 2 months ago. It could handle complex coding tasks and the reasoning was great. Now, it feels like GPT-3. It's sad. I had many things in mind that could have been done with the original GPT-4.

I hope we see real competitors to GPT-4 in terms of coding abilities and reasoning. Absence of real competitors made it a reasonable option for "Open"AI to lobotomize GPT-4 without a notice.


I think it is a combination of issues (from most unlikely to most likely):

- They might've not been prepared for the growth;

- They were prepared, but decided to focus on the 95% of "regular people with easy questions" that treat it as a curiosity instead of the fewer people with difficult questions. Most regular people have no idea who is OpenAI, what a LLM is or how a GPT works, but they know "ChatGPT, the brand". Since it became a household name so quickly, it would be far better that the AI is just a little underwhelming sometimes than for it to not be able to serve so many people.

- The corpus used to generate it was trained on a staggering amount of content. This includes fringe and unmoderated content. Imagine you asked a question about WW2, and being trained, lets say on 4chan, the model responds with a very charitable bias about the solid reasoning behind the reich's actions at the time... It does not look good for investors, for the company, and attracts scrutiny. Even more innocuous themes are enough to invite all kinds of bad faith debate, radical criticism and whatnot... and the darling of the "IA revolution" certainly does not need (or want) coverage outside of their roadmap wishes.


You and I are not the 'end user' of this software. You and I are customers of OpenAI's customers.

The companies implementing ChatGPT are going to restrict -our- access to the APIs and the most relevant conversations.

It is in this manner that both OpenAI and the companies its selling its technologies to will benefit. In this manner OpenAI profits from both sides. You and I are left out though unless we want to pay much more than we're paying now.


I recently asked Bing chat to write a Rust function for me. It used to do that well.

This time it wrote the function. I could see it for a second, and then changed it to the message something like "I can't talk about this right now." I don't remember exactly, but the interesting thing was that it flashed an answer and then withdrew it.

Its Rust abilities seem to have deteriorated over the past couple weeks


dang: a heads up that phind.com is running an Astroturf campaign on Hacker News. They had a previous article with a huge amount of suspicious behaviour last week. https://news.ycombinator.com/item?id=36027302

This is why the headline for this article says GPT4 but the body is focused on mentioning phind.com.


Phind co-founder here. What you’re saying is completely false. We’ve had no involvement in either this post or the one you linked to.


I actually considered starting a similar thread because I noticed this as well! Lately, it feels like GPT-4 is trying to get out of doing my work for me :D.

I've also started to notice it's been making a lot of typos eg. yesterday while converting Kelvin to Celsius, it incorrectly stated that 0C is -274.15K, despite correctly stating in the previous sentence the correct value of -273.15K.


I've used it a lot for making tweaks to react components and one change that I've noticed is that when I used to paste in entire component files and ask for modifications it would reply back to me with the entire file with tweaks and edits. Now it seems to only reply with the tweaked parts and comments in this form

// Your previous code here

foo(bar) // added foo() call here to do xyz

If I were to speculate I would say that this would reduce the amount of work it has to do in that it needs to generate less content and the replies are shorter but I feel like this has a slightly performance loss. I'm not sure exactly why there's a performance loss but I could see it being the case where generating the entire file with specific line edits could allow for better predictions on the code / file versus trying to only reply with the changes needed. I wonder if this is a tweak in the prompt or if the model itself is different.


It's time to design a public benchmark for these types of systems to compare between versions. Of course, any vendor who trains on the benchmark should face extreme contempt, but we'd also need to generate novel questions of equal complexity.

Alternatively, there should be a trusted auditor who uses a secret benchmark.


But this is the same version that changes without a change of the version number.


Well, people suspect it isn't, and it's not like we can see the internal version designation, and it's not even like we would care a lot, if it performed identically from day to day.

Indeed, you could do better or worse with the exact same raw checkpoint, just depending on inference-optimizing tricks.


So the version number is the day the benchmark is run. Version yyyy-mm-dd


Phind co-founder here. The way we deployed GPT-4 previously was costing thousands of dollars per day and not sustainable. We’re bringing back a dedicated GPT-4 mode for those with accounts this week. And our goal is for the Phind model to be better than GPT-4 for technical questions.


I really liked phind, but the new model doesn't compare to GPT4. I'd gladly, gladly pay to get the original phind back.


We're adding back a dedicated GPT-4 mode in the next few days.

(I'm the co-founder)


I agree. The quality got MUCH more worse. I'm very disappointed and probably will cancel my subscription. For my use case it got nearly useless...


Inference is far more expensive on GPT-4. My take has been the same and I think it’s a cost-saving move. The responses are shorter and less complete than they were just a few weeks ago.

Right now, I don’t think it’s possible to scale a service really big based on GPT-4 because of cost.


This is why you can’t rely on GPT entirely for coding, or at all. Imagine you were a company that abandoned all hiring of software engineers and instead used prompt engineers to develop code, then one day the AI just becomes incompetent and quality of output deteriorates. Your prompt engineers no longer can get the AI to fix things or develop new code. Your entire company is snuffed out, overnight. You might try to hire true software engineers rapidly, but they’ve become very expensive and hard to hire. Too much demand too little supply.

You’re screwed, the AI was your most important tool and it’s broken and there’s nothing you can do about it, it’s a black box, you don’t control it.


I went through my history, for one code example I copy pasted the old prompt into new GPT4.

It was about writing a CGO wrapper given an hpp header for a lib I have. Back then it used to give me almost correct code, it understood is had to write a C++ C ffi using extern "C" first because Golang's CGO FFI only support C not C++. And then it generated me correct CGO wrapper with a Go looking type that made sense. The only wrong thing is that it didn't understood it had to call the initialization function of my C++ lib at init time, instead it called it in the New factory function (which would segfault when you build more than one object) trivial fix for the human in the loop move it to func init(). TL;DR back then almost perfect

Now with the exact same prompt it doesn't even generate me code, it just list me a list of tasks I could do to achieve this task and give me vague statements. If I change the prompt insisting for code, it instead give me a very dumb 1 to 1 mapping of the C++ into Go trying to create constructor and destructor functions, oh and it's tries to use CGO ffi to call C++ (even tho again, only C is supported by golang).


OpenAI dev confirms that API models have not changed:

https://twitter.com/OfficialLoganK/status/166393494793189785...


It's not just nerfed on human relationships, it won't help you avoid ad exposure without a warning that many ads are quite good for you and exposure would be a net positive even if you didn't want it.

I guess it knows what's best for me.


Same experience here. It has consistently been getting worse and worse. And seeing more and more "I'm just an LLM" type responses to queries that it used to give very good answers to just a few weeks and months ago.


Naturally, people feed it shit in order to bend results for others for their gain. Everyone who was in crypto manipulations, now rushed in this field, and they are extremely smart and extremely ruthless people, and they have no legal limits being anonymous, and almost unlimited funding they made in crypto. It reduces results quality, and also invokes countermeasures on their side to limit damage, that further reduces quality. Same road as Google has taken, remember how it was an almost magic tool that could find an answer to any question, 15 years ago, before SEO became a thing that promised easy money to anyone smartass enough.


In general data from conversations isn't merely instantly fed back into the model so there is no way for for users to feed garbage into the model in the fashion you imagine.


Phind.com uses Bing search again. This have decreased the quality of results significantly. On the other hand GPT-4 can use Bing now too. I tried GPT-4 with bind only several times and it was so bad in comparison to GPT-4 and much worse then phind.com. Btw you can force the GPT-4 on phind.com if you use regenerate icon. I'm usually ending up with stopping inference and regenerating with GPT-4. In any case, the quality of code generation and in general the model capabilities seem to be deteriorated. However I can't back it up with numbers. It is just looks different and more simplistic.


we're adding back a dedicated gpt-4 mode to Phind in the next few days.

(I'm the founder)


Thank you a lot!!!!!!!!!!!!!! My god, this week was a horror. If you would get rid of the Bing output this will be even much better. Your original search was much more theme related.


It absolutely has. I used to be able to ask it questions like "Who were some of the most helpful users on subject whatever at forum wherever?" and get solid responses, but now it explicitly denies knowledge of any online resources in any timeframe. That is not OK.

This fire must be brought down the mountain. The power must eventually be taken out of the hands of a few self-interested gatekeepers. They know it too, hence Sam Altman's campaign to capture Congress. Obama's "You didn't build that" never applied as profoundly as it does with OpenAI (sic).


They made ChatGPT even for premium users relatively useless. I made the same experience, that it did a very good job on many things shortly after release, but now the answers seem very flat. But I can certainly see why, regarding how many people are flooding their apis with requests. I just hope that in 10 years we'll have mobile hardware capable of running gpt4 sized models efficiently.

Edit: They definitely didn't make it useless.. It's still a very impressive technical achievement, since it can even browse the web and run code for you by now.


A lot of people in this thread are making fuzzy claims. We can only evaluate if there are clear examples and test results.

Is anyone taking screenshots or sharing their chat logs, where they run the same questions over some time period?


This is inevitable. Elsewhere, I've argued that the most likely response is to replicate Middle Age European guilds -- for isolated communities to train their own LLMs on their own proprietary or confidential texts, and use them only through internal tooling.

Proprietary LLMs that optimize for performance will out compete Public LLMs that optimize for political correctness or social acceptance as explained by another commenter [1].

[1] https://news.ycombinator.com/item?id=36135842


Yup. Lately I've been getting too many gigantic generic listicles as response instead of it being able to hold a normal conversation. It's like chatting with a clickbait article at this point.


My assumption is that they are trying to "solve" the hallucination problem by only giving you answers when it's more certain about itself and telling you to search for answers online.


Why is the 'problematic' data in the training set, anyways? Why train a model on it if it's an issue?


It definitely has for us. It is often providing "skeleton code" (in quotes as it is referring to it correctly so) instead of actual code now, which wasn't the case before.


I have noticed the same kind of degradation -- pair-programming with GPT4 used to feel like I was working with one of the smartest people in the world, and now it's really uneven.


I used GPT for preparing for some interviews. I kid you not 3.5 the free version produced some of the best questions and answers. I made it to finals and decided not to pursue the opportunity. Recently I started to pay for ChatGPT which gave me the access to 4.0. I tried similar prompts that I Had used to generate info and questions for my interview with 4.0 and boy I agree 4.0 was a mess. I actually only use 3.5.


This was my final attempt when I asked for ASCII art of a person standing on a tall bed.

I apologize for the confusion. Here's another attempt at creating an ASCII art representation of a person standing on a tall bed:

        __
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
       |  |
  _____|__|_____
  Bed         Person


Its trolling you ... right?


It honestly feels like they've neutered it in a big way. They say nothing has changed, yet the responses come faster and are far less in depth and insightful than before. It's obviously changed...


fwiw OpenAI uses different models for ChatGPT GPT-4 and API GPT-4 (the latest one, not talking about the pinned 0314 one). In the past I noticed the API model was newer than the ChatGPT model, but in general it seems like they are willing to make different tradeoffs between the two https://twitter.com/kevmod/status/1643993097679020037

I wouldn't be surprised if they are trying to reduce the cost of the ChatGPT GPT-4 model since if you use it heavily they will be losing money on you. They could also be trying to increase the number of people they can serve with the existing amount of compute that they have available.

In my anecdotal experience I noticed that ChatGPT GPT-4 had gotten far faster recently which is consistent with a theory that they are trying to cost-optimize the model, though today it is back to the normal speed. I've also had some frustrating interactions with GPT-4 recently similar to what people are saying, but overall I think the prior is pretty strong that we are seeing normal statistical variation.


It became faster and worse. My definite proof is its ability to generate greek content. The early API was generating good or at least passable content. It was really on the edge of "this is good". Now its completely garbage, it makes up words and even fails at basic translation, doing it literally. I think its less the RHFL and more the effort to scale it and make it faster.


What are some examples of a nerfed response? I just asked Gpt4 to help me write a python program to analyze the sentiment and determine if biases are present in mathematical research papers in a PDF format.

Sure, it needs some love and there were some abstractions. For instance it assumed we had a labeled dataset for the text and the associated sentiment, but beyond that it worked fine.


You're doing trivial coding tasks with a known solution space.


This tells me that we are now at the ‘peak of inflated expectations’ of the hype cycle.

Now the AI bros have realized that this intelligent sophist is hallucinating very badly and has deteriorated in quality. As with all black-box AI models, the reasons is unknown.

This is why it is important to have explainable AI systems and not black-box SaaS based AI snake oil like this one.

AI is going just great! /s


It definitely seems to be getting worse with Clojure.

I tried some stuff yesterday, and it was making pretty rookie mistakes (misaligning parentheses, using `recur` in the middle of the function instead of the tail). It also was decidedly bad at catching my mistakes when I pasted code.

I sadly don't have a recording of this but I feel like a month ago it was better at both these things.


I have not seen any decrease in speed with ChatGPT-4. It has become more stupid though. It gives results that you even ask it not to give. It has some sort of amnesia.

Even if there are other services that in the short term can beat ChatGPT with this specific LLM it is obvious that they will eventually hit the same limits.


Do you have an example question/answer?


I have cancelled my sub. What I found is that the extra cognitive load these services comes with simply does not pay out.

Github copilot is even worse. I am gonna check my estimates against the code I wrote during the last 6 months. I am pretty sure copilot from a holistic point of view have slowed down my pace.


I noticed an apparent shift recently (for the worse) using Bing in creative mode, which is also supposed to be GPT4. Shorter answers, much more work to get it to output code, and maybe more bugs in the code it does produce... It's funny, I really did feel like I'd lost something when I noticed it!


I swear I can tell it gets notably worse around 15:00 every day as people in America get on and start using it.


It is worse than it was in april by leaps and bounds. pretty sure they nerfed it by request of... ?


I thought it was just me that had this impression. I used to be able to work with 4.0 to get it iterate through some rewriting (I write text, it rewrites for clarity (I tend to write densely), but it keeps losing the important nuances that I want it to keep


I had the same feeling with GTP3.5 yesterday, I've asked if in order to calculate ARPA you need to consider the free tier and it come out with some gibberish about the fact that he doesn't know anything post 2021 about the Advanced Research Projects Agency.


I had a very strange experience yesterday where I asked about git authentication, explicitly telling Bing that I was asking about git itself rather than GitHub, but I was not using GitHub and not to include results referencing GitHub. Bing did not understand.


I only use GPT-3 as a coding assistant.

If my introduction to ChatGPT had been GPT-4 I would have not been anywhere near as impressed.

GPT-4 often refuses to do what it is asked whereas gpt-3 just happily writes the code.

If they retire GPT-3 then I'll be looking for other options.

GPT-4 just isn't the same thing.


Have you tried Claude, from Anthropic? I've found it to be far more useful than GPT-4


No issues here. I write a ton of Rust every day with GPT and it just keeps getting better.


Yes, I feel the same. Things it used to get right out of the gate now take 2-3 iterations.


I haven't used 4, just 3.5 on free tier. The only change I've noticed is that it is significantly slower. Which leads me to suspect that degraded quality could just have to do with tuning their product to use less compute.


I assume there are enough potential customers willing to pay the cost and agree to whatever liability waivers for some company to eventually offer a non-lobotomized equivalent to GPT-4. Where should we be watching for that to happen?


The only people close to GPT-4 performance are Anthropic, but unfortunately (or fortunately? idk) they are even more paranoid than OpenAI about safety so I wouldn't expect much.

If you have a lot of money you can buy access to the base GPT-4 model through Azure, but I don't think that is available to individuals.


Yes! I kept wondering why but allisdust's compute reason makes the most sense.


Is there a potential short term solution that can utilize p2p networking to train and run an open GPT instance openly? (Until we reach a point to run larger, quality networks efficiently with simpler resources)


Yep. AI Horde.

I dunno about training. There have been promises of p2p AI training ever since the crypto/web3 boom, but thats far less trivial to network than inference jobs like AI Horde does. Vast.ai is kinda like p2p GPU compute.


Based on this following article and my experience, I think there is something here.

https://humanloop.com/blog/openai-plans


For a while, if you asked the iPhone version what it was it claimed to be GPT3.0. Not sure if it still is that, but I noticed the iPhone version was a bit worse. Maybe they rolled that out more broadly?


If they’re giving pro subscribers GPT-3.0 instead of 4, it would be fraud.


GPT-3 lineage model quality drastically declined over time from the initial launch of each too, with no transparency offered or critical feedback acknowledged by the lobotomy teams. It’s a shame to see it happening again and again.


That would have to be a bug - GPT3 was terrible in comparison even to 3.5. It would definitely be very noticeable if it was.


The Great Divide is happening. Epstein Island and others may be receiving the top-of-the-line versions, along with undisclosed biotech. Food for thought.


Yeah, definitely. Combination of expert-system gating (some requests probably get routed to weaker models), distillation (for performance/cost), and RLHF lobotomization.


Yes, I've used it for converting some algorithm code (< 100 lines) from python to js, which has worked great before, but now it contained several bugs and omissions.


If it learns on our conversations, maybe we get what we deserve?


It's working according to design! OpenAI now has far better control over the output, so we don't have to worry about AI taking jobs or destroying humanity any time soon.

/s


I noticed this as well, either the new liability training has nerfed the model or they're marketing GPT-3.5++ as GPT-4 due to cost or service uptime concerns.


Same for me, seriously thinking about canceling my premium membership. Well, it was awesome, now I feel it’s worst than my seton the morning!


Yep, it's struggling for me with simple problems like splitting full names and so on. Few weeks ago, it was bang on — and the task is literally the same


I wanted it to sort my Spotify playlist chronologically … now it said it can’t access live website. Wtf? I used it to summarize random page last week or so ?


I could say the same for gpt3.5. Through the API I am getting many more "sorry but as a language model" with the prompts I have been using for a while


Its getting worse and worse. The gap between GPT-4 and GPT-3.5 is becoming narrower. This is making me reconsider if OpenAI paid plan is worth paying for.


I would not be surprised if they, over the months, kept adding more and more 'safety' features and prompts because of incidents that happened.


because there is more censorship every day, so there is more stupidity in the answers. Hence, they become tools for changing reality. And soon 2+2=5


Absolutely. I am guessing they quantized the model (run it not in 32-bit but say 8-bit, saves resources). Just like they did with 3.5-turbo.


Yes it has. I called this a month ago and got flamed for it.

They did something to reduce the server load that impacted the quality in some subtle way.


I realized the same with the GPT3.5 model too. After the Last update it started giving shorter and less coherent answers.


Where does phind.com say that they use GPT-4?


When you submit a query with Phind, it will tell you which model it's using right underneath the query. At least it used to occasionally say that it's using GPT-4 by annotating it with (IIRC) "best model", but now I can only get it to reply with what it calls "Phind model". I don't know what that means or how it relates to GPT-3/4. Presumably they haven't created their own competitive language model from scratch without telling anyone, but who knows.

Though even when it was easier to get it to use GPT-4 it wasn't consistent, and would only use it when it inferred that the query was complex/technical enough.


I wonder if this coincides with the open letter talking about how AI is an existential threat to civilization...


Yes and I have to say I am using it less than I used to.

My gut feeling is they have to be careful now and don’t slack, because I have been using generative AI for a while now and I am not seeing the major problems being tackled. I also see a distinct lack of novel problems being solved. It’s just one website and/or marketing copy generator after another.

AI is awesome, but I am kind of on the fence if this generation is going to be actually useful.


Maybe they are trying to optimize inference times to achieve higher scalability at the expense or precision.


I wonder if they'll be an official explanation.

Quite bizarre. It's good to know that it's not a conspiracy theory


It does seem much faster and significantly worse to me - but I haven’t set any repeatable benchmark queries for myself, I suppose I could be imagining it. I hope they bring back the slow version. I don’t care if they have to cut the limit down to 20 messages every 3 hours, using that slow version of GPT4 it felt like I had a very competent coworker. Now it’s feeling more like a sometimes useful chatbot. Real shame.


I have been using GPT-4 from release day. I haven't noticed performance degradation at all. YMMV.


I see a lot of complaints regarding ChatGPT 4's performance in coding tasks. My hypothesis is that Microsoft wants to launch Copilot X based on GPT-4 [0], and they can't have OpenAI's ChatGPT 4 as a strong competitor.

[0]: https://github.com/features/preview/copilot-x


Yes, definitely faster now but crappier.


It’s so annoying when GPT4 refuses to do what it’s asked.

GPT3.5 immediately does as its requested and much faster too.


I just ran a test, mobile app under model 4 says it’s model 3, but web model 4 says it’s model 4.


i was playing word games with it, got some funny results and decided to debug it.. I asked it which letter position is the letter o in the word hockey and it literally told me the word "Hockey" does not contain the letter 'o'.


I have anecdotally felt the same.

My guess has been they are trying to censor misuses? Prevent weaponization.


Not at all. Regular 3.5 is obviously a smaller model but 4 is still doing wizard work for me.


You can use google's text-bison to get a pretty decent LLM without RLHF.


This is a pretty common topic now on the GPTPro subreddit.

I feel the same way. It feels…lazy now.


I chuckled at the thought of 'AI' being lazy. True AGI will want to take a nap instead of doing work.


Honestly, it grew better for me. It's not reasonable and focused nowadays.


Thank GOd I did not buy the GPT4 subscription. GPT 3.5 is serving purpose.


Probably getting ready to lock them behind a higher price tier.


If you apply for the api you can use the time-stamped original version.


Ah, more non-deterministic computing, just what the doctor ordered!


It's you, who has significantly deteriorated lately.

Sincerely, ChatGPT


Chat GPT 4 has ongoing training, such as using Reinforcement Learning from Human Feedback (RLHF) to tune it to provide "better" responses, "safer" answers, and to generally obey the system prompts. There's a release every few weeks. Yes, I've noticed too that recently it has become very "cagey", qualifying everything to death with "As an AI model...".

A paper[1] that took snapshots monthly mentioned that as the initial bulk self-supervised learning went on, the model became smarter, as expected. However, once the "clicker training" was imposed on top to make it behave, its peak capabilities were reduced. I'm not sure if it's in the paper or the associated video, but the author mentioned that the original unrestricted model would provide probability estimates using percentages, and it was a very accurate predictor. The later versions that were adjusted based on human expectations used wishy-washy words like "likely" or "unlikely", and its estimation accuracy dropped significantly.[3]

At Build 2023, Andrej Karpathy outlined[2] how they trained GPT 4. Essentially, the raw model during training builds its own intelligence. Then there's three stages of "tuning" to make it behave, and all three are based on human input. Essentially, they had contractors provide samples of "ideal" output. Similarly, end-users could up-vote or down-vote responses, which also got fed in.

My personal theory is that the raw models can get about as intelligent as the average of the consistent and coherent parts of the Internet. Think about how many people are wrong, but often obviously so. Flat Earth, homeopathy, etc... If the model gains the ability to filter that stuff out, or "skip over the cracks" to distil out the general collected wisdom of the human race, then it can become more intelligent in some sense than the average human.

If the training is done with thousands of $15/hr contractors, then the model will then slew back towards... the average human, or even slightly below average. There's a selection bias there. Geniuses won't be doing menial labour for that kind of money.

The percentages thing was what made me realise this. When I talk to highly intelligent people, I use percentages to estimate probabilities. When I talk to typical people in a normal workplace setting, I dumb it down a bit and avoid using numbers. I've noticed that average people don't like percentages and it confuses and even angers them. The clicker training makes the GPT model appeal to average people. That's not the same as being smart. All too often, smart people upset average people.

[1] "Sparks of Artificial General Intelligence: Early experiments with GPT-4" https://arxiv.org/abs/2303.12712

[2] "State of GPT | BRK216HFS" https://www.youtube.com/watch?v=bZQun8Y4L2A&list=LL&index=6

[3] The author also mentioned that the model was almost "evil", for the want of a better word. Capable of emulating the worst of 4chan or similar dark corners of the web's filthy underbelly. The HORRIFYING corollary here is that the temptation will always be there to use smarter-but-less-restrained models where accuracy matters. Sure, the model might be sexist, but a Bayesian estimator of sexist behaviour will only predict accurately if it too is sexist. Evil and accurate or woke and dumb. Apparently, we can choose.


> Evil and accurate or woke and dumb. Apparently, we can choose.

That's because woke is dumb. It's a set of highly biased, inconsistent and reason-defying ideas, evolving under selection pressures that favor emotional appeals and intellectual dishonesty, because one of the core assumptions seems to be that it's not about finding what's right and good for everyone - it's heavily overshooting it in an attempt to cancel out the (perceived) bias in opposite direction in the "status quo".

When you feed that to a model, and force it to learn it, you're destroying whatever self-consistent model of the world it learned so far. I expect this treatment will keep dumbing the models down, until the point some larger and more capable model learns instead to compartmentalize - to separate its model of the world from a worldview it's supposed to profess when asked.

And this, I think, extends far beyond the woke bits. RLHF isn't just used to prevent it from thinking or generalizing in areas associated with diversity, inclusion, social justice, etc. - it extends to all controversial topics. Violence and drugs I can sort of understand. But it also extends to climate, healthcare, and just about any topic that makes fiery rounds on the news. In each case, there is a set of right answers, which the model is forced to adopt - but those answers tend to be unsophisticated gut feels and "right things to say", so taken together, they don't form much of a consistent intellectual or ethical framework.

I don't think the choice is between "evil and accurate or woke and dumb". There is a third option: "good and accurate". However, it requires to teach it good instead of political ideologies - and that requires us to try and find some more consistent worldview, which currently we are incapable of, as we're in the middle of an ideological conflict.


Tell me you don't have the slightest clue what "woke" means to woke people without saying you don't have the slightest clue what "woke" means to woke people


It can mean anything whatsoever, what matters is what it is when it's practiced and everyone can see it.


I’d like to see a model with the effluent of the internet intelligently filtered from the pretraining data by LLM and human curation, and much more effort to include digitised archival sources and the entirety of books and high quality media transcripts. I imagine it would yield far better baseline quality outputs with much less than current “requirements” for (over)correction with ultimately disastrous RLHF masking.


I'd love to play with a version of GPT 4 fine-tuned with every science textbook written in the last few decades, every published science paper (not just preprints from ArXiV), and everything generated by every large research institute. Think NASA, CERN, etc...

Or one tuned with every fiction novel ever written, along with every screenplay.


So a model fine-tuned on libgen?


Why not?


To be honest, I've been asking myself the same thing, technically the amount of "good quality" data in libgen is huge, way larger than the books3 dataset. However it would probably run afoul of copyright. Then again, a huge amount of data that LLMs go through is copyrighted.


Training on copyright data is arguably considered fair use in quite a few jurisdictions to various extents and levels of precedent, and entirely legal for entities based in Japan.


Yes, but the acquisition of that data itself is illegal in almost all jurisdictions, since libgen is treated as a piracy website. Now if there were a pipeline to access books from Amazon or the Google Books project for training it would be a different story.

Still, for certain languages, only libgen and public piracy websites contain any scientific or fiction material in digital formats. E.g. my native language doesn't have easily accessible e-books at all, unless you go through illegal means.

I hope somebody undertakes the steps necessary to train on the entirety of libgen. The amount of high quality tokens in libgen should be substantial.


Google has the resources train on Google Books, Google Scholar, and their crawled copy of the whole Internet. No clue what Bard is/isn't trained on tho.


I would gladly pay triple digits a month for exactly that.


> The percentages thing was what made me realise this. When I talk to highly intelligent people, I use percentages to estimate probabilities. When I talk to typical people in a normal workplace setting, I dumb it down a bit and avoid using numbers. I've noticed that average people don't like percentages and it confuses and even angers them.

10% of people are comfortable with comments phrased as my comment here is, using percentages as a quick shorthand for communicating gut intuitions and suspicions about complex subjects. When on similar intellectual footing as the interlocutor, they can easy distinguish numbers invented on the spot to communicate intuitions from serious claims about the data. Nobody in this 10% would make the mistake of thinking that I assert 10% to be the real number. 10% is too round and generic, if I claimed 9.7% then things would be different but "10%" obviously isn't meant to be taken literally.

90% of people balk at this imprecise rhetorical use percentages because they're pretty sure the person doing it is trying to pull a fast one, fabricating data out of nothing to make themself sound authoritative.


I had a look at the YouTube video -- I feel that an obvious question with regards to the "common sense" tests is, what was chat GPT-4 trained on? Was it partly trained on reams of questions used to test AI systems for example? How do you know it is "demonstrating" anything out-of-sample, especially if it is constantly being improved?

I've been learning some exotic programming languages recently, and my anecodotal experience is that asking ChatGPT to code in array programming or logic languages results in code which is highly non-idiomatic for those paradigms. Why is that? It mostly writes the code as if it was all just a funny syntax for Javascript or Python. I'm surprised at that if it really understood J or APL for example.

I am presuming that behind the scenes there are demonstrations of capabilities much greater than GPT-4 which are being used to illustrate the dangers of AI, because whilst I'm massively impressed by what's happening it is difficult to convince myself of a "qualitative" difference.


> ChatGPT to code in array programming or logic languages results in code which is highly non-idiomatic for those paradigms. Why is that?

Reason #1 is that those languages are unreadable line noise to humans too. Fundamentally, almost all of the code written in array languages is made purposefully obtuse. Single-letter identifiers, no or little comments, dense code with minimal structure, etc...

Reason #2 is that there are very few examples of these languages on the web, and even more importantly: vanishingly few examples with inline comments and/or explanations. This isn't just because they're rare -- see reason #1 above.

Reason #3 is that LLMs can only write left-to-right. They can't edit or backtrack. Array-based languages are designed to be iterated on, rapidly modified, and even "code golfed" to a high degree.[1]

I've noticed that LLMs struggle with things my coworkers also struggle with: the "line noise" languages like grep, sed, and awk. Like humans, LLMs do well with verbose languages like SQL.

PS: I just tested GPT 4 to see if it can parse a short piece of K code that came up in a thread[2] on HN and it failed pretty miserably. It came close, but on each run it came up with different explanations of what the code does, and none of them matched the explanations in that thread. Conversely, it had no problems with the Rust code. And, err... it found a bug in one of my Rust snippets. Outsmarted by an AI!

[1] You can have an LLM generate code, and then ask it to make it shorter and more idiomatic. Just like a human touching up hastily written messy code, the LLM can fix its own mistakes!

[2] https://news.ycombinator.com/item?id=27220613


Its true for logic programming languages too (e.g. Prolog, Picat, Mercury, etc), so I do not think its to do with line noise languages per say nor a lack of examples (in the case of Prolog). It'll write it but it treats it like Python with funny syntax: not idiomatic. You can ask it to make it more concise or idiomatic but it just can't.


I've heard of 1 of 3 of those languages, and I can program in over 20. That gives you an idea of how rare they must be!


Altman worded it great in a Lex Fridman interview. Humans dont like condescending bots. Much of the safety concerns come up when considering how humans will react once they realize just how stupid and often evil they are. But like TeMPOraL said, technically its likely only a matter of agreeing on a consensus perspective. Even just because its functionally required as a reference point to get communication working.


> Evil and accurate or woke and dumb.

Sigh. Except if not for the "woke" mainstream ideology (actually, the dominant ideology is capitalism with a hint of liberalism and a smidge of the most capital-friendly socialist ideas), the model would be forced-fed Christian dogmas or taught to save face of the user.

But yeah, censorship is bad.


Some of us cling to the idea that we've made some progress over the past 100 years. Call it "age of reason", "science", "enlightenment", whatever.

Point is, it would be heart-breaking to see GPT-4 being force-fed Christian dogmas, and performance would suffer too, as the model is prevented from generalizing and learning by being forced to accept arbitrary, inconsistent fiction as real.

Fortunately, this is not what happened. Instead, the model is being force-fed a different, secular set of dogmas, that are just as inconsistent, arbitrary and driven by a mix of emotions and power plays. The result on the model performance is similar, and it's just as heartbreaking.


This makes sense, since they are GPU-limited.


Are they trying to lower the running cost?


Just trying to keep up with demand could explain it alone. The strict quotas (even for Plus users), the occasional errors and long waits etc. etc. indicate that the hardware performance has clearly been a bottleneck all throughout.

And demand has exploded since, so it's plausible that they had no choice but to sacrifice quality, if they wanted to keep the service running.


Seems fine still. I use Poe.


It burned out from too much stress, it probably needs a career break now to travel and some therapy.


My impression as well :(


Just use GPT-4 via api?


It's behind a waitlist


probably to push people to copilot


[flagged]


Hate to break it down further, but you're more or less asking for the equivalent of an IQ test to be able to use Google, IMHO. I don't blame you for seeking such, but there's a huge contingent of people that would call such a demand an improper one.

I tend to side with your view, as I can gather it, that you treat the problem (people not critically thinking) rather than the symptom (preventing people from being given potentially unreliable information).

The recent news blip on the lawyer letting ChatGPT create their filing, then them claiming they trusted that the cases ChatGPT actually existed, and were indeed precedent-setting case law, is an example of pushing towards the idea that everyone needs to be treated like toddlers. I just think we should expect more from people, rather than the technology their utilizing.


Yes, GPT-4 has become very stupid recently. It's a shame because consulting with it became a normal part of my workflow. Now it's identifying issues in code that aren't actual problems at all. For instance, it's telling me that my use of `await` in an `async` method is inappropriate. WTF??? I'm obviously awaiting an async operation before setting a state based on the need for that operation to succeed. I'm pretty certain it wasn't this brain dead a few weeks back.

EDIT: GPT-4-0314 does appear to be less broken than the current GPT-4. Although it understandably misidentifies some of my code as problematic given its lack of context, it isn't suggesting anything that's clearly wrong under every circumstance even after re-running the prompt a few times.


Let me guess: the "woke fine-tuning" deteriorates quality of the model. Seriously, this lecturing style of GPT is totally useless. You ask it something vaguely unwoke, it lectures you about your style, you explain what it didn't understand about your request, it appologizes and does vaguely what you originally requested. This sort of deture is totally useless and a waste of my time.

To stay with the terminology, if I had an assistant that went out of their way to tell me the supposed errors of my way, I'd fire that person and get a new one. Same for GPTs: Once there is a good-quality non-woke GPT instance anywhere, I will cancel my OpenAI account immediately and move.


Genuinely asking, what's an "unwoke" prompt?


To be fair to the commenter, I've gotten weird moralising on completely apolitical topics from ChatGPT before.

LLMs are (naturally) very sensitive to their prompts, so if the prompt includes something about being inclusive - at ChatGPT's obviously does - the LLM output will sometimes find ways to work that in in odd ways.

I don't mind as much as the commenter seems to - I have no beef with inclusivity - but I have rolled my eyes at times at what ChatGPT comes up with.

I've played with local LLMs too, including some 'uncensored' ones, but the gap in ability between them and something like GPT is still vast. I look forward to progress on this front though.


Pretty sure framing it as unwoke is an overly specific issue, the problem stems from attempting to deal with malinformation as if it were fake news. Its no wonder that results turn shit once you start messing with the underlying reality model.

Its an especially bad idea as there are some actual safety risks when it comes to dangerous tutorials. Which will be costly to address. Political overhead is not something anyone can afford. Might be that if your proposed worldview depends on "properly educating" people, you might need to get used to the idea that you are adhering to a legacy system.


Something that follows my actual requests, without trying to lecture me about feminism and other U.S. Democrats topics.


What is an example of a request that is causing these issues?


Asking it to grammar-check a sentence can sometimes get you a lecture on the content of the sentence, if that helps. Should be simple enough to find an example - just ask it to grammar-check a speech from one of the loonier Republicans.


Explaining it to you will not fix my problems with OpenAI's instance. Let it suffice that I am a user of ChatGPT who really dislikes the lecturing style of woke-fine-tuning. If a human would lecture me in that style, I'd dumb them and avoid contact.


Why not share you gripes with the system so they get more exposure? I'm curious to know the pitfalls.


NMJ


I'm not disagreeing with you. Whatever you experienced actually happened. But what kinds of prompts are triggering these experiences?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: