Hacker News new | past | comments | ask | show | jobs | submit login
Native JSON Output from GPT-4 (yonom.substack.com)
604 points by yonom on June 14, 2023 | hide | past | favorite | 248 comments



I'm concerned that OpenAI's example documentation suggests using this to A) construct SQL queries and B) summarize emails, but that their example code doesn't include clear hooks for human validation before actions are called.

For a recipe builder it's not so big a deal, but I really worry how eager people are to remove human review from these steps. It gets rid of a very important mechanism for reducing the risks of prompt injection.

The top comment here suggests wiring this up to allow GPT-4 to recursively call itself. Meanwhile, some of the best advice I've seen from security professionals on secure LLM app development is to whenever possible completely isolate queries from each other to reduce the potential damage that a compromised agent can do before its "memory" is wiped.

There are definitely ways to use this safely, and there are definitely some pretty powerful apps you could build on top of this without much risk. LLMs as a transformation layer for trusted input is a good use-case. But are devs going to stick with that? Is it going to be used safely? Do devs understand any of the risks or how to mitigate them in the first place?

3rd-party plugins on ChatGPT have repeatedly been vulnerable in the real world, I'm worried about what mistakes developers are going to make now that they're actively encouraged to treat GPT as even more of a low-level data layer. Especially since OpenAI's documentation on how to build secure apps is mostly pretty bad, and they don't seem to be spending much time or effort educating developers/partners on how to approach LLM security.


In my opinion the only way to use it safely is to ensure your AI only has access to data that the end user already has access to.

At that point, prompt injection is no-longer an issue - because the AI doesn't need to hide anything.

Giving GPT access to your entire database, but telling it not to reveal certain bits, is never going to work. There will always be side channel vulnerabilities in those systems.


> e.g. define a function called extract_data(name: string, birthday: string), or sql_query(query: string)

This section in OpenAI's product announcement really irritates me because it's so obvious that the model should have access to a subset of API calls that themselves fetch the data, as opposed to giving the model raw access to SQL. You could have the same capabilities while eliminating a huge amount of risk. And OpenAI just sticks this right in the announcement, they're encouraging it.

When I'm building a completely isolated backend with just regular code, I still usually put a data access layer in front of the database in most cases. I still don't want my REST endpoints directly building SQL queries or directly accessing the database, and that's without an LLM in the loop at all. It's just safer.

It's the same idea as using `innerHTML`; in general it's better when possible to have those kinds of calls extremely isolated and to go through functions that constrain what can go wrong. But no, OpenAI just straight up telling developers to do the wrong things and to give GPT unrestricted database access.


SQL doesn’t necessarily have to mean full database access.

I known it’s pretty common to have apps connect to a database with a db user with full access to do anything, but that’s definitely not the only way.

If you’re interested in being safer, it’s worth learning the security features built in to your database.


> If you’re interested in being safer, it’s worth learning the security features built in to your database.

The problem isn't that there's no way to be safe, the problem is that OpenAI's documentation does not do anything to discourage developers from implementing this in the most dangerous way possible. Like you suggest, the most common way this will be implemented is via a db user with full access to do anything.

Developers would be far more likely to implement this safely if they were discouraged from using direct SQL queries. Developers who know how to safely add SQL queries will still know how to do that -- but developers who are copying and pasting code or thinking naively "can't I just feed my schema into GPT" should be pushed towards an implementation that's harder to mess up.


It's hard for me to believe openai's documentation will have any effect on developers who write or copy-and-paste data access code without regard to security, no matter what it says.

If you provide an API or other external access to app data and the app data contains anything not everyone should be able to access freely then your API has to implement some kind of access control. It really doesn't matter if your API is SQL-based, REST-based, or whatever.

A SQL-based API isn't inherently less secure than a non-SQL-based one if you implement access control, and a non-SQL-based API isn't inherently more secure than a SQL-based one if you don't implement access control. The SQL-ness of an API doesn't change the security picture.


> If you provide an API or other external access to app data and the app data contains anything not everyone should be able to access freely then your API has to implement some kind of access control. It really doesn't matter if your API is SQL-based, REST-based, or whatever.

I don't think that's the way developers are going to interact with GPT at all, I don't think they're looking at this as if it's external access. OpenAI's documentation makes it feel like a system library or dependency, even though it's clearly not.

I'll go out on a limb, I suspect a pretty sizable chunk (if not an outright majority) of the devs who try to build on this will not be thinking about the fact that they need access controls at all.

> A SQL-based API isn't inherently less secure than a non-SQL-based one if you implement access control, and a non-SQL-based API isn't inherently more secure than a SQL-based one if you don't implement access control. The SQL-ness of an API doesn't change the security picture.

I'm not sure I agree with this either. If I see a dev exposing direct query access to a database, my reaction is going to be very dependent on whether or not I think they're an experienced programmer already. If I know them enough to trust them, fine. Otherwise, my assumption is that they're probably doing something dangerous. I think the access controls that are built into SQL are a lot easier to foot-gun, I generally advise devs to build wrappers because I think it's generally harder to mess them up. Opinion me :shrug:

Regardless, I do think the way OpenAI talks about this does matter, I do think their documentation will influence how developers use the product, so I think if they're going to talk about SQL they should in-code be showing examples of how to implement those access controls. "We're just providing the API, if developers mess it up its their fault" -- I don't know, good APIs and good documentation should try to when possible provide a "pit of success[0]" for naive developers. In particular I think that matters when talking about a market segment that is getting a lot of naive VC money thrown at it sometimes without a lot of diligence, and where those security risks may end up impacting regular people.

[0]: https://blog.codinghorror.com/falling-into-the-pit-of-succes...


You don't need to directly run the query it returns, you can use that query as a sub-query on a known safe set of data and let it fail if someone manages to prompt inject their way into looking at other tables/columns.

That way you can support natural language to query without sending dozens of functions (which will eat up the context window)


You can do that (I wouldn't advise it, there are still problems that are better solved by building explicit functions; but you can use subqueries and it would be safer) -- but most developers won't. They'll run the query directly. Most developers also will not execute it as a readonly query, they'll give the LLM write access to the database.

If OpenAI doesn't know that, then I don't know what to say, they haven't spent enough time writing documentation for general users.


You can't advise for or against it without a well defined problem: for some cases explicit functions won't even be an option.

Defining basic CRUD functions for a few basic entities will a ton of tokens in schema definitions, and still suffers from injection if you want to support querying on data that wasn't well defined a-priori, which is a problem I've worked on.

Overall if this was one of their example projects I'd be disappointed, but it was a snippet in a release note. So far their actual example projects have done a fair job showing where guardrails in production systems are needed, I wouldn't over-index on this.


> You can't advise for or against it without a well defined problem: for some cases explicit functions won't even be an option.

On average I think I can. I mean, I can't know without the exact problem specifications whether or not a developer should use `innerHTML`/`eval`. But I can offer general advice against it, even though both can be used securely. I feel pretty safe saying that exposing SQL access directly in an API will usually lead to more fragile infrastructure. There are plenty of exceptions of course, but there are exceptions to pretty much all programming advice. I don't think it's good for it to be one of the first examples they bring up for how to use the API.

----

> Overall if this was one of their example projects I'd be disappointed

I have similar complaints about their example code. They include the comment:

> # Note: the JSON response from the model may not be valid JSON

But they don't actually do schema validation here or check anything. Their example project isn't fit to deploy. My thought on this is that if every response for practically every project needs to have schema validation (and I would strongly advise doing schema validation on every response), then the sample code should have schema validation in it. Their example project should be something that could be almost copy-and-pasted.

If that makes the code sample longer, well... that is the minimum complexity to build an app on this. The sample code should reflect that.

> and still suffers from injection if you want to support querying on data that wasn't well defined a-priori

This is a really good point. My response would be that they should be expanding on this as well. I'm really frustrated that OpenAI's documentation provides (imo) basically no really practical/great security advice other than "hey, this problem exists, make sure you deal with it." But it seems to me like they're already falling over on providing good documentation before they even get to the point where they can talk seriously about bigger security decisions.


> your AI only has access to data that the end user already has access to.

That doesn't work for the same reason you mention with a DB ... any data source is vulnerable to indirect injection attacks. If you open the door to ANY data source this a factor, including ones under the sole "control" of the user.


>At that point, prompt injection is no-longer an issue [...]

As far as input goes, yes. But I am more worried about agents that can take actions that affect the outside world, like sending emails on your behalf.


I was going to say “I look forward to it and think it’s hilarious,” but then I remembered that most victims will be people learning to code, not companies. It would really suck to suddenly lose your recipe database when you just wanted to figure out how this programming stuff worked.

Some kind of “heads up” tagline is probably a good idea, yeah.


I think the victims will mostly be the users of the software. The personal assistant that can handle your calendar and emails and all would be able to do real damage.


I don't understand why they have done this? Like, how did the conversations go when it was pointed out to them what a pretty darn bad idea it was to recommend connecting chatgpt directly to a SQL database?

I know we are supposed to assume incompetence over malice, but no one is that incompetent. They must have had the conversations, and chose to do it anyway.


Why is this unreasonable to you? I can imagine using this, just run it with read access and check the sql if the results are interesting.


Even read only. You are giving access to your data to a black box API.


If it's on Azure anyway I don't see the big deal, especially if you are an enterprise and so buying it via azure instead of directly.


Perhaps they plan on having ChatGPT make a quick copy of your database, for your convenience of course.


i think people are underestimating the potential here for agents building - it is now a lot easier for GPT4 to call other models, or itself. while i was taking notes for our emergency pod yesterday (https://www.latent.space/p/function-agents) we had this interesting debate with Simon Willison on just how many functions will be supplied to this API. Simon thinks it will be "deep" rather than "wide" - eg a few functions that do many things, rather than many functions that do few things. I think i agree.

you can now trivially make GPT4 decide whether to call itself again, or to proceed to the next stage. it feels like the first XOR circuit from which we can compose a "transistor", from which we can compose a new kind of CPU.


It was already quite easy to get GPT-4 to output json. You just append ‘reply in json with this format’ and it does a really good job.

GPT-3.5 was very haphazard though and needs extensive babysitting and reminding, so if this makes gpt3 better then it’s useful - it does have an annoying disclaimer though that ‘it may not reply with valid json’ so we’ll still have to do some sense checks into he output.

I have been using this to make a few ‘choose your own adventure’ type games and I can see there’s a TONNE of potential useful things.


> You just append ‘reply in json with this format’ and it does a really good job.

It does an ok job. Except when it doesn't. Definitely misses a lot of the time, sometimes on prompts that succeeded on previous runs.


It literally does it everytime perfectly. I remember I put together an entire system that would validate the JSON against a zod schema and use reflection to fix it and it literally never gets triggered because GPT3.5-turbo always does it right the first time.


> It literally does it everytime perfectly. I remember I put together an entire system that would validate the JSON against a zod schema and use reflection to fix it and it literally never gets triggered because GPT3.5-turbo always does it right the first time.

Danger! There be assumptions!!

gpt-? is a moving target and in rapid development. What it does Tuesday, which it did not do on Monday, it may well not do on Wednesday

If there is a documented method to guarantee it, it will work that way (modulo OpenAI bugs - and now Microsoft is involved....)

What we had before, what you are talking of, was observed behaviour. An assumption that what we observed in the past will continue in the future is not something to build a business on


ChatGPT moves fast. The API version doesn’t seem to change except with the model and documented API changes.


No it doesn't lol. I've seen it just randomly not use a comma after one array element, for example.


Yep. Incorrect trailing commas ad nauseum for me.


Are you saying that it return only JSON before? I'm with the other commenters it was wildly variable and always at least said "Here is your response" which doesn't parse well.


If you want a parsable response, have it wrap that with ```. Include an example request/response in your history. Treat any message you can’t parse as an error message.

This works well because it has a place to put any “keep in mind” noise. You can actually include that in your example.


Yeah no


The solution that worked great for me - do not use JSON for GPT to agent communication. Use comma separated key=value, or something to that effect.

Then have another pure code layer to parse that into structured JSON.

I think it’s the JSON syntax (with curly braces) that does it in. So YAML or TOML might work just as well, but I haven’t tried that.


Coincidentally, I just published this JS library[1] over the weekend that helps prompt LLMs to return typed JSON data and validates it for you. Would love feedback on it if this is something people here are interested in. Haven’t played around with the new API yet but I think this is super exciting stuff!

[1] https://github.com/jacobsimon/prompting


Looks promising! Do you do retries when returned json is invalid? Personally, I used io-ts for parsing, and GPT seems to be able to correct itself easily when confronted with a well-formed error message.


Great idea, I was going to add basic retries but didn’t think to include the error.

Any other features you’d expect in a prompt builder like this? I’m tempted to add lots of other utility methods like classify(), summarize(), language(), etc


It's harder to form a tree with key value. I also tried the relational route. But it would always messup the cardinality (one person should have 0 or n friends, but a person has a single birth date).


You could flatten it using namespaced keys. Eg.

    {
      parent1: { child1: value }
    }
Becomes one of the following:

    parent1/child1=value
    parent1_child1=value
    parent1.child1=value
..you get the idea.


It's also harder to stream JSON? Maybe I'm overthinking this.


even with gpt 4, it hallucinates enough that it’s not reliable, forgetting to open/close brackets and quotes. This sounds like it’d be a big improvement.


Not that it matters now but just doing something like this works 99% of the time or more with 4 and 90% with 3.5.

It is VERY IMPORTANT that you respond in valid JSON ONLY. Nothing before or after. Make sure to escape all strings. Use this format:

{“some_variable”: [describe the variable purpose]}


99% of the time is still super frustrating when it fails, if you're using it in a consumer facing app. You have to clean up the output to avoid getting an error. If it goes from 99% to 100% JSON that is a big deal for me, much simpler.


Except it says in the small print to expect invalid JSON occasionally, so you have to write your error handling code either way


If you're building an app based on LLMs that expects higher than 99% correctness from it, you are bound to fail. Negative scenarios workarounds and retries are mandatory.


Yup. Is there a good/forgiving "drunken JSON parser" library that people like to use? Feels like it would be a useful (and separable) piece?


Honestly, I suspect asking GPT-4 to fix your JSON (in a new chat) is a good drunken JSON parser. We are only scraping the surface of what's possible with LLMs. If Token generation was free and instant we could come up with a giant schema of interacting model calls that generates 10 suggestions, iterates over them, ranks them and picks the best one, as silly as it sounds.


That's hilarious... if parsing GPT's JSON fails, keep asking GPT to fix it until it parses!


It shouldn't be surprising though. If a human makes an error parsing JSON, what do you do? You make them look over it again. Unless their intelligence is the bottleneck they might just be able to fix it.


It works. Just be sure to build a good error message.


I already do this today to create domain-specific knowledge focused prompts and then have them iterate back and forth and a ‘moderator’ that chooses what goes in and what doesn’t.


Wouldn't you use traditional software to validate the JSON, then ask chatgpt to try again if it wasn't right?


In my experience, telling it "no thats wrong, try again" just gets it to be wrong in a new different way, or restate the same wrong answer slightly differently. I've had to explicitly guide it to correct answers or formats at times.


Try different phrasing, like "Did your answer follow all of the criteria?".


It forgets commas too


Nah, this was solved by most teams a while ago.


I feel like I’m taking crazy pills with the amount of people saying this is game changing.

Did they not even try asking gpt to format the output as json?


> I feel like I’m taking crazy pills....try asking gpt to format the output as json

You are taking crazey pills. Stop

gpt-? is unreliable! That is not a bug in it, it is the nature of the beast.

It is not an expert at anything except natural language, and even then it is an idiot savant


I like to define a JSON schema (https://json-schema.org/) and prompt GPT-4 to output JSON based on that schema.

This lets me specify general requirements (not just JSON structure) inline with the schema and in a very detailed and structured manor.


In a production system, you don’t need easy to do most of the time, you need easy without fail.


Ok, just playing devil's advocate here. How many FAANG companies have you seen have an outage this year? What's their budget?

I think a better way to reply to the author would have been "how often does it fail"?

Every system will have outages, it's just a matter of how much money you can throw at the problem to reduce them.


If 99.995% correct looks bad to users, wait until they see 37%.


It's fine, but the article makes some good points why - less cognitive load for GPT and less tokens. I think the transistor to logic gate analogy makes sense. You can build the thing perfectly with transistors, but just use the logic gate lol.


Is there any publicly available resource replicate your work? I would love to just find the right kind of "incantation" for the gpt-3.5-t or gpt-4 to output a meaningful story arc etc.

Any examples of your work would be greatly helpful as well!


I'm not the person you're asking, but I built a site that allows you to generate fiction if you have an OpenAI API key. You can see the prompts sent in console, and it's all open source:

https://havewords.ai/


I have an open source project doing exactly this at https://www.generativestorytelling.ai/ GitHub link is on the main page!


I could not get GPT-4 to reliably not give some sort of text response, even if was just a simple "Sure" followed by the JSON.


Pass in an agent message with "Sure here is the answer in json format:" after the user message. Gpt will think it has already done the preamble and the rest of the message will start right with the json.


Did you try using the API and providing a very clear system message followed by several examples that were pure JSON?


Yep. I even gave it a JSON schema file to use. It just wouldn't stop added extra verbage.


I just use a regex to select everything between the first and last curly bracket, reliable fixes the “sure, here’s your object” problem.


Say it's a json API and may only reply with valid json without explanation.


Lol yes of course I tried that.


I've had good luck with both:

https://github.com/drorm/gish/blob/main/tasks/coding.txt

and

https://github.com/drorm/gish/blob/main/tasks/webapp.txt

With the second one, I reliably generated half a dozen apps with one command.

Not to say that it won't fail sometimes.


Combine both ? :)


Just end your request with

‘’’json

Or provide a few examples of user request and then agent response in json. Or both.


Does the ```json trick work with the chat models? Or only the earlier completion models?


Works with chat. They’re still text completion models under all that rlhf


GPT-4 was already a massive improvement on 3.5 in terms of replying consistently in a certain JSON structure - I often don't even need to give examples, just a sentence describing the format.

It's great to see they're making it even better, but where I'm currently hitting the limit still in GPT-4 for "shelling out" is about it being truly "creative" or "introspective" about "do I need to ask for clarifications" or "can I find a truly novel away around this task" type of things vs "here's a possible but half-baked sequence I'm going to follow".


It is “good enough”. Where I struggle is maintaining its memory through a longer request where multiple iterations fail or succeed and then all of a sudden its memory is exceeded and starts fresh. I wish I could store “learnings” that it could revisit.


Sounds like you want something like tree of thoughts: https://arxiv.org/abs/2305.10601


Interestingly the paper's repo starts off :

Blah Blah "...is NOT the correct implementation to replicate paper results. In fact, people have reported that his code cannot properly run, and is probably automatically generated by ChatGPT, and kyegomez has done so for other popular ML methods, while intentionally refusing to link to official implementations for his own interests"

Love a good GitHub Identity Theft Star farming ML story

But this method could have potential for a chain of function


It's interesting to think about this form of computation (LLM + function call) in terms of circuitry. It is still unclear to me however, if the sequential form of reasoning imposed by a sequence of chat messages is the right model here. LLM decoding and also more high-level "reasoning algorithms" like tree of thought are not that linear.

Ever since we started working on LMQL, the overarching vision all along was to get to a form of language model programming, where LLM calls are just the smallest primitive of the "text computer" you are running on. It will be interesting to see what kind of patterns emerge, now that the smallest primitive becomes more robust and reliable, at least in terms of the interface.


Exactly, we humans can use specialized models and traditional tool APIs and models and orchestrate the use of all these without understanding how these things work in detail.

To do accounting, GPT 4 (or future models) doesn't have to know how to calculate. All it needs to know how to interface with tools like calculators, spreadsheets, etc. and parse their outputs. Every script, program, etc. becomes a thing that has such an API. A lot what we humans do to solve problems is breaking down big problems into problems where we know the solution already.

Real life tool interfaces are messy and optimized for humans with their limited language and cognitive skills. Ironically, that means they are relatively easy to figure out for AI language models. Relative to human language the grammar of these tool "languages" is more regular and the syntax less ambiguous and complicated. Which is why gpt 3 and 4 are reasonably proficient with even some more obscure programming languages and in the use of various frameworks; including some very obscure ones.

Given a lot of these tools with machine accessible APIs with some sort of description or documentation, figuring out how to call these things is relatively straightforward for a language model. The rest is just coming up with a high level plan and then executing it. Which amounts to generating some sort of script that does this. As soon as you have that, that in itself becomes a tool that may be used later. So, it can get better over time. Especially once it starts incorporating feedback about the quality of its results. It would be able to run mini experiments and run its own QA on its own output as well.


"Trivial" is misleading. From OpenAI's docs and demos, the full ReAct workflow is an order of magnitude more difficult than typical ChatGPT API usage with a new set of constaints (e.g. schema definitions)

Even OpenAI's notebook demo has error handling workflows which was actually necessary since ChatGPT returned incorrect formatted output.


Maybe trivial isn't the right word, but it's still very straight-forward to get something basic, yet really powerful...

ReAct Setup Prompt (goal + available actions) -> Agent "ReAction" -> Parse & Execute Action -> Send Action Response (success or error) -> Agent "ReAction" -> repeat

As long as each action has proper validation and returns meaningful error messages, you don't need to even change the control flow. The agent will typically understand what went wrong, and attempt to correct it in the next "ReAction".

I've been refactoring some agents to use "functions" and so far it seems to be a HUGE improvement in reliability vs the "Return JSON matching this format" approach. Most impactful is that fact that "3.5-turbo" will now reliability return JSON (before you'd be forced to use GPT-4 for an ReAct style agent of modest complexity).

My agents also seem to be better at following other instructions now that the noise of the response format is gone (of course it's still there, but in a way it has been specifically trained on). This could also just be a result of the improvements to the system prompt though.


For 3.5, I found it easiest to specify a simple, but parsable, format for responses and then convert that to JSON myself.

I'll have to see if the new JSON schema support is easier than what I already have in place.


The first transistors were slow, and it seems this "GPT3/4 calling itself" stuff is quite slow. GPT3/4 as a direct chat is about as slow as I can take. Once this gets sped up.

I am sure it will, as you can scale out, scale up and build more efficient code and build more efficient architectures and "tool for the job" different parts of the process.

The problem now (using auto gpt, for example) is accuracy is bad, so you need human feedback and intervention AND it is slow. Take away the slow, or the needing human intervention and this can be very powerful.

I dream of the breakthrough "shitty old laptop is all you need" paper where they figure out how to do amazing stuff with a 1Gb of space on a spinny disk and 1Gb RAM and a CPU.


I agree with this. We’ve already gotten pretty good at json coercion, but this seems like it goes one step further by bundling decision making in to the model instead of junking up your prompt or requiring some kind of eval on a single json response.

It should also be much easier to cache these functions. If you send the same set of functions on every API hit, OpenAI should be able to cache that more intelligently than if everything was one big text prompt.


Wow your brand is huge. Crazy growth. i wonder how much these subtle mentions on forums help


They're the only one commenter on HN I noticed keeps writing "smol" instead of "small", and is associated with projects with "smol" in their name. Surely I'm not the only one who missed it being a meme around 2015 or sth., and finds this word/use jarring - and therefore very attention-grabbing? Wonder how much that helps with marketing.

This is meant with no negative intentions. It's just that 'swyx was, in my mind, "that HN-er that does AI and keeps saying 'smol'" for far longer than I was aware of latent.space articles/podcasts.


Personally, I associate "smol" with "doggo" and "chonker" and other childish redditspeak.


and fun fact i used to work at Temporal too heheh.


i mean hopefully its relevant content to the discussion, i hope enough pple know me here by now that i fully participate in The Discourse rather than just being here to cynically plug my stuff. i had a 1.5 hr convo with simon willison and other well known AI tinkerers on this exact thing, and so I shared it, making the most out of their time that they chose to share with me.


100%, if the API itself can choose to call a function or an LLM, then it's way easier to build any agent loop without extensive prompt engineering + worrying about errors.

Tweeted about it here as well: https://twitter.com/jerryjliu0/status/1668994580396621827?s=...


You still have to worry about errors. You will probably have to add an error handler function that it can call out to. Otherwise the LLM will hallucinate a valid output regardless of the input. You want it to be able to throw an error and say I could produce the output given this format.


> "you can now trivially make GPT4 decide whether to call itself again, or to proceed to the next stage."

Does this mean the GPT-4 API is now publicly available, or is there still a waitlist? If there's a waitlist and you literally are not allowed to use it no matter how much you are willing to pay then it seems like it's hard to call that trivial.


Not GP, but it's still the latter...i've been (im)patiently waiting.

From their blog post the other day: With these updates, we’ll be inviting many more people from the waitlist to try GPT-4 over the coming weeks, with the intent to remove the waitlist entirely with this model. Thank you to everyone who has been patiently waiting, we are excited to see what you build with GPT-4!


If you put contact info in your HN profile - especially an email address that matches one you use to login to openai, someone will probably give you access...

Anyone with access can share it with any other user via the 'invite to organisation' feature. Obviously that allows the invited person do requests billed to the inviter, but since most experiments are only a few cents that doesn't really matter much in practice.


Good to know, but I've racked up a decent bill for just my GPT 3.5 use. I can get by with experiments using my ChatGPT Plus subscription, but I really need my own API access to start using it for anything serious.


"With these updates, we’ll be inviting many more people from the waitlist to try GPT-4 over the coming weeks, with the intent to remove the waitlist entirely with this model. Thank you to everyone who has been patiently waiting, we are excited to see what you build with GPT-4!"

https://openai.com/blog/function-calling-and-other-api-updat...


Interesting observation, @swyx. There seems to be a connection to transitive closure in SQL queries, where the output of the query is fed as the input to the query in the next iteration [1]. We are thinking about how to best support such recursive functions in EvaDB [2].

[1] http://dwhoman.com/blog/sql-transitive-closure.html [2] https://evadb.readthedocs.io/en/stable/source/tutorials/11-s...


The thing is the relevant context often depends on what it's trying to do. You can give it a lot of context in 16k but if there are too many different types of things then I think it will be confused or at least have less capacity for the actual selected task.

So what I am thinking is that some functions might just be like gateways into a second menu level. So instead of just edit_file with the filename and new source, maybe only select_files_for_edit is available at the top level. In that case I can ensure it doesn't try to overwrite an existing file without important stuff that was already in there, by providing the requested files existing contents along with the function allowing the file edit.


Not sure that’s true. I haven’t completely filled the context with examples but I do provide 8 or so exchanges between user and assistant along with a menu of available commands and it seems to be able to generalize from that very well. No hallucinations either. Good idea about sub menus though, I’ll have to use that.


I think big context only makes sense for document analysis.

For programming you want to keep it slim. Just like you should keep your controllers and classes slim.

Also people with 32k access report very very long response times of up to multiple minutes which is not feasible if you only want a smaller change or analysis.


What would be an example where there needs to be an arbitrary level of recursive ability for GPT4 to call itself?


writing code of higher complexity (we know from CICERO that longer time spent on inference is worth orders of magnitude more than the equivalent in training when it comes to improving end performance), or doing real world tasks with unknown fractal depth (aka yak shave)


Who is Simon Willison? Is he big in AI?


formerly cocreator of Django, now Datasette, but pretty much the top writer/hacker on HN making AI topics accessible to engineers https://hn.algolia.com/?dateRange=pastYear&page=0&prefix=tru...


Oh wow, nice! Big fan of his work


Do you people always have to overhype this shit?


Do you have to be nasty?

That's a person you're replying to with feelings, so why not default to being kind in comments as per HN guidelines?

As it happens, swyx has built notable AI related things, for example smol-developer

https://twitter.com/swyx/status/1657892220492738560

and it would be nice to be able to read his and other perspectives without having to read shallow, mean, dismissive replies such as yours.


[flagged]


Hey, I understand the frustration (both the frustration of endless links on an over-hyped topic, and the frustration of getting scolded by another user when expressing yourself) - but it really would be good if you'd post more in the intended spirit of this site (https://news.ycombinator.com/newsguidelines.html).

People sometimes misunderstand this, so I'd like to explain a bit. (It probably won't help, but it might, and I don't like flagging or banning accounts without trying to persuade people first if possible.)

We don't ask people to be kind, post thoughtfully, not call names, not flame, etc., out of nannyism or some moral thing we're trying to impose. That wouldn't feel right and I wouldn't want to be under Mary Poppins's umbrella either.

The reason is more like an engineering problem: we're trying to optimize for one specific thing (https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...) and we can't do that if people don't respect certain constraints. The constraints are to prevent the forum from burning itself to a crisp, which is where the arrow of internet entropy will take us if we don't expend energy to stave it off.

It probably doesn't feel like you're doing anything particularly wrong, but there's a cognitive bias where everyone underestimates the damage they're causing (by say 10x) and overestimates the damage others are causing (by say 10x) and that compounds into a major bias where everyone feels like everyone else is the problem. We need a way out of that dynamic if we're to have any hope of keeping this place interesting. As you probably realize, HN is forever on the brink of caving into a pit. We need you to help nudge it back from that, not push it over.

Of course you're free to say "what do I care if HN burns itself to a crisp, fuck you all" but I'd argue you shouldn't take that nihilistic position because it isn't in your own interests. HN may be annoying at times, but it's interesting enough for you to spend time here—otherwise you wouldn't be reading the site and posting to it. Why not contribute to making it more interesting rather than destroying it for yourself and everyone else? (I don't mean that you're intentionally destroying it—but the way you've been posting is unintentionally contributing to that outcome.)

I'm sure you wouldn't drop lit matches in a dry forest, or dump motor oil in a mountain lake, trample flower gardens, or litter in a city park, for much the same reason. It's in your own interest to practice the same care for the commons here. Thanks for listening.


Thanks for the effort of explanation.

If someone were egregiously out of line, typically, I feel community sentiment reflects this.

Personally, I feel your assessment of cognitive bias at play is way off base. I don't think it's a valid comparison to claim that someone is causing "damage" by merely expressing distaste. That's a common tool that humans use for social feedback. Is cutting off the ability for genuine social feedback or adjustment and forcing people to be saccharine out of fear of reprisal from the top really an optimal solution to an engineering problem? It seems more like a simulacrum of an HR department where the guillotine is more real: your job and life rather than merely your ability to share your thoughts on a corner of the Internet.

Think about the engineering problem you find yourself in with this state of affairs: something very similar to the kind of content you might find on LinkedIn, a sort of circular back-patting engine devoid of real challenge and grit because of the aforementioned guillotine under which all participants hang.

And, quite frankly, you do see the effects of this in precisely the post in this initial exchange: hyperbole and lack of deep critical assessment are artificially inflated. This isn't a coincidence: this has been cultured very specifically by the available growing conditions and the starter used -- saccharine hall monitors that fold like cheap suits (e.g. very poorly, lots of creases) when the lowest level of social challenge is raised fo their ideas.

You know what it really feels like? A Silicon Valley reconstruction of all the bad things about a workplace, not a genuine forum for debate and intellectual exploration. If you want to find a place to model such behavior, the Greeks already have you figured out - how do you think Diogenes would feel about human resources?

That being said, I appreciate the empathy.

Obviously, I feel a bit like a guy Tony Soprano beat up and being forced to apologize afterwards to him for bruising his knuckles.


Not to belabor the point but from my perspective you've illustrated the point about cognitive bias: it always feels like the other person started it and did worse ("I feel a bit like a guy Tony Soprano beat up and being forced to apologize afterwards to him for bruising his knuckles") and it always feels like one was merely defending oneself reasonably ("merely expressing distaste"). This is the asymmetry I'm talking about.

As you can imagine, mods get this kind of feedback all the time from all angles. The basic learning it adds up to is that everybody always feels this way. Therefore those feelings are not a reliable compass to navigate by.

This is not a criticism—I appreciate your reply!

Edit:

> forcing people to be saccharine [...] like a simulacrum of an HR department

We definitely don't want that and the site guidelines don't call for that. There is tons of room to make your substantive points thoughtfully without being saccharine. It can take a little bit of reflective work to find that room, though, just because we (humans in general) tend to get locked into binary oppositions.

The best principle to go by is just to ask yourself: is what I'm posting part of a curious conversation? That's the intended spirit of the site. It's possible to tell if you (I don't mean you personally, I mean all of us) are functioning in the range of curiosity and to refrain from posting if you aren't.

It is true that the HN guidelines bring a bit of blandness to discourse because they eliminate the rough-and-tumble debate that can work well in much smaller groups of close peers. But that's because that kind of debate is impossible in a large public forum like HN—it just degenerates immediately into dumb brawls. I've written about this quite a bit if you or anyone wants to read about that:

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... (I like that analogy)

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...


I think your argument is reasonable from a logical perspective, and I would generally make a similar argument as I would find the template quite persuasive.

However, I, again, feel you're improperly pushing shapes into the shape-board again. Of course, understanding cognitive bias is a fantastic tool to improve human behavior from an engineering perspective, and your argumentum ad numerum is sound.

That being said, you're focusing too much on what my emotional motivation might be rather than looking at the system - do you really think there isn't an element of that dynamic I outlined in an interaction like this? Of course there is.

Anyhow, you know, I don't have the terminology in my back-pocket, but there's definitely a large blind-spot when someone is ignoring the spirit of intellectual curiosity in a positive light rather than a negative one.

In this case, don't you think a tool like mild negative social feedback might be a useful mechanism? Of course, there's a limit, and if such a person were incapable of further insight, they'd probably not be very useful conversants. That's obviously not happening here.

One final thing is relevant here - you just hit on a pretty important point. There is a grit to a certain type of discourse that is actually superior to this discourse, I'd happily accept that point. Why not just transfer the burden of moderation to that point, rather than what you perceive to be the outset? Surely, you'll greatly reduce your number of false positives.

I provide negative social feedback sometimes because I feel it's appropriate. In the future, I probably won't. That being said, it's obvious that I've never sparked a thoughtless brawl, so the tolerance is at least inappropriately adjusted sufficiently to that extent.


What’s your problem? There’s nothing overhyped about that comment. People, including me, are building complex agents that can execute multi stage prompts and perform complex tasks. Comparing these first models to a basic unit of logic is more than fair given how much more capable they are. Do you just have an axe to grind?


[flagged]


How is it inappropriate? How is it not building?


After reading the docs for the new ChatGPT function calling yesterday, it's structured and/or typed data for GPT input or output that's the key feature of these new models. The ReAct flow of tool selection that it provides is secondary.

As this post notes, you don't even need to the full flow of passing a function result back to the model: getting structured data from ChatGPT in itself has a lot of fun and practical use cases. You could coax previous versions of ChatGPT to "output results as JSON" with a system prompt but in practice results are mixed, although even with this finetuned model the docs warn that there still could be parsing errors.

OpenAI's demo for function calling is not a Hello World, to put it mildly: https://github.com/openai/openai-cookbook/blob/main/examples...


IIRC, there's a way to "force" LLMs to output proper JSON by adding some logic to the top token selection. I.e. in the randomness function (which OpenAI calls temperature) you'd never choose a next token that results in broken JSON. The only reason it wouldn't would be if the output exceeds the token limit. I wonder if OpenAI is doing something like this.


Note that you don’t necessarily need to have the AI output any JSON at all — simply have it answer when being asked for the value to a specific JSON key, and handle the JSON structure part in your hallucinations-free own code: https://github.com/manuelkiessling/php-ai-tool-bridge


Would be nice if you could send a back and forth interaction for each key. This approach turns into lots of requests that reapply the entire context and ends up slow. I wish i could just send a Microsoft guidance template program, and process that in a single pass.


Thanks for sharing!


It would seem not, as the official documentation mentions the arguments may be hallucinated or be a malformed JSON.

(except if the meaning is the JSON syntax is valid but may not conform to the schema, but they're unclear on that).


For various reasons, token selection may be implemented as upweighting/downweighting instead of outright ban of invalid tokens. (Maybe it helps training?) Then the model could generate malformed JSON. I think it is premature to infer from "can generate malformed JSON" that OpenAI is not using token selection restriction.


the linked article hypothesizes:

> I assume OpenAI’s implementation works conceptually similar to jsonformer, where the token selection algorithm is changed from “choose the token with the highest logit” to “choose the token with the highest logit which is valid for the schema”.


Note that this (token selection restriction) is even available on OpenAI API as logit_bias.


But only for the whole generation. So if you want to constrain things one token at a time (as you would to force things to follow a grammar) you have to make fresh calls and only request one token which makes things more or less impractical if you want true guarantees. A few months ago I built this anyway to suss out how much more expensive it was [1]

[1] https://github.com/newhouseb/clownfish#so-how-do-i-use-this-...


I think the problem is that tokens are not characters. So even if you had access to a JSON parser state that could tell you whether or not a given character is valid as the next character, I am not sure how you would translate that into tokens to apply the logit biases appropriately. There would be a great deal of computation required at each step to scan the parser state and generate the list of prohibited or allowable tokens.

But if one could pull this off, it would be super cool. Similar to how Microsoft’s guidance module uses the logit_bias parameter to force the model to choose between a set of available options.


You simply sample tokens starting with the allowed characters and truncate if needed. It’s pretty efficient, there’s an implementation here: https://github.com/1rgs/jsonformer


This is the best implementation I've seen, but only for Hugging Face models: https://github.com/1rgs/jsonformer


How would a tweaked temp enforce a non broken output exactly?


It's not temperature, but sampling. Output of LLM is probabilistic distribution over tokens. To get concrete tokens, you sample from that distribution. Unfortunately, OpenAI API does not expose the distribution. You only get the sampled tokens.

As an example, on the link JSON schema is defined such that recipe ingredient unit is one of grams/ml/cups/pieces/teaspoons. LLM may output the distribution grams(30%), cups(30%), pounds(40%). Sampling the best token "pounds" would generate an invalid document. Instead, you can use the schema to filter tokens and sample from the filtered distribution, which is grams(50%), cups(50%).


Not traditional temperature, maybe the parent worded it somewhat obtusely. Anyway, to disambiguate...

I think it works something like this: You let something akin to a json parser run with the output sampler. First token must be either '{' or '['; then if you see [ has the highest probability, you select that. Ignore all other tokens, even those with high probability.

Second token must be ... and so on and so on.

Guarantee for non-broken (or at least parseable) json


What's the implication of this new change for Microsoft Guidance, LMQL, Langchain, etc.? It looks like much of their functionality (controlling model output) just became obsolete. Am I missing something?


If anything this removes a major roadblock for libraries/languages that want to employ LLM calls as a primitive, no? Although, I fear the vendor lock-in intensifies here, also given how restrictive and specific the Chat API.

Either way, as part of the LMQL team, I am actually pretty excited about this, also with respect to what we want to build going forward. This makes language model programming much easier.


> Although, I fear the vendor lock-in intensifies here,

The openAI API is super simple - any other vendor is free to copy it, and I'm sure many will.


`Although, I fear the vendor lock-in intensifies here, also given how restrictive and specific the Chat API.`

Eh, would be pretty easy to write a wrapper that takes a functions-like JSON Schema object and interpolates it into a traditional "You MUST return ONLY JSON in the following format:" prompt snippet.


Langchain added support for `function_call` args yesterday:

* https://github.com/hwchase17/langchain/pull/6099/files

* https://github.com/hwchase17/langchain/issues/6104

IMHO, this should make Langchain much easier and less chaotic to use.


It's only been added to the OpenAI interface. Function calling is really useful when used with agents. To include that to agents would require some redesign as the tool instructions should be removed from the prompt templates in favor of function definitions in the API request. The response parsing code would also be affected.

I just hope they won't come up with yet another agent type.



LangChain is a perpetual hackathon.


They have something closer to a simple Hello World example here:

https://platform.openai.com/docs/guides/gpt/function-calling

That example needs a bit of work I think. In Step 3, they're not really using the returned function_name; they're just assuming it's the only function that's been defined, which I guess is equivalent for this simple example with just one function but less instructive. In Step 4, I believe they should also have sent the function definition block again a second time since model calls in the API are memory-less and independent. They didn't, although the model appears to guess what's needed anyway in this case.


That SQL example is going to result in a catastrophe somewhere when someone uses it in their project. It is encouraging something very dangerous when allowed to run on untrusted inputs.


Marvin Minsky was so damn far ahead of his time with Society of Mind.

Engineering of cognitively advanced multiagent systems will become the area of research of this century / multiple decades.

GPT-GPT > GPT-API in terms of power.

The space of possible combinations of GPT multiagents goes beyond imagination since even GPT-4 goes so.

Multiagent systems are best modeled with signal theory, graph theory and cognitive science.

Of course "programming" will also play a role, in sense of abstractions and creation of systems of / for thought.

Signal theory will be a significant approach for thinking about embedded agency.

Complex multiagent systems approach us.


Makes me think of the Freud/Jungian notions of personas in us that are in various degrees semi-autonomously looking out for themselves. The “angry” agent, the “child” agent, so on.


Building agents that use advanced API's was not really practical until now. Things like Langchain's Structured Agents worked somewhat reliably, but due to the massive token count it was so slow, the experience was _never_ going to be useful.

Due to this, the performance in which our agent processes results has improved 5-6 times and it does actually do a pretty good job of keeping the schema.

One problem that is not resolved yet is that it still hallucinates a lot of attributes. For example we have tool that allows it to create contacts in user's CRM. I ask it to:

"Create contacts for top 3 Barcelona players:.

It creates an structure like this"

1. Lionel Messi - Email: lionel.messi@barcelona.com - Phone Number: +1234567890 - Tags: Player, Barcelona

2. Gerard Pique - Email: gerard.pique@barcelona.com - Phone Number: +1234567891 - Tags: Player, Barcelona

3. Marc-Andre ter Stegen - Email: marc-terstegen@barcelona.com - Phone Number: +1234567892 - Tags: Player, Barcelona

And you can see it hallucinated email addresses and phone numbers.


ChatGPT can be usefully for many things, but you should really, not use it if you want to retrieve factual data. This might partly be resolved by querying the internet like bing does but purely on the language model side these hallucinations are just an unavoidable part of it.


Yep, it's always always write code / query / function / whatever you need that you would parse and retrieve the data from an external system.


I would never rely on an LLM as a source of such information, just as I wouldn't trust the general knowledge of a human being used as a database. Does your workflow include a step for information search? With the new json features, it should be easy to instruct it to perform a search or directly feed it the right pages to parse.


For those who want to test out the LLM as API idea, we are building a turnkey prompt to API product. Here's Simon's recipe maker deployed in a minute: https://preview.promptjoy.com/apis/1AgCy9 . Public preview to make and test your own API: https://preview.promptjoy.com


This is really cool, I had a similar idea but didn't build it. I was also thinking a user could take these different prompts (I called them tasks) that anyone could create, and then connect them together like a node graph or visual programming interface, with some Chat-GPT middleware that resolves the outputs to inputs.


Congrats on the first-time user experience, I could experiment with your API in a few seconds, and the product is sleek!


This is cool! Are you using one-shot learning under the hood with a user provided example?


BTW: Here's a more performant version (fewer tokens) https://preview.promptjoy.com/apis/jNqCA2 that uses a smaller example but will still generate pretty good results.


This is still pretty fast - impressive! Are there any tricks you're doing to speed things up?


Thanks. We find few-shot learning to be more effective overall. So we are generating additional examples from the provided example.


I own this domain: prompts.run Do you wanna it?


The JSON schema not counting toward token usage is huge, that will really help reduce costs.


> Under the hood, functions are injected into the system message in a syntax the model has been trained on. This means functions count against the model's context limit and are billed as input tokens. If running into context limits, we suggest limiting the number of functions or the length of documentation you provide for function parameters.


I believe functions do count in some way toward the token usage; but it seems to be in a more efficient way than pasting raw JSON schemas into the prompt. Nevertheless, the token usage seems to be far lower than previous alternatives, which is awesome!


But it does count toward token usage. And they picked JSON schema which is like 6x more verbose than typescript for defining the shape of json.


That is up in the air and needs more testing. Field descriptions, for example, are important but extraneous input that would be tokenized and count in the costs.

At the least for ChatGPT, input token costs were cut by 25% so it evens out.


I'm wondering if introducing a system message like "convert the resulting json to yaml and return the yaml only" would adversely affect the optimization done for these models. The reason is that yaml uses significantly fewer tokens compared to json. For the output, where data type specification or adding comments may not be necessary, this could be beneficial. From my understanding, specifying functions in json now uses fewer tokens, but I believe the response still consumes the usual amount of tokens.


I think one should not underestimate the impact on downstream performance the output format can have. From a modelling perspective it is unclear whether asking/fine-tuning the model to generate JSON (or YAML) output is really lossless with respect to the raw reasoning powers of the model (e.g. it may perform worse on tasks when asked/trained to always respond in JSON).

I am sure they ran tests on this internally, but I wonder what the concrete effects are, especially comparing different output formats like JSON, YAML, different function calling conventions and/or forms of tool discovery.


That's what I'm doing. I ask ChatGPT to return inline yaml (no wasting tokens on line breaks), then I parse the yaml output into JSON once I receive it. A bit awkward but it cuts costs in half.


This is useful, but for me at least, GPT-4 is unusable because it sometimes takes 30 seconds + to reply to even basic queries.


Also the rate limit is pretty bad if you want to release any type of app


More importantly: there's a waiting list.

Also, if you want to use both the ChatGPT web app and the API, you'll be billed for both separately. They really should be unified and billed under a single account. The difference is literally just whether there's a "web UI" on top of the API... or not.


It works pretty good. You define a few “function” and enter a description on what it does, when user prompts, it will understand the prompt and tell you which likely “function” to use, which is just the function name. I feel like this is a new way to program, a sort of fuzzy logic type of programming


> fuzzy logic

Yes and no. While the choice of which function to call is dependent on an llm, ultimately, you control the function itself whose output is deterministic.

Even today, given an api, people can choose to call or not call based on some factor. We don’t call this fuzzy logic. E.g., people can decide to sell or buy stock through an api based on some internal calculations - doesn’t make the system “fuzzy”.


If you feed that result into another io box you may or may not know if that is the correct answer, which may need some sort of error detection. I think this is going to be majority of the use cases


Hm, I see what you mean. Afaict, only the decision to call or not call a function is up to the model (fuzzy). Once it decides to call the function, it generates mostly correct JSON based on your schema and returns that to you as is (not very fuzzy).

It’ll be interesting to test APIs which accept user inputs. Depending on how ChatGPT populates the JSON, the API could be required to understand/interpret/respond to lots of variability in inputs.


Yeah I’ve tested, you should use the curl example they gave as you can test instantly pasting it into your terminal. The description of the functions is prompt engineering in addition to the original system prompt, need to test the dependency more, it’s so new.


Glad we didn't get to far into adopting something like Guardrails. This sort of kills it's main value prop for OpenAI.

https://shreyar.github.io/guardrails/


i mean only at the most superficial level. she has a ton of other validators that arent superceded (eg SQL is validated by branching the database - we discussed on our pod https://www.latent.space/p/guaranteed-quality-and-structure)


yeah, listened to the pod (that's how I found out about guardrails!).

fair point, I should have said: "value prop for our use case"... the thing I was most interested in was how well Guardrails structured output.


haha excellent. i was quite impressed by her and the vision for guardrails. thanks for listening!


Luckily it's for LLMs, not openai


Guardrails is an awesome project and will continue to be even after this.


Is there a decent way of converting to a structure with a very constrained vocabulary? For example, given some input text, converting it to something like {"OID-189": "QQID-378", "OID-478":"QQID-678"}. Where OID and QQID dictionaries can be e.g. millions of different items defined by a description. The rules for mapping could be essentially what looks closest in semantic space to the descriptions given in a dictionary.

I know this should be able to be solvable by local LLMs and bert cosine similarity (it isn't exactly, but it's a start on the idea), but is there a way to do this with decoder models rather than encoder models with other logic?


You can train custom GPT 3 models, and Azure now has vector database integration for GPT-based models in the cloud. You can feed it the data, and ask it for the embedding lookup, etc...

You can also host a vector database yourself and fill it up with the embeddings from the OpenAI GPT 3 API.


Unfortunately this doesn't really work, as the model is not limited in it's decoding vocabulary.

Does anyone have other suggestions that may work in this space?


The way openai implemented this is really clever, beyond how neat the plugin architecture is, as it lets them peek one layer inside your internal API surface and can infer what you intend to do with the LLM output. Collecting some good data here.


huh, i never thought of it that way. i thought openai pinky swears not to train on our data tho


I was thinking more as data for their product org


Recent and related:

Function calling and other API updates - https://news.ycombinator.com/item?id=36313348 - June 2023 (154 comments)


IMO this isn't a dupe and shouldn't be penalized as a result.


It's certainly not a dupe. It looks like a follow-up though. No?


More a very timely but practical demo.


Ok, thanks!


OpenAI integration is going to be a goldmine for criminals in the future.

Everyone and their momma is gonna start passing poorly validated/sanitized client input to shared sessions of a non-deterministic function.

I love the future!


In the “future”?


Nice to have an endpoint which takes care of this. I've been doing this manually, it's a fairly simple process:

* Add "Output your response in json format, with the fields 'x', which indicates 'x_explanation', 'z', which indicates 'z_explanation' (...)" etc. GPT-4 does this fairly reliably.

* Validate the response, repeat if malformed.

* Bam, you've got a json.

I wonder if they've implemented this endpoint with validation and carefully crafted prompts on the base model, or if this is specifically fine-tuned.


It appears to be fine-tuning:

"These models have been fine-tuned to both detect when a function needs to be called (depending on the user’s input) and to respond with JSON that adheres to the function signature."

https://openai.com/blog/function-calling-and-other-api-updat...


I will experiment with this at the weekend. Once thing I found useful with supplying a json schema in the prompt was that I could supply inline comments and tell it when to leave a field null, etc. I found that much more reliable than describing these nuances elsewhere in the prompt. Presumably I can't do this with functions, but maybe I'll be able to work around it in the prompt (particularly now that I have more room to play with.)


Running an LLM every time someone clicks on a button is expensive and slow in production, but probably still ~10x cheaper to produce than code.


New techniques like semantic caching will help. This is the modern era's version of building a performant social graph.


What's semantic caching?


With LLMs, the inputs are highly variable so exact match caching is generally less useful. Semantic caching groups similar inputs and returns relevant results accordingly. So {"dish":"spaghetti bolognese"} and {"dish":"spaghetti with meat sauce"} could return the same cached result.


Or store as sentence embedding and calculate the vector distance, but creates many edge cases


Just this morning I wrote a JSON object. I told GPT to turn it into a schema. I tweaked that and then gave a list of terms for which I wanted GPT to populate the schema accordingly.

It worked pretty well without any functions, but I did feel like I was missing something because I was ready to be explicit and there wasn’t any way for me to tell that to GPT.

I look forward to trying this out.


It's a shame they couldn't use yaml, instead. I compared them and yaml uses about 20% fewer tokens. However, I can understand accuracy, derived from frequency, being more important than token budget.


I think YAML actually uses more tokens than JSON without indents, especially with deep data. For example "," being a single token makes JSON quite compact.

You can compare JSON and YAML on https://platform.openai.com/tokenizer


I would imagine JSON is easier for a LLM to understand (and for humans!) because it doesn't rely on indentation and confusing syntax for lists, strings etc.


Its a lot more straightforward to use JSON programmatically than YAML.


If you are using any kind of type checking instead of blindly trusting generated json it's exactly the same amount of work.


It really shouldn't be, though. I.e. not unless you're parsing or emitting it ad-hoc, for example by assuming that an expression like:

  "{" + $someKey + ":" + $someValue + "}"
produces a valid JSON. It does - sometimes - and then it's indeed easier to work with. It'll also blow up in your face. Using JSON the right way - via a proper parser and serializer - should be identical to using YAML or any other equivalent format.


Even if the APIs for both were equally simple, modules for manipulating json are way more likely to be available in the stdlib of whatever language you’re using.


JSON can be minified.


This was technically possible before. I think the approach used by many - myself included - is to simply embed results in a markdown code block and the match it with regex pattern. Then you just need to phrase the prompt to generate the desired output.

This is an example of that generating the arguments for the MongoDB's `db.runCommand()` function: https://aihelperbot.com/snippets/cliwx7sr80000jj0finjl46cp


In the openai blog post they mention "Convert “Who are my top ten customers this month?” to an internal API call" but I'm assuming they mean gpt will respond with structured json (we define via schema in function prompt) that we can use to more easily programatically make that api call?

I could be confused but I'm interpreting this function calling as "a way to define structured input and selection of function and then structured output" but not the actual ability to send it arbitrary code to execute.

Still amazing, just wanting to see if I'm wrong on this.


This does not execute code!


Ok, yea this makes sense. Also for others curious of the flow here's a video walkthrough I just skimmed through: https://www.youtube.com/watch?v=91VVM6MNVlk


> The process is simple enough that you can let non-technical people build something like this via a no-code interface. No-code tools can leverage this to let their users define “backend” functionality.

Early prototypes of software can use simple prompts like this one to become interactive. Running an LLM every time someone clicks on a button is expensive and slow in production, but probably still ~10x cheaper to produce than code.

Hah wow... no. Definitely not.


Has anyone tried throwing their backend Swagger at this and made ChatGPT perform user story tests?


Wouldnt this be possible with a solution like Guidance where you have a pre structured JSON format ready to go and all you need is text: https://github.com/microsoft/guidance


Can I use this to make it reliably output code (say JavaScript)? I haven't managed to do it with just prompt engineering as it will still add explanations, apologies and do other unwanted things like splitting the code into two files as markdown.


Here’s an approach to return just JavaScript:

https://github.com/williamcotton/transynthetical-engine

The key is the addition of few-shot exemplars.


Here's a demo of some system prompt engineering which resulted in better results for the older ChatGPT: https://github.com/minimaxir/simpleaichat/blob/main/examples...

Coincidentially, the new gpt-3.5-turbo-0613 model also has better system prompt guidance: for the demo above and some further prompt tweaking, it's possible to get ChatGPT to output code super reliably.


Not this, but using the token selection restriction approach, you can let LLM produce output that conforms to arbitrary formal grammar completely reliably. JavaScript, Python, whatever.


Did people really struggle with getting JSON outputs from GPT4. You can literally do it zero shot by just saying match this typescript type.

GPT3.5 would output perfect JSON with a single example.

I have no idea why people are talking about this like it’s a new development.


Unfortunately, in practice that works only most of the time. At least in our experience (and the article says something similar) sometimes ChatGPT would return something completely different when JSON-formatted response would be expected.


I've been using the same prompts for months and have never seen this happen on 3.5-turbo let alone 4.

https://gist.github.com/BLamy/244eec016beb9ad8ed48cf61fd2054...


In my experience if you set the temperature to zero it works 99.9% of the time, and then you can just add retry logic for the remaining 0.1%


It’s pretty interesting how the work they’ve been doing on plugins has fed into this.

I suspect that they’ve managed to get a lot of good training data by calling the APIs provided by plugins and detecting when it’s gone wrong from bad request responses.


I'm trying to experiment with the API but the response time is always in the 15-25second range. How are people getting any interesting work done with it?

I see others on the OpenAPI dev forum complaining about this too, but no resolution.


I pass a kotlin data class and ask chatGPT to return json which can be parsed by that class. Reduces errors with date-time parsing and other formatting issues and takes up lesser tokens than the approach in the article.


I thought GPT-4 was doing a pretty good job at outputting JSON (for some of the toy problems I've given it like some of my gardening projects.) Interesting to see this hit the very top of HN


I've used GCP Vertex AI for a specific task and the prompt was to generate a JSON response with keys specified and it does generate the result as JSON with said keys.


Issue is that's it's not guaranteed, unlike this new openai feature. Personally, Ive found Vertex AI's json output to be not so great, it often uses single quotes in my experience. But maybe you have figured out the right prompts? I'd be interested what you use if so.


Is it possible to fine-tune with custom data to output JSON?


That's not the current OpenAI recipe. Their expectation is that your custom data will be retrieved via a function/plugin and then be subsequently processed by a chat model.

Only the older completion models (davinci, curie, babbage, ada) are avaialble for fine-tuning.


here is code (with several examples) that takes it a couple steps further by validating the output json and pydantic model and providing feedback to the llm model when it gets either of those wrong:

https://github.com/jiggy-ai/pydantic-chatcompletion/blob/mas...


We're not far from writing a bunch of stubs, query GPT at startup to resolve the business logic. I guess we're going to need a new JAX-RS soon.


Actually I'm looking to take GPT-4 output and create file formats like keynote presentations, or pptx. Is that currently possible with some tools?


I would recommend creating a simplified JSON schema for the slides (say, presentation is an array of slides, each slide has a title, body, optional image, optional diagram, each diagram is one of pie, table, ... Then use a library to generate the pptx file from the content generated.


Library? What library?

It seems to me that a Transformer should excel at Transforming, say, text into pptx or pdf or HTML with CSS etc.

Why don't they train it on that? So I don't have to sit there with manually written libraries. It can easily transform HTML to XML or text bullet points so why not the other formats?


I don't think the name "Transformer" is meant in the sense of "transforming between file formats".

My intuition is that LLMs tend to be good at things human brains are good at (e.g. reasoning), and bad at things human brains are bad at (e.g. math, writing pptx binary files from scratch, ...).

Eventually, we might get LLMs that can open PowerPoint and quickly design the whole presentation using a virtual mouse and keyboard but we're not there yet.


It’s just XML They can produce HTML and transform python into php etc.

So why not? It’s easy for them no?


apparently pandoc also supports pptx

so you can tell GPT4 to output markdown, then use pandoc to convert that markdown to pptx or pdf.



I have been using gpt4 to translate natural language to JSON already. And on v4 ( not v3) it hasn't returned any malformed JSON iirc


- if the only reason you're using v4 over v3.5 is to generate JSON, you can now use this API and downgrade for faster and cheaper API calls. - malicious user input may break your json (by asking GPT to include comments around the JSON, as another user suggested); this may or may not be an issue (e. g. if one user can influence other users' experience)


What if you ask it to include comments in the JSON explaining its choices


Newbie in machine learning here. It’s crazy that this is the top post just today. I’ve been doing the intro to deep learning course from MIT this week, mainly because I have a ton of JSON files that are already classified, and want to train a model that can generate new JSON data by taking classification tags as input.

So naturally this post is exciting. My main unknown right now is figuring out which model to train my data on. An RNN, a GAN, a diffusion model?


Did you read the article? To do it with OpenAI you would just put a few output examples in the prompt and then give it a function that takes the class and the output parameters correspond to the JSON format you want, or just a string containing JSON.

You could also fine tuned an LLM like Falcon-7b but probably not necessary and nothing to do with OpenAI.

You might also look into the OpenAI Embedding API as a third option.

I would try the first option though.


having gpt-4 as a dependency for your product or business seems... shortsighted


what if you don't have product yet, tied product is better than no product


as long as you walk into it with your eyes wide open




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: