I've written a whole series of posts about prompt injection that you might find ...

jarulraj · on May 1, 2023

Hey Simon! Thanks for sharing this. I have long admired your work on Datasette :) We will check out your posts for ideas on coping with prompt injection.

I just came across your recent post on the ChatGPT SQL function in SQLite [1]. We just added a ChatGPT-based UDF in EVA [2]. I would love to hear your thoughts on the difference between these two approaches.

Another coincidence is that EVA uses SQLite for managing structured data by default. Can EVA's SQLite database be an interesting use case for Datasette?

[1] https://simonwillison.net/2023/Apr/29/enriching-data/ [2] https://github.com/georgia-tech-db/eva/pull/655

simonw · on May 1, 2023

The approaches look pretty similar. My chatgpt() function is pretty much the most basic possible implementation of that pattern - it's just a SQLite custom-SQL function written in Python.

You should absolutely try pointing Datasette at that SQLite database, I imagine it would work really well!

jarulraj · on May 1, 2023

Thanks so much for sharing your thoughts! I also felt that they are pretty similar. But, I am guessing that SQLite (similar to most relational database systems) does not automatically cache the results of functions, do non-trivial cost-based optimization for functions in queries, or reorder function-based predicates based on the estimated cost of running the functions, etc.

Edit: I have shared more details on the function-aware optimization in EVA in this post (in case you are interested) -- https://news.ycombinator.com/item?id=35764355#35773608

Sure, we will try it out and keep you posted :)

simonw · on May 1, 2023

You can cache function results yourself in Python if you want to - my implementation also sums up the tokens used by the calls to the functions.

Influencing optimization isn't possible using regular Python-based custom SQL functions though. I think you can influence that stuff in SQLite if you create more complex virtual table functions, but those aren't exposed through the regular Python sqlite3 module yet.

jarulraj · on May 1, 2023

Thanks for the clarifications. Token summation is a cool optimization :)

Query optimizers in SQL database systems typically optimize based on the time to execute the function on a local server. The token summation optimization generalizes time-based optimization of local functions to dollar-based optimization for remote functions.

Execution Time-based optimization: FunctionFoo(input 1) = 2x FunctionFoo(input 2)

Dollar-based optimization: ChatGPT(prompt with 100 tokens) = 2x ChatGPT(prompt with 50 tokens)

We are also exploring dollar-based optimization in EVA, and will check out your openai-to-sqlite tool for ideas [1].

[1] https://datasette.io/tools/openai-to-sqlite