Fwiw I've spent my whole career doing data analysis but the ease at which I was ...

nomilk · on Dec 30, 2023

I'm a data scientist and it's my first time seeing analysis of a dataset using prompts (as opposed to code: i.e. python/R/SQL). I'm slightly blown away! The plot titled 'Distribution of Builds by Platform and Outcome' looks professional and would take me 10-60 minutes using ggplot. The spacings between the text and other graphical elements are done well and would be time-consuming (not to mention bland) for humans.

I'm wondering if we'll soon see Jupyter notebooks with R, Python, Julia, and OpenAI-assistant kernels! (the latter being human readable plain text instructions like the ones used in your analysis E.g. rather than 20 lines of matplotlib or ggplot "Show me the distribution of builds by machine platform, where the platforms are ordered by M1 to M3, and within the platform class Pro comes before Max.".

This has blown my mind.

I'm still unclear on the exact tech stack you used. If I understand correctly, the steps were:

- generate data locally,

- use an ETL tool to push data to Google BigQuery,

- use BigQuery to generate CSVs

- give CSVs to an OpenAI assistant.

From there you asked OpenAI assistant questions and it generates the plots? Is this understanding correct?

Last question: how many times did you have to re-submit or rewrite the prompts? Were the outputs mostly from the first attempts, or was there a fair bit of back and forth re wording the prompts?

lawrjone · on Dec 30, 2023

I think we’ll definitely see AI find its way to the notebook tools. Funny enough, you can ask the model to give you an iPython notebook of its workings if you want to move your analysis locally, so in a way it’s already there!

On the process: we’re using OpenAI’s assistants feature alongside the ‘code interpreter’. This means the LLM that you speak to is fine tuned to produce Python code that can do data analysis.

You upload your files to OpenAI and make them available in the assistant. Then you speak to the LLM and ask questions about your data, it generates Python code (using pandas/numpy/etc) and runs the code on OpenAI infra in a sandbox, pulling the results out and having them interpreted by the LLM.

So the plots you see are coming direct from Python code the LLM generated.

On how many times did I resubmit: quite a few. I’d ask for a graph and it would give me what I needed, but maybe the layout was bad so you’d say “repeat the above but show two histograms a row and colour each machine model differently”.

I was using a very recent model that’s in preview. It was a bit slow (30s to 2m to get a response sometimes) but that’s expected on a preview and this stuff will only get faster.

Hope that answers your questions!

gray_-_wolf · on Dec 29, 2023

> it makes this analysis far more accessible than it was before

How does the average engineer verify if the result is correct? You claim (and I believe you) to be able to do this "by hand", if required. Great, but that likely means you are able to catch when LLM makes an mistake. Any ideas on how average engineer, without much experience in this area, should validate the results?

lawrjone · on Dec 29, 2023

I mentioned this in a separate comment but it may be worth bearing in mind how the AI pipeline works, in that you’re not pushing all this data into an LLM and asking it to produce graphs, which would be prone to some terrible errors.

Instead, you’re using the LLM to generate Python code that runs using normal libraries like Pandas and gnuplot. When it makes errors it’s usually generating totally the wrong graphs rather than inaccurate data, and you can quickly ask it “how many X Y Z” and use that to spot check the graphs before you proceed.

My initial version of this began in a spreadsheet so it’s not like you need sophisticated analysis to check this stuff. Hope that explains it!