Taming randomness in ML models with hypothesis testing and marimo

axpy906 · 2024-10-19T15:32:29 1729351949

Good post. I’ve been thinking about doing offline testing of LLM tasks a bit these days and have come to the conclusion that old school testing is the best until more mature features can be developed. Specifically, I mean running a power analysis to determine sample size, random sampling based on that and then running tests like a z test to see if there is a difference and between what bounds. Tests are expensive and I wish there was a better way for realizable offline evals.

rhdunn · 2024-10-19T16:16:25 1729354585

Have you seen LLM testing tools like promptfoo?

axpy906 · 2024-10-19T22:35:24 1729377324

Yes, I have seen it and BrainTrust too. Unfortunately, need FOSS without vendor at scale.

amarcheschi · 2024-10-19T15:19:28 1729351168

i - luckily - passed my statistics exam this summer, it's however cool to visualize what's happening

qefduzh · 2024-10-19T11:34:04 1729337644

[flagged]

aduffy · 2024-10-19T13:29:40 1729344580

- Brand new burner account

- upset about “AI slop” (the image is clearly not AI)

- mentions tech buzzwords that annoy you

- claiming the article is not as rigorous as an academic paper

Perhaps I’m just old school. But I miss the HN where the best way to get upvotes was to be insightful and not to send low-effort snarky replies

overbytecode · 2024-10-19T12:28:49 1729340929

One of those points is not like the other, Marimo’s feature to deploy a notebook as WASM is a very nice feature imo.

smrtinsert · 2024-10-19T14:17:57 1729347477

Clearly marked as unsplash