People’s names for the four types vary, but I’m personally a fan of naming the axes “propositional vs procedural” and “informational vs developmental”, giving us a final four categories (00, 01, 10, 11) of “References”, “Instructions”, “Lessons”, and “Tutorials”. I think the applicability to LLM clearly holds up! Though more so for advanced chatbots than HR widgets TBF, I doubt anyone is looking for developmental content from one of those.
That's actually a pretty interesting point. Not just evals but other components like system prompt should also be tailored to match the expected outcome.
My experience with them doesn't quite fit either: I've primarily used LLMs for giving me hints when I'm struggling with a leetcode problem or similar. They're surprisingly good at it, providing you regularly remind them to provide little clues only.
When I was doing SEO full-time, this is one of the ways we used to categorize content - via intent. As a result, my immediate question becomes: how long before those responses start to be subsumed by commercial intent responses? To me, this is an inevitability. A when, not an if.
But then how do you classify a task that the LMM performed, such as a summary?
I think you are onto something here but it really depends on what task you want the LMM to perform, search, how to, summary, extraction etc...
People’s names for the four types vary, but I’m personally a fan of naming the axes “propositional vs procedural” and “informational vs developmental”, giving us a final four categories (00, 01, 10, 11) of “References”, “Instructions”, “Lessons”, and “Tutorials”. I think the applicability to LLM clearly holds up! Though more so for advanced chatbots than HR widgets TBF, I doubt anyone is looking for developmental content from one of those.