I’ve especially noticed this with gpt-4o-mini [1], and it’s a big problem. My particular use case involves keeping a running summary of a conversation between a user and the LLM, and 4o-mini has a really bad tendency of inventing details in order to hit the desired summary word limit. I didn’t see this with 4o or earlier models
Fwiw my subjective experience has been that non-technical stakeholders tend to be more impressed with / agreeable to longer AI outputs, regardless of underlying quality. I have lost count of the number of times I’ve been asked to make outputs longer. Maybe this is just OpenAI responding to what users want?
> You may output only up to 500 words, if the best summary is less than 500 words, that's totally fine. If details are unclear, do not fill-in gaps, do leave them out of the summary instead.
I wanted to document a particular genAI antipattern which I've seen a few times now.
LLMs are theoretically pretty fungible, because you send English and get English back--but in practice you still need to do some amount of technical due diligence before swapping model. These things are benchmarked on tasks which rarely resemble your specific use case. Blindly swap models at your own risk!
Something that has become very clear since the advent of GPT-3.5 is that LLMs are far from magic, and using them does not remove the need for good engineering fundamentals. It's important to have a solid eval suite so you can quickly benchmark your system against different LLMs, because the APIs we're all building on are constant moving targets.
Even with this setup in place you need a heightened level of caution relative to a monolith. In a monolith I can refactor function signatures however I desire because the whole service is an atomically deployed unit. Once you have two independently deployed components that goes out the window and you now need to be a lot more mindful when introducing breaking changes to an endpoint’s types
Not sure what the situation is like now, but we stopped using LangChain last year because the rate of change in the library was huge. Whenever we needed to upgrade for a new feature or bug fix we’d be 20~ versions behind and need to work through breaking changes. Eventually we decided that it was easier to just write everything ourselves.
This is from the first half of 2023 or so; maybe things are more stable now, but looks like the Python implementation is still pre-v1.
The other possibility (which is common in startups) is that often the “right way” is different depending on the scale of the system you need to design for. In cases like this you end up with technical debt a year down the line, but at the time the feature was shipped the engineering decisions made were extremely reasonable.
I’ve seen a few colleagues jump to writing off all technical debt as being inherently bad, but in cases like this it’s a sign of success and something that’s largely impossible to avoid (the EV of building for 10-100x current scale is generally negative, factoring in the risk of the business going bust). There’s a kind of entropy at play here.
Big fan of tidying things up incrementally as you go [1], because it enables teams to at least mitigate this natural degradation over time
Big fan of this. Here in New Zealand we have a slogan “be a tidy kiwi” that encourages people to pick up their litter and be good stewards of for our natural environment
Imo the same mentality is good to have in software, and I’ve always appreciated being in a team that makes codebase improvements alongside feature additions. It makes things a lot more pleasant
This already exists (in a slightly different prompt format); it's the underlying idea behind ReAct: https://react-lm.github.io
As you say, I'm skeptical this counts as AGI. Although I admit that I don't have a particularly rock solid definition of what _would_ constitute true AGI.
This sounds like an argument against TypeScript in general, no?
e.g. If I am parsing a string to a number via Number.parseInt, I don’t need a “: number” annotation because I can just call the variable “myNumber” and use that.
Branding a string is in many ways an extension on the idea of “branding” my “myNumber” variable as “: number” rather than leaving it as “: any”. Even if the TS type system is easy to bail out of, I still want the type annotations in the first place because they are useful regardless. I like reducing the number of things I need to think about and shoving responsibility off to my tools.
Happens a lot with junction tables ime. e.g. At my last job we had three tables: user, stream, user_stream. user_stream is an N:N junction between a user and a stream
A user is free to leave and rejoin a stream, and we want to retain old data. So each user_stream has columns id, user_id, stream_id (+ others)
Issues occur when people write code like the following:
The issue is easily noticed if you name the “stream” parameter “userStream” instead, but this particular footgun came up _all_ the time in code review; and it also occurred with other junction tables as well. Branded types on the various id fields completely solve this mistake at design time.
Type-prefixed IDs are the way to go. For completeness it's worth noting that the first example using the `string | 'currentNode'` type can be slightly improved in cases where you _do_ want autocomplete for known-good values but are still OK with accepting arbitrary string values:
type Target = 'currentNode' | (string & {});
const targets: Target[] = [
'currentNode', // you get autocomplete hints for this!
'somethingElse', // no autocomplete here, but it typechecks
];
It's a useful hack. In JavaScript, there is no value that's both a string and an object. At runtime, it will just be a string. You can use it like a string and it will type-check, because it's a string plus some extra compile-time baggage, sort of like you subclassed the string type. ('&' is a subtype operation.)
When converting something to this type, it will fail unless you cast it, but it's a compile-time cast. At runtime, there's no conversion.
This is essentially "lying" to the type checker in order to extend it.
Fwiw my subjective experience has been that non-technical stakeholders tend to be more impressed with / agreeable to longer AI outputs, regardless of underlying quality. I have lost count of the number of times I’ve been asked to make outputs longer. Maybe this is just OpenAI responding to what users want?
[1] https://sophiabits.com/blog/new-llms-arent-always-better#exa...