Possibly overly cynical, but I think a lot of people who are getting forced to shoehorn LLMs into their applications don’t have a choice about NOT using an LLM for the required use case.
Testing and validation kind of imply that it won’t ship if it isn’t fit for purpose, which isn’t an option and most devs will know it’s kind of junk already.
That’s not to say there aren’t good use cases, but when forced to add an LLM somewhere it doesn’t work, and you have no examples of “correct” output anyway, validation is usually an afterthought.
Not cynical at all. I think you're highlighting a real problem in the industry, and certainly something we've seen - teams, for a number of reasons (optics, marketing, hype/vibes, experimentation, pressure to adopt AI) use LLMs without perhaps proper consideration. Thats actually the opposite of what we're advocating for.
The whole point of proper testing is to determine if an LLM is suitable for your specific task, and then testing and measuring further to optimise for the outcome you want. The post refers to more testing LLMs at Scale, and the use cases we refer to assume a system design took place where the use of an LLM was deemed necessary for the task. Teams absolutely should have the option to determine if an LLM is not "fit for purpose". Reva actually helps with this - Good testing and validation during the experimentation stage often reveals when a simpler solution works better. But "pressure" can come in many forms, and I have empathy for teams that perhaps are not in a healthy environment where saying no is part of the culture.
We're working with teams that have real use cases, and we've seen a real problem with how teams are testing their use of LLMs. Its hard. Especially at scale! We built infrastructure allowing you to test with your own real historic data allowing you to measure actual performance improvements (or regressions) against the business outcome, rather than "yep, looks good!"
Ugh I am dealing with an amazingly productive platform team who churn out so much stuff that the product teams just can’t keep on top of all the changes to tools and tech.
That one team is super productive at the cost of everyone else grinding to a halt.
Some of the coolest demos from my Lean Six Sigma training were all about demonstrating that oftentimes the easiest way to increase the end-to-end throughput of your value chain is to find the team member with the greatest personal productivity and force them to slow down.
You don't necessarily even have to do anything more than that. Just impose the rate limit on them and they'll automatically and unconsciously stop doing all sorts of little things - mostly various flavors of corner cutting - that make life harder for everyone around them.
I have to work with a DBA who has decided that nothing new gets developed using Postgres and is confident that our use cases are best served with a document db… all without knowing any of our use cases, requirements or constraints.
Now I just don’t involve him in anything unless forced.
Honestly this is a concern for me in a non boogeyman way. I joined a company to work on their edtech product for kids and got assigned to work on their AI product. I have no idea how to be confident and prove that the gen ai won’t tell the kids harmful things. We can try all kinds of things but I don’t know how to PROVE it won’t.
I'm curious what fast charging standard is going to be standard for where Japanese used cars dominated market. Japan still stick with CHAdeMO for local market despite it's dying outside of Japan. I think it's quite bad for used car export economy.
The company I work at is using step functions heavily and I hate it. Instead of an if statement they make it a new step with the conditional in json, so following the code required you to jump around between files.
It has also been built by contractors who have no incentives to make it run locally or be easy to manage in production, but that isn't specific to step functions, and more due to poor leadership.
Imagine using JSON as a programming language. And all the implications of that. AWS does provide a VS Code and web renderer for the JSON to visually see what the flow of the step function looks like, but this is just a small improvement overall. If you split up your step functions into multiple, where one step function can call another, also invoke lambdas and other things, then you can imagine you lose all benefits of a modern code editor/IDE and you're manually lookup which file to jump to in order to find the next step in the execution path.
To be fair AWS does provide a "CDK" which lets you write Python code that gets converted into their JSON DSL -- or something along those lines. But I haven't used that, only direct JSON DSL "code" to write step functions.
To interpret your example in another way, a page working in IE is doing it right. So first you do it, and structure it the way you think it should be done with "correct" markup. Once you have that, you can then do it right and get it working properly in IE. After that, doing it better would be restructuring things so maybe you dont need as many hacks.
Reality: Your employer doesn't pay you to write 'right' software, they want a deliverable that works in IE by end of day tomorrow and for the life of you, you can't figure out where half the elements are actually displaying.
Reality: it is far easier to iterate on software that’s clean and mostly correct than it is to do on software that is riddled with hacks, gotchas, footguns, and long-distance side effects.
It’s extremely depressing working with “senior” engineers who’ve spent an entire career with the above mentality, who have missed out on any chance at ever learning how to actually engineer software for reliability and maintainability. Their inability to do so reflects on a lack of practice rather than some sort of fundamental impossibility. Which sadly seems to be a widespread misconception these days.
I've seen it go wrong where there is a design system, but none of the designs that come in actually match it, so we need to extend the standard components or create whole new ones; but timelines are set as if it's all off the shelf stuff.
Not sure if related, but my 2020 M1 macbook air bricked a week or so after upgrading to Sonoma. I was suspicious if this was related to the update.
Luckily the logic board was replaced for free under warranty laws here, though it put me off switching to iphone which I was a day away from doing.
I was affected by this and like many users the problem was fixed after replacing the I/O board. In my case, I did it myself using a $10 part from Ebay since the machine was well out of warranty at that point.
From comments #736 and #747 attached to the forum post you kindly shared, it sounds like simply disconnecting and reconnecting the I/O board may be sufficient (found those comments linked in #831):
Why the "/troll"? You're 100% right non-ironically: the problem being that on Linux the need to consult forum posts to fix these kind of issues is way more frequent than in macOS.
By the standard of "do you ever need to consult forum posts to solve a problem", sure, Linux is worse than macOS. By the standard of "do you ever need to consult forum posts to fix hardware that has apparently been bricked by a software update", macOS seems to be considerably worse. At least, that's my experience. I've never had hardware damaged by Linux, which I've run almost exclusively. On the other hand the one Apple device I've ever owned got bricked by their software update.
I can't say I've heard of that happening to people on Linux at all other than maybe early days of Xorg. Damage (reversible or otherwise) to hardware is extraordinarily rare on Linux, I can only think of it happening during the very early days of EFI and only under very specific conditions.
> I've never had hardware damaged by Linux, which I've run almost exclusively. [...] I can't say I've heard of that happening to people on Linux at all other than maybe early days of Xorg.
More recently there was "rm -rf /" wiping efivars and bricking some motherboards with shitty uefi implementations thanks to systemd mounting efivars rw by default (and shitty motherboard firmware). The kernel "fixed" this by mounting unknown efivars as (mostly) immutable.
There were also some motherboards with shitty UEFI implementations which got bricked when the efivars storage did not have enough free space to do the garbage collection. The kernel "fixed" this by not allowing more than half of the efivars storage to be used (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...).
Right, this is what I meant by the EFI brick in my comment. And in my comment "I can't say I've heard of that happening", I meant bricking a device on a system update. That's the specific thing which seems to happen on occasion with macOS, but that I've not seen with Linux. I do grant that there have been some (very rare) instances like this where hardware can be bricked by a command run on a Linux system.
Sure, but that's when they get told "you need to replace the logic board, that'll be $500 since it's out of warranty". That's not a theoretical problem either, people were literally told to do this back when the Big Sur brick happened to my model (2014 MBP). Eventually users figured out on their own that you could just replace the I/O board (not the mainboard / "logic board"). Apparently (going by that forum thread) disconnecting and then reconnecting the I/O board fixes it as well for some people (I don't remember whether I tested this), but this isn't something that Apple happily figured out and did for everyone who walked into one of their stores. We had to fix it ourselves.
Sure, doesn't make this any less inconvenient though. I would much prefer to finish whatever I want to finish today (even if I spend a few hours trying to fix an issue) than wait whatever amount of days waiting until my hardware is fixed.
You're right that it's not ideal, but it certainly makes it a lot _less_ inconvenient than being completely stuck forever (as most non-highly-technical people would be if they had to follow instructions on forums to fix Linux).
This is anecdotal, but my last “corporate job” was the closest to thing to shrink-wrapped software, even though it was a SaaS. Every release was meticulously documented. Any public facing UI or API change was approved the appropriate teams.
This is similar to macOS, Windows, or even FreeBSD releases. I haven’t seen any Linux distribution that has such comprehensively coordinated releases. Between systemd and the Linux kernel, I’m not sure it would be possible.
Many distros have good documentation, but, in my experience, far too often the bulk of it is in out of date wikis or forums. Perhaps this is out of date thinking and I’ve missed the train in the past 10 years.
As a counterpoint, OpenWRT has been good, but their main “product”, imho, is LuCI. Lower level issues often require vendor specific forums.
So essentially, the situation you'd have if you'd bought a Mac?
If we want to compare apples to apples, then we compare:
Mac with macOS updates installed regularly, and only those provided by Apple. Non-Apple apps get dropped in /Applications like they should be. If there's an installer that asks for root access, you might get boned.
Linux preinstalled with OS updates installed regularly, and only those provided by the vendor. Apps that don't come with the OS's package manager should be installed somewhere under $HOME, and never installed systemwide as root.
Sure, if you have a Mac and disable SIP (or whatever it's called nowadays) and start mucking around with files in /System or whatever, because you want to install some mod that does something cool, you might have a bad time. Same as if you decide that screwing around in /lib on a Linux machine is a good idea.
But if we actually compare these two apples, I suspect the Linux one would have fewer problems.
>So essentially, the situation you'd have if you'd bought a Mac?
No, worse, with more device incompatibilities, manual fiddling, arcane settings, and so on to make things work.
>Sure, if you have a Mac and disable SIP (or whatever it's called nowadays) and start mucking around with files in /System or whatever, because you want to install some mod that does something cool, you might have a bad time.
Sure, but I'm not talking about that. With Linux you often have a bad time trying to make basic, but not distro configured, functionality to work.
> with more device incompatibilities, manual fiddling
Did you read what I wrote above? "if you buy preinstalled"
If you install Linux on an ordinary "Windows-certified" computer, you will have problems. If you install an alternative OS on a Macbook, you will have exactly the same problems.
>Did you read what I wrote above? "if you buy preinstalled"
And did you understand my point? Buying preinstalled isn't a cure-all, only ensures that the bundled hardware drivers are compatible and configured. That's a pretty low bar.
It doesn't cover doing stuff with third party devices (which on the Mac 99% of the time it works every time).
Not to mention even the bundled-hardware doesn't always work even if you buy pre-installed (like the laptop not sleeping properly for example).
That's the point, I think - Linux gets derided because people say it just breaks at random and you have to wade through forums to find arcane incantations to fix it, either implying or outright stating that their favorite proprietary OS would never just blow up in your face and force you to resort to exotic troubleshooting steps. So when macos, the poster child for "user friendly", proceeds to brick the machine and require elaborate rituals to fix, it invites a certain level of snark from users pointing out that the high and mighty proprietary OSs might be just as bad as Linux after all.
Of course, whether that's valid is at minimum a question of actual frequency of problems and relative impact and effort to fix, but from a perspective of optics and emotions I understand the reaction.
I concur, the amount of times I had to Google dozens of minutes for issues happening in my work-issued Macbook Pro, and never finding answers because things are supposed to "just work" is maddening.
For one example on top of my head, sometimes I can't adjust the brightness of the monitor in the Macbook using the Notification Center (it is grayed out), but if I open the "Settings -> Displays" I can do it. Never found a solution for it after searching for a while, so I just gave up.
Or the fact that I can't enable retina or font smoothing in my 1440p monitor, so the fonts looks ugly (I got used eventually, but they still looks worse than Windows or Linux in the same monitor). I used a workaround in the past using "Better Display" to create a 4k framebuffer that was downscaled to 1440p, but this was so slow and also prone of other issues so eventually I just got used to the ugly fonts.
Another one: I have a TouchBar Macbook (again, this is a work-issued laptop), but I just want it to work as a normal keyboard: show the Function keys, if I press Fn show the shortcuts. Yep, doesn't work: while you can do this, pressing Fn while pressing some of the shortcuts in the TouchBar doesn't work. This is especially infuriating because one of the shortcuts that doesn't work is the brightness one. Go back to the first issue and you can see why this drive me mad sometimes.
I've never had an Android phone brick itself in 13 years of owning them. I have friends whose iPhones have gotten bad updates. Not sure if they were bricked, though, or if they "only" needed a factory reset to get things going again.
In the same period, 10-12 years ago, both androids and iPhones bricked themselves if there was no storage left on the device.
Both needed somes bytes on boot and if they couldn't write on disk, they failed to boot.
Phones are Apple's main business. At this point, Macs are second-tier. With Google, I suspect it's their engineering practice. Google doesn't like to make engineering mistakes.
The logic Board failed in my 2020 M1 Air as well. Opened the lid one day, and it wouldn't power on. I have AppleCare on it, otherwise it would have been a $500 repair.
About two weeks ago I'm sitting in a hotel room with the air on bed with the lid open. I grab it by the screen to slide it closer to me and the screen shatters from the light pressure of my finger.
There are instances of both these things happening to the Air all over the internet. At first I really liked the M1 Air, but it has now proven too unreliable for me.
My 2020 M1 Air generally requires a hard reboot if left closed and on a charger overnight, but that's been the worst of it until now (besides the rapidly degrading battery that seems calibrated to hit 79% a month after my AppleCare+ expires, while my 2015 Air's is still going strong).
Testing and validation kind of imply that it won’t ship if it isn’t fit for purpose, which isn’t an option and most devs will know it’s kind of junk already.
That’s not to say there aren’t good use cases, but when forced to add an LLM somewhere it doesn’t work, and you have no examples of “correct” output anyway, validation is usually an afterthought.