So you pick up the mouse, and talk into it?
....
Seriously though, I don't understand why the state of voice interaction is so poor.
In the 90's we had voice commands (early dragon ?) available to tell our computer what to do. It was limited, but it worked extremely well, even in busy environments.
I remember my Thinkpad 486dx2(?), at a party - opening software and choosing music to play from a list, and controlling volume, all by voice. Thinking to ourselves, imagine what this will be like when we have a stronger, faster computer, in five years.
It's truly gone nowhere. Still, the most advanced thing you can reliably get it to do is "Set a timer for 10 minutes."
I wonder if these "SmartAssistant" programmers ever actually had a human personal assistant. For most of what you need them to do, you don't even ask them to do it, they just know you and do it. An actually good computerized SmartAssistant would know that it's been a year, so it's time to book my physical with my doctor. It would have contacted the doctor's office for me, checked my calendar, scheduled the appointment, and then proactively reminded me a few days in advance. I shouldn't have to say "Hey, Assistant: Please schedule a physical for Doctor X at Clinic Y on July 1 of this year." (by the way SmartAssistants can't even currently do that).
The voice interaction should only be for exceptional cases: "Hey, Assistant: My trip to the Paris office needs to be delayed by one week." The assistant should then go and re-book flights, hotels, and rental cars, and then when finished, merely say "Done."
Until they can do this, tech companies might as well stop bothering releasing incremental crap products that can barely understand a task I'd expect a 4 year-old to be able to do.
In general, I wouldn't want to pass off that level of control. Maybe if I'm really busy and an assistant knew me really well... And there are certainly sometimes heavily scheduled trips where your "handlers" pretty much just tell you where to be and when.
But, especially if it's my money on a trip I want at least some "me" time, I probably want to take at least a cursory look at flight and hotel options and lots of other details.
That's also a detail a human assistant would know about you, and would know to pass the information on to you for confirmation before they took action. I would expect a SmartAssistant to do the same.
The point is "Alexa play music" is a huge distance away from what the product should be.
I think it was scalability to languages, dialects, idioms, etc. Super easy to have high quality American English with a few commands. Much harder to support any language, any syntax, any accent. The brute force optimizations just don't scale.
Modern ML and embeddings models are the discontinuity that was needed to get from "massively complex hack that can't scale" to "even more complex but principled approach that scales pretty well".