Because you need a reliable way to convert human speech into a structured representation of actions that the smart device can execute so the LLM can interface with the underlying system. For most smart devices this is probably locked away, and combined with issues like hallucination, lack of training data for the action representation, etc., it's hard to say whether such technology would be reliable enough for stable use.
We've seen a similar thing with big companies putting in LLM chatbots for customer service -- turns out LLMs can go off the rails really easily with the right prompts.