I don't think the issue is the tools available but the motivation behind their use. Until we get rid of the ad revenue model we'll be stuck with user profiling and tracking. Micropayments or some other model will have to take it's place before we see a return to the ideals we used to aspire to.
If there's support for embedded assets (you mention videos, so I assume also images, etc), how do you prevent "tracking pixels" that can monitor what content is viewed per IP address?
I think it's generally useful to look at email here, where only a small subset of html—primarily photos + text—are reliably supported across clients.
Content addressing can help here. If every asset is identified by a cryptographic hash, the user agent is free to fetch it from any available source. In the case of images, you can hash over the rendered pixels instead of the file encoding, so all transparent 1x1 images are concise red interchangeable.
Alternatively/in addition, the user agent can treat embedded assets like textbooks do, and present them all as numbered, boxed, and captioned figures.
What would that constraint look like? Is animation and interaction prohibited there, for example?