You can use NTP to get the devices' clocks synced up to much better than necessary tolerance, and play back accordingly.
And then you "just" have the same problems that you have with purely electrically connected, analogue speakers (which are effectively 100% in sync in terms of receiving the signal): Sound is relatively slow, and so the audio from a speaker that is far away will reach you later than the nearby speaker.
You can mitigate that by adding a precise delay to the far away speaker... but of course that does not work if you're standing on the other side. Nevertheless, as said, that problem is regardless of whether your speaker is network-connected or not.
Kind of. The bigger problem you will have if you try this is that the audio is not clocked by the system clock, and the audio clock is almost always free-running (and even if it were derived from the system clock, NTP et al don't generally discipline the clock itself, just the OS's presentation of it). So in the case of a long running playback (or continuous, as in this case), you will drift out of sync over time, and it doesn't take that long to become noticeable. And at some point you'll either start dropping out due to either buffer underflow or buffer overflow. So you do still need to take care about this.
So to work well you do need to resync the audio to the local audio clock using a sample rate converter, or build some custom hardware that lets you sync the playback audio clocks somehow. Or if you want to be sloppy about it, keep close track and stuff or drop individual samples as you drift.
Sonos has a remarkably good implementation of all of this.
For URL-based streams they buffer and NTP to sync. For live streams (e.g. gaming) they p2p multicast and tweak the wifi params in real-time to minimize drops.
The speakers create their own wifi and use MST network heuristics to latency-min route over that versus native wifi or ethernet if you've plugged it in. Sound drops when the wifi spectrum blinks (rarely), but I have never encountered the speakers being out of sync or noticing an echo effect.
And the speakers can use your phone's mic to scan the soundscape of a room to acoustically balance the sound when you set them up. I particularly like how consistent the sound volume is room-to-room even with very different speaker setups.
IIRC they've patented their specific mechanism. So ya, it's solved, but it may be expensive to license.
(Not affiliated with Sonos, I just have a bunch of them and like them a lot.)
Yeah, Sonos is very much the Apple of this space. A solid, user-friendly implementation of several pre-existing concepts into a cohesive product - no small task. I don't think the technologically important parts of this are patentable though, there's both prior art and the obviousness standard to worry about. But very much like Apple's 'rounded corners' case, they've gone after (IMO) obvious UI functionality for such a system to extract money from their competitors.
If you are just interested in the synchronized Audio-over-Ethernet part, AES67 is the industry standard, and a pretty complete open-source implementation can be found at https://github.com/bondagit/aes67-linux-daemon , though AES67 is itself a composition of existing standards, fundamentally it is mostly composed of SDP for sessions description, RTP for media, and PTP for clock sync, so you can build that out of a variety of implementations too.
The patent actually covers a mechanism for electing a master controller for synching and storing configuration parameters. The actual process of synching audio is not covered. Not that difficult to work around the patent. But definitely easy to trip over the patent if you're not careful.
True, it was definitely simplified. But yeah, in cases where you really care, there's a bunch of options to do it completely/sufficiently in sync. (A true asynchronous sample rate converter, as it would have to be here, might be a bit expensive, but simple interpolation, or even stuffing/dropping, might be sufficient for this particular use case.)
Just re-sync at the start of each song. Sound propagating through air introduces ~ 1ms of latency per foot. So if tracks drift out of sync by a few milliseconds, it's no big deal.
That is one solution, and in some scenarios it might not even be noticeable, but it's basically conceding the problem and accepting a guaranteed audio dropout at the end of every 'song', since for this to work you need some dead time to ensure all buffers are drained and start the new stream.
The simplest model is a source that generates a continuous audio stream, and a sink that plays it back; adding the idea of songs complicates the model, and in some use cases might be totally inappropriate. For elevator music, sure it likely doesn't matter, and maybe you can hide it in a crossfade or something with enough metadata, but this is probably part of a system where you put audio into one device connected to the network, that might include live stuff like PA announcements, and it comes out a bunch of other ones, not a dedicated elevator music system.
And then you "just" have the same problems that you have with purely electrically connected, analogue speakers (which are effectively 100% in sync in terms of receiving the signal): Sound is relatively slow, and so the audio from a speaker that is far away will reach you later than the nearby speaker.
You can mitigate that by adding a precise delay to the far away speaker... but of course that does not work if you're standing on the other side. Nevertheless, as said, that problem is regardless of whether your speaker is network-connected or not.