A fairly typical and simple approach is to set an intentional, fixed delay, say 500ms, to absorb network latency / inconsistency. The sender sends a target playback timestamp ~500ms in the future with each block of audio. Then the actual delay at the playback side can expand or contract as necessary to take up network delay. The lower you make this delay, the more care you need to take on the network side to guarantee timely delivery.
NTP is accurate enough for this, but I think most of the modern protocols in the wild e.g. AES67, AirPlay2 are using PTP. It is both more accurate and in some ways simpler for this use case.
NTP is accurate enough for this, but I think most of the modern protocols in the wild e.g. AES67, AirPlay2 are using PTP. It is both more accurate and in some ways simpler for this use case.