A naive, thread-safe reference counter using atomic compare-swap is extremely expensive. On the best CPU's in the uncontended case it's 30-50 clock cycles. So modifying a single pointer field becomes a 100 clock cycle affair (decrement for the old value, increment for the new value).
Thread-safe deferred reference counting looks like GC.
Thread-safe deferred reference counting looks like GC.