Uncontended is uninteresting, because all mutex implementations perform roughly the same here, give or take a nanosecond or two. If you're truly uncontended then a naïve spin lock will actually seem fastest, because xchg is faster than cmpxchg which is needed for good locks.
On x86 you can. When xchg is used with a memory parameter it locks the bus. This is true even in the absence of a lock prefix. I included a spinlock implementation in the blog post. If you see any errors with it, then please let me know!