MEDIUM: ring: improve speed in the queue waiting loop on x86_64
x86_64 doesn't have a native atomic FETCH_OR(), it's implemented using
a CAS, which will always cause a write cycle. Here we know we can just
wait as long as the lock bit is held so better loop on a load, and only
attempt the CAS on success. This requires a tiny ifdef and brings nice
benefits. This brings the performance back from 3.33M to 3.75M at 24C48T
while doing no change at 3C6T.