Data-share write (ds_write) instructions do not necessarily complete
the write to LDS immediately. When a write completes, LGKM_CNT is
decremented. For now, we wait until LGKM_CNT reaches zero after each
ds_write instruction.
This fixes a race condition in the case where LDS is read immediately
after being written. This can happen with broadcast operations.