[thirdparty/kernel/stable.git] / Documentation / atomic_t.txt


On atomic types (atomic_t atomic64_t and atomic_long_t).

The atomic type provides an interface to the architecture's means of atomic
RMW operations between CPUs (atomic operations on MMIO are not supported and
can lead to fatal traps on some platforms).

API
---

The 'full' API consists of (atomic64_ and atomic_long_ prefixes omitted for
brevity):

Non-RMW ops:

  atomic_read(), atomic_set()
  atomic_read_acquire(), atomic_set_release()


RMW atomic operations:

Arithmetic:

  atomic_{add,sub,inc,dec}()
  atomic_{add,sub,inc,dec}_return{,_relaxed,_acquire,_release}()
  atomic_fetch_{add,sub,inc,dec}{,_relaxed,_acquire,_release}()


Bitwise:

  atomic_{and,or,xor,andnot}()
  atomic_fetch_{and,or,xor,andnot}{,_relaxed,_acquire,_release}()


Swap:

  atomic_xchg{,_relaxed,_acquire,_release}()
  atomic_cmpxchg{,_relaxed,_acquire,_release}()
  atomic_try_cmpxchg{,_relaxed,_acquire,_release}()


Reference count (but please see refcount_t):

  atomic_add_unless(), atomic_inc_not_zero()
  atomic_sub_and_test(), atomic_dec_and_test()


Misc:

  atomic_inc_and_test(), atomic_add_negative()
  atomic_dec_unless_positive(), atomic_inc_unless_negative()


Barriers:

  smp_mb__{before,after}_atomic()


TYPES (signed vs unsigned)
-----

While atomic_t, atomic_long_t and atomic64_t use int, long and s64
respectively (for hysterical raisins), the kernel uses -fno-strict-overflow
(which implies -fwrapv) and defines signed overflow to behave like
2s-complement.

Therefore, an explicitly unsigned variant of the atomic ops is strictly
unnecessary and we can simply cast, there is no UB.

There was a bug in UBSAN prior to GCC-8 that would generate UB warnings for
signed types.

With this we also conform to the C/C++ _Atomic behaviour and things like
P1236R1.


SEMANTICS
---------

Non-RMW ops:

The non-RMW ops are (typically) regular LOADs and STOREs and are canonically
implemented using READ_ONCE(), WRITE_ONCE(), smp_load_acquire() and
smp_store_release() respectively. Therefore, if you find yourself only using
the Non-RMW operations of atomic_t, you do not in fact need atomic_t at all
and are doing it wrong.

A note for the implementation of atomic_set{}() is that it must not break the
atomicity of the RMW ops. That is:

  C Atomic-RMW-ops-are-atomic-WRT-atomic_set

  {
    atomic_t v = ATOMIC_INIT(1);
  }

  P0(atomic_t *v)
  {
    (void)atomic_add_unless(v, 1, 0);
  }

  P1(atomic_t *v)
  {
    atomic_set(v, 0);
  }

  exists
  (v=2)

In this case we would expect the atomic_set() from CPU1 to either happen
before the atomic_add_unless(), in which case that latter one would no-op, or
_after_ in which case we'd overwrite its result. In no case is "2" a valid
outcome.

This is typically true on 'normal' platforms, where a regular competing STORE
will invalidate a LL/SC or fail a CMPXCHG.

The obvious case where this is not so is when we need to implement atomic ops
with a lock:

  CPU0						CPU1

  atomic_add_unless(v, 1, 0);
    lock();
    ret = READ_ONCE(v->counter); // == 1
						atomic_set(v, 0);
    if (ret != u)				  WRITE_ONCE(v->counter, 0);
      WRITE_ONCE(v->counter, ret + 1);
    unlock();

the typical solution is to then implement atomic_set{}() with atomic_xchg().


RMW ops:

These come in various forms:

 - plain operations without return value: atomic_{}()

 - operations which return the modified value: atomic_{}_return()

   these are limited to the arithmetic operations because those are
   reversible. Bitops are irreversible and therefore the modified value
   is of dubious utility.

 - operations which return the original value: atomic_fetch_{}()

 - swap operations: xchg(), cmpxchg() and try_cmpxchg()

 - misc; the special purpose operations that are commonly used and would,
   given the interface, normally be implemented using (try_)cmpxchg loops but
   are time critical and can, (typically) on LL/SC architectures, be more
   efficiently implemented.

All these operations are SMP atomic; that is, the operations (for a single
atomic variable) can be fully ordered and no intermediate state is lost or
visible.


ORDERING  (go read memory-barriers.txt first)
--------

The rule of thumb:

 - non-RMW operations are unordered;

 - RMW operations that have no return value are unordered;

 - RMW operations that have a return value are fully ordered;

 - RMW operations that are conditional are unordered on FAILURE,
   otherwise the above rules apply.

Except of course when an operation has an explicit ordering like:

 {}_relaxed: unordered
 {}_acquire: the R of the RMW (or atomic_read) is an ACQUIRE
 {}_release: the W of the RMW (or atomic_set)  is a  RELEASE

Where 'unordered' is against other memory locations. Address dependencies are
not defeated.

Fully ordered primitives are ordered against everything prior and everything
subsequent. Therefore a fully ordered primitive is like having an smp_mb()
before and an smp_mb() after the primitive.


The barriers:

  smp_mb__{before,after}_atomic()

only apply to the RMW atomic ops and can be used to augment/upgrade the
ordering inherent to the op. These barriers act almost like a full smp_mb():
smp_mb__before_atomic() orders all earlier accesses against the RMW op
itself and all accesses following it, and smp_mb__after_atomic() orders all
later accesses against the RMW op and all accesses preceding it. However,
accesses between the smp_mb__{before,after}_atomic() and the RMW op are not
ordered, so it is advisable to place the barrier right next to the RMW atomic
op whenever possible.

These helper barriers exist because architectures have varying implicit
ordering on their SMP atomic primitives. For example our TSO architectures
provide full ordered atomics and these barriers are no-ops.

NOTE: when the atomic RmW ops are fully ordered, they should also imply a
compiler barrier.

Thus:

  atomic_fetch_add();

is equivalent to:

  smp_mb__before_atomic();
  atomic_fetch_add_relaxed();
  smp_mb__after_atomic();

However the atomic_fetch_add() might be implemented more efficiently.

Further, while something like:

  smp_mb__before_atomic();
  atomic_dec(&X);

is a 'typical' RELEASE pattern, the barrier is strictly stronger than
a RELEASE because it orders preceding instructions against both the read
and write parts of the atomic_dec(), and against all following instructions
as well. Similarly, something like:

  atomic_inc(&X);
  smp_mb__after_atomic();

is an ACQUIRE pattern (though very much not typical), but again the barrier is
strictly stronger than ACQUIRE. As illustrated:

  C Atomic-RMW+mb__after_atomic-is-stronger-than-acquire

  {
  }

  P0(int *x, atomic_t *y)
  {
    r0 = READ_ONCE(*x);
    smp_rmb();
    r1 = atomic_read(y);
  }

  P1(int *x, atomic_t *y)
  {
    atomic_inc(y);
    smp_mb__after_atomic();
    WRITE_ONCE(*x, 1);
  }

  exists
  (0:r0=1 /\ 0:r1=0)

This should not happen; but a hypothetical atomic_inc_acquire() --
(void)atomic_fetch_inc_acquire() for instance -- would allow the outcome,
because it would not order the W part of the RMW against the following
WRITE_ONCE.  Thus:

  P0			P1

			t = LL.acq *y (0)
			t++;
			*x = 1;
  r0 = *x (1)
  RMB
  r1 = *y (0)
			SC *y, t;

is allowed.


CMPXCHG vs TRY_CMPXCHG
----------------------

  int atomic_cmpxchg(atomic_t *ptr, int old, int new);
  bool atomic_try_cmpxchg(atomic_t *ptr, int *oldp, int new);

Both provide the same functionality, but try_cmpxchg() can lead to more
compact code. The functions relate like:

  bool atomic_try_cmpxchg(atomic_t *ptr, int *oldp, int new)
  {
    int ret, old = *oldp;
    ret = atomic_cmpxchg(ptr, old, new);
    if (ret != old)
      *oldp = ret;
    return ret == old;
  }

and:

  int atomic_cmpxchg(atomic_t *ptr, int old, int new)
  {
    (void)atomic_try_cmpxchg(ptr, &old, new);
    return old;
  }

Usage:

  old = atomic_read(&v);			old = atomic_read(&v);
  for (;;) {					do {
    new = func(old);				  new = func(old);
    tmp = atomic_cmpxchg(&v, old, new);		} while (!atomic_try_cmpxchg(&v, &old, new));
    if (tmp == old)
      break;
    old = tmp;
  }

NB. try_cmpxchg() also generates better code on some platforms (notably x86)
where the function more closely matches the hardware instruction.


FORWARD PROGRESS
----------------

In general strong forward progress is expected of all unconditional atomic
operations -- those in the Arithmetic and Bitwise classes and xchg(). However
a fair amount of code also requires forward progress from the conditional
atomic operations.

Specifically 'simple' cmpxchg() loops are expected to not starve one another
indefinitely. However, this is not evident on LL/SC architectures, because
while an LL/SC architecture 'can/should/must' provide forward progress
guarantees between competing LL/SC sections, such a guarantee does not
transfer to cmpxchg() implemented using LL/SC. Consider:

  old = atomic_read(&v);
  do {
    new = func(old);
  } while (!atomic_try_cmpxchg(&v, &old, new));

which on LL/SC becomes something like:

  old = atomic_read(&v);
  do {
    new = func(old);
  } while (!({
    volatile asm ("1: LL  %[oldval], %[v]\n"
                  "   CMP %[oldval], %[old]\n"
                  "   BNE 2f\n"
                  "   SC  %[new], %[v]\n"
                  "   BNE 1b\n"
                  "2:\n"
                  : [oldval] "=&r" (oldval), [v] "m" (v)
		  : [old] "r" (old), [new] "r" (new)
                  : "memory");
    success = (oldval == old);
    if (!success)
      old = oldval;
    success; }));

However, even the forward branch from the failed compare can cause the LL/SC
to fail on some architectures, let alone whatever the compiler makes of the C
loop body. As a result there is no guarantee what so ever the cacheline
containing @v will stay on the local CPU and progress is made.

Even native CAS architectures can fail to provide forward progress for their
primitive (See Sparc64 for an example).

Such implementations are strongly encouraged to add exponential backoff loops
to a failed CAS in order to ensure some progress. Affected architectures are
also strongly encouraged to inspect/audit the atomic fallbacks, refcount_t and
their locking primitives.
Commit	Line	Data
706eeb3e PZ	1
	2	On atomic types (atomic_t atomic64_t and atomic_long_t).
	3
	4	The atomic type provides an interface to the architecture's means of atomic
	5	RMW operations between CPUs (atomic operations on MMIO are not supported and
	6	can lead to fatal traps on some platforms).
	7
	8	API
	9	---
	10
	11	The 'full' API consists of (atomic64_ and atomic_long_ prefixes omitted for
	12	brevity):
	13
	14	Non-RMW ops:
	15
	16	atomic_read(), atomic_set()
	17	atomic_read_acquire(), atomic_set_release()
	18
	19
	20	RMW atomic operations:
	21
	22	Arithmetic:
	23
	24	atomic_{add,sub,inc,dec}()
	25	atomic_{add,sub,inc,dec}_return{,_relaxed,_acquire,_release}()
	26	atomic_fetch_{add,sub,inc,dec}{,_relaxed,_acquire,_release}()
	27
	28
	29	Bitwise:
	30
	31	atomic_{and,or,xor,andnot}()
	32	atomic_fetch_{and,or,xor,andnot}{,_relaxed,_acquire,_release}()
	33
	34
	35	Swap:
	36
	37	atomic_xchg{,_relaxed,_acquire,_release}()
	38	atomic_cmpxchg{,_relaxed,_acquire,_release}()
	39	atomic_try_cmpxchg{,_relaxed,_acquire,_release}()
	40
	41
	42	Reference count (but please see refcount_t):
	43
	44	atomic_add_unless(), atomic_inc_not_zero()
	45	atomic_sub_and_test(), atomic_dec_and_test()
	46
	47
	48	Misc:
	49
	50	atomic_inc_and_test(), atomic_add_negative()
	51	atomic_dec_unless_positive(), atomic_inc_unless_negative()
	52
	53
	54	Barriers:
	55
	56	smp_mb__{before,after}_atomic()
	57
	58
f1887143 PZ	59	TYPES (signed vs unsigned)
	60	-----
	61
	62	While atomic_t, atomic_long_t and atomic64_t use int, long and s64
	63	respectively (for hysterical raisins), the kernel uses -fno-strict-overflow
	64	(which implies -fwrapv) and defines signed overflow to behave like
	65	2s-complement.
	66
	67	Therefore, an explicitly unsigned variant of the atomic ops is strictly
	68	unnecessary and we can simply cast, there is no UB.
	69
	70	There was a bug in UBSAN prior to GCC-8 that would generate UB warnings for
	71	signed types.
	72
	73	With this we also conform to the C/C++ _Atomic behaviour and things like
	74	P1236R1.
	75
706eeb3e PZ	76
	77	SEMANTICS
	78	---------
	79
	80	Non-RMW ops:
	81
	82	The non-RMW ops are (typically) regular LOADs and STOREs and are canonically
	83	implemented using READ_ONCE(), WRITE_ONCE(), smp_load_acquire() and
fff9b6c7 PZ	84	smp_store_release() respectively. Therefore, if you find yourself only using
	85	the Non-RMW operations of atomic_t, you do not in fact need atomic_t at all
	86	and are doing it wrong.
706eeb3e	87
4dcd4d36 BF	88	A note for the implementation of atomic_set{}() is that it must not break the
4dcd4d36 BF	89	atomicity of the RMW ops. That is:
706eeb3e	90
4dcd4d36	91	C Atomic-RMW-ops-are-atomic-WRT-atomic_set
706eeb3e PZ	92
706eeb3e PZ	93	{
4dcd4d36	94	atomic_t v = ATOMIC_INIT(1);
706eeb3e PZ	95	}
706eeb3e PZ	96
4dcd4d36	97	P0(atomic_t *v)
706eeb3e	98	{
4dcd4d36	99	(void)atomic_add_unless(v, 1, 0);
706eeb3e PZ	100	}
706eeb3e PZ	101
4dcd4d36	102	P1(atomic_t *v)
706eeb3e PZ	103	{
	104	atomic_set(v, 0);
	105	}
	106
	107	exists
	108	(v=2)
	109
	110	In this case we would expect the atomic_set() from CPU1 to either happen
	111	before the atomic_add_unless(), in which case that latter one would no-op, or
	112	_after_ in which case we'd overwrite its result. In no case is "2" a valid
	113	outcome.
	114
	115	This is typically true on 'normal' platforms, where a regular competing STORE
	116	will invalidate a LL/SC or fail a CMPXCHG.
	117
	118	The obvious case where this is not so is when we need to implement atomic ops
	119	with a lock:
	120
	121	CPU0 CPU1
	122
	123	atomic_add_unless(v, 1, 0);
	124	lock();
	125	ret = READ_ONCE(v->counter); // == 1
	126	atomic_set(v, 0);
	127	if (ret != u) WRITE_ONCE(v->counter, 0);
	128	WRITE_ONCE(v->counter, ret + 1);
	129	unlock();
	130
	131	the typical solution is to then implement atomic_set{}() with atomic_xchg().
	132
	133
	134	RMW ops:
	135
	136	These come in various forms:
	137
	138	- plain operations without return value: atomic_{}()
	139
	140	- operations which return the modified value: atomic_{}_return()
	141
	142	these are limited to the arithmetic operations because those are
	143	reversible. Bitops are irreversible and therefore the modified value
	144	is of dubious utility.
	145
	146	- operations which return the original value: atomic_fetch_{}()
	147
	148	- swap operations: xchg(), cmpxchg() and try_cmpxchg()
	149
	150	- misc; the special purpose operations that are commonly used and would,
	151	given the interface, normally be implemented using (try_)cmpxchg loops but
	152	are time critical and can, (typically) on LL/SC architectures, be more
	153	efficiently implemented.
	154
	155	All these operations are SMP atomic; that is, the operations (for a single
	156	atomic variable) can be fully ordered and no intermediate state is lost or
	157	visible.
	158
	159
	160	ORDERING (go read memory-barriers.txt first)
	161	--------
	162
	163	The rule of thumb:
	164
	165	- non-RMW operations are unordered;
	166
167	- RMW operations that have no return value are unordered;
168
169	- RMW operations that have a return value are fully ordered;
170
171	- RMW operations that are conditional are unordered on FAILURE,
172	otherwise the above rules apply.
173
174	Except of course when an operation has an explicit ordering like:
175
176	{}_relaxed: unordered
177	{}_acquire: the R of the RMW (or atomic_read) is an ACQUIRE
178	{}_release: the W of the RMW (or atomic_set) is a RELEASE
179
180	Where 'unordered' is against other memory locations. Address dependencies are
181	not defeated.
182
183	Fully ordered primitives are ordered against everything prior and everything
184	subsequent. Therefore a fully ordered primitive is like having an smp_mb()
185	before and an smp_mb() after the primitive.
186
187
188	The barriers:
189
190	smp_mb__{before,after}_atomic()
191
2966f8d4 AS	192	only apply to the RMW atomic ops and can be used to augment/upgrade the
	193	ordering inherent to the op. These barriers act almost like a full smp_mb():
	194	smp_mb__before_atomic() orders all earlier accesses against the RMW op
	195	itself and all accesses following it, and smp_mb__after_atomic() orders all
	196	later accesses against the RMW op and all accesses preceding it. However,
	197	accesses between the smp_mb__{before,after}_atomic() and the RMW op are not
	198	ordered, so it is advisable to place the barrier right next to the RMW atomic
	199	op whenever possible.
706eeb3e PZ	200
	201	These helper barriers exist because architectures have varying implicit
	202	ordering on their SMP atomic primitives. For example our TSO architectures
	203	provide full ordered atomics and these barriers are no-ops.
	204
69d927bb PZ	205	NOTE: when the atomic RmW ops are fully ordered, they should also imply a
	206	compiler barrier.
	207
706eeb3e PZ	208	Thus:
	209
	210	atomic_fetch_add();
	211
	212	is equivalent to:
	213
	214	smp_mb__before_atomic();
	215	atomic_fetch_add_relaxed();
	216	smp_mb__after_atomic();
	217
	218	However the atomic_fetch_add() might be implemented more efficiently.
	219
	220	Further, while something like:
	221
	222	smp_mb__before_atomic();
	223	atomic_dec(&X);
	224
	225	is a 'typical' RELEASE pattern, the barrier is strictly stronger than
2966f8d4 AS	226	a RELEASE because it orders preceding instructions against both the read
	227	and write parts of the atomic_dec(), and against all following instructions
	228	as well. Similarly, something like:
706eeb3e	229
ca110694 PZ	230	atomic_inc(&X);
	231	smp_mb__after_atomic();
	232
	233	is an ACQUIRE pattern (though very much not typical), but again the barrier is
	234	strictly stronger than ACQUIRE. As illustrated:
	235
e30d0235	236	C Atomic-RMW+mb__after_atomic-is-stronger-than-acquire
ca110694 PZ	237
	238	{
	239	}
	240
e30d0235	241	P0(int x, atomic_t y)
ca110694 PZ	242	{
	243	r0 = READ_ONCE(*x);
	244	smp_rmb();
	245	r1 = atomic_read(y);
	246	}
	247
e30d0235	248	P1(int x, atomic_t y)
ca110694 PZ	249	{
	250	atomic_inc(y);
	251	smp_mb__after_atomic();
	252	WRITE_ONCE(*x, 1);
	253	}
	254
	255	exists
e30d0235	256	(0:r0=1 /\ 0:r1=0)
ca110694 PZ	257
	258	This should not happen; but a hypothetical atomic_inc_acquire() --
	259	(void)atomic_fetch_inc_acquire() for instance -- would allow the outcome,
2966f8d4 AS	260	because it would not order the W part of the RMW against the following
2966f8d4 AS	261	WRITE_ONCE. Thus:
ca110694	262
e30d0235	263	P0 P1
ca110694 PZ	264
	265	t = LL.acq *y (0)
	266	t++;
	267	*x = 1;
	268	r0 = *x (1)
	269	RMB
	270	r1 = *y (0)
	271	SC *y, t;
706eeb3e	272
ca110694	273	is allowed.
d1bbfd0c PZ	274
	275
	276	CMPXCHG vs TRY_CMPXCHG
	277	----------------------
	278
	279	int atomic_cmpxchg(atomic_t *ptr, int old, int new);
	280	bool atomic_try_cmpxchg(atomic_t ptr, int oldp, int new);
	281
	282	Both provide the same functionality, but try_cmpxchg() can lead to more
	283	compact code. The functions relate like:
	284
	285	bool atomic_try_cmpxchg(atomic_t ptr, int oldp, int new)
	286	{
	287	int ret, old = *oldp;
	288	ret = atomic_cmpxchg(ptr, old, new);
	289	if (ret != old)
	290	*oldp = ret;
	291	return ret == old;
	292	}
	293
	294	and:
	295
	296	int atomic_cmpxchg(atomic_t *ptr, int old, int new)
	297	{
	298	(void)atomic_try_cmpxchg(ptr, &old, new);
	299	return old;
	300	}
	301
	302	Usage:
	303
	304	old = atomic_read(&v); old = atomic_read(&v);
	305	for (;;) { do {
	306	new = func(old); new = func(old);
	307	tmp = atomic_cmpxchg(&v, old, new); } while (!atomic_try_cmpxchg(&v, &old, new));
	308	if (tmp == old)
	309	break;
	310	old = tmp;
	311	}
	312
	313	NB. try_cmpxchg() also generates better code on some platforms (notably x86)
	314	where the function more closely matches the hardware instruction.
55bccf1f PZ	315
	316
	317	FORWARD PROGRESS
	318	----------------
	319
	320	In general strong forward progress is expected of all unconditional atomic
	321	operations -- those in the Arithmetic and Bitwise classes and xchg(). However
	322	a fair amount of code also requires forward progress from the conditional
	323	atomic operations.
	324
	325	Specifically 'simple' cmpxchg() loops are expected to not starve one another
	326	indefinitely. However, this is not evident on LL/SC architectures, because
aae0c8a5	327	while an LL/SC architecture 'can/should/must' provide forward progress
55bccf1f PZ	328	guarantees between competing LL/SC sections, such a guarantee does not
	329	transfer to cmpxchg() implemented using LL/SC. Consider:
	330
	331	old = atomic_read(&v);
	332	do {
	333	new = func(old);
	334	} while (!atomic_try_cmpxchg(&v, &old, new));
	335
	336	which on LL/SC becomes something like:
	337
	338	old = atomic_read(&v);
	339	do {
	340	new = func(old);
	341	} while (!({
	342	volatile asm ("1: LL %[oldval], %[v]\n"
	343	" CMP %[oldval], %[old]\n"
	344	" BNE 2f\n"
	345	" SC %[new], %[v]\n"
	346	" BNE 1b\n"
	347	"2:\n"
	348	: [oldval] "=&r" (oldval), [v] "m" (v)
	349	: [old] "r" (old), [new] "r" (new)
	350	: "memory");
	351	success = (oldval == old);
	352	if (!success)
	353	old = oldval;
	354	success; }));
	355
	356	However, even the forward branch from the failed compare can cause the LL/SC
	357	to fail on some architectures, let alone whatever the compiler makes of the C
	358	loop body. As a result there is no guarantee what so ever the cacheline
	359	containing @v will stay on the local CPU and progress is made.
	360
	361	Even native CAS architectures can fail to provide forward progress for their
	362	primitive (See Sparc64 for an example).
	363
	364	Such implementations are strongly encouraged to add exponential backoff loops
	365	to a failed CAS in order to ensure some progress. Affected architectures are
	366	also strongly encouraged to inspect/audit the atomic fallbacks, refcount_t and
	367	their locking primitives.