[thirdparty/gcc.git] / libgomp / doc / nvptx.rst

..
  Copyright 1988-2022 Free Software Foundation, Inc.
  This is part of the GCC manual.
  For copying conditions, see the copyright.rst file.

.. _nvptx:

nvptx
*****

On the hardware side, there is the hierarchy (fine to coarse):

* thread

* warp

* thread block

* streaming multiprocessor

All OpenMP and OpenACC levels are used, i.e.

* OpenMP's simd and OpenACC's vector map to threads

* OpenMP's threads ('parallel') and OpenACC's workers map to warps

* OpenMP's teams and OpenACC's gang use a threadpool with the
        size of the number of teams or gangs, respectively.

The used sizes are

* The ``warp_size`` is always 32

* CUDA kernel launched: ``dim={#teams,1,1}, blocks={#threads,warp_size,1}``.

Additional information can be obtained by setting the environment variable to
``GOMP_DEBUG=1`` (very verbose; grep for ``kernel.*launch`` for launch
parameters).

GCC generates generic PTX ISA code, which is just-in-time compiled by CUDA,
which caches the JIT in the user's directory (see CUDA documentation; can be
tuned by the environment variables ``CUDA_CACHE_{DISABLE,MAXSIZE,PATH}``.

Note: While PTX ISA is generic, the ``-mptx=`` and ``-march=`` commandline
options still affect the used PTX ISA code and, thus, the requirments on
CUDA version and hardware.

The implementation remark:

* I/O within OpenMP target regions and OpenACC parallel/kernels is supported
        using the C library ``printf`` functions. Note that the Fortran
        ``print`` / ``write`` statements are not supported, yet.

* Compilation OpenMP code that contains ``requires reverse_offload``
        requires at least ``-march=sm_35``, compiling for ``-march=sm_30``
        is not supported.

.. -
   The libgomp ABI
   -
Commit	Line	Data
c63539ff ML	1	..
	2	Copyright 1988-2022 Free Software Foundation, Inc.
	3	This is part of the GCC manual.
	4	For copying conditions, see the copyright.rst file.
	5
	6	.. _nvptx:
	7
	8	nvptx
	9	*****
	10
	11	On the hardware side, there is the hierarchy (fine to coarse):
	12
	13	* thread
	14
	15	* warp
	16
	17	* thread block
	18
	19	* streaming multiprocessor
	20
	21	All OpenMP and OpenACC levels are used, i.e.
	22
	23	* OpenMP's simd and OpenACC's vector map to threads
	24
	25	* OpenMP's threads ('parallel') and OpenACC's workers map to warps
	26
	27	* OpenMP's teams and OpenACC's gang use a threadpool with the
	28	size of the number of teams or gangs, respectively.
	29
	30	The used sizes are
	31
	32	* The ``warp_size`` is always 32
	33
	34	* CUDA kernel launched: ``dim={#teams,1,1}, blocks={#threads,warp_size,1}``.
	35
	36	Additional information can be obtained by setting the environment variable to
	37	``GOMP_DEBUG=1`` (very verbose; grep for ``kernel.*launch`` for launch
	38	parameters).
	39
	40	GCC generates generic PTX ISA code, which is just-in-time compiled by CUDA,
	41	which caches the JIT in the user's directory (see CUDA documentation; can be
	42	tuned by the environment variables ``CUDA_CACHE_{DISABLE,MAXSIZE,PATH}``.
	43
	44	Note: While PTX ISA is generic, the ``-mptx=`` and ``-march=`` commandline
	45	options still affect the used PTX ISA code and, thus, the requirments on
	46	CUDA version and hardware.
	47
	48	The implementation remark:
	49
	50	* I/O within OpenMP target regions and OpenACC parallel/kernels is supported
	51	using the C library ``printf`` functions. Note that the Fortran
	52	``print`` / ``write`` statements are not supported, yet.
	53
	54	* Compilation OpenMP code that contains ``requires reverse_offload``
	55	requires at least ``-march=sm_35``, compiling for ``-march=sm_30``
	56	is not supported.
	57
	58	.. -
	59	The libgomp ABI
3ed1b4ce	60	-