@copying
-Copyright @copyright{} 2006-2015 Free Software Foundation, Inc.
+Copyright @copyright{} 2006-2019 Free Software Foundation, Inc.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.3 or
This manual documents the usage of libgomp, the GNU Offloading and
Multi Processing Runtime Library. This includes the GNU
-implementation of the @uref{http://www.openmp.org, OpenMP} Application
+implementation of the @uref{https://www.openmp.org, OpenMP} Application
Programming Interface (API) for multi-platform shared-memory parallel
programming in C/C++ and Fortran, and the GNU implementation of the
-@uref{http://www.openacc.org/, OpenACC} Application Programming
+@uref{https://www.openacc.org, OpenACC} Application Programming
Interface (API) for offloading of code to accelerator devices in C/C++
and Fortran.
@comment
@menu
* Enabling OpenMP:: How to enable OpenMP for your applications.
-* Runtime Library Routines:: The OpenMP runtime application programming
+* OpenMP Runtime Library Routines: Runtime Library Routines.
+ The OpenMP runtime application programming
interface.
-* Environment Variables:: Influencing runtime behavior with environment
- variables.
+* OpenMP Environment Variables: Environment Variables.
+ Influencing OpenMP runtime behavior with
+ environment variables.
+* Enabling OpenACC:: How to enable OpenACC for your
+ applications.
+* OpenACC Runtime Library Routines:: The OpenACC runtime application
+ programming interface.
+* OpenACC Environment Variables:: Influencing OpenACC runtime behavior with
+ environment variables.
+* CUDA Streams Usage:: Notes on the implementation of
+ asynchronous operations.
+* OpenACC Library Interoperability:: OpenACC library interoperability with the
+ NVIDIA CUBLAS library.
+* OpenACC Profiling Interface::
* The libgomp ABI:: Notes on the external ABI presented by libgomp.
* Reporting Bugs:: How to report bugs in the GNU Offloading and
Multi Processing Runtime Library.
(@ref{Runtime Library Routines}).
A complete description of all OpenMP directives accepted may be found in
-the @uref{http://www.openmp.org, OpenMP Application Program Interface} manual,
-version 4.0.
+the @uref{https://www.openmp.org, OpenMP Application Program Interface} manual,
+version 4.5.
@c ---------------------------------------------------------------------
-@c Runtime Library Routines
+@c OpenMP Runtime Library Routines
@c ---------------------------------------------------------------------
@node Runtime Library Routines
-@chapter Runtime Library Routines
+@chapter OpenMP Runtime Library Routines
The runtime routines described here are defined by Section 3 of the OpenMP
-specification in version 4.0. The routines are structured in following
+specification in version 4.5. The routines are structured in following
three parts:
@menu
* omp_get_dynamic:: Dynamic teams setting
* omp_get_level:: Number of parallel regions
* omp_get_max_active_levels:: Maximum number of active regions
+* omp_get_max_task_priority:: Maximum task priority value that can be set
* omp_get_max_threads:: Maximum number of threads of parallel region
* omp_get_nested:: Nested parallel regions
* omp_get_num_devices:: Number of target devices
@ref{omp_get_level}, @ref{omp_get_max_active_levels}, @ref{omp_set_max_active_levels}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.20.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.20.
@end table
@ref{omp_get_level}, @ref{omp_get_thread_num}, @ref{omp_get_team_size}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.18.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.18.
@end table
@ref{OMP_CANCELLATION}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.9.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.9.
@end table
@ref{OMP_DEFAULT_DEVICE}, @ref{omp_set_default_device}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.24.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.30.
@end table
@ref{omp_set_dynamic}, @ref{OMP_DYNAMIC}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.8.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.8.
@end table
@ref{omp_get_active_level}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.17.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.17.
@end table
@ref{omp_set_max_active_levels}, @ref{omp_get_active_level}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.16.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.16.
@end table
+@node omp_get_max_task_priority
+@section @code{omp_get_max_task_priority} -- Maximum priority value
+that can be set for tasks.
+@table @asis
+@item @emph{Description}:
+This function obtains the maximum allowed priority number for tasks.
+
+@item @emph{C/C++}
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{int omp_get_max_task_priority(void);}
+@end multitable
+
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{integer function omp_get_max_task_priority()}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.29.
+@end table
+
@node omp_get_max_threads
@section @code{omp_get_max_threads} -- Maximum number of threads of parallel region
@ref{omp_set_num_threads}, @ref{omp_set_dynamic}, @ref{omp_get_thread_limit}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.3.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.3.
@end table
@ref{omp_set_nested}, @ref{OMP_NESTED}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.11.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.11.
@end table
@end multitable
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.25.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.31.
@end table
@end multitable
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.5.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.5.
@end table
@end multitable
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.26.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.32.
@end table
@ref{omp_get_max_threads}, @ref{omp_set_num_threads}, @ref{OMP_NUM_THREADS}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.2.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.2.
@end table
@ref{OMP_PROC_BIND}, @ref{OMP_PLACES}, @ref{GOMP_CPU_AFFINITY},
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.22.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.22.
@end table
Obtain the runtime scheduling method. The @var{kind} argument will be
set to the value @code{omp_sched_static}, @code{omp_sched_dynamic},
@code{omp_sched_guided} or @code{omp_sched_auto}. The second argument,
-@var{modifier}, is set to the chunk size.
+@var{chunk_size}, is set to the chunk size.
@item @emph{C/C++}
@multitable @columnfractions .20 .80
-@item @emph{Prototype}: @tab @code{void omp_get_schedule(omp_sched_t *kind, int *modifier);}
+@item @emph{Prototype}: @tab @code{void omp_get_schedule(omp_sched_t *kind, int *chunk_size);}
@end multitable
@item @emph{Fortran}:
@multitable @columnfractions .20 .80
-@item @emph{Interface}: @tab @code{subroutine omp_get_schedule(kind, modifier)}
+@item @emph{Interface}: @tab @code{subroutine omp_get_schedule(kind, chunk_size)}
@item @tab @code{integer(kind=omp_sched_kind) kind}
-@item @tab @code{integer modifier}
+@item @tab @code{integer chunk_size}
@end multitable
@item @emph{See also}:
@ref{omp_set_schedule}, @ref{OMP_SCHEDULE}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.13.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.13.
@end table
@end multitable
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.27.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.33.
@end table
@ref{omp_get_num_threads}, @ref{omp_get_level}, @ref{omp_get_ancestor_thread_num}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.19.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.19.
@end table
@ref{omp_get_max_threads}, @ref{OMP_THREAD_LIMIT}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.14.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.14.
@end table
@ref{omp_get_num_threads}, @ref{omp_get_ancestor_thread_num}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.4.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.4.
@end table
@end multitable
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.6.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.6.
@end table
@end multitable
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.21.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.21.
@end table
@end multitable
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.28.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.34.
@end table
@ref{OMP_DEFAULT_DEVICE}, @ref{omp_get_default_device}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.23.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.29.
@end table
@ref{OMP_DYNAMIC}, @ref{omp_get_dynamic}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.7.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.7.
@end table
@ref{omp_get_max_active_levels}, @ref{omp_get_active_level}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.15.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.15.
@end table
@ref{OMP_NESTED}, @ref{omp_get_nested}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.10.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.10.
@end table
@ref{OMP_NUM_THREADS}, @ref{omp_get_num_threads}, @ref{omp_get_max_threads}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.1.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.1.
@end table
value @code{omp_sched_static}, @code{omp_sched_dynamic},
@code{omp_sched_guided} or @code{omp_sched_auto}. Except for
@code{omp_sched_auto}, the chunk size is set to the value of
-@var{modifier} if positive, or to the default value if zero or negative.
-For @code{omp_sched_auto} the @var{modifier} argument is ignored.
+@var{chunk_size} if positive, or to the default value if zero or negative.
+For @code{omp_sched_auto} the @var{chunk_size} argument is ignored.
@item @emph{C/C++}
@multitable @columnfractions .20 .80
-@item @emph{Prototype}: @tab @code{void omp_set_schedule(omp_sched_t kind, int modifier);}
+@item @emph{Prototype}: @tab @code{void omp_set_schedule(omp_sched_t kind, int chunk_size);}
@end multitable
@item @emph{Fortran}:
@multitable @columnfractions .20 .80
-@item @emph{Interface}: @tab @code{subroutine omp_set_schedule(kind, modifier)}
+@item @emph{Interface}: @tab @code{subroutine omp_set_schedule(kind, chunk_size)}
@item @tab @code{integer(kind=omp_sched_kind) kind}
-@item @tab @code{integer modifier}
+@item @tab @code{integer chunk_size}
@end multitable
@item @emph{See also}:
@ref{OMP_SCHEDULE}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.2.12.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.2.12.
@end table
@ref{omp_destroy_lock}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.3.1.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.3.1.
@end table
@ref{omp_init_lock}, @ref{omp_test_lock}, @ref{omp_unset_lock}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.3.3.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.3.4.
@end table
@ref{omp_init_lock}, @ref{omp_set_lock}, @ref{omp_set_lock}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.3.5.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.3.6.
@end table
@ref{omp_set_lock}, @ref{omp_test_lock}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.3.4.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.3.5.
@end table
@ref{omp_init_lock}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.3.2.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.3.3.
@end table
@ref{omp_destroy_nest_lock}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.3.1.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.3.1.
@end table
@ref{omp_init_nest_lock}, @ref{omp_unset_nest_lock}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.3.3.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.3.4.
@end table
@ref{omp_init_lock}, @ref{omp_set_lock}, @ref{omp_set_lock}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.3.5.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.3.6.
@end table
@ref{omp_set_nest_lock}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.3.4.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.3.5.
@end table
@ref{omp_init_lock}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.3.2.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.3.3.
@end table
@ref{omp_get_wtime}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.4.2.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.4.2.
@end table
@ref{omp_get_wtick}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 3.4.1.
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 3.4.1.
@end table
@c ---------------------------------------------------------------------
-@c Environment Variables
+@c OpenMP Environment Variables
@c ---------------------------------------------------------------------
@node Environment Variables
-@chapter Environment Variables
+@chapter OpenMP Environment Variables
The environment variables which beginning with @env{OMP_} are defined by
-section 4 of the OpenMP specification in version 4.0, while those
+section 4 of the OpenMP specification in version 4.5, while those
beginning with @env{GOMP_} are GNU extensions.
@menu
-* OMP_CANCELLATION:: Set whether cancellation is activated
-* OMP_DISPLAY_ENV:: Show OpenMP version and environment variables
-* OMP_DEFAULT_DEVICE:: Set the device used in target regions
-* OMP_DYNAMIC:: Dynamic adjustment of threads
-* OMP_MAX_ACTIVE_LEVELS:: Set the maximum number of nested parallel regions
-* OMP_NESTED:: Nested parallel regions
-* OMP_NUM_THREADS:: Specifies the number of threads to use
-* OMP_PROC_BIND:: Whether theads may be moved between CPUs
-* OMP_PLACES:: Specifies on which CPUs the theads should be placed
-* OMP_STACKSIZE:: Set default thread stack size
-* OMP_SCHEDULE:: How threads are scheduled
-* OMP_THREAD_LIMIT:: Set the maximum number of threads
-* OMP_WAIT_POLICY:: How waiting threads are handled
-* GOMP_CPU_AFFINITY:: Bind threads to specific CPUs
-* GOMP_DEBUG:: Enable debugging output
-* GOMP_STACKSIZE:: Set default thread stack size
-* GOMP_SPINCOUNT:: Set the busy-wait spin count
+* OMP_CANCELLATION:: Set whether cancellation is activated
+* OMP_DISPLAY_ENV:: Show OpenMP version and environment variables
+* OMP_DEFAULT_DEVICE:: Set the device used in target regions
+* OMP_DYNAMIC:: Dynamic adjustment of threads
+* OMP_MAX_ACTIVE_LEVELS:: Set the maximum number of nested parallel regions
+* OMP_MAX_TASK_PRIORITY:: Set the maximum task priority value
+* OMP_NESTED:: Nested parallel regions
+* OMP_NUM_THREADS:: Specifies the number of threads to use
+* OMP_PROC_BIND:: Whether theads may be moved between CPUs
+* OMP_PLACES:: Specifies on which CPUs the theads should be placed
+* OMP_STACKSIZE:: Set default thread stack size
+* OMP_SCHEDULE:: How threads are scheduled
+* OMP_THREAD_LIMIT:: Set the maximum number of threads
+* OMP_WAIT_POLICY:: How waiting threads are handled
+* GOMP_CPU_AFFINITY:: Bind threads to specific CPUs
+* GOMP_DEBUG:: Enable debugging output
+* GOMP_STACKSIZE:: Set default thread stack size
+* GOMP_SPINCOUNT:: Set the busy-wait spin count
+* GOMP_RTEMS_THREAD_POOLS:: Set the RTEMS specific thread pools
@end menu
@ref{omp_get_cancellation}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 4.11
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 4.11
@end table
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 4.12
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 4.12
@end table
@ref{omp_get_default_device}, @ref{omp_set_default_device},
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 4.11
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 4.13
@end table
@ref{omp_set_dynamic}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 4.3
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 4.3
@end table
@ref{omp_set_max_active_levels}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 4.9
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 4.9
+@end table
+
+
+
+@node OMP_MAX_TASK_PRIORITY
+@section @env{OMP_MAX_TASK_PRIORITY} -- Set the maximum priority
+number that can be set for a task.
+@cindex Environment Variable
+@table @asis
+@item @emph{Description}:
+Specifies the initial value for the maximum priority value that can be
+set for a task. The value of this variable shall be a non-negative
+integer, and zero is allowed. If undefined, the default priority is
+0.
+
+@item @emph{See also}:
+@ref{omp_get_max_task_priority}
+
+@item @emph{Reference}:
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 4.14
@end table
@ref{omp_set_nested}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 4.6
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 4.6
@end table
@ref{omp_set_num_threads}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 4.2
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 4.2
@end table
@ref{OMP_PLACES}, @ref{GOMP_CPU_AFFINITY}, @ref{omp_get_proc_bind}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 4.4
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 4.4
@end table
@ref{OMP_DISPLAY_ENV}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 4.5
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 4.5
@end table
dependent.
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 4.7
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 4.7
@end table
@ref{omp_set_schedule}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Sections 2.7.1 and 4.1
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Sections 2.7.1.1 and 4.1
@end table
@ref{OMP_NUM_THREADS}, @ref{omp_get_thread_limit}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 4.10
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 4.10
@end table
@ref{GOMP_SPINCOUNT}
@item @emph{Reference}:
-@uref{http://www.openmp.org/, OpenMP specification v4.0}, Section 4.8
+@uref{https://www.openmp.org, OpenMP specification v4.5}, Section 4.8
@end table
-@c ---------------------------------------------------------------------
-@c The libgomp ABI
-@c ---------------------------------------------------------------------
+@node GOMP_RTEMS_THREAD_POOLS
+@section @env{GOMP_RTEMS_THREAD_POOLS} -- Set the RTEMS specific thread pools
+@cindex Environment Variable
+@cindex Implementation specific setting
+@table @asis
+@item @emph{Description}:
+This environment variable is only used on the RTEMS real-time operating system.
+It determines the scheduler instance specific thread pools. The format for
+@env{GOMP_RTEMS_THREAD_POOLS} is a list of optional
+@code{<thread-pool-count>[$<priority>]@@<scheduler-name>} configurations
+separated by @code{:} where:
+@itemize @bullet
+@item @code{<thread-pool-count>} is the thread pool count for this scheduler
+instance.
+@item @code{$<priority>} is an optional priority for the worker threads of a
+thread pool according to @code{pthread_setschedparam}. In case a priority
+value is omitted, then a worker thread will inherit the priority of the OpenMP
+master thread that created it. The priority of the worker thread is not
+changed after creation, even if a new OpenMP master thread using the worker has
+a different priority.
+@item @code{@@<scheduler-name>} is the scheduler instance name according to the
+RTEMS application configuration.
+@end itemize
+In case no thread pool configuration is specified for a scheduler instance,
+then each OpenMP master thread of this scheduler instance will use its own
+dynamically allocated thread pool. To limit the worker thread count of the
+thread pools, each OpenMP master thread must call @code{omp_set_num_threads}.
+@item @emph{Example}:
+Lets suppose we have three scheduler instances @code{IO}, @code{WRK0}, and
+@code{WRK1} with @env{GOMP_RTEMS_THREAD_POOLS} set to
+@code{"1@@WRK0:3$4@@WRK1"}. Then there are no thread pool restrictions for
+scheduler instance @code{IO}. In the scheduler instance @code{WRK0} there is
+one thread pool available. Since no priority is specified for this scheduler
+instance, the worker thread inherits the priority of the OpenMP master thread
+that created it. In the scheduler instance @code{WRK1} there are three thread
+pools available and their worker threads run at priority four.
+@end table
-@node The libgomp ABI
-@chapter The libgomp ABI
-The following sections present notes on the external ABI as
-presented by libgomp. Only maintainers should need them.
-@menu
-* Implementing MASTER construct::
-* Implementing CRITICAL construct::
-* Implementing ATOMIC construct::
-* Implementing FLUSH construct::
-* Implementing BARRIER construct::
-* Implementing THREADPRIVATE construct::
-* Implementing PRIVATE clause::
-* Implementing FIRSTPRIVATE LASTPRIVATE COPYIN and COPYPRIVATE clauses::
-* Implementing REDUCTION clause::
-* Implementing PARALLEL construct::
-* Implementing FOR construct::
-* Implementing ORDERED construct::
-* Implementing SECTIONS construct::
-* Implementing SINGLE construct::
-@end menu
+@c ---------------------------------------------------------------------
+@c Enabling OpenACC
+@c ---------------------------------------------------------------------
+@node Enabling OpenACC
+@chapter Enabling OpenACC
-@node Implementing MASTER construct
-@section Implementing MASTER construct
+To activate the OpenACC extensions for C/C++ and Fortran, the compile-time
+flag @option{-fopenacc} must be specified. This enables the OpenACC directive
+@code{#pragma acc} in C/C++ and @code{!$accp} directives in free form,
+@code{c$acc}, @code{*$acc} and @code{!$acc} directives in fixed form,
+@code{!$} conditional compilation sentinels in free form and @code{c$},
+@code{*$} and @code{!$} sentinels in fixed form, for Fortran. The flag also
+arranges for automatic linking of the OpenACC runtime library
+(@ref{OpenACC Runtime Library Routines}).
-@smallexample
-if (omp_get_thread_num () == 0)
- block
-@end smallexample
+A complete description of all OpenACC directives accepted may be found in
+the @uref{https://www.openacc.org, OpenACC} Application Programming
+Interface manual, version 2.0.
-Alternately, we generate two copies of the parallel subfunction
-and only include this in the version run by the master thread.
-Surely this is not worthwhile though...
+Note that this is an experimental feature and subject to
+change in future versions of GCC. See
+@uref{https://gcc.gnu.org/wiki/OpenACC} for more information.
-@node Implementing CRITICAL construct
-@section Implementing CRITICAL construct
+@c ---------------------------------------------------------------------
+@c OpenACC Runtime Library Routines
+@c ---------------------------------------------------------------------
-Without a specified name,
+@node OpenACC Runtime Library Routines
+@chapter OpenACC Runtime Library Routines
-@smallexample
- void GOMP_critical_start (void);
- void GOMP_critical_end (void);
-@end smallexample
+The runtime routines described here are defined by section 3 of the OpenACC
+specifications in version 2.0.
+They have C linkage, and do not throw exceptions.
+Generally, they are available only for the host, with the exception of
+@code{acc_on_device}, which is available for both the host and the
+acceleration device.
-so that we don't get COPY relocations from libgomp to the main
-application.
+@menu
+* acc_get_num_devices:: Get number of devices for the given device
+ type.
+* acc_set_device_type:: Set type of device accelerator to use.
+* acc_get_device_type:: Get type of device accelerator to be used.
+* acc_set_device_num:: Set device number to use.
+* acc_get_device_num:: Get device number to be used.
+* acc_async_test:: Tests for completion of a specific asynchronous
+ operation.
+* acc_async_test_all:: Tests for completion of all asychronous
+ operations.
+* acc_wait:: Wait for completion of a specific asynchronous
+ operation.
+* acc_wait_all:: Waits for completion of all asyncrhonous
+ operations.
+* acc_wait_all_async:: Wait for completion of all asynchronous
+ operations.
+* acc_wait_async:: Wait for completion of asynchronous operations.
+* acc_init:: Initialize runtime for a specific device type.
+* acc_shutdown:: Shuts down the runtime for a specific device
+ type.
+* acc_on_device:: Whether executing on a particular device
+* acc_malloc:: Allocate device memory.
+* acc_free:: Free device memory.
+* acc_copyin:: Allocate device memory and copy host memory to
+ it.
+* acc_present_or_copyin:: If the data is not present on the device,
+ allocate device memory and copy from host
+ memory.
+* acc_create:: Allocate device memory and map it to host
+ memory.
+* acc_present_or_create:: If the data is not present on the device,
+ allocate device memory and map it to host
+ memory.
+* acc_copyout:: Copy device memory to host memory.
+* acc_delete:: Free device memory.
+* acc_update_device:: Update device memory from mapped host memory.
+* acc_update_self:: Update host memory from mapped device memory.
+* acc_map_data:: Map previously allocated device memory to host
+ memory.
+* acc_unmap_data:: Unmap device memory from host memory.
+* acc_deviceptr:: Get device pointer associated with specific
+ host address.
+* acc_hostptr:: Get host pointer associated with specific
+ device address.
+* acc_is_present:: Indiciate whether host variable / array is
+ present on device.
+* acc_memcpy_to_device:: Copy host memory to device memory.
+* acc_memcpy_from_device:: Copy device memory to host memory.
+
+API routines for target platforms.
+
+* acc_get_current_cuda_device:: Get CUDA device handle.
+* acc_get_current_cuda_context::Get CUDA context handle.
+* acc_get_cuda_stream:: Get CUDA stream handle.
+* acc_set_cuda_stream:: Set CUDA stream handle.
+
+API routines for the OpenACC Profiling Interface.
+
+* acc_prof_register:: Register callbacks.
+* acc_prof_unregister:: Unregister callbacks.
+* acc_prof_lookup:: Obtain inquiry functions.
+* acc_register_library:: Library registration.
+@end menu
-With a specified name, use omp_set_lock and omp_unset_lock with
-name being transformed into a variable declared like
-@smallexample
- omp_lock_t gomp_critical_user_<name> __attribute__((common))
-@end smallexample
-Ideally the ABI would specify that all zero is a valid unlocked
-state, and so we wouldn't need to initialize this at
-startup.
+@node acc_get_num_devices
+@section @code{acc_get_num_devices} -- Get number of devices for given device type
+@table @asis
+@item @emph{Description}
+This function returns a value indicating the number of devices available
+for the device type specified in @var{devicetype}.
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{int acc_get_num_devices(acc_device_t devicetype);}
+@end multitable
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{integer function acc_get_num_devices(devicetype)}
+@item @tab @code{integer(kind=acc_device_kind) devicetype}
+@end multitable
-@node Implementing ATOMIC construct
-@section Implementing ATOMIC construct
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.1.
+@end table
-The target should implement the @code{__sync} builtins.
-Failing that we could add
-@smallexample
- void GOMP_atomic_enter (void)
- void GOMP_atomic_exit (void)
-@end smallexample
+@node acc_set_device_type
+@section @code{acc_set_device_type} -- Set type of device accelerator to use.
+@table @asis
+@item @emph{Description}
+This function indicates to the runtime library which device typr, specified
+in @var{devicetype}, to use when executing a parallel or kernels region.
-which reuses the regular lock code, but with yet another lock
-object private to the library.
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_set_device_type(acc_device_t devicetype);}
+@end multitable
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{subroutine acc_set_device_type(devicetype)}
+@item @tab @code{integer(kind=acc_device_kind) devicetype}
+@end multitable
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.2.
+@end table
-@node Implementing FLUSH construct
-@section Implementing FLUSH construct
-Expands to the @code{__sync_synchronize} builtin.
+@node acc_get_device_type
+@section @code{acc_get_device_type} -- Get type of device accelerator to be used.
+@table @asis
+@item @emph{Description}
+This function returns what device type will be used when executing a
+parallel or kernels region.
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_device_t acc_get_device_type(void);}
+@end multitable
-@node Implementing BARRIER construct
-@section Implementing BARRIER construct
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{function acc_get_device_type(void)}
+@item @tab @code{integer(kind=acc_device_kind) acc_get_device_type}
+@end multitable
-@smallexample
- void GOMP_barrier (void)
-@end smallexample
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.3.
+@end table
-@node Implementing THREADPRIVATE construct
-@section Implementing THREADPRIVATE construct
-In _most_ cases we can map this directly to @code{__thread}. Except
-that OMP allows constructors for C++ objects. We can either
-refuse to support this (how often is it used?) or we can
-implement something akin to .ctors.
+@node acc_set_device_num
+@section @code{acc_set_device_num} -- Set device number to use.
+@table @asis
+@item @emph{Description}
+This function will indicate to the runtime which device number,
+specified by @var{num}, associated with the specifed device
+type @var{devicetype}.
-Even more ideally, this ctor feature is handled by extensions
-to the main pthreads library. Failing that, we can have a set
-of entry points to register ctor functions to be called.
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_set_device_num(int num, acc_device_t devicetype);}
+@end multitable
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{subroutine acc_set_device_num(devicenum, devicetype)}
+@item @tab @code{integer devicenum}
+@item @tab @code{integer(kind=acc_device_kind) devicetype}
+@end multitable
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.4.
+@end table
-@node Implementing PRIVATE clause
-@section Implementing PRIVATE clause
-In association with a PARALLEL, or within the lexical extent
-of a PARALLEL block, the variable becomes a local variable in
-the parallel subfunction.
-In association with FOR or SECTIONS blocks, create a new
-automatic variable within the current function. This preserves
-the semantic of new variable creation.
+@node acc_get_device_num
+@section @code{acc_get_device_num} -- Get device number to be used.
+@table @asis
+@item @emph{Description}
+This function returns which device number associated with the specified device
+type @var{devicetype}, will be used when executing a parallel or kernels
+region.
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{int acc_get_device_num(acc_device_t devicetype);}
+@end multitable
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{function acc_get_device_num(devicetype)}
+@item @tab @code{integer(kind=acc_device_kind) devicetype}
+@item @tab @code{integer acc_get_device_num}
+@end multitable
-@node Implementing FIRSTPRIVATE LASTPRIVATE COPYIN and COPYPRIVATE clauses
-@section Implementing FIRSTPRIVATE LASTPRIVATE COPYIN and COPYPRIVATE clauses
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.5.
+@end table
-This seems simple enough for PARALLEL blocks. Create a private
-struct for communicating between the parent and subfunction.
-In the parent, copy in values for scalar and "small" structs;
-copy in addresses for others TREE_ADDRESSABLE types. In the
-subfunction, copy the value into the local variable.
-It is not clear what to do with bare FOR or SECTION blocks.
-The only thing I can figure is that we do something like:
-@smallexample
-#pragma omp for firstprivate(x) lastprivate(y)
-for (int i = 0; i < n; ++i)
- body;
-@end smallexample
+@node acc_async_test
+@section @code{acc_async_test} -- Test for completion of a specific asynchronous operation.
+@table @asis
+@item @emph{Description}
+This function tests for completion of the asynchrounous operation specified
+in @var{arg}. In C/C++, a non-zero value will be returned to indicate
+the specified asynchronous operation has completed. While Fortran will return
+a @code{true}. If the asynchrounous operation has not completed, C/C++ returns
+a zero and Fortran returns a @code{false}.
-which becomes
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{int acc_async_test(int arg);}
+@end multitable
-@smallexample
-@{
- int x = x, y;
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{function acc_async_test(arg)}
+@item @tab @code{integer(kind=acc_handle_kind) arg}
+@item @tab @code{logical acc_async_test}
+@end multitable
- // for stuff
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.6.
+@end table
- if (i == n)
- y = y;
-@}
-@end smallexample
-where the "x=x" and "y=y" assignments actually have different
-uids for the two variables, i.e. not something you could write
-directly in C. Presumably this only makes sense if the "outer"
-x and y are global variables.
-COPYPRIVATE would work the same way, except the structure
-broadcast would have to happen via SINGLE machinery instead.
+@node acc_async_test_all
+@section @code{acc_async_test_all} -- Tests for completion of all asynchronous operations.
+@table @asis
+@item @emph{Description}
+This function tests for completion of all asynchrounous operations.
+In C/C++, a non-zero value will be returned to indicate all asynchronous
+operations have completed. While Fortran will return a @code{true}. If
+any asynchronous operation has not completed, C/C++ returns a zero and
+Fortran returns a @code{false}.
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{int acc_async_test_all(void);}
+@end multitable
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{function acc_async_test()}
+@item @tab @code{logical acc_get_device_num}
+@end multitable
-@node Implementing REDUCTION clause
-@section Implementing REDUCTION clause
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.7.
+@end table
-The private struct mentioned in the previous section should have
-a pointer to an array of the type of the variable, indexed by the
-thread's @var{team_id}. The thread stores its final value into the
-array, and after the barrier, the master thread iterates over the
-array to collect the values.
-@node Implementing PARALLEL construct
-@section Implementing PARALLEL construct
+@node acc_wait
+@section @code{acc_wait} -- Wait for completion of a specific asynchronous operation.
+@table @asis
+@item @emph{Description}
+This function waits for completion of the asynchronous operation
+specified in @var{arg}.
-@smallexample
- #pragma omp parallel
- @{
- body;
- @}
-@end smallexample
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_wait(arg);}
+@item @emph{Prototype (OpenACC 1.0 compatibility)}: @tab @code{acc_async_wait(arg);}
+@end multitable
-becomes
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{subroutine acc_wait(arg)}
+@item @tab @code{integer(acc_handle_kind) arg}
+@item @emph{Interface (OpenACC 1.0 compatibility)}: @tab @code{subroutine acc_async_wait(arg)}
+@item @tab @code{integer(acc_handle_kind) arg}
+@end multitable
-@smallexample
- void subfunction (void *data)
- @{
- use data;
- body;
- @}
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.8.
+@end table
- setup data;
- GOMP_parallel_start (subfunction, &data, num_threads);
- subfunction (&data);
- GOMP_parallel_end ();
-@end smallexample
-@smallexample
- void GOMP_parallel_start (void (*fn)(void *), void *data, unsigned num_threads)
-@end smallexample
-The @var{FN} argument is the subfunction to be run in parallel.
+@node acc_wait_all
+@section @code{acc_wait_all} -- Waits for completion of all asynchronous operations.
+@table @asis
+@item @emph{Description}
+This function waits for the completion of all asynchronous operations.
-The @var{DATA} argument is a pointer to a structure used to
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_wait_all(void);}
+@item @emph{Prototype (OpenACC 1.0 compatibility)}: @tab @code{acc_async_wait_all(void);}
+@end multitable
+
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{subroutine acc_wait_all()}
+@item @emph{Interface (OpenACC 1.0 compatibility)}: @tab @code{subroutine acc_async_wait_all()}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.10.
+@end table
+
+
+
+@node acc_wait_all_async
+@section @code{acc_wait_all_async} -- Wait for completion of all asynchronous operations.
+@table @asis
+@item @emph{Description}
+This function enqueues a wait operation on the queue @var{async} for any
+and all asynchronous operations that have been previously enqueued on
+any queue.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_wait_all_async(int async);}
+@end multitable
+
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{subroutine acc_wait_all_async(async)}
+@item @tab @code{integer(acc_handle_kind) async}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.11.
+@end table
+
+
+
+@node acc_wait_async
+@section @code{acc_wait_async} -- Wait for completion of asynchronous operations.
+@table @asis
+@item @emph{Description}
+This function enqueues a wait operation on queue @var{async} for any and all
+asynchronous operations enqueued on queue @var{arg}.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_wait_async(int arg, int async);}
+@end multitable
+
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{subroutine acc_wait_async(arg, async)}
+@item @tab @code{integer(acc_handle_kind) arg, async}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.9.
+@end table
+
+
+
+@node acc_init
+@section @code{acc_init} -- Initialize runtime for a specific device type.
+@table @asis
+@item @emph{Description}
+This function initializes the runtime for the device type specified in
+@var{devicetype}.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_init(acc_device_t devicetype);}
+@end multitable
+
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{subroutine acc_init(devicetype)}
+@item @tab @code{integer(acc_device_kind) devicetype}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.12.
+@end table
+
+
+
+@node acc_shutdown
+@section @code{acc_shutdown} -- Shuts down the runtime for a specific device type.
+@table @asis
+@item @emph{Description}
+This function shuts down the runtime for the device type specified in
+@var{devicetype}.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_shutdown(acc_device_t devicetype);}
+@end multitable
+
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{subroutine acc_shutdown(devicetype)}
+@item @tab @code{integer(acc_device_kind) devicetype}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.13.
+@end table
+
+
+
+@node acc_on_device
+@section @code{acc_on_device} -- Whether executing on a particular device
+@table @asis
+@item @emph{Description}:
+This function returns whether the program is executing on a particular
+device specified in @var{devicetype}. In C/C++ a non-zero value is
+returned to indicate the device is execiting on the specified device type.
+In Fortran, @code{true} will be returned. If the program is not executing
+on the specified device type C/C++ will return a zero, while Fortran will
+return @code{false}.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_on_device(acc_device_t devicetype);}
+@end multitable
+
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{function acc_on_device(devicetype)}
+@item @tab @code{integer(acc_device_kind) devicetype}
+@item @tab @code{logical acc_on_device}
+@end multitable
+
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.14.
+@end table
+
+
+
+@node acc_malloc
+@section @code{acc_malloc} -- Allocate device memory.
+@table @asis
+@item @emph{Description}
+This function allocates @var{len} bytes of device memory. It returns
+the device address of the allocated memory.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{d_void* acc_malloc(size_t len);}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.15.
+@end table
+
+
+
+@node acc_free
+@section @code{acc_free} -- Free device memory.
+@table @asis
+@item @emph{Description}
+Free previously allocated device memory at the device address @code{a}.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_free(d_void *a);}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.16.
+@end table
+
+
+
+@node acc_copyin
+@section @code{acc_copyin} -- Allocate device memory and copy host memory to it.
+@table @asis
+@item @emph{Description}
+In C/C++, this function allocates @var{len} bytes of device memory
+and maps it to the specified host address in @var{a}. The device
+address of the newly allocated device memory is returned.
+
+In Fortran, two (2) forms are supported. In the first form, @var{a} specifies
+a contiguous array section. The second form @var{a} specifies a
+variable or array element and @var{len} specifies the length in bytes.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{void *acc_copyin(h_void *a, size_t len);}
+@end multitable
+
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{subroutine acc_copyin(a)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @emph{Interface}: @tab @code{subroutine acc_copyin(a, len)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @tab @code{integer len}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.17.
+@end table
+
+
+
+@node acc_present_or_copyin
+@section @code{acc_present_or_copyin} -- If the data is not present on the device, allocate device memory and copy from host memory.
+@table @asis
+@item @emph{Description}
+This function tests if the host data specifed by @var{a} and of length
+@var{len} is present or not. If it is not present, then device memory
+will be allocated and the host memory copied. The device address of
+the newly allocated device memory is returned.
+
+In Fortran, two (2) forms are supported. In the first form, @var{a} specifies
+a contiguous array section. The second form @var{a} specifies a variable or
+array element and @var{len} specifies the length in bytes.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{void *acc_present_or_copyin(h_void *a, size_t len);}
+@item @emph{Prototype}: @tab @code{void *acc_pcopyin(h_void *a, size_t len);}
+@end multitable
+
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{subroutine acc_present_or_copyin(a)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @emph{Interface}: @tab @code{subroutine acc_present_or_copyin(a, len)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @tab @code{integer len}
+@item @emph{Interface}: @tab @code{subroutine acc_pcopyin(a)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @emph{Interface}: @tab @code{subroutine acc_pcopyin(a, len)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @tab @code{integer len}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.18.
+@end table
+
+
+
+@node acc_create
+@section @code{acc_create} -- Allocate device memory and map it to host memory.
+@table @asis
+@item @emph{Description}
+This function allocates device memory and maps it to host memory specified
+by the host address @var{a} with a length of @var{len} bytes. In C/C++,
+the function returns the device address of the allocated device memory.
+
+In Fortran, two (2) forms are supported. In the first form, @var{a} specifies
+a contiguous array section. The second form @var{a} specifies a variable or
+array element and @var{len} specifies the length in bytes.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{void *acc_create(h_void *a, size_t len);}
+@end multitable
+
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{subroutine acc_create(a)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @emph{Interface}: @tab @code{subroutine acc_create(a, len)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @tab @code{integer len}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.19.
+@end table
+
+
+
+@node acc_present_or_create
+@section @code{acc_present_or_create} -- If the data is not present on the device, allocate device memory and map it to host memory.
+@table @asis
+@item @emph{Description}
+This function tests if the host data specifed by @var{a} and of length
+@var{len} is present or not. If it is not present, then device memory
+will be allocated and mapped to host memory. In C/C++, the device address
+of the newly allocated device memory is returned.
+
+In Fortran, two (2) forms are supported. In the first form, @var{a} specifies
+a contiguous array section. The second form @var{a} specifies a variable or
+array element and @var{len} specifies the length in bytes.
+
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{void *acc_present_or_create(h_void *a, size_t len)}
+@item @emph{Prototype}: @tab @code{void *acc_pcreate(h_void *a, size_t len)}
+@end multitable
+
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{subroutine acc_present_or_create(a)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @emph{Interface}: @tab @code{subroutine acc_present_or_create(a, len)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @tab @code{integer len}
+@item @emph{Interface}: @tab @code{subroutine acc_pcreate(a)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @emph{Interface}: @tab @code{subroutine acc_pcreate(a, len)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @tab @code{integer len}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.20.
+@end table
+
+
+
+@node acc_copyout
+@section @code{acc_copyout} -- Copy device memory to host memory.
+@table @asis
+@item @emph{Description}
+This function copies mapped device memory to host memory which is specified
+by host address @var{a} for a length @var{len} bytes in C/C++.
+
+In Fortran, two (2) forms are supported. In the first form, @var{a} specifies
+a contiguous array section. The second form @var{a} specifies a variable or
+array element and @var{len} specifies the length in bytes.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_copyout(h_void *a, size_t len);}
+@end multitable
+
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{subroutine acc_copyout(a)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @emph{Interface}: @tab @code{subroutine acc_copyout(a, len)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @tab @code{integer len}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.21.
+@end table
+
+
+
+@node acc_delete
+@section @code{acc_delete} -- Free device memory.
+@table @asis
+@item @emph{Description}
+This function frees previously allocated device memory specified by
+the device address @var{a} and the length of @var{len} bytes.
+
+In Fortran, two (2) forms are supported. In the first form, @var{a} specifies
+a contiguous array section. The second form @var{a} specifies a variable or
+array element and @var{len} specifies the length in bytes.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_delete(h_void *a, size_t len);}
+@end multitable
+
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{subroutine acc_delete(a)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @emph{Interface}: @tab @code{subroutine acc_delete(a, len)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @tab @code{integer len}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.22.
+@end table
+
+
+
+@node acc_update_device
+@section @code{acc_update_device} -- Update device memory from mapped host memory.
+@table @asis
+@item @emph{Description}
+This function updates the device copy from the previously mapped host memory.
+The host memory is specified with the host address @var{a} and a length of
+@var{len} bytes.
+
+In Fortran, two (2) forms are supported. In the first form, @var{a} specifies
+a contiguous array section. The second form @var{a} specifies a variable or
+array element and @var{len} specifies the length in bytes.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_update_device(h_void *a, size_t len);}
+@end multitable
+
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{subroutine acc_update_device(a)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @emph{Interface}: @tab @code{subroutine acc_update_device(a, len)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @tab @code{integer len}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.23.
+@end table
+
+
+
+@node acc_update_self
+@section @code{acc_update_self} -- Update host memory from mapped device memory.
+@table @asis
+@item @emph{Description}
+This function updates the host copy from the previously mapped device memory.
+The host memory is specified with the host address @var{a} and a length of
+@var{len} bytes.
+
+In Fortran, two (2) forms are supported. In the first form, @var{a} specifies
+a contiguous array section. The second form @var{a} specifies a variable or
+array element and @var{len} specifies the length in bytes.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_update_self(h_void *a, size_t len);}
+@end multitable
+
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{subroutine acc_update_self(a)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @emph{Interface}: @tab @code{subroutine acc_update_self(a, len)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @tab @code{integer len}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.24.
+@end table
+
+
+
+@node acc_map_data
+@section @code{acc_map_data} -- Map previously allocated device memory to host memory.
+@table @asis
+@item @emph{Description}
+This function maps previously allocated device and host memory. The device
+memory is specified with the device address @var{d}. The host memory is
+specified with the host address @var{h} and a length of @var{len}.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_map_data(h_void *h, d_void *d, size_t len);}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.25.
+@end table
+
+
+
+@node acc_unmap_data
+@section @code{acc_unmap_data} -- Unmap device memory from host memory.
+@table @asis
+@item @emph{Description}
+This function unmaps previously mapped device and host memory. The latter
+specified by @var{h}.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_unmap_data(h_void *h);}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.26.
+@end table
+
+
+
+@node acc_deviceptr
+@section @code{acc_deviceptr} -- Get device pointer associated with specific host address.
+@table @asis
+@item @emph{Description}
+This function returns the device address that has been mapped to the
+host address specified by @var{h}.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{void *acc_deviceptr(h_void *h);}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.27.
+@end table
+
+
+
+@node acc_hostptr
+@section @code{acc_hostptr} -- Get host pointer associated with specific device address.
+@table @asis
+@item @emph{Description}
+This function returns the host address that has been mapped to the
+device address specified by @var{d}.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{void *acc_hostptr(d_void *d);}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.28.
+@end table
+
+
+
+@node acc_is_present
+@section @code{acc_is_present} -- Indicate whether host variable / array is present on device.
+@table @asis
+@item @emph{Description}
+This function indicates whether the specified host address in @var{a} and a
+length of @var{len} bytes is present on the device. In C/C++, a non-zero
+value is returned to indicate the presence of the mapped memory on the
+device. A zero is returned to indicate the memory is not mapped on the
+device.
+
+In Fortran, two (2) forms are supported. In the first form, @var{a} specifies
+a contiguous array section. The second form @var{a} specifies a variable or
+array element and @var{len} specifies the length in bytes. If the host
+memory is mapped to device memory, then a @code{true} is returned. Otherwise,
+a @code{false} is return to indicate the mapped memory is not present.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{int acc_is_present(h_void *a, size_t len);}
+@end multitable
+
+@item @emph{Fortran}:
+@multitable @columnfractions .20 .80
+@item @emph{Interface}: @tab @code{function acc_is_present(a)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @tab @code{logical acc_is_present}
+@item @emph{Interface}: @tab @code{function acc_is_present(a, len)}
+@item @tab @code{type, dimension(:[,:]...) :: a}
+@item @tab @code{integer len}
+@item @tab @code{logical acc_is_present}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.29.
+@end table
+
+
+
+@node acc_memcpy_to_device
+@section @code{acc_memcpy_to_device} -- Copy host memory to device memory.
+@table @asis
+@item @emph{Description}
+This function copies host memory specified by host address of @var{src} to
+device memory specified by the device address @var{dest} for a length of
+@var{bytes} bytes.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_memcpy_to_device(d_void *dest, h_void *src, size_t bytes);}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.30.
+@end table
+
+
+
+@node acc_memcpy_from_device
+@section @code{acc_memcpy_from_device} -- Copy device memory to host memory.
+@table @asis
+@item @emph{Description}
+This function copies host memory specified by host address of @var{src} from
+device memory specified by the device address @var{dest} for a length of
+@var{bytes} bytes.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_memcpy_from_device(d_void *dest, h_void *src, size_t bytes);}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+3.2.31.
+@end table
+
+
+
+@node acc_get_current_cuda_device
+@section @code{acc_get_current_cuda_device} -- Get CUDA device handle.
+@table @asis
+@item @emph{Description}
+This function returns the CUDA device handle. This handle is the same
+as used by the CUDA Runtime or Driver API's.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{void *acc_get_current_cuda_device(void);}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+A.2.1.1.
+@end table
+
+
+
+@node acc_get_current_cuda_context
+@section @code{acc_get_current_cuda_context} -- Get CUDA context handle.
+@table @asis
+@item @emph{Description}
+This function returns the CUDA context handle. This handle is the same
+as used by the CUDA Runtime or Driver API's.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{void *acc_get_current_cuda_context(void);}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+A.2.1.2.
+@end table
+
+
+
+@node acc_get_cuda_stream
+@section @code{acc_get_cuda_stream} -- Get CUDA stream handle.
+@table @asis
+@item @emph{Description}
+This function returns the CUDA stream handle for the queue @var{async}.
+This handle is the same as used by the CUDA Runtime or Driver API's.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{void *acc_get_cuda_stream(int async);}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+A.2.1.3.
+@end table
+
+
+
+@node acc_set_cuda_stream
+@section @code{acc_set_cuda_stream} -- Set CUDA stream handle.
+@table @asis
+@item @emph{Description}
+This function associates the stream handle specified by @var{stream} with
+the queue @var{async}.
+
+This cannot be used to change the stream handle associated with
+@code{acc_async_sync}.
+
+The return value is not specified.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{int acc_set_cuda_stream(int async, void *stream);}
+@end multitable
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+A.2.1.4.
+@end table
+
+
+
+@node acc_prof_register
+@section @code{acc_prof_register} -- Register callbacks.
+@table @asis
+@item @emph{Description}:
+This function registers callbacks.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{void acc_prof_register (acc_event_t, acc_prof_callback, acc_register_t);}
+@end multitable
+
+@item @emph{See also}:
+@ref{OpenACC Profiling Interface}
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.6}, section
+5.3.
+@end table
+
+
+
+@node acc_prof_unregister
+@section @code{acc_prof_unregister} -- Unregister callbacks.
+@table @asis
+@item @emph{Description}:
+This function unregisters callbacks.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{void acc_prof_unregister (acc_event_t, acc_prof_callback, acc_register_t);}
+@end multitable
+
+@item @emph{See also}:
+@ref{OpenACC Profiling Interface}
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.6}, section
+5.3.
+@end table
+
+
+
+@node acc_prof_lookup
+@section @code{acc_prof_lookup} -- Obtain inquiry functions.
+@table @asis
+@item @emph{Description}:
+Function to obtain inquiry functions.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_query_fn acc_prof_lookup (const char *);}
+@end multitable
+
+@item @emph{See also}:
+@ref{OpenACC Profiling Interface}
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.6}, section
+5.3.
+@end table
+
+
+
+@node acc_register_library
+@section @code{acc_register_library} -- Library registration.
+@table @asis
+@item @emph{Description}:
+Function for library registration.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{void acc_register_library (acc_prof_reg, acc_prof_reg, acc_prof_lookup_func);}
+@end multitable
+
+@item @emph{See also}:
+@ref{OpenACC Profiling Interface}, @ref{ACC_PROFLIB}
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.6}, section
+5.3.
+@end table
+
+
+
+@c ---------------------------------------------------------------------
+@c OpenACC Environment Variables
+@c ---------------------------------------------------------------------
+
+@node OpenACC Environment Variables
+@chapter OpenACC Environment Variables
+
+The variables @env{ACC_DEVICE_TYPE} and @env{ACC_DEVICE_NUM}
+are defined by section 4 of the OpenACC specification in version 2.0.
+The variable @env{ACC_PROFLIB}
+is defined by section 4 of the OpenACC specification in version 2.6.
+The variable @env{GCC_ACC_NOTIFY} is used for diagnostic purposes.
+
+@menu
+* ACC_DEVICE_TYPE::
+* ACC_DEVICE_NUM::
+* ACC_PROFLIB::
+* GCC_ACC_NOTIFY::
+@end menu
+
+
+
+@node ACC_DEVICE_TYPE
+@section @code{ACC_DEVICE_TYPE}
+@table @asis
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+4.1.
+@end table
+
+
+
+@node ACC_DEVICE_NUM
+@section @code{ACC_DEVICE_NUM}
+@table @asis
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.0}, section
+4.2.
+@end table
+
+
+
+@node ACC_PROFLIB
+@section @code{ACC_PROFLIB}
+@table @asis
+@item @emph{See also}:
+@ref{acc_register_library}, @ref{OpenACC Profiling Interface}
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.6}, section
+4.3.
+@end table
+
+
+
+@node GCC_ACC_NOTIFY
+@section @code{GCC_ACC_NOTIFY}
+@table @asis
+@item @emph{Description}:
+Print debug information pertaining to the accelerator.
+@end table
+
+
+
+@c ---------------------------------------------------------------------
+@c CUDA Streams Usage
+@c ---------------------------------------------------------------------
+
+@node CUDA Streams Usage
+@chapter CUDA Streams Usage
+
+This applies to the @code{nvptx} plugin only.
+
+The library provides elements that perform asynchronous movement of
+data and asynchronous operation of computing constructs. This
+asynchronous functionality is implemented by making use of CUDA
+streams@footnote{See "Stream Management" in "CUDA Driver API",
+TRM-06703-001, Version 5.5, for additional information}.
+
+The primary means by that the asychronous functionality is accessed
+is through the use of those OpenACC directives which make use of the
+@code{async} and @code{wait} clauses. When the @code{async} clause is
+first used with a directive, it creates a CUDA stream. If an
+@code{async-argument} is used with the @code{async} clause, then the
+stream is associated with the specified @code{async-argument}.
+
+Following the creation of an association between a CUDA stream and the
+@code{async-argument} of an @code{async} clause, both the @code{wait}
+clause and the @code{wait} directive can be used. When either the
+clause or directive is used after stream creation, it creates a
+rendezvous point whereby execution waits until all operations
+associated with the @code{async-argument}, that is, stream, have
+completed.
+
+Normally, the management of the streams that are created as a result of
+using the @code{async} clause, is done without any intervention by the
+caller. This implies the association between the @code{async-argument}
+and the CUDA stream will be maintained for the lifetime of the program.
+However, this association can be changed through the use of the library
+function @code{acc_set_cuda_stream}. When the function
+@code{acc_set_cuda_stream} is called, the CUDA stream that was
+originally associated with the @code{async} clause will be destroyed.
+Caution should be taken when changing the association as subsequent
+references to the @code{async-argument} refer to a different
+CUDA stream.
+
+
+
+@c ---------------------------------------------------------------------
+@c OpenACC Library Interoperability
+@c ---------------------------------------------------------------------
+
+@node OpenACC Library Interoperability
+@chapter OpenACC Library Interoperability
+
+@section Introduction
+
+The OpenACC library uses the CUDA Driver API, and may interact with
+programs that use the Runtime library directly, or another library
+based on the Runtime library, e.g., CUBLAS@footnote{See section 2.26,
+"Interactions with the CUDA Driver API" in
+"CUDA Runtime API", Version 5.5, and section 2.27, "VDPAU
+Interoperability", in "CUDA Driver API", TRM-06703-001, Version 5.5,
+for additional information on library interoperability.}.
+This chapter describes the use cases and what changes are
+required in order to use both the OpenACC library and the CUBLAS and Runtime
+libraries within a program.
+
+@section First invocation: NVIDIA CUBLAS library API
+
+In this first use case (see below), a function in the CUBLAS library is called
+prior to any of the functions in the OpenACC library. More specifically, the
+function @code{cublasCreate()}.
+
+When invoked, the function initializes the library and allocates the
+hardware resources on the host and the device on behalf of the caller. Once
+the initialization and allocation has completed, a handle is returned to the
+caller. The OpenACC library also requires initialization and allocation of
+hardware resources. Since the CUBLAS library has already allocated the
+hardware resources for the device, all that is left to do is to initialize
+the OpenACC library and acquire the hardware resources on the host.
+
+Prior to calling the OpenACC function that initializes the library and
+allocate the host hardware resources, you need to acquire the device number
+that was allocated during the call to @code{cublasCreate()}. The invoking of the
+runtime library function @code{cudaGetDevice()} accomplishes this. Once
+acquired, the device number is passed along with the device type as
+parameters to the OpenACC library function @code{acc_set_device_num()}.
+
+Once the call to @code{acc_set_device_num()} has completed, the OpenACC
+library uses the context that was created during the call to
+@code{cublasCreate()}. In other words, both libraries will be sharing the
+same context.
+
+@smallexample
+ /* Create the handle */
+ s = cublasCreate(&h);
+ if (s != CUBLAS_STATUS_SUCCESS)
+ @{
+ fprintf(stderr, "cublasCreate failed %d\n", s);
+ exit(EXIT_FAILURE);
+ @}
+
+ /* Get the device number */
+ e = cudaGetDevice(&dev);
+ if (e != cudaSuccess)
+ @{
+ fprintf(stderr, "cudaGetDevice failed %d\n", e);
+ exit(EXIT_FAILURE);
+ @}
+
+ /* Initialize OpenACC library and use device 'dev' */
+ acc_set_device_num(dev, acc_device_nvidia);
+
+@end smallexample
+@center Use Case 1
+
+@section First invocation: OpenACC library API
+
+In this second use case (see below), a function in the OpenACC library is
+called prior to any of the functions in the CUBLAS library. More specificially,
+the function @code{acc_set_device_num()}.
+
+In the use case presented here, the function @code{acc_set_device_num()}
+is used to both initialize the OpenACC library and allocate the hardware
+resources on the host and the device. In the call to the function, the
+call parameters specify which device to use and what device
+type to use, i.e., @code{acc_device_nvidia}. It should be noted that this
+is but one method to initialize the OpenACC library and allocate the
+appropriate hardware resources. Other methods are available through the
+use of environment variables and these will be discussed in the next section.
+
+Once the call to @code{acc_set_device_num()} has completed, other OpenACC
+functions can be called as seen with multiple calls being made to
+@code{acc_copyin()}. In addition, calls can be made to functions in the
+CUBLAS library. In the use case a call to @code{cublasCreate()} is made
+subsequent to the calls to @code{acc_copyin()}.
+As seen in the previous use case, a call to @code{cublasCreate()}
+initializes the CUBLAS library and allocates the hardware resources on the
+host and the device. However, since the device has already been allocated,
+@code{cublasCreate()} will only initialize the CUBLAS library and allocate
+the appropriate hardware resources on the host. The context that was created
+as part of the OpenACC initialization is shared with the CUBLAS library,
+similarly to the first use case.
+
+@smallexample
+ dev = 0;
+
+ acc_set_device_num(dev, acc_device_nvidia);
+
+ /* Copy the first set to the device */
+ d_X = acc_copyin(&h_X[0], N * sizeof (float));
+ if (d_X == NULL)
+ @{
+ fprintf(stderr, "copyin error h_X\n");
+ exit(EXIT_FAILURE);
+ @}
+
+ /* Copy the second set to the device */
+ d_Y = acc_copyin(&h_Y1[0], N * sizeof (float));
+ if (d_Y == NULL)
+ @{
+ fprintf(stderr, "copyin error h_Y1\n");
+ exit(EXIT_FAILURE);
+ @}
+
+ /* Create the handle */
+ s = cublasCreate(&h);
+ if (s != CUBLAS_STATUS_SUCCESS)
+ @{
+ fprintf(stderr, "cublasCreate failed %d\n", s);
+ exit(EXIT_FAILURE);
+ @}
+
+ /* Perform saxpy using CUBLAS library function */
+ s = cublasSaxpy(h, N, &alpha, d_X, 1, d_Y, 1);
+ if (s != CUBLAS_STATUS_SUCCESS)
+ @{
+ fprintf(stderr, "cublasSaxpy failed %d\n", s);
+ exit(EXIT_FAILURE);
+ @}
+
+ /* Copy the results from the device */
+ acc_memcpy_from_device(&h_Y1[0], d_Y, N * sizeof (float));
+
+@end smallexample
+@center Use Case 2
+
+@section OpenACC library and environment variables
+
+There are two environment variables associated with the OpenACC library
+that may be used to control the device type and device number:
+@env{ACC_DEVICE_TYPE} and @env{ACC_DEVICE_NUM}, respecively. These two
+environement variables can be used as an alternative to calling
+@code{acc_set_device_num()}. As seen in the second use case, the device
+type and device number were specified using @code{acc_set_device_num()}.
+If however, the aforementioned environment variables were set, then the
+call to @code{acc_set_device_num()} would not be required.
+
+
+The use of the environment variables is only relevant when an OpenACC function
+is called prior to a call to @code{cudaCreate()}. If @code{cudaCreate()}
+is called prior to a call to an OpenACC function, then you must call
+@code{acc_set_device_num()}@footnote{More complete information
+about @env{ACC_DEVICE_TYPE} and @env{ACC_DEVICE_NUM} can be found in
+sections 4.1 and 4.2 of the @uref{https://www.openacc.org, OpenACC}
+Application Programming Interfaceā€¯, Version 2.0.}
+
+
+
+@c ---------------------------------------------------------------------
+@c OpenACC Profiling Interface
+@c ---------------------------------------------------------------------
+
+@node OpenACC Profiling Interface
+@chapter OpenACC Profiling Interface
+
+@section Implementation Status and Implementation-Defined Behavior
+
+We're implementing the OpenACC Profiling Interface as defined by the
+OpenACC 2.6 specification. We're clarifying some aspects here as
+@emph{implementation-defined behavior}, while they're still under
+discussion within the OpenACC Technical Committee.
+
+This implementation is tuned to keep the performance impact as low as
+possible for the (very common) case that the Profiling Interface is
+not enabled. This is relevant, as the Profiling Interface affects all
+the @emph{hot} code paths (in the target code, not in the offloaded
+code). Users of the OpenACC Profiling Interface can be expected to
+understand that performance will be impacted to some degree once the
+Profiling Interface has gotten enabled: for example, because of the
+@emph{runtime} (libgomp) calling into a third-party @emph{library} for
+every event that has been registered.
+
+We're not yet accounting for the fact that @cite{OpenACC events may
+occur during event processing}.
+
+We're not yet implementing initialization via a
+@code{acc_register_library} function that is either statically linked
+in, or dynamically via @env{LD_PRELOAD}.
+Initialization via @code{acc_register_library} functions dynamically
+loaded via the @env{ACC_PROFLIB} environment variable does work, as
+does directly calling @code{acc_prof_register},
+@code{acc_prof_unregister}, @code{acc_prof_lookup}.
+
+As currently there are no inquiry functions defined, calls to
+@code{acc_prof_lookup} will always return @code{NULL}.
+
+There aren't separate @emph{start}, @emph{stop} events defined for the
+event types @code{acc_ev_create}, @code{acc_ev_delete},
+@code{acc_ev_alloc}, @code{acc_ev_free}. It's not clear if these
+should be triggered before or after the actual device-specific call is
+made. We trigger them after.
+
+Remarks about data provided to callbacks:
+
+@table @asis
+
+@item @code{acc_prof_info.event_type}
+It's not clear if for @emph{nested} event callbacks (for example,
+@code{acc_ev_enqueue_launch_start} as part of a parent compute
+construct), this should be set for the nested event
+(@code{acc_ev_enqueue_launch_start}), or if the value of the parent
+construct should remain (@code{acc_ev_compute_construct_start}). In
+this implementation, the value will generally correspond to the
+innermost nested event type.
+
+@item @code{acc_prof_info.device_type}
+@itemize
+
+@item
+For @code{acc_ev_compute_construct_start}, and in presence of an
+@code{if} clause with @emph{false} argument, this will still refer to
+the offloading device type.
+It's not clear if that's the expected behavior.
+
+@item
+Complementary to the item before, for
+@code{acc_ev_compute_construct_end}, this is set to
+@code{acc_device_host} in presence of an @code{if} clause with
+@emph{false} argument.
+It's not clear if that's the expected behavior.
+
+@end itemize
+
+@item @code{acc_prof_info.thread_id}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_prof_info.async}
+@itemize
+
+@item
+Not yet implemented correctly for
+@code{acc_ev_compute_construct_start}.
+
+@item
+In a compute construct, for host-fallback
+execution/@code{acc_device_host} it will always be
+@code{acc_async_sync}.
+It's not clear if that's the expected behavior.
+
+@item
+For @code{acc_ev_device_init_start} and @code{acc_ev_device_init_end},
+it will always be @code{acc_async_sync}.
+It's not clear if that's the expected behavior.
+
+@end itemize
+
+@item @code{acc_prof_info.async_queue}
+There is no @cite{limited number of asynchronous queues} in libgomp.
+This will always have the same value as @code{acc_prof_info.async}.
+
+@item @code{acc_prof_info.src_file}
+Always @code{NULL}; not yet implemented.
+
+@item @code{acc_prof_info.func_name}
+Always @code{NULL}; not yet implemented.
+
+@item @code{acc_prof_info.line_no}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_prof_info.end_line_no}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_prof_info.func_line_no}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_prof_info.func_end_line_no}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_event_info.event_type}, @code{acc_event_info.*.event_type}
+Relating to @code{acc_prof_info.event_type} discussed above, in this
+implementation, this will always be the same value as
+@code{acc_prof_info.event_type}.
+
+@item @code{acc_event_info.*.parent_construct}
+@itemize
+
+@item
+Will be @code{acc_construct_parallel} for all OpenACC compute
+constructs as well as many OpenACC Runtime API calls; should be the
+one matching the actual construct, or
+@code{acc_construct_runtime_api}, respectively.
+
+@item
+Will be @code{acc_construct_enter_data} or
+@code{acc_construct_exit_data} when processing variable mappings
+specified in OpenACC @emph{declare} directives; should be
+@code{acc_construct_declare}.
+
+@item
+For implicit @code{acc_ev_device_init_start},
+@code{acc_ev_device_init_end}, and explicit as well as implicit
+@code{acc_ev_alloc}, @code{acc_ev_free},
+@code{acc_ev_enqueue_upload_start}, @code{acc_ev_enqueue_upload_end},
+@code{acc_ev_enqueue_download_start}, and
+@code{acc_ev_enqueue_download_end}, will be
+@code{acc_construct_parallel}; should reflect the real parent
+construct.
+
+@end itemize
+
+@item @code{acc_event_info.*.implicit}
+For @code{acc_ev_alloc}, @code{acc_ev_free},
+@code{acc_ev_enqueue_upload_start}, @code{acc_ev_enqueue_upload_end},
+@code{acc_ev_enqueue_download_start}, and
+@code{acc_ev_enqueue_download_end}, this currently will be @code{1}
+also for explicit usage.
+
+@item @code{acc_event_info.data_event.var_name}
+Always @code{NULL}; not yet implemented.
+
+@item @code{acc_event_info.data_event.host_ptr}
+For @code{acc_ev_alloc}, and @code{acc_ev_free}, this is always
+@code{NULL}.
+
+@item @code{typedef union acc_api_info}
+@dots{} as printed in @cite{5.2.3. Third Argument: API-Specific
+Information}. This should obviously be @code{typedef @emph{struct}
+acc_api_info}.
+
+@item @code{acc_api_info.device_api}
+Possibly not yet implemented correctly for
+@code{acc_ev_compute_construct_start},
+@code{acc_ev_device_init_start}, @code{acc_ev_device_init_end}:
+will always be @code{acc_device_api_none} for these event types.
+For @code{acc_ev_enter_data_start}, it will be
+@code{acc_device_api_none} in some cases.
+
+@item @code{acc_api_info.device_type}
+Always the same as @code{acc_prof_info.device_type}.
+
+@item @code{acc_api_info.vendor}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_api_info.device_handle}
+Always @code{NULL}; not yet implemented.
+
+@item @code{acc_api_info.context_handle}
+Always @code{NULL}; not yet implemented.
+
+@item @code{acc_api_info.async_handle}
+Always @code{NULL}; not yet implemented.
+
+@end table
+
+Remarks about certain event types:
+
+@table @asis
+
+@item @code{acc_ev_device_init_start}, @code{acc_ev_device_init_end}
+@itemize
+
+@item
+@c See 'DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT' in
+@c 'libgomp.oacc-c-c++-common/acc_prof-kernels-1.c',
+@c 'libgomp.oacc-c-c++-common/acc_prof-parallel-1.c'.
+Whan a compute construct triggers implicit
+@code{acc_ev_device_init_start} and @code{acc_ev_device_init_end}
+events, they currently aren't @emph{nested within} the corresponding
+@code{acc_ev_compute_construct_start} and
+@code{acc_ev_compute_construct_end}, but they're currently observed
+@emph{before} @code{acc_ev_compute_construct_start}.
+It's not clear what to do: the standard asks us provide a lot of
+details to the @code{acc_ev_compute_construct_start} callback, without
+(implicitly) initializing a device before?
+
+@item
+Callbacks for these event types will not be invoked for calls to the
+@code{acc_set_device_type} and @code{acc_set_device_num} functions.
+It's not clear if they should be.
+
+@end itemize
+
+@item @code{acc_ev_enter_data_start}, @code{acc_ev_enter_data_end}, @code{acc_ev_exit_data_start}, @code{acc_ev_exit_data_end}
+@itemize
+
+@item
+Callbacks for these event types will also be invoked for OpenACC
+@emph{host_data} constructs.
+It's not clear if they should be.
+
+@item
+Callbacks for these event types will also be invoked when processing
+variable mappings specified in OpenACC @emph{declare} directives.
+It's not clear if they should be.
+
+@end itemize
+
+@end table
+
+Callbacks for the following event types will be invoked, but dispatch
+and information provided therein has not yet been thoroughly reviewed:
+
+@itemize
+@item @code{acc_ev_alloc}
+@item @code{acc_ev_free}
+@item @code{acc_ev_update_start}, @code{acc_ev_update_end}
+@item @code{acc_ev_enqueue_upload_start}, @code{acc_ev_enqueue_upload_end}
+@item @code{acc_ev_enqueue_download_start}, @code{acc_ev_enqueue_download_end}
+@end itemize
+
+During device initialization, and finalization, respectively,
+callbacks for the following event types will not yet be invoked:
+
+@itemize
+@item @code{acc_ev_alloc}
+@item @code{acc_ev_free}
+@end itemize
+
+Callbacks for the following event types have not yet been implemented,
+so currently won't be invoked:
+
+@itemize
+@item @code{acc_ev_device_shutdown_start}, @code{acc_ev_device_shutdown_end}
+@item @code{acc_ev_runtime_shutdown}
+@item @code{acc_ev_create}, @code{acc_ev_delete}
+@item @code{acc_ev_wait_start}, @code{acc_ev_wait_end}
+@end itemize
+
+For the following runtime library functions, not all expected
+callbacks will be invoked (mostly concerning implicit device
+initialization):
+
+@itemize
+@item @code{acc_get_num_devices}
+@item @code{acc_set_device_type}
+@item @code{acc_get_device_type}
+@item @code{acc_set_device_num}
+@item @code{acc_get_device_num}
+@item @code{acc_init}
+@item @code{acc_shutdown}
+@end itemize
+
+Aside from implicit device initialization, for the following runtime
+library functions, no callbacks will be invoked for shared-memory
+offloading devices (it's not clear if they should be):
+
+@itemize
+@item @code{acc_malloc}
+@item @code{acc_free}
+@item @code{acc_copyin}, @code{acc_present_or_copyin}, @code{acc_copyin_async}
+@item @code{acc_create}, @code{acc_present_or_create}, @code{acc_create_async}
+@item @code{acc_copyout}, @code{acc_copyout_async}, @code{acc_copyout_finalize}, @code{acc_copyout_finalize_async}
+@item @code{acc_delete}, @code{acc_delete_async}, @code{acc_delete_finalize}, @code{acc_delete_finalize_async}
+@item @code{acc_update_device}, @code{acc_update_device_async}
+@item @code{acc_update_self}, @code{acc_update_self_async}
+@item @code{acc_map_data}, @code{acc_unmap_data}
+@item @code{acc_memcpy_to_device}, @code{acc_memcpy_to_device_async}
+@item @code{acc_memcpy_from_device}, @code{acc_memcpy_from_device_async}
+@end itemize
+
+
+
+@c ---------------------------------------------------------------------
+@c The libgomp ABI
+@c ---------------------------------------------------------------------
+
+@node The libgomp ABI
+@chapter The libgomp ABI
+
+The following sections present notes on the external ABI as
+presented by libgomp. Only maintainers should need them.
+
+@menu
+* Implementing MASTER construct::
+* Implementing CRITICAL construct::
+* Implementing ATOMIC construct::
+* Implementing FLUSH construct::
+* Implementing BARRIER construct::
+* Implementing THREADPRIVATE construct::
+* Implementing PRIVATE clause::
+* Implementing FIRSTPRIVATE LASTPRIVATE COPYIN and COPYPRIVATE clauses::
+* Implementing REDUCTION clause::
+* Implementing PARALLEL construct::
+* Implementing FOR construct::
+* Implementing ORDERED construct::
+* Implementing SECTIONS construct::
+* Implementing SINGLE construct::
+* Implementing OpenACC's PARALLEL construct::
+@end menu
+
+
+@node Implementing MASTER construct
+@section Implementing MASTER construct
+
+@smallexample
+if (omp_get_thread_num () == 0)
+ block
+@end smallexample
+
+Alternately, we generate two copies of the parallel subfunction
+and only include this in the version run by the master thread.
+Surely this is not worthwhile though...
+
+
+
+@node Implementing CRITICAL construct
+@section Implementing CRITICAL construct
+
+Without a specified name,
+
+@smallexample
+ void GOMP_critical_start (void);
+ void GOMP_critical_end (void);
+@end smallexample
+
+so that we don't get COPY relocations from libgomp to the main
+application.
+
+With a specified name, use omp_set_lock and omp_unset_lock with
+name being transformed into a variable declared like
+
+@smallexample
+ omp_lock_t gomp_critical_user_<name> __attribute__((common))
+@end smallexample
+
+Ideally the ABI would specify that all zero is a valid unlocked
+state, and so we wouldn't need to initialize this at
+startup.
+
+
+
+@node Implementing ATOMIC construct
+@section Implementing ATOMIC construct
+
+The target should implement the @code{__sync} builtins.
+
+Failing that we could add
+
+@smallexample
+ void GOMP_atomic_enter (void)
+ void GOMP_atomic_exit (void)
+@end smallexample
+
+which reuses the regular lock code, but with yet another lock
+object private to the library.
+
+
+
+@node Implementing FLUSH construct
+@section Implementing FLUSH construct
+
+Expands to the @code{__sync_synchronize} builtin.
+
+
+
+@node Implementing BARRIER construct
+@section Implementing BARRIER construct
+
+@smallexample
+ void GOMP_barrier (void)
+@end smallexample
+
+
+@node Implementing THREADPRIVATE construct
+@section Implementing THREADPRIVATE construct
+
+In _most_ cases we can map this directly to @code{__thread}. Except
+that OMP allows constructors for C++ objects. We can either
+refuse to support this (how often is it used?) or we can
+implement something akin to .ctors.
+
+Even more ideally, this ctor feature is handled by extensions
+to the main pthreads library. Failing that, we can have a set
+of entry points to register ctor functions to be called.
+
+
+
+@node Implementing PRIVATE clause
+@section Implementing PRIVATE clause
+
+In association with a PARALLEL, or within the lexical extent
+of a PARALLEL block, the variable becomes a local variable in
+the parallel subfunction.
+
+In association with FOR or SECTIONS blocks, create a new
+automatic variable within the current function. This preserves
+the semantic of new variable creation.
+
+
+
+@node Implementing FIRSTPRIVATE LASTPRIVATE COPYIN and COPYPRIVATE clauses
+@section Implementing FIRSTPRIVATE LASTPRIVATE COPYIN and COPYPRIVATE clauses
+
+This seems simple enough for PARALLEL blocks. Create a private
+struct for communicating between the parent and subfunction.
+In the parent, copy in values for scalar and "small" structs;
+copy in addresses for others TREE_ADDRESSABLE types. In the
+subfunction, copy the value into the local variable.
+
+It is not clear what to do with bare FOR or SECTION blocks.
+The only thing I can figure is that we do something like:
+
+@smallexample
+#pragma omp for firstprivate(x) lastprivate(y)
+for (int i = 0; i < n; ++i)
+ body;
+@end smallexample
+
+which becomes
+
+@smallexample
+@{
+ int x = x, y;
+
+ // for stuff
+
+ if (i == n)
+ y = y;
+@}
+@end smallexample
+
+where the "x=x" and "y=y" assignments actually have different
+uids for the two variables, i.e. not something you could write
+directly in C. Presumably this only makes sense if the "outer"
+x and y are global variables.
+
+COPYPRIVATE would work the same way, except the structure
+broadcast would have to happen via SINGLE machinery instead.
+
+
+
+@node Implementing REDUCTION clause
+@section Implementing REDUCTION clause
+
+The private struct mentioned in the previous section should have
+a pointer to an array of the type of the variable, indexed by the
+thread's @var{team_id}. The thread stores its final value into the
+array, and after the barrier, the master thread iterates over the
+array to collect the values.
+
+
+@node Implementing PARALLEL construct
+@section Implementing PARALLEL construct
+
+@smallexample
+ #pragma omp parallel
+ @{
+ body;
+ @}
+@end smallexample
+
+becomes
+
+@smallexample
+ void subfunction (void *data)
+ @{
+ use data;
+ body;
+ @}
+
+ setup data;
+ GOMP_parallel_start (subfunction, &data, num_threads);
+ subfunction (&data);
+ GOMP_parallel_end ();
+@end smallexample
+
+@smallexample
+ void GOMP_parallel_start (void (*fn)(void *), void *data, unsigned num_threads)
+@end smallexample
+
+The @var{FN} argument is the subfunction to be run in parallel.
+
+The @var{DATA} argument is a pointer to a structure used to
communicate data in and out of the subfunction, as discussed
above with respect to FIRSTPRIVATE et al.
+@node Implementing OpenACC's PARALLEL construct
+@section Implementing OpenACC's PARALLEL construct
+
+@smallexample
+ void GOACC_parallel ()
+@end smallexample
+
+
+
@c ---------------------------------------------------------------------
@c Reporting Bugs
@c ---------------------------------------------------------------------