* Stream/Descriptor Precautions:: Precautions needed if you use both
descriptors and streams.
* Scatter-Gather:: Fast I/O to discontinuous buffers.
+* Copying File Data:: Copying data between files.
* Memory-mapped I/O:: Using files like memory.
* Waiting for I/O:: How to check for input or output
on multiple file descriptors.
@c This is a syscall for Linux v4.6. The sysdeps/posix fallback emulation
@c is also MT-Safe since it calls preadv.
-This function is similar to the @code{preadv} function, with the difference
-it adds an extra @var{flags} parameter of type @code{int}. The supported
-@var{flags} are dependent of the underlying system. For Linux it supports:
+This function is similar to the @code{preadv} function, with the
+difference it adds an extra @var{flags} parameter of type @code{int}.
+Additionally, if @var{offset} is @math{-1}, the current file position
+is used and updated (like the @code{readv} function).
+
+The supported @var{flags} are dependent of the underlying system. For
+Linux it supports:
@vtable @code
@item RWF_HIPRI
@c This is a syscall for Linux v4.6. The sysdeps/posix fallback emulation
@c is also MT-Safe since it calls pwritev.
-This function is similar to the @code{pwritev} function, with the difference
-it adds an extra @var{flags} parameter of type @code{int}. The supported
-@var{flags} are dependent of the underlying system and for Linux it supports
-the same ones as for @code{preadv2}.
+This function is similar to the @code{pwritev} function, with the
+difference it adds an extra @var{flags} parameter of type @code{int}.
+Additionally, if @var{offset} is @math{-1}, the current file position
+should is used and updated (like the @code{writev} function).
+
+The supported @var{flags} are dependent of the underlying system. For
+Linux, the supported flags are the same as those for @code{preadv2}.
When the source file is compiled with @code{_FILE_OFFSET_BITS == 64} the
@code{pwritev2} function is in fact @code{pwritev64v2} and the type
@code{pwritev2} and so transparently replaces the 32 bit interface.
@end deftypefun
+@node Copying File Data
+@section Copying data between two files
+@cindex copying files
+@cindex file copy
+
+A special function is provided to copy data between two files on the
+same file system. The system can optimize such copy operations. This
+is particularly important on network file systems, where the data would
+otherwise have to be transferred twice over the network.
+
+Note that this function only copies file data, but not metadata such as
+file permissions or extended attributes.
+
+@deftypefun ssize_t copy_file_range (int @var{inputfd}, off64_t *@var{inputpos}, int @var{outputfd}, off64_t *@var{outputpos}, ssize_t @var{length}, unsigned int @var{flags})
+@standards{GNU, unistd.h}
+@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
+
+This function copies up to @var{length} bytes from the file descriptor
+@var{inputfd} to the file descriptor @var{outputfd}.
+
+The function can operate on both the current file position (like
+@code{read} and @code{write}) and an explicit offset (like @code{pread}
+and @code{pwrite}). If the @var{inputpos} pointer is null, the file
+position of @var{inputfd} is used as the starting point of the copy
+operation, and the file position is advanced during it. If
+@var{inputpos} is not null, then @code{*@var{inputpos}} is used as the
+starting point of the copy operation, and @code{*@var{inputpos}} is
+incremented by the number of copied bytes, but the file position remains
+unchanged. Similar rules apply to @var{outputfd} and @var{outputpos}
+for the output file position.
+
+The @var{flags} argument is currently reserved and must be zero.
+
+The @code{copy_file_range} function returns the number of bytes copied.
+This can be less than the specified @var{length} in case the input file
+contains fewer remaining bytes than @var{length}, or if a read or write
+failure occurs. The return value is zero if the end of the input file
+is encountered immediately.
+
+If no bytes can be copied, to report an error, @code{copy_file_range}
+returns the value @math{-1} and sets @code{errno}. The following
+@code{errno} error conditions are specific to this function:
+
+@table @code
+@item EISDIR
+At least one of the descriptors @var{inputfd} or @var{outputfd} refers
+to a directory.
+
+@item EINVAL
+At least one of the descriptors @var{inputfd} or @var{outputfd} refers
+to a non-regular, non-directory file (such as a socket or a FIFO).
+
+The input or output positions before are after the copy operations are
+outside of an implementation-defined limit.
+
+The @var{flags} argument is not zero.
+
+@item EFBIG
+The new file size would exceed the process file size limit.
+@xref{Limits on Resources}.
+
+The input or output positions before are after the copy operations are
+outside of an implementation-defined limit. This can happen if the file
+was not opened with large file support (LFS) on 32-bit machines, and the
+copy operation would create a file which is larger than what
+@code{off_t} could represent.
+
+@item EBADF
+The argument @var{inputfd} is not a valid file descriptor open for
+reading.
+
+The argument @var{outputfd} is not a valid file descriptor open for
+writing, or @var{outputfd} has been opened with @code{O_APPEND}.
+
+@item EXDEV
+The input and output files reside on different file systems.
+@end table
+
+In addition, @code{copy_file_range} can fail with the error codes
+which are used by @code{read}, @code{pread}, @code{write}, and
+@code{pwrite}.
+
+The @code{copy_file_range} function is a cancellation point. In case of
+cancellation, the input location (the file position or the value at
+@code{*@var{inputpos}}) is indeterminate.
+@end deftypefun
+
@node Memory-mapped I/O
@section Memory-mapped I/O
Memory mapping only works on entire pages of memory. Thus, addresses
for mapping must be page-aligned, and length values will be rounded up.
-To determine the size of a page the machine uses one should use
+To determine the default size of a page the machine uses one should use:
@vindex _SC_PAGESIZE
@smallexample
size_t page_size = (size_t) sysconf (_SC_PAGESIZE);
@end smallexample
-@noindent
-These functions are declared in @file{sys/mman.h}.
+On some systems, mappings can use larger page sizes
+for certain files, and applications can request larger page sizes for
+anonymous mappings as well (see the @code{MAP_HUGETLB} flag below).
+
+The following functions are declared in @file{sys/mman.h}:
@deftypefun {void *} mmap (void *@var{address}, size_t @var{length}, int @var{protect}, int @var{flags}, int @var{filedes}, off_t @var{offset})
@standards{POSIX, sys/mman.h}
@code{malloc} for large blocks. This is not an issue with @theglibc{},
as the included @code{malloc} automatically uses @code{mmap} where appropriate.
+@item MAP_HUGETLB
+@standards{Linux, sys/mman.h}
+This requests that the system uses an alternative page size which is
+larger than the default page size for the mapping. For some workloads,
+increasing the page size for large mappings improves performance because
+the system needs to handle far fewer pages. For other workloads which
+require frequent transfer of pages between storage or different nodes,
+the decreased page granularity may cause performance problems due to the
+increased page size and larger transfers.
+
+In order to create the mapping, the system needs physically contiguous
+memory of the size of the increased page size. As a result,
+@code{MAP_HUGETLB} mappings are affected by memory fragmentation, and
+their creation can fail even if plenty of memory is available in the
+system.
+
+Not all file systems support mappings with an increased page size.
+
+The @code{MAP_HUGETLB} flag is specific to Linux.
+
+@c There is a mechanism to select different hugepage sizes; see
+@c include/uapi/asm-generic/hugetlb_encode.h in the kernel sources.
+
@c Linux has some other MAP_ options, which I have not discussed here.
@c MAP_DENYWRITE, MAP_EXECUTABLE and MAP_GROWSDOWN don't seem applicable to
@c user programs (and I don't understand the last two). MAP_LOCKED does
@item EINVAL
-Either @var{address} was unusable, or inconsistent @var{flags} were
-given.
+Either @var{address} was unusable (because it is not a multiple of the
+applicable page size), or inconsistent @var{flags} were given.
+
+If @code{MAP_HUGETLB} was specified, the file or system does not support
+large page sizes.
@item EACCES
causing any changes to the pages to be lost, as well as swapped
out pages to be discarded.
+@item MADV_HUGEPAGE
+@standards{Linux, sys/mman.h}
+Indicate that it is beneficial to increase the page size for this
+mapping. This can improve performance for larger mappings because the
+system needs to handle far fewer pages. However, if parts of the
+mapping are frequently transferred between storage or different nodes,
+performance may suffer because individual transfers can become
+substantially larger due to the increased page size.
+
+This flag is specific to Linux.
+
+@item MADV_NOHUGEPAGE
+Undo the effect of a previous @code{MADV_HUGEPAGE} advice. This flag
+is specific to Linux.
+
@end vtable
The POSIX names are slightly different, but with the same meanings:
On failure @code{errno} is set.
@end deftypefn
+@deftypefun int memfd_create (const char *@var{name}, unsigned int @var{flags})
+@standards{Linux, sys/mman.h}
+@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{@acsfd{}}}
+The @code{memfd_create} function returns a file descriptor which can be
+used to create memory mappings using the @code{mmap} function. It is
+similar to the @code{shm_open} function in the sense that these mappings
+are not backed by actual files. However, the descriptor returned by
+@code{memfd_create} does not correspond to a named object; the
+@var{name} argument is used for debugging purposes only (e.g., will
+appear in @file{/proc}), and separate invocations of @code{memfd_create}
+with the same @var{name} will not return descriptors for the same region
+of memory. The descriptor can also be used to create alias mappings
+within the same process.
+
+The descriptor initially refers to a zero-length file. Before mappings
+can be created which are backed by memory, the file size needs to be
+increased with the @code{ftruncate} function. @xref{File Size}.
+
+The @var{flags} argument can be a combination of the following flags:
+
+@vtable @code
+@item MFD_CLOEXEC
+@standards{Linux, sys/mman.h}
+The descriptor is created with the @code{O_CLOEXEC} flag.
+
+@item MFD_ALLOW_SEALING
+@standards{Linux, sys/mman.h}
+The descriptor supports the addition of seals using the @code{fcntl}
+function.
+
+@item MFD_HUGETLB
+@standards{Linux, sys/mman.h}
+This requests that mappings created using the returned file descriptor
+use a larger page size. See @code{MAP_HUGETLB} above for details.
+
+This flag is incompatible with @code{MFD_ALLOW_SEALING}.
+@end vtable
+
+@code{memfd_create} returns a file descriptor on success, and @math{-1}
+on failure.
+
+The following @code{errno} error conditions are defined for this
+function:
+
+@table @code
+@item EINVAL
+An invalid combination is specified in @var{flags}, or @var{name} is
+too long.
+
+@item EFAULT
+The @var{name} argument does not point to a string.
+
+@item EMFILE
+The operation would exceed the file descriptor limit for this process.
+
+@item ENFILE
+The operation would exceed the system-wide file descriptor limit.
+
+@item ENOMEM
+There is not enough memory for the operation.
+@end table
+@end deftypefun
+
@node Waiting for I/O
@section Waiting for Input or Output
@cindex waiting for input or output