1 .\" Copyright (c) 2016, IBM Corporation.
2 .\" Written by Mike Rapoport <rppt@linux.vnet.ibm.com>
3 .\" and Copyright (C) 2016 Michael Kerrisk <mtk.manpages@gmail.com>
5 .\" SPDX-License-Identifier: Linux-man-pages-copyleft
8 .TH ioctl_userfaultfd 2 (date) "Linux man-pages (unreleased)"
10 ioctl_userfaultfd \- create a file descriptor for handling page faults in user
14 .RI ( libc ", " \-lc )
17 .BR "#include <linux/userfaultfd.h>" " /* Definition of " UFFD* " constants */"
18 .B #include <sys/ioctl.h>
20 .BI "int ioctl(int " fd ", int " cmd ", ...);"
25 operations can be performed on a userfaultfd object (created by a call to
27 using calls of the form:
37 is a file descriptor referring to a userfaultfd object,
39 is one of the commands listed below, and
41 is a pointer to a data structure that is specific to
46 operations are described below.
52 operations are used to
55 These operations allow the caller to choose what features will be enabled and
56 what kinds of events will be delivered to the application.
57 The remaining operations are
60 These operations enable the calling application to resolve page-fault
65 Enable operation of the userfaultfd and perform API handshake.
69 argument is a pointer to a
71 structure, defined as:
76 __u64 api; /* Requested API version (input) */
77 __u64 features; /* Requested features (input/output) */
78 __u64 ioctls; /* Available ioctl() operations (output) */
85 field denotes the API version requested by the application.
86 The kernel verifies that it can support the requested API version,
91 fields to bit masks representing all the available features and the generic
96 applications should use the
98 field to perform a two-step handshake.
104 The kernel responds by setting all supported feature bits.
106 Applications which do not require any specific features
107 can begin using the userfaultfd immediately.
108 Applications which do need specific features
111 again with a subset of the reported feature bits set
112 to enable those features.
114 Before Linux 4.11, the
116 field must be initialized to zero before the call to
118 and zero (i.e., no feature bits) is placed in the
120 field by the kernel upon return from
123 If the application sets unsupported feature bits,
124 the kernel will zero out the returned
129 The following feature bits may be set:
131 .BR UFFD_FEATURE_EVENT_FORK " (since Linux 4.11)"
132 When this feature is enabled,
133 the userfaultfd objects associated with a parent process are duplicated
134 into the child process during
138 event is delivered to the userfaultfd monitor
140 .BR UFFD_FEATURE_EVENT_REMAP " (since Linux 4.11)"
141 If this feature is enabled,
142 when the faulting process invokes
144 the userfaultfd monitor will receive an event of type
145 .BR UFFD_EVENT_REMAP .
147 .BR UFFD_FEATURE_EVENT_REMOVE " (since Linux 4.11)"
148 If this feature is enabled,
149 when the faulting process calls
155 advice value to free a virtual memory area
156 the userfaultfd monitor will receive an event of type
157 .BR UFFD_EVENT_REMOVE .
159 .BR UFFD_FEATURE_EVENT_UNMAP " (since Linux 4.11)"
160 If this feature is enabled,
161 when the faulting process unmaps virtual memory either explicitly with
163 or implicitly during either
167 the userfaultfd monitor will receive an event of type
168 .BR UFFD_EVENT_UNMAP .
170 .BR UFFD_FEATURE_MISSING_HUGETLBFS " (since Linux 4.11)"
171 If this feature bit is set,
172 the kernel supports registering userfaultfd ranges on hugetlbfs
175 .BR UFFD_FEATURE_MISSING_SHMEM " (since Linux 4.11)"
176 If this feature bit is set,
177 the kernel supports registering userfaultfd ranges on shared memory areas.
178 This includes all kernel shared memory APIs:
179 System V shared memory,
187 .BR memfd_create (2),
190 .BR UFFD_FEATURE_SIGBUS " (since Linux 4.14)"
191 .\" commit 2d6d6f5a09a96cc1fec7ed992b825e05f64cb50e
192 If this feature bit is set, no page-fault events
193 .RB ( UFFD_EVENT_PAGEFAULT )
197 signal will be sent to the faulting process.
198 Applications using this
199 feature will not require the use of a userfaultfd monitor for processing
200 memory accesses to the regions registered with userfaultfd.
202 .BR UFFD_FEATURE_THREAD_ID " (since Linux 4.14)"
203 If this feature bit is set,
204 .I uffd_msg.pagefault.feat.ptid
205 will be set to the faulted thread ID for each page-fault message.
207 .BR UFFD_FEATURE_PAGEFAULT_FLAG_WP " (since Linux 5.10)"
208 If this feature bit is set,
209 userfaultfd supports write-protect faults
210 for anonymous memory.
211 (Note that shmem / hugetlbfs support
212 is indicated by a separate feature.)
214 .BR UFFD_FEATURE_MINOR_HUGETLBFS " (since Linux 5.13)"
215 If this feature bit is set,
216 the kernel supports registering userfaultfd ranges
217 in minor mode on hugetlbfs-backed memory areas.
219 .BR UFFD_FEATURE_MINOR_SHMEM " (since Linux 5.14)"
220 If this feature bit is set,
221 the kernel supports registering userfaultfd ranges
222 in minor mode on shmem-backed memory areas.
224 .BR UFFD_FEATURE_EXACT_ADDRESS " (since Linux 5.18)"
225 If this feature bit is set,
226 .I uffd_msg.pagefault.address
227 will be set to the exact page-fault address that was reported by the hardware,
228 and will not mask the offset within the page.
229 Note that old Linux versions might indicate the exact address as well,
230 even though the feature bit is not set.
232 .BR UFFD_FEATURE_WP_HUGETLBFS_SHMEM " (since Linux 5.19)"
233 If this feature bit is set,
234 userfaultfd supports write-protect faults
235 for hugetlbfs and shmem / tmpfs memory.
237 .BR UFFD_FEATURE_WP_UNPOPULATED " (since Linux 6.4)"
238 If this feature bit is set,
239 the kernel will handle anonymous memory the same way as file memory,
240 by allowing the user to write-protect unpopulated page table entries.
242 .BR UFFD_FEATURE_POISON " (since Linux 6.6)"
243 If this feature bit is set,
244 the kernel supports resolving faults with the
248 .BR UFFD_FEATURE_WP_ASYNC " (since Linux 6.7)"
249 If this feature bit is set,
250 the write protection faults would be asynchronously resolved
255 field can contain the following bits:
256 .\" FIXME This user-space API seems not fully polished. Why are there
257 .\" not constants defined for each of the bit-mask values listed below?
262 operation is supported.
264 .B 1 << _UFFDIO_REGISTER
267 operation is supported.
269 .B 1 << _UFFDIO_UNREGISTER
272 operation is supported.
276 operation returns 0 on success.
277 On error, \-1 is returned and
279 is set to indicate the error.
281 the kernel may zero the provided
284 The caller should treat its contents as unspecified,
285 and reinitialize it before re-attempting another
288 Possible errors include:
292 refers to an address that is outside the calling process's
293 accessible address space.
296 The API version requested in the
298 field is not supported by this kernel, or the
300 field passed to the kernel includes feature bits that are not supported
301 by the current kernel version.
306 call already enabled one or more features for this userfaultfd.
310 the first time with no features set,
311 is explicitly allowed
312 as per the two-step feature detection handshake.
316 .B UFFD_FEATURE_EVENT_FORK
318 but the calling process doesn't have the
323 Register a memory address range with the userfaultfd object.
324 The pages in the range must be \[lq]compatible\[rq].
325 Please refer to the list of register modes below
326 for the compatible memory backends for each mode.
330 argument is a pointer to a
332 structure, defined as:
336 struct uffdio_range {
337 __u64 start; /* Start of range */
338 __u64 len; /* Length of range (bytes) */
341 struct uffdio_register {
342 struct uffdio_range range;
343 __u64 mode; /* Desired mode of operation (input) */
344 __u64 ioctls; /* Available ioctl() operations (output) */
351 field defines a memory range starting at
355 bytes that should be handled by the userfaultfd.
359 field defines the mode of operation desired for this memory region.
360 The following values may be bitwise ORed to set the userfaultfd mode for
363 .B UFFDIO_REGISTER_MODE_MISSING
364 Track page faults on missing pages.
366 only private anonymous ranges are compatible.
368 hugetlbfs and shared memory ranges are also compatible.
370 .B UFFDIO_REGISTER_MODE_WP
371 Track page faults on write-protected pages.
373 only private anonymous ranges are compatible.
375 .B UFFDIO_REGISTER_MODE_MINOR
376 Track minor page faults.
378 only hugetlbfs ranges are compatible.
380 compatibility with shmem ranges was added.
382 If the operation is successful, the kernel modifies the
384 bit-mask field to indicate which
386 operations are available for the specified range.
387 This returned bit mask can contain the following bits:
392 operation is supported.
397 operation is supported.
399 .B 1 << _UFFDIO_WRITEPROTECT
401 .B UFFDIO_WRITEPROTECT
402 operation is supported.
404 .B 1 << _UFFDIO_ZEROPAGE
407 operation is supported.
409 .B 1 << _UFFDIO_CONTINUE
412 operation is supported.
414 .B 1 << _UFFDIO_POISON
417 operation is supported.
421 operation returns 0 on success.
422 On error, \-1 is returned and
424 is set to indicate the error.
425 Possible errors include:
426 .\" FIXME Is the following error list correct?
430 A mapping in the specified range is registered with another
435 refers to an address that is outside the calling process's
436 accessible address space.
439 An invalid or unsupported bit was specified in the
446 There is no mapping in the specified address range.
452 is not a multiple of the system page size; or,
454 is zero; or these fields are otherwise invalid.
457 There as an incompatible mapping in the specified address range.
459 .\" ENOMEM if the process is exiting and the
460 .\" mm_struct has gone by the time userfault grabs it.
461 .SS UFFDIO_UNREGISTER
463 Unregister a memory address range from userfaultfd.
464 The pages in the range must be \[lq]compatible\[rq]
465 (see the description of
466 .BR UFFDIO_REGISTER .)
468 The address range to unregister is specified in the
470 structure pointed to by
475 operation returns 0 on success.
476 On error, \-1 is returned and
478 is set to indicate the error.
479 Possible errors include:
488 structure was not a multiple of the system page size; or the
490 field was zero; or these fields were otherwise invalid.
493 There as an incompatible mapping in the specified address range.
496 There was no mapping in the specified address range.
500 Atomically copy a continuous memory chunk into the userfault registered
501 range and optionally wake up the blocked thread.
502 The source and destination addresses and the number of bytes to copy are
510 structure pointed to by
516 __u64 dst; /* Destination of copy */
517 __u64 src; /* Source of copy */
518 __u64 len; /* Number of bytes to copy */
519 __u64 mode; /* Flags controlling behavior of copy */
520 __s64 copy; /* Number of bytes copied, or negated error */
525 The following value may be bitwise ORed in
527 to change the behavior of the
531 .B UFFDIO_COPY_MODE_DONTWAKE
532 Do not wake up the thread that waits for page-fault resolution
534 .B UFFDIO_COPY_MODE_WP
535 Copy the page with read-only permission.
536 This allows the user to trap the next write to the page,
537 which will block and generate another write-protect userfault message.
538 This is used only when both
539 .B UFFDIO_REGISTER_MODE_MISSING
541 .B UFFDIO_REGISTER_MODE_WP
542 modes are enabled for the registered range.
546 field is used by the kernel to return the number of bytes
547 that was actually copied, or an error (a negated
550 .\" FIXME Above: Why is the 'copy' field used to return error values?
551 .\" This should be explained in the manual page.
552 If the value returned in
554 doesn't match the value that was specified in
556 the operation fails with the error
560 field is output-only;
561 it is not read by the
567 operation returns 0 on success.
568 In this case, the entire area was copied.
569 On error, \-1 is returned and
571 is set to indicate the error.
572 Possible errors include:
575 The number of bytes copied (i.e., the value returned in the
578 does not equal the value that was specified in the
587 was not a multiple of the system page size, or the range specified by
598 An invalid bit was specified in the
602 .BR ENOENT " (since Linux 4.11)"
603 The faulting process has changed
604 its virtual memory layout simultaneously with an outstanding
608 .BR ENOSPC " (from Linux 4.11 until Linux 4.13)"
609 The faulting process has exited at the time of a
613 .BR ESRCH " (since Linux 4.13)"
614 The faulting process has exited at the time of a
620 Zero out a memory range registered with userfaultfd.
622 The requested range is specified by the
626 structure pointed to by
631 struct uffdio_zeropage {
632 struct uffdio_range range;
633 __u64 mode; /* Flags controlling behavior of copy */
634 __s64 zeropage; /* Number of bytes zeroed, or negated error */
639 The following value may be bitwise ORed in
641 to change the behavior of the
645 .B UFFDIO_ZEROPAGE_MODE_DONTWAKE
646 Do not wake up the thread that waits for page-fault resolution.
650 field is used by the kernel to return the number of bytes
651 that was actually zeroed,
652 or an error in the same manner as
654 .\" FIXME Why is the 'zeropage' field used to return error values?
655 .\" This should be explained in the manual page.
656 If the value returned in the
658 field doesn't match the value that was specified in
660 the operation fails with the error
664 field is output-only;
665 it is not read by the
671 operation returns 0 on success.
672 In this case, the entire area was zeroed.
673 On error, \-1 is returned and
675 is set to indicate the error.
676 Possible errors include:
679 The number of bytes zeroed (i.e., the value returned in the
682 does not equal the value that was specified in the
691 was not a multiple of the system page size; or
693 was zero; or the range specified was invalid.
696 An invalid bit was specified in the
700 .BR ESRCH " (since Linux 4.13)"
701 The faulting process has exited at the time of a
707 Wake up the thread waiting for page-fault resolution on
708 a specified memory address range.
712 operation is used in conjunction with
716 operations that have the
717 .B UFFDIO_COPY_MODE_DONTWAKE
719 .B UFFDIO_ZEROPAGE_MODE_DONTWAKE
723 The userfault monitor can perform several
727 operations in a batch and then explicitly wake up the faulting thread using
732 argument is a pointer to a
734 structure (shown above) that specifies the address range.
738 operation returns 0 on success.
739 On error, \-1 is returned and
741 is set to indicate the error.
742 Possible errors include:
751 structure was not a multiple of the system page size; or
753 was zero; or the specified range was otherwise invalid.
754 .SS UFFDIO_WRITEPROTECT
756 Write-protect or write-unprotect a userfaultfd-registered memory range
758 .BR UFFDIO_REGISTER_MODE_WP .
762 argument is a pointer to a
764 structure as shown below:
768 struct uffdio_writeprotect {
769 struct uffdio_range range; /* Range to change write permission*/
770 __u64 mode; /* Mode to change write permission */
775 There are two mode bits that are supported in this structure:
777 .B UFFDIO_WRITEPROTECT_MODE_WP
778 When this mode bit is set,
779 the ioctl will be a write-protect operation upon the memory range specified by
781 Otherwise it will be a write-unprotect operation upon the specified range,
782 which can be used to resolve a userfaultfd write-protect page fault.
784 .B UFFDIO_WRITEPROTECT_MODE_DONTWAKE
785 When this mode bit is set,
786 do not wake up any thread that waits for
787 page-fault resolution after the operation.
788 This can be specified only if
789 .B UFFDIO_WRITEPROTECT_MODE_WP
794 operation returns 0 on success.
795 On error, \-1 is returned and
797 is set to indicate the error.
798 Possible errors include:
807 structure was not a multiple of the system page size; or
809 was zero; or the specified range was otherwise invalid.
812 The process was interrupted; retry this call.
815 The range specified in
818 For example, the virtual address does not exist,
819 or not registered with userfaultfd write-protect mode.
822 Encountered a generic fault during processing.
826 Resolve a minor page fault
827 by installing page table entries
828 for existing pages in the page cache.
832 argument is a pointer to a
834 structure as shown below:
838 struct uffdio_continue {
839 struct uffdio_range range;
840 /* Range to install PTEs for and continue */
841 __u64 mode; /* Flags controlling the behavior of continue */
842 __s64 mapped; /* Number of bytes mapped, or negated error */
847 The following value may be bitwise ORed in
849 to change the behavior of the
853 .B UFFDIO_CONTINUE_MODE_DONTWAKE
854 Do not wake up the thread that waits for page-fault resolution.
858 field is used by the kernel
859 to return the number of bytes that were actually mapped,
860 or an error in the same manner as
862 If the value returned in the
864 field doesn't match the value that was specified in
866 the operation fails with the error
870 field is output-only;
871 it is not read by the
877 operation returns 0 on success.
879 the entire area was mapped.
880 On error, \-1 is returned and
882 is set to indicate the error.
883 Possible errors include:
886 The number of bytes mapped
887 (i.e., the value returned in the
890 does not equal the value that was specified in the
895 One or more pages were already mapped in the given range.
898 No existing page could be found in the page cache for the given range.
905 was not a multiple of the system page size; or
907 was zero; or the range specified was invalid.
910 An invalid bit was specified in the
915 The faulting process has changed its virtual memory layout simultaneously with
921 Allocating memory needed to setup the page table mappings failed.
924 The faulting process has exited at the time of a
930 Mark an address range as "poisoned".
931 Future accesses to these addresses will raise a
936 this works by installing page table entries,
937 rather than "really" poisoning the underlying physical pages.
938 This means it only affects this particular address space.
942 argument is a pointer to a
944 structure as shown below:
948 struct uffdio_poison {
949 struct uffdio_range range;
950 /* Range to install poison PTE markers in */
951 __u64 mode; /* Flags controlling the behavior of poison */
952 __s64 updated; /* Number of bytes poisoned, or negated error */
957 The following value may be bitwise ORed in
959 to change the behavior of the
963 .B UFFDIO_POISON_MODE_DONTWAKE
964 Do not wake up the thread that waits for page-fault resolution.
968 field is used by the kernel
969 to return the number of bytes that were actually poisoned,
970 or an error in the same manner as
972 If the value returned in the
974 field doesn't match the value that was specified in
976 the operation fails with the error
980 field is output-only;
981 it is not read by the
987 operation returns 0 on success.
989 the entire area was poisoned.
990 On error, \-1 is returned and
992 is set to indicate the error.
993 Possible errors include:
996 The number of bytes mapped
997 (i.e., the value returned in the
1000 does not equal the value that was specified in the
1009 was not a multiple of the system page size; or
1011 was zero; or the range specified was invalid.
1014 An invalid bit was specified in the
1019 One or more pages were already mapped in the given range.
1022 The faulting process has changed its virtual memory layout simultaneously with
1028 Allocating memory for page table entries failed.
1031 The faulting process has exited at the time of a
1036 See descriptions of the individual operations, above.
1038 See descriptions of the individual operations, above.
1039 In addition, the following general errors can occur for all of the
1040 operations described above:
1044 does not point to a valid memory address.
1047 (For all operations except
1049 The userfaultfd object has not yet been enabled (via the
1055 In order to detect available userfault features and
1056 enable some subset of those features
1057 the userfaultfd file descriptor must be closed after the first
1059 operation that queries features availability and reopened before
1062 operation that actually enables the desired features.
1065 .BR userfaultfd (2).
1071 .I Documentation/admin\-guide/mm/userfaultfd.rst
1072 in the Linux kernel source tree