]>
Commit | Line | Data |
---|---|---|
e00c3a07 | 1 | .\" Copyright (C) 2001 David Gómez <davidge@jazzfree.com> |
fea681da | 2 | .\" |
5fbde956 | 3 | .\" SPDX-License-Identifier: Linux-man-pages-copyleft |
fea681da MK |
4 | .\" |
5 | .\" Based on comments from mm/filemap.c. Last modified on 10-06-2001 | |
c11b1abf | 6 | .\" Modified, 25 Feb 2002, Michael Kerrisk, <mtk.manpages@gmail.com> |
fea681da | 7 | .\" Added notes on MADV_DONTNEED |
5baa8f09 MK |
8 | .\" 2010-06-19, mtk, Added documentation of MADV_MERGEABLE and |
9 | .\" MADV_UNMERGEABLE | |
f5321b14 MK |
10 | .\" 2010-06-15, Andi Kleen, Add documentation of MADV_HWPOISON. |
11 | .\" 2010-06-19, Andi Kleen, Add documentation of MADV_SOFT_OFFLINE. | |
3d4b49b0 MK |
12 | .\" 2011-09-18, Doug Goldstein <cardoe@cardoe.com> |
13 | .\" Document MADV_HUGEPAGE and MADV_NOHUGEPAGE | |
347e325b | 14 | .\" |
4c1c5274 | 15 | .TH madvise 2 (date) "Linux man-pages (unreleased)" |
fea681da MK |
16 | .SH NAME |
17 | madvise \- give advice about use of memory | |
f934f70e AC |
18 | .SH LIBRARY |
19 | Standard C library | |
8fc3b2cf | 20 | .RI ( libc ", " \-lc ) |
fea681da | 21 | .SH SYNOPSIS |
c7db92b9 | 22 | .nf |
fea681da | 23 | .B #include <sys/mman.h> |
68e4db0a | 24 | .PP |
c64cd13e | 25 | .BI "int madvise(void " addr [. length "], size_t " length ", int " advice ); |
c7db92b9 | 26 | .fi |
68e4db0a | 27 | .PP |
d39ad78f | 28 | .RS -4 |
cc4615cc MK |
29 | Feature Test Macro Requirements for glibc (see |
30 | .BR feature_test_macros (7)): | |
d39ad78f | 31 | .RE |
68e4db0a | 32 | .PP |
cc4615cc | 33 | .BR madvise (): |
9d2adbae MK |
34 | .nf |
35 | Since glibc 2.19: | |
36 | _DEFAULT_SOURCE | |
37 | Up to and including glibc 2.19: | |
38 | _BSD_SOURCE | |
39 | .fi | |
fea681da MK |
40 | .SH DESCRIPTION |
41 | The | |
e511ffb6 | 42 | .BR madvise () |
845c8bea MK |
43 | system call is used to give advice or directions to the kernel |
44 | about the address range beginning at address | |
14f5ae6d | 45 | .I addr |
fea681da | 46 | and with size |
756761bf MK |
47 | .IR length . |
48 | .BR madvise () | |
49 | only operates on whole pages, therefore | |
50 | .I addr | |
51 | must be page-aligned. | |
52 | The value of | |
fea681da | 53 | .I length |
756761bf | 54 | is rounded up to a multiple of page size. |
a8db50d3 MK |
55 | In most cases, |
56 | the goal of such advice is to improve system or application performance. | |
efeece04 | 57 | .PP |
845c8bea MK |
58 | Initially, the system call supported a set of "conventional" |
59 | .I advice | |
60 | values, which are also available on several other implementations. | |
61 | (Note, though, that | |
62 | .BR madvise () | |
63 | is not specified in POSIX.) | |
64 | Subsequently, a number of Linux-specific | |
1ae6b2c7 | 65 | .I advice |
845c8bea MK |
66 | values have been added. |
67 | .\" | |
68 | .\" ====================================================================== | |
69 | .\" | |
70 | .SS Conventional advice values | |
71 | The | |
72 | .I advice | |
73 | values listed below | |
74 | allow an application to tell the kernel how it expects to use | |
fea681da MK |
75 | some mapped or shared memory areas, so that the kernel can choose |
76 | appropriate read-ahead and caching techniques. | |
845c8bea MK |
77 | These |
78 | .I advice | |
79 | values do not influence the semantics of the application | |
fea681da MK |
80 | (except in the case of |
81 | .BR MADV_DONTNEED ), | |
845c8bea | 82 | but may influence its performance. |
845c8bea MK |
83 | All of the |
84 | .I advice | |
85 | values listed here have analogs in the POSIX-specified | |
86 | .BR posix_madvise (3) | |
87 | function, and the values have the same meanings, with the exception of | |
88 | .BR MADV_DONTNEED . | |
dd3568a1 | 89 | .PP |
c13182ef | 90 | The advice is indicated in the |
fea681da | 91 | .I advice |
95467f1d | 92 | argument, which is one of the following: |
fea681da MK |
93 | .TP |
94 | .B MADV_NORMAL | |
c13182ef MK |
95 | No special treatment. |
96 | This is the default. | |
fea681da MK |
97 | .TP |
98 | .B MADV_RANDOM | |
99 | Expect page references in random order. | |
100 | (Hence, read ahead may be less useful than normally.) | |
101 | .TP | |
102 | .B MADV_SEQUENTIAL | |
103 | Expect page references in sequential order. | |
104 | (Hence, pages in the given range can be aggressively read ahead, | |
105 | and may be freed soon after they are accessed.) | |
106 | .TP | |
107 | .B MADV_WILLNEED | |
108 | Expect access in the near future. | |
109 | (Hence, it might be a good idea to read some pages ahead.) | |
110 | .TP | |
111 | .B MADV_DONTNEED | |
112 | Do not expect access in the near future. | |
113 | (For the time being, the application is finished with the given range, | |
114 | so the kernel can free resources associated with it.) | |
efeece04 | 115 | .IP |
a727d7cc MK |
116 | After a successful |
117 | .B MADV_DONTNEED | |
118 | operation, | |
119 | the semantics of memory access in the specified region are changed: | |
120 | subsequent accesses of pages in the range will succeed, but will result | |
d5e9c9bb MK |
121 | in either repopulating the memory contents from the |
122 | up-to-date contents of the underlying mapped file | |
cd15218e MK |
123 | (for shared file mappings, shared anonymous mappings, |
124 | and shmem-based techniques such as System V shared memory segments) | |
125 | or zero-fill-on-demand pages for anonymous private mappings. | |
efeece04 | 126 | .IP |
d5e9c9bb | 127 | Note that, when applied to shared mappings, |
1ae6b2c7 | 128 | .B MADV_DONTNEED |
d5e9c9bb MK |
129 | might not lead to immediate freeing of the pages in the range. |
130 | The kernel is free to delay freeing the pages until an appropriate moment. | |
131 | The resident set size (RSS) of the calling process will be immediately | |
132 | reduced however. | |
efeece04 | 133 | .IP |
a727d7cc | 134 | .B MADV_DONTNEED |
756761bf | 135 | cannot be applied to locked pages, or |
1ae6b2c7 | 136 | .B VM_PFNMAP |
36e5bc92 MK |
137 | pages. |
138 | (Pages marked with the kernel-internal | |
139 | .B VM_PFNMAP | |
140 | .\" http://lwn.net/Articles/162860/ | |
141 | flag are special memory areas that are not managed | |
142 | by the virtual memory subsystem. | |
143 | Such pages are typically created by device drivers that | |
144 | map the pages into user space.) | |
756761bf MK |
145 | .IP |
146 | Support for Huge TLB pages was added in Linux v5.18. | |
147 | Addresses within a mapping backed by Huge TLB pages must be aligned | |
148 | to the underlying Huge TLB page size, | |
149 | and the range length is rounded up | |
150 | to a multiple of the underlying Huge TLB page size. | |
845c8bea MK |
151 | .\" |
152 | .\" ====================================================================== | |
153 | .\" | |
154 | .SS Linux-specific advice values | |
155 | The following Linux-specific | |
156 | .I advice | |
157 | values have no counterparts in the POSIX-specified | |
158 | .BR posix_madvise (3), | |
159 | and may or may not have counterparts in the | |
160 | .BR madvise () | |
fb2bb886 MK |
161 | interface available on other implementations. |
162 | Note that some of these operations change the semantics of memory accesses. | |
835c4d5c | 163 | .TP |
31c1f2b0 | 164 | .BR MADV_REMOVE " (since Linux 2.6.16)" |
498f9213 | 165 | .\" commit f6b3ec238d12c8cc6cc71490c6e3127988460349 |
835c4d5c | 166 | Free up a given range of pages |
c13182ef | 167 | and its associated backing store. |
756761bf | 168 | This is equivalent to punching a hole in the corresponding |
49170db5 MK |
169 | range of the backing store (see |
170 | .BR fallocate (2)). | |
171 | Subsequent accesses in the specified address range will see | |
756761bf | 172 | data with a value of zero. |
bc6eb5ef MK |
173 | .\" Databases want to use this feature to drop a section of their |
174 | .\" bufferpool (shared memory segments) - without writing back to | |
175 | .\" disk/swap space. This feature is also useful for supporting | |
176 | .\" hot-plug memory on UML. | |
efeece04 | 177 | .IP |
5575818d | 178 | The specified address range must be mapped shared and writable. |
756761bf | 179 | This flag cannot be applied to locked pages, or |
1ae6b2c7 | 180 | .B VM_PFNMAP |
36e5bc92 | 181 | pages. |
efeece04 | 182 | .IP |
4e07c70f MK |
183 | In the initial implementation, only |
184 | .BR tmpfs (5) | |
756761bf | 185 | supported |
deb99649 MK |
186 | .BR MADV_REMOVE ; |
187 | but since Linux 3.5, | |
188 | .\" commit 3f31d07571eeea18a7d34db9af21d2285b807a17 | |
f7282b7b | 189 | any filesystem which supports the |
deb99649 | 190 | .BR fallocate (2) |
1ae6b2c7 | 191 | .B FALLOC_FL_PUNCH_HOLE |
95467f1d | 192 | mode also supports |
f7282b7b | 193 | .BR MADV_REMOVE . |
756761bf MK |
194 | Filesystems which do not support |
195 | .B MADV_REMOVE | |
196 | fail with the error | |
deb99649 | 197 | .BR EOPNOTSUPP . |
756761bf MK |
198 | .IP |
199 | Support for the Huge TLB filesystem was added in Linux v4.3. | |
835c4d5c | 200 | .TP |
31c1f2b0 | 201 | .BR MADV_DONTFORK " (since Linux 2.6.16)" |
498f9213 | 202 | .\" commit f822566165dd46ff5de9bf895cfa6c51f53bb0c4 |
835c4d5c MK |
203 | .\" See http://lwn.net/Articles/171941/ |
204 | Do not make the pages in this range available to the child after a | |
205 | .BR fork (2). | |
206 | This is useful to prevent copy-on-write semantics from changing | |
95467f1d | 207 | the physical location of a page if the parent writes to it after a |
835c4d5c MK |
208 | .BR fork (2). |
209 | (Such page relocations cause problems for hardware that | |
95467f1d | 210 | DMAs into the page.) |
835c4d5c | 211 | .\" [PATCH] madvise MADV_DONTFORK/MADV_DOFORK |
c13182ef MK |
212 | .\" Currently, copy-on-write may change the physical address of |
213 | .\" a page even if the user requested that the page is pinned in | |
214 | .\" memory (either by mlock or by get_user_pages). This happens | |
215 | .\" if the process forks meanwhile, and the parent writes to that | |
216 | .\" page. As a result, the page is orphaned: in case of | |
217 | .\" get_user_pages, the application will never see any data hardware | |
218 | .\" DMA's into this page after the COW. In case of mlock'd memory, | |
835c4d5c | 219 | .\" the parent is not getting the realtime/security benefits of mlock. |
c13182ef MK |
220 | .\" |
221 | .\" In particular, this affects the Infiniband modules which do DMA from | |
835c4d5c | 222 | .\" and into user pages all the time. |
c13182ef MK |
223 | .\" |
224 | .\" This patch adds madvise options to control whether memory range is | |
225 | .\" inherited across fork. Useful e.g. for when hardware is doing DMA | |
226 | .\" from/into these pages. Could also be useful to an application | |
227 | .\" wanting to speed up its forks by cutting large areas out of | |
835c4d5c | 228 | .\" consideration. |
49237f3d MK |
229 | .\" |
230 | .\" SEE ALSO: http://lwn.net/Articles/171941/ | |
231 | .\" "Tweaks to madvise() and posix_fadvise()", 14 Feb 2006 | |
835c4d5c | 232 | .TP |
31c1f2b0 | 233 | .BR MADV_DOFORK " (since Linux 2.6.16)" |
835c4d5c MK |
234 | Undo the effect of |
235 | .BR MADV_DONTFORK , | |
d9bfdb9c | 236 | restoring the default behavior, whereby a mapping is inherited across |
835c4d5c | 237 | .BR fork (2). |
523c2f67 | 238 | .TP |
9bfc9cb1 | 239 | .BR MADV_HWPOISON " (since Linux 2.6.32)" |
498f9213 | 240 | .\" commit 9893e49d64a4874ea67849ee2cfbf3f3d6817573 |
11c25e24 MK |
241 | Poison the pages in the range specified by |
242 | .I addr | |
243 | and | |
1ae6b2c7 | 244 | .I length |
11c25e24 MK |
245 | and handle subsequent references to those pages |
246 | like a hardware memory corruption. | |
33a0ccb2 | 247 | This operation is available only for privileged |
523c2f67 AK |
248 | .RB ( CAP_SYS_ADMIN ) |
249 | processes. | |
250 | This operation may result in the calling process receiving a | |
251 | .B SIGBUS | |
252 | and the page being unmapped. | |
efeece04 | 253 | .IP |
ae24c212 | 254 | This feature is intended for testing of memory error-handling code; |
33a0ccb2 | 255 | it is available only if the kernel was configured with |
ae24c212 AK |
256 | .BR CONFIG_MEMORY_FAILURE . |
257 | .TP | |
5baa8f09 | 258 | .BR MADV_MERGEABLE " (since Linux 2.6.32)" |
498f9213 | 259 | .\" commit f8af4da3b4c14e7267c4ffb952079af3912c51c5 |
5baa8f09 MK |
260 | Enable Kernel Samepage Merging (KSM) for the pages in the range specified by |
261 | .I addr | |
262 | and | |
e5963382 | 263 | .IR length . |
3b18c59b | 264 | The kernel regularly scans those areas of user memory that have |
5baa8f09 MK |
265 | been marked as mergeable, |
266 | looking for pages with identical content. | |
267 | These are replaced by a single write-protected page (which is automatically | |
268 | copied if a process later wants to update the content of the page). | |
33a0ccb2 | 269 | KSM merges only private anonymous pages (see |
5baa8f09 | 270 | .BR mmap (2)). |
efeece04 | 271 | .IP |
5baa8f09 MK |
272 | The KSM feature is intended for applications that generate many |
273 | instances of the same data (e.g., virtualization systems such as KVM). | |
274 | It can consume a lot of processing power; use with care. | |
66a9882e | 275 | See the Linux kernel source file |
b49c2acb | 276 | .I Documentation/admin\-guide/mm/ksm.rst |
5baa8f09 | 277 | for more details. |
efeece04 | 278 | .IP |
5baa8f09 | 279 | The |
1ae6b2c7 | 280 | .B MADV_MERGEABLE |
5baa8f09 | 281 | and |
1ae6b2c7 | 282 | .B MADV_UNMERGEABLE |
33a0ccb2 | 283 | operations are available only if the kernel was configured with |
8c3fb604 | 284 | .BR CONFIG_KSM . |
5baa8f09 MK |
285 | .TP |
286 | .BR MADV_UNMERGEABLE " (since Linux 2.6.32)" | |
287 | Undo the effect of an earlier | |
1ae6b2c7 | 288 | .B MADV_MERGEABLE |
5baa8f09 | 289 | operation on the specified address range; |
ff24dd19 | 290 | KSM unmerges whatever pages it had merged in the address range specified by |
1ae6b2c7 | 291 | .I addr |
5baa8f09 MK |
292 | and |
293 | .IR length . | |
e8dd3ed2 | 294 | .TP |
9bfc9cb1 | 295 | .BR MADV_SOFT_OFFLINE " (since Linux 2.6.33)" |
6b1e34f2 MK |
296 | .\" commit afcf938ee0aac4ef95b1a23bac704c6fbeb26de6 |
297 | Soft offline the pages in the range specified by | |
298 | .I addr | |
299 | and | |
300 | .IR length . | |
301 | The memory of each page in the specified range is preserved | |
302 | (i.e., when next accessed, the same content will be visible, | |
303 | but in a new physical page frame), | |
304 | and the original page is offlined | |
305 | (i.e., no longer used, and taken out of normal memory management). | |
306 | The effect of the | |
307 | .B MADV_SOFT_OFFLINE | |
308 | operation is invisible to (i.e., does not change the semantics of) | |
309 | the calling process. | |
efeece04 | 310 | .IP |
6b1e34f2 MK |
311 | This feature is intended for testing of memory error-handling code; |
312 | it is available only if the kernel was configured with | |
313 | .BR CONFIG_MEMORY_FAILURE . | |
314 | .TP | |
e8dd3ed2 | 315 | .BR MADV_HUGEPAGE " (since Linux 2.6.38)" |
498f9213 | 316 | .\" commit 0af4e98b6b095c74588af04872f83d333c958c32 |
3d4b49b0 MK |
317 | .\" http://lwn.net/Articles/358904/ |
318 | .\" https://lwn.net/Articles/423584/ | |
95467f1d | 319 | Enable Transparent Huge Pages (THP) for pages in the range specified by |
e8dd3ed2 DG |
320 | .I addr |
321 | and | |
322 | .IR length . | |
e8dd3ed2 DG |
323 | The kernel will regularly scan the areas marked as huge page candidates |
324 | to replace them with huge pages. | |
325 | The kernel will also allocate huge pages directly when the region is | |
3d4b49b0 | 326 | naturally aligned to the huge page size (see |
e8dd3ed2 | 327 | .BR posix_memalign (2)). |
efeece04 | 328 | .IP |
c0e140e6 | 329 | This feature is primarily aimed at applications that use large mappings of |
e9dedcd2 | 330 | data and access large regions of that memory at a time (e.g., virtualization |
c0e140e6 | 331 | systems such as QEMU). |
ee8655b5 MK |
332 | It can very easily waste memory (e.g., a 2\ MB mapping that only ever accesses |
333 | 1 byte will result in 2\ MB of wired memory instead of one 4\ KB page). | |
66a9882e | 334 | See the Linux kernel source file |
b49c2acb | 335 | .I Documentation/admin\-guide/mm/transhuge.rst |
e8dd3ed2 | 336 | for more details. |
efeece04 | 337 | .IP |
38b08118 MK |
338 | Most common kernels configurations provide |
339 | .BR MADV_HUGEPAGE -style | |
340 | behavior by default, and thus | |
1ae6b2c7 | 341 | .B MADV_HUGEPAGE |
38b08118 MK |
342 | is normally not necessary. |
343 | It is mostly intended for embedded systems, where | |
20b9102a | 344 | .BR MADV_HUGEPAGE -style |
38b08118 MK |
345 | behavior may not be enabled by default in the kernel. |
346 | On such systems, | |
347 | this flag can be used in order to selectively enable THP. | |
348 | Whenever | |
1ae6b2c7 | 349 | .B MADV_HUGEPAGE |
38b08118 MK |
350 | is used, it should always be in regions of memory with |
351 | an access pattern that the developer knows in advance won't risk | |
352 | to increase the memory footprint of the application when transparent | |
353 | hugepages are enabled. | |
354 | .IP | |
797a95e0 ZK |
355 | .\" commit 99cb0dbd47a15d395bf3faa78dc122bc5efe3fc0 |
356 | Since Linux 5.4, | |
357 | automatic scan of eligible areas and replacement by huge pages works with | |
358 | private anonymous pages (see | |
359 | .BR mmap (2)), | |
360 | shmem pages, | |
361 | and file-backed pages. | |
362 | For all memory types, | |
363 | memory may only be replaced by huge pages on hugepage-aligned boundaries. | |
364 | For file-mapped memory | |
365 | \(emincluding tmpfs (see | |
366 | .BR tmpfs (2))\(em | |
367 | the mapping must also be naturally hugepage-aligned within the file. | |
368 | Additionally, | |
369 | for file-backed, | |
370 | non-tmpfs memory, | |
371 | the file must not be open for write and the mapping must be executable. | |
372 | .IP | |
373 | The VMA must not be marked | |
374 | .BR VM_NOHUGEPAGE , | |
375 | .BR VM_HUGETLB , | |
376 | .BR VM_IO , | |
377 | .BR VM_DONTEXPAND , | |
378 | .BR VM_MIXEDMAP , | |
379 | or | |
380 | .BR VM_PFNMAP , | |
381 | nor can it be stack memory or backed by a DAX-enabled device | |
382 | (unless the DAX device is hot-plugged as System RAM). | |
383 | The process must also not have | |
384 | .B PR_SET_THP_DISABLE | |
385 | set (see | |
386 | .BR prctl (2)). | |
387 | .IP | |
e8dd3ed2 | 388 | The |
b106cd5b ZK |
389 | .BR MADV_HUGEPAGE , |
390 | .BR MADV_NOHUGEPAGE , | |
e8dd3ed2 | 391 | and |
b106cd5b | 392 | .B MADV_COLLAPSE |
33a0ccb2 | 393 | operations are available only if the kernel was configured with |
797a95e0 ZK |
394 | .B CONFIG_TRANSPARENT_HUGEPAGE |
395 | and file/shmem memory is only supported if the kernel was configured with | |
396 | .BR CONFIG_READ_ONLY_THP_FOR_FS . | |
e8dd3ed2 DG |
397 | .TP |
398 | .BR MADV_NOHUGEPAGE " (since Linux 2.6.38)" | |
399 | Ensures that memory in the address range specified by | |
1ae6b2c7 | 400 | .I addr |
e8dd3ed2 | 401 | and |
1ae6b2c7 | 402 | .I length |
38b08118 | 403 | will not be backed by transparent hugepages. |
c639b314 | 404 | .TP |
b106cd5b ZK |
405 | .BR MADV_COLLAPSE " (since Linux 6.1)" |
406 | .\" commit 7d8faaf155454f8798ec56404faca29a82689c77 | |
407 | .\" commit 34488399fa08faaf664743fa54b271eb6f9e1321 | |
408 | Perform a best-effort synchronous collapse of | |
409 | the native pages mapped by the memory range | |
410 | into Transparent Huge Pages (THPs). | |
411 | .B MADV_COLLAPSE | |
412 | operates on the current state of memory of the calling process and | |
413 | makes no persistent changes or guarantees on how pages will be mapped, | |
414 | constructed, | |
415 | or faulted in the future. | |
416 | .IP | |
417 | .B MADV_COLLAPSE | |
418 | supports private anonymous pages (see | |
419 | .BR mmap (2)), | |
420 | shmem pages, | |
421 | and file-backed pages. | |
422 | See | |
423 | .B MADV_HUGEPAGE | |
424 | for general information on memory requirements for THP. | |
425 | If the range provided spans multiple VMAs, | |
426 | the semantics of the collapse over each VMA is independent from the others. | |
427 | If collapse of a given huge page-aligned/sized region fails, | |
428 | the operation may continue to attempt collapsing | |
429 | the remainder of the specified memory. | |
430 | .B MADV_COLLAPSE | |
431 | will automatically clamp the provided range to be hugepage-aligned. | |
432 | .IP | |
433 | All non-resident pages covered by the range | |
434 | will first be swapped/faulted-in, | |
435 | before being copied onto a freshly allocated hugepage. | |
436 | If the native pages compose the same PTE-mapped hugepage, | |
437 | and are suitably aligned, | |
438 | allocation of a new hugepage may be elided and | |
439 | collapse may happen in-place. | |
440 | Unmapped pages will have their data directly initialized to 0 | |
441 | in the new hugepage. | |
442 | However, | |
443 | for every eligible hugepage-aligned/sized region to be collapsed, | |
444 | at least one page must currently be backed by physical memory. | |
445 | .IP | |
446 | .B MADV_COLLAPSE | |
447 | is independent of any sysfs | |
448 | (see | |
449 | .BR sysfs (5)) | |
450 | setting under | |
451 | .IR /sys/kernel/mm/transparent_hugepage , | |
452 | both in terms of determining THP eligibility, | |
453 | and allocation semantics. | |
454 | See Linux kernel source file | |
455 | .I Documentation/admin\-guide/mm/transhuge.rst | |
456 | for more information. | |
457 | .B MADV_COLLAPSE | |
458 | also ignores | |
459 | .B huge= | |
460 | tmpfs mount when operating on tmpfs files. | |
461 | Allocation for the new hugepage may enter direct reclaim and/or compaction, | |
462 | regardless of VMA flags | |
463 | (though | |
464 | .B VM_NOHUGEPAGE | |
465 | is still respected). | |
466 | .IP | |
467 | When the system has multiple NUMA nodes, | |
468 | the hugepage will be allocated from | |
469 | the node providing the most native pages. | |
470 | .IP | |
471 | If all hugepage-sized/aligned regions covered by the provided range were | |
472 | either successfully collapsed, | |
473 | or were already PMD-mapped THPs, | |
474 | this operation will be deemed successful. | |
475 | Note that this doesn't guarantee anything about | |
476 | other possible mappings of the memory. | |
477 | In the event multiple hugepage-aligned/sized areas fail to collapse, | |
478 | only the most-recently\[en]failed code will be set in | |
479 | .IR errno . | |
480 | .TP | |
c639b314 | 481 | .BR MADV_DONTDUMP " (since Linux 3.4)" |
498f9213 MK |
482 | .\" commit 909af768e88867016f427264ae39d27a57b6a8ed |
483 | .\" commit accb61fe7bb0f5c2a4102239e4981650f9048519 | |
c639b314 JB |
484 | Exclude from a core dump those pages in the range specified by |
485 | .I addr | |
486 | and | |
487 | .IR length . | |
488 | This is useful in applications that have large areas of memory | |
489 | that are known not to be useful in a core dump. | |
490 | The effect of | |
1ae6b2c7 | 491 | .B MADV_DONTDUMP |
c639b314 | 492 | takes precedence over the bit mask that is set via the |
750653a8 | 493 | .I /proc/[pid]/coredump_filter |
c639b314 JB |
494 | file (see |
495 | .BR core (5)). | |
496 | .TP | |
497 | .BR MADV_DODUMP " (since Linux 3.4)" | |
498 | Undo the effect of an earlier | |
499 | .BR MADV_DONTDUMP . | |
9ec13698 | 500 | .TP |
d432f10d MK |
501 | .BR MADV_FREE " (since Linux 4.5)" |
502 | The application no longer requires the pages in the range specified by | |
1ae6b2c7 | 503 | .I addr |
d432f10d MK |
504 | and |
505 | .IR len . | |
506 | The kernel can thus free these pages, | |
507 | but the freeing could be delayed until memory pressure occurs. | |
508 | For each of the pages that has been marked to be freed | |
509 | but has not yet been freed, | |
510 | the free operation will be canceled if the caller writes into the page. | |
511 | After a successful | |
512 | .B MADV_FREE | |
513 | operation, any stale data (i.e., dirty, unwritten pages) will be lost | |
514 | when the kernel frees the pages. | |
515 | However, subsequent writes to pages in the range will succeed | |
516 | and then kernel cannot free those dirtied pages, | |
517 | so that the caller can always see just written data. | |
518 | If there is no subsequent write, | |
519 | the kernel can free the pages at any time. | |
520 | Once pages in the range have been freed, the caller will | |
521 | see zero-fill-on-demand pages upon subsequent page references. | |
efeece04 | 522 | .IP |
d432f10d MK |
523 | The |
524 | .B MADV_FREE | |
525 | operation | |
526 | can be applied only to private anonymous pages (see | |
9ec13698 | 527 | .BR mmap (2)). |
b324e17d | 528 | Before Linux 4.12, |
07ca8b34 MK |
529 | .\" commit 93e06c7a645343d222c9a838834a51042eebbbf7 |
530 | when freeing pages on a swapless system, | |
531 | the pages in the given range are freed instantly, | |
9ec13698 | 532 | regardless of memory pressure. |
c0c4f6c2 RR |
533 | .TP |
534 | .BR MADV_WIPEONFORK " (since Linux 4.14)" | |
535 | .\" commit d2cd9ede6e193dd7d88b6d27399e96229a551b19 | |
536 | Present the child process with zero-filled memory in this range after a | |
537 | .BR fork (2). | |
2c63b13e MK |
538 | This is useful in forking servers in order to ensure |
539 | that sensitive per-process data | |
540 | (for example, PRNG seeds, cryptographic secrets, and so on) | |
541 | is not handed to child processes. | |
c0c4f6c2 RR |
542 | .IP |
543 | The | |
544 | .B MADV_WIPEONFORK | |
2c63b13e | 545 | operation can be applied only to private anonymous pages (see |
c0c4f6c2 | 546 | .BR mmap (2)). |
dca5d444 MK |
547 | .IP |
548 | Within the child created by | |
549 | .BR fork (2), | |
550 | the | |
551 | .B MADV_WIPEONFORK | |
552 | setting remains in place on the specified address range. | |
553 | This setting is cleared during | |
554 | .BR execve (2). | |
c0c4f6c2 RR |
555 | .TP |
556 | .BR MADV_KEEPONFORK " (since Linux 4.14)" | |
557 | .\" commit d2cd9ede6e193dd7d88b6d27399e96229a551b19 | |
558 | Undo the effect of an earlier | |
559 | .BR MADV_WIPEONFORK . | |
c9c9ab2e MK |
560 | .TP |
561 | .BR MADV_COLD " (since Linux 5.4)" | |
562 | .\" commit 9c276cc65a58faf98be8e56962745ec99ab87636 | |
563 | Deactivate a given range of pages. | |
564 | This will make the pages a more probable | |
565 | reclaim target should there be a memory pressure. | |
566 | This is a nondestructive operation. | |
567 | The advice might be ignored for some pages in the range when it is not | |
568 | applicable. | |
569 | .TP | |
570 | .BR MADV_PAGEOUT " (since Linux 5.4)" | |
571 | .\" commit 1a4e58cce84ee88129d5d49c064bd2852b481357 | |
572 | Reclaim a given range of pages. | |
573 | This is done to free up memory occupied by these pages. | |
574 | If a page is anonymous, it will be swapped out. | |
575 | If a page is file-backed and dirty, it will be written back to the backing | |
576 | storage. | |
577 | The advice might be ignored for some pages in the range when it is not | |
578 | applicable. | |
9f307c06 DH |
579 | .TP |
580 | .BR MADV_POPULATE_READ " (since Linux 5.14)" | |
581 | "Populate (prefault) page tables readable, | |
582 | faulting in all pages in the range just as if manually reading from each page; | |
583 | however, | |
584 | avoid the actual memory access that would have been performed after handling | |
585 | the fault. | |
586 | .IP | |
587 | In contrast to | |
588 | .BR MAP_POPULATE , | |
589 | .B MADV_POPULATE_READ | |
590 | does not hide errors, | |
591 | can be applied to (parts of) existing mappings and will always populate | |
592 | (prefault) page tables readable. | |
593 | One example use case is prefaulting a file mapping, | |
594 | reading all file content from disk; | |
595 | however, | |
596 | pages won't be dirtied and consequently won't have to be written back to disk | |
597 | when evicting the pages from memory. | |
598 | .IP | |
599 | Depending on the underlying mapping, | |
600 | map the shared zeropage, | |
601 | preallocate memory or read the underlying file; | |
602 | files with holes might or might not preallocate blocks. | |
603 | If populating fails, | |
604 | a | |
605 | .B SIGBUS | |
606 | signal is not generated; instead, an error is returned. | |
607 | .IP | |
608 | If | |
609 | .B MADV_POPULATE_READ | |
610 | succeeds, | |
611 | all page tables have been populated (prefaulted) readable once. | |
612 | If | |
613 | .B MADV_POPULATE_READ | |
614 | fails, | |
615 | some page tables might have been populated. | |
616 | .IP | |
617 | .B MADV_POPULATE_READ | |
618 | cannot be applied to mappings without read permissions | |
619 | and special mappings, | |
620 | for example, | |
621 | mappings marked with kernel-internal flags such as | |
622 | .B VM_PFNMAP | |
623 | or | |
624 | .BR VM_IO , | |
625 | or secret memory regions created using | |
626 | .BR memfd_secret(2) . | |
627 | .IP | |
628 | Note that with | |
629 | .BR MADV_POPULATE_READ , | |
630 | the process can be killed at any moment when the system runs out of memory. | |
631 | .TP | |
632 | .BR MADV_POPULATE_WRITE " (since Linux 5.14)" | |
633 | Populate (prefault) page tables writable, | |
634 | faulting in all pages in the range just as if manually writing to each | |
635 | each page; | |
636 | however, | |
637 | avoid the actual memory access that would have been performed after handling | |
638 | the fault. | |
639 | .IP | |
640 | In contrast to | |
641 | .BR MAP_POPULATE , | |
642 | MADV_POPULATE_WRITE does not hide errors, | |
643 | can be applied to (parts of) existing mappings and will always populate | |
644 | (prefault) page tables writable. | |
645 | One example use case is preallocating memory, | |
646 | breaking any CoW (Copy on Write). | |
647 | .IP | |
648 | Depending on the underlying mapping, | |
649 | preallocate memory or read the underlying file; | |
650 | files with holes will preallocate blocks. | |
651 | If populating fails, | |
652 | a | |
653 | .B SIGBUS | |
654 | signal is not generated; instead, an error is returned. | |
655 | .IP | |
656 | If | |
657 | .B MADV_POPULATE_WRITE | |
658 | succeeds, | |
659 | all page tables have been populated (prefaulted) writable once. | |
660 | If | |
661 | .B MADV_POPULATE_WRITE | |
662 | fails, | |
663 | some page tables might have been populated. | |
664 | .IP | |
665 | .B MADV_POPULATE_WRITE | |
666 | cannot be applied to mappings without write permissions | |
667 | and special mappings, | |
668 | for example, | |
669 | mappings marked with kernel-internal flags such as | |
670 | .B VM_PFNMAP | |
671 | or | |
672 | .BR VM_IO , | |
673 | or secret memory regions created using | |
674 | .BR memfd_secret(2) . | |
675 | .IP | |
676 | Note that with | |
677 | .BR MADV_POPULATE_WRITE , | |
678 | the process can be killed at any moment when the system runs out of memory. | |
47297adb | 679 | .SH RETURN VALUE |
95467f1d | 680 | On success, |
e511ffb6 | 681 | .BR madvise () |
c13182ef MK |
682 | returns zero. |
683 | On error, it returns \-1 and | |
fea681da | 684 | .I errno |
f6a4078b | 685 | is set to indicate the error. |
fea681da MK |
686 | .SH ERRORS |
687 | .TP | |
7208ad0a MK |
688 | .B EACCES |
689 | .I advice | |
690 | is | |
691 | .BR MADV_REMOVE , | |
692 | but the specified address range is not a shared writable mapping. | |
693 | .TP | |
fea681da MK |
694 | .B EAGAIN |
695 | A kernel resource was temporarily unavailable. | |
696 | .TP | |
697 | .B EBADF | |
698 | The map exists, but the area maps something that isn't a file. | |
699 | .TP | |
b106cd5b ZK |
700 | .B EBUSY |
701 | (for | |
702 | .BR MADV_COLLAPSE ) | |
703 | Could not charge hugepage to cgroup: cgroup limit exceeded. | |
704 | .TP | |
9f307c06 DH |
705 | .B EFAULT |
706 | .I advice | |
707 | is | |
708 | .B MADV_POPULATE_READ | |
709 | or | |
710 | .BR MADV_POPULATE_WRITE , | |
711 | and populating (prefaulting) page tables failed because a | |
712 | .B SIGBUS | |
713 | would have been generated on actual memory access and the reason is not a | |
714 | HW poisoned page | |
715 | (HW poisoned pages can, | |
716 | for example, | |
717 | be created using the | |
718 | .B MADV_HWPOISON | |
719 | flag described elsewhere in this page). | |
720 | .TP | |
fea681da | 721 | .B EINVAL |
ac95034e MK |
722 | .I addr |
723 | is not page-aligned or | |
c608a033 | 724 | .I length |
601f3bc6 | 725 | is negative. |
c608a033 | 726 | .\" .I length |
fea681da | 727 | .\" is zero, |
ac95034e MK |
728 | .TP |
729 | .B EINVAL | |
730 | .I advice | |
731 | is not a valid. | |
732 | .TP | |
733 | .B EINVAL | |
4335648d | 734 | .I advice |
8604677b CTR |
735 | is |
736 | .B MADV_COLD | |
737 | or | |
738 | .B MADV_PAGEOUT | |
739 | and the specified address range includes locked, Huge TLB pages, or | |
740 | .B VM_PFNMAP | |
741 | pages. | |
742 | .TP | |
743 | .B EINVAL | |
744 | .I advice | |
4335648d MK |
745 | is |
746 | .B MADV_DONTNEED | |
747 | or | |
1ae6b2c7 | 748 | .B MADV_REMOVE |
36e5bc92 MK |
749 | and the specified address range includes locked, Huge TLB pages, or |
750 | .B VM_PFNMAP | |
751 | pages. | |
ac95034e MK |
752 | .TP |
753 | .B EINVAL | |
c13182ef | 754 | .I advice |
ac95034e | 755 | is |
1ae6b2c7 | 756 | .B MADV_MERGEABLE |
5baa8f09 | 757 | or |
ac95034e | 758 | .BR MADV_UNMERGEABLE , |
5baa8f09 MK |
759 | but the kernel was not configured with |
760 | .BR CONFIG_KSM . | |
fea681da | 761 | .TP |
c0c4f6c2 RR |
762 | .B EINVAL |
763 | .I advice | |
764 | is | |
1ae6b2c7 | 765 | .B MADV_FREE |
c0c4f6c2 | 766 | or |
1ae6b2c7 | 767 | .B MADV_WIPEONFORK |
c0c4f6c2 RR |
768 | but the specified address range includes file, Huge TLB, |
769 | .BR MAP_SHARED , | |
770 | or | |
1ae6b2c7 | 771 | .B VM_PFNMAP |
c0c4f6c2 RR |
772 | ranges. |
773 | .TP | |
9f307c06 DH |
774 | .B EINVAL |
775 | .I advice | |
776 | is | |
777 | .B MADV_POPULATE_READ | |
778 | or | |
779 | .BR MADV_POPULATE_WRITE , | |
780 | but the specified address range includes ranges with insufficient permissions | |
781 | or special mappings, | |
782 | for example, | |
783 | mappings marked with kernel-internal flags such a | |
784 | .B VM_IO | |
785 | or | |
786 | .BR VM_PFNMAP , | |
787 | or secret memory regions created using | |
788 | .BR memfd_secret(2) . | |
789 | .TP | |
fea681da | 790 | .B EIO |
682edefb MK |
791 | (for |
792 | .BR MADV_WILLNEED ) | |
793 | Paging in this area would exceed the process's | |
fea681da MK |
794 | maximum resident set size. |
795 | .TP | |
796 | .B ENOMEM | |
682edefb MK |
797 | (for |
798 | .BR MADV_WILLNEED ) | |
799 | Not enough memory: paging in failed. | |
fea681da MK |
800 | .TP |
801 | .B ENOMEM | |
b106cd5b ZK |
802 | (for |
803 | .BR MADV_COLLAPSE ) | |
804 | Not enough memory: could not allocate hugepage. | |
805 | .TP | |
806 | .B ENOMEM | |
fea681da MK |
807 | Addresses in the specified range are not currently |
808 | mapped, or are outside the address space of the process. | |
9c0b66eb | 809 | .TP |
9f307c06 DH |
810 | .B ENOMEM |
811 | .I advice | |
812 | is | |
813 | .B MADV_POPULATE_READ | |
814 | or | |
815 | .BR MADV_POPULATE_WRITE , | |
816 | and populating (prefaulting) page tables failed because there was not enough | |
817 | memory. | |
818 | .TP | |
9c0b66eb MK |
819 | .B EPERM |
820 | .I advice | |
821 | is | |
822 | .BR MADV_HWPOISON , | |
823 | but the caller does not have the | |
824 | .B CAP_SYS_ADMIN | |
825 | capability. | |
9f307c06 DH |
826 | .TP |
827 | .B EHWPOISON | |
828 | .I advice | |
829 | is | |
830 | .B MADV_POPULATE_READ | |
831 | or | |
832 | .BR MADV_POPULATE_WRITE , | |
833 | and populating (prefaulting) page tables failed because a HW poisoned page | |
834 | (HW poisoned pages can, | |
835 | for example, | |
836 | be created using the | |
837 | .B MADV_HWPOISON | |
838 | flag described elsewhere in this page) | |
839 | was encountered. | |
6e519900 MK |
840 | .SH VERSIONS |
841 | Since Linux 3.18, | |
842 | .\" commit d3ac21cacc24790eb45d735769f35753f5b56ceb | |
843 | support for this system call is optional, | |
844 | depending on the setting of the | |
845 | .B CONFIG_ADVISE_SYSCALLS | |
846 | configuration option. | |
3113c7f3 | 847 | .SH STANDARDS |
c73c7130 MK |
848 | .BR madvise () |
849 | is not specified by any standards. | |
850 | Versions of this system call, implementing a wide variety of | |
851 | .I advice | |
852 | values, exist on many other implementations. | |
853 | Other implementations typically implement at least the flags listed | |
854 | above under | |
95467f1d | 855 | .IR "Conventional advice flags" , |
c73c7130 | 856 | albeit with some variation in semantics. |
efeece04 | 857 | .PP |
a1d5f77c MK |
858 | POSIX.1-2001 describes |
859 | .BR posix_madvise (3) | |
682edefb MK |
860 | with constants |
861 | .BR POSIX_MADV_NORMAL , | |
f78ed33a | 862 | .BR POSIX_MADV_RANDOM , |
b7bc9bfd MK |
863 | .BR POSIX_MADV_SEQUENTIAL , |
864 | .BR POSIX_MADV_WILLNEED , | |
865 | and | |
866 | .BR POSIX_MADV_DONTNEED , | |
95467f1d | 867 | and so on, with behavior close to the similarly named flags listed above. |
4fb31341 | 868 | .SH NOTES |
c634028a | 869 | .SS Linux notes |
fea681da | 870 | The Linux implementation requires that the address |
14f5ae6d | 871 | .I addr |
fea681da MK |
872 | be page-aligned, and allows |
873 | .I length | |
c13182ef MK |
874 | to be zero. |
875 | If there are some parts of the specified address range | |
fea681da | 876 | that are not mapped, the Linux version of |
e511ffb6 | 877 | .BR madvise () |
c13182ef | 878 | ignores them and applies the call to the rest (but returns |
fea681da MK |
879 | .B ENOMEM |
880 | from the system call, as it should). | |
bd14f1e3 | 881 | .PP |
4bbb0652 | 882 | .I madvise(0,\ 0,\ advice) |
bd14f1e3 ZK |
883 | will return zero iff |
884 | .I advice | |
885 | is supported by the kernel and can be relied on to probe for support. | |
889829be MK |
886 | .\" .SH HISTORY |
887 | .\" The | |
888 | .\" .BR madvise () | |
889 | .\" function first appeared in 4.4BSD. | |
47297adb | 890 | .SH SEE ALSO |
fea681da | 891 | .BR getrlimit (2), |
1ae6b2c7 | 892 | .BR memfd_secret (2), |
fea681da MK |
893 | .BR mincore (2), |
894 | .BR mmap (2), | |
895 | .BR mprotect (2), | |
896 | .BR msync (2), | |
c639b314 | 897 | .BR munmap (2), |
48cb32cd | 898 | .BR prctl (2), |
81ec67d8 | 899 | .BR process_madvise (2), |
3a4e05a1 | 900 | .BR posix_madvise (3), |
c639b314 | 901 | .BR core (5) |