]> git.ipfire.org Git - thirdparty/glibc.git/blame - manual/resource.texi
manual: Document thread/task IDs for Linux
[thirdparty/glibc.git] / manual / resource.texi
CommitLineData
5ce8f203
UD
1@node Resource Usage And Limitation, Non-Local Exits, Date and Time, Top
2@c %MENU% Functions for examining resource usage and getting and setting limits
3@chapter Resource Usage And Limitation
4This chapter describes functions for examining how much of various kinds of
5resources (CPU time, memory, etc.) a process has used and getting and setting
6limits on future usage.
7
8@menu
9* Resource Usage:: Measuring various resources used.
10* Limits on Resources:: Specifying limits on resource usage.
11* Priority:: Reading or setting process run priority.
b642f101
UD
12* Memory Resources:: Querying memory available resources.
13* Processor Resources:: Learn about the processors available.
5ce8f203
UD
14@end menu
15
16
17@node Resource Usage
18@section Resource Usage
19
20@pindex sys/resource.h
21The function @code{getrusage} and the data type @code{struct rusage}
22are used to examine the resource usage of a process. They are declared
23in @file{sys/resource.h}.
24
5ce8f203 25@deftypefun int getrusage (int @var{processes}, struct rusage *@var{rusage})
d08a7e4c 26@standards{BSD, sys/resource.h}
c8ce789c
AO
27@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
28@c On HURD, this calls task_info 3 times. On UNIX, it's a syscall.
5ce8f203
UD
29This function reports resource usage totals for processes specified by
30@var{processes}, storing the information in @code{*@var{rusage}}.
31
32In most systems, @var{processes} has only two valid values:
33
a449fc68 34@vtable @code
5ce8f203 35@item RUSAGE_SELF
d08a7e4c 36@standards{BSD, sys/resource.h}
5ce8f203
UD
37Just the current process.
38
5ce8f203 39@item RUSAGE_CHILDREN
d08a7e4c 40@standards{BSD, sys/resource.h}
5ce8f203 41All child processes (direct and indirect) that have already terminated.
a449fc68 42@end vtable
5ce8f203 43
5ce8f203
UD
44The return value of @code{getrusage} is zero for success, and @code{-1}
45for failure.
46
47@table @code
48@item EINVAL
49The argument @var{processes} is not valid.
50@end table
51@end deftypefun
52
53One way of getting resource usage for a particular child process is with
54the function @code{wait4}, which returns totals for a child when it
55terminates. @xref{BSD Wait Functions}.
56
5ce8f203 57@deftp {Data Type} {struct rusage}
d08a7e4c 58@standards{BSD, sys/resource.h}
5ce8f203
UD
59This data type stores various resource usage statistics. It has the
60following members, and possibly others:
61
62@table @code
63@item struct timeval ru_utime
64Time spent executing user instructions.
65
66@item struct timeval ru_stime
67Time spent in operating system code on behalf of @var{processes}.
68
69@item long int ru_maxrss
70The maximum resident set size used, in kilobytes. That is, the maximum
71number of kilobytes of physical memory that @var{processes} used
72simultaneously.
73
74@item long int ru_ixrss
75An integral value expressed in kilobytes times ticks of execution, which
76indicates the amount of memory used by text that was shared with other
77processes.
78
79@item long int ru_idrss
80An integral value expressed the same way, which is the amount of
81unshared memory used for data.
82
83@item long int ru_isrss
84An integral value expressed the same way, which is the amount of
85unshared memory used for stack space.
86
87@item long int ru_minflt
88The number of page faults which were serviced without requiring any I/O.
89
90@item long int ru_majflt
91The number of page faults which were serviced by doing I/O.
92
93@item long int ru_nswap
94The number of times @var{processes} was swapped entirely out of main memory.
95
96@item long int ru_inblock
97The number of times the file system had to read from the disk on behalf
98of @var{processes}.
99
100@item long int ru_oublock
101The number of times the file system had to write to the disk on behalf
102of @var{processes}.
103
104@item long int ru_msgsnd
105Number of IPC messages sent.
106
107@item long int ru_msgrcv
108Number of IPC messages received.
109
110@item long int ru_nsignals
111Number of signals received.
112
113@item long int ru_nvcsw
114The number of times @var{processes} voluntarily invoked a context switch
115(usually to wait for some service).
116
117@item long int ru_nivcsw
118The number of times an involuntary context switch took place (because
119a time slice expired, or another process of higher priority was
120scheduled).
121@end table
122@end deftp
123
124@code{vtimes} is a historical function that does some of what
125@code{getrusage} does. @code{getrusage} is a better choice.
126
127@code{vtimes} and its @code{vtimes} data structure are declared in
128@file{sys/vtimes.h}.
129@pindex sys/vtimes.h
5ce8f203 130
8ded91fb 131@deftypefun int vtimes (struct vtimes *@var{current}, struct vtimes *@var{child})
d08a7e4c 132@standards{???, sys/vtimes.h}
c8ce789c
AO
133@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
134@c Calls getrusage twice.
5ce8f203
UD
135
136@code{vtimes} reports resource usage totals for a process.
137
138If @var{current} is non-null, @code{vtimes} stores resource usage totals for
139the invoking process alone in the structure to which it points. If
140@var{child} is non-null, @code{vtimes} stores resource usage totals for all
141past children (which have terminated) of the invoking process in the structure
142to which it points.
143
144@deftp {Data Type} {struct vtimes}
145This data type contains information about the resource usage of a process.
146Each member corresponds to a member of the @code{struct rusage} data type
147described above.
148
149@table @code
150@item vm_utime
151User CPU time. Analogous to @code{ru_utime} in @code{struct rusage}
152@item vm_stime
153System CPU time. Analogous to @code{ru_stime} in @code{struct rusage}
154@item vm_idsrss
155Data and stack memory. The sum of the values that would be reported as
156@code{ru_idrss} and @code{ru_isrss} in @code{struct rusage}
157@item vm_ixrss
158Shared memory. Analogous to @code{ru_ixrss} in @code{struct rusage}
159@item vm_maxrss
160Maximent resident set size. Analogous to @code{ru_maxrss} in
161@code{struct rusage}
162@item vm_majflt
163Major page faults. Analogous to @code{ru_majflt} in @code{struct rusage}
164@item vm_minflt
165Minor page faults. Analogous to @code{ru_minflt} in @code{struct rusage}
166@item vm_nswap
167Swap count. Analogous to @code{ru_nswap} in @code{struct rusage}
168@item vm_inblk
169Disk reads. Analogous to @code{ru_inblk} in @code{struct rusage}
170@item vm_oublk
171Disk writes. Analogous to @code{ru_oublk} in @code{struct rusage}
172@end table
173@end deftp
174
175
176The return value is zero if the function succeeds; @code{-1} otherwise.
177
178
179
180@end deftypefun
181An additional historical function for examining resource usage,
182@code{vtimes}, is supported but not documented here. It is declared in
183@file{sys/vtimes.h}.
184
185@node Limits on Resources
186@section Limiting Resource Usage
187@cindex resource limits
188@cindex limits on resource usage
189@cindex usage limits
190
191You can specify limits for the resource usage of a process. When the
192process tries to exceed a limit, it may get a signal, or the system call
193by which it tried to do so may fail, depending on the resource. Each
194process initially inherits its limit values from its parent, but it can
195subsequently change them.
196
197There are two per-process limits associated with a resource:
198@cindex limit
199
200@table @dfn
201@item current limit
202The current limit is the value the system will not allow usage to
203exceed. It is also called the ``soft limit'' because the process being
204limited can generally raise the current limit at will.
205@cindex current limit
206@cindex soft limit
207
208@item maximum limit
209The maximum limit is the maximum value to which a process is allowed to
210set its current limit. It is also called the ``hard limit'' because
211there is no way for a process to get around it. A process may lower
212its own maximum limit, but only the superuser may increase a maximum
213limit.
214@cindex maximum limit
215@cindex hard limit
216@end table
217
218@pindex sys/resource.h
219The symbols for use with @code{getrlimit}, @code{setrlimit},
0bc93a2f 220@code{getrlimit64}, and @code{setrlimit64} are defined in
5ce8f203
UD
221@file{sys/resource.h}.
222
5ce8f203 223@deftypefun int getrlimit (int @var{resource}, struct rlimit *@var{rlp})
d08a7e4c 224@standards{BSD, sys/resource.h}
c8ce789c
AO
225@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
226@c Direct syscall on most systems.
5ce8f203
UD
227Read the current and maximum limits for the resource @var{resource}
228and store them in @code{*@var{rlp}}.
229
230The return value is @code{0} on success and @code{-1} on failure. The
231only possible @code{errno} error condition is @code{EFAULT}.
232
233When the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a
23432-bit system this function is in fact @code{getrlimit64}. Thus, the
235LFS interface transparently replaces the old interface.
236@end deftypefun
237
5ce8f203 238@deftypefun int getrlimit64 (int @var{resource}, struct rlimit64 *@var{rlp})
d08a7e4c 239@standards{Unix98, sys/resource.h}
c8ce789c
AO
240@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
241@c Direct syscall on most systems, wrapper to getrlimit otherwise.
5ce8f203
UD
242This function is similar to @code{getrlimit} but its second parameter is
243a pointer to a variable of type @code{struct rlimit64}, which allows it
244to read values which wouldn't fit in the member of a @code{struct
245rlimit}.
246
247If the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a
24832-bit machine, this function is available under the name
249@code{getrlimit} and so transparently replaces the old interface.
250@end deftypefun
251
5ce8f203 252@deftypefun int setrlimit (int @var{resource}, const struct rlimit *@var{rlp})
d08a7e4c 253@standards{BSD, sys/resource.h}
c8ce789c
AO
254@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
255@c Direct syscall on most systems; lock-taking critical section on HURD.
5ce8f203
UD
256Store the current and maximum limits for the resource @var{resource}
257in @code{*@var{rlp}}.
258
259The return value is @code{0} on success and @code{-1} on failure. The
260following @code{errno} error condition is possible:
261
262@table @code
263@item EPERM
264@itemize @bullet
265@item
266The process tried to raise a current limit beyond the maximum limit.
267
268@item
269The process tried to raise a maximum limit, but is not superuser.
270@end itemize
271@end table
272
273When the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a
27432-bit system this function is in fact @code{setrlimit64}. Thus, the
275LFS interface transparently replaces the old interface.
276@end deftypefun
277
5ce8f203 278@deftypefun int setrlimit64 (int @var{resource}, const struct rlimit64 *@var{rlp})
d08a7e4c 279@standards{Unix98, sys/resource.h}
c8ce789c
AO
280@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
281@c Wrapper for setrlimit or direct syscall.
5ce8f203
UD
282This function is similar to @code{setrlimit} but its second parameter is
283a pointer to a variable of type @code{struct rlimit64} which allows it
284to set values which wouldn't fit in the member of a @code{struct
285rlimit}.
286
287If the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a
28832-bit machine this function is available under the name
289@code{setrlimit} and so transparently replaces the old interface.
290@end deftypefun
291
5ce8f203 292@deftp {Data Type} {struct rlimit}
d08a7e4c 293@standards{BSD, sys/resource.h}
5ce8f203
UD
294This structure is used with @code{getrlimit} to receive limit values,
295and with @code{setrlimit} to specify limit values for a particular process
296and resource. It has two fields:
297
298@table @code
299@item rlim_t rlim_cur
300The current limit
301
302@item rlim_t rlim_max
303The maximum limit.
304@end table
305
306For @code{getrlimit}, the structure is an output; it receives the current
307values. For @code{setrlimit}, it specifies the new values.
308@end deftp
309
310For the LFS functions a similar type is defined in @file{sys/resource.h}.
311
5ce8f203 312@deftp {Data Type} {struct rlimit64}
d08a7e4c 313@standards{Unix98, sys/resource.h}
5ce8f203
UD
314This structure is analogous to the @code{rlimit} structure above, but
315its components have wider ranges. It has two fields:
316
317@table @code
318@item rlim64_t rlim_cur
319This is analogous to @code{rlimit.rlim_cur}, but with a different type.
320
321@item rlim64_t rlim_max
322This is analogous to @code{rlimit.rlim_max}, but with a different type.
323@end table
324
325@end deftp
326
327Here is a list of resources for which you can specify a limit. Memory
328and file sizes are measured in bytes.
329
2fe82ca6 330@vtable @code
5ce8f203 331@item RLIMIT_CPU
d08a7e4c 332@standards{BSD, sys/resource.h}
5ce8f203
UD
333The maximum amount of CPU time the process can use. If it runs for
334longer than this, it gets a signal: @code{SIGXCPU}. The value is
335measured in seconds. @xref{Operation Error Signals}.
336
5ce8f203 337@item RLIMIT_FSIZE
d08a7e4c 338@standards{BSD, sys/resource.h}
5ce8f203
UD
339The maximum size of file the process can create. Trying to write a
340larger file causes a signal: @code{SIGXFSZ}. @xref{Operation Error
341Signals}.
342
5ce8f203 343@item RLIMIT_DATA
d08a7e4c 344@standards{BSD, sys/resource.h}
5ce8f203
UD
345The maximum size of data memory for the process. If the process tries
346to allocate data memory beyond this amount, the allocation function
347fails.
348
5ce8f203 349@item RLIMIT_STACK
d08a7e4c 350@standards{BSD, sys/resource.h}
5ce8f203
UD
351The maximum stack size for the process. If the process tries to extend
352its stack past this size, it gets a @code{SIGSEGV} signal.
353@xref{Program Error Signals}.
354
5ce8f203 355@item RLIMIT_CORE
d08a7e4c 356@standards{BSD, sys/resource.h}
5ce8f203
UD
357The maximum size core file that this process can create. If the process
358terminates and would dump a core file larger than this, then no core
359file is created. So setting this limit to zero prevents core files from
360ever being created.
361
5ce8f203 362@item RLIMIT_RSS
d08a7e4c 363@standards{BSD, sys/resource.h}
5ce8f203
UD
364The maximum amount of physical memory that this process should get.
365This parameter is a guide for the system's scheduler and memory
366allocator; the system may give the process more memory when there is a
367surplus.
368
5ce8f203 369@item RLIMIT_MEMLOCK
d08a7e4c 370@standards{BSD, sys/resource.h}
5ce8f203
UD
371The maximum amount of memory that can be locked into physical memory (so
372it will never be paged out).
373
5ce8f203 374@item RLIMIT_NPROC
d08a7e4c 375@standards{BSD, sys/resource.h}
5ce8f203
UD
376The maximum number of processes that can be created with the same user ID.
377If you have reached the limit for your user ID, @code{fork} will fail
378with @code{EAGAIN}. @xref{Creating a Process}.
379
5ce8f203 380@item RLIMIT_NOFILE
5ce8f203 381@itemx RLIMIT_OFILE
d08a7e4c 382@standardsx{RLIMIT_NOFILE, BSD, sys/resource.h}
5ce8f203
UD
383The maximum number of files that the process can open. If it tries to
384open more files than this, its open attempt fails with @code{errno}
385@code{EMFILE}. @xref{Error Codes}. Not all systems support this limit;
386GNU does, and 4.4 BSD does.
387
5ce8f203 388@item RLIMIT_AS
d08a7e4c 389@standards{Unix98, sys/resource.h}
5ce8f203
UD
390The maximum size of total memory that this process should get. If the
391process tries to allocate more memory beyond this amount with, for
392example, @code{brk}, @code{malloc}, @code{mmap} or @code{sbrk}, the
393allocation function fails.
394
5ce8f203 395@item RLIM_NLIMITS
d08a7e4c 396@standards{BSD, sys/resource.h}
5ce8f203
UD
397The number of different resource limits. Any valid @var{resource}
398operand must be less than @code{RLIM_NLIMITS}.
2fe82ca6 399@end vtable
5ce8f203 400
8ded91fb 401@deftypevr Constant rlim_t RLIM_INFINITY
d08a7e4c 402@standards{BSD, sys/resource.h}
5ce8f203
UD
403This constant stands for a value of ``infinity'' when supplied as
404the limit value in @code{setrlimit}.
405@end deftypevr
406
407
408The following are historical functions to do some of what the functions
409above do. The functions above are better choices.
410
411@code{ulimit} and the command symbols are declared in @file{ulimit.h}.
412@pindex ulimit.h
5ce8f203 413
8ded91fb 414@deftypefun {long int} ulimit (int @var{cmd}, @dots{})
d08a7e4c 415@standards{BSD, ulimit.h}
c8ce789c
AO
416@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
417@c Wrapper for getrlimit, setrlimit or
418@c sysconf(_SC_OPEN_MAX)->getdtablesize->getrlimit.
5ce8f203
UD
419
420@code{ulimit} gets the current limit or sets the current and maximum
421limit for a particular resource for the calling process according to the
d3e22d59 422command @var{cmd}.
5ce8f203
UD
423
424If you are getting a limit, the command argument is the only argument.
425If you are setting a limit, there is a second argument:
426@code{long int} @var{limit} which is the value to which you are setting
427the limit.
428
429The @var{cmd} values and the operations they specify are:
2fe82ca6 430@vtable @code
5ce8f203
UD
431
432@item GETFSIZE
433Get the current limit on the size of a file, in units of 512 bytes.
434
435@item SETFSIZE
436Set the current and maximum limit on the size of a file to @var{limit} *
437512 bytes.
438
2fe82ca6 439@end vtable
5ce8f203
UD
440
441There are also some other @var{cmd} values that may do things on some
442systems, but they are not supported.
443
444Only the superuser may increase a maximum limit.
445
446When you successfully get a limit, the return value of @code{ulimit} is
447that limit, which is never negative. When you successfully set a limit,
448the return value is zero. When the function fails, the return value is
449@code{-1} and @code{errno} is set according to the reason:
450
451@table @code
452@item EPERM
453A process tried to increase a maximum limit, but is not superuser.
454@end table
455
456
457@end deftypefun
458
459@code{vlimit} and its resource symbols are declared in @file{sys/vlimit.h}.
5ce8f203 460@pindex sys/vlimit.h
5ce8f203
UD
461
462@deftypefun int vlimit (int @var{resource}, int @var{limit})
d08a7e4c 463@standards{BSD, sys/vlimit.h}
c8ce789c
AO
464@safety{@prelim{}@mtunsafe{@mtasurace{:setrlimit}}@asunsafe{}@acsafe{}}
465@c It calls getrlimit and modifies the rlim_cur field before calling
466@c setrlimit. There's a window for a concurrent call to setrlimit that
467@c modifies e.g. rlim_max, which will be lost if running as super-user.
5ce8f203
UD
468
469@code{vlimit} sets the current limit for a resource for a process.
470
471@var{resource} identifies the resource:
472
2fe82ca6 473@vtable @code
5ce8f203
UD
474@item LIM_CPU
475Maximum CPU time. Same as @code{RLIMIT_CPU} for @code{setrlimit}.
476@item LIM_FSIZE
477Maximum file size. Same as @code{RLIMIT_FSIZE} for @code{setrlimit}.
478@item LIM_DATA
479Maximum data memory. Same as @code{RLIMIT_DATA} for @code{setrlimit}.
480@item LIM_STACK
481Maximum stack size. Same as @code{RLIMIT_STACK} for @code{setrlimit}.
482@item LIM_CORE
483Maximum core file size. Same as @code{RLIMIT_COR} for @code{setrlimit}.
484@item LIM_MAXRSS
485Maximum physical memory. Same as @code{RLIMIT_RSS} for @code{setrlimit}.
2fe82ca6 486@end vtable
5ce8f203
UD
487
488The return value is zero for success, and @code{-1} with @code{errno} set
489accordingly for failure:
490
491@table @code
492@item EPERM
493The process tried to set its current limit beyond its maximum limit.
494@end table
495
496@end deftypefun
497
498@node Priority
639c6286 499@section Process CPU Priority And Scheduling
5ce8f203 500@cindex process priority
639c6286 501@cindex cpu priority
5ce8f203
UD
502@cindex priority of a process
503
639c6286
UD
504When multiple processes simultaneously require CPU time, the system's
505scheduling policy and process CPU priorities determine which processes
506get it. This section describes how that determination is made and
1f77f049 507@glibcadj{} functions to control it.
639c6286
UD
508
509It is common to refer to CPU scheduling simply as scheduling and a
510process' CPU priority simply as the process' priority, with the CPU
511resource being implied. Bear in mind, though, that CPU time is not the
512only resource a process uses or that processes contend for. In some
513cases, it is not even particularly important. Giving a process a high
514``priority'' may have very little effect on how fast a process runs with
515respect to other processes. The priorities discussed in this section
516apply only to CPU time.
517
518CPU scheduling is a complex issue and different systems do it in wildly
519different ways. New ideas continually develop and find their way into
520the intricacies of the various systems' scheduling algorithms. This
87b56f36 521section discusses the general concepts, some specifics of systems
1f77f049 522that commonly use @theglibc{}, and some standards.
639c6286
UD
523
524For simplicity, we talk about CPU contention as if there is only one CPU
525in the system. But all the same principles apply when a processor has
526multiple CPUs, and knowing that the number of processes that can run at
527any one time is equal to the number of CPUs, you can easily extrapolate
528the information.
529
530The functions described in this section are all defined by the POSIX.1
95fdc6a0 531and POSIX.1b standards (the @code{sched@dots{}} functions are POSIX.1b).
639c6286
UD
532However, POSIX does not define any semantics for the values that these
533functions get and set. In this chapter, the semantics are based on the
534Linux kernel's implementation of the POSIX standard. As you will see,
535the Linux implementation is quite the inverse of what the authors of the
536POSIX syntax had in mind.
537
538@menu
539* Absolute Priority:: The first tier of priority. Posix
540* Realtime Scheduling:: Scheduling among the process nobility
541* Basic Scheduling Functions:: Get/set scheduling policy, priority
542* Traditional Scheduling:: Scheduling among the vulgar masses
d9997a45 543* CPU Affinity:: Limiting execution to certain CPUs
639c6286
UD
544@end menu
545
546
547
548@node Absolute Priority
549@subsection Absolute Priority
550@cindex absolute priority
551@cindex priority, absolute
552
553Every process has an absolute priority, and it is represented by a number.
554The higher the number, the higher the absolute priority.
555
556@cindex realtime CPU scheduling
557On systems of the past, and most systems today, all processes have
558absolute priority 0 and this section is irrelevant. In that case,
559@xref{Traditional Scheduling}. Absolute priorities were invented to
0bc93a2f 560accommodate realtime systems, in which it is vital that certain processes
639c6286
UD
561be able to respond to external events happening in real time, which
562means they cannot wait around while some other process that @emph{wants
563to}, but doesn't @emph{need to} run occupies the CPU.
564
565@cindex ready to run
566@cindex preemptive scheduling
567When two processes are in contention to use the CPU at any instant, the
568one with the higher absolute priority always gets it. This is true even if the
11bf311e 569process with the lower priority is already using the CPU (i.e., the
639c6286
UD
570scheduling is preemptive). Of course, we're only talking about
571processes that are running or ``ready to run,'' which means they are
572ready to execute instructions right now. When a process blocks to wait
573for something like I/O, its absolute priority is irrelevant.
574
575@cindex runnable process
48b22986 576@strong{NB:} The term ``runnable'' is a synonym for ``ready to run.''
639c6286
UD
577
578When two processes are running or ready to run and both have the same
579absolute priority, it's more interesting. In that case, who gets the
0bc93a2f 580CPU is determined by the scheduling policy. If the processes have
639c6286
UD
581absolute priority 0, the traditional scheduling policy described in
582@ref{Traditional Scheduling} applies. Otherwise, the policies described
583in @ref{Realtime Scheduling} apply.
584
585You normally give an absolute priority above 0 only to a process that
586can be trusted not to hog the CPU. Such processes are designed to block
587(or terminate) after relatively short CPU runs.
588
589A process begins life with the same absolute priority as its parent
590process. Functions described in @ref{Basic Scheduling Functions} can
591change it.
592
593Only a privileged process can change a process' absolute priority to
594something other than @code{0}. Only a privileged process or the
595target process' owner can change its absolute priority at all.
596
597POSIX requires absolute priority values used with the realtime
598scheduling policies to be consecutive with a range of at least 32. On
599Linux, they are 1 through 99. The functions
600@code{sched_get_priority_max} and @code{sched_set_priority_min} portably
601tell you what the range is on a particular system.
602
603
604@subsubsection Using Absolute Priority
605
606One thing you must keep in mind when designing real time applications is
607that having higher absolute priority than any other process doesn't
608guarantee the process can run continuously. Two things that can wreck a
87b56f36 609good CPU run are interrupts and page faults.
639c6286
UD
610
611Interrupt handlers live in that limbo between processes. The CPU is
612executing instructions, but they aren't part of any process. An
613interrupt will stop even the highest priority process. So you must
614allow for slight delays and make sure that no device in the system has
615an interrupt handler that could cause too long a delay between
616instructions for your process.
617
618Similarly, a page fault causes what looks like a straightforward
619sequence of instructions to take a long time. The fact that other
620processes get to run while the page faults in is of no consequence,
d3e22d59 621because as soon as the I/O is complete, the higher priority process will
639c6286
UD
622kick them out and run again, but the wait for the I/O itself could be a
623problem. To neutralize this threat, use @code{mlock} or
624@code{mlockall}.
625
626There are a few ramifications of the absoluteness of this priority on a
627single-CPU system that you need to keep in mind when you choose to set a
628priority and also when you're working on a program that runs with high
629absolute priority. Consider a process that has higher absolute priority
630than any other process in the system and due to a bug in its program, it
631gets into an infinite loop. It will never cede the CPU. You can't run
632a command to kill it because your command would need to get the CPU in
633order to run. The errant program is in complete control. It controls
634the vertical, it controls the horizontal.
635
636There are two ways to avoid this: 1) keep a shell running somewhere with
d3e22d59 637a higher absolute priority or 2) keep a controlling terminal attached to
639c6286
UD
638the high priority process group. All the priority in the world won't
639stop an interrupt handler from running and delivering a signal to the
640process if you hit Control-C.
641
95fdc6a0 642Some systems use absolute priority as a means of allocating a fixed
0bc93a2f 643percentage of CPU time to a process. To do this, a super high priority
639c6286
UD
644privileged process constantly monitors the process' CPU usage and raises
645its absolute priority when the process isn't getting its entitled share
646and lowers it when the process is exceeding it.
647
48b22986 648@strong{NB:} The absolute priority is sometimes called the ``static
639c6286
UD
649priority.'' We don't use that term in this manual because it misses the
650most important feature of the absolute priority: its absoluteness.
651
652
653@node Realtime Scheduling
654@subsection Realtime Scheduling
b642f101 655@cindex realtime scheduling
639c6286
UD
656
657Whenever two processes with the same absolute priority are ready to run,
658the kernel has a decision to make, because only one can run at a time.
659If the processes have absolute priority 0, the kernel makes this decision
660as described in @ref{Traditional Scheduling}. Otherwise, the decision
661is as described in this section.
662
663If two processes are ready to run but have different absolute priorities,
664the decision is much simpler, and is described in @ref{Absolute
665Priority}.
666
87b56f36 667Each process has a scheduling policy. For processes with absolute
639c6286
UD
668priority other than zero, there are two available:
669
670@enumerate
671@item
672First Come First Served
673@item
674Round Robin
675@end enumerate
676
677The most sensible case is where all the processes with a certain
678absolute priority have the same scheduling policy. We'll discuss that
679first.
680
681In Round Robin, processes share the CPU, each one running for a small
682quantum of time (``time slice'') and then yielding to another in a
683circular fashion. Of course, only processes that are ready to run and
684have the same absolute priority are in this circle.
685
686In First Come First Served, the process that has been waiting the
687longest to run gets the CPU, and it keeps it until it voluntarily
688relinquishes the CPU, runs out of things to do (blocks), or gets
689preempted by a higher priority process.
690
691First Come First Served, along with maximal absolute priority and
692careful control of interrupts and page faults, is the one to use when a
693process absolutely, positively has to run at full CPU speed or not at
694all.
695
696Judicious use of @code{sched_yield} function invocations by processes
697with First Come First Served scheduling policy forms a good compromise
698between Round Robin and First Come First Served.
699
700To understand how scheduling works when processes of different scheduling
701policies occupy the same absolute priority, you have to know the nitty
d3e22d59 702gritty details of how processes enter and exit the ready to run list.
639c6286
UD
703
704In both cases, the ready to run list is organized as a true queue, where
705a process gets pushed onto the tail when it becomes ready to run and is
706popped off the head when the scheduler decides to run it. Note that
707ready to run and running are two mutually exclusive states. When the
708scheduler runs a process, that process is no longer ready to run and no
709longer in the ready to run list. When the process stops running, it
710may go back to being ready to run again.
711
712The only difference between a process that is assigned the Round Robin
713scheduling policy and a process that is assigned First Come First Serve
714is that in the former case, the process is automatically booted off the
715CPU after a certain amount of time. When that happens, the process goes
716back to being ready to run, which means it enters the queue at the tail.
717The time quantum we're talking about is small. Really small. This is
718not your father's timesharing. For example, with the Linux kernel, the
719round robin time slice is a thousand times shorter than its typical
720time slice for traditional scheduling.
721
722A process begins life with the same scheduling policy as its parent process.
723Functions described in @ref{Basic Scheduling Functions} can change it.
724
725Only a privileged process can set the scheduling policy of a process
726that has absolute priority higher than 0.
727
728@node Basic Scheduling Functions
729@subsection Basic Scheduling Functions
730
1f77f049 731This section describes functions in @theglibc{} for setting the
639c6286
UD
732absolute priority and scheduling policy of a process.
733
734@strong{Portability Note:} On systems that have the functions in this
735section, the macro _POSIX_PRIORITY_SCHEDULING is defined in
736@file{<unistd.h>}.
737
738For the case that the scheduling policy is traditional scheduling, more
739functions to fine tune the scheduling are in @ref{Traditional Scheduling}.
740
741Don't try to make too much out of the naming and structure of these
742functions. They don't match the concepts described in this manual
743because the functions are as defined by POSIX.1b, but the implementation
1f77f049 744on systems that use @theglibc{} is the inverse of what the POSIX
639c6286
UD
745structure contemplates. The POSIX scheme assumes that the primary
746scheduling parameter is the scheduling policy and that the priority
747value, if any, is a parameter of the scheduling policy. In the
748implementation, though, the priority value is king and the scheduling
749policy, if anything, only fine tunes the effect of that priority.
750
751The symbols in this section are declared by including file @file{sched.h}.
752
639c6286 753@deftp {Data Type} {struct sched_param}
d08a7e4c 754@standards{POSIX, sched.h}
639c6286
UD
755This structure describes an absolute priority.
756@table @code
757@item int sched_priority
758absolute priority value
759@end table
760@end deftp
761
639c6286 762@deftypefun int sched_setscheduler (pid_t @var{pid}, int @var{policy}, const struct sched_param *@var{param})
d08a7e4c 763@standards{POSIX, sched.h}
c8ce789c
AO
764@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
765@c Direct syscall, Linux only.
639c6286
UD
766
767This function sets both the absolute priority and the scheduling policy
768for a process.
769
770It assigns the absolute priority value given by @var{param} and the
771scheduling policy @var{policy} to the process with Process ID @var{pid},
772or the calling process if @var{pid} is zero. If @var{policy} is
0bc93a2f 773negative, @code{sched_setscheduler} keeps the existing scheduling policy.
639c6286
UD
774
775The following macros represent the valid values for @var{policy}:
776
2fe82ca6 777@vtable @code
639c6286
UD
778@item SCHED_OTHER
779Traditional Scheduling
780@item SCHED_FIFO
87b56f36 781First In First Out
639c6286
UD
782@item SCHED_RR
783Round Robin
2fe82ca6 784@end vtable
639c6286
UD
785
786@c The Linux kernel code (in sched.c) actually reschedules the process,
787@c but it puts it at the head of the run queue, so I'm not sure just what
788@c the effect is, but it must be subtle.
789
790On success, the return value is @code{0}. Otherwise, it is @code{-1}
791and @code{ERRNO} is set accordingly. The @code{errno} values specific
792to this function are:
793
794@table @code
795@item EPERM
796@itemize @bullet
797@item
798The calling process does not have @code{CAP_SYS_NICE} permission and
799@var{policy} is not @code{SCHED_OTHER} (or it's negative and the
800existing policy is not @code{SCHED_OTHER}.
801
802@item
803The calling process does not have @code{CAP_SYS_NICE} permission and its
11bf311e 804owner is not the target process' owner. I.e., the effective uid of the
639c6286
UD
805calling process is neither the effective nor the real uid of process
806@var{pid}.
807@c We need a cross reference to the capabilities section, when written.
808@end itemize
809
810@item ESRCH
811There is no process with pid @var{pid} and @var{pid} is not zero.
812
813@item EINVAL
814@itemize @bullet
815@item
816@var{policy} does not identify an existing scheduling policy.
817
818@item
819The absolute priority value identified by *@var{param} is outside the
820valid range for the scheduling policy @var{policy} (or the existing
821scheduling policy if @var{policy} is negative) or @var{param} is
822null. @code{sched_get_priority_max} and @code{sched_get_priority_min}
823tell you what the valid range is.
824
825@item
826@var{pid} is negative.
827@end itemize
828@end table
829
830@end deftypefun
831
832
639c6286 833@deftypefun int sched_getscheduler (pid_t @var{pid})
d08a7e4c 834@standards{POSIX, sched.h}
c8ce789c
AO
835@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
836@c Direct syscall, Linux only.
639c6286
UD
837
838This function returns the scheduling policy assigned to the process with
839Process ID (pid) @var{pid}, or the calling process if @var{pid} is zero.
840
841The return value is the scheduling policy. See
842@code{sched_setscheduler} for the possible values.
843
844If the function fails, the return value is instead @code{-1} and
845@code{errno} is set accordingly.
846
847The @code{errno} values specific to this function are:
848
849@table @code
850
851@item ESRCH
852There is no process with pid @var{pid} and it is not zero.
853
854@item EINVAL
855@var{pid} is negative.
856
857@end table
858
859Note that this function is not an exact mate to @code{sched_setscheduler}
860because while that function sets the scheduling policy and the absolute
861priority, this function gets only the scheduling policy. To get the
862absolute priority, use @code{sched_getparam}.
863
864@end deftypefun
865
866
639c6286 867@deftypefun int sched_setparam (pid_t @var{pid}, const struct sched_param *@var{param})
d08a7e4c 868@standards{POSIX, sched.h}
c8ce789c
AO
869@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
870@c Direct syscall, Linux only.
639c6286
UD
871
872This function sets a process' absolute priority.
873
874It is functionally identical to @code{sched_setscheduler} with
875@var{policy} = @code{-1}.
876
877@c in fact, that's how it's implemented in Linux.
878
879@end deftypefun
880
8ded91fb 881@deftypefun int sched_getparam (pid_t @var{pid}, struct sched_param *@var{param})
d08a7e4c 882@standards{POSIX, sched.h}
c8ce789c
AO
883@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
884@c Direct syscall, Linux only.
639c6286
UD
885
886This function returns a process' absolute priority.
887
888@var{pid} is the Process ID (pid) of the process whose absolute priority
889you want to know.
890
891@var{param} is a pointer to a structure in which the function stores the
892absolute priority of the process.
893
894On success, the return value is @code{0}. Otherwise, it is @code{-1}
d3e22d59 895and @code{errno} is set accordingly. The @code{errno} values specific
639c6286
UD
896to this function are:
897
898@table @code
899
900@item ESRCH
901There is no process with pid @var{pid} and it is not zero.
902
903@item EINVAL
904@var{pid} is negative.
905
906@end table
907
908@end deftypefun
909
910
8ded91fb 911@deftypefun int sched_get_priority_min (int @var{policy})
d08a7e4c 912@standards{POSIX, sched.h}
c8ce789c
AO
913@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
914@c Direct syscall, Linux only.
639c6286
UD
915
916This function returns the lowest absolute priority value that is
917allowable for a process with scheduling policy @var{policy}.
918
919On Linux, it is 0 for SCHED_OTHER and 1 for everything else.
920
921On success, the return value is @code{0}. Otherwise, it is @code{-1}
922and @code{ERRNO} is set accordingly. The @code{errno} values specific
923to this function are:
924
925@table @code
926@item EINVAL
927@var{policy} does not identify an existing scheduling policy.
928@end table
929
930@end deftypefun
931
8ded91fb 932@deftypefun int sched_get_priority_max (int @var{policy})
d08a7e4c 933@standards{POSIX, sched.h}
c8ce789c
AO
934@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
935@c Direct syscall, Linux only.
639c6286
UD
936
937This function returns the highest absolute priority value that is
938allowable for a process that with scheduling policy @var{policy}.
939
940On Linux, it is 0 for SCHED_OTHER and 99 for everything else.
941
942On success, the return value is @code{0}. Otherwise, it is @code{-1}
943and @code{ERRNO} is set accordingly. The @code{errno} values specific
944to this function are:
945
946@table @code
947@item EINVAL
948@var{policy} does not identify an existing scheduling policy.
949@end table
950
951@end deftypefun
952
639c6286 953@deftypefun int sched_rr_get_interval (pid_t @var{pid}, struct timespec *@var{interval})
d08a7e4c 954@standards{POSIX, sched.h}
c8ce789c
AO
955@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
956@c Direct syscall, Linux only.
639c6286 957
87b56f36 958This function returns the length of the quantum (time slice) used with
639c6286
UD
959the Round Robin scheduling policy, if it is used, for the process with
960Process ID @var{pid}.
961
87b56f36 962It returns the length of time as @var{interval}.
639c6286
UD
963@c We need a cross-reference to where timespec is explained. But that
964@c section doesn't exist yet, and the time chapter needs to be slightly
965@c reorganized so there is a place to put it (which will be right next
966@c to timeval, which is presently misplaced). 2000.05.07.
967
968With a Linux kernel, the round robin time slice is always 150
969microseconds, and @var{pid} need not even be a real pid.
970
971The return value is @code{0} on success and in the pathological case
972that it fails, the return value is @code{-1} and @code{errno} is set
973accordingly. There is nothing specific that can go wrong with this
974function, so there are no specific @code{errno} values.
975
976@end deftypefun
977
3c44837c 978@deftypefun int sched_yield (void)
d08a7e4c 979@standards{POSIX, sched.h}
c8ce789c
AO
980@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
981@c Direct syscall on Linux; alias to swtch on HURD.
639c6286
UD
982
983This function voluntarily gives up the process' claim on the CPU.
984
985Technically, @code{sched_yield} causes the calling process to be made
986immediately ready to run (as opposed to running, which is what it was
987before). This means that if it has absolute priority higher than 0, it
988gets pushed onto the tail of the queue of processes that share its
989absolute priority and are ready to run, and it will run again when its
990turn next arrives. If its absolute priority is 0, it is more
991complicated, but still has the effect of yielding the CPU to other
992processes.
993
994If there are no other processes that share the calling process' absolute
995priority, this function doesn't have any effect.
996
997To the extent that the containing program is oblivious to what other
998processes in the system are doing and how fast it executes, this
999function appears as a no-op.
1000
1001The return value is @code{0} on success and in the pathological case
1002that it fails, the return value is @code{-1} and @code{errno} is set
1003accordingly. There is nothing specific that can go wrong with this
1004function, so there are no specific @code{errno} values.
1005
1006@end deftypefun
1007
1008@node Traditional Scheduling
1009@subsection Traditional Scheduling
1010@cindex scheduling, traditional
1011
1012This section is about the scheduling among processes whose absolute
1013priority is 0. When the system hands out the scraps of CPU time that
0bc93a2f 1014are left over after the processes with higher absolute priority have
639c6286
UD
1015taken all they want, the scheduling described herein determines who
1016among the great unwashed processes gets them.
1017
1018@menu
1019* Traditional Scheduling Intro::
1020* Traditional Scheduling Functions::
1021@end menu
1022
1023@node Traditional Scheduling Intro
1024@subsubsection Introduction To Traditional Scheduling
1025
1026Long before there was absolute priority (See @ref{Absolute Priority}),
d3e22d59 1027Unix systems were scheduling the CPU using this system. When POSIX came
0bc93a2f 1028in like the Romans and imposed absolute priorities to accommodate the
639c6286
UD
1029needs of realtime processing, it left the indigenous Absolute Priority
1030Zero processes to govern themselves by their own familiar scheduling
1031policy.
1032
1033Indeed, absolute priorities higher than zero are not available on many
1034systems today and are not typically used when they are, being intended
1035mainly for computers that do realtime processing. So this section
1036describes the only scheduling many programmers need to be concerned
1037about.
1038
1039But just to be clear about the scope of this scheduling: Any time a
9dcc8f11 1040process with an absolute priority of 0 and a process with an absolute
639c6286
UD
1041priority higher than 0 are ready to run at the same time, the one with
1042absolute priority 0 does not run. If it's already running when the
1043higher priority ready-to-run process comes into existence, it stops
1044immediately.
1045
1046In addition to its absolute priority of zero, every process has another
1047priority, which we will refer to as "dynamic priority" because it changes
87b56f36 1048over time. The dynamic priority is meaningless for processes with
639c6286
UD
1049an absolute priority higher than zero.
1050
1051The dynamic priority sometimes determines who gets the next turn on the
1052CPU. Sometimes it determines how long turns last. Sometimes it
1053determines whether a process can kick another off the CPU.
1054
d3e22d59 1055In Linux, the value is a combination of these things, but mostly it
639c6286
UD
1056just determines the length of the time slice. The higher a process'
1057dynamic priority, the longer a shot it gets on the CPU when it gets one.
1058If it doesn't use up its time slice before giving up the CPU to do
1059something like wait for I/O, it is favored for getting the CPU back when
1060it's ready for it, to finish out its time slice. Other than that,
1061selection of processes for new time slices is basically round robin.
1062But the scheduler does throw a bone to the low priority processes: A
1063process' dynamic priority rises every time it is snubbed in the
1064scheduling process. In Linux, even the fat kid gets to play.
1065
1066The fluctuation of a process' dynamic priority is regulated by another
1067value: The ``nice'' value. The nice value is an integer, usually in the
1068range -20 to 20, and represents an upper limit on a process' dynamic
1069priority. The higher the nice number, the lower that limit.
1070
1071On a typical Linux system, for example, a process with a nice value of
107220 can get only 10 milliseconds on the CPU at a time, whereas a process
1073with a nice value of -20 can achieve a high enough priority to get 400
1074milliseconds.
1075
1076The idea of the nice value is deferential courtesy. In the beginning,
1077in the Unix garden of Eden, all processes shared equally in the bounty
1078of the computer system. But not all processes really need the same
1079share of CPU time, so the nice value gave a courteous process the
1080ability to refuse its equal share of CPU time that others might prosper.
1081Hence, the higher a process' nice value, the nicer the process is.
1082(Then a snake came along and offered some process a negative nice value
1083and the system became the crass resource allocation system we know
d3e22d59 1084today.)
639c6286
UD
1085
1086Dynamic priorities tend upward and downward with an objective of
1087smoothing out allocation of CPU time and giving quick response time to
1088infrequent requests. But they never exceed their nice limits, so on a
1089heavily loaded CPU, the nice value effectively determines how fast a
1090process runs.
1091
1092In keeping with the socialistic heritage of Unix process priority, a
1093process begins life with the same nice value as its parent process and
1094can raise it at will. A process can also raise the nice value of any
1095other process owned by the same user (or effective user). But only a
1096privileged process can lower its nice value. A privileged process can
1097also raise or lower another process' nice value.
1098
1f77f049 1099@glibcadj{} functions for getting and setting nice values are described in
639c6286
UD
1100@xref{Traditional Scheduling Functions}.
1101
1102@node Traditional Scheduling Functions
1103@subsubsection Functions For Traditional Scheduling
1104
5ce8f203 1105@pindex sys/resource.h
639c6286
UD
1106This section describes how you can read and set the nice value of a
1107process. All these symbols are declared in @file{sys/resource.h}.
1108
1109The function and macro names are defined by POSIX, and refer to
1110"priority," but the functions actually have to do with nice values, as
1111the terms are used both in the manual and POSIX.
1112
1113The range of valid nice values depends on the kernel, but typically it
1114runs from @code{-20} to @code{20}. A lower nice value corresponds to
1115higher priority for the process. These constants describe the range of
5ce8f203
UD
1116priority values:
1117
b642f101 1118@vtable @code
5ce8f203 1119@item PRIO_MIN
d08a7e4c 1120@standards{BSD, sys/resource.h}
639c6286 1121The lowest valid nice value.
5ce8f203 1122
5ce8f203 1123@item PRIO_MAX
d08a7e4c 1124@standards{BSD, sys/resource.h}
639c6286 1125The highest valid nice value.
b642f101 1126@end vtable
5ce8f203 1127
5ce8f203 1128@deftypefun int getpriority (int @var{class}, int @var{id})
d08a7e4c
RJ
1129@standards{BSD, sys/resource.h}
1130@standards{POSIX, sys/resource.h}
c8ce789c
AO
1131@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1132@c Direct syscall on UNIX. On HURD, calls _hurd_priority_which_map.
639c6286 1133Return the nice value of a set of processes; @var{class} and @var{id}
5ce8f203 1134specify which ones (see below). If the processes specified do not all
639c6286 1135have the same nice value, this returns the lowest value that any of them
5ce8f203
UD
1136has.
1137
639c6286 1138On success, the return value is @code{0}. Otherwise, it is @code{-1}
d3e22d59 1139and @code{errno} is set accordingly. The @code{errno} values specific
639c6286 1140to this function are:
5ce8f203
UD
1141
1142@table @code
1143@item ESRCH
1144The combination of @var{class} and @var{id} does not match any existing
1145process.
1146
1147@item EINVAL
1148The value of @var{class} is not valid.
1149@end table
1150
639c6286
UD
1151If the return value is @code{-1}, it could indicate failure, or it could
1152be the nice value. The only way to make certain is to set @code{errno =
11530} before calling @code{getpriority}, then use @code{errno != 0}
1154afterward as the criterion for failure.
5ce8f203
UD
1155@end deftypefun
1156
639c6286 1157@deftypefun int setpriority (int @var{class}, int @var{id}, int @var{niceval})
d08a7e4c
RJ
1158@standards{BSD, sys/resource.h}
1159@standards{POSIX, sys/resource.h}
c8ce789c
AO
1160@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1161@c Direct syscall on UNIX. On HURD, calls _hurd_priority_which_map.
639c6286 1162Set the nice value of a set of processes to @var{niceval}; @var{class}
5ce8f203
UD
1163and @var{id} specify which ones (see below).
1164
6a7a8b22 1165The return value is @code{0} on success, and @code{-1} on
639c6286
UD
1166failure. The following @code{errno} error condition are possible for
1167this function:
5ce8f203
UD
1168
1169@table @code
1170@item ESRCH
1171The combination of @var{class} and @var{id} does not match any existing
1172process.
1173
1174@item EINVAL
1175The value of @var{class} is not valid.
1176
1177@item EPERM
639c6286 1178The call would set the nice value of a process which is owned by a different
11bf311e 1179user than the calling process (i.e., the target process' real or effective
639c6286
UD
1180uid does not match the calling process' effective uid) and the calling
1181process does not have @code{CAP_SYS_NICE} permission.
5ce8f203
UD
1182
1183@item EACCES
639c6286
UD
1184The call would lower the process' nice value and the process does not have
1185@code{CAP_SYS_NICE} permission.
5ce8f203 1186@end table
639c6286 1187
5ce8f203
UD
1188@end deftypefun
1189
1190The arguments @var{class} and @var{id} together specify a set of
1191processes in which you are interested. These are the possible values of
1192@var{class}:
1193
b642f101 1194@vtable @code
5ce8f203 1195@item PRIO_PROCESS
d08a7e4c 1196@standards{BSD, sys/resource.h}
639c6286 1197One particular process. The argument @var{id} is a process ID (pid).
5ce8f203 1198
5ce8f203 1199@item PRIO_PGRP
d08a7e4c 1200@standards{BSD, sys/resource.h}
639c6286
UD
1201All the processes in a particular process group. The argument @var{id} is
1202a process group ID (pgid).
5ce8f203 1203
5ce8f203 1204@item PRIO_USER
d08a7e4c 1205@standards{BSD, sys/resource.h}
11bf311e 1206All the processes owned by a particular user (i.e., whose real uid
639c6286 1207indicates the user). The argument @var{id} is a user ID (uid).
b642f101 1208@end vtable
5ce8f203 1209
639c6286
UD
1210If the argument @var{id} is 0, it stands for the calling process, its
1211process group, or its owner (real uid), according to @var{class}.
5ce8f203 1212
5ce8f203 1213@deftypefun int nice (int @var{increment})
d08a7e4c 1214@standards{BSD, unistd.h}
c8ce789c
AO
1215@safety{@prelim{}@mtunsafe{@mtasurace{:setpriority}}@asunsafe{}@acsafe{}}
1216@c Calls getpriority before and after setpriority, using the result of
1217@c the first call to compute the argument for setpriority. This creates
1218@c a window for a concurrent setpriority (or nice) call to be lost or
1219@c exhibit surprising behavior.
639c6286 1220Increment the nice value of the calling process by @var{increment}.
6a7a8b22
AJ
1221The return value is the new nice value on success, and @code{-1} on
1222failure. In the case of failure, @code{errno} will be set to the
1223same values as for @code{setpriority}.
1224
5ce8f203
UD
1225
1226Here is an equivalent definition of @code{nice}:
1227
1228@smallexample
1229int
1230nice (int increment)
1231@{
6a7a8b22
AJ
1232 int result, old = getpriority (PRIO_PROCESS, 0);
1233 result = setpriority (PRIO_PROCESS, 0, old + increment);
1234 if (result != -1)
1235 return old + increment;
1236 else
1237 return -1;
5ce8f203
UD
1238@}
1239@end smallexample
1240@end deftypefun
b642f101 1241
d9997a45
UD
1242
1243@node CPU Affinity
1244@subsection Limiting execution to certain CPUs
1245
1246On a multi-processor system the operating system usually distributes
1247the different processes which are runnable on all available CPUs in a
1248way which allows the system to work most efficiently. Which processes
1249and threads run can be to some extend be control with the scheduling
1250functionality described in the last sections. But which CPU finally
1251executes which process or thread is not covered.
1252
1253There are a number of reasons why a program might want to have control
1254over this aspect of the system as well:
1255
1256@itemize @bullet
1257@item
1258One thread or process is responsible for absolutely critical work
1259which under no circumstances must be interrupted or hindered from
d3e22d59 1260making progress by other processes or threads using CPU resources. In
d9997a45
UD
1261this case the special process would be confined to a CPU which no
1262other process or thread is allowed to use.
1263
1264@item
1265The access to certain resources (RAM, I/O ports) has different costs
1266from different CPUs. This is the case in NUMA (Non-Uniform Memory
11bf311e 1267Architecture) machines. Preferably memory should be accessed locally
d9997a45
UD
1268but this requirement is usually not visible to the scheduler.
1269Therefore forcing a process or thread to the CPUs which have local
d3e22d59 1270access to the most-used memory helps to significantly boost the
d9997a45
UD
1271performance.
1272
1273@item
1274In controlled runtimes resource allocation and book-keeping work (for
1275instance garbage collection) is performance local to processors. This
1276can help to reduce locking costs if the resources do not have to be
1277protected from concurrent accesses from different processors.
1278@end itemize
1279
1280The POSIX standard up to this date is of not much help to solve this
1281problem. The Linux kernel provides a set of interfaces to allow
1282specifying @emph{affinity sets} for a process. The scheduler will
bbf70ae9 1283schedule the thread or process on CPUs specified by the affinity
1f77f049 1284masks. The interfaces which @theglibc{} define follow to some
d3e22d59 1285extent the Linux kernel interface.
d9997a45 1286
d9997a45 1287@deftp {Data Type} cpu_set_t
d08a7e4c 1288@standards{GNU, sched.h}
d9997a45
UD
1289This data set is a bitset where each bit represents a CPU. How the
1290system's CPUs are mapped to bits in the bitset is system dependent.
1291The data type has a fixed size; in the unlikely case that the number
1292of bits are not sufficient to describe the CPUs of the system a
1293different interface has to be used.
1294
1295This type is a GNU extension and is defined in @file{sched.h}.
1296@end deftp
1297
d3e22d59 1298To manipulate the bitset, to set and reset bits, a number of macros are
d9997a45
UD
1299defined. Some of the macros take a CPU number as a parameter. Here
1300it is important to never exceed the size of the bitset. The following
1301macro specifies the number of bits in the @code{cpu_set_t} bitset.
1302
d9997a45 1303@deftypevr Macro int CPU_SETSIZE
d08a7e4c 1304@standards{GNU, sched.h}
d9997a45
UD
1305The value of this macro is the maximum number of CPUs which can be
1306handled with a @code{cpu_set_t} object.
1307@end deftypevr
1308
1309The type @code{cpu_set_t} should be considered opaque; all
1310manipulation should happen via the next four macros.
1311
d9997a45 1312@deftypefn Macro void CPU_ZERO (cpu_set_t *@var{set})
d08a7e4c 1313@standards{GNU, sched.h}
c8ce789c
AO
1314@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1315@c CPU_ZERO ok
1316@c __CPU_ZERO_S ok
1317@c memset dup ok
d9997a45
UD
1318This macro initializes the CPU set @var{set} to be the empty set.
1319
1320This macro is a GNU extension and is defined in @file{sched.h}.
1321@end deftypefn
1322
d9997a45 1323@deftypefn Macro void CPU_SET (int @var{cpu}, cpu_set_t *@var{set})
d08a7e4c 1324@standards{GNU, sched.h}
c8ce789c
AO
1325@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1326@c CPU_SET ok
1327@c __CPU_SET_S ok
1328@c __CPUELT ok
1329@c __CPUMASK ok
d9997a45
UD
1330This macro adds @var{cpu} to the CPU set @var{set}.
1331
1332The @var{cpu} parameter must not have side effects since it is
1333evaluated more than once.
1334
1335This macro is a GNU extension and is defined in @file{sched.h}.
1336@end deftypefn
1337
d9997a45 1338@deftypefn Macro void CPU_CLR (int @var{cpu}, cpu_set_t *@var{set})
d08a7e4c 1339@standards{GNU, sched.h}
c8ce789c
AO
1340@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1341@c CPU_CLR ok
1342@c __CPU_CLR_S ok
1343@c __CPUELT dup ok
1344@c __CPUMASK dup ok
d9997a45
UD
1345This macro removes @var{cpu} from the CPU set @var{set}.
1346
1347The @var{cpu} parameter must not have side effects since it is
1348evaluated more than once.
1349
1350This macro is a GNU extension and is defined in @file{sched.h}.
1351@end deftypefn
1352
d9997a45 1353@deftypefn Macro int CPU_ISSET (int @var{cpu}, const cpu_set_t *@var{set})
d08a7e4c 1354@standards{GNU, sched.h}
c8ce789c
AO
1355@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1356@c CPU_ISSET ok
1357@c __CPU_ISSET_S ok
1358@c __CPUELT dup ok
1359@c __CPUMASK dup ok
d9997a45
UD
1360This macro returns a nonzero value (true) if @var{cpu} is a member
1361of the CPU set @var{set}, and zero (false) otherwise.
1362
1363The @var{cpu} parameter must not have side effects since it is
1364evaluated more than once.
1365
1366This macro is a GNU extension and is defined in @file{sched.h}.
1367@end deftypefn
1368
1369
1370CPU bitsets can be constructed from scratch or the currently installed
1371affinity mask can be retrieved from the system.
1372
6f0b2e1f 1373@deftypefun int sched_getaffinity (pid_t @var{pid}, size_t @var{cpusetsize}, cpu_set_t *@var{cpuset})
d08a7e4c 1374@standards{GNU, sched.h}
c8ce789c
AO
1375@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1376@c Wrapped syscall to zero out past the kernel cpu set size; Linux
1377@c only.
d9997a45 1378
d3e22d59 1379This function stores the CPU affinity mask for the process or thread
6f0b2e1f
RM
1380with the ID @var{pid} in the @var{cpusetsize} bytes long bitmap
1381pointed to by @var{cpuset}. If successful, the function always
1382initializes all bits in the @code{cpu_set_t} object and returns zero.
d9997a45
UD
1383
1384If @var{pid} does not correspond to a process or thread on the system
1385the or the function fails for some other reason, it returns @code{-1}
1386and @code{errno} is set to represent the error condition.
1387
1388@table @code
1389@item ESRCH
1390No process or thread with the given ID found.
1391
1392@item EFAULT
d3e22d59 1393The pointer @var{cpuset} does not point to a valid object.
d9997a45
UD
1394@end table
1395
1396This function is a GNU extension and is declared in @file{sched.h}.
1397@end deftypefun
1398
1399Note that it is not portably possible to use this information to
1400retrieve the information for different POSIX threads. A separate
1401interface must be provided for that.
1402
6f0b2e1f 1403@deftypefun int sched_setaffinity (pid_t @var{pid}, size_t @var{cpusetsize}, const cpu_set_t *@var{cpuset})
d08a7e4c 1404@standards{GNU, sched.h}
c8ce789c
AO
1405@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1406@c Wrapped syscall to detect attempts to set bits past the kernel cpu
1407@c set size; Linux only.
d9997a45 1408
6f0b2e1f
RM
1409This function installs the @var{cpusetsize} bytes long affinity mask
1410pointed to by @var{cpuset} for the process or thread with the ID @var{pid}.
d3e22d59 1411If successful the function returns zero and the scheduler will in the future
6f0b2e1f 1412take the affinity information into account.
d9997a45
UD
1413
1414If the function fails it will return @code{-1} and @code{errno} is set
1415to the error code:
1416
1417@table @code
1418@item ESRCH
1419No process or thread with the given ID found.
1420
1421@item EFAULT
d3e22d59 1422The pointer @var{cpuset} does not point to a valid object.
d9997a45
UD
1423
1424@item EINVAL
1425The bitset is not valid. This might mean that the affinity set might
1426not leave a processor for the process or thread to run on.
1427@end table
1428
1429This function is a GNU extension and is declared in @file{sched.h}.
1430@end deftypefun
1431
a092ca94
L
1432@deftypefun int getcpu (unsigned int *cpu, unsigned int *node)
1433@standards{Linux, <sched.h>}
1434@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1435The @code{getcpu} function identifies the processor and node on which
1436the calling thread or process is currently running and writes them into
1437the integers pointed to by the @var{cpu} and @var{node} arguments. The
1438processor is a unique nonnegative integer identifying a CPU. The node
1439is a unique nonnegative integer identifying a NUMA node. When either
1440@var{cpu} or @var{node} is @code{NULL}, nothing is written to the
1441respective pointer.
1442
1443The return value is @code{0} on success and @code{-1} on failure. The
1444following @code{errno} error condition is defined for this function:
1445
1446@table @code
1447@item ENOSYS
1448The operating system does not support this function.
1449@end table
1450
1451This function is Linux-specific and is declared in @file{sched.h}.
1452@end deftypefun
d9997a45 1453
b642f101
UD
1454@node Memory Resources
1455@section Querying memory available resources
1456
1457The amount of memory available in the system and the way it is organized
1458determines oftentimes the way programs can and have to work. For
5a7eedfb 1459functions like @code{mmap} it is necessary to know about the size of
b642f101
UD
1460individual memory pages and knowing how much memory is available enables
1461a program to select appropriate sizes for, say, caches. Before we get
1462into these details a few words about memory subsystems in traditional
5a7eedfb 1463Unix systems will be given.
b642f101
UD
1464
1465@menu
1466* Memory Subsystem:: Overview about traditional Unix memory handling.
1467* Query Memory Parameters:: How to get information about the memory
1468 subsystem?
1469@end menu
1470
1471@node Memory Subsystem
1472@subsection Overview about traditional Unix memory handling
1473
1474@cindex address space
1475@cindex physical memory
1476@cindex physical address
1477Unix systems normally provide processes virtual address spaces. This
1478means that the addresses of the memory regions do not have to correspond
1479directly to the addresses of the actual physical memory which stores the
1480data. An extra level of indirection is introduced which translates
1481virtual addresses into physical addresses. This is normally done by the
1482hardware of the processor.
1483
1484@cindex shared memory
d3e22d59 1485Using a virtual address space has several advantages. The most important
b642f101
UD
1486is process isolation. The different processes running on the system
1487cannot interfere directly with each other. No process can write into
1488the address space of another process (except when shared memory is used
1489but then it is wanted and controlled).
1490
1491Another advantage of virtual memory is that the address space the
1492processes see can actually be larger than the physical memory available.
1493The physical memory can be extended by storage on an external media
1494where the content of currently unused memory regions is stored. The
1495address translation can then intercept accesses to these memory regions
1496and make memory content available again by loading the data back into
1497memory. This concept makes it necessary that programs which have to use
1498lots of memory know the difference between available virtual address
1499space and available physical memory. If the working set of virtual
1500memory of all the processes is larger than the available physical memory
1501the system will slow down dramatically due to constant swapping of
1502memory content from the memory to the storage media and back. This is
1503called ``thrashing''.
1504@cindex thrashing
1505
1506@cindex memory page
1507@cindex page, memory
1508A final aspect of virtual memory which is important and follows from
1509what is said in the last paragraph is the granularity of the virtual
1510address space handling. When we said that the virtual address handling
1511stores memory content externally it cannot do this on a byte-by-byte
1512basis. The administrative overhead does not allow this (leaving alone
1513the processor hardware). Instead several thousand bytes are handled
1514together and form a @dfn{page}. The size of each page is always a power
d3e22d59 1515of two bytes. The smallest page size in use today is 4096, with 8192,
b642f101
UD
151616384, and 65536 being other popular sizes.
1517
1518@node Query Memory Parameters
1519@subsection How to get information about the memory subsystem?
1520
1521The page size of the virtual memory the process sees is essential to
d3e22d59 1522know in several situations. Some programming interfaces (e.g.,
b642f101 1523@code{mmap}, @pxref{Memory-mapped I/O}) require the user to provide
d3e22d59 1524information adjusted to the page size. In the case of @code{mmap} it is
b642f101
UD
1525necessary to provide a length argument which is a multiple of the page
1526size. Another place where the knowledge about the page size is useful
1527is in memory allocation. If one allocates pieces of memory in larger
1528chunks which are then subdivided by the application code it is useful to
1529adjust the size of the larger blocks to the page size. If the total
1530memory requirement for the block is close (but not larger) to a multiple
1531of the page size the kernel's memory handling can work more effectively
1532since it only has to allocate memory pages which are fully used. (To do
1533this optimization it is necessary to know a bit about the memory
1534allocator which will require a bit of memory itself for each block and
d3e22d59 1535this overhead must not push the total size over the page size multiple.)
b642f101
UD
1536
1537The page size traditionally was a compile time constant. But recent
1538development of processors changed this. Processors now support
1539different page sizes and they can possibly even vary among different
1540processes on the same system. Therefore the system should be queried at
1541runtime about the current page size and no assumptions (except about it
1542being a power of two) should be made.
1543
1544@vindex _SC_PAGESIZE
1545The correct interface to query about the page size is @code{sysconf}
1546(@pxref{Sysconf Definition}) with the parameter @code{_SC_PAGESIZE}.
1547There is a much older interface available, too.
1548
b642f101 1549@deftypefun int getpagesize (void)
d08a7e4c 1550@standards{BSD, unistd.h}
c8ce789c
AO
1551@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1552@c Obtained from the aux vec at program startup time. GNU/Linux/m68k is
1553@c the exception, with the possibility of a syscall.
b642f101
UD
1554The @code{getpagesize} function returns the page size of the process.
1555This value is fixed for the runtime of the process but can vary in
1556different runs of the application.
1557
1558The function is declared in @file{unistd.h}.
1559@end deftypefun
1560
1561Widely available on @w{System V} derived systems is a method to get
1562information about the physical memory the system has. The call
1563
1564@vindex _SC_PHYS_PAGES
1565@cindex sysconf
1566@smallexample
1567 sysconf (_SC_PHYS_PAGES)
1568@end smallexample
1569
cb4fe8a2 1570@noindent
d3e22d59 1571returns the total number of pages of physical memory the system has.
b642f101
UD
1572This does not mean all this memory is available. This information can
1573be found using
1574
1575@vindex _SC_AVPHYS_PAGES
1576@cindex sysconf
1577@smallexample
1578 sysconf (_SC_AVPHYS_PAGES)
1579@end smallexample
1580
1581These two values help to optimize applications. The value returned for
1582@code{_SC_AVPHYS_PAGES} is the amount of memory the application can use
1583without hindering any other process (given that no other process
1584increases its memory usage). The value returned for
1585@code{_SC_PHYS_PAGES} is more or less a hard limit for the working set.
1586If all applications together constantly use more than that amount of
1587memory the system is in trouble.
1588
1f77f049 1589@Theglibc{} provides in addition to these already described way to
cb4fe8a2
UD
1590get this information two functions. They are declared in the file
1591@file{sys/sysinfo.h}. Programmers should prefer to use the
1592@code{sysconf} method described above.
1593
4c78249d 1594@deftypefun {long int} get_phys_pages (void)
d08a7e4c 1595@standards{GNU, sys/sysinfo.h}
c8ce789c
AO
1596@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{} @asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
1597@c This fopens a /proc file and scans it for the requested information.
cb4fe8a2 1598The @code{get_phys_pages} function returns the total number of pages of
d3e22d59 1599physical memory the system has. To get the amount of memory this number has to
cb4fe8a2
UD
1600be multiplied by the page size.
1601
1602This function is a GNU extension.
1603@end deftypefun
1604
4c78249d 1605@deftypefun {long int} get_avphys_pages (void)
d08a7e4c 1606@standards{GNU, sys/sysinfo.h}
c8ce789c 1607@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{} @asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
cd1fb604 1608The @code{get_avphys_pages} function returns the number of available pages of
d3e22d59 1609physical memory the system has. To get the amount of memory this number has to
cb4fe8a2
UD
1610be multiplied by the page size.
1611
1612This function is a GNU extension.
1613@end deftypefun
1614
b642f101
UD
1615@node Processor Resources
1616@section Learn about the processors available
1617
1618The use of threads or processes with shared memory allows an application
1619to take advantage of all the processing power a system can provide. If
1620the task can be parallelized the optimal way to write an application is
1621to have at any time as many processes running as there are processors.
1622To determine the number of processors available to the system one can
1623run
1624
1625@vindex _SC_NPROCESSORS_CONF
1626@cindex sysconf
1627@smallexample
1628 sysconf (_SC_NPROCESSORS_CONF)
1629@end smallexample
1630
1631@noindent
1632which returns the number of processors the operating system configured.
1633But it might be possible for the operating system to disable individual
1634processors and so the call
1635
1636@vindex _SC_NPROCESSORS_ONLN
1637@cindex sysconf
1638@smallexample
1639 sysconf (_SC_NPROCESSORS_ONLN)
1640@end smallexample
1641
1642@noindent
26428b7c 1643returns the number of processors which are currently online (i.e.,
b642f101 1644available).
e4cf5229 1645
1f77f049 1646For these two pieces of information @theglibc{} also provides
cb4fe8a2
UD
1647functions to get the information directly. The functions are declared
1648in @file{sys/sysinfo.h}.
1649
cb4fe8a2 1650@deftypefun int get_nprocs_conf (void)
d08a7e4c 1651@standards{GNU, sys/sysinfo.h}
c8ce789c
AO
1652@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{} @asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
1653@c This function reads from from /sys using dir streams (single user, so
1654@c no @mtasurace issue), and on some arches, from /proc using streams.
cb4fe8a2
UD
1655The @code{get_nprocs_conf} function returns the number of processors the
1656operating system configured.
1657
1658This function is a GNU extension.
1659@end deftypefun
1660
cb4fe8a2 1661@deftypefun int get_nprocs (void)
d08a7e4c 1662@standards{GNU, sys/sysinfo.h}
c8ce789c
AO
1663@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{@acsfd{}}}
1664@c This function reads from /proc using file descriptor I/O.
cb4fe8a2
UD
1665The @code{get_nprocs} function returns the number of available processors.
1666
1667This function is a GNU extension.
1668@end deftypefun
1669
e4cf5229
UD
1670@cindex load average
1671Before starting more threads it should be checked whether the processors
1672are not already overused. Unix systems calculate something called the
1673@dfn{load average}. This is a number indicating how many processes were
d3e22d59 1674running. This number is an average over different periods of time
e4cf5229
UD
1675(normally 1, 5, and 15 minutes).
1676
e4cf5229 1677@deftypefun int getloadavg (double @var{loadavg}[], int @var{nelem})
d08a7e4c 1678@standards{BSD, stdlib.h}
c8ce789c
AO
1679@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{@acsfd{}}}
1680@c Calls host_info on HURD; on Linux, opens /proc/loadavg, reads from
1681@c it, closes it, without cancellation point, and calls strtod_l with
1682@c the C locale to convert the strings to doubles.
e4cf5229 1683This function gets the 1, 5 and 15 minute load averages of the
cf822e3c 1684system. The values are placed in @var{loadavg}. @code{getloadavg} will
e4cf5229
UD
1685place at most @var{nelem} elements into the array but never more than
1686three elements. The return value is the number of elements written to
1687@var{loadavg}, or -1 on error.
1688
1689This function is declared in @file{stdlib.h}.
1690@end deftypefun