]> git.ipfire.org Git - thirdparty/glibc.git/blame - manual/resource.texi
manual: Fix up invalid header and standards syntax.
[thirdparty/glibc.git] / manual / resource.texi
CommitLineData
5ce8f203
UD
1@node Resource Usage And Limitation, Non-Local Exits, Date and Time, Top
2@c %MENU% Functions for examining resource usage and getting and setting limits
3@chapter Resource Usage And Limitation
4This chapter describes functions for examining how much of various kinds of
5resources (CPU time, memory, etc.) a process has used and getting and setting
6limits on future usage.
7
8@menu
9* Resource Usage:: Measuring various resources used.
10* Limits on Resources:: Specifying limits on resource usage.
11* Priority:: Reading or setting process run priority.
b642f101
UD
12* Memory Resources:: Querying memory available resources.
13* Processor Resources:: Learn about the processors available.
5ce8f203
UD
14@end menu
15
16
17@node Resource Usage
18@section Resource Usage
19
20@pindex sys/resource.h
21The function @code{getrusage} and the data type @code{struct rusage}
22are used to examine the resource usage of a process. They are declared
23in @file{sys/resource.h}.
24
25@comment sys/resource.h
26@comment BSD
27@deftypefun int getrusage (int @var{processes}, struct rusage *@var{rusage})
c8ce789c
AO
28@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
29@c On HURD, this calls task_info 3 times. On UNIX, it's a syscall.
5ce8f203
UD
30This function reports resource usage totals for processes specified by
31@var{processes}, storing the information in @code{*@var{rusage}}.
32
33In most systems, @var{processes} has only two valid values:
34
35@table @code
36@comment sys/resource.h
37@comment BSD
38@item RUSAGE_SELF
39Just the current process.
40
41@comment sys/resource.h
42@comment BSD
43@item RUSAGE_CHILDREN
44All child processes (direct and indirect) that have already terminated.
45@end table
46
5ce8f203
UD
47The return value of @code{getrusage} is zero for success, and @code{-1}
48for failure.
49
50@table @code
51@item EINVAL
52The argument @var{processes} is not valid.
53@end table
54@end deftypefun
55
56One way of getting resource usage for a particular child process is with
57the function @code{wait4}, which returns totals for a child when it
58terminates. @xref{BSD Wait Functions}.
59
60@comment sys/resource.h
61@comment BSD
62@deftp {Data Type} {struct rusage}
63This data type stores various resource usage statistics. It has the
64following members, and possibly others:
65
66@table @code
67@item struct timeval ru_utime
68Time spent executing user instructions.
69
70@item struct timeval ru_stime
71Time spent in operating system code on behalf of @var{processes}.
72
73@item long int ru_maxrss
74The maximum resident set size used, in kilobytes. That is, the maximum
75number of kilobytes of physical memory that @var{processes} used
76simultaneously.
77
78@item long int ru_ixrss
79An integral value expressed in kilobytes times ticks of execution, which
80indicates the amount of memory used by text that was shared with other
81processes.
82
83@item long int ru_idrss
84An integral value expressed the same way, which is the amount of
85unshared memory used for data.
86
87@item long int ru_isrss
88An integral value expressed the same way, which is the amount of
89unshared memory used for stack space.
90
91@item long int ru_minflt
92The number of page faults which were serviced without requiring any I/O.
93
94@item long int ru_majflt
95The number of page faults which were serviced by doing I/O.
96
97@item long int ru_nswap
98The number of times @var{processes} was swapped entirely out of main memory.
99
100@item long int ru_inblock
101The number of times the file system had to read from the disk on behalf
102of @var{processes}.
103
104@item long int ru_oublock
105The number of times the file system had to write to the disk on behalf
106of @var{processes}.
107
108@item long int ru_msgsnd
109Number of IPC messages sent.
110
111@item long int ru_msgrcv
112Number of IPC messages received.
113
114@item long int ru_nsignals
115Number of signals received.
116
117@item long int ru_nvcsw
118The number of times @var{processes} voluntarily invoked a context switch
119(usually to wait for some service).
120
121@item long int ru_nivcsw
122The number of times an involuntary context switch took place (because
123a time slice expired, or another process of higher priority was
124scheduled).
125@end table
126@end deftp
127
128@code{vtimes} is a historical function that does some of what
129@code{getrusage} does. @code{getrusage} is a better choice.
130
131@code{vtimes} and its @code{vtimes} data structure are declared in
132@file{sys/vtimes.h}.
133@pindex sys/vtimes.h
5ce8f203 134
8ded91fb
RM
135@comment sys/vtimes.h
136@deftypefun int vtimes (struct vtimes *@var{current}, struct vtimes *@var{child})
c8ce789c
AO
137@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
138@c Calls getrusage twice.
5ce8f203
UD
139
140@code{vtimes} reports resource usage totals for a process.
141
142If @var{current} is non-null, @code{vtimes} stores resource usage totals for
143the invoking process alone in the structure to which it points. If
144@var{child} is non-null, @code{vtimes} stores resource usage totals for all
145past children (which have terminated) of the invoking process in the structure
146to which it points.
147
148@deftp {Data Type} {struct vtimes}
149This data type contains information about the resource usage of a process.
150Each member corresponds to a member of the @code{struct rusage} data type
151described above.
152
153@table @code
154@item vm_utime
155User CPU time. Analogous to @code{ru_utime} in @code{struct rusage}
156@item vm_stime
157System CPU time. Analogous to @code{ru_stime} in @code{struct rusage}
158@item vm_idsrss
159Data and stack memory. The sum of the values that would be reported as
160@code{ru_idrss} and @code{ru_isrss} in @code{struct rusage}
161@item vm_ixrss
162Shared memory. Analogous to @code{ru_ixrss} in @code{struct rusage}
163@item vm_maxrss
164Maximent resident set size. Analogous to @code{ru_maxrss} in
165@code{struct rusage}
166@item vm_majflt
167Major page faults. Analogous to @code{ru_majflt} in @code{struct rusage}
168@item vm_minflt
169Minor page faults. Analogous to @code{ru_minflt} in @code{struct rusage}
170@item vm_nswap
171Swap count. Analogous to @code{ru_nswap} in @code{struct rusage}
172@item vm_inblk
173Disk reads. Analogous to @code{ru_inblk} in @code{struct rusage}
174@item vm_oublk
175Disk writes. Analogous to @code{ru_oublk} in @code{struct rusage}
176@end table
177@end deftp
178
179
180The return value is zero if the function succeeds; @code{-1} otherwise.
181
182
183
184@end deftypefun
185An additional historical function for examining resource usage,
186@code{vtimes}, is supported but not documented here. It is declared in
187@file{sys/vtimes.h}.
188
189@node Limits on Resources
190@section Limiting Resource Usage
191@cindex resource limits
192@cindex limits on resource usage
193@cindex usage limits
194
195You can specify limits for the resource usage of a process. When the
196process tries to exceed a limit, it may get a signal, or the system call
197by which it tried to do so may fail, depending on the resource. Each
198process initially inherits its limit values from its parent, but it can
199subsequently change them.
200
201There are two per-process limits associated with a resource:
202@cindex limit
203
204@table @dfn
205@item current limit
206The current limit is the value the system will not allow usage to
207exceed. It is also called the ``soft limit'' because the process being
208limited can generally raise the current limit at will.
209@cindex current limit
210@cindex soft limit
211
212@item maximum limit
213The maximum limit is the maximum value to which a process is allowed to
214set its current limit. It is also called the ``hard limit'' because
215there is no way for a process to get around it. A process may lower
216its own maximum limit, but only the superuser may increase a maximum
217limit.
218@cindex maximum limit
219@cindex hard limit
220@end table
221
222@pindex sys/resource.h
223The symbols for use with @code{getrlimit}, @code{setrlimit},
0bc93a2f 224@code{getrlimit64}, and @code{setrlimit64} are defined in
5ce8f203
UD
225@file{sys/resource.h}.
226
227@comment sys/resource.h
228@comment BSD
229@deftypefun int getrlimit (int @var{resource}, struct rlimit *@var{rlp})
c8ce789c
AO
230@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
231@c Direct syscall on most systems.
5ce8f203
UD
232Read the current and maximum limits for the resource @var{resource}
233and store them in @code{*@var{rlp}}.
234
235The return value is @code{0} on success and @code{-1} on failure. The
236only possible @code{errno} error condition is @code{EFAULT}.
237
238When the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a
23932-bit system this function is in fact @code{getrlimit64}. Thus, the
240LFS interface transparently replaces the old interface.
241@end deftypefun
242
243@comment sys/resource.h
244@comment Unix98
245@deftypefun int getrlimit64 (int @var{resource}, struct rlimit64 *@var{rlp})
c8ce789c
AO
246@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
247@c Direct syscall on most systems, wrapper to getrlimit otherwise.
5ce8f203
UD
248This function is similar to @code{getrlimit} but its second parameter is
249a pointer to a variable of type @code{struct rlimit64}, which allows it
250to read values which wouldn't fit in the member of a @code{struct
251rlimit}.
252
253If the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a
25432-bit machine, this function is available under the name
255@code{getrlimit} and so transparently replaces the old interface.
256@end deftypefun
257
258@comment sys/resource.h
259@comment BSD
260@deftypefun int setrlimit (int @var{resource}, const struct rlimit *@var{rlp})
c8ce789c
AO
261@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
262@c Direct syscall on most systems; lock-taking critical section on HURD.
5ce8f203
UD
263Store the current and maximum limits for the resource @var{resource}
264in @code{*@var{rlp}}.
265
266The return value is @code{0} on success and @code{-1} on failure. The
267following @code{errno} error condition is possible:
268
269@table @code
270@item EPERM
271@itemize @bullet
272@item
273The process tried to raise a current limit beyond the maximum limit.
274
275@item
276The process tried to raise a maximum limit, but is not superuser.
277@end itemize
278@end table
279
280When the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a
28132-bit system this function is in fact @code{setrlimit64}. Thus, the
282LFS interface transparently replaces the old interface.
283@end deftypefun
284
285@comment sys/resource.h
286@comment Unix98
287@deftypefun int setrlimit64 (int @var{resource}, const struct rlimit64 *@var{rlp})
c8ce789c
AO
288@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
289@c Wrapper for setrlimit or direct syscall.
5ce8f203
UD
290This function is similar to @code{setrlimit} but its second parameter is
291a pointer to a variable of type @code{struct rlimit64} which allows it
292to set values which wouldn't fit in the member of a @code{struct
293rlimit}.
294
295If the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a
29632-bit machine this function is available under the name
297@code{setrlimit} and so transparently replaces the old interface.
298@end deftypefun
299
300@comment sys/resource.h
301@comment BSD
302@deftp {Data Type} {struct rlimit}
303This structure is used with @code{getrlimit} to receive limit values,
304and with @code{setrlimit} to specify limit values for a particular process
305and resource. It has two fields:
306
307@table @code
308@item rlim_t rlim_cur
309The current limit
310
311@item rlim_t rlim_max
312The maximum limit.
313@end table
314
315For @code{getrlimit}, the structure is an output; it receives the current
316values. For @code{setrlimit}, it specifies the new values.
317@end deftp
318
319For the LFS functions a similar type is defined in @file{sys/resource.h}.
320
321@comment sys/resource.h
322@comment Unix98
323@deftp {Data Type} {struct rlimit64}
324This structure is analogous to the @code{rlimit} structure above, but
325its components have wider ranges. It has two fields:
326
327@table @code
328@item rlim64_t rlim_cur
329This is analogous to @code{rlimit.rlim_cur}, but with a different type.
330
331@item rlim64_t rlim_max
332This is analogous to @code{rlimit.rlim_max}, but with a different type.
333@end table
334
335@end deftp
336
337Here is a list of resources for which you can specify a limit. Memory
338and file sizes are measured in bytes.
339
2fe82ca6 340@vtable @code
5ce8f203
UD
341@comment sys/resource.h
342@comment BSD
343@item RLIMIT_CPU
5ce8f203
UD
344The maximum amount of CPU time the process can use. If it runs for
345longer than this, it gets a signal: @code{SIGXCPU}. The value is
346measured in seconds. @xref{Operation Error Signals}.
347
348@comment sys/resource.h
349@comment BSD
350@item RLIMIT_FSIZE
5ce8f203
UD
351The maximum size of file the process can create. Trying to write a
352larger file causes a signal: @code{SIGXFSZ}. @xref{Operation Error
353Signals}.
354
355@comment sys/resource.h
356@comment BSD
357@item RLIMIT_DATA
5ce8f203
UD
358The maximum size of data memory for the process. If the process tries
359to allocate data memory beyond this amount, the allocation function
360fails.
361
362@comment sys/resource.h
363@comment BSD
364@item RLIMIT_STACK
5ce8f203
UD
365The maximum stack size for the process. If the process tries to extend
366its stack past this size, it gets a @code{SIGSEGV} signal.
367@xref{Program Error Signals}.
368
369@comment sys/resource.h
370@comment BSD
371@item RLIMIT_CORE
5ce8f203
UD
372The maximum size core file that this process can create. If the process
373terminates and would dump a core file larger than this, then no core
374file is created. So setting this limit to zero prevents core files from
375ever being created.
376
377@comment sys/resource.h
378@comment BSD
379@item RLIMIT_RSS
5ce8f203
UD
380The maximum amount of physical memory that this process should get.
381This parameter is a guide for the system's scheduler and memory
382allocator; the system may give the process more memory when there is a
383surplus.
384
385@comment sys/resource.h
386@comment BSD
387@item RLIMIT_MEMLOCK
388The maximum amount of memory that can be locked into physical memory (so
389it will never be paged out).
390
391@comment sys/resource.h
392@comment BSD
393@item RLIMIT_NPROC
394The maximum number of processes that can be created with the same user ID.
395If you have reached the limit for your user ID, @code{fork} will fail
396with @code{EAGAIN}. @xref{Creating a Process}.
397
398@comment sys/resource.h
399@comment BSD
400@item RLIMIT_NOFILE
5ce8f203 401@itemx RLIMIT_OFILE
5ce8f203
UD
402The maximum number of files that the process can open. If it tries to
403open more files than this, its open attempt fails with @code{errno}
404@code{EMFILE}. @xref{Error Codes}. Not all systems support this limit;
405GNU does, and 4.4 BSD does.
406
407@comment sys/resource.h
408@comment Unix98
409@item RLIMIT_AS
5ce8f203
UD
410The maximum size of total memory that this process should get. If the
411process tries to allocate more memory beyond this amount with, for
412example, @code{brk}, @code{malloc}, @code{mmap} or @code{sbrk}, the
413allocation function fails.
414
415@comment sys/resource.h
416@comment BSD
417@item RLIM_NLIMITS
5ce8f203
UD
418The number of different resource limits. Any valid @var{resource}
419operand must be less than @code{RLIM_NLIMITS}.
2fe82ca6 420@end vtable
5ce8f203
UD
421
422@comment sys/resource.h
423@comment BSD
8ded91fb 424@deftypevr Constant rlim_t RLIM_INFINITY
5ce8f203
UD
425This constant stands for a value of ``infinity'' when supplied as
426the limit value in @code{setrlimit}.
427@end deftypevr
428
429
430The following are historical functions to do some of what the functions
431above do. The functions above are better choices.
432
433@code{ulimit} and the command symbols are declared in @file{ulimit.h}.
434@pindex ulimit.h
5ce8f203 435
b642f101
UD
436@comment ulimit.h
437@comment BSD
8ded91fb 438@deftypefun {long int} ulimit (int @var{cmd}, @dots{})
c8ce789c
AO
439@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
440@c Wrapper for getrlimit, setrlimit or
441@c sysconf(_SC_OPEN_MAX)->getdtablesize->getrlimit.
5ce8f203
UD
442
443@code{ulimit} gets the current limit or sets the current and maximum
444limit for a particular resource for the calling process according to the
d3e22d59 445command @var{cmd}.
5ce8f203
UD
446
447If you are getting a limit, the command argument is the only argument.
448If you are setting a limit, there is a second argument:
449@code{long int} @var{limit} which is the value to which you are setting
450the limit.
451
452The @var{cmd} values and the operations they specify are:
2fe82ca6 453@vtable @code
5ce8f203
UD
454
455@item GETFSIZE
456Get the current limit on the size of a file, in units of 512 bytes.
457
458@item SETFSIZE
459Set the current and maximum limit on the size of a file to @var{limit} *
460512 bytes.
461
2fe82ca6 462@end vtable
5ce8f203
UD
463
464There are also some other @var{cmd} values that may do things on some
465systems, but they are not supported.
466
467Only the superuser may increase a maximum limit.
468
469When you successfully get a limit, the return value of @code{ulimit} is
470that limit, which is never negative. When you successfully set a limit,
471the return value is zero. When the function fails, the return value is
472@code{-1} and @code{errno} is set according to the reason:
473
474@table @code
475@item EPERM
476A process tried to increase a maximum limit, but is not superuser.
477@end table
478
479
480@end deftypefun
481
482@code{vlimit} and its resource symbols are declared in @file{sys/vlimit.h}.
5ce8f203 483@pindex sys/vlimit.h
5ce8f203 484
b642f101
UD
485@comment sys/vlimit.h
486@comment BSD
5ce8f203 487@deftypefun int vlimit (int @var{resource}, int @var{limit})
c8ce789c
AO
488@safety{@prelim{}@mtunsafe{@mtasurace{:setrlimit}}@asunsafe{}@acsafe{}}
489@c It calls getrlimit and modifies the rlim_cur field before calling
490@c setrlimit. There's a window for a concurrent call to setrlimit that
491@c modifies e.g. rlim_max, which will be lost if running as super-user.
5ce8f203
UD
492
493@code{vlimit} sets the current limit for a resource for a process.
494
495@var{resource} identifies the resource:
496
2fe82ca6 497@vtable @code
5ce8f203
UD
498@item LIM_CPU
499Maximum CPU time. Same as @code{RLIMIT_CPU} for @code{setrlimit}.
500@item LIM_FSIZE
501Maximum file size. Same as @code{RLIMIT_FSIZE} for @code{setrlimit}.
502@item LIM_DATA
503Maximum data memory. Same as @code{RLIMIT_DATA} for @code{setrlimit}.
504@item LIM_STACK
505Maximum stack size. Same as @code{RLIMIT_STACK} for @code{setrlimit}.
506@item LIM_CORE
507Maximum core file size. Same as @code{RLIMIT_COR} for @code{setrlimit}.
508@item LIM_MAXRSS
509Maximum physical memory. Same as @code{RLIMIT_RSS} for @code{setrlimit}.
2fe82ca6 510@end vtable
5ce8f203
UD
511
512The return value is zero for success, and @code{-1} with @code{errno} set
513accordingly for failure:
514
515@table @code
516@item EPERM
517The process tried to set its current limit beyond its maximum limit.
518@end table
519
520@end deftypefun
521
522@node Priority
639c6286 523@section Process CPU Priority And Scheduling
5ce8f203 524@cindex process priority
639c6286 525@cindex cpu priority
5ce8f203
UD
526@cindex priority of a process
527
639c6286
UD
528When multiple processes simultaneously require CPU time, the system's
529scheduling policy and process CPU priorities determine which processes
530get it. This section describes how that determination is made and
1f77f049 531@glibcadj{} functions to control it.
639c6286
UD
532
533It is common to refer to CPU scheduling simply as scheduling and a
534process' CPU priority simply as the process' priority, with the CPU
535resource being implied. Bear in mind, though, that CPU time is not the
536only resource a process uses or that processes contend for. In some
537cases, it is not even particularly important. Giving a process a high
538``priority'' may have very little effect on how fast a process runs with
539respect to other processes. The priorities discussed in this section
540apply only to CPU time.
541
542CPU scheduling is a complex issue and different systems do it in wildly
543different ways. New ideas continually develop and find their way into
544the intricacies of the various systems' scheduling algorithms. This
87b56f36 545section discusses the general concepts, some specifics of systems
1f77f049 546that commonly use @theglibc{}, and some standards.
639c6286
UD
547
548For simplicity, we talk about CPU contention as if there is only one CPU
549in the system. But all the same principles apply when a processor has
550multiple CPUs, and knowing that the number of processes that can run at
551any one time is equal to the number of CPUs, you can easily extrapolate
552the information.
553
554The functions described in this section are all defined by the POSIX.1
95fdc6a0 555and POSIX.1b standards (the @code{sched@dots{}} functions are POSIX.1b).
639c6286
UD
556However, POSIX does not define any semantics for the values that these
557functions get and set. In this chapter, the semantics are based on the
558Linux kernel's implementation of the POSIX standard. As you will see,
559the Linux implementation is quite the inverse of what the authors of the
560POSIX syntax had in mind.
561
562@menu
563* Absolute Priority:: The first tier of priority. Posix
564* Realtime Scheduling:: Scheduling among the process nobility
565* Basic Scheduling Functions:: Get/set scheduling policy, priority
566* Traditional Scheduling:: Scheduling among the vulgar masses
d9997a45 567* CPU Affinity:: Limiting execution to certain CPUs
639c6286
UD
568@end menu
569
570
571
572@node Absolute Priority
573@subsection Absolute Priority
574@cindex absolute priority
575@cindex priority, absolute
576
577Every process has an absolute priority, and it is represented by a number.
578The higher the number, the higher the absolute priority.
579
580@cindex realtime CPU scheduling
581On systems of the past, and most systems today, all processes have
582absolute priority 0 and this section is irrelevant. In that case,
583@xref{Traditional Scheduling}. Absolute priorities were invented to
0bc93a2f 584accommodate realtime systems, in which it is vital that certain processes
639c6286
UD
585be able to respond to external events happening in real time, which
586means they cannot wait around while some other process that @emph{wants
587to}, but doesn't @emph{need to} run occupies the CPU.
588
589@cindex ready to run
590@cindex preemptive scheduling
591When two processes are in contention to use the CPU at any instant, the
592one with the higher absolute priority always gets it. This is true even if the
11bf311e 593process with the lower priority is already using the CPU (i.e., the
639c6286
UD
594scheduling is preemptive). Of course, we're only talking about
595processes that are running or ``ready to run,'' which means they are
596ready to execute instructions right now. When a process blocks to wait
597for something like I/O, its absolute priority is irrelevant.
598
599@cindex runnable process
48b22986 600@strong{NB:} The term ``runnable'' is a synonym for ``ready to run.''
639c6286
UD
601
602When two processes are running or ready to run and both have the same
603absolute priority, it's more interesting. In that case, who gets the
0bc93a2f 604CPU is determined by the scheduling policy. If the processes have
639c6286
UD
605absolute priority 0, the traditional scheduling policy described in
606@ref{Traditional Scheduling} applies. Otherwise, the policies described
607in @ref{Realtime Scheduling} apply.
608
609You normally give an absolute priority above 0 only to a process that
610can be trusted not to hog the CPU. Such processes are designed to block
611(or terminate) after relatively short CPU runs.
612
613A process begins life with the same absolute priority as its parent
614process. Functions described in @ref{Basic Scheduling Functions} can
615change it.
616
617Only a privileged process can change a process' absolute priority to
618something other than @code{0}. Only a privileged process or the
619target process' owner can change its absolute priority at all.
620
621POSIX requires absolute priority values used with the realtime
622scheduling policies to be consecutive with a range of at least 32. On
623Linux, they are 1 through 99. The functions
624@code{sched_get_priority_max} and @code{sched_set_priority_min} portably
625tell you what the range is on a particular system.
626
627
628@subsubsection Using Absolute Priority
629
630One thing you must keep in mind when designing real time applications is
631that having higher absolute priority than any other process doesn't
632guarantee the process can run continuously. Two things that can wreck a
87b56f36 633good CPU run are interrupts and page faults.
639c6286
UD
634
635Interrupt handlers live in that limbo between processes. The CPU is
636executing instructions, but they aren't part of any process. An
637interrupt will stop even the highest priority process. So you must
638allow for slight delays and make sure that no device in the system has
639an interrupt handler that could cause too long a delay between
640instructions for your process.
641
642Similarly, a page fault causes what looks like a straightforward
643sequence of instructions to take a long time. The fact that other
644processes get to run while the page faults in is of no consequence,
d3e22d59 645because as soon as the I/O is complete, the higher priority process will
639c6286
UD
646kick them out and run again, but the wait for the I/O itself could be a
647problem. To neutralize this threat, use @code{mlock} or
648@code{mlockall}.
649
650There are a few ramifications of the absoluteness of this priority on a
651single-CPU system that you need to keep in mind when you choose to set a
652priority and also when you're working on a program that runs with high
653absolute priority. Consider a process that has higher absolute priority
654than any other process in the system and due to a bug in its program, it
655gets into an infinite loop. It will never cede the CPU. You can't run
656a command to kill it because your command would need to get the CPU in
657order to run. The errant program is in complete control. It controls
658the vertical, it controls the horizontal.
659
660There are two ways to avoid this: 1) keep a shell running somewhere with
d3e22d59 661a higher absolute priority or 2) keep a controlling terminal attached to
639c6286
UD
662the high priority process group. All the priority in the world won't
663stop an interrupt handler from running and delivering a signal to the
664process if you hit Control-C.
665
95fdc6a0 666Some systems use absolute priority as a means of allocating a fixed
0bc93a2f 667percentage of CPU time to a process. To do this, a super high priority
639c6286
UD
668privileged process constantly monitors the process' CPU usage and raises
669its absolute priority when the process isn't getting its entitled share
670and lowers it when the process is exceeding it.
671
48b22986 672@strong{NB:} The absolute priority is sometimes called the ``static
639c6286
UD
673priority.'' We don't use that term in this manual because it misses the
674most important feature of the absolute priority: its absoluteness.
675
676
677@node Realtime Scheduling
678@subsection Realtime Scheduling
b642f101 679@cindex realtime scheduling
639c6286
UD
680
681Whenever two processes with the same absolute priority are ready to run,
682the kernel has a decision to make, because only one can run at a time.
683If the processes have absolute priority 0, the kernel makes this decision
684as described in @ref{Traditional Scheduling}. Otherwise, the decision
685is as described in this section.
686
687If two processes are ready to run but have different absolute priorities,
688the decision is much simpler, and is described in @ref{Absolute
689Priority}.
690
87b56f36 691Each process has a scheduling policy. For processes with absolute
639c6286
UD
692priority other than zero, there are two available:
693
694@enumerate
695@item
696First Come First Served
697@item
698Round Robin
699@end enumerate
700
701The most sensible case is where all the processes with a certain
702absolute priority have the same scheduling policy. We'll discuss that
703first.
704
705In Round Robin, processes share the CPU, each one running for a small
706quantum of time (``time slice'') and then yielding to another in a
707circular fashion. Of course, only processes that are ready to run and
708have the same absolute priority are in this circle.
709
710In First Come First Served, the process that has been waiting the
711longest to run gets the CPU, and it keeps it until it voluntarily
712relinquishes the CPU, runs out of things to do (blocks), or gets
713preempted by a higher priority process.
714
715First Come First Served, along with maximal absolute priority and
716careful control of interrupts and page faults, is the one to use when a
717process absolutely, positively has to run at full CPU speed or not at
718all.
719
720Judicious use of @code{sched_yield} function invocations by processes
721with First Come First Served scheduling policy forms a good compromise
722between Round Robin and First Come First Served.
723
724To understand how scheduling works when processes of different scheduling
725policies occupy the same absolute priority, you have to know the nitty
d3e22d59 726gritty details of how processes enter and exit the ready to run list.
639c6286
UD
727
728In both cases, the ready to run list is organized as a true queue, where
729a process gets pushed onto the tail when it becomes ready to run and is
730popped off the head when the scheduler decides to run it. Note that
731ready to run and running are two mutually exclusive states. When the
732scheduler runs a process, that process is no longer ready to run and no
733longer in the ready to run list. When the process stops running, it
734may go back to being ready to run again.
735
736The only difference between a process that is assigned the Round Robin
737scheduling policy and a process that is assigned First Come First Serve
738is that in the former case, the process is automatically booted off the
739CPU after a certain amount of time. When that happens, the process goes
740back to being ready to run, which means it enters the queue at the tail.
741The time quantum we're talking about is small. Really small. This is
742not your father's timesharing. For example, with the Linux kernel, the
743round robin time slice is a thousand times shorter than its typical
744time slice for traditional scheduling.
745
746A process begins life with the same scheduling policy as its parent process.
747Functions described in @ref{Basic Scheduling Functions} can change it.
748
749Only a privileged process can set the scheduling policy of a process
750that has absolute priority higher than 0.
751
752@node Basic Scheduling Functions
753@subsection Basic Scheduling Functions
754
1f77f049 755This section describes functions in @theglibc{} for setting the
639c6286
UD
756absolute priority and scheduling policy of a process.
757
758@strong{Portability Note:} On systems that have the functions in this
759section, the macro _POSIX_PRIORITY_SCHEDULING is defined in
760@file{<unistd.h>}.
761
762For the case that the scheduling policy is traditional scheduling, more
763functions to fine tune the scheduling are in @ref{Traditional Scheduling}.
764
765Don't try to make too much out of the naming and structure of these
766functions. They don't match the concepts described in this manual
767because the functions are as defined by POSIX.1b, but the implementation
1f77f049 768on systems that use @theglibc{} is the inverse of what the POSIX
639c6286
UD
769structure contemplates. The POSIX scheme assumes that the primary
770scheduling parameter is the scheduling policy and that the priority
771value, if any, is a parameter of the scheduling policy. In the
772implementation, though, the priority value is king and the scheduling
773policy, if anything, only fine tunes the effect of that priority.
774
775The symbols in this section are declared by including file @file{sched.h}.
776
777@comment sched.h
778@comment POSIX
779@deftp {Data Type} {struct sched_param}
780This structure describes an absolute priority.
781@table @code
782@item int sched_priority
783absolute priority value
784@end table
785@end deftp
786
787@comment sched.h
788@comment POSIX
789@deftypefun int sched_setscheduler (pid_t @var{pid}, int @var{policy}, const struct sched_param *@var{param})
c8ce789c
AO
790@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
791@c Direct syscall, Linux only.
639c6286
UD
792
793This function sets both the absolute priority and the scheduling policy
794for a process.
795
796It assigns the absolute priority value given by @var{param} and the
797scheduling policy @var{policy} to the process with Process ID @var{pid},
798or the calling process if @var{pid} is zero. If @var{policy} is
0bc93a2f 799negative, @code{sched_setscheduler} keeps the existing scheduling policy.
639c6286
UD
800
801The following macros represent the valid values for @var{policy}:
802
2fe82ca6 803@vtable @code
639c6286
UD
804@item SCHED_OTHER
805Traditional Scheduling
806@item SCHED_FIFO
87b56f36 807First In First Out
639c6286
UD
808@item SCHED_RR
809Round Robin
2fe82ca6 810@end vtable
639c6286
UD
811
812@c The Linux kernel code (in sched.c) actually reschedules the process,
813@c but it puts it at the head of the run queue, so I'm not sure just what
814@c the effect is, but it must be subtle.
815
816On success, the return value is @code{0}. Otherwise, it is @code{-1}
817and @code{ERRNO} is set accordingly. The @code{errno} values specific
818to this function are:
819
820@table @code
821@item EPERM
822@itemize @bullet
823@item
824The calling process does not have @code{CAP_SYS_NICE} permission and
825@var{policy} is not @code{SCHED_OTHER} (or it's negative and the
826existing policy is not @code{SCHED_OTHER}.
827
828@item
829The calling process does not have @code{CAP_SYS_NICE} permission and its
11bf311e 830owner is not the target process' owner. I.e., the effective uid of the
639c6286
UD
831calling process is neither the effective nor the real uid of process
832@var{pid}.
833@c We need a cross reference to the capabilities section, when written.
834@end itemize
835
836@item ESRCH
837There is no process with pid @var{pid} and @var{pid} is not zero.
838
839@item EINVAL
840@itemize @bullet
841@item
842@var{policy} does not identify an existing scheduling policy.
843
844@item
845The absolute priority value identified by *@var{param} is outside the
846valid range for the scheduling policy @var{policy} (or the existing
847scheduling policy if @var{policy} is negative) or @var{param} is
848null. @code{sched_get_priority_max} and @code{sched_get_priority_min}
849tell you what the valid range is.
850
851@item
852@var{pid} is negative.
853@end itemize
854@end table
855
856@end deftypefun
857
858
859@comment sched.h
860@comment POSIX
861@deftypefun int sched_getscheduler (pid_t @var{pid})
c8ce789c
AO
862@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
863@c Direct syscall, Linux only.
639c6286
UD
864
865This function returns the scheduling policy assigned to the process with
866Process ID (pid) @var{pid}, or the calling process if @var{pid} is zero.
867
868The return value is the scheduling policy. See
869@code{sched_setscheduler} for the possible values.
870
871If the function fails, the return value is instead @code{-1} and
872@code{errno} is set accordingly.
873
874The @code{errno} values specific to this function are:
875
876@table @code
877
878@item ESRCH
879There is no process with pid @var{pid} and it is not zero.
880
881@item EINVAL
882@var{pid} is negative.
883
884@end table
885
886Note that this function is not an exact mate to @code{sched_setscheduler}
887because while that function sets the scheduling policy and the absolute
888priority, this function gets only the scheduling policy. To get the
889absolute priority, use @code{sched_getparam}.
890
891@end deftypefun
892
893
894@comment sched.h
895@comment POSIX
896@deftypefun int sched_setparam (pid_t @var{pid}, const struct sched_param *@var{param})
c8ce789c
AO
897@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
898@c Direct syscall, Linux only.
639c6286
UD
899
900This function sets a process' absolute priority.
901
902It is functionally identical to @code{sched_setscheduler} with
903@var{policy} = @code{-1}.
904
905@c in fact, that's how it's implemented in Linux.
906
907@end deftypefun
908
909@comment sched.h
910@comment POSIX
8ded91fb 911@deftypefun int sched_getparam (pid_t @var{pid}, struct sched_param *@var{param})
c8ce789c
AO
912@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
913@c Direct syscall, Linux only.
639c6286
UD
914
915This function returns a process' absolute priority.
916
917@var{pid} is the Process ID (pid) of the process whose absolute priority
918you want to know.
919
920@var{param} is a pointer to a structure in which the function stores the
921absolute priority of the process.
922
923On success, the return value is @code{0}. Otherwise, it is @code{-1}
d3e22d59 924and @code{errno} is set accordingly. The @code{errno} values specific
639c6286
UD
925to this function are:
926
927@table @code
928
929@item ESRCH
930There is no process with pid @var{pid} and it is not zero.
931
932@item EINVAL
933@var{pid} is negative.
934
935@end table
936
937@end deftypefun
938
939
940@comment sched.h
941@comment POSIX
8ded91fb 942@deftypefun int sched_get_priority_min (int @var{policy})
c8ce789c
AO
943@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
944@c Direct syscall, Linux only.
639c6286
UD
945
946This function returns the lowest absolute priority value that is
947allowable for a process with scheduling policy @var{policy}.
948
949On Linux, it is 0 for SCHED_OTHER and 1 for everything else.
950
951On success, the return value is @code{0}. Otherwise, it is @code{-1}
952and @code{ERRNO} is set accordingly. The @code{errno} values specific
953to this function are:
954
955@table @code
956@item EINVAL
957@var{policy} does not identify an existing scheduling policy.
958@end table
959
960@end deftypefun
961
962@comment sched.h
963@comment POSIX
8ded91fb 964@deftypefun int sched_get_priority_max (int @var{policy})
c8ce789c
AO
965@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
966@c Direct syscall, Linux only.
639c6286
UD
967
968This function returns the highest absolute priority value that is
969allowable for a process that with scheduling policy @var{policy}.
970
971On Linux, it is 0 for SCHED_OTHER and 99 for everything else.
972
973On success, the return value is @code{0}. Otherwise, it is @code{-1}
974and @code{ERRNO} is set accordingly. The @code{errno} values specific
975to this function are:
976
977@table @code
978@item EINVAL
979@var{policy} does not identify an existing scheduling policy.
980@end table
981
982@end deftypefun
983
984@comment sched.h
985@comment POSIX
986@deftypefun int sched_rr_get_interval (pid_t @var{pid}, struct timespec *@var{interval})
c8ce789c
AO
987@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
988@c Direct syscall, Linux only.
639c6286 989
87b56f36 990This function returns the length of the quantum (time slice) used with
639c6286
UD
991the Round Robin scheduling policy, if it is used, for the process with
992Process ID @var{pid}.
993
87b56f36 994It returns the length of time as @var{interval}.
639c6286
UD
995@c We need a cross-reference to where timespec is explained. But that
996@c section doesn't exist yet, and the time chapter needs to be slightly
997@c reorganized so there is a place to put it (which will be right next
998@c to timeval, which is presently misplaced). 2000.05.07.
999
1000With a Linux kernel, the round robin time slice is always 150
1001microseconds, and @var{pid} need not even be a real pid.
1002
1003The return value is @code{0} on success and in the pathological case
1004that it fails, the return value is @code{-1} and @code{errno} is set
1005accordingly. There is nothing specific that can go wrong with this
1006function, so there are no specific @code{errno} values.
1007
1008@end deftypefun
1009
1010@comment sched.h
1011@comment POSIX
3c44837c 1012@deftypefun int sched_yield (void)
c8ce789c
AO
1013@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1014@c Direct syscall on Linux; alias to swtch on HURD.
639c6286
UD
1015
1016This function voluntarily gives up the process' claim on the CPU.
1017
1018Technically, @code{sched_yield} causes the calling process to be made
1019immediately ready to run (as opposed to running, which is what it was
1020before). This means that if it has absolute priority higher than 0, it
1021gets pushed onto the tail of the queue of processes that share its
1022absolute priority and are ready to run, and it will run again when its
1023turn next arrives. If its absolute priority is 0, it is more
1024complicated, but still has the effect of yielding the CPU to other
1025processes.
1026
1027If there are no other processes that share the calling process' absolute
1028priority, this function doesn't have any effect.
1029
1030To the extent that the containing program is oblivious to what other
1031processes in the system are doing and how fast it executes, this
1032function appears as a no-op.
1033
1034The return value is @code{0} on success and in the pathological case
1035that it fails, the return value is @code{-1} and @code{errno} is set
1036accordingly. There is nothing specific that can go wrong with this
1037function, so there are no specific @code{errno} values.
1038
1039@end deftypefun
1040
1041@node Traditional Scheduling
1042@subsection Traditional Scheduling
1043@cindex scheduling, traditional
1044
1045This section is about the scheduling among processes whose absolute
1046priority is 0. When the system hands out the scraps of CPU time that
0bc93a2f 1047are left over after the processes with higher absolute priority have
639c6286
UD
1048taken all they want, the scheduling described herein determines who
1049among the great unwashed processes gets them.
1050
1051@menu
1052* Traditional Scheduling Intro::
1053* Traditional Scheduling Functions::
1054@end menu
1055
1056@node Traditional Scheduling Intro
1057@subsubsection Introduction To Traditional Scheduling
1058
1059Long before there was absolute priority (See @ref{Absolute Priority}),
d3e22d59 1060Unix systems were scheduling the CPU using this system. When POSIX came
0bc93a2f 1061in like the Romans and imposed absolute priorities to accommodate the
639c6286
UD
1062needs of realtime processing, it left the indigenous Absolute Priority
1063Zero processes to govern themselves by their own familiar scheduling
1064policy.
1065
1066Indeed, absolute priorities higher than zero are not available on many
1067systems today and are not typically used when they are, being intended
1068mainly for computers that do realtime processing. So this section
1069describes the only scheduling many programmers need to be concerned
1070about.
1071
1072But just to be clear about the scope of this scheduling: Any time a
9dcc8f11 1073process with an absolute priority of 0 and a process with an absolute
639c6286
UD
1074priority higher than 0 are ready to run at the same time, the one with
1075absolute priority 0 does not run. If it's already running when the
1076higher priority ready-to-run process comes into existence, it stops
1077immediately.
1078
1079In addition to its absolute priority of zero, every process has another
1080priority, which we will refer to as "dynamic priority" because it changes
87b56f36 1081over time. The dynamic priority is meaningless for processes with
639c6286
UD
1082an absolute priority higher than zero.
1083
1084The dynamic priority sometimes determines who gets the next turn on the
1085CPU. Sometimes it determines how long turns last. Sometimes it
1086determines whether a process can kick another off the CPU.
1087
d3e22d59 1088In Linux, the value is a combination of these things, but mostly it
639c6286
UD
1089just determines the length of the time slice. The higher a process'
1090dynamic priority, the longer a shot it gets on the CPU when it gets one.
1091If it doesn't use up its time slice before giving up the CPU to do
1092something like wait for I/O, it is favored for getting the CPU back when
1093it's ready for it, to finish out its time slice. Other than that,
1094selection of processes for new time slices is basically round robin.
1095But the scheduler does throw a bone to the low priority processes: A
1096process' dynamic priority rises every time it is snubbed in the
1097scheduling process. In Linux, even the fat kid gets to play.
1098
1099The fluctuation of a process' dynamic priority is regulated by another
1100value: The ``nice'' value. The nice value is an integer, usually in the
1101range -20 to 20, and represents an upper limit on a process' dynamic
1102priority. The higher the nice number, the lower that limit.
1103
1104On a typical Linux system, for example, a process with a nice value of
110520 can get only 10 milliseconds on the CPU at a time, whereas a process
1106with a nice value of -20 can achieve a high enough priority to get 400
1107milliseconds.
1108
1109The idea of the nice value is deferential courtesy. In the beginning,
1110in the Unix garden of Eden, all processes shared equally in the bounty
1111of the computer system. But not all processes really need the same
1112share of CPU time, so the nice value gave a courteous process the
1113ability to refuse its equal share of CPU time that others might prosper.
1114Hence, the higher a process' nice value, the nicer the process is.
1115(Then a snake came along and offered some process a negative nice value
1116and the system became the crass resource allocation system we know
d3e22d59 1117today.)
639c6286
UD
1118
1119Dynamic priorities tend upward and downward with an objective of
1120smoothing out allocation of CPU time and giving quick response time to
1121infrequent requests. But they never exceed their nice limits, so on a
1122heavily loaded CPU, the nice value effectively determines how fast a
1123process runs.
1124
1125In keeping with the socialistic heritage of Unix process priority, a
1126process begins life with the same nice value as its parent process and
1127can raise it at will. A process can also raise the nice value of any
1128other process owned by the same user (or effective user). But only a
1129privileged process can lower its nice value. A privileged process can
1130also raise or lower another process' nice value.
1131
1f77f049 1132@glibcadj{} functions for getting and setting nice values are described in
639c6286
UD
1133@xref{Traditional Scheduling Functions}.
1134
1135@node Traditional Scheduling Functions
1136@subsubsection Functions For Traditional Scheduling
1137
5ce8f203 1138@pindex sys/resource.h
639c6286
UD
1139This section describes how you can read and set the nice value of a
1140process. All these symbols are declared in @file{sys/resource.h}.
1141
1142The function and macro names are defined by POSIX, and refer to
1143"priority," but the functions actually have to do with nice values, as
1144the terms are used both in the manual and POSIX.
1145
1146The range of valid nice values depends on the kernel, but typically it
1147runs from @code{-20} to @code{20}. A lower nice value corresponds to
1148higher priority for the process. These constants describe the range of
5ce8f203
UD
1149priority values:
1150
b642f101 1151@vtable @code
5ce8f203
UD
1152@comment sys/resource.h
1153@comment BSD
1154@item PRIO_MIN
639c6286 1155The lowest valid nice value.
5ce8f203
UD
1156
1157@comment sys/resource.h
1158@comment BSD
1159@item PRIO_MAX
639c6286 1160The highest valid nice value.
b642f101 1161@end vtable
5ce8f203
UD
1162
1163@comment sys/resource.h
f227c3e0 1164@comment BSD, POSIX
5ce8f203 1165@deftypefun int getpriority (int @var{class}, int @var{id})
c8ce789c
AO
1166@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1167@c Direct syscall on UNIX. On HURD, calls _hurd_priority_which_map.
639c6286 1168Return the nice value of a set of processes; @var{class} and @var{id}
5ce8f203 1169specify which ones (see below). If the processes specified do not all
639c6286 1170have the same nice value, this returns the lowest value that any of them
5ce8f203
UD
1171has.
1172
639c6286 1173On success, the return value is @code{0}. Otherwise, it is @code{-1}
d3e22d59 1174and @code{errno} is set accordingly. The @code{errno} values specific
639c6286 1175to this function are:
5ce8f203
UD
1176
1177@table @code
1178@item ESRCH
1179The combination of @var{class} and @var{id} does not match any existing
1180process.
1181
1182@item EINVAL
1183The value of @var{class} is not valid.
1184@end table
1185
639c6286
UD
1186If the return value is @code{-1}, it could indicate failure, or it could
1187be the nice value. The only way to make certain is to set @code{errno =
11880} before calling @code{getpriority}, then use @code{errno != 0}
1189afterward as the criterion for failure.
5ce8f203
UD
1190@end deftypefun
1191
1192@comment sys/resource.h
f227c3e0 1193@comment BSD, POSIX
639c6286 1194@deftypefun int setpriority (int @var{class}, int @var{id}, int @var{niceval})
c8ce789c
AO
1195@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1196@c Direct syscall on UNIX. On HURD, calls _hurd_priority_which_map.
639c6286 1197Set the nice value of a set of processes to @var{niceval}; @var{class}
5ce8f203
UD
1198and @var{id} specify which ones (see below).
1199
6a7a8b22 1200The return value is @code{0} on success, and @code{-1} on
639c6286
UD
1201failure. The following @code{errno} error condition are possible for
1202this function:
5ce8f203
UD
1203
1204@table @code
1205@item ESRCH
1206The combination of @var{class} and @var{id} does not match any existing
1207process.
1208
1209@item EINVAL
1210The value of @var{class} is not valid.
1211
1212@item EPERM
639c6286 1213The call would set the nice value of a process which is owned by a different
11bf311e 1214user than the calling process (i.e., the target process' real or effective
639c6286
UD
1215uid does not match the calling process' effective uid) and the calling
1216process does not have @code{CAP_SYS_NICE} permission.
5ce8f203
UD
1217
1218@item EACCES
639c6286
UD
1219The call would lower the process' nice value and the process does not have
1220@code{CAP_SYS_NICE} permission.
5ce8f203 1221@end table
639c6286 1222
5ce8f203
UD
1223@end deftypefun
1224
1225The arguments @var{class} and @var{id} together specify a set of
1226processes in which you are interested. These are the possible values of
1227@var{class}:
1228
b642f101 1229@vtable @code
5ce8f203
UD
1230@comment sys/resource.h
1231@comment BSD
1232@item PRIO_PROCESS
639c6286 1233One particular process. The argument @var{id} is a process ID (pid).
5ce8f203
UD
1234
1235@comment sys/resource.h
1236@comment BSD
1237@item PRIO_PGRP
639c6286
UD
1238All the processes in a particular process group. The argument @var{id} is
1239a process group ID (pgid).
5ce8f203
UD
1240
1241@comment sys/resource.h
1242@comment BSD
1243@item PRIO_USER
11bf311e 1244All the processes owned by a particular user (i.e., whose real uid
639c6286 1245indicates the user). The argument @var{id} is a user ID (uid).
b642f101 1246@end vtable
5ce8f203 1247
639c6286
UD
1248If the argument @var{id} is 0, it stands for the calling process, its
1249process group, or its owner (real uid), according to @var{class}.
5ce8f203 1250
b642f101
UD
1251@comment unistd.h
1252@comment BSD
5ce8f203 1253@deftypefun int nice (int @var{increment})
c8ce789c
AO
1254@safety{@prelim{}@mtunsafe{@mtasurace{:setpriority}}@asunsafe{}@acsafe{}}
1255@c Calls getpriority before and after setpriority, using the result of
1256@c the first call to compute the argument for setpriority. This creates
1257@c a window for a concurrent setpriority (or nice) call to be lost or
1258@c exhibit surprising behavior.
639c6286 1259Increment the nice value of the calling process by @var{increment}.
6a7a8b22
AJ
1260The return value is the new nice value on success, and @code{-1} on
1261failure. In the case of failure, @code{errno} will be set to the
1262same values as for @code{setpriority}.
1263
5ce8f203
UD
1264
1265Here is an equivalent definition of @code{nice}:
1266
1267@smallexample
1268int
1269nice (int increment)
1270@{
6a7a8b22
AJ
1271 int result, old = getpriority (PRIO_PROCESS, 0);
1272 result = setpriority (PRIO_PROCESS, 0, old + increment);
1273 if (result != -1)
1274 return old + increment;
1275 else
1276 return -1;
5ce8f203
UD
1277@}
1278@end smallexample
1279@end deftypefun
b642f101 1280
d9997a45
UD
1281
1282@node CPU Affinity
1283@subsection Limiting execution to certain CPUs
1284
1285On a multi-processor system the operating system usually distributes
1286the different processes which are runnable on all available CPUs in a
1287way which allows the system to work most efficiently. Which processes
1288and threads run can be to some extend be control with the scheduling
1289functionality described in the last sections. But which CPU finally
1290executes which process or thread is not covered.
1291
1292There are a number of reasons why a program might want to have control
1293over this aspect of the system as well:
1294
1295@itemize @bullet
1296@item
1297One thread or process is responsible for absolutely critical work
1298which under no circumstances must be interrupted or hindered from
d3e22d59 1299making progress by other processes or threads using CPU resources. In
d9997a45
UD
1300this case the special process would be confined to a CPU which no
1301other process or thread is allowed to use.
1302
1303@item
1304The access to certain resources (RAM, I/O ports) has different costs
1305from different CPUs. This is the case in NUMA (Non-Uniform Memory
11bf311e 1306Architecture) machines. Preferably memory should be accessed locally
d9997a45
UD
1307but this requirement is usually not visible to the scheduler.
1308Therefore forcing a process or thread to the CPUs which have local
d3e22d59 1309access to the most-used memory helps to significantly boost the
d9997a45
UD
1310performance.
1311
1312@item
1313In controlled runtimes resource allocation and book-keeping work (for
1314instance garbage collection) is performance local to processors. This
1315can help to reduce locking costs if the resources do not have to be
1316protected from concurrent accesses from different processors.
1317@end itemize
1318
1319The POSIX standard up to this date is of not much help to solve this
1320problem. The Linux kernel provides a set of interfaces to allow
1321specifying @emph{affinity sets} for a process. The scheduler will
bbf70ae9 1322schedule the thread or process on CPUs specified by the affinity
1f77f049 1323masks. The interfaces which @theglibc{} define follow to some
d3e22d59 1324extent the Linux kernel interface.
d9997a45
UD
1325
1326@comment sched.h
1327@comment GNU
1328@deftp {Data Type} cpu_set_t
1329This data set is a bitset where each bit represents a CPU. How the
1330system's CPUs are mapped to bits in the bitset is system dependent.
1331The data type has a fixed size; in the unlikely case that the number
1332of bits are not sufficient to describe the CPUs of the system a
1333different interface has to be used.
1334
1335This type is a GNU extension and is defined in @file{sched.h}.
1336@end deftp
1337
d3e22d59 1338To manipulate the bitset, to set and reset bits, a number of macros are
d9997a45
UD
1339defined. Some of the macros take a CPU number as a parameter. Here
1340it is important to never exceed the size of the bitset. The following
1341macro specifies the number of bits in the @code{cpu_set_t} bitset.
1342
1343@comment sched.h
1344@comment GNU
1345@deftypevr Macro int CPU_SETSIZE
1346The value of this macro is the maximum number of CPUs which can be
1347handled with a @code{cpu_set_t} object.
1348@end deftypevr
1349
1350The type @code{cpu_set_t} should be considered opaque; all
1351manipulation should happen via the next four macros.
1352
1353@comment sched.h
1354@comment GNU
1355@deftypefn Macro void CPU_ZERO (cpu_set_t *@var{set})
c8ce789c
AO
1356@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1357@c CPU_ZERO ok
1358@c __CPU_ZERO_S ok
1359@c memset dup ok
d9997a45
UD
1360This macro initializes the CPU set @var{set} to be the empty set.
1361
1362This macro is a GNU extension and is defined in @file{sched.h}.
1363@end deftypefn
1364
1365@comment sched.h
1366@comment GNU
1367@deftypefn Macro void CPU_SET (int @var{cpu}, cpu_set_t *@var{set})
c8ce789c
AO
1368@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1369@c CPU_SET ok
1370@c __CPU_SET_S ok
1371@c __CPUELT ok
1372@c __CPUMASK ok
d9997a45
UD
1373This macro adds @var{cpu} to the CPU set @var{set}.
1374
1375The @var{cpu} parameter must not have side effects since it is
1376evaluated more than once.
1377
1378This macro is a GNU extension and is defined in @file{sched.h}.
1379@end deftypefn
1380
1381@comment sched.h
1382@comment GNU
1383@deftypefn Macro void CPU_CLR (int @var{cpu}, cpu_set_t *@var{set})
c8ce789c
AO
1384@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1385@c CPU_CLR ok
1386@c __CPU_CLR_S ok
1387@c __CPUELT dup ok
1388@c __CPUMASK dup ok
d9997a45
UD
1389This macro removes @var{cpu} from the CPU set @var{set}.
1390
1391The @var{cpu} parameter must not have side effects since it is
1392evaluated more than once.
1393
1394This macro is a GNU extension and is defined in @file{sched.h}.
1395@end deftypefn
1396
1397@comment sched.h
1398@comment GNU
1399@deftypefn Macro int CPU_ISSET (int @var{cpu}, const cpu_set_t *@var{set})
c8ce789c
AO
1400@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1401@c CPU_ISSET ok
1402@c __CPU_ISSET_S ok
1403@c __CPUELT dup ok
1404@c __CPUMASK dup ok
d9997a45
UD
1405This macro returns a nonzero value (true) if @var{cpu} is a member
1406of the CPU set @var{set}, and zero (false) otherwise.
1407
1408The @var{cpu} parameter must not have side effects since it is
1409evaluated more than once.
1410
1411This macro is a GNU extension and is defined in @file{sched.h}.
1412@end deftypefn
1413
1414
1415CPU bitsets can be constructed from scratch or the currently installed
1416affinity mask can be retrieved from the system.
1417
1418@comment sched.h
1419@comment GNU
6f0b2e1f 1420@deftypefun int sched_getaffinity (pid_t @var{pid}, size_t @var{cpusetsize}, cpu_set_t *@var{cpuset})
c8ce789c
AO
1421@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1422@c Wrapped syscall to zero out past the kernel cpu set size; Linux
1423@c only.
d9997a45 1424
d3e22d59 1425This function stores the CPU affinity mask for the process or thread
6f0b2e1f
RM
1426with the ID @var{pid} in the @var{cpusetsize} bytes long bitmap
1427pointed to by @var{cpuset}. If successful, the function always
1428initializes all bits in the @code{cpu_set_t} object and returns zero.
d9997a45
UD
1429
1430If @var{pid} does not correspond to a process or thread on the system
1431the or the function fails for some other reason, it returns @code{-1}
1432and @code{errno} is set to represent the error condition.
1433
1434@table @code
1435@item ESRCH
1436No process or thread with the given ID found.
1437
1438@item EFAULT
d3e22d59 1439The pointer @var{cpuset} does not point to a valid object.
d9997a45
UD
1440@end table
1441
1442This function is a GNU extension and is declared in @file{sched.h}.
1443@end deftypefun
1444
1445Note that it is not portably possible to use this information to
1446retrieve the information for different POSIX threads. A separate
1447interface must be provided for that.
1448
1449@comment sched.h
1450@comment GNU
6f0b2e1f 1451@deftypefun int sched_setaffinity (pid_t @var{pid}, size_t @var{cpusetsize}, const cpu_set_t *@var{cpuset})
c8ce789c
AO
1452@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1453@c Wrapped syscall to detect attempts to set bits past the kernel cpu
1454@c set size; Linux only.
d9997a45 1455
6f0b2e1f
RM
1456This function installs the @var{cpusetsize} bytes long affinity mask
1457pointed to by @var{cpuset} for the process or thread with the ID @var{pid}.
d3e22d59 1458If successful the function returns zero and the scheduler will in the future
6f0b2e1f 1459take the affinity information into account.
d9997a45
UD
1460
1461If the function fails it will return @code{-1} and @code{errno} is set
1462to the error code:
1463
1464@table @code
1465@item ESRCH
1466No process or thread with the given ID found.
1467
1468@item EFAULT
d3e22d59 1469The pointer @var{cpuset} does not point to a valid object.
d9997a45
UD
1470
1471@item EINVAL
1472The bitset is not valid. This might mean that the affinity set might
1473not leave a processor for the process or thread to run on.
1474@end table
1475
1476This function is a GNU extension and is declared in @file{sched.h}.
1477@end deftypefun
1478
1479
b642f101
UD
1480@node Memory Resources
1481@section Querying memory available resources
1482
1483The amount of memory available in the system and the way it is organized
1484determines oftentimes the way programs can and have to work. For
5a7eedfb 1485functions like @code{mmap} it is necessary to know about the size of
b642f101
UD
1486individual memory pages and knowing how much memory is available enables
1487a program to select appropriate sizes for, say, caches. Before we get
1488into these details a few words about memory subsystems in traditional
5a7eedfb 1489Unix systems will be given.
b642f101
UD
1490
1491@menu
1492* Memory Subsystem:: Overview about traditional Unix memory handling.
1493* Query Memory Parameters:: How to get information about the memory
1494 subsystem?
1495@end menu
1496
1497@node Memory Subsystem
1498@subsection Overview about traditional Unix memory handling
1499
1500@cindex address space
1501@cindex physical memory
1502@cindex physical address
1503Unix systems normally provide processes virtual address spaces. This
1504means that the addresses of the memory regions do not have to correspond
1505directly to the addresses of the actual physical memory which stores the
1506data. An extra level of indirection is introduced which translates
1507virtual addresses into physical addresses. This is normally done by the
1508hardware of the processor.
1509
1510@cindex shared memory
d3e22d59 1511Using a virtual address space has several advantages. The most important
b642f101
UD
1512is process isolation. The different processes running on the system
1513cannot interfere directly with each other. No process can write into
1514the address space of another process (except when shared memory is used
1515but then it is wanted and controlled).
1516
1517Another advantage of virtual memory is that the address space the
1518processes see can actually be larger than the physical memory available.
1519The physical memory can be extended by storage on an external media
1520where the content of currently unused memory regions is stored. The
1521address translation can then intercept accesses to these memory regions
1522and make memory content available again by loading the data back into
1523memory. This concept makes it necessary that programs which have to use
1524lots of memory know the difference between available virtual address
1525space and available physical memory. If the working set of virtual
1526memory of all the processes is larger than the available physical memory
1527the system will slow down dramatically due to constant swapping of
1528memory content from the memory to the storage media and back. This is
1529called ``thrashing''.
1530@cindex thrashing
1531
1532@cindex memory page
1533@cindex page, memory
1534A final aspect of virtual memory which is important and follows from
1535what is said in the last paragraph is the granularity of the virtual
1536address space handling. When we said that the virtual address handling
1537stores memory content externally it cannot do this on a byte-by-byte
1538basis. The administrative overhead does not allow this (leaving alone
1539the processor hardware). Instead several thousand bytes are handled
1540together and form a @dfn{page}. The size of each page is always a power
d3e22d59 1541of two bytes. The smallest page size in use today is 4096, with 8192,
b642f101
UD
154216384, and 65536 being other popular sizes.
1543
1544@node Query Memory Parameters
1545@subsection How to get information about the memory subsystem?
1546
1547The page size of the virtual memory the process sees is essential to
d3e22d59 1548know in several situations. Some programming interfaces (e.g.,
b642f101 1549@code{mmap}, @pxref{Memory-mapped I/O}) require the user to provide
d3e22d59 1550information adjusted to the page size. In the case of @code{mmap} it is
b642f101
UD
1551necessary to provide a length argument which is a multiple of the page
1552size. Another place where the knowledge about the page size is useful
1553is in memory allocation. If one allocates pieces of memory in larger
1554chunks which are then subdivided by the application code it is useful to
1555adjust the size of the larger blocks to the page size. If the total
1556memory requirement for the block is close (but not larger) to a multiple
1557of the page size the kernel's memory handling can work more effectively
1558since it only has to allocate memory pages which are fully used. (To do
1559this optimization it is necessary to know a bit about the memory
1560allocator which will require a bit of memory itself for each block and
d3e22d59 1561this overhead must not push the total size over the page size multiple.)
b642f101
UD
1562
1563The page size traditionally was a compile time constant. But recent
1564development of processors changed this. Processors now support
1565different page sizes and they can possibly even vary among different
1566processes on the same system. Therefore the system should be queried at
1567runtime about the current page size and no assumptions (except about it
1568being a power of two) should be made.
1569
1570@vindex _SC_PAGESIZE
1571The correct interface to query about the page size is @code{sysconf}
1572(@pxref{Sysconf Definition}) with the parameter @code{_SC_PAGESIZE}.
1573There is a much older interface available, too.
1574
1575@comment unistd.h
1576@comment BSD
1577@deftypefun int getpagesize (void)
c8ce789c
AO
1578@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1579@c Obtained from the aux vec at program startup time. GNU/Linux/m68k is
1580@c the exception, with the possibility of a syscall.
b642f101
UD
1581The @code{getpagesize} function returns the page size of the process.
1582This value is fixed for the runtime of the process but can vary in
1583different runs of the application.
1584
1585The function is declared in @file{unistd.h}.
1586@end deftypefun
1587
1588Widely available on @w{System V} derived systems is a method to get
1589information about the physical memory the system has. The call
1590
1591@vindex _SC_PHYS_PAGES
1592@cindex sysconf
1593@smallexample
1594 sysconf (_SC_PHYS_PAGES)
1595@end smallexample
1596
cb4fe8a2 1597@noindent
d3e22d59 1598returns the total number of pages of physical memory the system has.
b642f101
UD
1599This does not mean all this memory is available. This information can
1600be found using
1601
1602@vindex _SC_AVPHYS_PAGES
1603@cindex sysconf
1604@smallexample
1605 sysconf (_SC_AVPHYS_PAGES)
1606@end smallexample
1607
1608These two values help to optimize applications. The value returned for
1609@code{_SC_AVPHYS_PAGES} is the amount of memory the application can use
1610without hindering any other process (given that no other process
1611increases its memory usage). The value returned for
1612@code{_SC_PHYS_PAGES} is more or less a hard limit for the working set.
1613If all applications together constantly use more than that amount of
1614memory the system is in trouble.
1615
1f77f049 1616@Theglibc{} provides in addition to these already described way to
cb4fe8a2
UD
1617get this information two functions. They are declared in the file
1618@file{sys/sysinfo.h}. Programmers should prefer to use the
1619@code{sysconf} method described above.
1620
1621@comment sys/sysinfo.h
1622@comment GNU
4c78249d 1623@deftypefun {long int} get_phys_pages (void)
c8ce789c
AO
1624@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{} @asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
1625@c This fopens a /proc file and scans it for the requested information.
cb4fe8a2 1626The @code{get_phys_pages} function returns the total number of pages of
d3e22d59 1627physical memory the system has. To get the amount of memory this number has to
cb4fe8a2
UD
1628be multiplied by the page size.
1629
1630This function is a GNU extension.
1631@end deftypefun
1632
1633@comment sys/sysinfo.h
1634@comment GNU
4c78249d 1635@deftypefun {long int} get_avphys_pages (void)
c8ce789c 1636@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{} @asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
cd1fb604 1637The @code{get_avphys_pages} function returns the number of available pages of
d3e22d59 1638physical memory the system has. To get the amount of memory this number has to
cb4fe8a2
UD
1639be multiplied by the page size.
1640
1641This function is a GNU extension.
1642@end deftypefun
1643
b642f101
UD
1644@node Processor Resources
1645@section Learn about the processors available
1646
1647The use of threads or processes with shared memory allows an application
1648to take advantage of all the processing power a system can provide. If
1649the task can be parallelized the optimal way to write an application is
1650to have at any time as many processes running as there are processors.
1651To determine the number of processors available to the system one can
1652run
1653
1654@vindex _SC_NPROCESSORS_CONF
1655@cindex sysconf
1656@smallexample
1657 sysconf (_SC_NPROCESSORS_CONF)
1658@end smallexample
1659
1660@noindent
1661which returns the number of processors the operating system configured.
1662But it might be possible for the operating system to disable individual
1663processors and so the call
1664
1665@vindex _SC_NPROCESSORS_ONLN
1666@cindex sysconf
1667@smallexample
1668 sysconf (_SC_NPROCESSORS_ONLN)
1669@end smallexample
1670
1671@noindent
26428b7c 1672returns the number of processors which are currently online (i.e.,
b642f101 1673available).
e4cf5229 1674
1f77f049 1675For these two pieces of information @theglibc{} also provides
cb4fe8a2
UD
1676functions to get the information directly. The functions are declared
1677in @file{sys/sysinfo.h}.
1678
1679@comment sys/sysinfo.h
1680@comment GNU
1681@deftypefun int get_nprocs_conf (void)
c8ce789c
AO
1682@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{} @asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
1683@c This function reads from from /sys using dir streams (single user, so
1684@c no @mtasurace issue), and on some arches, from /proc using streams.
cb4fe8a2
UD
1685The @code{get_nprocs_conf} function returns the number of processors the
1686operating system configured.
1687
1688This function is a GNU extension.
1689@end deftypefun
1690
1691@comment sys/sysinfo.h
1692@comment GNU
1693@deftypefun int get_nprocs (void)
c8ce789c
AO
1694@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{@acsfd{}}}
1695@c This function reads from /proc using file descriptor I/O.
cb4fe8a2
UD
1696The @code{get_nprocs} function returns the number of available processors.
1697
1698This function is a GNU extension.
1699@end deftypefun
1700
e4cf5229
UD
1701@cindex load average
1702Before starting more threads it should be checked whether the processors
1703are not already overused. Unix systems calculate something called the
1704@dfn{load average}. This is a number indicating how many processes were
d3e22d59 1705running. This number is an average over different periods of time
e4cf5229
UD
1706(normally 1, 5, and 15 minutes).
1707
1708@comment stdlib.h
1709@comment BSD
1710@deftypefun int getloadavg (double @var{loadavg}[], int @var{nelem})
c8ce789c
AO
1711@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{@acsfd{}}}
1712@c Calls host_info on HURD; on Linux, opens /proc/loadavg, reads from
1713@c it, closes it, without cancellation point, and calls strtod_l with
1714@c the C locale to convert the strings to doubles.
e4cf5229 1715This function gets the 1, 5 and 15 minute load averages of the
cf822e3c 1716system. The values are placed in @var{loadavg}. @code{getloadavg} will
e4cf5229
UD
1717place at most @var{nelem} elements into the array but never more than
1718three elements. The return value is the number of elements written to
1719@var{loadavg}, or -1 on error.
1720
1721This function is declared in @file{stdlib.h}.
1722@end deftypefun