]> git.ipfire.org Git - thirdparty/glibc.git/blame - manual/resource.texi
Manual typos: Date and Time
[thirdparty/glibc.git] / manual / resource.texi
CommitLineData
5ce8f203
UD
1@node Resource Usage And Limitation, Non-Local Exits, Date and Time, Top
2@c %MENU% Functions for examining resource usage and getting and setting limits
3@chapter Resource Usage And Limitation
4This chapter describes functions for examining how much of various kinds of
5resources (CPU time, memory, etc.) a process has used and getting and setting
6limits on future usage.
7
8@menu
9* Resource Usage:: Measuring various resources used.
10* Limits on Resources:: Specifying limits on resource usage.
11* Priority:: Reading or setting process run priority.
b642f101
UD
12* Memory Resources:: Querying memory available resources.
13* Processor Resources:: Learn about the processors available.
5ce8f203
UD
14@end menu
15
16
17@node Resource Usage
18@section Resource Usage
19
20@pindex sys/resource.h
21The function @code{getrusage} and the data type @code{struct rusage}
22are used to examine the resource usage of a process. They are declared
23in @file{sys/resource.h}.
24
25@comment sys/resource.h
26@comment BSD
27@deftypefun int getrusage (int @var{processes}, struct rusage *@var{rusage})
c8ce789c
AO
28@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
29@c On HURD, this calls task_info 3 times. On UNIX, it's a syscall.
5ce8f203
UD
30This function reports resource usage totals for processes specified by
31@var{processes}, storing the information in @code{*@var{rusage}}.
32
33In most systems, @var{processes} has only two valid values:
34
35@table @code
36@comment sys/resource.h
37@comment BSD
38@item RUSAGE_SELF
39Just the current process.
40
41@comment sys/resource.h
42@comment BSD
43@item RUSAGE_CHILDREN
44All child processes (direct and indirect) that have already terminated.
45@end table
46
5ce8f203
UD
47The return value of @code{getrusage} is zero for success, and @code{-1}
48for failure.
49
50@table @code
51@item EINVAL
52The argument @var{processes} is not valid.
53@end table
54@end deftypefun
55
56One way of getting resource usage for a particular child process is with
57the function @code{wait4}, which returns totals for a child when it
58terminates. @xref{BSD Wait Functions}.
59
60@comment sys/resource.h
61@comment BSD
62@deftp {Data Type} {struct rusage}
63This data type stores various resource usage statistics. It has the
64following members, and possibly others:
65
66@table @code
67@item struct timeval ru_utime
68Time spent executing user instructions.
69
70@item struct timeval ru_stime
71Time spent in operating system code on behalf of @var{processes}.
72
73@item long int ru_maxrss
74The maximum resident set size used, in kilobytes. That is, the maximum
75number of kilobytes of physical memory that @var{processes} used
76simultaneously.
77
78@item long int ru_ixrss
79An integral value expressed in kilobytes times ticks of execution, which
80indicates the amount of memory used by text that was shared with other
81processes.
82
83@item long int ru_idrss
84An integral value expressed the same way, which is the amount of
85unshared memory used for data.
86
87@item long int ru_isrss
88An integral value expressed the same way, which is the amount of
89unshared memory used for stack space.
90
91@item long int ru_minflt
92The number of page faults which were serviced without requiring any I/O.
93
94@item long int ru_majflt
95The number of page faults which were serviced by doing I/O.
96
97@item long int ru_nswap
98The number of times @var{processes} was swapped entirely out of main memory.
99
100@item long int ru_inblock
101The number of times the file system had to read from the disk on behalf
102of @var{processes}.
103
104@item long int ru_oublock
105The number of times the file system had to write to the disk on behalf
106of @var{processes}.
107
108@item long int ru_msgsnd
109Number of IPC messages sent.
110
111@item long int ru_msgrcv
112Number of IPC messages received.
113
114@item long int ru_nsignals
115Number of signals received.
116
117@item long int ru_nvcsw
118The number of times @var{processes} voluntarily invoked a context switch
119(usually to wait for some service).
120
121@item long int ru_nivcsw
122The number of times an involuntary context switch took place (because
123a time slice expired, or another process of higher priority was
124scheduled).
125@end table
126@end deftp
127
128@code{vtimes} is a historical function that does some of what
129@code{getrusage} does. @code{getrusage} is a better choice.
130
131@code{vtimes} and its @code{vtimes} data structure are declared in
132@file{sys/vtimes.h}.
133@pindex sys/vtimes.h
5ce8f203 134
8ded91fb
RM
135@comment sys/vtimes.h
136@deftypefun int vtimes (struct vtimes *@var{current}, struct vtimes *@var{child})
c8ce789c
AO
137@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
138@c Calls getrusage twice.
5ce8f203
UD
139
140@code{vtimes} reports resource usage totals for a process.
141
142If @var{current} is non-null, @code{vtimes} stores resource usage totals for
143the invoking process alone in the structure to which it points. If
144@var{child} is non-null, @code{vtimes} stores resource usage totals for all
145past children (which have terminated) of the invoking process in the structure
146to which it points.
147
148@deftp {Data Type} {struct vtimes}
149This data type contains information about the resource usage of a process.
150Each member corresponds to a member of the @code{struct rusage} data type
151described above.
152
153@table @code
154@item vm_utime
155User CPU time. Analogous to @code{ru_utime} in @code{struct rusage}
156@item vm_stime
157System CPU time. Analogous to @code{ru_stime} in @code{struct rusage}
158@item vm_idsrss
159Data and stack memory. The sum of the values that would be reported as
160@code{ru_idrss} and @code{ru_isrss} in @code{struct rusage}
161@item vm_ixrss
162Shared memory. Analogous to @code{ru_ixrss} in @code{struct rusage}
163@item vm_maxrss
164Maximent resident set size. Analogous to @code{ru_maxrss} in
165@code{struct rusage}
166@item vm_majflt
167Major page faults. Analogous to @code{ru_majflt} in @code{struct rusage}
168@item vm_minflt
169Minor page faults. Analogous to @code{ru_minflt} in @code{struct rusage}
170@item vm_nswap
171Swap count. Analogous to @code{ru_nswap} in @code{struct rusage}
172@item vm_inblk
173Disk reads. Analogous to @code{ru_inblk} in @code{struct rusage}
174@item vm_oublk
175Disk writes. Analogous to @code{ru_oublk} in @code{struct rusage}
176@end table
177@end deftp
178
179
180The return value is zero if the function succeeds; @code{-1} otherwise.
181
182
183
184@end deftypefun
185An additional historical function for examining resource usage,
186@code{vtimes}, is supported but not documented here. It is declared in
187@file{sys/vtimes.h}.
188
189@node Limits on Resources
190@section Limiting Resource Usage
191@cindex resource limits
192@cindex limits on resource usage
193@cindex usage limits
194
195You can specify limits for the resource usage of a process. When the
196process tries to exceed a limit, it may get a signal, or the system call
197by which it tried to do so may fail, depending on the resource. Each
198process initially inherits its limit values from its parent, but it can
199subsequently change them.
200
201There are two per-process limits associated with a resource:
202@cindex limit
203
204@table @dfn
205@item current limit
206The current limit is the value the system will not allow usage to
207exceed. It is also called the ``soft limit'' because the process being
208limited can generally raise the current limit at will.
209@cindex current limit
210@cindex soft limit
211
212@item maximum limit
213The maximum limit is the maximum value to which a process is allowed to
214set its current limit. It is also called the ``hard limit'' because
215there is no way for a process to get around it. A process may lower
216its own maximum limit, but only the superuser may increase a maximum
217limit.
218@cindex maximum limit
219@cindex hard limit
220@end table
221
222@pindex sys/resource.h
223The symbols for use with @code{getrlimit}, @code{setrlimit},
0bc93a2f 224@code{getrlimit64}, and @code{setrlimit64} are defined in
5ce8f203
UD
225@file{sys/resource.h}.
226
227@comment sys/resource.h
228@comment BSD
229@deftypefun int getrlimit (int @var{resource}, struct rlimit *@var{rlp})
c8ce789c
AO
230@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
231@c Direct syscall on most systems.
5ce8f203
UD
232Read the current and maximum limits for the resource @var{resource}
233and store them in @code{*@var{rlp}}.
234
235The return value is @code{0} on success and @code{-1} on failure. The
236only possible @code{errno} error condition is @code{EFAULT}.
237
238When the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a
23932-bit system this function is in fact @code{getrlimit64}. Thus, the
240LFS interface transparently replaces the old interface.
241@end deftypefun
242
243@comment sys/resource.h
244@comment Unix98
245@deftypefun int getrlimit64 (int @var{resource}, struct rlimit64 *@var{rlp})
c8ce789c
AO
246@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
247@c Direct syscall on most systems, wrapper to getrlimit otherwise.
5ce8f203
UD
248This function is similar to @code{getrlimit} but its second parameter is
249a pointer to a variable of type @code{struct rlimit64}, which allows it
250to read values which wouldn't fit in the member of a @code{struct
251rlimit}.
252
253If the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a
25432-bit machine, this function is available under the name
255@code{getrlimit} and so transparently replaces the old interface.
256@end deftypefun
257
258@comment sys/resource.h
259@comment BSD
260@deftypefun int setrlimit (int @var{resource}, const struct rlimit *@var{rlp})
c8ce789c
AO
261@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
262@c Direct syscall on most systems; lock-taking critical section on HURD.
5ce8f203
UD
263Store the current and maximum limits for the resource @var{resource}
264in @code{*@var{rlp}}.
265
266The return value is @code{0} on success and @code{-1} on failure. The
267following @code{errno} error condition is possible:
268
269@table @code
270@item EPERM
271@itemize @bullet
272@item
273The process tried to raise a current limit beyond the maximum limit.
274
275@item
276The process tried to raise a maximum limit, but is not superuser.
277@end itemize
278@end table
279
280When the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a
28132-bit system this function is in fact @code{setrlimit64}. Thus, the
282LFS interface transparently replaces the old interface.
283@end deftypefun
284
285@comment sys/resource.h
286@comment Unix98
287@deftypefun int setrlimit64 (int @var{resource}, const struct rlimit64 *@var{rlp})
c8ce789c
AO
288@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
289@c Wrapper for setrlimit or direct syscall.
5ce8f203
UD
290This function is similar to @code{setrlimit} but its second parameter is
291a pointer to a variable of type @code{struct rlimit64} which allows it
292to set values which wouldn't fit in the member of a @code{struct
293rlimit}.
294
295If the sources are compiled with @code{_FILE_OFFSET_BITS == 64} on a
29632-bit machine this function is available under the name
297@code{setrlimit} and so transparently replaces the old interface.
298@end deftypefun
299
300@comment sys/resource.h
301@comment BSD
302@deftp {Data Type} {struct rlimit}
303This structure is used with @code{getrlimit} to receive limit values,
304and with @code{setrlimit} to specify limit values for a particular process
305and resource. It has two fields:
306
307@table @code
308@item rlim_t rlim_cur
309The current limit
310
311@item rlim_t rlim_max
312The maximum limit.
313@end table
314
315For @code{getrlimit}, the structure is an output; it receives the current
316values. For @code{setrlimit}, it specifies the new values.
317@end deftp
318
319For the LFS functions a similar type is defined in @file{sys/resource.h}.
320
321@comment sys/resource.h
322@comment Unix98
323@deftp {Data Type} {struct rlimit64}
324This structure is analogous to the @code{rlimit} structure above, but
325its components have wider ranges. It has two fields:
326
327@table @code
328@item rlim64_t rlim_cur
329This is analogous to @code{rlimit.rlim_cur}, but with a different type.
330
331@item rlim64_t rlim_max
332This is analogous to @code{rlimit.rlim_max}, but with a different type.
333@end table
334
335@end deftp
336
337Here is a list of resources for which you can specify a limit. Memory
338and file sizes are measured in bytes.
339
340@table @code
341@comment sys/resource.h
342@comment BSD
343@item RLIMIT_CPU
344@vindex RLIMIT_CPU
345The maximum amount of CPU time the process can use. If it runs for
346longer than this, it gets a signal: @code{SIGXCPU}. The value is
347measured in seconds. @xref{Operation Error Signals}.
348
349@comment sys/resource.h
350@comment BSD
351@item RLIMIT_FSIZE
352@vindex RLIMIT_FSIZE
353The maximum size of file the process can create. Trying to write a
354larger file causes a signal: @code{SIGXFSZ}. @xref{Operation Error
355Signals}.
356
357@comment sys/resource.h
358@comment BSD
359@item RLIMIT_DATA
360@vindex RLIMIT_DATA
361The maximum size of data memory for the process. If the process tries
362to allocate data memory beyond this amount, the allocation function
363fails.
364
365@comment sys/resource.h
366@comment BSD
367@item RLIMIT_STACK
368@vindex RLIMIT_STACK
369The maximum stack size for the process. If the process tries to extend
370its stack past this size, it gets a @code{SIGSEGV} signal.
371@xref{Program Error Signals}.
372
373@comment sys/resource.h
374@comment BSD
375@item RLIMIT_CORE
376@vindex RLIMIT_CORE
377The maximum size core file that this process can create. If the process
378terminates and would dump a core file larger than this, then no core
379file is created. So setting this limit to zero prevents core files from
380ever being created.
381
382@comment sys/resource.h
383@comment BSD
384@item RLIMIT_RSS
385@vindex RLIMIT_RSS
386The maximum amount of physical memory that this process should get.
387This parameter is a guide for the system's scheduler and memory
388allocator; the system may give the process more memory when there is a
389surplus.
390
391@comment sys/resource.h
392@comment BSD
393@item RLIMIT_MEMLOCK
394The maximum amount of memory that can be locked into physical memory (so
395it will never be paged out).
396
397@comment sys/resource.h
398@comment BSD
399@item RLIMIT_NPROC
400The maximum number of processes that can be created with the same user ID.
401If you have reached the limit for your user ID, @code{fork} will fail
402with @code{EAGAIN}. @xref{Creating a Process}.
403
404@comment sys/resource.h
405@comment BSD
406@item RLIMIT_NOFILE
407@vindex RLIMIT_NOFILE
408@itemx RLIMIT_OFILE
409@vindex RLIMIT_OFILE
410The maximum number of files that the process can open. If it tries to
411open more files than this, its open attempt fails with @code{errno}
412@code{EMFILE}. @xref{Error Codes}. Not all systems support this limit;
413GNU does, and 4.4 BSD does.
414
415@comment sys/resource.h
416@comment Unix98
417@item RLIMIT_AS
418@vindex RLIMIT_AS
419The maximum size of total memory that this process should get. If the
420process tries to allocate more memory beyond this amount with, for
421example, @code{brk}, @code{malloc}, @code{mmap} or @code{sbrk}, the
422allocation function fails.
423
424@comment sys/resource.h
425@comment BSD
426@item RLIM_NLIMITS
427@vindex RLIM_NLIMITS
428The number of different resource limits. Any valid @var{resource}
429operand must be less than @code{RLIM_NLIMITS}.
430@end table
431
432@comment sys/resource.h
433@comment BSD
8ded91fb 434@deftypevr Constant rlim_t RLIM_INFINITY
5ce8f203
UD
435This constant stands for a value of ``infinity'' when supplied as
436the limit value in @code{setrlimit}.
437@end deftypevr
438
439
440The following are historical functions to do some of what the functions
441above do. The functions above are better choices.
442
443@code{ulimit} and the command symbols are declared in @file{ulimit.h}.
444@pindex ulimit.h
5ce8f203 445
b642f101
UD
446@comment ulimit.h
447@comment BSD
8ded91fb 448@deftypefun {long int} ulimit (int @var{cmd}, @dots{})
c8ce789c
AO
449@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
450@c Wrapper for getrlimit, setrlimit or
451@c sysconf(_SC_OPEN_MAX)->getdtablesize->getrlimit.
5ce8f203
UD
452
453@code{ulimit} gets the current limit or sets the current and maximum
454limit for a particular resource for the calling process according to the
455command @var{cmd}.a
456
457If you are getting a limit, the command argument is the only argument.
458If you are setting a limit, there is a second argument:
459@code{long int} @var{limit} which is the value to which you are setting
460the limit.
461
462The @var{cmd} values and the operations they specify are:
463@table @code
464
465@item GETFSIZE
466Get the current limit on the size of a file, in units of 512 bytes.
467
468@item SETFSIZE
469Set the current and maximum limit on the size of a file to @var{limit} *
470512 bytes.
471
472@end table
473
474There are also some other @var{cmd} values that may do things on some
475systems, but they are not supported.
476
477Only the superuser may increase a maximum limit.
478
479When you successfully get a limit, the return value of @code{ulimit} is
480that limit, which is never negative. When you successfully set a limit,
481the return value is zero. When the function fails, the return value is
482@code{-1} and @code{errno} is set according to the reason:
483
484@table @code
485@item EPERM
486A process tried to increase a maximum limit, but is not superuser.
487@end table
488
489
490@end deftypefun
491
492@code{vlimit} and its resource symbols are declared in @file{sys/vlimit.h}.
5ce8f203 493@pindex sys/vlimit.h
5ce8f203 494
b642f101
UD
495@comment sys/vlimit.h
496@comment BSD
5ce8f203 497@deftypefun int vlimit (int @var{resource}, int @var{limit})
c8ce789c
AO
498@safety{@prelim{}@mtunsafe{@mtasurace{:setrlimit}}@asunsafe{}@acsafe{}}
499@c It calls getrlimit and modifies the rlim_cur field before calling
500@c setrlimit. There's a window for a concurrent call to setrlimit that
501@c modifies e.g. rlim_max, which will be lost if running as super-user.
5ce8f203
UD
502
503@code{vlimit} sets the current limit for a resource for a process.
504
505@var{resource} identifies the resource:
506
507@table @code
508@item LIM_CPU
509Maximum CPU time. Same as @code{RLIMIT_CPU} for @code{setrlimit}.
510@item LIM_FSIZE
511Maximum file size. Same as @code{RLIMIT_FSIZE} for @code{setrlimit}.
512@item LIM_DATA
513Maximum data memory. Same as @code{RLIMIT_DATA} for @code{setrlimit}.
514@item LIM_STACK
515Maximum stack size. Same as @code{RLIMIT_STACK} for @code{setrlimit}.
516@item LIM_CORE
517Maximum core file size. Same as @code{RLIMIT_COR} for @code{setrlimit}.
518@item LIM_MAXRSS
519Maximum physical memory. Same as @code{RLIMIT_RSS} for @code{setrlimit}.
520@end table
521
522The return value is zero for success, and @code{-1} with @code{errno} set
523accordingly for failure:
524
525@table @code
526@item EPERM
527The process tried to set its current limit beyond its maximum limit.
528@end table
529
530@end deftypefun
531
532@node Priority
639c6286 533@section Process CPU Priority And Scheduling
5ce8f203 534@cindex process priority
639c6286 535@cindex cpu priority
5ce8f203
UD
536@cindex priority of a process
537
639c6286
UD
538When multiple processes simultaneously require CPU time, the system's
539scheduling policy and process CPU priorities determine which processes
540get it. This section describes how that determination is made and
1f77f049 541@glibcadj{} functions to control it.
639c6286
UD
542
543It is common to refer to CPU scheduling simply as scheduling and a
544process' CPU priority simply as the process' priority, with the CPU
545resource being implied. Bear in mind, though, that CPU time is not the
546only resource a process uses or that processes contend for. In some
547cases, it is not even particularly important. Giving a process a high
548``priority'' may have very little effect on how fast a process runs with
549respect to other processes. The priorities discussed in this section
550apply only to CPU time.
551
552CPU scheduling is a complex issue and different systems do it in wildly
553different ways. New ideas continually develop and find their way into
554the intricacies of the various systems' scheduling algorithms. This
87b56f36 555section discusses the general concepts, some specifics of systems
1f77f049 556that commonly use @theglibc{}, and some standards.
639c6286
UD
557
558For simplicity, we talk about CPU contention as if there is only one CPU
559in the system. But all the same principles apply when a processor has
560multiple CPUs, and knowing that the number of processes that can run at
561any one time is equal to the number of CPUs, you can easily extrapolate
562the information.
563
564The functions described in this section are all defined by the POSIX.1
95fdc6a0 565and POSIX.1b standards (the @code{sched@dots{}} functions are POSIX.1b).
639c6286
UD
566However, POSIX does not define any semantics for the values that these
567functions get and set. In this chapter, the semantics are based on the
568Linux kernel's implementation of the POSIX standard. As you will see,
569the Linux implementation is quite the inverse of what the authors of the
570POSIX syntax had in mind.
571
572@menu
573* Absolute Priority:: The first tier of priority. Posix
574* Realtime Scheduling:: Scheduling among the process nobility
575* Basic Scheduling Functions:: Get/set scheduling policy, priority
576* Traditional Scheduling:: Scheduling among the vulgar masses
d9997a45 577* CPU Affinity:: Limiting execution to certain CPUs
639c6286
UD
578@end menu
579
580
581
582@node Absolute Priority
583@subsection Absolute Priority
584@cindex absolute priority
585@cindex priority, absolute
586
587Every process has an absolute priority, and it is represented by a number.
588The higher the number, the higher the absolute priority.
589
590@cindex realtime CPU scheduling
591On systems of the past, and most systems today, all processes have
592absolute priority 0 and this section is irrelevant. In that case,
593@xref{Traditional Scheduling}. Absolute priorities were invented to
0bc93a2f 594accommodate realtime systems, in which it is vital that certain processes
639c6286
UD
595be able to respond to external events happening in real time, which
596means they cannot wait around while some other process that @emph{wants
597to}, but doesn't @emph{need to} run occupies the CPU.
598
599@cindex ready to run
600@cindex preemptive scheduling
601When two processes are in contention to use the CPU at any instant, the
602one with the higher absolute priority always gets it. This is true even if the
11bf311e 603process with the lower priority is already using the CPU (i.e., the
639c6286
UD
604scheduling is preemptive). Of course, we're only talking about
605processes that are running or ``ready to run,'' which means they are
606ready to execute instructions right now. When a process blocks to wait
607for something like I/O, its absolute priority is irrelevant.
608
609@cindex runnable process
48b22986 610@strong{NB:} The term ``runnable'' is a synonym for ``ready to run.''
639c6286
UD
611
612When two processes are running or ready to run and both have the same
613absolute priority, it's more interesting. In that case, who gets the
0bc93a2f 614CPU is determined by the scheduling policy. If the processes have
639c6286
UD
615absolute priority 0, the traditional scheduling policy described in
616@ref{Traditional Scheduling} applies. Otherwise, the policies described
617in @ref{Realtime Scheduling} apply.
618
619You normally give an absolute priority above 0 only to a process that
620can be trusted not to hog the CPU. Such processes are designed to block
621(or terminate) after relatively short CPU runs.
622
623A process begins life with the same absolute priority as its parent
624process. Functions described in @ref{Basic Scheduling Functions} can
625change it.
626
627Only a privileged process can change a process' absolute priority to
628something other than @code{0}. Only a privileged process or the
629target process' owner can change its absolute priority at all.
630
631POSIX requires absolute priority values used with the realtime
632scheduling policies to be consecutive with a range of at least 32. On
633Linux, they are 1 through 99. The functions
634@code{sched_get_priority_max} and @code{sched_set_priority_min} portably
635tell you what the range is on a particular system.
636
637
638@subsubsection Using Absolute Priority
639
640One thing you must keep in mind when designing real time applications is
641that having higher absolute priority than any other process doesn't
642guarantee the process can run continuously. Two things that can wreck a
87b56f36 643good CPU run are interrupts and page faults.
639c6286
UD
644
645Interrupt handlers live in that limbo between processes. The CPU is
646executing instructions, but they aren't part of any process. An
647interrupt will stop even the highest priority process. So you must
648allow for slight delays and make sure that no device in the system has
649an interrupt handler that could cause too long a delay between
650instructions for your process.
651
652Similarly, a page fault causes what looks like a straightforward
653sequence of instructions to take a long time. The fact that other
654processes get to run while the page faults in is of no consequence,
655because as soon as the I/O is complete, the high priority process will
656kick them out and run again, but the wait for the I/O itself could be a
657problem. To neutralize this threat, use @code{mlock} or
658@code{mlockall}.
659
660There are a few ramifications of the absoluteness of this priority on a
661single-CPU system that you need to keep in mind when you choose to set a
662priority and also when you're working on a program that runs with high
663absolute priority. Consider a process that has higher absolute priority
664than any other process in the system and due to a bug in its program, it
665gets into an infinite loop. It will never cede the CPU. You can't run
666a command to kill it because your command would need to get the CPU in
667order to run. The errant program is in complete control. It controls
668the vertical, it controls the horizontal.
669
670There are two ways to avoid this: 1) keep a shell running somewhere with
671a higher absolute priority. 2) keep a controlling terminal attached to
672the high priority process group. All the priority in the world won't
673stop an interrupt handler from running and delivering a signal to the
674process if you hit Control-C.
675
95fdc6a0 676Some systems use absolute priority as a means of allocating a fixed
0bc93a2f 677percentage of CPU time to a process. To do this, a super high priority
639c6286
UD
678privileged process constantly monitors the process' CPU usage and raises
679its absolute priority when the process isn't getting its entitled share
680and lowers it when the process is exceeding it.
681
48b22986 682@strong{NB:} The absolute priority is sometimes called the ``static
639c6286
UD
683priority.'' We don't use that term in this manual because it misses the
684most important feature of the absolute priority: its absoluteness.
685
686
687@node Realtime Scheduling
688@subsection Realtime Scheduling
b642f101 689@cindex realtime scheduling
639c6286
UD
690
691Whenever two processes with the same absolute priority are ready to run,
692the kernel has a decision to make, because only one can run at a time.
693If the processes have absolute priority 0, the kernel makes this decision
694as described in @ref{Traditional Scheduling}. Otherwise, the decision
695is as described in this section.
696
697If two processes are ready to run but have different absolute priorities,
698the decision is much simpler, and is described in @ref{Absolute
699Priority}.
700
87b56f36 701Each process has a scheduling policy. For processes with absolute
639c6286
UD
702priority other than zero, there are two available:
703
704@enumerate
705@item
706First Come First Served
707@item
708Round Robin
709@end enumerate
710
711The most sensible case is where all the processes with a certain
712absolute priority have the same scheduling policy. We'll discuss that
713first.
714
715In Round Robin, processes share the CPU, each one running for a small
716quantum of time (``time slice'') and then yielding to another in a
717circular fashion. Of course, only processes that are ready to run and
718have the same absolute priority are in this circle.
719
720In First Come First Served, the process that has been waiting the
721longest to run gets the CPU, and it keeps it until it voluntarily
722relinquishes the CPU, runs out of things to do (blocks), or gets
723preempted by a higher priority process.
724
725First Come First Served, along with maximal absolute priority and
726careful control of interrupts and page faults, is the one to use when a
727process absolutely, positively has to run at full CPU speed or not at
728all.
729
730Judicious use of @code{sched_yield} function invocations by processes
731with First Come First Served scheduling policy forms a good compromise
732between Round Robin and First Come First Served.
733
734To understand how scheduling works when processes of different scheduling
735policies occupy the same absolute priority, you have to know the nitty
736gritty details of how processes enter and exit the ready to run list:
737
738In both cases, the ready to run list is organized as a true queue, where
739a process gets pushed onto the tail when it becomes ready to run and is
740popped off the head when the scheduler decides to run it. Note that
741ready to run and running are two mutually exclusive states. When the
742scheduler runs a process, that process is no longer ready to run and no
743longer in the ready to run list. When the process stops running, it
744may go back to being ready to run again.
745
746The only difference between a process that is assigned the Round Robin
747scheduling policy and a process that is assigned First Come First Serve
748is that in the former case, the process is automatically booted off the
749CPU after a certain amount of time. When that happens, the process goes
750back to being ready to run, which means it enters the queue at the tail.
751The time quantum we're talking about is small. Really small. This is
752not your father's timesharing. For example, with the Linux kernel, the
753round robin time slice is a thousand times shorter than its typical
754time slice for traditional scheduling.
755
756A process begins life with the same scheduling policy as its parent process.
757Functions described in @ref{Basic Scheduling Functions} can change it.
758
759Only a privileged process can set the scheduling policy of a process
760that has absolute priority higher than 0.
761
762@node Basic Scheduling Functions
763@subsection Basic Scheduling Functions
764
1f77f049 765This section describes functions in @theglibc{} for setting the
639c6286
UD
766absolute priority and scheduling policy of a process.
767
768@strong{Portability Note:} On systems that have the functions in this
769section, the macro _POSIX_PRIORITY_SCHEDULING is defined in
770@file{<unistd.h>}.
771
772For the case that the scheduling policy is traditional scheduling, more
773functions to fine tune the scheduling are in @ref{Traditional Scheduling}.
774
775Don't try to make too much out of the naming and structure of these
776functions. They don't match the concepts described in this manual
777because the functions are as defined by POSIX.1b, but the implementation
1f77f049 778on systems that use @theglibc{} is the inverse of what the POSIX
639c6286
UD
779structure contemplates. The POSIX scheme assumes that the primary
780scheduling parameter is the scheduling policy and that the priority
781value, if any, is a parameter of the scheduling policy. In the
782implementation, though, the priority value is king and the scheduling
783policy, if anything, only fine tunes the effect of that priority.
784
785The symbols in this section are declared by including file @file{sched.h}.
786
787@comment sched.h
788@comment POSIX
789@deftp {Data Type} {struct sched_param}
790This structure describes an absolute priority.
791@table @code
792@item int sched_priority
793absolute priority value
794@end table
795@end deftp
796
797@comment sched.h
798@comment POSIX
799@deftypefun int sched_setscheduler (pid_t @var{pid}, int @var{policy}, const struct sched_param *@var{param})
c8ce789c
AO
800@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
801@c Direct syscall, Linux only.
639c6286
UD
802
803This function sets both the absolute priority and the scheduling policy
804for a process.
805
806It assigns the absolute priority value given by @var{param} and the
807scheduling policy @var{policy} to the process with Process ID @var{pid},
808or the calling process if @var{pid} is zero. If @var{policy} is
0bc93a2f 809negative, @code{sched_setscheduler} keeps the existing scheduling policy.
639c6286
UD
810
811The following macros represent the valid values for @var{policy}:
812
813@table @code
814@item SCHED_OTHER
815Traditional Scheduling
816@item SCHED_FIFO
87b56f36 817First In First Out
639c6286
UD
818@item SCHED_RR
819Round Robin
820@end table
821
822@c The Linux kernel code (in sched.c) actually reschedules the process,
823@c but it puts it at the head of the run queue, so I'm not sure just what
824@c the effect is, but it must be subtle.
825
826On success, the return value is @code{0}. Otherwise, it is @code{-1}
827and @code{ERRNO} is set accordingly. The @code{errno} values specific
828to this function are:
829
830@table @code
831@item EPERM
832@itemize @bullet
833@item
834The calling process does not have @code{CAP_SYS_NICE} permission and
835@var{policy} is not @code{SCHED_OTHER} (or it's negative and the
836existing policy is not @code{SCHED_OTHER}.
837
838@item
839The calling process does not have @code{CAP_SYS_NICE} permission and its
11bf311e 840owner is not the target process' owner. I.e., the effective uid of the
639c6286
UD
841calling process is neither the effective nor the real uid of process
842@var{pid}.
843@c We need a cross reference to the capabilities section, when written.
844@end itemize
845
846@item ESRCH
847There is no process with pid @var{pid} and @var{pid} is not zero.
848
849@item EINVAL
850@itemize @bullet
851@item
852@var{policy} does not identify an existing scheduling policy.
853
854@item
855The absolute priority value identified by *@var{param} is outside the
856valid range for the scheduling policy @var{policy} (or the existing
857scheduling policy if @var{policy} is negative) or @var{param} is
858null. @code{sched_get_priority_max} and @code{sched_get_priority_min}
859tell you what the valid range is.
860
861@item
862@var{pid} is negative.
863@end itemize
864@end table
865
866@end deftypefun
867
868
869@comment sched.h
870@comment POSIX
871@deftypefun int sched_getscheduler (pid_t @var{pid})
c8ce789c
AO
872@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
873@c Direct syscall, Linux only.
639c6286
UD
874
875This function returns the scheduling policy assigned to the process with
876Process ID (pid) @var{pid}, or the calling process if @var{pid} is zero.
877
878The return value is the scheduling policy. See
879@code{sched_setscheduler} for the possible values.
880
881If the function fails, the return value is instead @code{-1} and
882@code{errno} is set accordingly.
883
884The @code{errno} values specific to this function are:
885
886@table @code
887
888@item ESRCH
889There is no process with pid @var{pid} and it is not zero.
890
891@item EINVAL
892@var{pid} is negative.
893
894@end table
895
896Note that this function is not an exact mate to @code{sched_setscheduler}
897because while that function sets the scheduling policy and the absolute
898priority, this function gets only the scheduling policy. To get the
899absolute priority, use @code{sched_getparam}.
900
901@end deftypefun
902
903
904@comment sched.h
905@comment POSIX
906@deftypefun int sched_setparam (pid_t @var{pid}, const struct sched_param *@var{param})
c8ce789c
AO
907@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
908@c Direct syscall, Linux only.
639c6286
UD
909
910This function sets a process' absolute priority.
911
912It is functionally identical to @code{sched_setscheduler} with
913@var{policy} = @code{-1}.
914
915@c in fact, that's how it's implemented in Linux.
916
917@end deftypefun
918
919@comment sched.h
920@comment POSIX
8ded91fb 921@deftypefun int sched_getparam (pid_t @var{pid}, struct sched_param *@var{param})
c8ce789c
AO
922@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
923@c Direct syscall, Linux only.
639c6286
UD
924
925This function returns a process' absolute priority.
926
927@var{pid} is the Process ID (pid) of the process whose absolute priority
928you want to know.
929
930@var{param} is a pointer to a structure in which the function stores the
931absolute priority of the process.
932
933On success, the return value is @code{0}. Otherwise, it is @code{-1}
934and @code{ERRNO} is set accordingly. The @code{errno} values specific
935to this function are:
936
937@table @code
938
939@item ESRCH
940There is no process with pid @var{pid} and it is not zero.
941
942@item EINVAL
943@var{pid} is negative.
944
945@end table
946
947@end deftypefun
948
949
950@comment sched.h
951@comment POSIX
8ded91fb 952@deftypefun int sched_get_priority_min (int @var{policy})
c8ce789c
AO
953@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
954@c Direct syscall, Linux only.
639c6286
UD
955
956This function returns the lowest absolute priority value that is
957allowable for a process with scheduling policy @var{policy}.
958
959On Linux, it is 0 for SCHED_OTHER and 1 for everything else.
960
961On success, the return value is @code{0}. Otherwise, it is @code{-1}
962and @code{ERRNO} is set accordingly. The @code{errno} values specific
963to this function are:
964
965@table @code
966@item EINVAL
967@var{policy} does not identify an existing scheduling policy.
968@end table
969
970@end deftypefun
971
972@comment sched.h
973@comment POSIX
8ded91fb 974@deftypefun int sched_get_priority_max (int @var{policy})
c8ce789c
AO
975@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
976@c Direct syscall, Linux only.
639c6286
UD
977
978This function returns the highest absolute priority value that is
979allowable for a process that with scheduling policy @var{policy}.
980
981On Linux, it is 0 for SCHED_OTHER and 99 for everything else.
982
983On success, the return value is @code{0}. Otherwise, it is @code{-1}
984and @code{ERRNO} is set accordingly. The @code{errno} values specific
985to this function are:
986
987@table @code
988@item EINVAL
989@var{policy} does not identify an existing scheduling policy.
990@end table
991
992@end deftypefun
993
994@comment sched.h
995@comment POSIX
996@deftypefun int sched_rr_get_interval (pid_t @var{pid}, struct timespec *@var{interval})
c8ce789c
AO
997@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
998@c Direct syscall, Linux only.
639c6286 999
87b56f36 1000This function returns the length of the quantum (time slice) used with
639c6286
UD
1001the Round Robin scheduling policy, if it is used, for the process with
1002Process ID @var{pid}.
1003
87b56f36 1004It returns the length of time as @var{interval}.
639c6286
UD
1005@c We need a cross-reference to where timespec is explained. But that
1006@c section doesn't exist yet, and the time chapter needs to be slightly
1007@c reorganized so there is a place to put it (which will be right next
1008@c to timeval, which is presently misplaced). 2000.05.07.
1009
1010With a Linux kernel, the round robin time slice is always 150
1011microseconds, and @var{pid} need not even be a real pid.
1012
1013The return value is @code{0} on success and in the pathological case
1014that it fails, the return value is @code{-1} and @code{errno} is set
1015accordingly. There is nothing specific that can go wrong with this
1016function, so there are no specific @code{errno} values.
1017
1018@end deftypefun
1019
1020@comment sched.h
1021@comment POSIX
3c44837c 1022@deftypefun int sched_yield (void)
c8ce789c
AO
1023@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1024@c Direct syscall on Linux; alias to swtch on HURD.
639c6286
UD
1025
1026This function voluntarily gives up the process' claim on the CPU.
1027
1028Technically, @code{sched_yield} causes the calling process to be made
1029immediately ready to run (as opposed to running, which is what it was
1030before). This means that if it has absolute priority higher than 0, it
1031gets pushed onto the tail of the queue of processes that share its
1032absolute priority and are ready to run, and it will run again when its
1033turn next arrives. If its absolute priority is 0, it is more
1034complicated, but still has the effect of yielding the CPU to other
1035processes.
1036
1037If there are no other processes that share the calling process' absolute
1038priority, this function doesn't have any effect.
1039
1040To the extent that the containing program is oblivious to what other
1041processes in the system are doing and how fast it executes, this
1042function appears as a no-op.
1043
1044The return value is @code{0} on success and in the pathological case
1045that it fails, the return value is @code{-1} and @code{errno} is set
1046accordingly. There is nothing specific that can go wrong with this
1047function, so there are no specific @code{errno} values.
1048
1049@end deftypefun
1050
1051@node Traditional Scheduling
1052@subsection Traditional Scheduling
1053@cindex scheduling, traditional
1054
1055This section is about the scheduling among processes whose absolute
1056priority is 0. When the system hands out the scraps of CPU time that
0bc93a2f 1057are left over after the processes with higher absolute priority have
639c6286
UD
1058taken all they want, the scheduling described herein determines who
1059among the great unwashed processes gets them.
1060
1061@menu
1062* Traditional Scheduling Intro::
1063* Traditional Scheduling Functions::
1064@end menu
1065
1066@node Traditional Scheduling Intro
1067@subsubsection Introduction To Traditional Scheduling
1068
1069Long before there was absolute priority (See @ref{Absolute Priority}),
1070Unix systems were scheduling the CPU using this system. When Posix came
0bc93a2f 1071in like the Romans and imposed absolute priorities to accommodate the
639c6286
UD
1072needs of realtime processing, it left the indigenous Absolute Priority
1073Zero processes to govern themselves by their own familiar scheduling
1074policy.
1075
1076Indeed, absolute priorities higher than zero are not available on many
1077systems today and are not typically used when they are, being intended
1078mainly for computers that do realtime processing. So this section
1079describes the only scheduling many programmers need to be concerned
1080about.
1081
1082But just to be clear about the scope of this scheduling: Any time a
9dcc8f11 1083process with an absolute priority of 0 and a process with an absolute
639c6286
UD
1084priority higher than 0 are ready to run at the same time, the one with
1085absolute priority 0 does not run. If it's already running when the
1086higher priority ready-to-run process comes into existence, it stops
1087immediately.
1088
1089In addition to its absolute priority of zero, every process has another
1090priority, which we will refer to as "dynamic priority" because it changes
87b56f36 1091over time. The dynamic priority is meaningless for processes with
639c6286
UD
1092an absolute priority higher than zero.
1093
1094The dynamic priority sometimes determines who gets the next turn on the
1095CPU. Sometimes it determines how long turns last. Sometimes it
1096determines whether a process can kick another off the CPU.
1097
1098In Linux, the value is a combination of these things, but mostly it is
1099just determines the length of the time slice. The higher a process'
1100dynamic priority, the longer a shot it gets on the CPU when it gets one.
1101If it doesn't use up its time slice before giving up the CPU to do
1102something like wait for I/O, it is favored for getting the CPU back when
1103it's ready for it, to finish out its time slice. Other than that,
1104selection of processes for new time slices is basically round robin.
1105But the scheduler does throw a bone to the low priority processes: A
1106process' dynamic priority rises every time it is snubbed in the
1107scheduling process. In Linux, even the fat kid gets to play.
1108
1109The fluctuation of a process' dynamic priority is regulated by another
1110value: The ``nice'' value. The nice value is an integer, usually in the
1111range -20 to 20, and represents an upper limit on a process' dynamic
1112priority. The higher the nice number, the lower that limit.
1113
1114On a typical Linux system, for example, a process with a nice value of
111520 can get only 10 milliseconds on the CPU at a time, whereas a process
1116with a nice value of -20 can achieve a high enough priority to get 400
1117milliseconds.
1118
1119The idea of the nice value is deferential courtesy. In the beginning,
1120in the Unix garden of Eden, all processes shared equally in the bounty
1121of the computer system. But not all processes really need the same
1122share of CPU time, so the nice value gave a courteous process the
1123ability to refuse its equal share of CPU time that others might prosper.
1124Hence, the higher a process' nice value, the nicer the process is.
1125(Then a snake came along and offered some process a negative nice value
1126and the system became the crass resource allocation system we know
1127today).
1128
1129Dynamic priorities tend upward and downward with an objective of
1130smoothing out allocation of CPU time and giving quick response time to
1131infrequent requests. But they never exceed their nice limits, so on a
1132heavily loaded CPU, the nice value effectively determines how fast a
1133process runs.
1134
1135In keeping with the socialistic heritage of Unix process priority, a
1136process begins life with the same nice value as its parent process and
1137can raise it at will. A process can also raise the nice value of any
1138other process owned by the same user (or effective user). But only a
1139privileged process can lower its nice value. A privileged process can
1140also raise or lower another process' nice value.
1141
1f77f049 1142@glibcadj{} functions for getting and setting nice values are described in
639c6286
UD
1143@xref{Traditional Scheduling Functions}.
1144
1145@node Traditional Scheduling Functions
1146@subsubsection Functions For Traditional Scheduling
1147
5ce8f203 1148@pindex sys/resource.h
639c6286
UD
1149This section describes how you can read and set the nice value of a
1150process. All these symbols are declared in @file{sys/resource.h}.
1151
1152The function and macro names are defined by POSIX, and refer to
1153"priority," but the functions actually have to do with nice values, as
1154the terms are used both in the manual and POSIX.
1155
1156The range of valid nice values depends on the kernel, but typically it
1157runs from @code{-20} to @code{20}. A lower nice value corresponds to
1158higher priority for the process. These constants describe the range of
5ce8f203
UD
1159priority values:
1160
b642f101 1161@vtable @code
5ce8f203
UD
1162@comment sys/resource.h
1163@comment BSD
1164@item PRIO_MIN
639c6286 1165The lowest valid nice value.
5ce8f203
UD
1166
1167@comment sys/resource.h
1168@comment BSD
1169@item PRIO_MAX
639c6286 1170The highest valid nice value.
b642f101 1171@end vtable
5ce8f203
UD
1172
1173@comment sys/resource.h
639c6286 1174@comment BSD,POSIX
5ce8f203 1175@deftypefun int getpriority (int @var{class}, int @var{id})
c8ce789c
AO
1176@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1177@c Direct syscall on UNIX. On HURD, calls _hurd_priority_which_map.
639c6286 1178Return the nice value of a set of processes; @var{class} and @var{id}
5ce8f203 1179specify which ones (see below). If the processes specified do not all
639c6286 1180have the same nice value, this returns the lowest value that any of them
5ce8f203
UD
1181has.
1182
639c6286
UD
1183On success, the return value is @code{0}. Otherwise, it is @code{-1}
1184and @code{ERRNO} is set accordingly. The @code{errno} values specific
1185to this function are:
5ce8f203
UD
1186
1187@table @code
1188@item ESRCH
1189The combination of @var{class} and @var{id} does not match any existing
1190process.
1191
1192@item EINVAL
1193The value of @var{class} is not valid.
1194@end table
1195
639c6286
UD
1196If the return value is @code{-1}, it could indicate failure, or it could
1197be the nice value. The only way to make certain is to set @code{errno =
11980} before calling @code{getpriority}, then use @code{errno != 0}
1199afterward as the criterion for failure.
5ce8f203
UD
1200@end deftypefun
1201
1202@comment sys/resource.h
639c6286
UD
1203@comment BSD,POSIX
1204@deftypefun int setpriority (int @var{class}, int @var{id}, int @var{niceval})
c8ce789c
AO
1205@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1206@c Direct syscall on UNIX. On HURD, calls _hurd_priority_which_map.
639c6286 1207Set the nice value of a set of processes to @var{niceval}; @var{class}
5ce8f203
UD
1208and @var{id} specify which ones (see below).
1209
6a7a8b22 1210The return value is @code{0} on success, and @code{-1} on
639c6286
UD
1211failure. The following @code{errno} error condition are possible for
1212this function:
5ce8f203
UD
1213
1214@table @code
1215@item ESRCH
1216The combination of @var{class} and @var{id} does not match any existing
1217process.
1218
1219@item EINVAL
1220The value of @var{class} is not valid.
1221
1222@item EPERM
639c6286 1223The call would set the nice value of a process which is owned by a different
11bf311e 1224user than the calling process (i.e., the target process' real or effective
639c6286
UD
1225uid does not match the calling process' effective uid) and the calling
1226process does not have @code{CAP_SYS_NICE} permission.
5ce8f203
UD
1227
1228@item EACCES
639c6286
UD
1229The call would lower the process' nice value and the process does not have
1230@code{CAP_SYS_NICE} permission.
5ce8f203 1231@end table
639c6286 1232
5ce8f203
UD
1233@end deftypefun
1234
1235The arguments @var{class} and @var{id} together specify a set of
1236processes in which you are interested. These are the possible values of
1237@var{class}:
1238
b642f101 1239@vtable @code
5ce8f203
UD
1240@comment sys/resource.h
1241@comment BSD
1242@item PRIO_PROCESS
639c6286 1243One particular process. The argument @var{id} is a process ID (pid).
5ce8f203
UD
1244
1245@comment sys/resource.h
1246@comment BSD
1247@item PRIO_PGRP
639c6286
UD
1248All the processes in a particular process group. The argument @var{id} is
1249a process group ID (pgid).
5ce8f203
UD
1250
1251@comment sys/resource.h
1252@comment BSD
1253@item PRIO_USER
11bf311e 1254All the processes owned by a particular user (i.e., whose real uid
639c6286 1255indicates the user). The argument @var{id} is a user ID (uid).
b642f101 1256@end vtable
5ce8f203 1257
639c6286
UD
1258If the argument @var{id} is 0, it stands for the calling process, its
1259process group, or its owner (real uid), according to @var{class}.
5ce8f203 1260
b642f101
UD
1261@comment unistd.h
1262@comment BSD
5ce8f203 1263@deftypefun int nice (int @var{increment})
c8ce789c
AO
1264@safety{@prelim{}@mtunsafe{@mtasurace{:setpriority}}@asunsafe{}@acsafe{}}
1265@c Calls getpriority before and after setpriority, using the result of
1266@c the first call to compute the argument for setpriority. This creates
1267@c a window for a concurrent setpriority (or nice) call to be lost or
1268@c exhibit surprising behavior.
639c6286 1269Increment the nice value of the calling process by @var{increment}.
6a7a8b22
AJ
1270The return value is the new nice value on success, and @code{-1} on
1271failure. In the case of failure, @code{errno} will be set to the
1272same values as for @code{setpriority}.
1273
5ce8f203
UD
1274
1275Here is an equivalent definition of @code{nice}:
1276
1277@smallexample
1278int
1279nice (int increment)
1280@{
6a7a8b22
AJ
1281 int result, old = getpriority (PRIO_PROCESS, 0);
1282 result = setpriority (PRIO_PROCESS, 0, old + increment);
1283 if (result != -1)
1284 return old + increment;
1285 else
1286 return -1;
5ce8f203
UD
1287@}
1288@end smallexample
1289@end deftypefun
b642f101 1290
d9997a45
UD
1291
1292@node CPU Affinity
1293@subsection Limiting execution to certain CPUs
1294
1295On a multi-processor system the operating system usually distributes
1296the different processes which are runnable on all available CPUs in a
1297way which allows the system to work most efficiently. Which processes
1298and threads run can be to some extend be control with the scheduling
1299functionality described in the last sections. But which CPU finally
1300executes which process or thread is not covered.
1301
1302There are a number of reasons why a program might want to have control
1303over this aspect of the system as well:
1304
1305@itemize @bullet
1306@item
1307One thread or process is responsible for absolutely critical work
1308which under no circumstances must be interrupted or hindered from
1309making process by other process or threads using CPU resources. In
1310this case the special process would be confined to a CPU which no
1311other process or thread is allowed to use.
1312
1313@item
1314The access to certain resources (RAM, I/O ports) has different costs
1315from different CPUs. This is the case in NUMA (Non-Uniform Memory
11bf311e 1316Architecture) machines. Preferably memory should be accessed locally
d9997a45
UD
1317but this requirement is usually not visible to the scheduler.
1318Therefore forcing a process or thread to the CPUs which have local
1319access to the mostly used memory helps to significantly boost the
1320performance.
1321
1322@item
1323In controlled runtimes resource allocation and book-keeping work (for
1324instance garbage collection) is performance local to processors. This
1325can help to reduce locking costs if the resources do not have to be
1326protected from concurrent accesses from different processors.
1327@end itemize
1328
1329The POSIX standard up to this date is of not much help to solve this
1330problem. The Linux kernel provides a set of interfaces to allow
1331specifying @emph{affinity sets} for a process. The scheduler will
bbf70ae9 1332schedule the thread or process on CPUs specified by the affinity
1f77f049 1333masks. The interfaces which @theglibc{} define follow to some
d9997a45
UD
1334extend the Linux kernel interface.
1335
1336@comment sched.h
1337@comment GNU
1338@deftp {Data Type} cpu_set_t
1339This data set is a bitset where each bit represents a CPU. How the
1340system's CPUs are mapped to bits in the bitset is system dependent.
1341The data type has a fixed size; in the unlikely case that the number
1342of bits are not sufficient to describe the CPUs of the system a
1343different interface has to be used.
1344
1345This type is a GNU extension and is defined in @file{sched.h}.
1346@end deftp
1347
1348To manipulate the bitset, to set and reset bits, a number of macros is
1349defined. Some of the macros take a CPU number as a parameter. Here
1350it is important to never exceed the size of the bitset. The following
1351macro specifies the number of bits in the @code{cpu_set_t} bitset.
1352
1353@comment sched.h
1354@comment GNU
1355@deftypevr Macro int CPU_SETSIZE
1356The value of this macro is the maximum number of CPUs which can be
1357handled with a @code{cpu_set_t} object.
1358@end deftypevr
1359
1360The type @code{cpu_set_t} should be considered opaque; all
1361manipulation should happen via the next four macros.
1362
1363@comment sched.h
1364@comment GNU
1365@deftypefn Macro void CPU_ZERO (cpu_set_t *@var{set})
c8ce789c
AO
1366@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1367@c CPU_ZERO ok
1368@c __CPU_ZERO_S ok
1369@c memset dup ok
d9997a45
UD
1370This macro initializes the CPU set @var{set} to be the empty set.
1371
1372This macro is a GNU extension and is defined in @file{sched.h}.
1373@end deftypefn
1374
1375@comment sched.h
1376@comment GNU
1377@deftypefn Macro void CPU_SET (int @var{cpu}, cpu_set_t *@var{set})
c8ce789c
AO
1378@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1379@c CPU_SET ok
1380@c __CPU_SET_S ok
1381@c __CPUELT ok
1382@c __CPUMASK ok
d9997a45
UD
1383This macro adds @var{cpu} to the CPU set @var{set}.
1384
1385The @var{cpu} parameter must not have side effects since it is
1386evaluated more than once.
1387
1388This macro is a GNU extension and is defined in @file{sched.h}.
1389@end deftypefn
1390
1391@comment sched.h
1392@comment GNU
1393@deftypefn Macro void CPU_CLR (int @var{cpu}, cpu_set_t *@var{set})
c8ce789c
AO
1394@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1395@c CPU_CLR ok
1396@c __CPU_CLR_S ok
1397@c __CPUELT dup ok
1398@c __CPUMASK dup ok
d9997a45
UD
1399This macro removes @var{cpu} from the CPU set @var{set}.
1400
1401The @var{cpu} parameter must not have side effects since it is
1402evaluated more than once.
1403
1404This macro is a GNU extension and is defined in @file{sched.h}.
1405@end deftypefn
1406
1407@comment sched.h
1408@comment GNU
1409@deftypefn Macro int CPU_ISSET (int @var{cpu}, const cpu_set_t *@var{set})
c8ce789c
AO
1410@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1411@c CPU_ISSET ok
1412@c __CPU_ISSET_S ok
1413@c __CPUELT dup ok
1414@c __CPUMASK dup ok
d9997a45
UD
1415This macro returns a nonzero value (true) if @var{cpu} is a member
1416of the CPU set @var{set}, and zero (false) otherwise.
1417
1418The @var{cpu} parameter must not have side effects since it is
1419evaluated more than once.
1420
1421This macro is a GNU extension and is defined in @file{sched.h}.
1422@end deftypefn
1423
1424
1425CPU bitsets can be constructed from scratch or the currently installed
1426affinity mask can be retrieved from the system.
1427
1428@comment sched.h
1429@comment GNU
6f0b2e1f 1430@deftypefun int sched_getaffinity (pid_t @var{pid}, size_t @var{cpusetsize}, cpu_set_t *@var{cpuset})
c8ce789c
AO
1431@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1432@c Wrapped syscall to zero out past the kernel cpu set size; Linux
1433@c only.
d9997a45
UD
1434
1435This functions stores the CPU affinity mask for the process or thread
6f0b2e1f
RM
1436with the ID @var{pid} in the @var{cpusetsize} bytes long bitmap
1437pointed to by @var{cpuset}. If successful, the function always
1438initializes all bits in the @code{cpu_set_t} object and returns zero.
d9997a45
UD
1439
1440If @var{pid} does not correspond to a process or thread on the system
1441the or the function fails for some other reason, it returns @code{-1}
1442and @code{errno} is set to represent the error condition.
1443
1444@table @code
1445@item ESRCH
1446No process or thread with the given ID found.
1447
1448@item EFAULT
1449The pointer @var{cpuset} is does not point to a valid object.
1450@end table
1451
1452This function is a GNU extension and is declared in @file{sched.h}.
1453@end deftypefun
1454
1455Note that it is not portably possible to use this information to
1456retrieve the information for different POSIX threads. A separate
1457interface must be provided for that.
1458
1459@comment sched.h
1460@comment GNU
6f0b2e1f 1461@deftypefun int sched_setaffinity (pid_t @var{pid}, size_t @var{cpusetsize}, const cpu_set_t *@var{cpuset})
c8ce789c
AO
1462@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1463@c Wrapped syscall to detect attempts to set bits past the kernel cpu
1464@c set size; Linux only.
d9997a45 1465
6f0b2e1f
RM
1466This function installs the @var{cpusetsize} bytes long affinity mask
1467pointed to by @var{cpuset} for the process or thread with the ID @var{pid}.
1468If successful the function returns zero and the scheduler will in future
1469take the affinity information into account.
d9997a45
UD
1470
1471If the function fails it will return @code{-1} and @code{errno} is set
1472to the error code:
1473
1474@table @code
1475@item ESRCH
1476No process or thread with the given ID found.
1477
1478@item EFAULT
1479The pointer @var{cpuset} is does not point to a valid object.
1480
1481@item EINVAL
1482The bitset is not valid. This might mean that the affinity set might
1483not leave a processor for the process or thread to run on.
1484@end table
1485
1486This function is a GNU extension and is declared in @file{sched.h}.
1487@end deftypefun
1488
1489
b642f101
UD
1490@node Memory Resources
1491@section Querying memory available resources
1492
1493The amount of memory available in the system and the way it is organized
1494determines oftentimes the way programs can and have to work. For
5a7eedfb 1495functions like @code{mmap} it is necessary to know about the size of
b642f101
UD
1496individual memory pages and knowing how much memory is available enables
1497a program to select appropriate sizes for, say, caches. Before we get
1498into these details a few words about memory subsystems in traditional
5a7eedfb 1499Unix systems will be given.
b642f101
UD
1500
1501@menu
1502* Memory Subsystem:: Overview about traditional Unix memory handling.
1503* Query Memory Parameters:: How to get information about the memory
1504 subsystem?
1505@end menu
1506
1507@node Memory Subsystem
1508@subsection Overview about traditional Unix memory handling
1509
1510@cindex address space
1511@cindex physical memory
1512@cindex physical address
1513Unix systems normally provide processes virtual address spaces. This
1514means that the addresses of the memory regions do not have to correspond
1515directly to the addresses of the actual physical memory which stores the
1516data. An extra level of indirection is introduced which translates
1517virtual addresses into physical addresses. This is normally done by the
1518hardware of the processor.
1519
1520@cindex shared memory
1521Using a virtual address space has several advantage. The most important
1522is process isolation. The different processes running on the system
1523cannot interfere directly with each other. No process can write into
1524the address space of another process (except when shared memory is used
1525but then it is wanted and controlled).
1526
1527Another advantage of virtual memory is that the address space the
1528processes see can actually be larger than the physical memory available.
1529The physical memory can be extended by storage on an external media
1530where the content of currently unused memory regions is stored. The
1531address translation can then intercept accesses to these memory regions
1532and make memory content available again by loading the data back into
1533memory. This concept makes it necessary that programs which have to use
1534lots of memory know the difference between available virtual address
1535space and available physical memory. If the working set of virtual
1536memory of all the processes is larger than the available physical memory
1537the system will slow down dramatically due to constant swapping of
1538memory content from the memory to the storage media and back. This is
1539called ``thrashing''.
1540@cindex thrashing
1541
1542@cindex memory page
1543@cindex page, memory
1544A final aspect of virtual memory which is important and follows from
1545what is said in the last paragraph is the granularity of the virtual
1546address space handling. When we said that the virtual address handling
1547stores memory content externally it cannot do this on a byte-by-byte
1548basis. The administrative overhead does not allow this (leaving alone
1549the processor hardware). Instead several thousand bytes are handled
1550together and form a @dfn{page}. The size of each page is always a power
1551of two byte. The smallest page size in use today is 4096, with 8192,
155216384, and 65536 being other popular sizes.
1553
1554@node Query Memory Parameters
1555@subsection How to get information about the memory subsystem?
1556
1557The page size of the virtual memory the process sees is essential to
1558know in several situations. Some programming interface (e.g.,
1559@code{mmap}, @pxref{Memory-mapped I/O}) require the user to provide
1560information adjusted to the page size. In the case of @code{mmap} is it
1561necessary to provide a length argument which is a multiple of the page
1562size. Another place where the knowledge about the page size is useful
1563is in memory allocation. If one allocates pieces of memory in larger
1564chunks which are then subdivided by the application code it is useful to
1565adjust the size of the larger blocks to the page size. If the total
1566memory requirement for the block is close (but not larger) to a multiple
1567of the page size the kernel's memory handling can work more effectively
1568since it only has to allocate memory pages which are fully used. (To do
1569this optimization it is necessary to know a bit about the memory
1570allocator which will require a bit of memory itself for each block and
1571this overhead must not push the total size over the page size multiple.
1572
1573The page size traditionally was a compile time constant. But recent
1574development of processors changed this. Processors now support
1575different page sizes and they can possibly even vary among different
1576processes on the same system. Therefore the system should be queried at
1577runtime about the current page size and no assumptions (except about it
1578being a power of two) should be made.
1579
1580@vindex _SC_PAGESIZE
1581The correct interface to query about the page size is @code{sysconf}
1582(@pxref{Sysconf Definition}) with the parameter @code{_SC_PAGESIZE}.
1583There is a much older interface available, too.
1584
1585@comment unistd.h
1586@comment BSD
1587@deftypefun int getpagesize (void)
c8ce789c
AO
1588@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
1589@c Obtained from the aux vec at program startup time. GNU/Linux/m68k is
1590@c the exception, with the possibility of a syscall.
b642f101
UD
1591The @code{getpagesize} function returns the page size of the process.
1592This value is fixed for the runtime of the process but can vary in
1593different runs of the application.
1594
1595The function is declared in @file{unistd.h}.
1596@end deftypefun
1597
1598Widely available on @w{System V} derived systems is a method to get
1599information about the physical memory the system has. The call
1600
1601@vindex _SC_PHYS_PAGES
1602@cindex sysconf
1603@smallexample
1604 sysconf (_SC_PHYS_PAGES)
1605@end smallexample
1606
cb4fe8a2
UD
1607@noindent
1608returns the total number of pages of physical the system has.
b642f101
UD
1609This does not mean all this memory is available. This information can
1610be found using
1611
1612@vindex _SC_AVPHYS_PAGES
1613@cindex sysconf
1614@smallexample
1615 sysconf (_SC_AVPHYS_PAGES)
1616@end smallexample
1617
1618These two values help to optimize applications. The value returned for
1619@code{_SC_AVPHYS_PAGES} is the amount of memory the application can use
1620without hindering any other process (given that no other process
1621increases its memory usage). The value returned for
1622@code{_SC_PHYS_PAGES} is more or less a hard limit for the working set.
1623If all applications together constantly use more than that amount of
1624memory the system is in trouble.
1625
1f77f049 1626@Theglibc{} provides in addition to these already described way to
cb4fe8a2
UD
1627get this information two functions. They are declared in the file
1628@file{sys/sysinfo.h}. Programmers should prefer to use the
1629@code{sysconf} method described above.
1630
1631@comment sys/sysinfo.h
1632@comment GNU
4c78249d 1633@deftypefun {long int} get_phys_pages (void)
c8ce789c
AO
1634@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{} @asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
1635@c This fopens a /proc file and scans it for the requested information.
cb4fe8a2
UD
1636The @code{get_phys_pages} function returns the total number of pages of
1637physical the system has. To get the amount of memory this number has to
1638be multiplied by the page size.
1639
1640This function is a GNU extension.
1641@end deftypefun
1642
1643@comment sys/sysinfo.h
1644@comment GNU
4c78249d 1645@deftypefun {long int} get_avphys_pages (void)
c8ce789c 1646@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{} @asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
cd1fb604 1647The @code{get_avphys_pages} function returns the number of available pages of
cb4fe8a2
UD
1648physical the system has. To get the amount of memory this number has to
1649be multiplied by the page size.
1650
1651This function is a GNU extension.
1652@end deftypefun
1653
b642f101
UD
1654@node Processor Resources
1655@section Learn about the processors available
1656
1657The use of threads or processes with shared memory allows an application
1658to take advantage of all the processing power a system can provide. If
1659the task can be parallelized the optimal way to write an application is
1660to have at any time as many processes running as there are processors.
1661To determine the number of processors available to the system one can
1662run
1663
1664@vindex _SC_NPROCESSORS_CONF
1665@cindex sysconf
1666@smallexample
1667 sysconf (_SC_NPROCESSORS_CONF)
1668@end smallexample
1669
1670@noindent
1671which returns the number of processors the operating system configured.
1672But it might be possible for the operating system to disable individual
1673processors and so the call
1674
1675@vindex _SC_NPROCESSORS_ONLN
1676@cindex sysconf
1677@smallexample
1678 sysconf (_SC_NPROCESSORS_ONLN)
1679@end smallexample
1680
1681@noindent
26428b7c 1682returns the number of processors which are currently online (i.e.,
b642f101 1683available).
e4cf5229 1684
1f77f049 1685For these two pieces of information @theglibc{} also provides
cb4fe8a2
UD
1686functions to get the information directly. The functions are declared
1687in @file{sys/sysinfo.h}.
1688
1689@comment sys/sysinfo.h
1690@comment GNU
1691@deftypefun int get_nprocs_conf (void)
c8ce789c
AO
1692@safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{} @asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
1693@c This function reads from from /sys using dir streams (single user, so
1694@c no @mtasurace issue), and on some arches, from /proc using streams.
cb4fe8a2
UD
1695The @code{get_nprocs_conf} function returns the number of processors the
1696operating system configured.
1697
1698This function is a GNU extension.
1699@end deftypefun
1700
1701@comment sys/sysinfo.h
1702@comment GNU
1703@deftypefun int get_nprocs (void)
c8ce789c
AO
1704@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{@acsfd{}}}
1705@c This function reads from /proc using file descriptor I/O.
cb4fe8a2
UD
1706The @code{get_nprocs} function returns the number of available processors.
1707
1708This function is a GNU extension.
1709@end deftypefun
1710
e4cf5229
UD
1711@cindex load average
1712Before starting more threads it should be checked whether the processors
1713are not already overused. Unix systems calculate something called the
1714@dfn{load average}. This is a number indicating how many processes were
1715running. This number is average over different periods of times
1716(normally 1, 5, and 15 minutes).
1717
1718@comment stdlib.h
1719@comment BSD
1720@deftypefun int getloadavg (double @var{loadavg}[], int @var{nelem})
c8ce789c
AO
1721@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{@acsfd{}}}
1722@c Calls host_info on HURD; on Linux, opens /proc/loadavg, reads from
1723@c it, closes it, without cancellation point, and calls strtod_l with
1724@c the C locale to convert the strings to doubles.
e4cf5229 1725This function gets the 1, 5 and 15 minute load averages of the
cf822e3c 1726system. The values are placed in @var{loadavg}. @code{getloadavg} will
e4cf5229
UD
1727place at most @var{nelem} elements into the array but never more than
1728three elements. The return value is the number of elements written to
1729@var{loadavg}, or -1 on error.
1730
1731This function is declared in @file{stdlib.h}.
1732@end deftypefun