]> git.ipfire.org Git - thirdparty/man-pages.git/blob - man2/bpf.2
pldd.1, bpf.2, chdir.2, clone.2, fanotify_init.2, fanotify_mark.2, intro.2, ipc.2...
[thirdparty/man-pages.git] / man2 / bpf.2
1 .\" Copyright (C) 2015 Alexei Starovoitov <ast@kernel.org>
2 .\" and Copyright (C) 2015 Michael Kerrisk <mtk.manpages@gmail.com>
3 .\"
4 .\" %%%LICENSE_START(VERBATIM)
5 .\" Permission is granted to make and distribute verbatim copies of this
6 .\" manual provided the copyright notice and this permission notice are
7 .\" preserved on all copies.
8 .\"
9 .\" Permission is granted to copy and distribute modified versions of this
10 .\" manual under the conditions for verbatim copying, provided that the
11 .\" entire resulting derived work is distributed under the terms of a
12 .\" permission notice identical to this one.
13 .\"
14 .\" Since the Linux kernel and libraries are constantly changing, this
15 .\" manual page may be incorrect or out-of-date. The author(s) assume no
16 .\" responsibility for errors or omissions, or for damages resulting from
17 .\" the use of the information contained herein. The author(s) may not
18 .\" have taken the same level of care in the production of this manual,
19 .\" which is licensed free of charge, as they might when working
20 .\" professionally.
21 .\"
22 .\" Formatted or processed versions of this manual, if unaccompanied by
23 .\" the source, must acknowledge the copyright and authors of this work.
24 .\" %%%LICENSE_END
25 .\"
26 .TH BPF 2 2019-08-02 "Linux" "Linux Programmer's Manual"
27 .SH NAME
28 bpf \- perform a command on an extended BPF map or program
29 .SH SYNOPSIS
30 .nf
31 .B #include <linux/bpf.h>
32
33 .BI "int bpf(int " cmd ", union bpf_attr *" attr ", unsigned int " size );
34 .fi
35 .SH DESCRIPTION
36 The
37 .BR bpf ()
38 system call performs a range of operations related to extended
39 Berkeley Packet Filters.
40 Extended BPF (or eBPF) is similar to
41 the original ("classic") BPF (cBPF) used to filter network packets.
42 For both cBPF and eBPF programs,
43 the kernel statically analyzes the programs before loading them,
44 in order to ensure that they cannot harm the running system.
45 .PP
46 eBPF extends cBPF in multiple ways, including the ability to call
47 a fixed set of in-kernel helper functions
48 .\" See 'enum bpf_func_id' in include/uapi/linux/bpf.h
49 (via the
50 .B BPF_CALL
51 opcode extension provided by eBPF)
52 and access shared data structures such as eBPF maps.
53 .\"
54 .SS Extended BPF Design/Architecture
55 eBPF maps are a generic data structure for storage of different data types.
56 Data types are generally treated as binary blobs, so a user just specifies
57 the size of the key and the size of the value at map-creation time.
58 In other words, a key/value for a given map can have an arbitrary structure.
59 .PP
60 A user process can create multiple maps (with key/value-pairs being
61 opaque bytes of data) and access them via file descriptors.
62 Different eBPF programs can access the same maps in parallel.
63 It's up to the user process and eBPF program to decide what they store
64 inside maps.
65 .PP
66 There's one special map type, called a program array.
67 This type of map stores file descriptors referring to other eBPF programs.
68 When a lookup in the map is performed, the program flow is
69 redirected in-place to the beginning of another eBPF program and does not
70 return back to the calling program.
71 The level of nesting has a fixed limit of 32,
72 .\" Defined by the kernel constant MAX_TAIL_CALL_CNT in include/linux/bpf.h
73 so that infinite loops cannot be crafted.
74 At run time, the program file descriptors stored in the map can be modified,
75 so program functionality can be altered based on specific requirements.
76 All programs referred to in a program-array map must
77 have been previously loaded into the kernel via
78 .BR bpf ().
79 If a map lookup fails, the current program continues its execution.
80 See
81 .B BPF_MAP_TYPE_PROG_ARRAY
82 below for further details.
83 .PP
84 Generally, eBPF programs are loaded by the user process and automatically
85 unloaded when the process exits.
86 In some cases, for example,
87 .BR tc-bpf (8),
88 the program will continue to stay alive inside the kernel even after the
89 process that loaded the program exits.
90 In that case,
91 the tc subsystem holds a reference to the eBPF program after the
92 file descriptor has been closed by the user-space program.
93 Thus, whether a specific program continues to live inside the kernel
94 depends on how it is further attached to a given kernel subsystem
95 after it was loaded via
96 .BR bpf ().
97 .PP
98 Each eBPF program is a set of instructions that is safe to run until
99 its completion.
100 An in-kernel verifier statically determines that the eBPF program
101 terminates and is safe to execute.
102 During verification, the kernel increments reference counts for each of
103 the maps that the eBPF program uses,
104 so that the attached maps can't be removed until the program is unloaded.
105 .PP
106 eBPF programs can be attached to different events.
107 These events can be the arrival of network packets, tracing
108 events, classification events by network queueing disciplines
109 (for eBPF programs attached to a
110 .BR tc (8)
111 classifier), and other types that may be added in the future.
112 A new event triggers execution of the eBPF program, which
113 may store information about the event in eBPF maps.
114 Beyond storing data, eBPF programs may call a fixed set of
115 in-kernel helper functions.
116 .PP
117 The same eBPF program can be attached to multiple events and different
118 eBPF programs can access the same map:
119 .PP
120 .in +4n
121 .EX
122 tracing tracing tracing packet packet packet
123 event A event B event C on eth0 on eth1 on eth2
124 | | | | | ^
125 | | | | v |
126 --> tracing <-- tracing socket tc ingress tc egress
127 prog_1 prog_2 prog_3 classifier action
128 | | | | prog_4 prog_5
129 |--- -----| |------| map_3 | |
130 map_1 map_2 --| map_4 |--
131 .EE
132 .in
133 .\"
134 .SS Arguments
135 The operation to be performed by the
136 .BR bpf ()
137 system call is determined by the
138 .I cmd
139 argument.
140 Each operation takes an accompanying argument,
141 provided via
142 .IR attr ,
143 which is a pointer to a union of type
144 .I bpf_attr
145 (see below).
146 The
147 .I size
148 argument is the size of the union pointed to by
149 .IR attr .
150 .PP
151 The value provided in
152 .I cmd
153 is one of the following:
154 .TP
155 .B BPF_MAP_CREATE
156 Create a map and return a file descriptor that refers to the map.
157 The close-on-exec file descriptor flag (see
158 .BR fcntl (2))
159 is automatically enabled for the new file descriptor.
160 .TP
161 .B BPF_MAP_LOOKUP_ELEM
162 Look up an element by key in a specified map and return its value.
163 .TP
164 .B BPF_MAP_UPDATE_ELEM
165 Create or update an element (key/value pair) in a specified map.
166 .TP
167 .B BPF_MAP_DELETE_ELEM
168 Look up and delete an element by key in a specified map.
169 .TP
170 .B BPF_MAP_GET_NEXT_KEY
171 Look up an element by key in a specified map and return the key
172 of the next element.
173 .TP
174 .B BPF_PROG_LOAD
175 Verify and load an eBPF program,
176 returning a new file descriptor associated with the program.
177 The close-on-exec file descriptor flag (see
178 .BR fcntl (2))
179 is automatically enabled for the new file descriptor.
180 .IP
181 The
182 .I bpf_attr
183 union consists of various anonymous structures that are used by different
184 .BR bpf ()
185 commands:
186 .PP
187 .in +4n
188 .EX
189 union bpf_attr {
190 struct { /* Used by BPF_MAP_CREATE */
191 __u32 map_type;
192 __u32 key_size; /* size of key in bytes */
193 __u32 value_size; /* size of value in bytes */
194 __u32 max_entries; /* maximum number of entries
195 in a map */
196 };
197
198 struct { /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY
199 commands */
200 __u32 map_fd;
201 __aligned_u64 key;
202 union {
203 __aligned_u64 value;
204 __aligned_u64 next_key;
205 };
206 __u64 flags;
207 };
208
209 struct { /* Used by BPF_PROG_LOAD */
210 __u32 prog_type;
211 __u32 insn_cnt;
212 __aligned_u64 insns; /* 'const struct bpf_insn *' */
213 __aligned_u64 license; /* 'const char *' */
214 __u32 log_level; /* verbosity level of verifier */
215 __u32 log_size; /* size of user buffer */
216 __aligned_u64 log_buf; /* user supplied 'char *'
217 buffer */
218 __u32 kern_version;
219 /* checked when prog_type=kprobe
220 (since Linux 4.1) */
221 .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5
222 };
223 } __attribute__((aligned(8)));
224 .EE
225 .in
226 .\"
227 .SS eBPF maps
228 Maps are a generic data structure for storage of different types of data.
229 They allow sharing of data between eBPF kernel programs,
230 and also between kernel and user-space applications.
231 .PP
232 Each map type has the following attributes:
233 .IP * 3
234 type
235 .IP *
236 maximum number of elements
237 .IP *
238 key size in bytes
239 .IP *
240 value size in bytes
241 .PP
242 The following wrapper functions demonstrate how various
243 .BR bpf ()
244 commands can be used to access the maps.
245 The functions use the
246 .I cmd
247 argument to invoke different operations.
248 .TP
249 .B BPF_MAP_CREATE
250 The
251 .B BPF_MAP_CREATE
252 command creates a new map,
253 returning a new file descriptor that refers to the map.
254 .IP
255 .in +4n
256 .EX
257 int
258 bpf_create_map(enum bpf_map_type map_type,
259 unsigned int key_size,
260 unsigned int value_size,
261 unsigned int max_entries)
262 {
263 union bpf_attr attr = {
264 .map_type = map_type,
265 .key_size = key_size,
266 .value_size = value_size,
267 .max_entries = max_entries
268 };
269
270 return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
271 }
272 .EE
273 .in
274 .IP
275 The new map has the type specified by
276 .IR map_type ,
277 and attributes as specified in
278 .IR key_size ,
279 .IR value_size ,
280 and
281 .IR max_entries .
282 On success, this operation returns a file descriptor.
283 On error, \-1 is returned and
284 .I errno
285 is set to
286 .BR EINVAL ,
287 .BR EPERM ,
288 or
289 .BR ENOMEM .
290 .IP
291 The
292 .I key_size
293 and
294 .I value_size
295 attributes will be used by the verifier during program loading
296 to check that the program is calling
297 .BR bpf_map_*_elem ()
298 helper functions with a correctly initialized
299 .I key
300 and to check that the program doesn't access the map element
301 .I value
302 beyond the specified
303 .IR value_size .
304 For example, when a map is created with a
305 .I key_size
306 of 8 and the eBPF program calls
307 .IP
308 .in +4n
309 .EX
310 bpf_map_lookup_elem(map_fd, fp - 4)
311 .EE
312 .in
313 .IP
314 the program will be rejected,
315 since the in-kernel helper function
316 .IP
317 .EX
318 bpf_map_lookup_elem(map_fd, void *key)
319 .EE
320 .IP
321 expects to read 8 bytes from the location pointed to by
322 .IR key ,
323 but the
324 .I fp\ -\ 4
325 (where
326 .I fp
327 is the top of the stack)
328 starting address will cause out-of-bounds stack access.
329 .IP
330 Similarly, when a map is created with a
331 .I value_size
332 of 1 and the eBPF program contains
333 .IP
334 .in +4n
335 .EX
336 value = bpf_map_lookup_elem(...);
337 *(u32 *) value = 1;
338 .EE
339 .in
340 .IP
341 the program will be rejected, since it accesses the
342 .I value
343 pointer beyond the specified 1 byte
344 .I value_size
345 limit.
346 .IP
347 Currently, the following values are supported for
348 .IR map_type :
349 .IP
350 .in +4n
351 .EX
352 enum bpf_map_type {
353 BPF_MAP_TYPE_UNSPEC, /* Reserve 0 as invalid map type */
354 BPF_MAP_TYPE_HASH,
355 BPF_MAP_TYPE_ARRAY,
356 BPF_MAP_TYPE_PROG_ARRAY,
357 BPF_MAP_TYPE_PERF_EVENT_ARRAY,
358 BPF_MAP_TYPE_PERCPU_HASH,
359 BPF_MAP_TYPE_PERCPU_ARRAY,
360 BPF_MAP_TYPE_STACK_TRACE,
361 BPF_MAP_TYPE_CGROUP_ARRAY,
362 BPF_MAP_TYPE_LRU_HASH,
363 BPF_MAP_TYPE_LRU_PERCPU_HASH,
364 BPF_MAP_TYPE_LPM_TRIE,
365 BPF_MAP_TYPE_ARRAY_OF_MAPS,
366 BPF_MAP_TYPE_HASH_OF_MAPS,
367 BPF_MAP_TYPE_DEVMAP,
368 BPF_MAP_TYPE_SOCKMAP,
369 BPF_MAP_TYPE_CPUMAP,
370 };
371 .EE
372 .in
373 .IP
374 .I map_type
375 selects one of the available map implementations in the kernel.
376 .\" FIXME We need an explanation of why one might choose each of
377 .\" these map implementations
378 For all map types,
379 eBPF programs access maps with the same
380 .BR bpf_map_lookup_elem ()
381 and
382 .BR bpf_map_update_elem ()
383 helper functions.
384 Further details of the various map types are given below.
385 .TP
386 .B BPF_MAP_LOOKUP_ELEM
387 The
388 .B BPF_MAP_LOOKUP_ELEM
389 command looks up an element with a given
390 .I key
391 in the map referred to by the file descriptor
392 .IR fd .
393 .IP
394 .in +4n
395 .EX
396 int
397 bpf_lookup_elem(int fd, const void *key, void *value)
398 {
399 union bpf_attr attr = {
400 .map_fd = fd,
401 .key = ptr_to_u64(key),
402 .value = ptr_to_u64(value),
403 };
404
405 return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
406 }
407 .EE
408 .in
409 .IP
410 If an element is found,
411 the operation returns zero and stores the element's value into
412 .IR value ,
413 which must point to a buffer of
414 .I value_size
415 bytes.
416 .IP
417 If no element is found, the operation returns \-1 and sets
418 .I errno
419 to
420 .BR ENOENT .
421 .TP
422 .B BPF_MAP_UPDATE_ELEM
423 The
424 .B BPF_MAP_UPDATE_ELEM
425 command
426 creates or updates an element with a given
427 .I key/value
428 in the map referred to by the file descriptor
429 .IR fd .
430 .IP
431 .in +4n
432 .EX
433 int
434 bpf_update_elem(int fd, const void *key, const void *value,
435 uint64_t flags)
436 {
437 union bpf_attr attr = {
438 .map_fd = fd,
439 .key = ptr_to_u64(key),
440 .value = ptr_to_u64(value),
441 .flags = flags,
442 };
443
444 return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
445 }
446 .EE
447 .in
448 .IP
449 The
450 .I flags
451 argument should be specified as one of the following:
452 .RS
453 .TP
454 .B BPF_ANY
455 Create a new element or update an existing element.
456 .TP
457 .B BPF_NOEXIST
458 Create a new element only if it did not exist.
459 .TP
460 .B BPF_EXIST
461 Update an existing element.
462 .RE
463 .IP
464 On success, the operation returns zero.
465 On error, \-1 is returned and
466 .I errno
467 is set to
468 .BR EINVAL ,
469 .BR EPERM ,
470 .BR ENOMEM ,
471 or
472 .BR E2BIG .
473 .B E2BIG
474 indicates that the number of elements in the map reached the
475 .I max_entries
476 limit specified at map creation time.
477 .B EEXIST
478 will be returned if
479 .I flags
480 specifies
481 .B BPF_NOEXIST
482 and the element with
483 .I key
484 already exists in the map.
485 .B ENOENT
486 will be returned if
487 .I flags
488 specifies
489 .B BPF_EXIST
490 and the element with
491 .I key
492 doesn't exist in the map.
493 .TP
494 .B BPF_MAP_DELETE_ELEM
495 The
496 .B BPF_MAP_DELETE_ELEM
497 command
498 deletes the element whose key is
499 .I key
500 from the map referred to by the file descriptor
501 .IR fd .
502 .IP
503 .in +4n
504 .EX
505 int
506 bpf_delete_elem(int fd, const void *key)
507 {
508 union bpf_attr attr = {
509 .map_fd = fd,
510 .key = ptr_to_u64(key),
511 };
512
513 return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
514 }
515 .EE
516 .in
517 .IP
518 On success, zero is returned.
519 If the element is not found, \-1 is returned and
520 .I errno
521 is set to
522 .BR ENOENT .
523 .TP
524 .B BPF_MAP_GET_NEXT_KEY
525 The
526 .B BPF_MAP_GET_NEXT_KEY
527 command looks up an element by
528 .I key
529 in the map referred to by the file descriptor
530 .I fd
531 and sets the
532 .I next_key
533 pointer to the key of the next element.
534 .IP
535 .in +4n
536 .EX
537 int
538 bpf_get_next_key(int fd, const void *key, void *next_key)
539 {
540 union bpf_attr attr = {
541 .map_fd = fd,
542 .key = ptr_to_u64(key),
543 .next_key = ptr_to_u64(next_key),
544 };
545
546 return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
547 }
548 .EE
549 .in
550 .IP
551 If
552 .I key
553 is found, the operation returns zero and sets the
554 .I next_key
555 pointer to the key of the next element.
556 If
557 .I key
558 is not found, the operation returns zero and sets the
559 .I next_key
560 pointer to the key of the first element.
561 If
562 .I key
563 is the last element, \-1 is returned and
564 .I errno
565 is set to
566 .BR ENOENT .
567 Other possible
568 .I errno
569 values are
570 .BR ENOMEM ,
571 .BR EFAULT ,
572 .BR EPERM ,
573 and
574 .BR EINVAL .
575 This method can be used to iterate over all elements in the map.
576 .TP
577 .B close(map_fd)
578 Delete the map referred to by the file descriptor
579 .IR map_fd .
580 When the user-space program that created a map exits, all maps will
581 be deleted automatically (but see NOTES).
582 .\"
583 .SS eBPF map types
584 The following map types are supported:
585 .TP
586 .B BPF_MAP_TYPE_HASH
587 .\" commit 0f8e4bd8a1fc8c4185f1630061d0a1f2d197a475
588 Hash-table maps have the following characteristics:
589 .RS
590 .IP * 3
591 Maps are created and destroyed by user-space programs.
592 Both user-space and eBPF programs
593 can perform lookup, update, and delete operations.
594 .IP *
595 The kernel takes care of allocating and freeing key/value pairs.
596 .IP *
597 The
598 .BR map_update_elem ()
599 helper will fail to insert new element when the
600 .I max_entries
601 limit is reached.
602 (This ensures that eBPF programs cannot exhaust memory.)
603 .IP *
604 .BR map_update_elem ()
605 replaces existing elements atomically.
606 .RE
607 .IP
608 Hash-table maps are
609 optimized for speed of lookup.
610 .TP
611 .B BPF_MAP_TYPE_ARRAY
612 .\" commit 28fbcfa08d8ed7c5a50d41a0433aad222835e8e3
613 Array maps have the following characteristics:
614 .RS
615 .IP * 3
616 Optimized for fastest possible lookup.
617 In the future the verifier/JIT compiler
618 may recognize lookup() operations that employ a constant key
619 and optimize it into constant pointer.
620 It is possible to optimize a non-constant
621 key into direct pointer arithmetic as well, since pointers and
622 .I value_size
623 are constant for the life of the eBPF program.
624 In other words,
625 .BR array_map_lookup_elem ()
626 may be 'inlined' by the verifier/JIT compiler
627 while preserving concurrent access to this map from user space.
628 .IP *
629 All array elements pre-allocated and zero initialized at init time
630 .IP *
631 The key is an array index, and must be exactly four bytes.
632 .IP *
633 .BR map_delete_elem ()
634 fails with the error
635 .BR EINVAL ,
636 since elements cannot be deleted.
637 .IP *
638 .BR map_update_elem ()
639 replaces elements in a
640 .B nonatomic
641 fashion;
642 for atomic updates, a hash-table map should be used instead.
643 There is however one special case that can also be used with arrays:
644 the atomic built-in
645 .B __sync_fetch_and_add()
646 can be used on 32 and 64 bit atomic counters.
647 For example, it can be
648 applied on the whole value itself if it represents a single counter,
649 or in case of a structure containing multiple counters, it could be
650 used on individual counters.
651 This is quite often useful for aggregation and accounting of events.
652 .RE
653 .IP
654 Among the uses for array maps are the following:
655 .RS
656 .IP * 3
657 As "global" eBPF variables: an array of 1 element whose key is (index) 0
658 and where the value is a collection of 'global' variables which
659 eBPF programs can use to keep state between events.
660 .IP *
661 Aggregation of tracing events into a fixed set of buckets.
662 .IP *
663 Accounting of networking events, for example, number of packets and packet
664 sizes.
665 .RE
666 .TP
667 .BR BPF_MAP_TYPE_PROG_ARRAY " (since Linux 4.2)"
668 A program array map is a special kind of array map whose map values
669 contain only file descriptors referring to other eBPF programs.
670 Thus, both the
671 .I key_size
672 and
673 .I value_size
674 must be exactly four bytes.
675 This map is used in conjunction with the
676 .BR bpf_tail_call ()
677 helper.
678 .IP
679 This means that an eBPF program with a program array map attached to it
680 can call from kernel side into
681 .IP
682 .in +4n
683 .EX
684 void bpf_tail_call(void *context, void *prog_map,
685 unsigned int index);
686 .EE
687 .in
688 .IP
689 and therefore replace its own program flow with the one from the program
690 at the given program array slot, if present.
691 This can be regarded as kind of a jump table to a different eBPF program.
692 The invoked program will then reuse the same stack.
693 When a jump into the new program has been performed,
694 it won't return to the old program anymore.
695 .IP
696 If no eBPF program is found at the given index of the program array
697 (because the map slot doesn't contain a valid program file descriptor,
698 the specified lookup index/key is out of bounds,
699 or the limit of 32
700 .\" MAX_TAIL_CALL_CNT
701 nested calls has been exceed),
702 execution continues with the current eBPF program.
703 This can be used as a fall-through for default cases.
704 .IP
705 A program array map is useful, for example, in tracing or networking, to
706 handle individual system calls or protocols in their own subprograms and
707 use their identifiers as an individual map index.
708 This approach may result in performance benefits,
709 and also makes it possible to overcome the maximum
710 instruction limit of a single eBPF program.
711 In dynamic environments,
712 a user-space daemon might atomically replace individual subprograms
713 at run-time with newer versions to alter overall program behavior,
714 for instance, if global policies change.
715 .\"
716 .SS eBPF programs
717 The
718 .B BPF_PROG_LOAD
719 command is used to load an eBPF program into the kernel.
720 The return value for this command is a new file descriptor associated
721 with this eBPF program.
722 .PP
723 .in +4n
724 .EX
725 char bpf_log_buf[LOG_BUF_SIZE];
726
727 int
728 bpf_prog_load(enum bpf_prog_type type,
729 const struct bpf_insn *insns, int insn_cnt,
730 const char *license)
731 {
732 union bpf_attr attr = {
733 .prog_type = type,
734 .insns = ptr_to_u64(insns),
735 .insn_cnt = insn_cnt,
736 .license = ptr_to_u64(license),
737 .log_buf = ptr_to_u64(bpf_log_buf),
738 .log_size = LOG_BUF_SIZE,
739 .log_level = 1,
740 };
741
742 return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
743 }
744 .EE
745 .in
746 .PP
747 .I prog_type
748 is one of the available program types:
749 .IP
750 .in +4n
751 .EX
752 enum bpf_prog_type {
753 BPF_PROG_TYPE_UNSPEC, /* Reserve 0 as invalid
754 program type */
755 BPF_PROG_TYPE_SOCKET_FILTER,
756 BPF_PROG_TYPE_KPROBE,
757 BPF_PROG_TYPE_SCHED_CLS,
758 BPF_PROG_TYPE_SCHED_ACT,
759 };
760 .EE
761 .in
762 .PP
763 For further details of eBPF program types, see below.
764 .PP
765 The remaining fields of
766 .I bpf_attr
767 are set as follows:
768 .IP * 3
769 .I insns
770 is an array of
771 .I "struct bpf_insn"
772 instructions.
773 .IP *
774 .I insn_cnt
775 is the number of instructions in the program referred to by
776 .IR insns .
777 .IP *
778 .I license
779 is a license string, which must be GPL compatible to call helper functions
780 marked
781 .IR gpl_only .
782 (The licensing rules are the same as for kernel modules,
783 so that also dual licenses, such as "Dual BSD/GPL", may be used.)
784 .IP *
785 .I log_buf
786 is a pointer to a caller-allocated buffer in which the in-kernel
787 verifier can store the verification log.
788 This log is a multi-line string that can be checked by
789 the program author in order to understand how the verifier came to
790 the conclusion that the eBPF program is unsafe.
791 The format of the output can change at any time as the verifier evolves.
792 .IP *
793 .I log_size
794 size of the buffer pointed to by
795 .IR log_buf .
796 If the size of the buffer is not large enough to store all
797 verifier messages, \-1 is returned and
798 .I errno
799 is set to
800 .BR ENOSPC .
801 .IP *
802 .I log_level
803 verbosity level of the verifier.
804 A value of zero means that the verifier will not provide a log;
805 in this case,
806 .I log_buf
807 must be a NULL pointer, and
808 .I log_size
809 must be zero.
810 .PP
811 Applying
812 .BR close (2)
813 to the file descriptor returned by
814 .B BPF_PROG_LOAD
815 will unload the eBPF program (but see NOTES).
816 .PP
817 Maps are accessible from eBPF programs and are used to exchange data between
818 eBPF programs and between eBPF programs and user-space programs.
819 For example,
820 eBPF programs can process various events (like kprobe, packets) and
821 store their data into a map,
822 and user-space programs can then fetch data from the map.
823 Conversely, user-space programs can use a map as a configuration mechanism,
824 populating the map with values checked by the eBPF program,
825 which then modifies its behavior on the fly according to those values.
826 .\"
827 .\"
828 .SS eBPF program types
829 The eBPF program type
830 .RI ( prog_type )
831 determines the subset of kernel helper functions that the program
832 may call.
833 The program type also determines the program input (context)\(emthe
834 format of
835 .I "struct bpf_context"
836 (which is the data blob passed into the eBPF program as the first argument).
837 .\"
838 .\" FIXME
839 .\" Somewhere in this page we need a general introduction to the
840 .\" bpf_context. For example, how does a BPF program access the
841 .\" context?
842 .PP
843 For example, a tracing program does not have the exact same
844 subset of helper functions as a socket filter program
845 (though they may have some helpers in common).
846 Similarly,
847 the input (context) for a tracing program is a set of register values,
848 while for a socket filter it is a network packet.
849 .PP
850 The set of functions available to eBPF programs of a given type may increase
851 in the future.
852 .PP
853 The following program types are supported:
854 .TP
855 .BR BPF_PROG_TYPE_SOCKET_FILTER " (since Linux 3.19)"
856 Currently, the set of functions for
857 .B BPF_PROG_TYPE_SOCKET_FILTER
858 is:
859 .IP
860 .in +4n
861 .EX
862 bpf_map_lookup_elem(map_fd, void *key)
863 /* look up key in a map_fd */
864 bpf_map_update_elem(map_fd, void *key, void *value)
865 /* update key/value */
866 bpf_map_delete_elem(map_fd, void *key)
867 /* delete key in a map_fd */
868 .EE
869 .in
870 .IP
871 The
872 .I bpf_context
873 argument is a pointer to a
874 .IR "struct __sk_buff" .
875 .\" FIXME: We need some text here to explain how the program
876 .\" accesses __sk_buff.
877 .\" See 'struct __sk_buff' and commit 9bac3d6d548e5
878 .\"
879 .\" Alexei commented:
880 .\" Actually now in case of SOCKET_FILTER, SCHED_CLS, SCHED_ACT
881 .\" the program can now access skb fields.
882 .\"
883 .TP
884 .BR BPF_PROG_TYPE_KPROBE " (since Linux 4.1)"
885 .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5
886 [To be documented]
887 .\" FIXME Document this program type
888 .\" Describe allowed helper functions for this program type
889 .\" Describe bpf_context for this program type
890 .\"
891 .\" FIXME We need text here to describe 'kern_version'
892 .TP
893 .BR BPF_PROG_TYPE_SCHED_CLS " (since Linux 4.1)"
894 .\" commit 96be4325f443dbbfeb37d2a157675ac0736531a1
895 .\" commit e2e9b6541dd4b31848079da80fe2253daaafb549
896 [To be documented]
897 .\" FIXME Document this program type
898 .\" Describe allowed helper functions for this program type
899 .\" Describe bpf_context for this program type
900 .TP
901 .BR BPF_PROG_TYPE_SCHED_ACT " (since Linux 4.1)"
902 .\" commit 94caee8c312d96522bcdae88791aaa9ebcd5f22c
903 .\" commit a8cb5f556b567974d75ea29c15181c445c541b1f
904 [To be documented]
905 .\" FIXME Document this program type
906 .\" Describe allowed helper functions for this program type
907 .\" Describe bpf_context for this program type
908 .SS Events
909 Once a program is loaded, it can be attached to an event.
910 Various kernel subsystems have different ways to do so.
911 .PP
912 Since Linux 3.19,
913 .\" commit 89aa075832b0da4402acebd698d0411dcc82d03e
914 the following call will attach the program
915 .I prog_fd
916 to the socket
917 .IR sockfd ,
918 which was created by an earlier call to
919 .BR socket (2):
920 .PP
921 .in +4n
922 .EX
923 setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_BPF,
924 &prog_fd, sizeof(prog_fd));
925 .EE
926 .in
927 .PP
928 Since Linux 4.1,
929 .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5
930 the following call may be used to attach
931 the eBPF program referred to by the file descriptor
932 .I prog_fd
933 to a perf event file descriptor,
934 .IR event_fd ,
935 that was created by a previous call to
936 .BR perf_event_open (2):
937 .PP
938 .in +4n
939 .EX
940 ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
941 .EE
942 .in
943 .\"
944 .\"
945 .SH EXAMPLES
946 .EX
947 /* bpf+sockets example:
948 * 1. create array map of 256 elements
949 * 2. load program that counts number of packets received
950 * r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)]
951 * map[r0]++
952 * 3. attach prog_fd to raw socket via setsockopt()
953 * 4. print number of received TCP/UDP packets every second
954 */
955 int
956 main(int argc, char **argv)
957 {
958 int sock, map_fd, prog_fd, key;
959 long long value = 0, tcp_cnt, udp_cnt;
960
961 map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key),
962 sizeof(value), 256);
963 if (map_fd < 0) {
964 printf("failed to create map '%s'\en", strerror(errno));
965 /* likely not run as root */
966 return 1;
967 }
968
969 struct bpf_insn prog[] = {
970 BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), /* r6 = r1 */
971 BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol)),
972 /* r0 = ip->proto */
973 BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4),
974 /* *(u32 *)(fp - 4) = r0 */
975 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), /* r2 = fp */
976 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = r2 - 4 */
977 BPF_LD_MAP_FD(BPF_REG_1, map_fd), /* r1 = map_fd */
978 BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),
979 /* r0 = map_lookup(r1, r2) */
980 BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
981 /* if (r0 == 0) goto pc+2 */
982 BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
983 BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0),
984 /* lock *(u64 *) r0 += r1 */
985 .\" == atomic64_add
986 BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
987 BPF_EXIT_INSN(), /* return r0 */
988 };
989
990 prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog,
991 sizeof(prog) / sizeof(prog[0]), "GPL");
992
993 sock = open_raw_sock("lo");
994
995 assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd,
996 sizeof(prog_fd)) == 0);
997
998 for (;;) {
999 key = IPPROTO_TCP;
1000 assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0);
1001 key = IPPROTO_UDP;
1002 assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0);
1003 printf("TCP %lld UDP %lld packets\en", tcp_cnt, udp_cnt);
1004 sleep(1);
1005 }
1006
1007 return 0;
1008 }
1009 .EE
1010 .PP
1011 Some complete working code can be found in the
1012 .I samples/bpf
1013 directory in the kernel source tree.
1014 .SH RETURN VALUE
1015 For a successful call, the return value depends on the operation:
1016 .TP
1017 .B BPF_MAP_CREATE
1018 The new file descriptor associated with the eBPF map.
1019 .TP
1020 .B BPF_PROG_LOAD
1021 The new file descriptor associated with the eBPF program.
1022 .TP
1023 All other commands
1024 Zero.
1025 .PP
1026 On error, \-1 is returned, and
1027 .I errno
1028 is set appropriately.
1029 .SH ERRORS
1030 .TP
1031 .B E2BIG
1032 The eBPF program is too large or a map reached the
1033 .I max_entries
1034 limit (maximum number of elements).
1035 .TP
1036 .B EACCES
1037 For
1038 .BR BPF_PROG_LOAD ,
1039 even though all program instructions are valid, the program has been
1040 rejected because it was deemed unsafe.
1041 This may be because it may have
1042 accessed a disallowed memory region or an uninitialized stack/register or
1043 because the function constraints don't match the actual types or because
1044 there was a misaligned memory access.
1045 In this case, it is recommended to call
1046 .BR bpf ()
1047 again with
1048 .I log_level = 1
1049 and examine
1050 .I log_buf
1051 for the specific reason provided by the verifier.
1052 .TP
1053 .B EBADF
1054 .I fd
1055 is not an open file descriptor.
1056 .TP
1057 .B EFAULT
1058 One of the pointers
1059 .RI ( key
1060 or
1061 .I value
1062 or
1063 .I log_buf
1064 or
1065 .IR insns )
1066 is outside the accessible address space.
1067 .TP
1068 .B EINVAL
1069 The value specified in
1070 .I cmd
1071 is not recognized by this kernel.
1072 .TP
1073 .B EINVAL
1074 For
1075 .BR BPF_MAP_CREATE ,
1076 either
1077 .I map_type
1078 or attributes are invalid.
1079 .TP
1080 .B EINVAL
1081 For
1082 .B BPF_MAP_*_ELEM
1083 commands,
1084 some of the fields of
1085 .I "union bpf_attr"
1086 that are not used by this command
1087 are not set to zero.
1088 .TP
1089 .B EINVAL
1090 For
1091 .BR BPF_PROG_LOAD ,
1092 indicates an attempt to load an invalid program.
1093 eBPF programs can be deemed
1094 invalid due to unrecognized instructions, the use of reserved fields, jumps
1095 out of range, infinite loops or calls of unknown functions.
1096 .TP
1097 .B ENOENT
1098 For
1099 .B BPF_MAP_LOOKUP_ELEM
1100 or
1101 .BR BPF_MAP_DELETE_ELEM ,
1102 indicates that the element with the given
1103 .I key
1104 was not found.
1105 .TP
1106 .B ENOMEM
1107 Cannot allocate sufficient memory.
1108 .TP
1109 .B EPERM
1110 The call was made without sufficient privilege
1111 (without the
1112 .B CAP_SYS_ADMIN
1113 capability).
1114 .SH VERSIONS
1115 The
1116 .BR bpf ()
1117 system call first appeared in Linux 3.18.
1118 .SH CONFORMING TO
1119 The
1120 .BR bpf ()
1121 system call is Linux-specific.
1122 .SH NOTES
1123 In the current implementation, all
1124 .BR bpf ()
1125 commands require the caller to have the
1126 .B CAP_SYS_ADMIN
1127 capability.
1128 .PP
1129 eBPF objects (maps and programs) can be shared between processes.
1130 For example, after
1131 .BR fork (2),
1132 the child inherits file descriptors referring to the same eBPF objects.
1133 In addition, file descriptors referring to eBPF objects can be
1134 transferred over UNIX domain sockets.
1135 File descriptors referring to eBPF objects can be duplicated
1136 in the usual way, using
1137 .BR dup (2)
1138 and similar calls.
1139 An eBPF object is deallocated only after all file descriptors
1140 referring to the object have been closed.
1141 .PP
1142 eBPF programs can be written in a restricted C that is compiled (using the
1143 .B clang
1144 compiler) into eBPF bytecode.
1145 Various features are omitted from this restricted C, such as loops,
1146 global variables, variadic functions, floating-point numbers,
1147 and passing structures as function arguments.
1148 Some examples can be found in the
1149 .I samples/bpf/*_kern.c
1150 files in the kernel source tree.
1151 .\" There are also examples for the tc classifier, in the iproute2
1152 .\" project, in examples/bpf
1153 .PP
1154 The kernel contains a just-in-time (JIT) compiler that translates
1155 eBPF bytecode into native machine code for better performance.
1156 In kernels before Linux 4.15,
1157 the JIT compiler is disabled by default,
1158 but its operation can be controlled by writing one of the
1159 following integer strings to the file
1160 .IR /proc/sys/net/core/bpf_jit_enable :
1161 .IP 0 3
1162 Disable JIT compilation (default).
1163 .IP 1
1164 Normal compilation.
1165 .IP 2
1166 Debugging mode.
1167 The generated opcodes are dumped in hexadecimal into the kernel log.
1168 These opcodes can then be disassembled using the program
1169 .I tools/net/bpf_jit_disasm.c
1170 provided in the kernel source tree.
1171 .PP
1172 Since Linux 4.15,
1173 .\" commit 290af86629b25ffd1ed6232c4e9107da031705cb
1174 the kernel may configured with the
1175 .B CONFIG_BPF_JIT_ALWAYS_ON
1176 option.
1177 In this case, the JIT compiler is always enabled, and the
1178 .I bpf_jit_enable
1179 is initialized to 1 and is immutable.
1180 (This kernel configuration option was provided as a mitigation for
1181 one of the Spectre attacks against the BPF interpreter.)
1182 .PP
1183 The JIT compiler for eBPF is currently
1184 .\" Last reviewed in Linux 4.18-rc by grepping for BPF_ALU64 in arch/
1185 .\" and by checking the documentation for bpf_jit_enable in
1186 .\" Documentation/sysctl/net.txt
1187 available for the following architectures:
1188 .IP * 3
1189 x86-64 (since Linux 3.18; cBPF since Linux 3.0);
1190 .\" commit 0a14842f5a3c0e88a1e59fac5c3025db39721f74
1191 .PD 0
1192 .IP *
1193 ARM32 (since Linux 3.18; cBPF since Linux 3.4);
1194 .\" commit ddecdfcea0ae891f782ae853771c867ab51024c2
1195 .IP *
1196 SPARC 32 (since Linux 3.18; cBPF since Linux 3.5);
1197 .\" commit 2809a2087cc44b55e4377d7b9be3f7f5d2569091
1198 .IP *
1199 ARM-64 (since Linux 3.18);
1200 .\" commit e54bcde3d69d40023ae77727213d14f920eb264a
1201 .IP *
1202 s390 (since Linux 4.1; cBPF since Linux 3.7);
1203 .\" commit c10302efe569bfd646b4c22df29577a4595b4580
1204 .IP *
1205 PowerPC 64 (since Linux 4.8; cBPF since Linux 3.1);
1206 .\" commit 0ca87f05ba8bdc6791c14878464efc901ad71e99
1207 .\" commit 156d0e290e969caba25f1851c52417c14d141b24
1208 .IP *
1209 SPARC 64 (since Linux 4.12);
1210 .\" commit 7a12b5031c6b947cc13918237ae652b536243b76
1211 .IP *
1212 x86-32 (since Linux 4.18);
1213 .\" commit 03f5781be2c7b7e728d724ac70ba10799cc710d7
1214 .IP *
1215 MIPS 64 (since Linux 4.18; cBPF since Linux 3.16);
1216 .\" commit c6610de353da5ca6eee5b8960e838a87a90ead0c
1217 .\" commit f381bf6d82f032b7410185b35d000ea370ac706b
1218 .IP *
1219 riscv (since Linux 5.1).
1220 .\" commit 2353ecc6f91fd15b893fa01bf85a1c7a823ee4f2
1221 .PD
1222 .SH SEE ALSO
1223 .BR seccomp (2),
1224 .BR bpf-helpers (7),
1225 .BR socket (7),
1226 .BR tc (8),
1227 .BR tc-bpf (8)
1228 .PP
1229 Both classic and extended BPF are explained in the kernel source file
1230 .IR Documentation/networking/filter.txt .