]> git.ipfire.org Git - thirdparty/man-pages.git/blob - man2/bpf.2
mdoc.7: wfix
[thirdparty/man-pages.git] / man2 / bpf.2
1 .\" Copyright (C) 2015 Alexei Starovoitov <ast@kernel.org>
2 .\" and Copyright (C) 2015 Michael Kerrisk <mtk.manpages@gmail.com>
3 .\"
4 .\" %%%LICENSE_START(VERBATIM)
5 .\" Permission is granted to make and distribute verbatim copies of this
6 .\" manual provided the copyright notice and this permission notice are
7 .\" preserved on all copies.
8 .\"
9 .\" Permission is granted to copy and distribute modified versions of this
10 .\" manual under the conditions for verbatim copying, provided that the
11 .\" entire resulting derived work is distributed under the terms of a
12 .\" permission notice identical to this one.
13 .\"
14 .\" Since the Linux kernel and libraries are constantly changing, this
15 .\" manual page may be incorrect or out-of-date. The author(s) assume no
16 .\" responsibility for errors or omissions, or for damages resulting from
17 .\" the use of the information contained herein. The author(s) may not
18 .\" have taken the same level of care in the production of this manual,
19 .\" which is licensed free of charge, as they might when working
20 .\" professionally.
21 .\"
22 .\" Formatted or processed versions of this manual, if unaccompanied by
23 .\" the source, must acknowledge the copyright and authors of this work.
24 .\" %%%LICENSE_END
25 .\"
26 .TH BPF 2 2016-10-08 "Linux" "Linux Programmer's Manual"
27 .SH NAME
28 bpf \- perform a command on an extended BPF map or program
29 .SH SYNOPSIS
30 .nf
31 .B #include <linux/bpf.h>
32 .sp
33 .BI "int bpf(int " cmd ", union bpf_attr *" attr ", unsigned int " size ");
34 .SH DESCRIPTION
35 The
36 .BR bpf ()
37 system call performs a range of operations related to extended
38 Berkeley Packet Filters.
39 Extended BPF (or eBPF) is similar to
40 the original ("classic") BPF (cBPF) used to filter network packets.
41 For both cBPF and eBPF programs,
42 the kernel statically analyzes the programs before loading them,
43 in order to ensure that they cannot harm the running system.
44 .P
45 eBPF extends cBPF in multiple ways, including the ability to call
46 a fixed set of in-kernel helper functions
47 .\" See 'enum bpf_func_id' in include/uapi/linux/bpf.h
48 (via the
49 .B BPF_CALL
50 opcode extension provided by eBPF)
51 and access shared data structures such as eBPF maps.
52 .\"
53 .SS Extended BPF Design/Architecture
54 eBPF maps are a generic data structure for storage of different data types.
55 Data types are generally treated as binary blobs, so a user just specifies
56 the size of the key and the size of the value at map-creation time.
57 In other words, a key/value for a given map can have an arbitrary structure.
58
59 A user process can create multiple maps (with key/value-pairs being
60 opaque bytes of data) and access them via file descriptors.
61 Different eBPF programs can access the same maps in parallel.
62 It's up to the user process and eBPF program to decide what they store
63 inside maps.
64
65 There's one special map type, called a program array.
66 This type of map stores file descriptors referring to other eBPF programs.
67 When a lookup in the map is performed, the program flow is
68 redirected in-place to the beginning of another eBPF program and does not
69 return back to the calling program.
70 The level of nesting has a fixed limit of 32,
71 .\" Defined by the kernel constant MAX_TAIL_CALL_CNT in include/linux/bpf.h
72 so that infinite loops cannot be crafted.
73 At runtime, the program file descriptors stored in the map can be modified,
74 so program functionality can be altered based on specific requirements.
75 All programs referred to in a program-array map must
76 have been previously loaded into the kernel via
77 .BR bpf ().
78 If a map lookup fails, the current program continues its execution.
79 See
80 .B BPF_MAP_TYPE_PROG_ARRAY
81 below for further details.
82 .P
83 Generally, eBPF programs are loaded by the user process and automatically
84 unloaded when the process exits.
85 In some cases, for example,
86 .BR tc-bpf (8),
87 the program will continue to stay alive inside the kernel even after the
88 process that loaded the program exits.
89 In that case,
90 the tc subsystem holds a reference to the eBPF program after the
91 file descriptor has been closed by the user-space program.
92 Thus, whether a specific program continues to live inside the kernel
93 depends on how it is further attached to a given kernel subsystem
94 after it was loaded via
95 .BR bpf ().
96
97 Each eBPF program is a set of instructions that is safe to run until
98 its completion.
99 An in-kernel verifier statically determines that the eBPF program
100 terminates and is safe to execute.
101 During verification, the kernel increments reference counts for each of
102 the maps that the eBPF program uses,
103 so that the attached maps can't be removed until the program is unloaded.
104
105 eBPF programs can be attached to different events.
106 These events can be the arrival of network packets, tracing
107 events, classification events by network queueing disciplines
108 (for eBPF programs attached to a
109 .BR tc (8)
110 classifier), and other types that may be added in the future.
111 A new event triggers execution of the eBPF program, which
112 may store information about the event in eBPF maps.
113 Beyond storing data, eBPF programs may call a fixed set of
114 in-kernel helper functions.
115
116 The same eBPF program can be attached to multiple events and different
117 eBPF programs can access the same map:
118
119 .in +4n
120 .nf
121 tracing tracing tracing packet packet packet
122 event A event B event C on eth0 on eth1 on eth2
123 | | | | | ^
124 | | | | v |
125 --> tracing <-- tracing socket tc ingress tc egress
126 prog_1 prog_2 prog_3 classifier action
127 | | | | prog_4 prog_5
128 |--- -----| |------| map_3 | |
129 map_1 map_2 --| map_4 |--
130 .fi
131 .in
132 .\"
133 .SS Arguments
134 The operation to be performed by the
135 .BR bpf ()
136 system call is determined by the
137 .IR cmd
138 argument.
139 Each operation takes an accompanying argument,
140 provided via
141 .IR attr ,
142 which is a pointer to a union of type
143 .IR bpf_attr
144 (see below).
145 The
146 .I size
147 argument is the size of the union pointed to by
148 .IR attr .
149
150 The value provided in
151 .IR cmd
152 is one of the following:
153 .TP
154 .B BPF_MAP_CREATE
155 Create a map and return a file descriptor that refers to the map.
156 The close-on-exec file descriptor flag (see
157 .BR fcntl (2))
158 is automatically enabled for the new file descriptor.
159 .TP
160 .B BPF_MAP_LOOKUP_ELEM
161 Look up an element by key in a specified map and return its value.
162 .TP
163 .B BPF_MAP_UPDATE_ELEM
164 Create or update an element (key/value pair) in a specified map.
165 .TP
166 .B BPF_MAP_DELETE_ELEM
167 Look up and delete an element by key in a specified map.
168 .TP
169 .B BPF_MAP_GET_NEXT_KEY
170 Look up an element by key in a specified map and return the key
171 of the next element.
172 .TP
173 .B BPF_PROG_LOAD
174 Verify and load an eBPF program,
175 returning a new file descriptor associated with the program.
176 The close-on-exec file descriptor flag (see
177 .BR fcntl (2))
178 is automatically enabled for the new file descriptor.
179 .P
180 The
181 .I bpf_attr
182 union consists of various anonymous structures that are used by different
183 .BR bpf ()
184 commands:
185
186 .in +4n
187 .nf
188 union bpf_attr {
189 struct { /* Used by BPF_MAP_CREATE */
190 __u32 map_type;
191 __u32 key_size; /* size of key in bytes */
192 __u32 value_size; /* size of value in bytes */
193 __u32 max_entries; /* maximum number of entries
194 in a map */
195 };
196
197 struct { /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY
198 commands */
199 __u32 map_fd;
200 __aligned_u64 key;
201 union {
202 __aligned_u64 value;
203 __aligned_u64 next_key;
204 };
205 __u64 flags;
206 };
207
208 struct { /* Used by BPF_PROG_LOAD */
209 __u32 prog_type;
210 __u32 insn_cnt;
211 __aligned_u64 insns; /* 'const struct bpf_insn *' */
212 __aligned_u64 license; /* 'const char *' */
213 __u32 log_level; /* verbosity level of verifier */
214 __u32 log_size; /* size of user buffer */
215 __aligned_u64 log_buf; /* user supplied 'char *'
216 buffer */
217 __u32 kern_version;
218 /* checked when prog_type=kprobe
219 (since Linux 4.1) */
220 .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5
221 };
222 } __attribute__((aligned(8)));
223 .fi
224 .in
225 .\"
226 .SS eBPF maps
227 Maps are a generic data structure for storage of different types of data.
228 They allow sharing of data between eBPF kernel programs,
229 and also between kernel and user-space applications.
230
231 Each map type has the following attributes:
232
233 .PD 0
234 .IP * 3
235 type
236 .IP *
237 maximum number of elements
238 .IP *
239 key size in bytes
240 .IP *
241 value size in bytes
242 .PD
243 .PP
244 The following wrapper functions demonstrate how various
245 .BR bpf ()
246 commands can be used to access the maps.
247 The functions use the
248 .IR cmd
249 argument to invoke different operations.
250 .TP
251 .B BPF_MAP_CREATE
252 The
253 .B BPF_MAP_CREATE
254 command creates a new map,
255 returning a new file descriptor that refers to the map.
256
257 .in +4n
258 .nf
259 int
260 bpf_create_map(enum bpf_map_type map_type,
261 unsigned int key_size,
262 unsigned int value_size,
263 unsigned int max_entries)
264 {
265 union bpf_attr attr = {
266 .map_type = map_type,
267 .key_size = key_size,
268 .value_size = value_size,
269 .max_entries = max_entries
270 };
271
272 return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
273 }
274 .fi
275 .in
276
277 The new map has the type specified by
278 .IR map_type ,
279 and attributes as specified in
280 .IR key_size ,
281 .IR value_size ,
282 and
283 .IR max_entries .
284 On success, this operation returns a file descriptor.
285 On error, \-1 is returned and
286 .I errno
287 is set to
288 .BR EINVAL ,
289 .BR EPERM ,
290 or
291 .BR ENOMEM .
292
293 The
294 .I key_size
295 and
296 .I value_size
297 attributes will be used by the verifier during program loading
298 to check that the program is calling
299 .BR bpf_map_*_elem ()
300 helper functions with a correctly initialized
301 .I key
302 and to check that the program doesn't access the map element
303 .I value
304 beyond the specified
305 .IR value_size .
306 For example, when a map is created with a
307 .IR key_size
308 of 8 and the eBPF program calls
309
310 .in +4n
311 .nf
312 bpf_map_lookup_elem(map_fd, fp - 4)
313 .fi
314 .in
315
316 the program will be rejected,
317 since the in-kernel helper function
318
319 bpf_map_lookup_elem(map_fd, void *key)
320
321 expects to read 8 bytes from the location pointed to by
322 .IR key ,
323 but the
324 .IR "fp\ -\ 4"
325 (where
326 .I fp
327 is the top of the stack)
328 starting address will cause out-of-bounds stack access.
329
330 Similarly, when a map is created with a
331 .I value_size
332 of 1 and the eBPF program contains
333
334 .in +4n
335 .nf
336 value = bpf_map_lookup_elem(...);
337 *(u32 *) value = 1;
338 .fi
339 .in
340
341 the program will be rejected, since it accesses the
342 .I value
343 pointer beyond the specified 1 byte
344 .I value_size
345 limit.
346
347 Currently, the following values are supported for
348 .IR map_type :
349
350 .in +4n
351 .nf
352 enum bpf_map_type {
353 BPF_MAP_TYPE_UNSPEC, /* Reserve 0 as invalid map type */
354 BPF_MAP_TYPE_HASH,
355 BPF_MAP_TYPE_ARRAY,
356 BPF_MAP_TYPE_PROG_ARRAY,
357 };
358 .fi
359 .in
360
361 .I map_type
362 selects one of the available map implementations in the kernel.
363 .\" FIXME We need an explanation of why one might choose each of
364 .\" these map implementations
365 For all map types,
366 eBPF programs access maps with the same
367 .BR bpf_map_lookup_elem ()
368 and
369 .BR bpf_map_update_elem ()
370 helper functions.
371 Further details of the various map types are given below.
372 .TP
373 .B BPF_MAP_LOOKUP_ELEM
374 The
375 .B BPF_MAP_LOOKUP_ELEM
376 command looks up an element with a given
377 .I key
378 in the map referred to by the file descriptor
379 .IR fd .
380
381 .in +4n
382 .nf
383 int
384 bpf_lookup_elem(int fd, const void *key, void *value)
385 {
386 union bpf_attr attr = {
387 .map_fd = fd,
388 .key = ptr_to_u64(key),
389 .value = ptr_to_u64(value),
390 };
391
392 return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
393 }
394 .fi
395 .in
396
397 If an element is found,
398 the operation returns zero and stores the element's value into
399 .IR value ,
400 which must point to a buffer of
401 .I value_size
402 bytes.
403
404 If no element is found, the operation returns \-1 and sets
405 .I errno
406 to
407 .BR ENOENT .
408 .TP
409 .B BPF_MAP_UPDATE_ELEM
410 The
411 .B BPF_MAP_UPDATE_ELEM
412 command
413 creates or updates an element with a given
414 .I key/value
415 in the map referred to by the file descriptor
416 .IR fd .
417
418 .in +4n
419 .nf
420 int
421 bpf_update_elem(int fd, const void *key, const void *value,
422 uint64_t flags)
423 {
424 union bpf_attr attr = {
425 .map_fd = fd,
426 .key = ptr_to_u64(key),
427 .value = ptr_to_u64(value),
428 .flags = flags,
429 };
430
431 return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
432 }
433 .fi
434 .in
435
436 The
437 .I flags
438 argument should be specified as one of the following:
439 .RS
440 .TP
441 .B BPF_ANY
442 Create a new element or update an existing element.
443 .TP
444 .B BPF_NOEXIST
445 Create a new element only if it did not exist.
446 .TP
447 .B BPF_EXIST
448 Update an existing element.
449 .RE
450 .IP
451 On success, the operation returns zero.
452 On error, \-1 is returned and
453 .I errno
454 is set to
455 .BR EINVAL ,
456 .BR EPERM ,
457 .BR ENOMEM ,
458 or
459 .BR E2BIG .
460 .B E2BIG
461 indicates that the number of elements in the map reached the
462 .I max_entries
463 limit specified at map creation time.
464 .B EEXIST
465 will be returned if
466 .I flags
467 specifies
468 .B BPF_NOEXIST
469 and the element with
470 .I key
471 already exists in the map.
472 .B ENOENT
473 will be returned if
474 .I flags
475 specifies
476 .B BPF_EXIST
477 and the element with
478 .I key
479 doesn't exist in the map.
480 .TP
481 .B BPF_MAP_DELETE_ELEM
482 The
483 .B BPF_MAP_DELETE_ELEM
484 command
485 deleted the element whose key is
486 .I key
487 from the map referred to by the file descriptor
488 .IR fd .
489
490 .in +4n
491 .nf
492 int
493 bpf_delete_elem(int fd, const void *key)
494 {
495 union bpf_attr attr = {
496 .map_fd = fd,
497 .key = ptr_to_u64(key),
498 };
499
500 return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
501 }
502 .fi
503 .in
504
505 On success, zero is returned.
506 If the element is not found, \-1 is returned and
507 .I errno
508 is set to
509 .BR ENOENT .
510 .TP
511 .B BPF_MAP_GET_NEXT_KEY
512 The
513 .B BPF_MAP_GET_NEXT_KEY
514 command looks up an element by
515 .I key
516 in the map referred to by the file descriptor
517 .IR fd
518 and sets the
519 .I next_key
520 pointer to the key of the next element.
521
522 .nf
523 .in +4n
524 int
525 bpf_get_next_key(int fd, const void *key, void *next_key)
526 {
527 union bpf_attr attr = {
528 .map_fd = fd,
529 .key = ptr_to_u64(key),
530 .next_key = ptr_to_u64(next_key),
531 };
532
533 return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
534 }
535 .fi
536 .in
537
538 If
539 .I key
540 is found, the operation returns zero and sets the
541 .I next_key
542 pointer to the key of the next element.
543 If
544 .I key
545 is not found, the operation returns zero and sets the
546 .I next_key
547 pointer to the key of the first element.
548 If
549 .I key
550 is the last element, \-1 is returned and
551 .I errno
552 is set to
553 .BR ENOENT .
554 Other possible
555 .I errno
556 values are
557 .BR ENOMEM ,
558 .BR EFAULT ,
559 .BR EPERM ,
560 and
561 .BR EINVAL .
562 This method can be used to iterate over all elements in the map.
563 .TP
564 .B close(map_fd)
565 Delete the map referred to by the file descriptor
566 .IR map_fd .
567 When the user-space program that created a map exits, all maps will
568 be deleted automatically (but see NOTES).
569 .\"
570 .SS eBPF map types
571 The following map types are supported:
572 .TP
573 .B BPF_MAP_TYPE_HASH
574 .\" commit 0f8e4bd8a1fc8c4185f1630061d0a1f2d197a475
575 Hash-table maps have the following characteristics:
576 .RS
577 .IP * 3
578 Maps are created and destroyed by user-space programs.
579 Both user-space and eBPF programs
580 can perform lookup, update, and delete operations.
581 .IP *
582 The kernel takes care of allocating and freeing key/value pairs.
583 .IP *
584 The
585 .BR map_update_elem ()
586 helper will fail to insert new element when the
587 .I max_entries
588 limit is reached.
589 (This ensures that eBPF programs cannot exhaust memory.)
590 .IP *
591 .BR map_update_elem ()
592 replaces existing elements atomically.
593 .RE
594 .IP
595 Hash-table maps are
596 optimized for speed of lookup.
597 .TP
598 .B BPF_MAP_TYPE_ARRAY
599 .\" commit 28fbcfa08d8ed7c5a50d41a0433aad222835e8e3
600 Array maps have the following characteristics:
601 .RS
602 .IP * 3
603 Optimized for fastest possible lookup.
604 In the future the verifier/JIT compiler
605 may recognize lookup() operations that employ a constant key
606 and optimize it into constant pointer.
607 It is possible to optimize a non-constant
608 key into direct pointer arithmetic as well, since pointers and
609 .I value_size
610 are constant for the life of the eBPF program.
611 In other words,
612 .BR array_map_lookup_elem ()
613 may be 'inlined' by the verifier/JIT compiler
614 while preserving concurrent access to this map from user space.
615 .IP *
616 All array elements pre-allocated and zero initialized at init time
617 .IP *
618 The key is an array index, and must be exactly four bytes.
619 .IP *
620 .BR map_delete_elem ()
621 fails with the error
622 .BR EINVAL ,
623 since elements cannot be deleted.
624 .IP *
625 .BR map_update_elem ()
626 replaces elements in a
627 .B nonatomic
628 fashion;
629 for atomic updates, a hash-table map should be used instead.
630 There is however one special case that can also be used with arrays:
631 the atomic built-in
632 .BR __sync_fetch_and_add()
633 can be used on 32 and 64 bit atomic counters.
634 For example, it can be
635 applied on the whole value itself if it represents a single counter,
636 or in case of a structure containing multiple counters, it could be
637 used on individual counters.
638 This is quite often useful for aggregation and accounting of events.
639 .RE
640 .IP
641 Among the uses for array maps are the following:
642 .RS
643 .IP * 3
644 As "global" eBPF variables: an array of 1 element whose key is (index) 0
645 and where the value is a collection of 'global' variables which
646 eBPF programs can use to keep state between events.
647 .IP *
648 Aggregation of tracing events into a fixed set of buckets.
649 .IP *
650 Accounting of networking events, for example, number of packets and packet
651 sizes.
652 .RE
653 .TP
654 .BR BPF_MAP_TYPE_PROG_ARRAY " (since Linux 4.2)"
655 A program array map is a special kind of array map whose map values
656 contain only file descriptors referring to other eBPF programs.
657 Thus, both the
658 .I key_size
659 and
660 .I value_size
661 must be exactly four bytes.
662 This map is used in conjunction with the
663 .BR bpf_tail_call ()
664 helper.
665
666 This means that an eBPF program with a program array map attached to it
667 can call from kernel side into
668
669 .in +4n
670 .nf
671 void bpf_tail_call(void *context, void *prog_map, unsigned int index);
672 .fi
673 .in
674
675 and therefore replace its own program flow with the one from the program
676 at the given program array slot, if present.
677 This can be regarded as kind of a jump table to a different eBPF program.
678 The invoked program will then reuse the same stack.
679 When a jump into the new program has been performed,
680 it won't return to the old program anymore.
681
682 If no eBPF program is found at the given index of the program array
683 (because the map slot doesn't contain a valid program file descriptor,
684 the specified lookup index/key is out of bounds,
685 or the limit of 32
686 .\" MAX_TAIL_CALL_CNT
687 nested calls has been exceed),
688 execution continues with the current eBPF program.
689 This can be used as a fall-through for default cases.
690
691 A program array map is useful, for example, in tracing or networking, to
692 handle individual system calls or protocols in their own subprograms and
693 use their identifiers as an individual map index.
694 This approach may result in performance benefits,
695 and also makes it possible to overcome the maximum
696 instruction limit of a single eBPF program.
697 In dynamic environments,
698 a user-space daemon might atomically replace individual subprograms
699 at run-time with newer versions to alter overall program behavior,
700 for instance, if global policies change.
701 .\"
702 .SS eBPF programs
703 The
704 .B BPF_PROG_LOAD
705 command is used to load an eBPF program into the kernel.
706 The return value for this command is a new file descriptor associated
707 with this eBPF program.
708
709 .in +4n
710 .nf
711 char bpf_log_buf[LOG_BUF_SIZE];
712
713 int
714 bpf_prog_load(enum bpf_prog_type type,
715 const struct bpf_insn *insns, int insn_cnt,
716 const char *license)
717 {
718 union bpf_attr attr = {
719 .prog_type = type,
720 .insns = ptr_to_u64(insns),
721 .insn_cnt = insn_cnt,
722 .license = ptr_to_u64(license),
723 .log_buf = ptr_to_u64(bpf_log_buf),
724 .log_size = LOG_BUF_SIZE,
725 .log_level = 1,
726 };
727
728 return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
729 }
730 .fi
731 .in
732
733 .I prog_type
734 is one of the available program types:
735
736 .in +4n
737 .nf
738 enum bpf_prog_type {
739 BPF_PROG_TYPE_UNSPEC, /* Reserve 0 as invalid
740 program type */
741 BPF_PROG_TYPE_SOCKET_FILTER,
742 BPF_PROG_TYPE_KPROBE,
743 BPF_PROG_TYPE_SCHED_CLS,
744 BPF_PROG_TYPE_SCHED_ACT,
745 };
746 .fi
747 .in
748
749 For further details of eBPF program types, see below.
750
751 The remaining fields of
752 .I bpf_attr
753 are set as follows:
754 .IP * 3
755 .I insns
756 is an array of
757 .I "struct bpf_insn"
758 instructions.
759 .IP *
760 .I insn_cnt
761 is the number of instructions in the program referred to by
762 .IR insns .
763 .IP *
764 .I license
765 is a license string, which must be GPL compatible to call helper functions
766 marked
767 .IR gpl_only .
768 (The licensing rules are the same as for kernel modules,
769 so that also dual licenses, such as "Dual BSD/GPL", may be used.)
770 .IP *
771 .I log_buf
772 is a pointer to a caller-allocated buffer in which the in-kernel
773 verifier can store the verification log.
774 This log is a multi-line string that can be checked by
775 the program author in order to understand how the verifier came to
776 the conclusion that the eBPF program is unsafe.
777 The format of the output can change at any time as the verifier evolves.
778 .IP *
779 .I log_size
780 size of the buffer pointed to by
781 .IR log_bug .
782 If the size of the buffer is not large enough to store all
783 verifier messages, \-1 is returned and
784 .I errno
785 is set to
786 .BR ENOSPC .
787 .IP *
788 .I log_level
789 verbosity level of the verifier.
790 A value of zero means that the verifier will not provide a log;
791 in this case,
792 .I log_buf
793 must be a NULL pointer, and
794 .I log_size
795 must be zero.
796 .P
797 Applying
798 .BR close (2)
799 to the file descriptor returned by
800 .B BPF_PROG_LOAD
801 will unload the eBPF program (but see NOTES).
802
803 Maps are accessible from eBPF programs and are used to exchange data between
804 eBPF programs and between eBPF programs and user-space programs.
805 For example,
806 eBPF programs can process various events (like kprobe, packets) and
807 store their data into a map,
808 and user-space programs can then fetch data from the map.
809 Conversely, user-space programs can use a map as a configuration mechanism,
810 populating the map with values checked by the eBPF program,
811 which then modifies its behavior on the fly according to those values.
812 .\"
813 .\"
814 .SS eBPF program types
815 The eBPF program type
816 .RI ( prog_type )
817 determines the subset of kernel helper functions that the program
818 may call.
819 The program type also determines the program input (context)\(emthe
820 format of
821 .I "struct bpf_context"
822 (which is the data blob passed into the eBPF program as the first argument).
823 .\"
824 .\" FIXME
825 .\" Somewhere in this page we need a general introduction to the
826 .\" bpf_context. For example, how does a BPF program access the
827 .\" context?
828
829 For example, a tracing program does not have the exact same
830 subset of helper functions as a socket filter program
831 (though they may have some helpers in common).
832 Similarly,
833 the input (context) for a tracing program is a set of register values,
834 while for a socket filter it is a network packet.
835
836 The set of functions available to eBPF programs of a given type may increase
837 in the future.
838
839 The following program types are supported:
840 .TP
841 .BR BPF_PROG_TYPE_SOCKET_FILTER " (since Linux 3.19)"
842 Currently, the set of functions for
843 .B BPF_PROG_TYPE_SOCKET_FILTER
844 is:
845
846 .in +4n
847 .nf
848 bpf_map_lookup_elem(map_fd, void *key)
849 /* look up key in a map_fd */
850 bpf_map_update_elem(map_fd, void *key, void *value)
851 /* update key/value */
852 bpf_map_delete_elem(map_fd, void *key)
853 /* delete key in a map_fd */
854 .fi
855 .in
856
857 The
858 .I bpf_context
859 argument is a pointer to a
860 .IR "struct __sk_buff" .
861 .\" FIXME: We need some text here to explain how the program
862 .\" accesses __sk_buff.
863 .\" See 'struct __sk_buff' and commit 9bac3d6d548e5
864 .\"
865 .\" Alexei commented:
866 .\" Actually now in case of SOCKET_FILTER, SCHED_CLS, SCHED_ACT
867 .\" the program can now access skb fields.
868 .\"
869 .TP
870 .BR BPF_PROG_TYPE_KPROBE " (since Linux 4.1)
871 .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5
872 [To be documented]
873 .\" FIXME Document this program type
874 .\" Describe allowed helper functions for this program type
875 .\" Describe bpf_context for this program type
876 .\"
877 .\" FIXME We need text here to describe 'kern_version'
878 .TP
879 .BR BPF_PROG_TYPE_SCHED_CLS " (since Linux 4.1)
880 .\" commit 96be4325f443dbbfeb37d2a157675ac0736531a1
881 .\" commit e2e9b6541dd4b31848079da80fe2253daaafb549
882 [To be documented]
883 .\" FIXME Document this program type
884 .\" Describe allowed helper functions for this program type
885 .\" Describe bpf_context for this program type
886 .TP
887 .BR BPF_PROG_TYPE_SCHED_ACT " (since Linux 4.1)
888 .\" commit 94caee8c312d96522bcdae88791aaa9ebcd5f22c
889 .\" commit a8cb5f556b567974d75ea29c15181c445c541b1f
890 [To be documented]
891 .\" FIXME Document this program type
892 .\" Describe allowed helper functions for this program type
893 .\" Describe bpf_context for this program type
894 .SS Events
895 Once a program is loaded, it can be attached to an event.
896 Various kernel subsystems have different ways to do so.
897
898 Since Linux 3.19,
899 .\" commit 89aa075832b0da4402acebd698d0411dcc82d03e
900 the following call will attach the program
901 .I prog_fd
902 to the socket
903 .IR sockfd ,
904 which was created by an earlier call to
905 .BR socket (2):
906
907 .in +4n
908 .nf
909 setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_BPF,
910 &prog_fd, sizeof(prog_fd));
911 .fi
912 .in
913
914 Since Linux 4.1,
915 .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5
916 the following call may be used to attach
917 the eBPF program referred to by the file descriptor
918 .I prog_fd
919 to a perf event file descriptor,
920 .IR event_fd ,
921 that was created by a previous call to
922 .BR perf_event_open (2):
923
924 .in +4n
925 .nf
926 ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
927 .fi
928 .in
929 .\"
930 .\"
931 .SH EXAMPLES
932 .nf
933 /* bpf+sockets example:
934 * 1. create array map of 256 elements
935 * 2. load program that counts number of packets received
936 * r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)]
937 * map[r0]++
938 * 3. attach prog_fd to raw socket via setsockopt()
939 * 4. print number of received TCP/UDP packets every second
940 */
941 int
942 main(int argc, char **argv)
943 {
944 int sock, map_fd, prog_fd, key;
945 long long value = 0, tcp_cnt, udp_cnt;
946
947 map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key),
948 sizeof(value), 256);
949 if (map_fd < 0) {
950 printf("failed to create map '%s'\\n", strerror(errno));
951 /* likely not run as root */
952 return 1;
953 }
954
955 struct bpf_insn prog[] = {
956 BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), /* r6 = r1 */
957 BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol)),
958 /* r0 = ip->proto */
959 BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4),
960 /* *(u32 *)(fp - 4) = r0 */
961 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), /* r2 = fp */
962 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = r2 - 4 */
963 BPF_LD_MAP_FD(BPF_REG_1, map_fd), /* r1 = map_fd */
964 BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),
965 /* r0 = map_lookup(r1, r2) */
966 BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
967 /* if (r0 == 0) goto pc+2 */
968 BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
969 BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0),
970 /* lock *(u64 *) r0 += r1 */
971 .\" == atomic64_add
972 BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
973 BPF_EXIT_INSN(), /* return r0 */
974 };
975
976 prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog,
977 sizeof(prog), "GPL");
978
979 sock = open_raw_sock("lo");
980
981 assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd,
982 sizeof(prog_fd)) == 0);
983
984 for (;;) {
985 key = IPPROTO_TCP;
986 assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0);
987 key = IPPROTO_UDP
988 assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0);
989 printf("TCP %lld UDP %lld packets\n", tcp_cnt, udp_cnt);
990 sleep(1);
991 }
992
993 return 0;
994 }
995 .fi
996
997 Some complete working code can be found in the
998 .IR samples/bpf
999 directory in the kernel source tree.
1000 .SH RETURN VALUE
1001 For a successful call, the return value depends on the operation:
1002 .TP
1003 .B BPF_MAP_CREATE
1004 The new file descriptor associated with the eBPF map.
1005 .TP
1006 .B BPF_PROG_LOAD
1007 The new file descriptor associated with the eBPF program.
1008 .TP
1009 All other commands
1010 Zero.
1011 .PP
1012 On error, \-1 is returned, and
1013 .I errno
1014 is set appropriately.
1015 .SH ERRORS
1016 .TP
1017 .BR E2BIG
1018 The eBPF program is too large or a map reached the
1019 .I max_entries
1020 limit (maximum number of elements).
1021 .TP
1022 .BR EACCES
1023 For
1024 .BR BPF_PROG_LOAD,
1025 even though all program instructions are valid, the program has been
1026 rejected because it was deemed unsafe.
1027 This may be because it may have
1028 accessed a disallowed memory region or an uninitialized stack/register or
1029 because the function constraints don't match the actual types or because
1030 there was a misaligned memory access.
1031 In this case, it is recommended to call
1032 .BR bpf ()
1033 again with
1034 .I log_level = 1
1035 and examine
1036 .I log_buf
1037 for the specific reason provided by the verifier.
1038 .TP
1039 .B EBADF
1040 .I fd
1041 is not an open file descriptor.
1042 .TP
1043 .B EFAULT
1044 One of the pointers
1045 .RI ( key
1046 or
1047 .I value
1048 or
1049 .I log_buf
1050 or
1051 .IR insns )
1052 is outside the accessible address space.
1053 .TP
1054 .B EINVAL
1055 The value specified in
1056 .I cmd
1057 is not recognized by this kernel.
1058 .TP
1059 .B EINVAL
1060 For
1061 .BR BPF_MAP_CREATE ,
1062 either
1063 .I map_type
1064 or attributes are invalid.
1065 .TP
1066 .B EINVAL
1067 For
1068 .BR BPF_MAP_*_ELEM
1069 commands,
1070 some of the fields of
1071 .I "union bpf_attr"
1072 that are not used by this command
1073 are not set to zero.
1074 .TP
1075 .B EINVAL
1076 For
1077 .BR BPF_PROG_LOAD,
1078 indicates an attempt to load an invalid program.
1079 eBPF programs can be deemed
1080 invalid due to unrecognized instructions, the use of reserved fields, jumps
1081 out of range, infinite loops or calls of unknown functions.
1082 .TP
1083 .BR ENOENT
1084 For
1085 .B BPF_MAP_LOOKUP_ELEM
1086 or
1087 .BR BPF_MAP_DELETE_ELEM ,
1088 indicates that the element with the given
1089 .I key
1090 was not found.
1091 .TP
1092 .B ENOMEM
1093 Cannot allocate sufficient memory.
1094 .TP
1095 .B EPERM
1096 The call was made without sufficient privilege
1097 (without the
1098 .B CAP_SYS_ADMIN
1099 capability).
1100 .SH VERSIONS
1101 The
1102 .BR bpf ()
1103 system call first appeared in Linux 3.18.
1104 .SH CONFORMING TO
1105 The
1106 .BR bpf ()
1107 system call is Linux-specific.
1108 .SH NOTES
1109 In the current implementation, all
1110 .BR bpf ()
1111 commands require the caller to have the
1112 .B CAP_SYS_ADMIN
1113 capability.
1114
1115 eBPF objects (maps and programs) can be shared between processes.
1116 For example, after
1117 .BR fork (2),
1118 the child inherits file descriptors referring to the same eBPF objects.
1119 In addition, file descriptors referring to eBPF objects can be
1120 transferred over UNIX domain sockets.
1121 File descriptors referring to eBPF objects can be duplicated
1122 in the usual way, using
1123 .BR dup (2)
1124 and similar calls.
1125 An eBPF object is deallocated only after all file descriptors
1126 referring to the object have been closed.
1127
1128 eBPF programs can be written in a restricted C that is compiled (using the
1129 .B clang
1130 compiler) into eBPF bytecode.
1131 Various features are omitted from this restricted C, such as loops,
1132 global variables, variadic functions, floating-point numbers,
1133 and passing structures as function arguments.
1134 Some examples can be found in the
1135 .I samples/bpf/*_kern.c
1136 files in the kernel source tree.
1137 .\" There are also examples for the tc classifier, in the iproute2
1138 .\" project, in examples/bpf
1139
1140 The kernel contains a just-in-time (JIT) compiler that translates
1141 eBPF bytecode into native machine code for better performance.
1142 The JIT compiler is disabled by default,
1143 but its operation can be controlled by writing one of the
1144 following integer strings to the file
1145 .IR /proc/sys/net/core/bpf_jit_enable :
1146 .IP 0 3
1147 Disable JIT compilation (default).
1148 .IP 1
1149 Normal compilation.
1150 .IP 2
1151 Debugging mode.
1152 The generated opcodes are dumped in hexadecimal into the kernel log.
1153 These opcodes can then be disassembled using the program
1154 .IR tools/net/bpf_jit_disasm.c
1155 provided in the kernel source tree.
1156 .PP
1157 JIT compiler for eBPF is currently available for the x86-64, arm64,
1158 and s390 architectures.
1159 .SH SEE ALSO
1160 .BR seccomp (2),
1161 .BR socket (7),
1162 .BR tc (8),
1163 .BR tc-bpf (8)
1164
1165 Both classic and extended BPF are explained in the kernel source file
1166 .IR Documentation/networking/filter.txt .