man2/bpf.2

   1 .\" Copyright (C) 2015 Alexei Starovoitov <ast@kernel.org>
   2 .\" and Copyright (C) 2015 Michael Kerrisk <mtk.manpages@gmail.com>
   3 .\"
   4 .\" %%%LICENSE_START(VERBATIM)
   5 .\" Permission is granted to make and distribute verbatim copies of this
   6 .\" manual provided the copyright notice and this permission notice are
   7 .\" preserved on all copies.
   8 .\"
   9 .\" Permission is granted to copy and distribute modified versions of this
  10 .\" manual under the conditions for verbatim copying, provided that the
  11 .\" entire resulting derived work is distributed under the terms of a
  12 .\" permission notice identical to this one.
  13 .\"
  14 .\" Since the Linux kernel and libraries are constantly changing, this
  15 .\" manual page may be incorrect or out-of-date.  The author(s) assume no
  16 .\" responsibility for errors or omissions, or for damages resulting from
  17 .\" the use of the information contained herein.  The author(s) may not
  18 .\" have taken the same level of care in the production of this manual,
  19 .\" which is licensed free of charge, as they might when working
  20 .\" professionally.
  21 .\"
  22 .\" Formatted or processed versions of this manual, if unaccompanied by
  23 .\" the source, must acknowledge the copyright and authors of this work.
  24 .\" %%%LICENSE_END
  25 .\"
  26 .TH BPF 2 2019-08-02 "Linux" "Linux Programmer's Manual"
  27 .SH NAME
  28 bpf \- perform a command on an extended BPF map or program
  29 .SH SYNOPSIS
  30 .nf
  31 .B #include <linux/bpf.h>
  32
  33 .BI "int bpf(int " cmd ", union bpf_attr *" attr ", unsigned int " size );
  34 .fi
  35 .SH DESCRIPTION
  36 The
  37 .BR bpf ()
  38 system call performs a range of operations related to extended
  39 Berkeley Packet Filters.
  40 Extended BPF (or eBPF) is similar to
  41 the original ("classic") BPF (cBPF) used to filter network packets.
  42 For both cBPF and eBPF programs,
  43 the kernel statically analyzes the programs before loading them,
  44 in order to ensure that they cannot harm the running system.
  45 .PP
  46 eBPF extends cBPF in multiple ways, including the ability to call
  47 a fixed set of in-kernel helper functions
  48 .\" See 'enum bpf_func_id' in include/uapi/linux/bpf.h
  49 (via the
  50 .B BPF_CALL
  51 opcode extension provided by eBPF)
  52 and access shared data structures such as eBPF maps.
  53 .\"
  54 .SS Extended BPF Design/Architecture
  55 eBPF maps are a generic data structure for storage of different data types.
  56 Data types are generally treated as binary blobs, so a user just specifies
  57 the size of the key and the size of the value at map-creation time.
  58 In other words, a key/value for a given map can have an arbitrary structure.
  59 .PP
  60 A user process can create multiple maps (with key/value-pairs being
  61 opaque bytes of data) and access them via file descriptors.
  62 Different eBPF programs can access the same maps in parallel.
  63 It's up to the user process and eBPF program to decide what they store
  64 inside maps.
  65 .PP
  66 There's one special map type, called a program array.
  67 This type of map stores file descriptors referring to other eBPF programs.
  68 When a lookup in the map is performed, the program flow is
  69 redirected in-place to the beginning of another eBPF program and does not
  70 return back to the calling program.
  71 The level of nesting has a fixed limit of 32,
  72 .\" Defined by the kernel constant MAX_TAIL_CALL_CNT in include/linux/bpf.h
  73 so that infinite loops cannot be crafted.
  74 At run time, the program file descriptors stored in the map can be modified,
  75 so program functionality can be altered based on specific requirements.
  76 All programs referred to in a program-array map must
  77 have been previously loaded into the kernel via
  78 .BR bpf ().
  79 If a map lookup fails, the current program continues its execution.
  80 See
  81 .B BPF_MAP_TYPE_PROG_ARRAY
  82 below for further details.
  83 .PP
  84 Generally, eBPF programs are loaded by the user process and automatically
  85 unloaded when the process exits.
  86 In some cases, for example,
  87 .BR tc-bpf (8),
  88 the program will continue to stay alive inside the kernel even after the
  89 process that loaded the program exits.
  90 In that case,
  91 the tc subsystem holds a reference to the eBPF program after the
  92 file descriptor has been closed by the user-space program.
  93 Thus, whether a specific program continues to live inside the kernel
  94 depends on how it is further attached to a given kernel subsystem
  95 after it was loaded via
  96 .BR bpf ().
  97 .PP
  98 Each eBPF program is a set of instructions that is safe to run until
  99 its completion.
 100 An in-kernel verifier statically determines that the eBPF program
 101 terminates and is safe to execute.
 102 During verification, the kernel increments reference counts for each of
 103 the maps that the eBPF program uses,
 104 so that the attached maps can't be removed until the program is unloaded.
 105 .PP
 106 eBPF programs can be attached to different events.
 107 These events can be the arrival of network packets, tracing
 108 events, classification events by network queueing  disciplines
 109 (for eBPF programs attached to a
 110 .BR tc (8)
 111 classifier), and other types that may be added in the future.
 112 A new event triggers execution of the eBPF program, which
 113 may store information about the event in eBPF maps.
 114 Beyond storing data, eBPF programs may call a fixed set of
 115 in-kernel helper functions.
 116 .PP
 117 The same eBPF program can be attached to multiple events and different
 118 eBPF programs can access the same map:
 119 .PP
 120 .in +4n
 121 .EX
 122 tracing     tracing    tracing    packet      packet     packet
 123 event A     event B    event C    on eth0     on eth1    on eth2
 124  |             |         |          |           |          ^
 125  |             |         |          |           v          |
 126  --> tracing <--     tracing      socket    tc ingress   tc egress
 127       prog_1          prog_2      prog_3    classifier    action
 128       |  |              |           |         prog_4      prog_5
 129    |---  -----|  |------|          map_3        |           |
 130  map_1       map_2                              --| map_4 |--
 131 .EE
 132 .in
 133 .\"
 134 .SS Arguments
 135 The operation to be performed by the
 136 .BR bpf ()
 137 system call is determined by the
 138 .I cmd
 139 argument.
 140 Each operation takes an accompanying argument,
 141 provided via
 142 .IR attr ,
 143 which is a pointer to a union of type
 144 .I bpf_attr
 145 (see below).
 146 The
 147 .I size
 148 argument is the size of the union pointed to by
 149 .IR attr .
 150 .PP
 151 The value provided in
 152 .I cmd
 153 is one of the following:
 154 .TP
 155 .B BPF_MAP_CREATE
 156 Create a map and return a file descriptor that refers to the map.
 157 The close-on-exec file descriptor flag (see
 158 .BR fcntl (2))
 159 is automatically enabled for the new file descriptor.
 160 .TP
 161 .B BPF_MAP_LOOKUP_ELEM
 162 Look up an element by key in a specified map and return its value.
 163 .TP
 164 .B BPF_MAP_UPDATE_ELEM
 165 Create or update an element (key/value pair) in a specified map.
 166 .TP
 167 .B BPF_MAP_DELETE_ELEM
 168 Look up and delete an element by key in a specified map.
 169 .TP
 170 .B BPF_MAP_GET_NEXT_KEY
 171 Look up an element by key in a specified map and return the key
 172 of the next element.
 173 .TP
 174 .B BPF_PROG_LOAD
 175 Verify and load an eBPF program,
 176 returning a new file descriptor associated with the program.
 177 The close-on-exec file descriptor flag (see
 178 .BR fcntl (2))
 179 is automatically enabled for the new file descriptor.
 180 .IP
 181 The
 182 .I bpf_attr
 183 union consists of various anonymous structures that are used by different
 184 .BR bpf ()
 185 commands:
 186 .PP
 187 .in +4n
 188 .EX
 189 union bpf_attr {
 190     struct {    /* Used by BPF_MAP_CREATE */
 191         __u32         map_type;
 192         __u32         key_size;    /* size of key in bytes */
 193         __u32         value_size;  /* size of value in bytes */
 194         __u32         max_entries; /* maximum number of entries
 195                                       in a map */
 196     };
 197
 198     struct {    /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY
 199                    commands */
 200         __u32         map_fd;
 201         __aligned_u64 key;
 202         union {
 203             __aligned_u64 value;
 204             __aligned_u64 next_key;
 205         };
 206         __u64         flags;
 207     };
 208
 209     struct {    /* Used by BPF_PROG_LOAD */
 210         __u32         prog_type;
 211         __u32         insn_cnt;
 212         __aligned_u64 insns;      /* 'const struct bpf_insn *' */
 213         __aligned_u64 license;    /* 'const char *' */
 214         __u32         log_level;  /* verbosity level of verifier */
 215         __u32         log_size;   /* size of user buffer */
 216         __aligned_u64 log_buf;    /* user supplied 'char *'
 217                                      buffer */
 218         __u32         kern_version;
 219                                   /* checked when prog_type=kprobe
 220                                      (since Linux 4.1) */
 221 .\"                 commit 2541517c32be2531e0da59dfd7efc1ce844644f5
 222     };
 223 } __attribute__((aligned(8)));
 224 .EE
 225 .in
 226 .\"
 227 .SS eBPF maps
 228 Maps are a generic data structure for storage of different types of data.
 229 They allow sharing of data between eBPF kernel programs,
 230 and also between kernel and user-space applications.
 231 .PP
 232 Each map type has the following attributes:
 233 .IP * 3
 234 type
 235 .IP *
 236 maximum number of elements
 237 .IP *
 238 key size in bytes
 239 .IP *
 240 value size in bytes
 241 .PP
 242 The following wrapper functions demonstrate how various
 243 .BR bpf ()
 244 commands can be used to access the maps.
 245 The functions use the
 246 .I cmd
 247 argument to invoke different operations.
 248 .TP
 249 .B BPF_MAP_CREATE
 250 The
 251 .B BPF_MAP_CREATE
 252 command creates a new map,
 253 returning a new file descriptor that refers to the map.
 254 .IP
 255 .in +4n
 256 .EX
 257 int
 258 bpf_create_map(enum bpf_map_type map_type,
 259                unsigned int key_size,
 260                unsigned int value_size,
 261                unsigned int max_entries)
 262 {
 263     union bpf_attr attr = {
 264         .map_type    = map_type,
 265         .key_size    = key_size,
 266         .value_size  = value_size,
 267         .max_entries = max_entries
 268     };
 269
 270     return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
 271 }
 272 .EE
 273 .in
 274 .IP
 275 The new map has the type specified by
 276 .IR map_type ,
 277 and attributes as specified in
 278 .IR key_size ,
 279 .IR value_size ,
 280 and
 281 .IR max_entries .
 282 On success, this operation returns a file descriptor.
 283 On error, \-1 is returned and
 284 .I errno
 285 is set to
 286 .BR EINVAL ,
 287 .BR EPERM ,
 288 or
 289 .BR ENOMEM .
 290 .IP
 291 The
 292 .I key_size
 293 and
 294 .I value_size
 295 attributes will be used by the verifier during program loading
 296 to check that the program is calling
 297 .BR bpf_map_*_elem ()
 298 helper functions with a correctly initialized
 299 .I key
 300 and to check that the program doesn't access the map element
 301 .I value
 302 beyond the specified
 303 .IR value_size .
 304 For example, when a map is created with a
 305 .I key_size
 306 of 8 and the eBPF program calls
 307 .IP
 308 .in +4n
 309 .EX
 310 bpf_map_lookup_elem(map_fd, fp - 4)
 311 .EE
 312 .in
 313 .IP
 314 the program will be rejected,
 315 since the in-kernel helper function
 316 .IP
 317 .EX
 318     bpf_map_lookup_elem(map_fd, void *key)
 319 .EE
 320 .IP
 321 expects to read 8 bytes from the location pointed to by
 322 .IR key ,
 323 but the
 324 .I fp\ -\ 4
 325 (where
 326 .I fp
 327 is the top of the stack)
 328 starting address will cause out-of-bounds stack access.
 329 .IP
 330 Similarly, when a map is created with a
 331 .I value_size
 332 of 1 and the eBPF program contains
 333 .IP
 334 .in +4n
 335 .EX
 336 value = bpf_map_lookup_elem(...);
 337 *(u32 *) value = 1;
 338 .EE
 339 .in
 340 .IP
 341 the program will be rejected, since it accesses the
 342 .I value
 343 pointer beyond the specified 1 byte
 344 .I value_size
 345 limit.
 346 .IP
 347 Currently, the following values are supported for
 348 .IR map_type :
 349 .IP
 350 .in +4n
 351 .EX
 352 enum bpf_map_type {
 353     BPF_MAP_TYPE_UNSPEC,  /* Reserve 0 as invalid map type */
 354     BPF_MAP_TYPE_HASH,
 355     BPF_MAP_TYPE_ARRAY,
 356     BPF_MAP_TYPE_PROG_ARRAY,
 357     BPF_MAP_TYPE_PERF_EVENT_ARRAY,
 358     BPF_MAP_TYPE_PERCPU_HASH,
 359     BPF_MAP_TYPE_PERCPU_ARRAY,
 360     BPF_MAP_TYPE_STACK_TRACE,
 361     BPF_MAP_TYPE_CGROUP_ARRAY,
 362     BPF_MAP_TYPE_LRU_HASH,
 363     BPF_MAP_TYPE_LRU_PERCPU_HASH,
 364     BPF_MAP_TYPE_LPM_TRIE,
 365     BPF_MAP_TYPE_ARRAY_OF_MAPS,
 366     BPF_MAP_TYPE_HASH_OF_MAPS,
 367     BPF_MAP_TYPE_DEVMAP,
 368     BPF_MAP_TYPE_SOCKMAP,
 369     BPF_MAP_TYPE_CPUMAP,
 370 };
 371 .EE
 372 .in
 373 .IP
 374 .I map_type
 375 selects one of the available map implementations in the kernel.
 376 .\" FIXME We need an explanation of why one might choose each of
 377 .\" these map implementations
 378 For all map types,
 379 eBPF programs access maps with the same
 380 .BR bpf_map_lookup_elem ()
 381 and
 382 .BR bpf_map_update_elem ()
 383 helper functions.
 384 Further details of the various map types are given below.
 385 .TP
 386 .B BPF_MAP_LOOKUP_ELEM
 387 The
 388 .B BPF_MAP_LOOKUP_ELEM
 389 command looks up an element with a given
 390 .I key
 391 in the map referred to by the file descriptor
 392 .IR fd .
 393 .IP
 394 .in +4n
 395 .EX
 396 int
 397 bpf_lookup_elem(int fd, const void *key, void *value)
 398 {
 399     union bpf_attr attr = {
 400         .map_fd = fd,
 401         .key    = ptr_to_u64(key),
 402         .value  = ptr_to_u64(value),
 403     };
 404
 405     return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
 406 }
 407 .EE
 408 .in
 409 .IP
 410 If an element is found,
 411 the operation returns zero and stores the element's value into
 412 .IR value ,
 413 which must point to a buffer of
 414 .I value_size
 415 bytes.
 416 .IP
 417 If no element is found, the operation returns \-1 and sets
 418 .I errno
 419 to
 420 .BR ENOENT .
 421 .TP
 422 .B BPF_MAP_UPDATE_ELEM
 423 The
 424 .B BPF_MAP_UPDATE_ELEM
 425 command
 426 creates or updates an element with a given
 427 .I key/value
 428 in the map referred to by the file descriptor
 429 .IR fd .
 430 .IP
 431 .in +4n
 432 .EX
 433 int
 434 bpf_update_elem(int fd, const void *key, const void *value,
 435                 uint64_t flags)
 436 {
 437     union bpf_attr attr = {
 438         .map_fd = fd,
 439         .key    = ptr_to_u64(key),
 440         .value  = ptr_to_u64(value),
 441         .flags  = flags,
 442     };
 443
 444     return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
 445 }
 446 .EE
 447 .in
 448 .IP
 449 The
 450 .I flags
 451 argument should be specified as one of the following:
 452 .RS
 453 .TP
 454 .B BPF_ANY
 455 Create a new element or update an existing element.
 456 .TP
 457 .B BPF_NOEXIST
 458 Create a new element only if it did not exist.
 459 .TP
 460 .B BPF_EXIST
 461 Update an existing element.
 462 .RE
 463 .IP
 464 On success, the operation returns zero.
 465 On error, \-1 is returned and
 466 .I errno
 467 is set to
 468 .BR EINVAL ,
 469 .BR EPERM ,
 470 .BR ENOMEM ,
 471 or
 472 .BR E2BIG .
 473 .B E2BIG
 474 indicates that the number of elements in the map reached the
 475 .I max_entries
 476 limit specified at map creation time.
 477 .B EEXIST
 478 will be returned if
 479 .I flags
 480 specifies
 481 .B BPF_NOEXIST
 482 and the element with
 483 .I key
 484 already exists in the map.
 485 .B ENOENT
 486 will be returned if
 487 .I flags
 488 specifies
 489 .B BPF_EXIST
 490 and the element with
 491 .I key
 492 doesn't exist in the map.
 493 .TP
 494 .B BPF_MAP_DELETE_ELEM
 495 The
 496 .B BPF_MAP_DELETE_ELEM
 497 command
 498 deletes the element whose key is
 499 .I key
 500 from the map referred to by the file descriptor
 501 .IR fd .
 502 .IP
 503 .in +4n
 504 .EX
 505 int
 506 bpf_delete_elem(int fd, const void *key)
 507 {
 508     union bpf_attr attr = {
 509         .map_fd = fd,
 510         .key    = ptr_to_u64(key),
 511     };
 512
 513     return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
 514 }
 515 .EE
 516 .in
 517 .IP
 518 On success, zero is returned.
 519 If the element is not found, \-1 is returned and
 520 .I errno
 521 is set to
 522 .BR ENOENT .
 523 .TP
 524 .B BPF_MAP_GET_NEXT_KEY
 525 The
 526 .B BPF_MAP_GET_NEXT_KEY
 527 command looks up an element by
 528 .I key
 529 in the map referred to by the file descriptor
 530 .I fd
 531 and sets the
 532 .I next_key
 533 pointer to the key of the next element.
 534 .IP
 535 .in +4n
 536 .EX
 537 int
 538 bpf_get_next_key(int fd, const void *key, void *next_key)
 539 {
 540     union bpf_attr attr = {
 541         .map_fd   = fd,
 542         .key      = ptr_to_u64(key),
 543         .next_key = ptr_to_u64(next_key),
 544     };
 545
 546     return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
 547 }
 548 .EE
 549 .in
 550 .IP
 551 If
 552 .I key
 553 is found, the operation returns zero and sets the
 554 .I next_key
 555 pointer to the key of the next element.
 556 If
 557 .I key
 558 is not found, the operation returns zero and sets the
 559 .I next_key
 560 pointer to the key of the first element.
 561 If
 562 .I key
 563 is the last element, \-1 is returned and
 564 .I errno
 565 is set to
 566 .BR ENOENT .
 567 Other possible
 568 .I errno
 569 values are
 570 .BR ENOMEM ,
 571 .BR EFAULT ,
 572 .BR EPERM ,
 573 and
 574 .BR EINVAL .
 575 This method can be used to iterate over all elements in the map.
 576 .TP
 577 .B close(map_fd)
 578 Delete the map referred to by the file descriptor
 579 .IR map_fd .
 580 When the user-space program that created a map exits, all maps will
 581 be deleted automatically (but see NOTES).
 582 .\"
 583 .SS eBPF map types
 584 The following map types are supported:
 585 .TP
 586 .B BPF_MAP_TYPE_HASH
 587 .\" commit 0f8e4bd8a1fc8c4185f1630061d0a1f2d197a475
 588 Hash-table maps have the following characteristics:
 589 .RS
 590 .IP * 3
 591 Maps are created and destroyed by user-space programs.
 592 Both user-space and eBPF programs
 593 can perform lookup, update, and delete operations.
 594 .IP *
 595 The kernel takes care of allocating and freeing key/value pairs.
 596 .IP *
 597 The
 598 .BR map_update_elem ()
 599 helper will fail to insert new element when the
 600 .I max_entries
 601 limit is reached.
 602 (This ensures that eBPF programs cannot exhaust memory.)
 603 .IP *
 604 .BR map_update_elem ()
 605 replaces existing elements atomically.
 606 .RE
 607 .IP
 608 Hash-table maps are
 609 optimized for speed of lookup.
 610 .TP
 611 .B BPF_MAP_TYPE_ARRAY
 612 .\" commit 28fbcfa08d8ed7c5a50d41a0433aad222835e8e3
 613 Array maps have the following characteristics:
 614 .RS
 615 .IP * 3
 616 Optimized for fastest possible lookup.
 617 In the future the verifier/JIT compiler
 618 may recognize lookup() operations that employ a constant key
 619 and optimize it into constant pointer.
 620 It is possible to optimize a non-constant
 621 key into direct pointer arithmetic as well, since pointers and
 622 .I value_size
 623 are constant for the life of the eBPF program.
 624 In other words,
 625 .BR array_map_lookup_elem ()
 626 may be 'inlined' by the verifier/JIT compiler
 627 while preserving concurrent access to this map from user space.
 628 .IP *
 629 All array elements pre-allocated and zero initialized at init time
 630 .IP *
 631 The key is an array index, and must be exactly four bytes.
 632 .IP *
 633 .BR map_delete_elem ()
 634 fails with the error
 635 .BR EINVAL ,
 636 since elements cannot be deleted.
 637 .IP *
 638 .BR map_update_elem ()
 639 replaces elements in a
 640 .B nonatomic
 641 fashion;
 642 for atomic updates, a hash-table map should be used instead.
 643 There is however one special case that can also be used with arrays:
 644 the atomic built-in
 645 .B __sync_fetch_and_add()
 646 can be used on 32 and 64 bit atomic counters.
 647 For example, it can be
 648 applied on the whole value itself if it represents a single counter,
 649 or in case of a structure containing multiple counters, it could be
 650 used on individual counters.
 651 This is quite often useful for aggregation and accounting of events.
 652 .RE
 653 .IP
 654 Among the uses for array maps are the following:
 655 .RS
 656 .IP * 3
 657 As "global" eBPF variables: an array of 1 element whose key is (index) 0
 658 and where the value is a collection of 'global' variables which
 659 eBPF programs can use to keep state between events.
 660 .IP *
 661 Aggregation of tracing events into a fixed set of buckets.
 662 .IP *
 663 Accounting of networking events, for example, number of packets and packet
 664 sizes.
 665 .RE
 666 .TP
 667 .BR BPF_MAP_TYPE_PROG_ARRAY " (since Linux 4.2)"
 668 A program array map is a special kind of array map whose map values
 669 contain only file descriptors referring to other eBPF programs.
 670 Thus, both the
 671 .I key_size
 672 and
 673 .I value_size
 674 must be exactly four bytes.
 675 This map is used in conjunction with the
 676 .BR bpf_tail_call ()
 677 helper.
 678 .IP
 679 This means that an eBPF program with a program array map attached to it
 680 can call from kernel side into
 681 .IP
 682 .in +4n
 683 .EX
 684 void bpf_tail_call(void *context, void *prog_map,
 685                    unsigned int index);
 686 .EE
 687 .in
 688 .IP
 689 and therefore replace its own program flow with the one from the program
 690 at the given program array slot, if present.
 691 This can be regarded as kind of a jump table to a different eBPF program.
 692 The invoked program will then reuse the same stack.
 693 When a jump into the new program has been performed,
 694 it won't return to the old program anymore.
 695 .IP
 696 If no eBPF program is found at the given index of the program array
 697 (because the map slot doesn't contain a valid program file descriptor,
 698 the specified lookup index/key is out of bounds,
 699 or the limit of 32
 700 .\" MAX_TAIL_CALL_CNT
 701 nested calls has been exceed),
 702 execution continues with the current eBPF program.
 703 This can be used as a fall-through for default cases.
 704 .IP
 705 A program array map is useful, for example, in tracing or networking, to
 706 handle individual system calls or protocols in their own subprograms and
 707 use their identifiers as an individual map index.
 708 This approach may result in performance benefits,
 709 and also makes it possible to overcome the maximum
 710 instruction limit of a single eBPF program.
 711 In dynamic environments,
 712 a user-space daemon might atomically replace individual subprograms
 713 at run-time with newer versions to alter overall program behavior,
 714 for instance, if global policies change.
 715 .\"
 716 .SS eBPF programs
 717 The
 718 .B BPF_PROG_LOAD
 719 command is used to load an eBPF program into the kernel.
 720 The return value for this command is a new file descriptor associated
 721 with this eBPF program.
 722 .PP
 723 .in +4n
 724 .EX
 725 char bpf_log_buf[LOG_BUF_SIZE];
 726
 727 int
 728 bpf_prog_load(enum bpf_prog_type type,
 729               const struct bpf_insn *insns, int insn_cnt,
 730               const char *license)
 731 {
 732     union bpf_attr attr = {
 733         .prog_type = type,
 734         .insns     = ptr_to_u64(insns),
 735         .insn_cnt  = insn_cnt,
 736         .license   = ptr_to_u64(license),
 737         .log_buf   = ptr_to_u64(bpf_log_buf),
 738         .log_size  = LOG_BUF_SIZE,
 739         .log_level = 1,
 740     };
 741
 742     return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
 743 }
 744 .EE
 745 .in
 746 .PP
 747 .I prog_type
 748 is one of the available program types:
 749 .IP
 750 .in +4n
 751 .EX
 752 enum bpf_prog_type {
 753     BPF_PROG_TYPE_UNSPEC,        /* Reserve 0 as invalid
 754                                     program type */
 755     BPF_PROG_TYPE_SOCKET_FILTER,
 756     BPF_PROG_TYPE_KPROBE,
 757     BPF_PROG_TYPE_SCHED_CLS,
 758     BPF_PROG_TYPE_SCHED_ACT,
 759 };
 760 .EE
 761 .in
 762 .PP
 763 For further details of eBPF program types, see below.
 764 .PP
 765 The remaining fields of
 766 .I bpf_attr
 767 are set as follows:
 768 .IP * 3
 769 .I insns
 770 is an array of
 771 .I "struct bpf_insn"
 772 instructions.
 773 .IP *
 774 .I insn_cnt
 775 is the number of instructions in the program referred to by
 776 .IR insns .
 777 .IP *
 778 .I license
 779 is a license string, which must be GPL compatible to call helper functions
 780 marked
 781 .IR gpl_only .
 782 (The licensing rules are the same as for kernel modules,
 783 so that also dual licenses, such as "Dual BSD/GPL", may be used.)
 784 .IP *
 785 .I log_buf
 786 is a pointer to a caller-allocated buffer in which the in-kernel
 787 verifier can store the verification log.
 788 This log is a multi-line string that can be checked by
 789 the program author in order to understand how the verifier came to
 790 the conclusion that the eBPF program is unsafe.
 791 The format of the output can change at any time as the verifier evolves.
 792 .IP *
 793 .I log_size
 794 size of the buffer pointed to by
 795 .IR log_buf .
 796 If the size of the buffer is not large enough to store all
 797 verifier messages, \-1 is returned and
 798 .I errno
 799 is set to
 800 .BR ENOSPC .
 801 .IP *
 802 .I log_level
 803 verbosity level of the verifier.
 804 A value of zero means that the verifier will not provide a log;
 805 in this case,
 806 .I log_buf
 807 must be a NULL pointer, and
 808 .I log_size
 809 must be zero.
 810 .PP
 811 Applying
 812 .BR close (2)
 813 to the file descriptor returned by
 814 .B BPF_PROG_LOAD
 815 will unload the eBPF program (but see NOTES).
 816 .PP
 817 Maps are accessible from eBPF programs and are used to exchange data between
 818 eBPF programs and between eBPF programs and user-space programs.
 819 For example,
 820 eBPF programs can process various events (like kprobe, packets) and
 821 store their data into a map,
 822 and user-space programs can then fetch data from the map.
 823 Conversely, user-space programs can use a map as a configuration mechanism,
 824 populating the map with values checked by the eBPF program,
 825 which then modifies its behavior on the fly according to those values.
 826 .\"
 827 .\"
 828 .SS eBPF program types
 829 The eBPF program type
 830 .RI ( prog_type )
 831 determines the subset of kernel helper functions that the program
 832 may call.
 833 The program type also determines the program input (context)\(emthe
 834 format of
 835 .I "struct bpf_context"
 836 (which is the data blob passed into the eBPF program as the first argument).
 837 .\"
 838 .\" FIXME
 839 .\" Somewhere in this page we need a general introduction to the
 840 .\" bpf_context. For example, how does a BPF program access the
 841 .\" context?
 842 .PP
 843 For example, a tracing program does not have the exact same
 844 subset of helper functions as a socket filter program
 845 (though they may have some helpers in common).
 846 Similarly,
 847 the input (context) for a tracing program is a set of register values,
 848 while for a socket filter it is a network packet.
 849 .PP
 850 The set of functions available to eBPF programs of a given type may increase
 851 in the future.
 852 .PP
 853 The following program types are supported:
 854 .TP
 855 .BR BPF_PROG_TYPE_SOCKET_FILTER " (since Linux 3.19)"
 856 Currently, the set of functions for
 857 .B BPF_PROG_TYPE_SOCKET_FILTER
 858 is:
 859 .IP
 860 .in +4n
 861 .EX
 862 bpf_map_lookup_elem(map_fd, void *key)
 863                     /* look up key in a map_fd */
 864 bpf_map_update_elem(map_fd, void *key, void *value)
 865                     /* update key/value */
 866 bpf_map_delete_elem(map_fd, void *key)
 867                     /* delete key in a map_fd */
 868 .EE
 869 .in
 870 .IP
 871 The
 872 .I bpf_context
 873 argument is a pointer to a
 874 .IR "struct __sk_buff" .
 875 .\" FIXME: We need some text here to explain how the program
 876 .\" accesses __sk_buff.
 877 .\" See 'struct __sk_buff' and commit 9bac3d6d548e5
 878 .\"
 879 .\" Alexei commented:
 880 .\" Actually now in case of SOCKET_FILTER, SCHED_CLS, SCHED_ACT
 881 .\" the program can now access skb fields.
 882 .\"
 883 .TP
 884 .BR BPF_PROG_TYPE_KPROBE " (since Linux 4.1)"
 885 .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5
 886 [To be documented]
 887 .\" FIXME Document this program type
 888 .\"       Describe allowed helper functions for this program type
 889 .\"       Describe bpf_context for this program type
 890 .\"
 891 .\" FIXME We need text here to describe 'kern_version'
 892 .TP
 893 .BR BPF_PROG_TYPE_SCHED_CLS " (since Linux 4.1)"
 894 .\" commit 96be4325f443dbbfeb37d2a157675ac0736531a1
 895 .\" commit e2e9b6541dd4b31848079da80fe2253daaafb549
 896 [To be documented]
 897 .\" FIXME Document this program type
 898 .\"       Describe allowed helper functions for this program type
 899 .\"       Describe bpf_context for this program type
 900 .TP
 901 .BR BPF_PROG_TYPE_SCHED_ACT " (since Linux 4.1)"
 902 .\" commit 94caee8c312d96522bcdae88791aaa9ebcd5f22c
 903 .\" commit a8cb5f556b567974d75ea29c15181c445c541b1f
 904 [To be documented]
 905 .\" FIXME Document this program type
 906 .\"       Describe allowed helper functions for this program type
 907 .\"       Describe bpf_context for this program type
 908 .SS Events
 909 Once a program is loaded, it can be attached to an event.
 910 Various kernel subsystems have different ways to do so.
 911 .PP
 912 Since Linux 3.19,
 913 .\" commit 89aa075832b0da4402acebd698d0411dcc82d03e
 914 the following call will attach the program
 915 .I prog_fd
 916 to the socket
 917 .IR sockfd ,
 918 which was created by an earlier call to
 919 .BR socket (2):
 920 .PP
 921 .in +4n
 922 .EX
 923 setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_BPF,
 924            &prog_fd, sizeof(prog_fd));
 925 .EE
 926 .in
 927 .PP
 928 Since Linux 4.1,
 929 .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5
 930 the following call may be used to attach
 931 the eBPF program referred to by the file descriptor
 932 .I prog_fd
 933 to a perf event file descriptor,
 934 .IR event_fd ,
 935 that was created by a previous call to
 936 .BR perf_event_open (2):
 937 .PP
 938 .in +4n
 939 .EX
 940 ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
 941 .EE
 942 .in
 943 .\"
 944 .\"
 945 .SH EXAMPLES
 946 .EX
 947 /* bpf+sockets example:
 948  * 1. create array map of 256 elements
 949  * 2. load program that counts number of packets received
 950  *    r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)]
 951  *    map[r0]++
 952  * 3. attach prog_fd to raw socket via setsockopt()
 953  * 4. print number of received TCP/UDP packets every second
 954  */
 955 int
 956 main(int argc, char **argv)
 957 {
 958     int sock, map_fd, prog_fd, key;
 959     long long value = 0, tcp_cnt, udp_cnt;
 960
 961     map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key),
 962                             sizeof(value), 256);
 963     if (map_fd < 0) {
 964         printf("failed to create map '%s'\en", strerror(errno));
 965         /* likely not run as root */
 966         return 1;
 967     }
 968
 969     struct bpf_insn prog[] = {
 970         BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),        /* r6 = r1 */
 971         BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol)),
 972                                 /* r0 = ip->proto */
 973         BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4),
 974                                 /* *(u32 *)(fp - 4) = r0 */
 975         BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),       /* r2 = fp */
 976         BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),      /* r2 = r2 - 4 */
 977         BPF_LD_MAP_FD(BPF_REG_1, map_fd),           /* r1 = map_fd */
 978         BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),
 979                                 /* r0 = map_lookup(r1, r2) */
 980         BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
 981                                 /* if (r0 == 0) goto pc+2 */
 982         BPF_MOV64_IMM(BPF_REG_1, 1),                /* r1 = 1 */
 983         BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0),
 984                                 /* lock *(u64 *) r0 += r1 */
 985 .\"                                == atomic64_add
 986         BPF_MOV64_IMM(BPF_REG_0, 0),                /* r0 = 0 */
 987         BPF_EXIT_INSN(),                            /* return r0 */
 988     };
 989
 990     prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog,
 991                             sizeof(prog) / sizeof(prog[0]), "GPL");
 992
 993     sock = open_raw_sock("lo");
 994
 995     assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd,
 996                       sizeof(prog_fd)) == 0);
 997
 998     for (;;) {
 999         key = IPPROTO_TCP;
1000         assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0);
1001         key = IPPROTO_UDP;
1002         assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0);
1003         printf("TCP %lld UDP %lld packets\en", tcp_cnt, udp_cnt);
1004         sleep(1);
1005     }
1006
1007     return 0;
1008 }
1009 .EE
1010 .PP
1011 Some complete working code can be found in the
1012 .I samples/bpf
1013 directory in the kernel source tree.
1014 .SH RETURN VALUE
1015 For a successful call, the return value depends on the operation:
1016 .TP
1017 .B BPF_MAP_CREATE
1018 The new file descriptor associated with the eBPF map.
1019 .TP
1020 .B BPF_PROG_LOAD
1021 The new file descriptor associated with the eBPF program.
1022 .TP
1023 All other commands
1024 Zero.
1025 .PP
1026 On error, \-1 is returned, and
1027 .I errno
1028 is set appropriately.
1029 .SH ERRORS
1030 .TP
1031 .B E2BIG
1032 The eBPF program is too large or a map reached the
1033 .I max_entries
1034 limit (maximum number of elements).
1035 .TP
1036 .B EACCES
1037 For
1038 .BR BPF_PROG_LOAD ,
1039 even though all program instructions are valid, the program has been
1040 rejected because it was deemed unsafe.
1041 This may be because it may have
1042 accessed a disallowed memory region or an uninitialized stack/register or
1043 because the function constraints don't match the actual types or because
1044 there was a misaligned memory access.
1045 In this case, it is recommended to call
1046 .BR bpf ()
1047 again with
1048 .I log_level = 1
1049 and examine
1050 .I log_buf
1051 for the specific reason provided by the verifier.
1052 .TP
1053 .B EBADF
1054 .I fd
1055 is not an open file descriptor.
1056 .TP
1057 .B EFAULT
1058 One of the pointers
1059 .RI ( key
1060 or
1061 .I value
1062 or
1063 .I log_buf
1064 or
1065 .IR insns )
1066 is outside the accessible address space.
1067 .TP
1068 .B EINVAL
1069 The value specified in
1070 .I cmd
1071 is not recognized by this kernel.
1072 .TP
1073 .B EINVAL
1074 For
1075 .BR BPF_MAP_CREATE ,
1076 either
1077 .I map_type
1078 or attributes are invalid.
1079 .TP
1080 .B EINVAL
1081 For
1082 .B BPF_MAP_*_ELEM
1083 commands,
1084 some of the fields of
1085 .I "union bpf_attr"
1086 that are not used by this command
1087 are not set to zero.
1088 .TP
1089 .B EINVAL
1090 For
1091 .BR BPF_PROG_LOAD ,
1092 indicates an attempt to load an invalid program.
1093 eBPF programs can be deemed
1094 invalid due to unrecognized instructions, the use of reserved fields, jumps
1095 out of range, infinite loops or calls of unknown functions.
1096 .TP
1097 .B ENOENT
1098 For
1099 .B BPF_MAP_LOOKUP_ELEM
1100 or
1101 .BR BPF_MAP_DELETE_ELEM ,
1102 indicates that the element with the given
1103 .I key
1104 was not found.
1105 .TP
1106 .B ENOMEM
1107 Cannot allocate sufficient memory.
1108 .TP
1109 .B EPERM
1110 The call was made without sufficient privilege
1111 (without the
1112 .B CAP_SYS_ADMIN
1113 capability).
1114 .SH VERSIONS
1115 The
1116 .BR bpf ()
1117 system call first appeared in Linux 3.18.
1118 .SH CONFORMING TO
1119 The
1120 .BR bpf ()
1121 system call is Linux-specific.
1122 .SH NOTES
1123 In the current implementation, all
1124 .BR bpf ()
1125 commands require the caller to have the
1126 .B CAP_SYS_ADMIN
1127 capability.
1128 .PP
1129 eBPF objects (maps and programs) can be shared between processes.
1130 For example, after
1131 .BR fork (2),
1132 the child inherits file descriptors referring to the same eBPF objects.
1133 In addition, file descriptors referring to eBPF objects can be
1134 transferred over UNIX domain sockets.
1135 File descriptors referring to eBPF objects can be duplicated
1136 in the usual way, using
1137 .BR dup (2)
1138 and similar calls.
1139 An eBPF object is deallocated only after all file descriptors
1140 referring to the object have been closed.
1141 .PP
1142 eBPF programs can be written in a restricted C that is compiled (using the
1143 .B clang
1144 compiler) into eBPF bytecode.
1145 Various features are omitted from this restricted C, such as loops,
1146 global variables, variadic functions, floating-point numbers,
1147 and passing structures as function arguments.
1148 Some examples can be found in the
1149 .I samples/bpf/*_kern.c
1150 files in the kernel source tree.
1151 .\" There are also examples for the tc classifier, in the iproute2
1152 .\" project, in examples/bpf
1153 .PP
1154 The kernel contains a just-in-time (JIT) compiler that translates
1155 eBPF bytecode into native machine code for better performance.
1156 In kernels before Linux 4.15,
1157 the JIT compiler is disabled by default,
1158 but its operation can be controlled by writing one of the
1159 following integer strings to the file
1160 .IR /proc/sys/net/core/bpf_jit_enable :
1161 .IP 0 3
1162 Disable JIT compilation (default).
1163 .IP 1
1164 Normal compilation.
1165 .IP 2
1166 Debugging mode.
1167 The generated opcodes are dumped in hexadecimal into the kernel log.
1168 These opcodes can then be disassembled using the program
1169 .I tools/net/bpf_jit_disasm.c
1170 provided in the kernel source tree.
1171 .PP
1172 Since Linux 4.15,
1173 .\" commit 290af86629b25ffd1ed6232c4e9107da031705cb
1174 the kernel may configured with the
1175 .B CONFIG_BPF_JIT_ALWAYS_ON
1176 option.
1177 In this case, the JIT compiler is always enabled, and the
1178 .I bpf_jit_enable
1179 is initialized to 1 and is immutable.
1180 (This kernel configuration option was provided as a mitigation for
1181 one of the Spectre attacks against the BPF interpreter.)
1182 .PP
1183 The JIT compiler for eBPF is currently
1184 .\" Last reviewed in Linux 4.18-rc by grepping for BPF_ALU64 in arch/
1185 .\" and by checking the documentation for bpf_jit_enable in
1186 .\" Documentation/sysctl/net.txt
1187 available for the following architectures:
1188 .IP * 3
1189 x86-64 (since Linux 3.18; cBPF since Linux 3.0);
1190 .\" commit 0a14842f5a3c0e88a1e59fac5c3025db39721f74
1191 .PD 0
1192 .IP *
1193 ARM32 (since Linux 3.18; cBPF since Linux 3.4);
1194 .\" commit ddecdfcea0ae891f782ae853771c867ab51024c2
1195 .IP *
1196 SPARC 32 (since Linux 3.18; cBPF since Linux 3.5);
1197 .\" commit 2809a2087cc44b55e4377d7b9be3f7f5d2569091
1198 .IP *
1199 ARM-64 (since Linux 3.18);
1200 .\" commit e54bcde3d69d40023ae77727213d14f920eb264a
1201 .IP *
1202 s390 (since Linux 4.1; cBPF since Linux 3.7);
1203 .\" commit c10302efe569bfd646b4c22df29577a4595b4580
1204 .IP *
1205 PowerPC 64 (since Linux 4.8; cBPF since Linux 3.1);
1206 .\" commit 0ca87f05ba8bdc6791c14878464efc901ad71e99
1207 .\" commit 156d0e290e969caba25f1851c52417c14d141b24
1208 .IP *
1209 SPARC 64 (since Linux 4.12);
1210 .\" commit 7a12b5031c6b947cc13918237ae652b536243b76
1211 .IP *
1212 x86-32 (since Linux 4.18);
1213 .\" commit 03f5781be2c7b7e728d724ac70ba10799cc710d7
1214 .IP *
1215 MIPS 64 (since Linux 4.18; cBPF since Linux 3.16);
1216 .\" commit c6610de353da5ca6eee5b8960e838a87a90ead0c
1217 .\" commit f381bf6d82f032b7410185b35d000ea370ac706b
1218 .IP *
1219 riscv (since Linux 5.1).
1220 .\" commit 2353ecc6f91fd15b893fa01bf85a1c7a823ee4f2
1221 .PD
1222 .SH SEE ALSO
1223 .BR seccomp (2),
1224 .BR bpf-helpers (7),
1225 .BR socket (7),
1226 .BR tc (8),
1227 .BR tc-bpf (8)
1228 .PP
1229 Both classic and extended BPF are explained in the kernel source file
1230 .IR Documentation/networking/filter.txt .