man2/bpf.2

   1 .\" Copyright (C) 2015 Alexei Starovoitov <ast@kernel.org>
   2 .\" and Copyright (C) 2015 Michael Kerrisk <mtk.manpages@gmail.com>
   3 .\"
   4 .\" %%%LICENSE_START(VERBATIM)
   5 .\" Permission is granted to make and distribute verbatim copies of this
   6 .\" manual provided the copyright notice and this permission notice are
   7 .\" preserved on all copies.
   8 .\"
   9 .\" Permission is granted to copy and distribute modified versions of this
  10 .\" manual under the conditions for verbatim copying, provided that the
  11 .\" entire resulting derived work is distributed under the terms of a
  12 .\" permission notice identical to this one.
  13 .\"
  14 .\" Since the Linux kernel and libraries are constantly changing, this
  15 .\" manual page may be incorrect or out-of-date.  The author(s) assume no
  16 .\" responsibility for errors or omissions, or for damages resulting from
  17 .\" the use of the information contained herein.  The author(s) may not
  18 .\" have taken the same level of care in the production of this manual,
  19 .\" which is licensed free of charge, as they might when working
  20 .\" professionally.
  21 .\"
  22 .\" Formatted or processed versions of this manual, if unaccompanied by
  23 .\" the source, must acknowledge the copyright and authors of this work.
  24 .\" %%%LICENSE_END
  25 .\"
  26 .TH BPF 2 2016-10-08 "Linux" "Linux Programmer's Manual"
  27 .SH NAME
  28 bpf \- perform a command on an extended BPF map or program
  29 .SH SYNOPSIS
  30 .nf
  31 .B #include <linux/bpf.h>
  32 .sp
  33 .BI "int bpf(int " cmd ", union bpf_attr *" attr ", unsigned int " size ");
  34 .SH DESCRIPTION
  35 The
  36 .BR bpf ()
  37 system call performs a range of operations related to extended
  38 Berkeley Packet Filters.
  39 Extended BPF (or eBPF) is similar to
  40 the original ("classic") BPF (cBPF) used to filter network packets.
  41 For both cBPF and eBPF programs,
  42 the kernel statically analyzes the programs before loading them,
  43 in order to ensure that they cannot harm the running system.
  44 .P
  45 eBPF extends cBPF in multiple ways, including the ability to call
  46 a fixed set of in-kernel helper functions
  47 .\" See 'enum bpf_func_id' in include/uapi/linux/bpf.h
  48 (via the
  49 .B BPF_CALL
  50 opcode extension provided by eBPF)
  51 and access shared data structures such as eBPF maps.
  52 .\"
  53 .SS Extended BPF Design/Architecture
  54 eBPF maps are a generic data structure for storage of different data types.
  55 Data types are generally treated as binary blobs, so a user just specifies
  56 the size of the key and the size of the value at map-creation time.
  57 In other words, a key/value for a given map can have an arbitrary structure.
  58
  59 A user process can create multiple maps (with key/value-pairs being
  60 opaque bytes of data) and access them via file descriptors.
  61 Different eBPF programs can access the same maps in parallel.
  62 It's up to the user process and eBPF program to decide what they store
  63 inside maps.
  64
  65 There's one special map type, called a program array.
  66 This type of map stores file descriptors referring to other eBPF programs.
  67 When a lookup in the map is performed, the program flow is
  68 redirected in-place to the beginning of another eBPF program and does not
  69 return back to the calling program.
  70 The level of nesting has a fixed limit of 32,
  71 .\" Defined by the kernel constant MAX_TAIL_CALL_CNT in include/linux/bpf.h
  72 so that infinite loops cannot be crafted.
  73 At runtime, the program file descriptors stored in the map can be modified,
  74 so program functionality can be altered based on specific requirements.
  75 All programs referred to in a program-array map must
  76 have been previously loaded into the kernel via
  77 .BR bpf ().
  78 If a map lookup fails, the current program continues its execution.
  79 See
  80 .B BPF_MAP_TYPE_PROG_ARRAY
  81 below for further details.
  82 .P
  83 Generally, eBPF programs are loaded by the user process and automatically
  84 unloaded when the process exits.
  85 In some cases, for example,
  86 .BR tc-bpf (8),
  87 the program will continue to stay alive inside the kernel even after the
  88 process that loaded the program exits.
  89 In that case,
  90 the tc subsystem holds a reference to the eBPF program after the
  91 file descriptor has been closed by the user-space program.
  92 Thus, whether a specific program continues to live inside the kernel
  93 depends on how it is further attached to a given kernel subsystem
  94 after it was loaded via
  95 .BR bpf ().
  96
  97 Each eBPF program is a set of instructions that is safe to run until
  98 its completion.
  99 An in-kernel verifier statically determines that the eBPF program
 100 terminates and is safe to execute.
 101 During verification, the kernel increments reference counts for each of
 102 the maps that the eBPF program uses,
 103 so that the attached maps can't be removed until the program is unloaded.
 104
 105 eBPF programs can be attached to different events.
 106 These events can be the arrival of network packets, tracing
 107 events, classification events by network queueing  disciplines
 108 (for eBPF programs attached to a
 109 .BR tc (8)
 110 classifier), and other types that may be added in the future.
 111 A new event triggers execution of the eBPF program, which
 112 may store information about the event in eBPF maps.
 113 Beyond storing data, eBPF programs may call a fixed set of
 114 in-kernel helper functions.
 115
 116 The same eBPF program can be attached to multiple events and different
 117 eBPF programs can access the same map:
 118
 119 .in +4n
 120 .nf
 121 tracing     tracing    tracing    packet      packet     packet
 122 event A     event B    event C    on eth0     on eth1    on eth2
 123  |             |         |          |           |          ^
 124  |             |         |          |           v          |
 125  --> tracing <--     tracing      socket    tc ingress   tc egress
 126       prog_1          prog_2      prog_3    classifier    action
 127       |  |              |           |         prog_4      prog_5
 128    |---  -----|  |------|          map_3        |           |
 129  map_1       map_2                              --| map_4 |--
 130 .fi
 131 .in
 132 .\"
 133 .SS Arguments
 134 The operation to be performed by the
 135 .BR bpf ()
 136 system call is determined by the
 137 .IR cmd
 138 argument.
 139 Each operation takes an accompanying argument,
 140 provided via
 141 .IR attr ,
 142 which is a pointer to a union of type
 143 .IR bpf_attr
 144 (see below).
 145 The
 146 .I size
 147 argument is the size of the union pointed to by
 148 .IR attr .
 149
 150 The value provided in
 151 .IR cmd
 152 is one of the following:
 153 .TP
 154 .B BPF_MAP_CREATE
 155 Create a map and return a file descriptor that refers to the map.
 156 The close-on-exec file descriptor flag (see
 157 .BR fcntl (2))
 158 is automatically enabled for the new file descriptor.
 159 .TP
 160 .B BPF_MAP_LOOKUP_ELEM
 161 Look up an element by key in a specified map and return its value.
 162 .TP
 163 .B BPF_MAP_UPDATE_ELEM
 164 Create or update an element (key/value pair) in a specified map.
 165 .TP
 166 .B BPF_MAP_DELETE_ELEM
 167 Look up and delete an element by key in a specified map.
 168 .TP
 169 .B BPF_MAP_GET_NEXT_KEY
 170 Look up an element by key in a specified map and return the key
 171 of the next element.
 172 .TP
 173 .B BPF_PROG_LOAD
 174 Verify and load an eBPF program,
 175 returning a new file descriptor associated with the program.
 176 The close-on-exec file descriptor flag (see
 177 .BR fcntl (2))
 178 is automatically enabled for the new file descriptor.
 179 .P
 180 The
 181 .I bpf_attr
 182 union consists of various anonymous structures that are used by different
 183 .BR bpf ()
 184 commands:
 185
 186 .in +4n
 187 .nf
 188 union bpf_attr {
 189     struct {    /* Used by BPF_MAP_CREATE */
 190         __u32         map_type;
 191         __u32         key_size;    /* size of key in bytes */
 192         __u32         value_size;  /* size of value in bytes */
 193         __u32         max_entries; /* maximum number of entries
 194                                       in a map */
 195     };
 196
 197     struct {    /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY
 198                    commands */
 199         __u32         map_fd;
 200         __aligned_u64 key;
 201         union {
 202             __aligned_u64 value;
 203             __aligned_u64 next_key;
 204         };
 205         __u64         flags;
 206     };
 207
 208     struct {    /* Used by BPF_PROG_LOAD */
 209         __u32         prog_type;
 210         __u32         insn_cnt;
 211         __aligned_u64 insns;      /* 'const struct bpf_insn *' */
 212         __aligned_u64 license;    /* 'const char *' */
 213         __u32         log_level;  /* verbosity level of verifier */
 214         __u32         log_size;   /* size of user buffer */
 215         __aligned_u64 log_buf;    /* user supplied 'char *'
 216                                      buffer */
 217         __u32         kern_version;
 218                                   /* checked when prog_type=kprobe
 219                                      (since Linux 4.1) */
 220 .\"                 commit 2541517c32be2531e0da59dfd7efc1ce844644f5
 221     };
 222 } __attribute__((aligned(8)));
 223 .fi
 224 .in
 225 .\"
 226 .SS eBPF maps
 227 Maps are a generic data structure for storage of different types of data.
 228 They allow sharing of data between eBPF kernel programs,
 229 and also between kernel and user-space applications.
 230
 231 Each map type has the following attributes:
 232
 233 .PD 0
 234 .IP * 3
 235 type
 236 .IP *
 237 maximum number of elements
 238 .IP *
 239 key size in bytes
 240 .IP *
 241 value size in bytes
 242 .PD
 243 .PP
 244 The following wrapper functions demonstrate how various
 245 .BR bpf ()
 246 commands can be used to access the maps.
 247 The functions use the
 248 .IR cmd
 249 argument to invoke different operations.
 250 .TP
 251 .B BPF_MAP_CREATE
 252 The
 253 .B BPF_MAP_CREATE
 254 command creates a new map,
 255 returning a new file descriptor that refers to the map.
 256
 257 .in +4n
 258 .nf
 259 int
 260 bpf_create_map(enum bpf_map_type map_type,
 261                unsigned int key_size,
 262                unsigned int value_size,
 263                unsigned int max_entries)
 264 {
 265     union bpf_attr attr = {
 266         .map_type    = map_type,
 267         .key_size    = key_size,
 268         .value_size  = value_size,
 269         .max_entries = max_entries
 270     };
 271
 272     return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
 273 }
 274 .fi
 275 .in
 276
 277 The new map has the type specified by
 278 .IR map_type ,
 279 and attributes as specified in
 280 .IR key_size ,
 281 .IR value_size ,
 282 and
 283 .IR max_entries .
 284 On success, this operation returns a file descriptor.
 285 On error, \-1 is returned and
 286 .I errno
 287 is set to
 288 .BR EINVAL ,
 289 .BR EPERM ,
 290 or
 291 .BR ENOMEM .
 292
 293 The
 294 .I key_size
 295 and
 296 .I value_size
 297 attributes will be used by the verifier during program loading
 298 to check that the program is calling
 299 .BR bpf_map_*_elem ()
 300 helper functions with a correctly initialized
 301 .I key
 302 and to check that the program doesn't access the map element
 303 .I value
 304 beyond the specified
 305 .IR value_size .
 306 For example, when a map is created with a
 307 .IR key_size
 308 of 8 and the eBPF program calls
 309
 310 .in +4n
 311 .nf
 312 bpf_map_lookup_elem(map_fd, fp - 4)
 313 .fi
 314 .in
 315
 316 the program will be rejected,
 317 since the in-kernel helper function
 318
 319     bpf_map_lookup_elem(map_fd, void *key)
 320
 321 expects to read 8 bytes from the location pointed to by
 322 .IR key ,
 323 but the
 324 .IR "fp\ -\ 4"
 325 (where
 326 .I fp
 327 is the top of the stack)
 328 starting address will cause out-of-bounds stack access.
 329
 330 Similarly, when a map is created with a
 331 .I value_size
 332 of 1 and the eBPF program contains
 333
 334 .in +4n
 335 .nf
 336 value = bpf_map_lookup_elem(...);
 337 *(u32 *) value = 1;
 338 .fi
 339 .in
 340
 341 the program will be rejected, since it accesses the
 342 .I value
 343 pointer beyond the specified 1 byte
 344 .I value_size
 345 limit.
 346
 347 Currently, the following values are supported for
 348 .IR map_type :
 349
 350 .in +4n
 351 .nf
 352 enum bpf_map_type {
 353     BPF_MAP_TYPE_UNSPEC,  /* Reserve 0 as invalid map type */
 354     BPF_MAP_TYPE_HASH,
 355     BPF_MAP_TYPE_ARRAY,
 356     BPF_MAP_TYPE_PROG_ARRAY,
 357 };
 358 .fi
 359 .in
 360
 361 .I map_type
 362 selects one of the available map implementations in the kernel.
 363 .\" FIXME We need an explanation of why one might choose each of
 364 .\" these map implementations
 365 For all map types,
 366 eBPF programs access maps with the same
 367 .BR bpf_map_lookup_elem ()
 368 and
 369 .BR bpf_map_update_elem ()
 370 helper functions.
 371 Further details of the various map types are given below.
 372 .TP
 373 .B BPF_MAP_LOOKUP_ELEM
 374 The
 375 .B BPF_MAP_LOOKUP_ELEM
 376 command looks up an element with a given
 377 .I key
 378 in the map referred to by the file descriptor
 379 .IR fd .
 380
 381 .in +4n
 382 .nf
 383 int
 384 bpf_lookup_elem(int fd, const void *key, void *value)
 385 {
 386     union bpf_attr attr = {
 387         .map_fd = fd,
 388         .key    = ptr_to_u64(key),
 389         .value  = ptr_to_u64(value),
 390     };
 391
 392     return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
 393 }
 394 .fi
 395 .in
 396
 397 If an element is found,
 398 the operation returns zero and stores the element's value into
 399 .IR value ,
 400 which must point to a buffer of
 401 .I value_size
 402 bytes.
 403
 404 If no element is found, the operation returns \-1 and sets
 405 .I errno
 406 to
 407 .BR ENOENT .
 408 .TP
 409 .B BPF_MAP_UPDATE_ELEM
 410 The
 411 .B BPF_MAP_UPDATE_ELEM
 412 command
 413 creates or updates an element with a given
 414 .I key/value
 415 in the map referred to by the file descriptor
 416 .IR fd .
 417
 418 .in +4n
 419 .nf
 420 int
 421 bpf_update_elem(int fd, const void *key, const void *value,
 422                 uint64_t flags)
 423 {
 424     union bpf_attr attr = {
 425         .map_fd = fd,
 426         .key    = ptr_to_u64(key),
 427         .value  = ptr_to_u64(value),
 428         .flags  = flags,
 429     };
 430
 431     return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
 432 }
 433 .fi
 434 .in
 435
 436 The
 437 .I flags
 438 argument should be specified as one of the following:
 439 .RS
 440 .TP
 441 .B BPF_ANY
 442 Create a new element or update an existing element.
 443 .TP
 444 .B BPF_NOEXIST
 445 Create a new element only if it did not exist.
 446 .TP
 447 .B BPF_EXIST
 448 Update an existing element.
 449 .RE
 450 .IP
 451 On success, the operation returns zero.
 452 On error, \-1 is returned and
 453 .I errno
 454 is set to
 455 .BR EINVAL ,
 456 .BR EPERM ,
 457 .BR ENOMEM ,
 458 or
 459 .BR E2BIG .
 460 .B E2BIG
 461 indicates that the number of elements in the map reached the
 462 .I max_entries
 463 limit specified at map creation time.
 464 .B EEXIST
 465 will be returned if
 466 .I flags
 467 specifies
 468 .B BPF_NOEXIST
 469 and the element with
 470 .I key
 471 already exists in the map.
 472 .B ENOENT
 473 will be returned if
 474 .I flags
 475 specifies
 476 .B BPF_EXIST
 477 and the element with
 478 .I key
 479 doesn't exist in the map.
 480 .TP
 481 .B BPF_MAP_DELETE_ELEM
 482 The
 483 .B BPF_MAP_DELETE_ELEM
 484 command
 485 deleted the element whose key is
 486 .I key
 487 from the map referred to by the file descriptor
 488 .IR fd .
 489
 490 .in +4n
 491 .nf
 492 int
 493 bpf_delete_elem(int fd, const void *key)
 494 {
 495     union bpf_attr attr = {
 496         .map_fd = fd,
 497         .key    = ptr_to_u64(key),
 498     };
 499
 500     return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
 501 }
 502 .fi
 503 .in
 504
 505 On success, zero is returned.
 506 If the element is not found, \-1 is returned and
 507 .I errno
 508 is set to
 509 .BR ENOENT .
 510 .TP
 511 .B BPF_MAP_GET_NEXT_KEY
 512 The
 513 .B BPF_MAP_GET_NEXT_KEY
 514 command looks up an element by
 515 .I key
 516 in the map referred to by the file descriptor
 517 .IR fd
 518 and sets the
 519 .I next_key
 520 pointer to the key of the next element.
 521
 522 .nf
 523 .in +4n
 524 int
 525 bpf_get_next_key(int fd, const void *key, void *next_key)
 526 {
 527     union bpf_attr attr = {
 528         .map_fd   = fd,
 529         .key      = ptr_to_u64(key),
 530         .next_key = ptr_to_u64(next_key),
 531     };
 532
 533     return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
 534 }
 535 .fi
 536 .in
 537
 538 If
 539 .I key
 540 is found, the operation returns zero and sets the
 541 .I next_key
 542 pointer to the key of the next element.
 543 If
 544 .I key
 545 is not found, the operation returns zero and sets the
 546 .I next_key
 547 pointer to the key of the first element.
 548 If
 549 .I key
 550 is the last element, \-1 is returned and
 551 .I errno
 552 is set to
 553 .BR ENOENT .
 554 Other possible
 555 .I errno
 556 values are
 557 .BR ENOMEM ,
 558 .BR EFAULT ,
 559 .BR EPERM ,
 560 and
 561 .BR EINVAL .
 562 This method can be used to iterate over all elements in the map.
 563 .TP
 564 .B close(map_fd)
 565 Delete the map referred to by the file descriptor
 566 .IR map_fd .
 567 When the user-space program that created a map exits, all maps will
 568 be deleted automatically (but see NOTES).
 569 .\"
 570 .SS eBPF map types
 571 The following map types are supported:
 572 .TP
 573 .B BPF_MAP_TYPE_HASH
 574 .\" commit 0f8e4bd8a1fc8c4185f1630061d0a1f2d197a475
 575 Hash-table maps have the following characteristics:
 576 .RS
 577 .IP * 3
 578 Maps are created and destroyed by user-space programs.
 579 Both user-space and eBPF programs
 580 can perform lookup, update, and delete operations.
 581 .IP *
 582 The kernel takes care of allocating and freeing key/value pairs.
 583 .IP *
 584 The
 585 .BR map_update_elem ()
 586 helper will fail to insert new element when the
 587 .I max_entries
 588 limit is reached.
 589 (This ensures that eBPF programs cannot exhaust memory.)
 590 .IP *
 591 .BR map_update_elem ()
 592 replaces existing elements atomically.
 593 .RE
 594 .IP
 595 Hash-table maps are
 596 optimized for speed of lookup.
 597 .TP
 598 .B BPF_MAP_TYPE_ARRAY
 599 .\" commit 28fbcfa08d8ed7c5a50d41a0433aad222835e8e3
 600 Array maps have the following characteristics:
 601 .RS
 602 .IP * 3
 603 Optimized for fastest possible lookup.
 604 In the future the verifier/JIT compiler
 605 may recognize lookup() operations that employ a constant key
 606 and optimize it into constant pointer.
 607 It is possible to optimize a non-constant
 608 key into direct pointer arithmetic as well, since pointers and
 609 .I value_size
 610 are constant for the life of the eBPF program.
 611 In other words,
 612 .BR array_map_lookup_elem ()
 613 may be 'inlined' by the verifier/JIT compiler
 614 while preserving concurrent access to this map from user space.
 615 .IP *
 616 All array elements pre-allocated and zero initialized at init time
 617 .IP *
 618 The key is an array index, and must be exactly four bytes.
 619 .IP *
 620 .BR map_delete_elem ()
 621 fails with the error
 622 .BR EINVAL ,
 623 since elements cannot be deleted.
 624 .IP *
 625 .BR map_update_elem ()
 626 replaces elements in a
 627 .B nonatomic
 628 fashion;
 629 for atomic updates, a hash-table map should be used instead.
 630 There is however one special case that can also be used with arrays:
 631 the atomic built-in
 632 .BR __sync_fetch_and_add()
 633 can be used on 32 and 64 bit atomic counters.
 634 For example, it can be
 635 applied on the whole value itself if it represents a single counter,
 636 or in case of a structure containing multiple counters, it could be
 637 used on individual counters.
 638 This is quite often useful for aggregation and accounting of events.
 639 .RE
 640 .IP
 641 Among the uses for array maps are the following:
 642 .RS
 643 .IP * 3
 644 As "global" eBPF variables: an array of 1 element whose key is (index) 0
 645 and where the value is a collection of 'global' variables which
 646 eBPF programs can use to keep state between events.
 647 .IP *
 648 Aggregation of tracing events into a fixed set of buckets.
 649 .IP *
 650 Accounting of networking events, for example, number of packets and packet
 651 sizes.
 652 .RE
 653 .TP
 654 .BR BPF_MAP_TYPE_PROG_ARRAY " (since Linux 4.2)"
 655 A program array map is a special kind of array map whose map values
 656 contain only file descriptors referring to other eBPF programs.
 657 Thus, both the
 658 .I key_size
 659 and
 660 .I value_size
 661 must be exactly four bytes.
 662 This map is used in conjunction with the
 663 .BR bpf_tail_call ()
 664 helper.
 665
 666 This means that an eBPF program with a program array map attached to it
 667 can call from kernel side into
 668
 669 .in +4n
 670 .nf
 671 void bpf_tail_call(void *context, void *prog_map, unsigned int index);
 672 .fi
 673 .in
 674
 675 and therefore replace its own program flow with the one from the program
 676 at the given program array slot, if present.
 677 This can be regarded as kind of a jump table to a different eBPF program.
 678 The invoked program will then reuse the same stack.
 679 When a jump into the new program has been performed,
 680 it won't return to the old program anymore.
 681
 682 If no eBPF program is found at the given index of the program array
 683 (because the map slot doesn't contain a valid program file descriptor,
 684 the specified lookup index/key is out of bounds,
 685 or the limit of 32
 686 .\" MAX_TAIL_CALL_CNT
 687 nested calls has been exceed),
 688 execution continues with the current eBPF program.
 689 This can be used as a fall-through for default cases.
 690
 691 A program array map is useful, for example, in tracing or networking, to
 692 handle individual system calls or protocols in their own subprograms and
 693 use their identifiers as an individual map index.
 694 This approach may result in performance benefits,
 695 and also makes it possible to overcome the maximum
 696 instruction limit of a single eBPF program.
 697 In dynamic environments,
 698 a user-space daemon might atomically replace individual subprograms
 699 at run-time with newer versions to alter overall program behavior,
 700 for instance, if global policies change.
 701 .\"
 702 .SS eBPF programs
 703 The
 704 .B BPF_PROG_LOAD
 705 command is used to load an eBPF program into the kernel.
 706 The return value for this command is a new file descriptor associated
 707 with this eBPF program.
 708
 709 .in +4n
 710 .nf
 711 char bpf_log_buf[LOG_BUF_SIZE];
 712
 713 int
 714 bpf_prog_load(enum bpf_prog_type type,
 715               const struct bpf_insn *insns, int insn_cnt,
 716               const char *license)
 717 {
 718     union bpf_attr attr = {
 719         .prog_type = type,
 720         .insns     = ptr_to_u64(insns),
 721         .insn_cnt  = insn_cnt,
 722         .license   = ptr_to_u64(license),
 723         .log_buf   = ptr_to_u64(bpf_log_buf),
 724         .log_size  = LOG_BUF_SIZE,
 725         .log_level = 1,
 726     };
 727
 728     return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
 729 }
 730 .fi
 731 .in
 732
 733 .I prog_type
 734 is one of the available program types:
 735
 736 .in +4n
 737 .nf
 738 enum bpf_prog_type {
 739     BPF_PROG_TYPE_UNSPEC,        /* Reserve 0 as invalid
 740                                     program type */
 741     BPF_PROG_TYPE_SOCKET_FILTER,
 742     BPF_PROG_TYPE_KPROBE,
 743     BPF_PROG_TYPE_SCHED_CLS,
 744     BPF_PROG_TYPE_SCHED_ACT,
 745 };
 746 .fi
 747 .in
 748
 749 For further details of eBPF program types, see below.
 750
 751 The remaining fields of
 752 .I bpf_attr
 753 are set as follows:
 754 .IP * 3
 755 .I insns
 756 is an array of
 757 .I "struct bpf_insn"
 758 instructions.
 759 .IP *
 760 .I insn_cnt
 761 is the number of instructions in the program referred to by
 762 .IR insns .
 763 .IP *
 764 .I license
 765 is a license string, which must be GPL compatible to call helper functions
 766 marked
 767 .IR gpl_only .
 768 (The licensing rules are the same as for kernel modules,
 769 so that also dual licenses, such as "Dual BSD/GPL", may be used.)
 770 .IP *
 771 .I log_buf
 772 is a pointer to a caller-allocated buffer in which the in-kernel
 773 verifier can store the verification log.
 774 This log is a multi-line string that can be checked by
 775 the program author in order to understand how the verifier came to
 776 the conclusion that the eBPF program is unsafe.
 777 The format of the output can change at any time as the verifier evolves.
 778 .IP *
 779 .I log_size
 780 size of the buffer pointed to by
 781 .IR log_bug .
 782 If the size of the buffer is not large enough to store all
 783 verifier messages, \-1 is returned and
 784 .I errno
 785 is set to
 786 .BR ENOSPC .
 787 .IP *
 788 .I log_level
 789 verbosity level of the verifier.
 790 A value of zero means that the verifier will not provide a log;
 791 in this case,
 792 .I log_buf
 793 must be a NULL pointer, and
 794 .I log_size
 795 must be zero.
 796 .P
 797 Applying
 798 .BR close (2)
 799 to the file descriptor returned by
 800 .B BPF_PROG_LOAD
 801 will unload the eBPF program (but see NOTES).
 802
 803 Maps are accessible from eBPF programs and are used to exchange data between
 804 eBPF programs and between eBPF programs and user-space programs.
 805 For example,
 806 eBPF programs can process various events (like kprobe, packets) and
 807 store their data into a map,
 808 and user-space programs can then fetch data from the map.
 809 Conversely, user-space programs can use a map as a configuration mechanism,
 810 populating the map with values checked by the eBPF program,
 811 which then modifies its behavior on the fly according to those values.
 812 .\"
 813 .\"
 814 .SS eBPF program types
 815 The eBPF program type
 816 .RI ( prog_type )
 817 determines the subset of kernel helper functions that the program
 818 may call.
 819 The program type also determines the program input (context)\(emthe
 820 format of
 821 .I "struct bpf_context"
 822 (which is the data blob passed into the eBPF program as the first argument).
 823 .\"
 824 .\" FIXME
 825 .\" Somewhere in this page we need a general introduction to the
 826 .\" bpf_context. For example, how does a BPF program access the
 827 .\" context?
 828
 829 For example, a tracing program does not have the exact same
 830 subset of helper functions as a socket filter program
 831 (though they may have some helpers in common).
 832 Similarly,
 833 the input (context) for a tracing program is a set of register values,
 834 while for a socket filter it is a network packet.
 835
 836 The set of functions available to eBPF programs of a given type may increase
 837 in the future.
 838
 839 The following program types are supported:
 840 .TP
 841 .BR BPF_PROG_TYPE_SOCKET_FILTER " (since Linux 3.19)"
 842 Currently, the set of functions for
 843 .B BPF_PROG_TYPE_SOCKET_FILTER
 844 is:
 845
 846 .in +4n
 847 .nf
 848 bpf_map_lookup_elem(map_fd, void *key)
 849                     /* look up key in a map_fd */
 850 bpf_map_update_elem(map_fd, void *key, void *value)
 851                     /* update key/value */
 852 bpf_map_delete_elem(map_fd, void *key)
 853                     /* delete key in a map_fd */
 854 .fi
 855 .in
 856
 857 The
 858 .I bpf_context
 859 argument is a pointer to a
 860 .IR "struct __sk_buff" .
 861 .\" FIXME: We need some text here to explain how the program
 862 .\" accesses __sk_buff.
 863 .\" See 'struct __sk_buff' and commit 9bac3d6d548e5
 864 .\"
 865 .\" Alexei commented:
 866 .\" Actually now in case of SOCKET_FILTER, SCHED_CLS, SCHED_ACT
 867 .\" the program can now access skb fields.
 868 .\"
 869 .TP
 870 .BR BPF_PROG_TYPE_KPROBE " (since Linux 4.1)
 871 .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5
 872 [To be documented]
 873 .\" FIXME Document this program type
 874 .\"       Describe allowed helper functions for this program type
 875 .\"       Describe bpf_context for this program type
 876 .\"
 877 .\" FIXME We need text here to describe 'kern_version'
 878 .TP
 879 .BR BPF_PROG_TYPE_SCHED_CLS " (since Linux 4.1)
 880 .\" commit 96be4325f443dbbfeb37d2a157675ac0736531a1
 881 .\" commit e2e9b6541dd4b31848079da80fe2253daaafb549
 882 [To be documented]
 883 .\" FIXME Document this program type
 884 .\"       Describe allowed helper functions for this program type
 885 .\"       Describe bpf_context for this program type
 886 .TP
 887 .BR BPF_PROG_TYPE_SCHED_ACT " (since Linux 4.1)
 888 .\" commit 94caee8c312d96522bcdae88791aaa9ebcd5f22c
 889 .\" commit a8cb5f556b567974d75ea29c15181c445c541b1f
 890 [To be documented]
 891 .\" FIXME Document this program type
 892 .\"       Describe allowed helper functions for this program type
 893 .\"       Describe bpf_context for this program type
 894 .SS Events
 895 Once a program is loaded, it can be attached to an event.
 896 Various kernel subsystems have different ways to do so.
 897
 898 Since Linux 3.19,
 899 .\" commit 89aa075832b0da4402acebd698d0411dcc82d03e
 900 the following call will attach the program
 901 .I prog_fd
 902 to the socket
 903 .IR sockfd ,
 904 which was created by an earlier call to
 905 .BR socket (2):
 906
 907 .in +4n
 908 .nf
 909 setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_BPF,
 910            &prog_fd, sizeof(prog_fd));
 911 .fi
 912 .in
 913
 914 Since Linux 4.1,
 915 .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5
 916 the following call may be used to attach
 917 the eBPF program referred to by the file descriptor
 918 .I prog_fd
 919 to a perf event file descriptor,
 920 .IR event_fd ,
 921 that was created by a previous call to
 922 .BR perf_event_open (2):
 923
 924 .in +4n
 925 .nf
 926 ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
 927 .fi
 928 .in
 929 .\"
 930 .\"
 931 .SH EXAMPLES
 932 .nf
 933 /* bpf+sockets example:
 934  * 1. create array map of 256 elements
 935  * 2. load program that counts number of packets received
 936  *    r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)]
 937  *    map[r0]++
 938  * 3. attach prog_fd to raw socket via setsockopt()
 939  * 4. print number of received TCP/UDP packets every second
 940  */
 941 int
 942 main(int argc, char **argv)
 943 {
 944     int sock, map_fd, prog_fd, key;
 945     long long value = 0, tcp_cnt, udp_cnt;
 946
 947     map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key),
 948                             sizeof(value), 256);
 949     if (map_fd < 0) {
 950         printf("failed to create map '%s'\\n", strerror(errno));
 951         /* likely not run as root */
 952         return 1;
 953     }
 954
 955     struct bpf_insn prog[] = {
 956         BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),        /* r6 = r1 */
 957         BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol)),
 958                                 /* r0 = ip->proto */
 959         BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4),
 960                                 /* *(u32 *)(fp - 4) = r0 */
 961         BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),       /* r2 = fp */
 962         BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),      /* r2 = r2 - 4 */
 963         BPF_LD_MAP_FD(BPF_REG_1, map_fd),           /* r1 = map_fd */
 964         BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),
 965                                 /* r0 = map_lookup(r1, r2) */
 966         BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
 967                                 /* if (r0 == 0) goto pc+2 */
 968         BPF_MOV64_IMM(BPF_REG_1, 1),                /* r1 = 1 */
 969         BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0),
 970                                 /* lock *(u64 *) r0 += r1 */
 971 .\"                                == atomic64_add
 972         BPF_MOV64_IMM(BPF_REG_0, 0),                /* r0 = 0 */
 973         BPF_EXIT_INSN(),                            /* return r0 */
 974     };
 975
 976     prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog,
 977                             sizeof(prog), "GPL");
 978
 979     sock = open_raw_sock("lo");
 980
 981     assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd,
 982                       sizeof(prog_fd)) == 0);
 983
 984     for (;;) {
 985         key = IPPROTO_TCP;
 986         assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0);
 987         key = IPPROTO_UDP
 988         assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0);
 989         printf("TCP %lld UDP %lld packets\n", tcp_cnt, udp_cnt);
 990         sleep(1);
 991     }
 992
 993     return 0;
 994 }
 995 .fi
 996
 997 Some complete working code can be found in the
 998 .IR samples/bpf
 999 directory in the kernel source tree.
1000 .SH RETURN VALUE
1001 For a successful call, the return value depends on the operation:
1002 .TP
1003 .B BPF_MAP_CREATE
1004 The new file descriptor associated with the eBPF map.
1005 .TP
1006 .B BPF_PROG_LOAD
1007 The new file descriptor associated with the eBPF program.
1008 .TP
1009 All other commands
1010 Zero.
1011 .PP
1012 On error, \-1 is returned, and
1013 .I errno
1014 is set appropriately.
1015 .SH ERRORS
1016 .TP
1017 .BR E2BIG
1018 The eBPF program is too large or a map reached the
1019 .I max_entries
1020 limit (maximum number of elements).
1021 .TP
1022 .BR EACCES
1023 For
1024 .BR BPF_PROG_LOAD,
1025 even though all program instructions are valid, the program has been
1026 rejected because it was deemed unsafe.
1027 This may be because it may have
1028 accessed a disallowed memory region or an uninitialized stack/register or
1029 because the function constraints don't match the actual types or because
1030 there was a misaligned memory access.
1031 In this case, it is recommended to call
1032 .BR bpf ()
1033 again with
1034 .I log_level = 1
1035 and examine
1036 .I log_buf
1037 for the specific reason provided by the verifier.
1038 .TP
1039 .B EBADF
1040 .I fd
1041 is not an open file descriptor.
1042 .TP
1043 .B EFAULT
1044 One of the pointers
1045 .RI ( key
1046 or
1047 .I value
1048 or
1049 .I log_buf
1050 or
1051 .IR insns )
1052 is outside the accessible address space.
1053 .TP
1054 .B EINVAL
1055 The value specified in
1056 .I cmd
1057 is not recognized by this kernel.
1058 .TP
1059 .B EINVAL
1060 For
1061 .BR BPF_MAP_CREATE ,
1062 either
1063 .I map_type
1064 or attributes are invalid.
1065 .TP
1066 .B EINVAL
1067 For
1068 .BR BPF_MAP_*_ELEM
1069 commands,
1070 some of the fields of
1071 .I "union bpf_attr"
1072 that are not used by this command
1073 are not set to zero.
1074 .TP
1075 .B EINVAL
1076 For
1077 .BR BPF_PROG_LOAD,
1078 indicates an attempt to load an invalid program.
1079 eBPF programs can be deemed
1080 invalid due to unrecognized instructions, the use of reserved fields, jumps
1081 out of range, infinite loops or calls of unknown functions.
1082 .TP
1083 .BR ENOENT
1084 For
1085 .B BPF_MAP_LOOKUP_ELEM
1086 or
1087 .BR BPF_MAP_DELETE_ELEM ,
1088 indicates that the element with the given
1089 .I key
1090 was not found.
1091 .TP
1092 .B ENOMEM
1093 Cannot allocate sufficient memory.
1094 .TP
1095 .B EPERM
1096 The call was made without sufficient privilege
1097 (without the
1098 .B CAP_SYS_ADMIN
1099 capability).
1100 .SH VERSIONS
1101 The
1102 .BR bpf ()
1103 system call first appeared in Linux 3.18.
1104 .SH CONFORMING TO
1105 The
1106 .BR bpf ()
1107 system call is Linux-specific.
1108 .SH NOTES
1109 In the current implementation, all
1110 .BR bpf ()
1111 commands require the caller to have the
1112 .B CAP_SYS_ADMIN
1113 capability.
1114
1115 eBPF objects (maps and programs) can be shared between processes.
1116 For example, after
1117 .BR fork (2),
1118 the child inherits file descriptors referring to the same eBPF objects.
1119 In addition, file descriptors referring to eBPF objects can be
1120 transferred over UNIX domain sockets.
1121 File descriptors referring to eBPF objects can be duplicated
1122 in the usual way, using
1123 .BR dup (2)
1124 and similar calls.
1125 An eBPF object is deallocated only after all file descriptors
1126 referring to the object have been closed.
1127
1128 eBPF programs can be written in a restricted C that is compiled (using the
1129 .B clang
1130 compiler) into eBPF bytecode.
1131 Various features are omitted from this restricted C, such as loops,
1132 global variables, variadic functions, floating-point numbers,
1133 and passing structures as function arguments.
1134 Some examples can be found in the
1135 .I samples/bpf/*_kern.c
1136 files in the kernel source tree.
1137 .\" There are also examples for the tc classifier, in the iproute2
1138 .\" project, in examples/bpf
1139
1140 The kernel contains a just-in-time (JIT) compiler that translates
1141 eBPF bytecode into native machine code for better performance.
1142 The JIT compiler is disabled by default,
1143 but its operation can be controlled by writing one of the
1144 following integer strings to the file
1145 .IR /proc/sys/net/core/bpf_jit_enable :
1146 .IP 0 3
1147 Disable JIT compilation (default).
1148 .IP 1
1149 Normal compilation.
1150 .IP 2
1151 Debugging mode.
1152 The generated opcodes are dumped in hexadecimal into the kernel log.
1153 These opcodes can then be disassembled using the program
1154 .IR tools/net/bpf_jit_disasm.c
1155 provided in the kernel source tree.
1156 .PP
1157 JIT compiler for eBPF is currently available for the x86-64, arm64,
1158 and s390 architectures.
1159 .SH SEE ALSO
1160 .BR seccomp (2),
1161 .BR socket (7),
1162 .BR tc (8),
1163 .BR tc-bpf (8)
1164
1165 Both classic and extended BPF are explained in the kernel source file
1166 .IR Documentation/networking/filter.txt .