]>
Commit | Line | Data |
---|---|---|
cc7ac21d | 1 | .\" Copyright (C) 2015 Alexei Starovoitov <ast@kernel.org> |
ce5db3fc | 2 | .\" and Copyright (C) 2015 Michael Kerrisk <mtk.manpages@gmail.com> |
cc7ac21d AS |
3 | .\" |
4 | .\" %%%LICENSE_START(VERBATIM) | |
5 | .\" Permission is granted to make and distribute verbatim copies of this | |
6 | .\" manual provided the copyright notice and this permission notice are | |
7 | .\" preserved on all copies. | |
8 | .\" | |
9 | .\" Permission is granted to copy and distribute modified versions of this | |
10 | .\" manual under the conditions for verbatim copying, provided that the | |
11 | .\" entire resulting derived work is distributed under the terms of a | |
12 | .\" permission notice identical to this one. | |
13 | .\" | |
14 | .\" Since the Linux kernel and libraries are constantly changing, this | |
15 | .\" manual page may be incorrect or out-of-date. The author(s) assume no | |
16 | .\" responsibility for errors or omissions, or for damages resulting from | |
17 | .\" the use of the information contained herein. The author(s) may not | |
18 | .\" have taken the same level of care in the production of this manual, | |
19 | .\" which is licensed free of charge, as they might when working | |
20 | .\" professionally. | |
21 | .\" | |
22 | .\" Formatted or processed versions of this manual, if unaccompanied by | |
23 | .\" the source, must acknowledge the copyright and authors of this work. | |
24 | .\" %%%LICENSE_END | |
25 | .\" | |
63121bd4 | 26 | .TH BPF 2 2019-08-02 "Linux" "Linux Programmer's Manual" |
cc7ac21d | 27 | .SH NAME |
99663603 | 28 | bpf \- perform a command on an extended BPF map or program |
cc7ac21d AS |
29 | .SH SYNOPSIS |
30 | .nf | |
31 | .B #include <linux/bpf.h> | |
c36ac88f | 32 | |
266791fb | 33 | .BI "int bpf(int " cmd ", union bpf_attr *" attr ", unsigned int " size ); |
c36ac88f | 34 | .fi |
cc7ac21d | 35 | .SH DESCRIPTION |
5988a659 | 36 | The |
16152abb | 37 | .BR bpf () |
842ee010 MK |
38 | system call performs a range of operations related to extended |
39 | Berkeley Packet Filters. | |
40 | Extended BPF (or eBPF) is similar to | |
54513c00 MK |
41 | the original ("classic") BPF (cBPF) used to filter network packets. |
42 | For both cBPF and eBPF programs, | |
842ee010 MK |
43 | the kernel statically analyzes the programs before loading them, |
44 | in order to ensure that they cannot harm the running system. | |
11ac5b51 | 45 | .PP |
cc42e9b8 | 46 | eBPF extends cBPF in multiple ways, including the ability to call |
f774ddf1 MK |
47 | a fixed set of in-kernel helper functions |
48 | .\" See 'enum bpf_func_id' in include/uapi/linux/bpf.h | |
49 | (via the | |
842ee010 MK |
50 | .B BPF_CALL |
51 | opcode extension provided by eBPF) | |
ce5db3fc | 52 | and access shared data structures such as eBPF maps. |
fcd1bee3 | 53 | .\" |
cc7ac21d | 54 | .SS Extended BPF Design/Architecture |
953d2673 | 55 | eBPF maps are a generic data structure for storage of different data types. |
9a818ddd | 56 | Data types are generally treated as binary blobs, so a user just specifies |
cd579c3f | 57 | the size of the key and the size of the value at map-creation time. |
9a818ddd | 58 | In other words, a key/value for a given map can have an arbitrary structure. |
f0271688 | 59 | .PP |
cc7ac21d | 60 | A user process can create multiple maps (with key/value-pairs being |
16152abb | 61 | opaque bytes of data) and access them via file descriptors. |
b87d8ba6 | 62 | Different eBPF programs can access the same maps in parallel. |
54513c00 | 63 | It's up to the user process and eBPF program to decide what they store |
cc7ac21d | 64 | inside maps. |
f0271688 | 65 | .PP |
cd579c3f MK |
66 | There's one special map type, called a program array. |
67 | This type of map stores file descriptors referring to other eBPF programs. | |
68 | When a lookup in the map is performed, the program flow is | |
69 | redirected in-place to the beginning of another eBPF program and does not | |
70 | return back to the calling program. | |
aabe0499 MK |
71 | The level of nesting has a fixed limit of 32, |
72 | .\" Defined by the kernel constant MAX_TAIL_CALL_CNT in include/linux/bpf.h | |
73 | so that infinite loops cannot be crafted. | |
29c0586f | 74 | At run time, the program file descriptors stored in the map can be modified, |
9a818ddd | 75 | so program functionality can be altered based on specific requirements. |
cd579c3f MK |
76 | All programs referred to in a program-array map must |
77 | have been previously loaded into the kernel via | |
78 | .BR bpf (). | |
79 | If a map lookup fails, the current program continues its execution. | |
80 | See | |
81 | .B BPF_MAP_TYPE_PROG_ARRAY | |
82 | below for further details. | |
11ac5b51 | 83 | .PP |
9a818ddd | 84 | Generally, eBPF programs are loaded by the user process and automatically |
cd579c3f MK |
85 | unloaded when the process exits. |
86 | In some cases, for example, | |
9a818ddd DB |
87 | .BR tc-bpf (8), |
88 | the program will continue to stay alive inside the kernel even after the | |
a0d8ddd1 | 89 | process that loaded the program exits. |
cd579c3f MK |
90 | In that case, |
91 | the tc subsystem holds a reference to the eBPF program after the | |
92 | file descriptor has been closed by the user-space program. | |
9a818ddd DB |
93 | Thus, whether a specific program continues to live inside the kernel |
94 | depends on how it is further attached to a given kernel subsystem | |
95 | after it was loaded via | |
cd579c3f | 96 | .BR bpf (). |
f0271688 | 97 | .PP |
cd579c3f | 98 | Each eBPF program is a set of instructions that is safe to run until |
9a5215bf | 99 | its completion. |
54513c00 | 100 | An in-kernel verifier statically determines that the eBPF program |
9a5215bf | 101 | terminates and is safe to execute. |
896388c8 MK |
102 | During verification, the kernel increments reference counts for each of |
103 | the maps that the eBPF program uses, | |
953d2673 | 104 | so that the attached maps can't be removed until the program is unloaded. |
f0271688 | 105 | .PP |
54513c00 | 106 | eBPF programs can be attached to different events. |
9ab03361 | 107 | These events can be the arrival of network packets, tracing |
953d2673 MK |
108 | events, classification events by network queueing disciplines |
109 | (for eBPF programs attached to a | |
9ab03361 MK |
110 | .BR tc (8) |
111 | classifier), and other types that may be added in the future. | |
54513c00 | 112 | A new event triggers execution of the eBPF program, which |
f774ddf1 | 113 | may store information about the event in eBPF maps. |
54513c00 | 114 | Beyond storing data, eBPF programs may call a fixed set of |
896388c8 | 115 | in-kernel helper functions. |
f0271688 | 116 | .PP |
f774ddf1 | 117 | The same eBPF program can be attached to multiple events and different |
cc42e9b8 | 118 | eBPF programs can access the same map: |
f0271688 | 119 | .PP |
1148d934 | 120 | .in +4n |
f0271688 | 121 | .EX |
cd579c3f MK |
122 | tracing tracing tracing packet packet packet |
123 | event A event B event C on eth0 on eth1 on eth2 | |
124 | | | | | | ^ | |
125 | | | | | v | | |
126 | --> tracing <-- tracing socket tc ingress tc egress | |
127 | prog_1 prog_2 prog_3 classifier action | |
128 | | | | | prog_4 prog_5 | |
129 | |--- -----| |------| map_3 | | | |
130 | map_1 map_2 --| map_4 |-- | |
f0271688 | 131 | .EE |
1148d934 | 132 | .in |
fcd1bee3 | 133 | .\" |
5988a659 | 134 | .SS Arguments |
842ee010 | 135 | The operation to be performed by the |
1148d934 | 136 | .BR bpf () |
842ee010 | 137 | system call is determined by the |
266791fb | 138 | .I cmd |
f774ddf1 MK |
139 | argument. |
140 | Each operation takes an accompanying argument, | |
141 | provided via | |
142 | .IR attr , | |
143 | which is a pointer to a union of type | |
266791fb | 144 | .I bpf_attr |
f774ddf1 MK |
145 | (see below). |
146 | The | |
147 | .I size | |
148 | argument is the size of the union pointed to by | |
149 | .IR attr . | |
efeece04 | 150 | .PP |
f774ddf1 | 151 | The value provided in |
266791fb | 152 | .I cmd |
f774ddf1 | 153 | is one of the following: |
cc7ac21d AS |
154 | .TP |
155 | .B BPF_MAP_CREATE | |
953d2673 | 156 | Create a map and return a file descriptor that refers to the map. |
0f166ce1 MK |
157 | The close-on-exec file descriptor flag (see |
158 | .BR fcntl (2)) | |
159 | is automatically enabled for the new file descriptor. | |
cc7ac21d AS |
160 | .TP |
161 | .B BPF_MAP_LOOKUP_ELEM | |
842ee010 | 162 | Look up an element by key in a specified map and return its value. |
cc7ac21d AS |
163 | .TP |
164 | .B BPF_MAP_UPDATE_ELEM | |
842ee010 | 165 | Create or update an element (key/value pair) in a specified map. |
cc7ac21d AS |
166 | .TP |
167 | .B BPF_MAP_DELETE_ELEM | |
842ee010 | 168 | Look up and delete an element by key in a specified map. |
cc7ac21d AS |
169 | .TP |
170 | .B BPF_MAP_GET_NEXT_KEY | |
842ee010 MK |
171 | Look up an element by key in a specified map and return the key |
172 | of the next element. | |
cc7ac21d AS |
173 | .TP |
174 | .B BPF_PROG_LOAD | |
9ab03361 MK |
175 | Verify and load an eBPF program, |
176 | returning a new file descriptor associated with the program. | |
0f166ce1 MK |
177 | The close-on-exec file descriptor flag (see |
178 | .BR fcntl (2)) | |
179 | is automatically enabled for the new file descriptor. | |
f0271688 | 180 | .IP |
842ee010 MK |
181 | The |
182 | .I bpf_attr | |
183 | union consists of various anonymous structures that are used by different | |
184 | .BR bpf () | |
185 | commands: | |
b3b5781e | 186 | .PP |
842ee010 | 187 | .in +4n |
f0271688 | 188 | .EX |
cc7ac21d | 189 | union bpf_attr { |
842ee010 MK |
190 | struct { /* Used by BPF_MAP_CREATE */ |
191 | __u32 map_type; | |
192 | __u32 key_size; /* size of key in bytes */ | |
193 | __u32 value_size; /* size of value in bytes */ | |
194 | __u32 max_entries; /* maximum number of entries | |
195 | in a map */ | |
cc7ac21d AS |
196 | }; |
197 | ||
f774ddf1 MK |
198 | struct { /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY |
199 | commands */ | |
842ee010 MK |
200 | __u32 map_fd; |
201 | __aligned_u64 key; | |
cc7ac21d AS |
202 | union { |
203 | __aligned_u64 value; | |
204 | __aligned_u64 next_key; | |
205 | }; | |
842ee010 | 206 | __u64 flags; |
cc7ac21d AS |
207 | }; |
208 | ||
842ee010 MK |
209 | struct { /* Used by BPF_PROG_LOAD */ |
210 | __u32 prog_type; | |
211 | __u32 insn_cnt; | |
212 | __aligned_u64 insns; /* 'const struct bpf_insn *' */ | |
213 | __aligned_u64 license; /* 'const char *' */ | |
214 | __u32 log_level; /* verbosity level of verifier */ | |
215 | __u32 log_size; /* size of user buffer */ | |
216 | __aligned_u64 log_buf; /* user supplied 'char *' | |
217 | buffer */ | |
f774ddf1 | 218 | __u32 kern_version; |
9ab03361 MK |
219 | /* checked when prog_type=kprobe |
220 | (since Linux 4.1) */ | |
221 | .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5 | |
cc7ac21d AS |
222 | }; |
223 | } __attribute__((aligned(8))); | |
f0271688 | 224 | .EE |
842ee010 | 225 | .in |
fcd1bee3 | 226 | .\" |
ce5db3fc | 227 | .SS eBPF maps |
8440f771 MK |
228 | Maps are a generic data structure for storage of different types of data. |
229 | They allow sharing of data between eBPF kernel programs, | |
230 | and also between kernel and user-space applications. | |
f0271688 | 231 | .PP |
16152abb | 232 | Each map type has the following attributes: |
16152abb MK |
233 | .IP * 3 |
234 | type | |
235 | .IP * | |
79e2beef | 236 | maximum number of elements |
16152abb MK |
237 | .IP * |
238 | key size in bytes | |
239 | .IP * | |
240 | value size in bytes | |
16152abb | 241 | .PP |
842ee010 MK |
242 | The following wrapper functions demonstrate how various |
243 | .BR bpf () | |
244 | commands can be used to access the maps. | |
9a5215bf | 245 | The functions use the |
266791fb | 246 | .I cmd |
cc7ac21d | 247 | argument to invoke different operations. |
ce5db3fc | 248 | .TP |
842ee010 MK |
249 | .B BPF_MAP_CREATE |
250 | The | |
cc7ac21d | 251 | .B BPF_MAP_CREATE |
5415d504 MK |
252 | command creates a new map, |
253 | returning a new file descriptor that refers to the map. | |
f0271688 | 254 | .IP |
842ee010 | 255 | .in +4n |
f0271688 | 256 | .EX |
842ee010 | 257 | int |
953d2673 MK |
258 | bpf_create_map(enum bpf_map_type map_type, |
259 | unsigned int key_size, | |
260 | unsigned int value_size, | |
261 | unsigned int max_entries) | |
cc7ac21d AS |
262 | { |
263 | union bpf_attr attr = { | |
953d2673 MK |
264 | .map_type = map_type, |
265 | .key_size = key_size, | |
266 | .value_size = value_size, | |
cc7ac21d AS |
267 | .max_entries = max_entries |
268 | }; | |
269 | ||
270 | return bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); | |
271 | } | |
f0271688 | 272 | .EE |
842ee010 | 273 | .in |
f0271688 | 274 | .IP |
842ee010 MK |
275 | The new map has the type specified by |
276 | .IR map_type , | |
277 | and attributes as specified in | |
1148d934 MK |
278 | .IR key_size , |
279 | .IR value_size , | |
842ee010 | 280 | and |
1148d934 | 281 | .IR max_entries . |
46a4949b | 282 | On success, this operation returns a file descriptor. |
9a5215bf | 283 | On error, \-1 is returned and |
cc7ac21d | 284 | .I errno |
1148d934 MK |
285 | is set to |
286 | .BR EINVAL , | |
287 | .BR EPERM , | |
288 | or | |
289 | .BR ENOMEM . | |
f0271688 | 290 | .IP |
953d2673 | 291 | The |
cc7ac21d AS |
292 | .I key_size |
293 | and | |
294 | .I value_size | |
953d2673 MK |
295 | attributes will be used by the verifier during program loading |
296 | to check that the program is calling | |
1148d934 MK |
297 | .BR bpf_map_*_elem () |
298 | helper functions with a correctly initialized | |
cc7ac21d | 299 | .I key |
f774ddf1 | 300 | and to check that the program doesn't access the map element |
cc7ac21d AS |
301 | .I value |
302 | beyond the specified | |
16152abb | 303 | .IR value_size . |
842ee010 | 304 | For example, when a map is created with a |
266791fb | 305 | .I key_size |
f774ddf1 | 306 | of 8 and the eBPF program calls |
f0271688 | 307 | .IP |
1148d934 | 308 | .in +4n |
f0271688 | 309 | .EX |
cc7ac21d | 310 | bpf_map_lookup_elem(map_fd, fp - 4) |
f0271688 | 311 | .EE |
1148d934 | 312 | .in |
f0271688 | 313 | .IP |
cc7ac21d | 314 | the program will be rejected, |
1148d934 | 315 | since the in-kernel helper function |
f0271688 MK |
316 | .IP |
317 | .EX | |
842ee010 | 318 | bpf_map_lookup_elem(map_fd, void *key) |
f0271688 MK |
319 | .EE |
320 | .IP | |
46a4949b MK |
321 | expects to read 8 bytes from the location pointed to by |
322 | .IR key , | |
323 | but the | |
266791fb | 324 | .I fp\ -\ 4 |
46a4949b MK |
325 | (where |
326 | .I fp | |
327 | is the top of the stack) | |
1148d934 | 328 | starting address will cause out-of-bounds stack access. |
f0271688 | 329 | .IP |
842ee010 MK |
330 | Similarly, when a map is created with a |
331 | .I value_size | |
f774ddf1 | 332 | of 1 and the eBPF program contains |
f0271688 | 333 | .IP |
1148d934 | 334 | .in +4n |
f0271688 | 335 | .EX |
cc7ac21d | 336 | value = bpf_map_lookup_elem(...); |
1148d934 | 337 | *(u32 *) value = 1; |
f0271688 | 338 | .EE |
1148d934 | 339 | .in |
f0271688 | 340 | .IP |
cc7ac21d AS |
341 | the program will be rejected, since it accesses the |
342 | .I value | |
1148d934 MK |
343 | pointer beyond the specified 1 byte |
344 | .I value_size | |
345 | limit. | |
f0271688 | 346 | .IP |
f774ddf1 MK |
347 | Currently, the following values are supported for |
348 | .IR map_type : | |
f0271688 | 349 | .IP |
1148d934 | 350 | .in +4n |
f0271688 | 351 | .EX |
cc7ac21d | 352 | enum bpf_map_type { |
ce5db3fc | 353 | BPF_MAP_TYPE_UNSPEC, /* Reserve 0 as invalid map type */ |
842ee010 MK |
354 | BPF_MAP_TYPE_HASH, |
355 | BPF_MAP_TYPE_ARRAY, | |
5415d504 | 356 | BPF_MAP_TYPE_PROG_ARRAY, |
1b7adc7c NB |
357 | BPF_MAP_TYPE_PERF_EVENT_ARRAY, |
358 | BPF_MAP_TYPE_PERCPU_HASH, | |
359 | BPF_MAP_TYPE_PERCPU_ARRAY, | |
360 | BPF_MAP_TYPE_STACK_TRACE, | |
361 | BPF_MAP_TYPE_CGROUP_ARRAY, | |
362 | BPF_MAP_TYPE_LRU_HASH, | |
363 | BPF_MAP_TYPE_LRU_PERCPU_HASH, | |
364 | BPF_MAP_TYPE_LPM_TRIE, | |
365 | BPF_MAP_TYPE_ARRAY_OF_MAPS, | |
366 | BPF_MAP_TYPE_HASH_OF_MAPS, | |
367 | BPF_MAP_TYPE_DEVMAP, | |
368 | BPF_MAP_TYPE_SOCKMAP, | |
369 | BPF_MAP_TYPE_CPUMAP, | |
cc7ac21d | 370 | }; |
f0271688 | 371 | .EE |
1148d934 | 372 | .in |
f0271688 | 373 | .IP |
cc7ac21d | 374 | .I map_type |
9a5215bf | 375 | selects one of the available map implementations in the kernel. |
f774ddf1 | 376 | .\" FIXME We need an explanation of why one might choose each of |
b913d165 | 377 | .\" these map implementations |
16152abb | 378 | For all map types, |
f774ddf1 MK |
379 | eBPF programs access maps with the same |
380 | .BR bpf_map_lookup_elem () | |
381 | and | |
1148d934 | 382 | .BR bpf_map_update_elem () |
cc7ac21d | 383 | helper functions. |
ce5db3fc | 384 | Further details of the various map types are given below. |
cc7ac21d AS |
385 | .TP |
386 | .B BPF_MAP_LOOKUP_ELEM | |
842ee010 MK |
387 | The |
388 | .B BPF_MAP_LOOKUP_ELEM | |
389 | command looks up an element with a given | |
390 | .I key | |
391 | in the map referred to by the file descriptor | |
392 | .IR fd . | |
f0271688 | 393 | .IP |
842ee010 | 394 | .in +4n |
f0271688 | 395 | .EX |
842ee010 | 396 | int |
953d2673 | 397 | bpf_lookup_elem(int fd, const void *key, void *value) |
cc7ac21d AS |
398 | { |
399 | union bpf_attr attr = { | |
400 | .map_fd = fd, | |
953d2673 MK |
401 | .key = ptr_to_u64(key), |
402 | .value = ptr_to_u64(value), | |
cc7ac21d AS |
403 | }; |
404 | ||
405 | return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr)); | |
406 | } | |
f0271688 | 407 | .EE |
842ee010 | 408 | .in |
f0271688 | 409 | .IP |
842ee010 MK |
410 | If an element is found, |
411 | the operation returns zero and stores the element's value into | |
5415d504 MK |
412 | .IR value , |
413 | which must point to a buffer of | |
414 | .I value_size | |
415 | bytes. | |
f0271688 | 416 | .IP |
842ee010 | 417 | If no element is found, the operation returns \-1 and sets |
cc7ac21d | 418 | .I errno |
1148d934 MK |
419 | to |
420 | .BR ENOENT . | |
cc7ac21d AS |
421 | .TP |
422 | .B BPF_MAP_UPDATE_ELEM | |
842ee010 MK |
423 | The |
424 | .B BPF_MAP_UPDATE_ELEM | |
425 | command | |
426 | creates or updates an element with a given | |
427 | .I key/value | |
428 | in the map referred to by the file descriptor | |
429 | .IR fd . | |
f0271688 | 430 | .IP |
842ee010 | 431 | .in +4n |
f0271688 | 432 | .EX |
842ee010 | 433 | int |
953d2673 MK |
434 | bpf_update_elem(int fd, const void *key, const void *value, |
435 | uint64_t flags) | |
cc7ac21d AS |
436 | { |
437 | union bpf_attr attr = { | |
438 | .map_fd = fd, | |
953d2673 MK |
439 | .key = ptr_to_u64(key), |
440 | .value = ptr_to_u64(value), | |
441 | .flags = flags, | |
cc7ac21d AS |
442 | }; |
443 | ||
444 | return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr)); | |
445 | } | |
f0271688 | 446 | .EE |
842ee010 | 447 | .in |
f0271688 | 448 | .IP |
842ee010 | 449 | The |
cc7ac21d | 450 | .I flags |
842ee010 MK |
451 | argument should be specified as one of the following: |
452 | .RS | |
453 | .TP | |
454 | .B BPF_ANY | |
455 | Create a new element or update an existing element. | |
456 | .TP | |
457 | .B BPF_NOEXIST | |
458 | Create a new element only if it did not exist. | |
459 | .TP | |
460 | .B BPF_EXIST | |
461 | Update an existing element. | |
462 | .RE | |
463 | .IP | |
464 | On success, the operation returns zero. | |
cc7ac21d AS |
465 | On error, \-1 is returned and |
466 | .I errno | |
1148d934 MK |
467 | is set to |
468 | .BR EINVAL , | |
469 | .BR EPERM , | |
470 | .BR ENOMEM , | |
471 | or | |
472 | .BR E2BIG . | |
cc7ac21d | 473 | .B E2BIG |
842ee010 | 474 | indicates that the number of elements in the map reached the |
cc7ac21d AS |
475 | .I max_entries |
476 | limit specified at map creation time. | |
477 | .B EEXIST | |
842ee010 MK |
478 | will be returned if |
479 | .I flags | |
480 | specifies | |
481 | .B BPF_NOEXIST | |
482 | and the element with | |
1148d934 MK |
483 | .I key |
484 | already exists in the map. | |
cc7ac21d | 485 | .B ENOENT |
953d2673 | 486 | will be returned if |
842ee010 MK |
487 | .I flags |
488 | specifies | |
489 | .B BPF_EXIST | |
490 | and the element with | |
1148d934 MK |
491 | .I key |
492 | doesn't exist in the map. | |
cc7ac21d AS |
493 | .TP |
494 | .B BPF_MAP_DELETE_ELEM | |
842ee010 MK |
495 | The |
496 | .B BPF_MAP_DELETE_ELEM | |
497 | command | |
96ed2f3f | 498 | deletes the element whose key is |
842ee010 MK |
499 | .I key |
500 | from the map referred to by the file descriptor | |
501 | .IR fd . | |
f0271688 | 502 | .IP |
842ee010 | 503 | .in +4n |
f0271688 | 504 | .EX |
842ee010 | 505 | int |
953d2673 | 506 | bpf_delete_elem(int fd, const void *key) |
cc7ac21d AS |
507 | { |
508 | union bpf_attr attr = { | |
509 | .map_fd = fd, | |
953d2673 | 510 | .key = ptr_to_u64(key), |
cc7ac21d AS |
511 | }; |
512 | ||
513 | return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr)); | |
514 | } | |
f0271688 | 515 | .EE |
842ee010 | 516 | .in |
f0271688 | 517 | .IP |
842ee010 MK |
518 | On success, zero is returned. |
519 | If the element is not found, \-1 is returned and | |
cc7ac21d | 520 | .I errno |
842ee010 | 521 | is set to |
1148d934 | 522 | .BR ENOENT . |
cc7ac21d AS |
523 | .TP |
524 | .B BPF_MAP_GET_NEXT_KEY | |
842ee010 MK |
525 | The |
526 | .B BPF_MAP_GET_NEXT_KEY | |
527 | command looks up an element by | |
528 | .I key | |
529 | in the map referred to by the file descriptor | |
266791fb | 530 | .I fd |
842ee010 MK |
531 | and sets the |
532 | .I next_key | |
533 | pointer to the key of the next element. | |
f0271688 | 534 | .IP |
842ee010 | 535 | .in +4n |
f0271688 | 536 | .EX |
842ee010 | 537 | int |
953d2673 | 538 | bpf_get_next_key(int fd, const void *key, void *next_key) |
cc7ac21d AS |
539 | { |
540 | union bpf_attr attr = { | |
953d2673 MK |
541 | .map_fd = fd, |
542 | .key = ptr_to_u64(key), | |
cc7ac21d AS |
543 | .next_key = ptr_to_u64(next_key), |
544 | }; | |
545 | ||
546 | return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr)); | |
547 | } | |
f0271688 | 548 | .EE |
842ee010 | 549 | .in |
f0271688 | 550 | .IP |
5415d504 MK |
551 | If |
552 | .I key | |
553 | is found, the operation returns zero and sets the | |
554 | .I next_key | |
555 | pointer to the key of the next element. | |
cc7ac21d AS |
556 | If |
557 | .I key | |
842ee010 | 558 | is not found, the operation returns zero and sets the |
cc7ac21d AS |
559 | .I next_key |
560 | pointer to the key of the first element. | |
561 | If | |
562 | .I key | |
842ee010 | 563 | is the last element, \-1 is returned and |
cc7ac21d | 564 | .I errno |
842ee010 | 565 | is set to |
1148d934 | 566 | .BR ENOENT . |
9a5215bf | 567 | Other possible |
cc7ac21d | 568 | .I errno |
1148d934 MK |
569 | values are |
570 | .BR ENOMEM , | |
571 | .BR EFAULT , | |
572 | .BR EPERM , | |
573 | and | |
574 | .BR EINVAL . | |
cc7ac21d AS |
575 | This method can be used to iterate over all elements in the map. |
576 | .TP | |
577 | .B close(map_fd) | |
842ee010 | 578 | Delete the map referred to by the file descriptor |
1148d934 | 579 | .IR map_fd . |
842ee010 | 580 | When the user-space program that created a map exits, all maps will |
ce5db3fc MK |
581 | be deleted automatically (but see NOTES). |
582 | .\" | |
583 | .SS eBPF map types | |
584 | The following map types are supported: | |
585 | .TP | |
586 | .B BPF_MAP_TYPE_HASH | |
587 | .\" commit 0f8e4bd8a1fc8c4185f1630061d0a1f2d197a475 | |
ce5db3fc MK |
588 | Hash-table maps have the following characteristics: |
589 | .RS | |
590 | .IP * 3 | |
591 | Maps are created and destroyed by user-space programs. | |
592 | Both user-space and eBPF programs | |
46a4949b | 593 | can perform lookup, update, and delete operations. |
ce5db3fc MK |
594 | .IP * |
595 | The kernel takes care of allocating and freeing key/value pairs. | |
596 | .IP * | |
597 | The | |
598 | .BR map_update_elem () | |
998f951b | 599 | helper will fail to insert new element when the |
ce5db3fc MK |
600 | .I max_entries |
601 | limit is reached. | |
602 | (This ensures that eBPF programs cannot exhaust memory.) | |
603 | .IP * | |
604 | .BR map_update_elem () | |
605 | replaces existing elements atomically. | |
606 | .RE | |
607 | .IP | |
953d2673 | 608 | Hash-table maps are |
ce5db3fc MK |
609 | optimized for speed of lookup. |
610 | .TP | |
611 | .B BPF_MAP_TYPE_ARRAY | |
612 | .\" commit 28fbcfa08d8ed7c5a50d41a0433aad222835e8e3 | |
ce5db3fc MK |
613 | Array maps have the following characteristics: |
614 | .RS | |
615 | .IP * 3 | |
616 | Optimized for fastest possible lookup. | |
46a4949b | 617 | In the future the verifier/JIT compiler |
ce5db3fc MK |
618 | may recognize lookup() operations that employ a constant key |
619 | and optimize it into constant pointer. | |
620 | It is possible to optimize a non-constant | |
621 | key into direct pointer arithmetic as well, since pointers and | |
622 | .I value_size | |
623 | are constant for the life of the eBPF program. | |
624 | In other words, | |
625 | .BR array_map_lookup_elem () | |
626 | may be 'inlined' by the verifier/JIT compiler | |
627 | while preserving concurrent access to this map from user space. | |
628 | .IP * | |
629 | All array elements pre-allocated and zero initialized at init time | |
630 | .IP * | |
631 | The key is an array index, and must be exactly four bytes. | |
632 | .IP * | |
633 | .BR map_delete_elem () | |
634 | fails with the error | |
635 | .BR EINVAL , | |
636 | since elements cannot be deleted. | |
637 | .IP * | |
638 | .BR map_update_elem () | |
953d2673 MK |
639 | replaces elements in a |
640 | .B nonatomic | |
641 | fashion; | |
cd579c3f MK |
642 | for atomic updates, a hash-table map should be used instead. |
643 | There is however one special case that can also be used with arrays: | |
644 | the atomic built-in | |
266791fb | 645 | .B __sync_fetch_and_add() |
cd579c3f MK |
646 | can be used on 32 and 64 bit atomic counters. |
647 | For example, it can be | |
9a818ddd DB |
648 | applied on the whole value itself if it represents a single counter, |
649 | or in case of a structure containing multiple counters, it could be | |
cd579c3f MK |
650 | used on individual counters. |
651 | This is quite often useful for aggregation and accounting of events. | |
ce5db3fc MK |
652 | .RE |
653 | .IP | |
654 | Among the uses for array maps are the following: | |
655 | .RS | |
656 | .IP * 3 | |
657 | As "global" eBPF variables: an array of 1 element whose key is (index) 0 | |
658 | and where the value is a collection of 'global' variables which | |
659 | eBPF programs can use to keep state between events. | |
660 | .IP * | |
661 | Aggregation of tracing events into a fixed set of buckets. | |
9a818ddd DB |
662 | .IP * |
663 | Accounting of networking events, for example, number of packets and packet | |
664 | sizes. | |
ce5db3fc MK |
665 | .RE |
666 | .TP | |
667 | .BR BPF_MAP_TYPE_PROG_ARRAY " (since Linux 4.2)" | |
cd579c3f MK |
668 | A program array map is a special kind of array map whose map values |
669 | contain only file descriptors referring to other eBPF programs. | |
670 | Thus, both the | |
671 | .I key_size | |
672 | and | |
673 | .I value_size | |
674 | must be exactly four bytes. | |
9a818ddd | 675 | This map is used in conjunction with the |
cd579c3f | 676 | .BR bpf_tail_call () |
9a818ddd | 677 | helper. |
f0271688 | 678 | .IP |
9a818ddd DB |
679 | This means that an eBPF program with a program array map attached to it |
680 | can call from kernel side into | |
f0271688 | 681 | .IP |
9a818ddd | 682 | .in +4n |
f0271688 | 683 | .EX |
05f10213 MK |
684 | void bpf_tail_call(void *context, void *prog_map, |
685 | unsigned int index); | |
f0271688 | 686 | .EE |
9a818ddd | 687 | .in |
f0271688 | 688 | .IP |
9a818ddd | 689 | and therefore replace its own program flow with the one from the program |
cd579c3f MK |
690 | at the given program array slot, if present. |
691 | This can be regarded as kind of a jump table to a different eBPF program. | |
692 | The invoked program will then reuse the same stack. | |
693 | When a jump into the new program has been performed, | |
694 | it won't return to the old program anymore. | |
f0271688 | 695 | .IP |
aabe0499 MK |
696 | If no eBPF program is found at the given index of the program array |
697 | (because the map slot doesn't contain a valid program file descriptor, | |
698 | the specified lookup index/key is out of bounds, | |
699 | or the limit of 32 | |
700 | .\" MAX_TAIL_CALL_CNT | |
701 | nested calls has been exceed), | |
9a818ddd DB |
702 | execution continues with the current eBPF program. |
703 | This can be used as a fall-through for default cases. | |
f0271688 | 704 | .IP |
9a818ddd | 705 | A program array map is useful, for example, in tracing or networking, to |
cd579c3f MK |
706 | handle individual system calls or protocols in their own subprograms and |
707 | use their identifiers as an individual map index. | |
708 | This approach may result in performance benefits, | |
709 | and also makes it possible to overcome the maximum | |
710 | instruction limit of a single eBPF program. | |
711 | In dynamic environments, | |
712 | a user-space daemon might atomically replace individual subprograms | |
713 | at run-time with newer versions to alter overall program behavior, | |
714 | for instance, if global policies change. | |
ce5db3fc MK |
715 | .\" |
716 | .SS eBPF programs | |
842ee010 MK |
717 | The |
718 | .B BPF_PROG_LOAD | |
54513c00 | 719 | command is used to load an eBPF program into the kernel. |
9ab03361 | 720 | The return value for this command is a new file descriptor associated |
ce5db3fc | 721 | with this eBPF program. |
f0271688 | 722 | .PP |
842ee010 | 723 | .in +4n |
f0271688 | 724 | .EX |
cc7ac21d AS |
725 | char bpf_log_buf[LOG_BUF_SIZE]; |
726 | ||
842ee010 | 727 | int |
953d2673 | 728 | bpf_prog_load(enum bpf_prog_type type, |
842ee010 MK |
729 | const struct bpf_insn *insns, int insn_cnt, |
730 | const char *license) | |
cc7ac21d AS |
731 | { |
732 | union bpf_attr attr = { | |
953d2673 MK |
733 | .prog_type = type, |
734 | .insns = ptr_to_u64(insns), | |
735 | .insn_cnt = insn_cnt, | |
736 | .license = ptr_to_u64(license), | |
737 | .log_buf = ptr_to_u64(bpf_log_buf), | |
738 | .log_size = LOG_BUF_SIZE, | |
cc7ac21d AS |
739 | .log_level = 1, |
740 | }; | |
741 | ||
742 | return bpf(BPF_PROG_LOAD, &attr, sizeof(attr)); | |
743 | } | |
f0271688 | 744 | .EE |
842ee010 | 745 | .in |
f0271688 | 746 | .PP |
1148d934 | 747 | .I prog_type |
cc7ac21d | 748 | is one of the available program types: |
f0271688 | 749 | .IP |
1148d934 | 750 | .in +4n |
f0271688 | 751 | .EX |
cc7ac21d | 752 | enum bpf_prog_type { |
f774ddf1 MK |
753 | BPF_PROG_TYPE_UNSPEC, /* Reserve 0 as invalid |
754 | program type */ | |
ce5db3fc MK |
755 | BPF_PROG_TYPE_SOCKET_FILTER, |
756 | BPF_PROG_TYPE_KPROBE, | |
757 | BPF_PROG_TYPE_SCHED_CLS, | |
758 | BPF_PROG_TYPE_SCHED_ACT, | |
cc7ac21d | 759 | }; |
f0271688 | 760 | .EE |
1148d934 | 761 | .in |
f0271688 | 762 | .PP |
ce5db3fc | 763 | For further details of eBPF program types, see below. |
f0271688 | 764 | .PP |
ce5db3fc | 765 | The remaining fields of |
842ee010 MK |
766 | .I bpf_attr |
767 | are set as follows: | |
842ee010 | 768 | .IP * 3 |
1148d934 | 769 | .I insns |
842ee010 | 770 | is an array of |
1148d934 MK |
771 | .I "struct bpf_insn" |
772 | instructions. | |
842ee010 | 773 | .IP * |
1148d934 | 774 | .I insn_cnt |
842ee010 MK |
775 | is the number of instructions in the program referred to by |
776 | .IR insns . | |
777 | .IP * | |
1148d934 | 778 | .I license |
842ee010 | 779 | is a license string, which must be GPL compatible to call helper functions |
1148d934 MK |
780 | marked |
781 | .IR gpl_only . | |
fcd1bee3 | 782 | (The licensing rules are the same as for kernel modules, |
9a818ddd | 783 | so that also dual licenses, such as "Dual BSD/GPL", may be used.) |
842ee010 | 784 | .IP * |
1148d934 | 785 | .I log_buf |
842ee010 MK |
786 | is a pointer to a caller-allocated buffer in which the in-kernel |
787 | verifier can store the verification log. | |
9a5215bf | 788 | This log is a multi-line string that can be checked by |
cc7ac21d | 789 | the program author in order to understand how the verifier came to |
953d2673 | 790 | the conclusion that the eBPF program is unsafe. |
cc7ac21d | 791 | The format of the output can change at any time as the verifier evolves. |
842ee010 | 792 | .IP * |
1148d934 | 793 | .I log_size |
842ee010 | 794 | size of the buffer pointed to by |
029b613f | 795 | .IR log_buf . |
9a5215bf | 796 | If the size of the buffer is not large enough to store all |
cc7ac21d AS |
797 | verifier messages, \-1 is returned and |
798 | .I errno | |
1148d934 MK |
799 | is set to |
800 | .BR ENOSPC . | |
842ee010 | 801 | .IP * |
1148d934 | 802 | .I log_level |
9a5215bf | 803 | verbosity level of the verifier. |
fcd1bee3 MK |
804 | A value of zero means that the verifier will not provide a log; |
805 | in this case, | |
806 | .I log_buf | |
807 | must be a NULL pointer, and | |
808 | .I log_size | |
809 | must be zero. | |
f0271688 | 810 | .PP |
ce5db3fc MK |
811 | Applying |
812 | .BR close (2) | |
813 | to the file descriptor returned by | |
814 | .B BPF_PROG_LOAD | |
815 | will unload the eBPF program (but see NOTES). | |
f0271688 | 816 | .PP |
54513c00 MK |
817 | Maps are accessible from eBPF programs and are used to exchange data between |
818 | eBPF programs and between eBPF programs and user-space programs. | |
5415d504 MK |
819 | For example, |
820 | eBPF programs can process various events (like kprobe, packets) and | |
821 | store their data into a map, | |
822 | and user-space programs can then fetch data from the map. | |
823 | Conversely, user-space programs can use a map as a configuration mechanism, | |
824 | populating the map with values checked by the eBPF program, | |
825 | which then modifies its behavior on the fly according to those values. | |
953d2673 MK |
826 | .\" |
827 | .\" | |
ce5db3fc | 828 | .SS eBPF program types |
953d2673 MK |
829 | The eBPF program type |
830 | .RI ( prog_type ) | |
fcd1bee3 | 831 | determines the subset of kernel helper functions that the program |
953d2673 | 832 | may call. |
fcd1bee3 | 833 | The program type also determines the program input (context)\(emthe |
953d2673 MK |
834 | format of |
835 | .I "struct bpf_context" | |
ce5db3fc | 836 | (which is the data blob passed into the eBPF program as the first argument). |
0fc33df7 | 837 | .\" |
30ea59e7 | 838 | .\" FIXME |
24493e9b | 839 | .\" Somewhere in this page we need a general introduction to the |
0fc33df7 MK |
840 | .\" bpf_context. For example, how does a BPF program access the |
841 | .\" context? | |
f0271688 | 842 | .PP |
953d2673 MK |
843 | For example, a tracing program does not have the exact same |
844 | subset of helper functions as a socket filter program | |
845 | (though they may have some helpers in common). | |
846 | Similarly, | |
847 | the input (context) for a tracing program is a set of register values, | |
848 | while for a socket filter it is a network packet. | |
f0271688 | 849 | .PP |
ce5db3fc MK |
850 | The set of functions available to eBPF programs of a given type may increase |
851 | in the future. | |
f0271688 | 852 | .PP |
ce5db3fc MK |
853 | The following program types are supported: |
854 | .TP | |
855 | .BR BPF_PROG_TYPE_SOCKET_FILTER " (since Linux 3.19)" | |
856 | Currently, the set of functions for | |
857 | .B BPF_PROG_TYPE_SOCKET_FILTER | |
858 | is: | |
f0271688 | 859 | .IP |
1148d934 | 860 | .in +4n |
f0271688 | 861 | .EX |
ce5db3fc MK |
862 | bpf_map_lookup_elem(map_fd, void *key) |
863 | /* look up key in a map_fd */ | |
864 | bpf_map_update_elem(map_fd, void *key, void *value) | |
865 | /* update key/value */ | |
866 | bpf_map_delete_elem(map_fd, void *key) | |
867 | /* delete key in a map_fd */ | |
f0271688 | 868 | .EE |
1148d934 | 869 | .in |
f0271688 | 870 | .IP |
ce5db3fc MK |
871 | The |
872 | .I bpf_context | |
873 | argument is a pointer to a | |
b87d8ba6 | 874 | .IR "struct __sk_buff" . |
953d2673 | 875 | .\" FIXME: We need some text here to explain how the program |
b913d165 MK |
876 | .\" accesses __sk_buff. |
877 | .\" See 'struct __sk_buff' and commit 9bac3d6d548e5 | |
878 | .\" | |
b87d8ba6 | 879 | .\" Alexei commented: |
b913d165 MK |
880 | .\" Actually now in case of SOCKET_FILTER, SCHED_CLS, SCHED_ACT |
881 | .\" the program can now access skb fields. | |
ce5db3fc MK |
882 | .\" |
883 | .TP | |
266791fb | 884 | .BR BPF_PROG_TYPE_KPROBE " (since Linux 4.1)" |
ce5db3fc MK |
885 | .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5 |
886 | [To be documented] | |
887 | .\" FIXME Document this program type | |
888 | .\" Describe allowed helper functions for this program type | |
889 | .\" Describe bpf_context for this program type | |
b913d165 | 890 | .\" |
ce5db3fc MK |
891 | .\" FIXME We need text here to describe 'kern_version' |
892 | .TP | |
266791fb | 893 | .BR BPF_PROG_TYPE_SCHED_CLS " (since Linux 4.1)" |
ce5db3fc MK |
894 | .\" commit 96be4325f443dbbfeb37d2a157675ac0736531a1 |
895 | .\" commit e2e9b6541dd4b31848079da80fe2253daaafb549 | |
896 | [To be documented] | |
897 | .\" FIXME Document this program type | |
898 | .\" Describe allowed helper functions for this program type | |
899 | .\" Describe bpf_context for this program type | |
900 | .TP | |
266791fb | 901 | .BR BPF_PROG_TYPE_SCHED_ACT " (since Linux 4.1)" |
ce5db3fc MK |
902 | .\" commit 94caee8c312d96522bcdae88791aaa9ebcd5f22c |
903 | .\" commit a8cb5f556b567974d75ea29c15181c445c541b1f | |
904 | [To be documented] | |
905 | .\" FIXME Document this program type | |
906 | .\" Describe allowed helper functions for this program type | |
907 | .\" Describe bpf_context for this program type | |
908 | .SS Events | |
909 | Once a program is loaded, it can be attached to an event. | |
910 | Various kernel subsystems have different ways to do so. | |
f0271688 | 911 | .PP |
ce5db3fc MK |
912 | Since Linux 3.19, |
913 | .\" commit 89aa075832b0da4402acebd698d0411dcc82d03e | |
914 | the following call will attach the program | |
cc7ac21d | 915 | .I prog_fd |
842ee010 MK |
916 | to the socket |
917 | .IR sockfd , | |
ce5db3fc MK |
918 | which was created by an earlier call to |
919 | .BR socket (2): | |
f0271688 | 920 | .PP |
1148d934 | 921 | .in +4n |
f0271688 | 922 | .EX |
ce5db3fc MK |
923 | setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_BPF, |
924 | &prog_fd, sizeof(prog_fd)); | |
f0271688 | 925 | .EE |
1148d934 | 926 | .in |
f0271688 | 927 | .PP |
ce5db3fc MK |
928 | Since Linux 4.1, |
929 | .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5 | |
930 | the following call may be used to attach | |
931 | the eBPF program referred to by the file descriptor | |
cc7ac21d | 932 | .I prog_fd |
ce5db3fc MK |
933 | to a perf event file descriptor, |
934 | .IR event_fd , | |
935 | that was created by a previous call to | |
936 | .BR perf_event_open (2): | |
efeece04 | 937 | .PP |
ce5db3fc | 938 | .in +4n |
b76974c1 | 939 | .EX |
ce5db3fc | 940 | ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd); |
b76974c1 | 941 | .EE |
ce5db3fc MK |
942 | .in |
943 | .\" | |
ce5db3fc | 944 | .\" |
cc7ac21d | 945 | .SH EXAMPLES |
f0271688 | 946 | .EX |
cc7ac21d AS |
947 | /* bpf+sockets example: |
948 | * 1. create array map of 256 elements | |
949 | * 2. load program that counts number of packets received | |
950 | * r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)] | |
951 | * map[r0]++ | |
952 | * 3. attach prog_fd to raw socket via setsockopt() | |
953 | * 4. print number of received TCP/UDP packets every second | |
954 | */ | |
842ee010 MK |
955 | int |
956 | main(int argc, char **argv) | |
cc7ac21d AS |
957 | { |
958 | int sock, map_fd, prog_fd, key; | |
959 | long long value = 0, tcp_cnt, udp_cnt; | |
960 | ||
1148d934 MK |
961 | map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), |
962 | sizeof(value), 256); | |
cc7ac21d | 963 | if (map_fd < 0) { |
d1a71985 | 964 | printf("failed to create map '%s'\en", strerror(errno)); |
cc7ac21d AS |
965 | /* likely not run as root */ |
966 | return 1; | |
967 | } | |
968 | ||
969 | struct bpf_insn prog[] = { | |
1148d934 MK |
970 | BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), /* r6 = r1 */ |
971 | BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol)), | |
972 | /* r0 = ip->proto */ | |
973 | BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), | |
974 | /* *(u32 *)(fp - 4) = r0 */ | |
975 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), /* r2 = fp */ | |
976 | BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = r2 - 4 */ | |
977 | BPF_LD_MAP_FD(BPF_REG_1, map_fd), /* r1 = map_fd */ | |
978 | BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem), | |
979 | /* r0 = map_lookup(r1, r2) */ | |
980 | BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), | |
981 | /* if (r0 == 0) goto pc+2 */ | |
982 | BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */ | |
983 | BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), | |
984 | /* lock *(u64 *) r0 += r1 */ | |
4fba111e | 985 | .\" == atomic64_add |
1148d934 MK |
986 | BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */ |
987 | BPF_EXIT_INSN(), /* return r0 */ | |
cc7ac21d AS |
988 | }; |
989 | ||
1148d934 | 990 | prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog, |
527bd1d7 | 991 | sizeof(prog) / sizeof(prog[0]), "GPL"); |
cc7ac21d AS |
992 | |
993 | sock = open_raw_sock("lo"); | |
994 | ||
1148d934 MK |
995 | assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, |
996 | sizeof(prog_fd)) == 0); | |
cc7ac21d AS |
997 | |
998 | for (;;) { | |
999 | key = IPPROTO_TCP; | |
1000 | assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0); | |
c83ad1ad | 1001 | key = IPPROTO_UDP; |
cc7ac21d | 1002 | assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0); |
d1a71985 | 1003 | printf("TCP %lld UDP %lld packets\en", tcp_cnt, udp_cnt); |
cc7ac21d AS |
1004 | sleep(1); |
1005 | } | |
1006 | ||
1007 | return 0; | |
1008 | } | |
f0271688 MK |
1009 | .EE |
1010 | .PP | |
ce5db3fc | 1011 | Some complete working code can be found in the |
266791fb | 1012 | .I samples/bpf |
5415d504 | 1013 | directory in the kernel source tree. |
cc7ac21d AS |
1014 | .SH RETURN VALUE |
1015 | For a successful call, the return value depends on the operation: | |
1016 | .TP | |
1017 | .B BPF_MAP_CREATE | |
ce5db3fc | 1018 | The new file descriptor associated with the eBPF map. |
cc7ac21d AS |
1019 | .TP |
1020 | .B BPF_PROG_LOAD | |
54513c00 | 1021 | The new file descriptor associated with the eBPF program. |
cc7ac21d AS |
1022 | .TP |
1023 | All other commands | |
1024 | Zero. | |
1025 | .PP | |
1026 | On error, \-1 is returned, and | |
1027 | .I errno | |
1028 | is set appropriately. | |
1029 | .SH ERRORS | |
1030 | .TP | |
266791fb | 1031 | .B E2BIG |
6cedbd4c MK |
1032 | The eBPF program is too large or a map reached the |
1033 | .I max_entries | |
1034 | limit (maximum number of elements). | |
cc7ac21d | 1035 | .TP |
266791fb | 1036 | .B EACCES |
6cedbd4c | 1037 | For |
266791fb | 1038 | .BR BPF_PROG_LOAD , |
6cedbd4c MK |
1039 | even though all program instructions are valid, the program has been |
1040 | rejected because it was deemed unsafe. | |
1041 | This may be because it may have | |
1042 | accessed a disallowed memory region or an uninitialized stack/register or | |
1043 | because the function constraints don't match the actual types or because | |
1044 | there was a misaligned memory access. | |
1045 | In this case, it is recommended to call | |
1046 | .BR bpf () | |
1047 | again with | |
1048 | .I log_level = 1 | |
1049 | and examine | |
1050 | .I log_buf | |
1051 | for the specific reason provided by the verifier. | |
cc7ac21d AS |
1052 | .TP |
1053 | .B EBADF | |
1054 | .I fd | |
7d6bfe72 | 1055 | is not an open file descriptor. |
cc7ac21d AS |
1056 | .TP |
1057 | .B EFAULT | |
1148d934 MK |
1058 | One of the pointers |
1059 | .RI ( key | |
cc7ac21d AS |
1060 | or |
1061 | .I value | |
1062 | or | |
1063 | .I log_buf | |
1064 | or | |
1148d934 MK |
1065 | .IR insns ) |
1066 | is outside the accessible address space. | |
cc7ac21d AS |
1067 | .TP |
1068 | .B EINVAL | |
1069 | The value specified in | |
1070 | .I cmd | |
1071 | is not recognized by this kernel. | |
1072 | .TP | |
1073 | .B EINVAL | |
1074 | For | |
1075 | .BR BPF_MAP_CREATE , | |
1076 | either | |
1077 | .I map_type | |
1078 | or attributes are invalid. | |
1079 | .TP | |
1080 | .B EINVAL | |
1081 | For | |
266791fb | 1082 | .B BPF_MAP_*_ELEM |
cc7ac21d | 1083 | commands, |
1148d934 MK |
1084 | some of the fields of |
1085 | .I "union bpf_attr" | |
1086 | that are not used by this command | |
cc7ac21d AS |
1087 | are not set to zero. |
1088 | .TP | |
1089 | .B EINVAL | |
1090 | For | |
266791fb | 1091 | .BR BPF_PROG_LOAD , |
9a5215bf | 1092 | indicates an attempt to load an invalid program. |
953d2673 MK |
1093 | eBPF programs can be deemed |
1094 | invalid due to unrecognized instructions, the use of reserved fields, jumps | |
cc7ac21d AS |
1095 | out of range, infinite loops or calls of unknown functions. |
1096 | .TP | |
266791fb | 1097 | .B ENOENT |
cc7ac21d AS |
1098 | For |
1099 | .B BPF_MAP_LOOKUP_ELEM | |
1100 | or | |
16152abb | 1101 | .BR BPF_MAP_DELETE_ELEM , |
cc7ac21d AS |
1102 | indicates that the element with the given |
1103 | .I key | |
1104 | was not found. | |
1105 | .TP | |
6cedbd4c MK |
1106 | .B ENOMEM |
1107 | Cannot allocate sufficient memory. | |
1108 | .TP | |
1109 | .B EPERM | |
1110 | The call was made without sufficient privilege | |
1111 | (without the | |
1112 | .B CAP_SYS_ADMIN | |
1113 | capability). | |
5f920e10 MK |
1114 | .SH VERSIONS |
1115 | The | |
1116 | .BR bpf () | |
1117 | system call first appeared in Linux 3.18. | |
8dbf8f2d MK |
1118 | .SH CONFORMING TO |
1119 | The | |
1120 | .BR bpf () | |
1121 | system call is Linux-specific. | |
cc7ac21d | 1122 | .SH NOTES |
842ee010 MK |
1123 | In the current implementation, all |
1124 | .BR bpf () | |
1125 | commands require the caller to have the | |
cc7ac21d | 1126 | .B CAP_SYS_ADMIN |
842ee010 | 1127 | capability. |
f0271688 | 1128 | .PP |
f774ddf1 MK |
1129 | eBPF objects (maps and programs) can be shared between processes. |
1130 | For example, after | |
1131 | .BR fork (2), | |
1132 | the child inherits file descriptors referring to the same eBPF objects. | |
1133 | In addition, file descriptors referring to eBPF objects can be | |
1134 | transferred over UNIX domain sockets. | |
1135 | File descriptors referring to eBPF objects can be duplicated | |
1136 | in the usual way, using | |
1137 | .BR dup (2) | |
1138 | and similar calls. | |
1139 | An eBPF object is deallocated only after all file descriptors | |
1140 | referring to the object have been closed. | |
f0271688 | 1141 | .PP |
4fba111e MK |
1142 | eBPF programs can be written in a restricted C that is compiled (using the |
1143 | .B clang | |
953d2673 MK |
1144 | compiler) into eBPF bytecode. |
1145 | Various features are omitted from this restricted C, such as loops, | |
f774ddf1 | 1146 | global variables, variadic functions, floating-point numbers, |
953d2673 | 1147 | and passing structures as function arguments. |
4fba111e MK |
1148 | Some examples can be found in the |
1149 | .I samples/bpf/*_kern.c | |
1150 | files in the kernel source tree. | |
ce5db3fc MK |
1151 | .\" There are also examples for the tc classifier, in the iproute2 |
1152 | .\" project, in examples/bpf | |
f0271688 | 1153 | .PP |
953d2673 MK |
1154 | The kernel contains a just-in-time (JIT) compiler that translates |
1155 | eBPF bytecode into native machine code for better performance. | |
5a29959a MK |
1156 | In kernels before Linux 4.15, |
1157 | the JIT compiler is disabled by default, | |
953d2673 MK |
1158 | but its operation can be controlled by writing one of the |
1159 | following integer strings to the file | |
1160 | .IR /proc/sys/net/core/bpf_jit_enable : | |
1161 | .IP 0 3 | |
1162 | Disable JIT compilation (default). | |
1163 | .IP 1 | |
1164 | Normal compilation. | |
1165 | .IP 2 | |
1166 | Debugging mode. | |
1167 | The generated opcodes are dumped in hexadecimal into the kernel log. | |
1168 | These opcodes can then be disassembled using the program | |
266791fb | 1169 | .I tools/net/bpf_jit_disasm.c |
953d2673 | 1170 | provided in the kernel source tree. |
fcd1bee3 | 1171 | .PP |
5a29959a MK |
1172 | Since Linux 4.15, |
1173 | .\" commit 290af86629b25ffd1ed6232c4e9107da031705cb | |
1174 | the kernel may configured with the | |
1175 | .B CONFIG_BPF_JIT_ALWAYS_ON | |
1176 | option. | |
1177 | In this case, the JIT compiler is always enabled, and the | |
1178 | .I bpf_jit_enable | |
1179 | is initialized to 1 and is immutable. | |
1180 | (This kernel configuration option was provided as a mitigation for | |
1181 | one of the Spectre attacks against the BPF interpreter.) | |
1182 | .PP | |
2b623a23 | 1183 | The JIT compiler for eBPF is currently |
4167f63f | 1184 | .\" Last reviewed in Linux 4.18-rc by grepping for BPF_ALU64 in arch/ |
6d2ac026 MK |
1185 | .\" and by checking the documentation for bpf_jit_enable in |
1186 | .\" Documentation/sysctl/net.txt | |
2b623a23 MK |
1187 | available for the following architectures: |
1188 | .IP * 3 | |
2ef9216b MK |
1189 | x86-64 (since Linux 3.18; cBPF since Linux 3.0); |
1190 | .\" commit 0a14842f5a3c0e88a1e59fac5c3025db39721f74 | |
2b623a23 MK |
1191 | .PD 0 |
1192 | .IP * | |
2ef9216b MK |
1193 | ARM32 (since Linux 3.18; cBPF since Linux 3.4); |
1194 | .\" commit ddecdfcea0ae891f782ae853771c867ab51024c2 | |
1195 | .IP * | |
1196 | SPARC 32 (since Linux 3.18; cBPF since Linux 3.5); | |
1197 | .\" commit 2809a2087cc44b55e4377d7b9be3f7f5d2569091 | |
2b623a23 | 1198 | .IP * |
2ef9216b MK |
1199 | ARM-64 (since Linux 3.18); |
1200 | .\" commit e54bcde3d69d40023ae77727213d14f920eb264a | |
2b623a23 | 1201 | .IP * |
069be4fd MK |
1202 | s390 (since Linux 4.1; cBPF since Linux 3.7); |
1203 | .\" commit c10302efe569bfd646b4c22df29577a4595b4580 | |
1204 | .IP * | |
2ef9216b MK |
1205 | PowerPC 64 (since Linux 4.8; cBPF since Linux 3.1); |
1206 | .\" commit 0ca87f05ba8bdc6791c14878464efc901ad71e99 | |
1207 | .\" commit 156d0e290e969caba25f1851c52417c14d141b24 | |
2b623a23 MK |
1208 | .IP * |
1209 | SPARC 64 (since Linux 4.12); | |
2ef9216b | 1210 | .\" commit 7a12b5031c6b947cc13918237ae652b536243b76 |
2b623a23 | 1211 | .IP * |
2ef9216b MK |
1212 | x86-32 (since Linux 4.18); |
1213 | .\" commit 03f5781be2c7b7e728d724ac70ba10799cc710d7 | |
2b623a23 | 1214 | .IP * |
2ef9216b MK |
1215 | MIPS 64 (since Linux 4.18; cBPF since Linux 3.16); |
1216 | .\" commit c6610de353da5ca6eee5b8960e838a87a90ead0c | |
1217 | .\" commit f381bf6d82f032b7410185b35d000ea370ac706b | |
c3a42840 | 1218 | .IP * |
2ef9216b MK |
1219 | riscv (since Linux 5.1). |
1220 | .\" commit 2353ecc6f91fd15b893fa01bf85a1c7a823ee4f2 | |
2b623a23 | 1221 | .PD |
cc7ac21d | 1222 | .SH SEE ALSO |
842ee010 | 1223 | .BR seccomp (2), |
3bcfaff6 | 1224 | .BR bpf-helpers (7), |
cc42e9b8 | 1225 | .BR socket (7), |
8440f771 MK |
1226 | .BR tc (8), |
1227 | .BR tc-bpf (8) | |
f0271688 | 1228 | .PP |
5988a659 | 1229 | Both classic and extended BPF are explained in the kernel source file |
1148d934 | 1230 | .IR Documentation/networking/filter.txt . |