]>
Commit | Line | Data |
---|---|---|
cc7ac21d | 1 | .\" Copyright (C) 2015 Alexei Starovoitov <ast@kernel.org> |
ce5db3fc | 2 | .\" and Copyright (C) 2015 Michael Kerrisk <mtk.manpages@gmail.com> |
cc7ac21d AS |
3 | .\" |
4 | .\" %%%LICENSE_START(VERBATIM) | |
5 | .\" Permission is granted to make and distribute verbatim copies of this | |
6 | .\" manual provided the copyright notice and this permission notice are | |
7 | .\" preserved on all copies. | |
8 | .\" | |
9 | .\" Permission is granted to copy and distribute modified versions of this | |
10 | .\" manual under the conditions for verbatim copying, provided that the | |
11 | .\" entire resulting derived work is distributed under the terms of a | |
12 | .\" permission notice identical to this one. | |
13 | .\" | |
14 | .\" Since the Linux kernel and libraries are constantly changing, this | |
15 | .\" manual page may be incorrect or out-of-date. The author(s) assume no | |
16 | .\" responsibility for errors or omissions, or for damages resulting from | |
17 | .\" the use of the information contained herein. The author(s) may not | |
18 | .\" have taken the same level of care in the production of this manual, | |
19 | .\" which is licensed free of charge, as they might when working | |
20 | .\" professionally. | |
21 | .\" | |
22 | .\" Formatted or processed versions of this manual, if unaccompanied by | |
23 | .\" the source, must acknowledge the copyright and authors of this work. | |
24 | .\" %%%LICENSE_END | |
25 | .\" | |
63121bd4 | 26 | .TH BPF 2 2019-08-02 "Linux" "Linux Programmer's Manual" |
cc7ac21d | 27 | .SH NAME |
99663603 | 28 | bpf \- perform a command on an extended BPF map or program |
cc7ac21d AS |
29 | .SH SYNOPSIS |
30 | .nf | |
31 | .B #include <linux/bpf.h> | |
c36ac88f | 32 | |
266791fb | 33 | .BI "int bpf(int " cmd ", union bpf_attr *" attr ", unsigned int " size ); |
c36ac88f | 34 | .fi |
cc7ac21d | 35 | .SH DESCRIPTION |
5988a659 | 36 | The |
16152abb | 37 | .BR bpf () |
842ee010 MK |
38 | system call performs a range of operations related to extended |
39 | Berkeley Packet Filters. | |
40 | Extended BPF (or eBPF) is similar to | |
54513c00 MK |
41 | the original ("classic") BPF (cBPF) used to filter network packets. |
42 | For both cBPF and eBPF programs, | |
842ee010 MK |
43 | the kernel statically analyzes the programs before loading them, |
44 | in order to ensure that they cannot harm the running system. | |
11ac5b51 | 45 | .PP |
cc42e9b8 | 46 | eBPF extends cBPF in multiple ways, including the ability to call |
f774ddf1 MK |
47 | a fixed set of in-kernel helper functions |
48 | .\" See 'enum bpf_func_id' in include/uapi/linux/bpf.h | |
49 | (via the | |
842ee010 MK |
50 | .B BPF_CALL |
51 | opcode extension provided by eBPF) | |
ce5db3fc | 52 | and access shared data structures such as eBPF maps. |
fcd1bee3 | 53 | .\" |
cc7ac21d | 54 | .SS Extended BPF Design/Architecture |
953d2673 | 55 | eBPF maps are a generic data structure for storage of different data types. |
9a818ddd | 56 | Data types are generally treated as binary blobs, so a user just specifies |
cd579c3f | 57 | the size of the key and the size of the value at map-creation time. |
9a818ddd | 58 | In other words, a key/value for a given map can have an arbitrary structure. |
f0271688 | 59 | .PP |
cc7ac21d | 60 | A user process can create multiple maps (with key/value-pairs being |
16152abb | 61 | opaque bytes of data) and access them via file descriptors. |
b87d8ba6 | 62 | Different eBPF programs can access the same maps in parallel. |
54513c00 | 63 | It's up to the user process and eBPF program to decide what they store |
cc7ac21d | 64 | inside maps. |
f0271688 | 65 | .PP |
cd579c3f MK |
66 | There's one special map type, called a program array. |
67 | This type of map stores file descriptors referring to other eBPF programs. | |
68 | When a lookup in the map is performed, the program flow is | |
69 | redirected in-place to the beginning of another eBPF program and does not | |
70 | return back to the calling program. | |
aabe0499 MK |
71 | The level of nesting has a fixed limit of 32, |
72 | .\" Defined by the kernel constant MAX_TAIL_CALL_CNT in include/linux/bpf.h | |
73 | so that infinite loops cannot be crafted. | |
29c0586f | 74 | At run time, the program file descriptors stored in the map can be modified, |
9a818ddd | 75 | so program functionality can be altered based on specific requirements. |
cd579c3f MK |
76 | All programs referred to in a program-array map must |
77 | have been previously loaded into the kernel via | |
78 | .BR bpf (). | |
79 | If a map lookup fails, the current program continues its execution. | |
80 | See | |
81 | .B BPF_MAP_TYPE_PROG_ARRAY | |
82 | below for further details. | |
11ac5b51 | 83 | .PP |
9a818ddd | 84 | Generally, eBPF programs are loaded by the user process and automatically |
cd579c3f MK |
85 | unloaded when the process exits. |
86 | In some cases, for example, | |
9a818ddd DB |
87 | .BR tc-bpf (8), |
88 | the program will continue to stay alive inside the kernel even after the | |
a0d8ddd1 | 89 | process that loaded the program exits. |
cd579c3f MK |
90 | In that case, |
91 | the tc subsystem holds a reference to the eBPF program after the | |
92 | file descriptor has been closed by the user-space program. | |
9a818ddd DB |
93 | Thus, whether a specific program continues to live inside the kernel |
94 | depends on how it is further attached to a given kernel subsystem | |
95 | after it was loaded via | |
cd579c3f | 96 | .BR bpf (). |
f0271688 | 97 | .PP |
cd579c3f | 98 | Each eBPF program is a set of instructions that is safe to run until |
9a5215bf | 99 | its completion. |
54513c00 | 100 | An in-kernel verifier statically determines that the eBPF program |
9a5215bf | 101 | terminates and is safe to execute. |
896388c8 MK |
102 | During verification, the kernel increments reference counts for each of |
103 | the maps that the eBPF program uses, | |
953d2673 | 104 | so that the attached maps can't be removed until the program is unloaded. |
f0271688 | 105 | .PP |
54513c00 | 106 | eBPF programs can be attached to different events. |
9ab03361 | 107 | These events can be the arrival of network packets, tracing |
953d2673 MK |
108 | events, classification events by network queueing disciplines |
109 | (for eBPF programs attached to a | |
9ab03361 MK |
110 | .BR tc (8) |
111 | classifier), and other types that may be added in the future. | |
54513c00 | 112 | A new event triggers execution of the eBPF program, which |
f774ddf1 | 113 | may store information about the event in eBPF maps. |
54513c00 | 114 | Beyond storing data, eBPF programs may call a fixed set of |
896388c8 | 115 | in-kernel helper functions. |
f0271688 | 116 | .PP |
f774ddf1 | 117 | The same eBPF program can be attached to multiple events and different |
cc42e9b8 | 118 | eBPF programs can access the same map: |
f0271688 | 119 | .PP |
1148d934 | 120 | .in +4n |
f0271688 | 121 | .EX |
cd579c3f MK |
122 | tracing tracing tracing packet packet packet |
123 | event A event B event C on eth0 on eth1 on eth2 | |
124 | | | | | | ^ | |
125 | | | | | v | | |
126 | --> tracing <-- tracing socket tc ingress tc egress | |
127 | prog_1 prog_2 prog_3 classifier action | |
128 | | | | | prog_4 prog_5 | |
129 | |--- -----| |------| map_3 | | | |
130 | map_1 map_2 --| map_4 |-- | |
f0271688 | 131 | .EE |
1148d934 | 132 | .in |
fcd1bee3 | 133 | .\" |
5988a659 | 134 | .SS Arguments |
842ee010 | 135 | The operation to be performed by the |
1148d934 | 136 | .BR bpf () |
842ee010 | 137 | system call is determined by the |
266791fb | 138 | .I cmd |
f774ddf1 MK |
139 | argument. |
140 | Each operation takes an accompanying argument, | |
141 | provided via | |
142 | .IR attr , | |
143 | which is a pointer to a union of type | |
266791fb | 144 | .I bpf_attr |
f774ddf1 MK |
145 | (see below). |
146 | The | |
147 | .I size | |
148 | argument is the size of the union pointed to by | |
149 | .IR attr . | |
efeece04 | 150 | .PP |
f774ddf1 | 151 | The value provided in |
266791fb | 152 | .I cmd |
f774ddf1 | 153 | is one of the following: |
cc7ac21d AS |
154 | .TP |
155 | .B BPF_MAP_CREATE | |
953d2673 | 156 | Create a map and return a file descriptor that refers to the map. |
0f166ce1 MK |
157 | The close-on-exec file descriptor flag (see |
158 | .BR fcntl (2)) | |
159 | is automatically enabled for the new file descriptor. | |
cc7ac21d AS |
160 | .TP |
161 | .B BPF_MAP_LOOKUP_ELEM | |
842ee010 | 162 | Look up an element by key in a specified map and return its value. |
cc7ac21d AS |
163 | .TP |
164 | .B BPF_MAP_UPDATE_ELEM | |
842ee010 | 165 | Create or update an element (key/value pair) in a specified map. |
cc7ac21d AS |
166 | .TP |
167 | .B BPF_MAP_DELETE_ELEM | |
842ee010 | 168 | Look up and delete an element by key in a specified map. |
cc7ac21d AS |
169 | .TP |
170 | .B BPF_MAP_GET_NEXT_KEY | |
842ee010 MK |
171 | Look up an element by key in a specified map and return the key |
172 | of the next element. | |
cc7ac21d AS |
173 | .TP |
174 | .B BPF_PROG_LOAD | |
9ab03361 MK |
175 | Verify and load an eBPF program, |
176 | returning a new file descriptor associated with the program. | |
0f166ce1 MK |
177 | The close-on-exec file descriptor flag (see |
178 | .BR fcntl (2)) | |
179 | is automatically enabled for the new file descriptor. | |
f0271688 | 180 | .IP |
842ee010 MK |
181 | The |
182 | .I bpf_attr | |
183 | union consists of various anonymous structures that are used by different | |
184 | .BR bpf () | |
185 | commands: | |
b3b5781e | 186 | .PP |
842ee010 | 187 | .in +4n |
f0271688 | 188 | .EX |
cc7ac21d | 189 | union bpf_attr { |
842ee010 MK |
190 | struct { /* Used by BPF_MAP_CREATE */ |
191 | __u32 map_type; | |
192 | __u32 key_size; /* size of key in bytes */ | |
193 | __u32 value_size; /* size of value in bytes */ | |
194 | __u32 max_entries; /* maximum number of entries | |
195 | in a map */ | |
cc7ac21d AS |
196 | }; |
197 | ||
f774ddf1 MK |
198 | struct { /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY |
199 | commands */ | |
842ee010 MK |
200 | __u32 map_fd; |
201 | __aligned_u64 key; | |
cc7ac21d AS |
202 | union { |
203 | __aligned_u64 value; | |
204 | __aligned_u64 next_key; | |
205 | }; | |
842ee010 | 206 | __u64 flags; |
cc7ac21d AS |
207 | }; |
208 | ||
842ee010 MK |
209 | struct { /* Used by BPF_PROG_LOAD */ |
210 | __u32 prog_type; | |
211 | __u32 insn_cnt; | |
212 | __aligned_u64 insns; /* 'const struct bpf_insn *' */ | |
213 | __aligned_u64 license; /* 'const char *' */ | |
214 | __u32 log_level; /* verbosity level of verifier */ | |
215 | __u32 log_size; /* size of user buffer */ | |
216 | __aligned_u64 log_buf; /* user supplied 'char *' | |
217 | buffer */ | |
f774ddf1 | 218 | __u32 kern_version; |
9ab03361 MK |
219 | /* checked when prog_type=kprobe |
220 | (since Linux 4.1) */ | |
221 | .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5 | |
cc7ac21d AS |
222 | }; |
223 | } __attribute__((aligned(8))); | |
f0271688 | 224 | .EE |
842ee010 | 225 | .in |
fcd1bee3 | 226 | .\" |
ce5db3fc | 227 | .SS eBPF maps |
8440f771 MK |
228 | Maps are a generic data structure for storage of different types of data. |
229 | They allow sharing of data between eBPF kernel programs, | |
230 | and also between kernel and user-space applications. | |
f0271688 | 231 | .PP |
16152abb | 232 | Each map type has the following attributes: |
16152abb MK |
233 | .IP * 3 |
234 | type | |
235 | .IP * | |
79e2beef | 236 | maximum number of elements |
16152abb MK |
237 | .IP * |
238 | key size in bytes | |
239 | .IP * | |
240 | value size in bytes | |
16152abb | 241 | .PP |
842ee010 MK |
242 | The following wrapper functions demonstrate how various |
243 | .BR bpf () | |
244 | commands can be used to access the maps. | |
9a5215bf | 245 | The functions use the |
266791fb | 246 | .I cmd |
cc7ac21d | 247 | argument to invoke different operations. |
ce5db3fc | 248 | .TP |
842ee010 MK |
249 | .B BPF_MAP_CREATE |
250 | The | |
cc7ac21d | 251 | .B BPF_MAP_CREATE |
5415d504 MK |
252 | command creates a new map, |
253 | returning a new file descriptor that refers to the map. | |
f0271688 | 254 | .IP |
842ee010 | 255 | .in +4n |
f0271688 | 256 | .EX |
842ee010 | 257 | int |
953d2673 MK |
258 | bpf_create_map(enum bpf_map_type map_type, |
259 | unsigned int key_size, | |
260 | unsigned int value_size, | |
261 | unsigned int max_entries) | |
cc7ac21d AS |
262 | { |
263 | union bpf_attr attr = { | |
953d2673 MK |
264 | .map_type = map_type, |
265 | .key_size = key_size, | |
266 | .value_size = value_size, | |
cc7ac21d AS |
267 | .max_entries = max_entries |
268 | }; | |
269 | ||
270 | return bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); | |
271 | } | |
f0271688 | 272 | .EE |
842ee010 | 273 | .in |
f0271688 | 274 | .IP |
842ee010 MK |
275 | The new map has the type specified by |
276 | .IR map_type , | |
277 | and attributes as specified in | |
1148d934 MK |
278 | .IR key_size , |
279 | .IR value_size , | |
842ee010 | 280 | and |
1148d934 | 281 | .IR max_entries . |
46a4949b | 282 | On success, this operation returns a file descriptor. |
9a5215bf | 283 | On error, \-1 is returned and |
cc7ac21d | 284 | .I errno |
1148d934 MK |
285 | is set to |
286 | .BR EINVAL , | |
287 | .BR EPERM , | |
288 | or | |
289 | .BR ENOMEM . | |
f0271688 | 290 | .IP |
953d2673 | 291 | The |
cc7ac21d AS |
292 | .I key_size |
293 | and | |
294 | .I value_size | |
953d2673 MK |
295 | attributes will be used by the verifier during program loading |
296 | to check that the program is calling | |
1148d934 MK |
297 | .BR bpf_map_*_elem () |
298 | helper functions with a correctly initialized | |
cc7ac21d | 299 | .I key |
f774ddf1 | 300 | and to check that the program doesn't access the map element |
cc7ac21d AS |
301 | .I value |
302 | beyond the specified | |
16152abb | 303 | .IR value_size . |
842ee010 | 304 | For example, when a map is created with a |
266791fb | 305 | .I key_size |
f774ddf1 | 306 | of 8 and the eBPF program calls |
f0271688 | 307 | .IP |
1148d934 | 308 | .in +4n |
f0271688 | 309 | .EX |
cc7ac21d | 310 | bpf_map_lookup_elem(map_fd, fp - 4) |
f0271688 | 311 | .EE |
1148d934 | 312 | .in |
f0271688 | 313 | .IP |
cc7ac21d | 314 | the program will be rejected, |
1148d934 | 315 | since the in-kernel helper function |
f0271688 MK |
316 | .IP |
317 | .EX | |
842ee010 | 318 | bpf_map_lookup_elem(map_fd, void *key) |
f0271688 MK |
319 | .EE |
320 | .IP | |
46a4949b MK |
321 | expects to read 8 bytes from the location pointed to by |
322 | .IR key , | |
323 | but the | |
266791fb | 324 | .I fp\ -\ 4 |
46a4949b MK |
325 | (where |
326 | .I fp | |
327 | is the top of the stack) | |
1148d934 | 328 | starting address will cause out-of-bounds stack access. |
f0271688 | 329 | .IP |
842ee010 MK |
330 | Similarly, when a map is created with a |
331 | .I value_size | |
f774ddf1 | 332 | of 1 and the eBPF program contains |
f0271688 | 333 | .IP |
1148d934 | 334 | .in +4n |
f0271688 | 335 | .EX |
cc7ac21d | 336 | value = bpf_map_lookup_elem(...); |
1148d934 | 337 | *(u32 *) value = 1; |
f0271688 | 338 | .EE |
1148d934 | 339 | .in |
f0271688 | 340 | .IP |
cc7ac21d AS |
341 | the program will be rejected, since it accesses the |
342 | .I value | |
1148d934 MK |
343 | pointer beyond the specified 1 byte |
344 | .I value_size | |
345 | limit. | |
f0271688 | 346 | .IP |
f774ddf1 MK |
347 | Currently, the following values are supported for |
348 | .IR map_type : | |
f0271688 | 349 | .IP |
1148d934 | 350 | .in +4n |
f0271688 | 351 | .EX |
cc7ac21d | 352 | enum bpf_map_type { |
ce5db3fc | 353 | BPF_MAP_TYPE_UNSPEC, /* Reserve 0 as invalid map type */ |
842ee010 MK |
354 | BPF_MAP_TYPE_HASH, |
355 | BPF_MAP_TYPE_ARRAY, | |
5415d504 | 356 | BPF_MAP_TYPE_PROG_ARRAY, |
1b7adc7c NB |
357 | BPF_MAP_TYPE_PERF_EVENT_ARRAY, |
358 | BPF_MAP_TYPE_PERCPU_HASH, | |
359 | BPF_MAP_TYPE_PERCPU_ARRAY, | |
360 | BPF_MAP_TYPE_STACK_TRACE, | |
361 | BPF_MAP_TYPE_CGROUP_ARRAY, | |
362 | BPF_MAP_TYPE_LRU_HASH, | |
363 | BPF_MAP_TYPE_LRU_PERCPU_HASH, | |
364 | BPF_MAP_TYPE_LPM_TRIE, | |
365 | BPF_MAP_TYPE_ARRAY_OF_MAPS, | |
366 | BPF_MAP_TYPE_HASH_OF_MAPS, | |
367 | BPF_MAP_TYPE_DEVMAP, | |
368 | BPF_MAP_TYPE_SOCKMAP, | |
369 | BPF_MAP_TYPE_CPUMAP, | |
0e861952 PW |
370 | BPF_MAP_TYPE_XSKMAP, |
371 | BPF_MAP_TYPE_SOCKHASH, | |
372 | BPF_MAP_TYPE_CGROUP_STORAGE, | |
373 | BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, | |
374 | BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, | |
375 | BPF_MAP_TYPE_QUEUE, | |
376 | BPF_MAP_TYPE_STACK, | |
377 | /* See /usr/include/linux/bpf.h for the full list. */ | |
cc7ac21d | 378 | }; |
f0271688 | 379 | .EE |
1148d934 | 380 | .in |
f0271688 | 381 | .IP |
cc7ac21d | 382 | .I map_type |
9a5215bf | 383 | selects one of the available map implementations in the kernel. |
f774ddf1 | 384 | .\" FIXME We need an explanation of why one might choose each of |
b913d165 | 385 | .\" these map implementations |
16152abb | 386 | For all map types, |
f774ddf1 MK |
387 | eBPF programs access maps with the same |
388 | .BR bpf_map_lookup_elem () | |
389 | and | |
1148d934 | 390 | .BR bpf_map_update_elem () |
cc7ac21d | 391 | helper functions. |
ce5db3fc | 392 | Further details of the various map types are given below. |
cc7ac21d AS |
393 | .TP |
394 | .B BPF_MAP_LOOKUP_ELEM | |
842ee010 MK |
395 | The |
396 | .B BPF_MAP_LOOKUP_ELEM | |
397 | command looks up an element with a given | |
398 | .I key | |
399 | in the map referred to by the file descriptor | |
400 | .IR fd . | |
f0271688 | 401 | .IP |
842ee010 | 402 | .in +4n |
f0271688 | 403 | .EX |
842ee010 | 404 | int |
953d2673 | 405 | bpf_lookup_elem(int fd, const void *key, void *value) |
cc7ac21d AS |
406 | { |
407 | union bpf_attr attr = { | |
408 | .map_fd = fd, | |
953d2673 MK |
409 | .key = ptr_to_u64(key), |
410 | .value = ptr_to_u64(value), | |
cc7ac21d AS |
411 | }; |
412 | ||
413 | return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr)); | |
414 | } | |
f0271688 | 415 | .EE |
842ee010 | 416 | .in |
f0271688 | 417 | .IP |
842ee010 MK |
418 | If an element is found, |
419 | the operation returns zero and stores the element's value into | |
5415d504 MK |
420 | .IR value , |
421 | which must point to a buffer of | |
422 | .I value_size | |
423 | bytes. | |
f0271688 | 424 | .IP |
842ee010 | 425 | If no element is found, the operation returns \-1 and sets |
cc7ac21d | 426 | .I errno |
1148d934 MK |
427 | to |
428 | .BR ENOENT . | |
cc7ac21d AS |
429 | .TP |
430 | .B BPF_MAP_UPDATE_ELEM | |
842ee010 MK |
431 | The |
432 | .B BPF_MAP_UPDATE_ELEM | |
433 | command | |
434 | creates or updates an element with a given | |
435 | .I key/value | |
436 | in the map referred to by the file descriptor | |
437 | .IR fd . | |
f0271688 | 438 | .IP |
842ee010 | 439 | .in +4n |
f0271688 | 440 | .EX |
842ee010 | 441 | int |
953d2673 MK |
442 | bpf_update_elem(int fd, const void *key, const void *value, |
443 | uint64_t flags) | |
cc7ac21d AS |
444 | { |
445 | union bpf_attr attr = { | |
446 | .map_fd = fd, | |
953d2673 MK |
447 | .key = ptr_to_u64(key), |
448 | .value = ptr_to_u64(value), | |
449 | .flags = flags, | |
cc7ac21d AS |
450 | }; |
451 | ||
452 | return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr)); | |
453 | } | |
f0271688 | 454 | .EE |
842ee010 | 455 | .in |
f0271688 | 456 | .IP |
842ee010 | 457 | The |
cc7ac21d | 458 | .I flags |
842ee010 MK |
459 | argument should be specified as one of the following: |
460 | .RS | |
461 | .TP | |
462 | .B BPF_ANY | |
463 | Create a new element or update an existing element. | |
464 | .TP | |
465 | .B BPF_NOEXIST | |
466 | Create a new element only if it did not exist. | |
467 | .TP | |
468 | .B BPF_EXIST | |
469 | Update an existing element. | |
470 | .RE | |
471 | .IP | |
472 | On success, the operation returns zero. | |
cc7ac21d AS |
473 | On error, \-1 is returned and |
474 | .I errno | |
1148d934 MK |
475 | is set to |
476 | .BR EINVAL , | |
477 | .BR EPERM , | |
478 | .BR ENOMEM , | |
479 | or | |
480 | .BR E2BIG . | |
cc7ac21d | 481 | .B E2BIG |
842ee010 | 482 | indicates that the number of elements in the map reached the |
cc7ac21d AS |
483 | .I max_entries |
484 | limit specified at map creation time. | |
485 | .B EEXIST | |
842ee010 MK |
486 | will be returned if |
487 | .I flags | |
488 | specifies | |
489 | .B BPF_NOEXIST | |
490 | and the element with | |
1148d934 MK |
491 | .I key |
492 | already exists in the map. | |
cc7ac21d | 493 | .B ENOENT |
953d2673 | 494 | will be returned if |
842ee010 MK |
495 | .I flags |
496 | specifies | |
497 | .B BPF_EXIST | |
498 | and the element with | |
1148d934 MK |
499 | .I key |
500 | doesn't exist in the map. | |
cc7ac21d AS |
501 | .TP |
502 | .B BPF_MAP_DELETE_ELEM | |
842ee010 MK |
503 | The |
504 | .B BPF_MAP_DELETE_ELEM | |
505 | command | |
96ed2f3f | 506 | deletes the element whose key is |
842ee010 MK |
507 | .I key |
508 | from the map referred to by the file descriptor | |
509 | .IR fd . | |
f0271688 | 510 | .IP |
842ee010 | 511 | .in +4n |
f0271688 | 512 | .EX |
842ee010 | 513 | int |
953d2673 | 514 | bpf_delete_elem(int fd, const void *key) |
cc7ac21d AS |
515 | { |
516 | union bpf_attr attr = { | |
517 | .map_fd = fd, | |
953d2673 | 518 | .key = ptr_to_u64(key), |
cc7ac21d AS |
519 | }; |
520 | ||
521 | return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr)); | |
522 | } | |
f0271688 | 523 | .EE |
842ee010 | 524 | .in |
f0271688 | 525 | .IP |
842ee010 MK |
526 | On success, zero is returned. |
527 | If the element is not found, \-1 is returned and | |
cc7ac21d | 528 | .I errno |
842ee010 | 529 | is set to |
1148d934 | 530 | .BR ENOENT . |
cc7ac21d AS |
531 | .TP |
532 | .B BPF_MAP_GET_NEXT_KEY | |
842ee010 MK |
533 | The |
534 | .B BPF_MAP_GET_NEXT_KEY | |
535 | command looks up an element by | |
536 | .I key | |
537 | in the map referred to by the file descriptor | |
266791fb | 538 | .I fd |
842ee010 MK |
539 | and sets the |
540 | .I next_key | |
541 | pointer to the key of the next element. | |
f0271688 | 542 | .IP |
842ee010 | 543 | .in +4n |
f0271688 | 544 | .EX |
842ee010 | 545 | int |
953d2673 | 546 | bpf_get_next_key(int fd, const void *key, void *next_key) |
cc7ac21d AS |
547 | { |
548 | union bpf_attr attr = { | |
953d2673 MK |
549 | .map_fd = fd, |
550 | .key = ptr_to_u64(key), | |
cc7ac21d AS |
551 | .next_key = ptr_to_u64(next_key), |
552 | }; | |
553 | ||
554 | return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr)); | |
555 | } | |
f0271688 | 556 | .EE |
842ee010 | 557 | .in |
f0271688 | 558 | .IP |
5415d504 MK |
559 | If |
560 | .I key | |
561 | is found, the operation returns zero and sets the | |
562 | .I next_key | |
563 | pointer to the key of the next element. | |
cc7ac21d AS |
564 | If |
565 | .I key | |
842ee010 | 566 | is not found, the operation returns zero and sets the |
cc7ac21d AS |
567 | .I next_key |
568 | pointer to the key of the first element. | |
569 | If | |
570 | .I key | |
842ee010 | 571 | is the last element, \-1 is returned and |
cc7ac21d | 572 | .I errno |
842ee010 | 573 | is set to |
1148d934 | 574 | .BR ENOENT . |
9a5215bf | 575 | Other possible |
cc7ac21d | 576 | .I errno |
1148d934 MK |
577 | values are |
578 | .BR ENOMEM , | |
579 | .BR EFAULT , | |
580 | .BR EPERM , | |
581 | and | |
582 | .BR EINVAL . | |
cc7ac21d AS |
583 | This method can be used to iterate over all elements in the map. |
584 | .TP | |
585 | .B close(map_fd) | |
842ee010 | 586 | Delete the map referred to by the file descriptor |
1148d934 | 587 | .IR map_fd . |
842ee010 | 588 | When the user-space program that created a map exits, all maps will |
ce5db3fc MK |
589 | be deleted automatically (but see NOTES). |
590 | .\" | |
591 | .SS eBPF map types | |
592 | The following map types are supported: | |
593 | .TP | |
594 | .B BPF_MAP_TYPE_HASH | |
595 | .\" commit 0f8e4bd8a1fc8c4185f1630061d0a1f2d197a475 | |
ce5db3fc MK |
596 | Hash-table maps have the following characteristics: |
597 | .RS | |
598 | .IP * 3 | |
599 | Maps are created and destroyed by user-space programs. | |
600 | Both user-space and eBPF programs | |
46a4949b | 601 | can perform lookup, update, and delete operations. |
ce5db3fc MK |
602 | .IP * |
603 | The kernel takes care of allocating and freeing key/value pairs. | |
604 | .IP * | |
605 | The | |
606 | .BR map_update_elem () | |
998f951b | 607 | helper will fail to insert new element when the |
ce5db3fc MK |
608 | .I max_entries |
609 | limit is reached. | |
610 | (This ensures that eBPF programs cannot exhaust memory.) | |
611 | .IP * | |
612 | .BR map_update_elem () | |
613 | replaces existing elements atomically. | |
614 | .RE | |
615 | .IP | |
953d2673 | 616 | Hash-table maps are |
ce5db3fc MK |
617 | optimized for speed of lookup. |
618 | .TP | |
619 | .B BPF_MAP_TYPE_ARRAY | |
620 | .\" commit 28fbcfa08d8ed7c5a50d41a0433aad222835e8e3 | |
ce5db3fc MK |
621 | Array maps have the following characteristics: |
622 | .RS | |
623 | .IP * 3 | |
624 | Optimized for fastest possible lookup. | |
46a4949b | 625 | In the future the verifier/JIT compiler |
ce5db3fc MK |
626 | may recognize lookup() operations that employ a constant key |
627 | and optimize it into constant pointer. | |
628 | It is possible to optimize a non-constant | |
629 | key into direct pointer arithmetic as well, since pointers and | |
630 | .I value_size | |
631 | are constant for the life of the eBPF program. | |
632 | In other words, | |
633 | .BR array_map_lookup_elem () | |
634 | may be 'inlined' by the verifier/JIT compiler | |
635 | while preserving concurrent access to this map from user space. | |
636 | .IP * | |
637 | All array elements pre-allocated and zero initialized at init time | |
638 | .IP * | |
639 | The key is an array index, and must be exactly four bytes. | |
640 | .IP * | |
641 | .BR map_delete_elem () | |
642 | fails with the error | |
643 | .BR EINVAL , | |
644 | since elements cannot be deleted. | |
645 | .IP * | |
646 | .BR map_update_elem () | |
953d2673 MK |
647 | replaces elements in a |
648 | .B nonatomic | |
649 | fashion; | |
cd579c3f MK |
650 | for atomic updates, a hash-table map should be used instead. |
651 | There is however one special case that can also be used with arrays: | |
652 | the atomic built-in | |
266791fb | 653 | .B __sync_fetch_and_add() |
cd579c3f MK |
654 | can be used on 32 and 64 bit atomic counters. |
655 | For example, it can be | |
9a818ddd DB |
656 | applied on the whole value itself if it represents a single counter, |
657 | or in case of a structure containing multiple counters, it could be | |
cd579c3f MK |
658 | used on individual counters. |
659 | This is quite often useful for aggregation and accounting of events. | |
ce5db3fc MK |
660 | .RE |
661 | .IP | |
662 | Among the uses for array maps are the following: | |
663 | .RS | |
664 | .IP * 3 | |
665 | As "global" eBPF variables: an array of 1 element whose key is (index) 0 | |
666 | and where the value is a collection of 'global' variables which | |
667 | eBPF programs can use to keep state between events. | |
668 | .IP * | |
669 | Aggregation of tracing events into a fixed set of buckets. | |
9a818ddd DB |
670 | .IP * |
671 | Accounting of networking events, for example, number of packets and packet | |
672 | sizes. | |
ce5db3fc MK |
673 | .RE |
674 | .TP | |
675 | .BR BPF_MAP_TYPE_PROG_ARRAY " (since Linux 4.2)" | |
cd579c3f MK |
676 | A program array map is a special kind of array map whose map values |
677 | contain only file descriptors referring to other eBPF programs. | |
678 | Thus, both the | |
679 | .I key_size | |
680 | and | |
681 | .I value_size | |
682 | must be exactly four bytes. | |
9a818ddd | 683 | This map is used in conjunction with the |
cd579c3f | 684 | .BR bpf_tail_call () |
9a818ddd | 685 | helper. |
f0271688 | 686 | .IP |
9a818ddd DB |
687 | This means that an eBPF program with a program array map attached to it |
688 | can call from kernel side into | |
f0271688 | 689 | .IP |
9a818ddd | 690 | .in +4n |
f0271688 | 691 | .EX |
05f10213 MK |
692 | void bpf_tail_call(void *context, void *prog_map, |
693 | unsigned int index); | |
f0271688 | 694 | .EE |
9a818ddd | 695 | .in |
f0271688 | 696 | .IP |
9a818ddd | 697 | and therefore replace its own program flow with the one from the program |
cd579c3f MK |
698 | at the given program array slot, if present. |
699 | This can be regarded as kind of a jump table to a different eBPF program. | |
700 | The invoked program will then reuse the same stack. | |
701 | When a jump into the new program has been performed, | |
702 | it won't return to the old program anymore. | |
f0271688 | 703 | .IP |
aabe0499 MK |
704 | If no eBPF program is found at the given index of the program array |
705 | (because the map slot doesn't contain a valid program file descriptor, | |
706 | the specified lookup index/key is out of bounds, | |
707 | or the limit of 32 | |
708 | .\" MAX_TAIL_CALL_CNT | |
709 | nested calls has been exceed), | |
9a818ddd DB |
710 | execution continues with the current eBPF program. |
711 | This can be used as a fall-through for default cases. | |
f0271688 | 712 | .IP |
9a818ddd | 713 | A program array map is useful, for example, in tracing or networking, to |
cd579c3f MK |
714 | handle individual system calls or protocols in their own subprograms and |
715 | use their identifiers as an individual map index. | |
716 | This approach may result in performance benefits, | |
717 | and also makes it possible to overcome the maximum | |
718 | instruction limit of a single eBPF program. | |
719 | In dynamic environments, | |
720 | a user-space daemon might atomically replace individual subprograms | |
721 | at run-time with newer versions to alter overall program behavior, | |
722 | for instance, if global policies change. | |
ce5db3fc MK |
723 | .\" |
724 | .SS eBPF programs | |
842ee010 MK |
725 | The |
726 | .B BPF_PROG_LOAD | |
54513c00 | 727 | command is used to load an eBPF program into the kernel. |
9ab03361 | 728 | The return value for this command is a new file descriptor associated |
ce5db3fc | 729 | with this eBPF program. |
f0271688 | 730 | .PP |
842ee010 | 731 | .in +4n |
f0271688 | 732 | .EX |
cc7ac21d AS |
733 | char bpf_log_buf[LOG_BUF_SIZE]; |
734 | ||
842ee010 | 735 | int |
953d2673 | 736 | bpf_prog_load(enum bpf_prog_type type, |
842ee010 MK |
737 | const struct bpf_insn *insns, int insn_cnt, |
738 | const char *license) | |
cc7ac21d AS |
739 | { |
740 | union bpf_attr attr = { | |
953d2673 MK |
741 | .prog_type = type, |
742 | .insns = ptr_to_u64(insns), | |
743 | .insn_cnt = insn_cnt, | |
744 | .license = ptr_to_u64(license), | |
745 | .log_buf = ptr_to_u64(bpf_log_buf), | |
746 | .log_size = LOG_BUF_SIZE, | |
cc7ac21d AS |
747 | .log_level = 1, |
748 | }; | |
749 | ||
750 | return bpf(BPF_PROG_LOAD, &attr, sizeof(attr)); | |
751 | } | |
f0271688 | 752 | .EE |
842ee010 | 753 | .in |
f0271688 | 754 | .PP |
1148d934 | 755 | .I prog_type |
cc7ac21d | 756 | is one of the available program types: |
f0271688 | 757 | .IP |
1148d934 | 758 | .in +4n |
f0271688 | 759 | .EX |
cc7ac21d | 760 | enum bpf_prog_type { |
f774ddf1 MK |
761 | BPF_PROG_TYPE_UNSPEC, /* Reserve 0 as invalid |
762 | program type */ | |
ce5db3fc MK |
763 | BPF_PROG_TYPE_SOCKET_FILTER, |
764 | BPF_PROG_TYPE_KPROBE, | |
765 | BPF_PROG_TYPE_SCHED_CLS, | |
766 | BPF_PROG_TYPE_SCHED_ACT, | |
0e861952 PW |
767 | BPF_PROG_TYPE_TRACEPOINT, |
768 | BPF_PROG_TYPE_XDP, | |
769 | BPF_PROG_TYPE_PERF_EVENT, | |
770 | BPF_PROG_TYPE_CGROUP_SKB, | |
771 | BPF_PROG_TYPE_CGROUP_SOCK, | |
772 | BPF_PROG_TYPE_LWT_IN, | |
773 | BPF_PROG_TYPE_LWT_OUT, | |
774 | BPF_PROG_TYPE_LWT_XMIT, | |
775 | BPF_PROG_TYPE_SOCK_OPS, | |
776 | BPF_PROG_TYPE_SK_SKB, | |
777 | BPF_PROG_TYPE_CGROUP_DEVICE, | |
778 | BPF_PROG_TYPE_SK_MSG, | |
779 | BPF_PROG_TYPE_RAW_TRACEPOINT, | |
780 | BPF_PROG_TYPE_CGROUP_SOCK_ADDR, | |
781 | BPF_PROG_TYPE_LWT_SEG6LOCAL, | |
782 | BPF_PROG_TYPE_LIRC_MODE2, | |
783 | BPF_PROG_TYPE_SK_REUSEPORT, | |
784 | BPF_PROG_TYPE_FLOW_DISSECTOR, | |
785 | /* See /usr/include/linux/bpf.h for the full list. */ | |
cc7ac21d | 786 | }; |
f0271688 | 787 | .EE |
1148d934 | 788 | .in |
f0271688 | 789 | .PP |
ce5db3fc | 790 | For further details of eBPF program types, see below. |
f0271688 | 791 | .PP |
ce5db3fc | 792 | The remaining fields of |
842ee010 MK |
793 | .I bpf_attr |
794 | are set as follows: | |
842ee010 | 795 | .IP * 3 |
1148d934 | 796 | .I insns |
842ee010 | 797 | is an array of |
1148d934 MK |
798 | .I "struct bpf_insn" |
799 | instructions. | |
842ee010 | 800 | .IP * |
1148d934 | 801 | .I insn_cnt |
842ee010 MK |
802 | is the number of instructions in the program referred to by |
803 | .IR insns . | |
804 | .IP * | |
1148d934 | 805 | .I license |
842ee010 | 806 | is a license string, which must be GPL compatible to call helper functions |
1148d934 MK |
807 | marked |
808 | .IR gpl_only . | |
fcd1bee3 | 809 | (The licensing rules are the same as for kernel modules, |
9a818ddd | 810 | so that also dual licenses, such as "Dual BSD/GPL", may be used.) |
842ee010 | 811 | .IP * |
1148d934 | 812 | .I log_buf |
842ee010 MK |
813 | is a pointer to a caller-allocated buffer in which the in-kernel |
814 | verifier can store the verification log. | |
9a5215bf | 815 | This log is a multi-line string that can be checked by |
cc7ac21d | 816 | the program author in order to understand how the verifier came to |
953d2673 | 817 | the conclusion that the eBPF program is unsafe. |
cc7ac21d | 818 | The format of the output can change at any time as the verifier evolves. |
842ee010 | 819 | .IP * |
1148d934 | 820 | .I log_size |
842ee010 | 821 | size of the buffer pointed to by |
029b613f | 822 | .IR log_buf . |
9a5215bf | 823 | If the size of the buffer is not large enough to store all |
cc7ac21d AS |
824 | verifier messages, \-1 is returned and |
825 | .I errno | |
1148d934 MK |
826 | is set to |
827 | .BR ENOSPC . | |
842ee010 | 828 | .IP * |
1148d934 | 829 | .I log_level |
9a5215bf | 830 | verbosity level of the verifier. |
fcd1bee3 MK |
831 | A value of zero means that the verifier will not provide a log; |
832 | in this case, | |
833 | .I log_buf | |
834 | must be a NULL pointer, and | |
835 | .I log_size | |
836 | must be zero. | |
f0271688 | 837 | .PP |
ce5db3fc MK |
838 | Applying |
839 | .BR close (2) | |
840 | to the file descriptor returned by | |
841 | .B BPF_PROG_LOAD | |
842 | will unload the eBPF program (but see NOTES). | |
f0271688 | 843 | .PP |
54513c00 MK |
844 | Maps are accessible from eBPF programs and are used to exchange data between |
845 | eBPF programs and between eBPF programs and user-space programs. | |
5415d504 MK |
846 | For example, |
847 | eBPF programs can process various events (like kprobe, packets) and | |
848 | store their data into a map, | |
849 | and user-space programs can then fetch data from the map. | |
850 | Conversely, user-space programs can use a map as a configuration mechanism, | |
851 | populating the map with values checked by the eBPF program, | |
852 | which then modifies its behavior on the fly according to those values. | |
953d2673 MK |
853 | .\" |
854 | .\" | |
ce5db3fc | 855 | .SS eBPF program types |
953d2673 MK |
856 | The eBPF program type |
857 | .RI ( prog_type ) | |
fcd1bee3 | 858 | determines the subset of kernel helper functions that the program |
953d2673 | 859 | may call. |
fcd1bee3 | 860 | The program type also determines the program input (context)\(emthe |
953d2673 MK |
861 | format of |
862 | .I "struct bpf_context" | |
ce5db3fc | 863 | (which is the data blob passed into the eBPF program as the first argument). |
0fc33df7 | 864 | .\" |
30ea59e7 | 865 | .\" FIXME |
24493e9b | 866 | .\" Somewhere in this page we need a general introduction to the |
0fc33df7 MK |
867 | .\" bpf_context. For example, how does a BPF program access the |
868 | .\" context? | |
f0271688 | 869 | .PP |
953d2673 MK |
870 | For example, a tracing program does not have the exact same |
871 | subset of helper functions as a socket filter program | |
872 | (though they may have some helpers in common). | |
873 | Similarly, | |
874 | the input (context) for a tracing program is a set of register values, | |
875 | while for a socket filter it is a network packet. | |
f0271688 | 876 | .PP |
ce5db3fc MK |
877 | The set of functions available to eBPF programs of a given type may increase |
878 | in the future. | |
f0271688 | 879 | .PP |
ce5db3fc MK |
880 | The following program types are supported: |
881 | .TP | |
882 | .BR BPF_PROG_TYPE_SOCKET_FILTER " (since Linux 3.19)" | |
883 | Currently, the set of functions for | |
884 | .B BPF_PROG_TYPE_SOCKET_FILTER | |
885 | is: | |
f0271688 | 886 | .IP |
1148d934 | 887 | .in +4n |
f0271688 | 888 | .EX |
ce5db3fc MK |
889 | bpf_map_lookup_elem(map_fd, void *key) |
890 | /* look up key in a map_fd */ | |
891 | bpf_map_update_elem(map_fd, void *key, void *value) | |
892 | /* update key/value */ | |
893 | bpf_map_delete_elem(map_fd, void *key) | |
894 | /* delete key in a map_fd */ | |
f0271688 | 895 | .EE |
1148d934 | 896 | .in |
f0271688 | 897 | .IP |
ce5db3fc MK |
898 | The |
899 | .I bpf_context | |
900 | argument is a pointer to a | |
b87d8ba6 | 901 | .IR "struct __sk_buff" . |
953d2673 | 902 | .\" FIXME: We need some text here to explain how the program |
b913d165 MK |
903 | .\" accesses __sk_buff. |
904 | .\" See 'struct __sk_buff' and commit 9bac3d6d548e5 | |
905 | .\" | |
b87d8ba6 | 906 | .\" Alexei commented: |
b913d165 MK |
907 | .\" Actually now in case of SOCKET_FILTER, SCHED_CLS, SCHED_ACT |
908 | .\" the program can now access skb fields. | |
ce5db3fc MK |
909 | .\" |
910 | .TP | |
266791fb | 911 | .BR BPF_PROG_TYPE_KPROBE " (since Linux 4.1)" |
ce5db3fc MK |
912 | .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5 |
913 | [To be documented] | |
914 | .\" FIXME Document this program type | |
915 | .\" Describe allowed helper functions for this program type | |
916 | .\" Describe bpf_context for this program type | |
b913d165 | 917 | .\" |
ce5db3fc MK |
918 | .\" FIXME We need text here to describe 'kern_version' |
919 | .TP | |
266791fb | 920 | .BR BPF_PROG_TYPE_SCHED_CLS " (since Linux 4.1)" |
ce5db3fc MK |
921 | .\" commit 96be4325f443dbbfeb37d2a157675ac0736531a1 |
922 | .\" commit e2e9b6541dd4b31848079da80fe2253daaafb549 | |
923 | [To be documented] | |
924 | .\" FIXME Document this program type | |
925 | .\" Describe allowed helper functions for this program type | |
926 | .\" Describe bpf_context for this program type | |
927 | .TP | |
266791fb | 928 | .BR BPF_PROG_TYPE_SCHED_ACT " (since Linux 4.1)" |
ce5db3fc MK |
929 | .\" commit 94caee8c312d96522bcdae88791aaa9ebcd5f22c |
930 | .\" commit a8cb5f556b567974d75ea29c15181c445c541b1f | |
931 | [To be documented] | |
932 | .\" FIXME Document this program type | |
933 | .\" Describe allowed helper functions for this program type | |
934 | .\" Describe bpf_context for this program type | |
935 | .SS Events | |
936 | Once a program is loaded, it can be attached to an event. | |
937 | Various kernel subsystems have different ways to do so. | |
f0271688 | 938 | .PP |
ce5db3fc MK |
939 | Since Linux 3.19, |
940 | .\" commit 89aa075832b0da4402acebd698d0411dcc82d03e | |
941 | the following call will attach the program | |
cc7ac21d | 942 | .I prog_fd |
842ee010 MK |
943 | to the socket |
944 | .IR sockfd , | |
ce5db3fc MK |
945 | which was created by an earlier call to |
946 | .BR socket (2): | |
f0271688 | 947 | .PP |
1148d934 | 948 | .in +4n |
f0271688 | 949 | .EX |
ce5db3fc MK |
950 | setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_BPF, |
951 | &prog_fd, sizeof(prog_fd)); | |
f0271688 | 952 | .EE |
1148d934 | 953 | .in |
f0271688 | 954 | .PP |
ce5db3fc MK |
955 | Since Linux 4.1, |
956 | .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5 | |
957 | the following call may be used to attach | |
958 | the eBPF program referred to by the file descriptor | |
cc7ac21d | 959 | .I prog_fd |
ce5db3fc MK |
960 | to a perf event file descriptor, |
961 | .IR event_fd , | |
962 | that was created by a previous call to | |
963 | .BR perf_event_open (2): | |
efeece04 | 964 | .PP |
ce5db3fc | 965 | .in +4n |
b76974c1 | 966 | .EX |
ce5db3fc | 967 | ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd); |
b76974c1 | 968 | .EE |
ce5db3fc MK |
969 | .in |
970 | .\" | |
ce5db3fc | 971 | .\" |
cc7ac21d | 972 | .SH EXAMPLES |
f0271688 | 973 | .EX |
cc7ac21d AS |
974 | /* bpf+sockets example: |
975 | * 1. create array map of 256 elements | |
976 | * 2. load program that counts number of packets received | |
977 | * r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)] | |
978 | * map[r0]++ | |
979 | * 3. attach prog_fd to raw socket via setsockopt() | |
980 | * 4. print number of received TCP/UDP packets every second | |
981 | */ | |
842ee010 MK |
982 | int |
983 | main(int argc, char **argv) | |
cc7ac21d AS |
984 | { |
985 | int sock, map_fd, prog_fd, key; | |
986 | long long value = 0, tcp_cnt, udp_cnt; | |
987 | ||
1148d934 MK |
988 | map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), |
989 | sizeof(value), 256); | |
cc7ac21d | 990 | if (map_fd < 0) { |
d1a71985 | 991 | printf("failed to create map '%s'\en", strerror(errno)); |
cc7ac21d AS |
992 | /* likely not run as root */ |
993 | return 1; | |
994 | } | |
995 | ||
996 | struct bpf_insn prog[] = { | |
1148d934 MK |
997 | BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), /* r6 = r1 */ |
998 | BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol)), | |
999 | /* r0 = ip->proto */ | |
1000 | BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), | |
1001 | /* *(u32 *)(fp - 4) = r0 */ | |
1002 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), /* r2 = fp */ | |
1003 | BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = r2 - 4 */ | |
1004 | BPF_LD_MAP_FD(BPF_REG_1, map_fd), /* r1 = map_fd */ | |
1005 | BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem), | |
1006 | /* r0 = map_lookup(r1, r2) */ | |
1007 | BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), | |
1008 | /* if (r0 == 0) goto pc+2 */ | |
1009 | BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */ | |
1010 | BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), | |
1011 | /* lock *(u64 *) r0 += r1 */ | |
4fba111e | 1012 | .\" == atomic64_add |
1148d934 MK |
1013 | BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */ |
1014 | BPF_EXIT_INSN(), /* return r0 */ | |
cc7ac21d AS |
1015 | }; |
1016 | ||
1148d934 | 1017 | prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog, |
527bd1d7 | 1018 | sizeof(prog) / sizeof(prog[0]), "GPL"); |
cc7ac21d AS |
1019 | |
1020 | sock = open_raw_sock("lo"); | |
1021 | ||
1148d934 MK |
1022 | assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, |
1023 | sizeof(prog_fd)) == 0); | |
cc7ac21d AS |
1024 | |
1025 | for (;;) { | |
1026 | key = IPPROTO_TCP; | |
1027 | assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0); | |
c83ad1ad | 1028 | key = IPPROTO_UDP; |
cc7ac21d | 1029 | assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0); |
d1a71985 | 1030 | printf("TCP %lld UDP %lld packets\en", tcp_cnt, udp_cnt); |
cc7ac21d AS |
1031 | sleep(1); |
1032 | } | |
1033 | ||
1034 | return 0; | |
1035 | } | |
f0271688 MK |
1036 | .EE |
1037 | .PP | |
ce5db3fc | 1038 | Some complete working code can be found in the |
266791fb | 1039 | .I samples/bpf |
5415d504 | 1040 | directory in the kernel source tree. |
cc7ac21d AS |
1041 | .SH RETURN VALUE |
1042 | For a successful call, the return value depends on the operation: | |
1043 | .TP | |
1044 | .B BPF_MAP_CREATE | |
ce5db3fc | 1045 | The new file descriptor associated with the eBPF map. |
cc7ac21d AS |
1046 | .TP |
1047 | .B BPF_PROG_LOAD | |
54513c00 | 1048 | The new file descriptor associated with the eBPF program. |
cc7ac21d AS |
1049 | .TP |
1050 | All other commands | |
1051 | Zero. | |
1052 | .PP | |
1053 | On error, \-1 is returned, and | |
1054 | .I errno | |
1055 | is set appropriately. | |
1056 | .SH ERRORS | |
1057 | .TP | |
266791fb | 1058 | .B E2BIG |
6cedbd4c MK |
1059 | The eBPF program is too large or a map reached the |
1060 | .I max_entries | |
1061 | limit (maximum number of elements). | |
cc7ac21d | 1062 | .TP |
266791fb | 1063 | .B EACCES |
6cedbd4c | 1064 | For |
266791fb | 1065 | .BR BPF_PROG_LOAD , |
6cedbd4c MK |
1066 | even though all program instructions are valid, the program has been |
1067 | rejected because it was deemed unsafe. | |
1068 | This may be because it may have | |
1069 | accessed a disallowed memory region or an uninitialized stack/register or | |
1070 | because the function constraints don't match the actual types or because | |
1071 | there was a misaligned memory access. | |
1072 | In this case, it is recommended to call | |
1073 | .BR bpf () | |
1074 | again with | |
1075 | .I log_level = 1 | |
1076 | and examine | |
1077 | .I log_buf | |
1078 | for the specific reason provided by the verifier. | |
cc7ac21d AS |
1079 | .TP |
1080 | .B EBADF | |
1081 | .I fd | |
7d6bfe72 | 1082 | is not an open file descriptor. |
cc7ac21d AS |
1083 | .TP |
1084 | .B EFAULT | |
1148d934 MK |
1085 | One of the pointers |
1086 | .RI ( key | |
cc7ac21d AS |
1087 | or |
1088 | .I value | |
1089 | or | |
1090 | .I log_buf | |
1091 | or | |
1148d934 MK |
1092 | .IR insns ) |
1093 | is outside the accessible address space. | |
cc7ac21d AS |
1094 | .TP |
1095 | .B EINVAL | |
1096 | The value specified in | |
1097 | .I cmd | |
1098 | is not recognized by this kernel. | |
1099 | .TP | |
1100 | .B EINVAL | |
1101 | For | |
1102 | .BR BPF_MAP_CREATE , | |
1103 | either | |
1104 | .I map_type | |
1105 | or attributes are invalid. | |
1106 | .TP | |
1107 | .B EINVAL | |
1108 | For | |
266791fb | 1109 | .B BPF_MAP_*_ELEM |
cc7ac21d | 1110 | commands, |
1148d934 MK |
1111 | some of the fields of |
1112 | .I "union bpf_attr" | |
1113 | that are not used by this command | |
cc7ac21d AS |
1114 | are not set to zero. |
1115 | .TP | |
1116 | .B EINVAL | |
1117 | For | |
266791fb | 1118 | .BR BPF_PROG_LOAD , |
9a5215bf | 1119 | indicates an attempt to load an invalid program. |
953d2673 MK |
1120 | eBPF programs can be deemed |
1121 | invalid due to unrecognized instructions, the use of reserved fields, jumps | |
cc7ac21d AS |
1122 | out of range, infinite loops or calls of unknown functions. |
1123 | .TP | |
266791fb | 1124 | .B ENOENT |
cc7ac21d AS |
1125 | For |
1126 | .B BPF_MAP_LOOKUP_ELEM | |
1127 | or | |
16152abb | 1128 | .BR BPF_MAP_DELETE_ELEM , |
cc7ac21d AS |
1129 | indicates that the element with the given |
1130 | .I key | |
1131 | was not found. | |
1132 | .TP | |
6cedbd4c MK |
1133 | .B ENOMEM |
1134 | Cannot allocate sufficient memory. | |
1135 | .TP | |
1136 | .B EPERM | |
1137 | The call was made without sufficient privilege | |
1138 | (without the | |
1139 | .B CAP_SYS_ADMIN | |
1140 | capability). | |
5f920e10 MK |
1141 | .SH VERSIONS |
1142 | The | |
1143 | .BR bpf () | |
1144 | system call first appeared in Linux 3.18. | |
8dbf8f2d MK |
1145 | .SH CONFORMING TO |
1146 | The | |
1147 | .BR bpf () | |
1148 | system call is Linux-specific. | |
cc7ac21d | 1149 | .SH NOTES |
821bf91c | 1150 | Prior to Linux 4.4, all |
842ee010 MK |
1151 | .BR bpf () |
1152 | commands require the caller to have the | |
cc7ac21d | 1153 | .B CAP_SYS_ADMIN |
35732aa7 MK |
1154 | capability. |
1155 | From Linux 4.4 onwards, | |
1156 | .\" commit 1be7f75d1668d6296b80bf35dcf6762393530afc | |
1157 | an unprivileged user may create limited programs of type | |
821bf91c | 1158 | .BR BPF_PROG_TYPE_SOCKET_FILTER |
35732aa7 MK |
1159 | and associated maps. |
1160 | However they may not store kernel pointers within | |
821bf91c | 1161 | the maps and are presently limited to the following helper functions: |
f7d706ba MK |
1162 | .\" [Linux 5.6] mtk: The list of available functions is, I think, governed |
1163 | .\" by the check in net/core/filter.c::bpf_base_func_proto(). | |
821bf91c RP |
1164 | .IP * 3 |
1165 | get_random | |
1166 | .PD 0 | |
1167 | .IP * | |
1168 | get_smp_processor_id | |
1169 | .IP * | |
1170 | tail_call | |
1171 | .IP * | |
1172 | ktime_get_ns | |
1173 | .PD 1 | |
1174 | .PP | |
1175 | Unprivileged access may be blocked by setting the sysctl | |
1176 | .IR /proc/sys/kernel/unprivileged_bpf_disabled . | |
f0271688 | 1177 | .PP |
f774ddf1 MK |
1178 | eBPF objects (maps and programs) can be shared between processes. |
1179 | For example, after | |
1180 | .BR fork (2), | |
1181 | the child inherits file descriptors referring to the same eBPF objects. | |
1182 | In addition, file descriptors referring to eBPF objects can be | |
1183 | transferred over UNIX domain sockets. | |
1184 | File descriptors referring to eBPF objects can be duplicated | |
1185 | in the usual way, using | |
1186 | .BR dup (2) | |
1187 | and similar calls. | |
1188 | An eBPF object is deallocated only after all file descriptors | |
1189 | referring to the object have been closed. | |
f0271688 | 1190 | .PP |
4fba111e MK |
1191 | eBPF programs can be written in a restricted C that is compiled (using the |
1192 | .B clang | |
953d2673 MK |
1193 | compiler) into eBPF bytecode. |
1194 | Various features are omitted from this restricted C, such as loops, | |
f774ddf1 | 1195 | global variables, variadic functions, floating-point numbers, |
953d2673 | 1196 | and passing structures as function arguments. |
4fba111e MK |
1197 | Some examples can be found in the |
1198 | .I samples/bpf/*_kern.c | |
1199 | files in the kernel source tree. | |
ce5db3fc MK |
1200 | .\" There are also examples for the tc classifier, in the iproute2 |
1201 | .\" project, in examples/bpf | |
f0271688 | 1202 | .PP |
953d2673 MK |
1203 | The kernel contains a just-in-time (JIT) compiler that translates |
1204 | eBPF bytecode into native machine code for better performance. | |
5a29959a MK |
1205 | In kernels before Linux 4.15, |
1206 | the JIT compiler is disabled by default, | |
953d2673 MK |
1207 | but its operation can be controlled by writing one of the |
1208 | following integer strings to the file | |
1209 | .IR /proc/sys/net/core/bpf_jit_enable : | |
1210 | .IP 0 3 | |
1211 | Disable JIT compilation (default). | |
1212 | .IP 1 | |
1213 | Normal compilation. | |
1214 | .IP 2 | |
1215 | Debugging mode. | |
1216 | The generated opcodes are dumped in hexadecimal into the kernel log. | |
1217 | These opcodes can then be disassembled using the program | |
266791fb | 1218 | .I tools/net/bpf_jit_disasm.c |
953d2673 | 1219 | provided in the kernel source tree. |
fcd1bee3 | 1220 | .PP |
5a29959a MK |
1221 | Since Linux 4.15, |
1222 | .\" commit 290af86629b25ffd1ed6232c4e9107da031705cb | |
1223 | the kernel may configured with the | |
1224 | .B CONFIG_BPF_JIT_ALWAYS_ON | |
1225 | option. | |
1226 | In this case, the JIT compiler is always enabled, and the | |
1227 | .I bpf_jit_enable | |
1228 | is initialized to 1 and is immutable. | |
1229 | (This kernel configuration option was provided as a mitigation for | |
1230 | one of the Spectre attacks against the BPF interpreter.) | |
1231 | .PP | |
2b623a23 | 1232 | The JIT compiler for eBPF is currently |
4167f63f | 1233 | .\" Last reviewed in Linux 4.18-rc by grepping for BPF_ALU64 in arch/ |
6d2ac026 MK |
1234 | .\" and by checking the documentation for bpf_jit_enable in |
1235 | .\" Documentation/sysctl/net.txt | |
2b623a23 MK |
1236 | available for the following architectures: |
1237 | .IP * 3 | |
2ef9216b MK |
1238 | x86-64 (since Linux 3.18; cBPF since Linux 3.0); |
1239 | .\" commit 0a14842f5a3c0e88a1e59fac5c3025db39721f74 | |
2b623a23 MK |
1240 | .PD 0 |
1241 | .IP * | |
2ef9216b MK |
1242 | ARM32 (since Linux 3.18; cBPF since Linux 3.4); |
1243 | .\" commit ddecdfcea0ae891f782ae853771c867ab51024c2 | |
1244 | .IP * | |
1245 | SPARC 32 (since Linux 3.18; cBPF since Linux 3.5); | |
1246 | .\" commit 2809a2087cc44b55e4377d7b9be3f7f5d2569091 | |
2b623a23 | 1247 | .IP * |
2ef9216b MK |
1248 | ARM-64 (since Linux 3.18); |
1249 | .\" commit e54bcde3d69d40023ae77727213d14f920eb264a | |
2b623a23 | 1250 | .IP * |
069be4fd MK |
1251 | s390 (since Linux 4.1; cBPF since Linux 3.7); |
1252 | .\" commit c10302efe569bfd646b4c22df29577a4595b4580 | |
1253 | .IP * | |
2ef9216b MK |
1254 | PowerPC 64 (since Linux 4.8; cBPF since Linux 3.1); |
1255 | .\" commit 0ca87f05ba8bdc6791c14878464efc901ad71e99 | |
1256 | .\" commit 156d0e290e969caba25f1851c52417c14d141b24 | |
2b623a23 MK |
1257 | .IP * |
1258 | SPARC 64 (since Linux 4.12); | |
2ef9216b | 1259 | .\" commit 7a12b5031c6b947cc13918237ae652b536243b76 |
2b623a23 | 1260 | .IP * |
2ef9216b MK |
1261 | x86-32 (since Linux 4.18); |
1262 | .\" commit 03f5781be2c7b7e728d724ac70ba10799cc710d7 | |
2b623a23 | 1263 | .IP * |
2ef9216b MK |
1264 | MIPS 64 (since Linux 4.18; cBPF since Linux 3.16); |
1265 | .\" commit c6610de353da5ca6eee5b8960e838a87a90ead0c | |
1266 | .\" commit f381bf6d82f032b7410185b35d000ea370ac706b | |
c3a42840 | 1267 | .IP * |
2ef9216b MK |
1268 | riscv (since Linux 5.1). |
1269 | .\" commit 2353ecc6f91fd15b893fa01bf85a1c7a823ee4f2 | |
2b623a23 | 1270 | .PD |
cc7ac21d | 1271 | .SH SEE ALSO |
842ee010 | 1272 | .BR seccomp (2), |
3bcfaff6 | 1273 | .BR bpf-helpers (7), |
cc42e9b8 | 1274 | .BR socket (7), |
8440f771 MK |
1275 | .BR tc (8), |
1276 | .BR tc-bpf (8) | |
f0271688 | 1277 | .PP |
5988a659 | 1278 | Both classic and extended BPF are explained in the kernel source file |
1148d934 | 1279 | .IR Documentation/networking/filter.txt . |