]> git.ipfire.org Git - thirdparty/pdns.git/blame - pdns/recursordist/docs/performance.rst
Merge pull request #11431 from jroessler-ox/docs-kskzskroll-update
[thirdparty/pdns.git] / pdns / recursordist / docs / performance.rst
CommitLineData
223bb49e
PL
1Performance Guide
2=================
3
4To get the best out of the PowerDNS recursor, which is important if you are doing thousands of queries per second, please consider the following.
5
6A busy server may need hundreds of file descriptors on startup, and deals with spikes better if it has that many available later on.
7Linux by default restricts processes to 1024 file descriptors, which should suffice most of the time, but Solaris has a default limit of 256.
8This can be raised using the ``ulimit`` command or via the ``LimitNOFILE`` unit directive when ``systemd`` is used.
9FreeBSD has a default limit that is high enough for even very heavy duty use.
10
11Limit the size of the caches to a sensible value.
e30afeb3 12Cache hit rate does not improve meaningfully beyond a few million :ref:`setting-max-cache-entries`, reducing the memory footprint reduces CPU cache misses.
223bb49e
PL
13See below for more information about the various caches.
14
15When deploying (large scale) IPv6, please be aware some Linux distributions leave IPv6 routing cache tables at very small default values.
16Please check and if necessary raise ``sysctl net.ipv6.route.max_size``.
17
4976f6ad 18Set :ref:`setting-threads` to your number of CPU cores minus the number of distributor threads.
223bb49e
PL
19
20Threading and distribution of queries
21-------------------------------------
22
4976f6ad
OM
23When running with several threads, you can either ask PowerDNS to start one or more special threads to dispatch the incoming queries to the workers by setting :ref:`setting-pdns-distributes-queries` to ``yes``, or let the worker threads handle the incoming queries themselves.
24The latter is the default since version 4.9.0.
8f99dc5b
MN
25
26The dispatch thread enabled by :ref:`setting-pdns-distributes-queries` tries to send the same queries to the same thread to maximize the cache-hit ratio.
27If the incoming query rate is so high that the dispatch thread becomes a bottleneck, you can increase :ref:`setting-distributor-threads` to use more than one.
223bb49e 28
4976f6ad 29If :ref:`setting-pdns-distributes-queries` is set to ``no`` and either ``SO_REUSEPORT`` support is not available or the :ref:`setting-reuseport` directive is set to ``no``, all worker threads share the same listening sockets.
223bb49e
PL
30
31This prevents a single thread from having to handle every incoming queries, but can lead to thundering herd issues where all threads are awoken at once when a query arrives.
32
4976f6ad
OM
33If ``SO_REUSEPORT`` support is available and :ref:`setting-reuseport` is set to ``yes``, which is the
34default since version 4.9.0, separate listening sockets are opened for each worker thread and the query distributions is handled by the kernel, avoiding any thundering herd issue as well as preventing the distributor thread from becoming the bottleneck.
1527b7ad 35The next section discusses how to determine if the mechanism is working properly.
4976f6ad 36
26c36ae6
OM
37.. _worker_imbalance:
38
1527b7ad
OM
39Imbalance
40^^^^^^^^^
41Due to the nature of the distribution method used by the kernel imbalance with the new default settings of :ref:`setting-reuseport` and :ref:`setting-pdns-distributes-queries` may occur if you have very few clients.
42Imbalance can be observed by reading the periodic statistics reported by :program:`Recursor`::
43
44 Jun 26 11:06:41 pepper pdns-recursor[10502]: msg="Queries handled by thread" subsystem="stats" level="0" prio="Info" tid="0" ts="1687770401.359" count="7" thread="0"
45 Jun 26 11:06:41 pepper pdns-recursor[10502]: msg="Queries handled by thread" subsystem=" stats" level="0" prio="Info" tid="0" ts="1687770401.359" count="535167" thread="1"
46 Jun 26 11:06:41 pepper pdns-recursor[10502]: msg="Queries handled by thread" subsystem=" stats" level="0" prio="Info" tid="0" ts="1687770401.359" count="5" thread="2"
47
48In the above log lines we see that almost all queries are processed by thread 1.
49This can typically be observed when using ``dnsdist`` in front of :program:`Recursor`.
50
51When using ``dnsdist`` with a single ``newServer`` to a recursor instance in its configuration, the kernel will regard ``dnsdist`` as a single client unless you use the ``sockets`` parameter to ``newServer`` to increase the number of source ports used by ``dnsdist``.
52The following guideline applies for the ``dnsdist`` case:
53
54- Be generous with the ``sockets`` setting of ``newServer``.
55 A starting points is to configure twice as many sockets as :program:`Recursor` threads.
56- As long as the threads of the :program:`Recursor` as not overloaded, some imbalance will not impact performance significantly.
57- If you want to reduce imbalance, increase the value of ``sockets`` even more.
58
59Non-Linux systems
60^^^^^^^^^^^^^^^^^
61On some systems setting :ref:`setting-reuseport` to ``yes`` does not have the desired effect at all.
62If your systems shows great imbalance in the number of queries processed per thread (as reported by the periodic statistics report), try switching :ref:`setting-reuseport` to ``no`` and/or setting :ref:`setting-pdns-distributes-queries` to ``yes``.
223bb49e 63
48333784
PL
64.. versionadded:: 4.1.0
65 The :ref:`setting-cpu-map` parameter can be used to pin worker threads to specific CPUs, in order to keep caches as warm as possible and optimize memory access on NUMA systems.
66
8f99dc5b
MN
67.. versionadded:: 4.2.0
68 The :ref:`setting-distributor-threads` parameter can be used to run more than one distributor thread.
69
4976f6ad
OM
70.. versionchanged:: 4.9.0
71 The :ref:`setting-reuseport` parameter now defaults to ``yes``.
72
73.. versionchanged:: 4.9.0
74 The :ref:`setting-pdns-distributes-queries` parameter now defaults to ``no``.
75
76
9ddb314d
RG
77MTasker and MThreads
78--------------------
79
80PowerDNS Recursor uses a cooperative multitasking in userspace called ``MTasker``, based either on ``boost::context`` if available, or on ``System V ucontexts`` otherwise. For maximum performance, please make sure that your system supports ``boost::context``, as the alternative has been known to be quite slower.
81
82The maximum number of simultaneous MTasker threads, called ``MThreads``, can be tuned via :ref:`setting-max-mthreads`, as the default value of 2048 might not be enough for large-scale installations.
84b183ca 83This setting limits the number of mthreads *per physical (Posix) thread*.
48ec0d7e 84The threads that create mthreads are the distributor and worker threads.
9ddb314d
RG
85
86When a ``MThread`` is started, a new stack is dynamically allocated for it on the heap. The size of that stack can be configured via the :ref:`setting-stack-size` parameter, whose default value is 200 kB which should be enough in most cases.
87
48ec0d7e 88To reduce the cost of allocating a new stack for every query, the recursor can cache a small amount of stacks to make sure that the allocation stays cheap. This can be configured via the :ref:`setting-stack-cache-size` setting.
84b183ca 89This limit is per physical (Posix) thread.
48ec0d7e 90The only trade-off of enabling this cache is a slightly increased memory consumption, at worst equals to the number of stacks specified by :ref:`setting-stack-cache-size` multiplied by the size of one stack, itself specified via :ref:`setting-stack-size`.
9ddb314d 91
223bb49e
PL
92Performance tips
93----------------
94
95For best PowerDNS Recursor performance, use a recent version of your operating system, since this generally offers the best event multiplexer implementation available (``kqueue``, ``epoll``, ``ports`` or ``/dev/poll``).
96
97On AMD/Intel hardware, wherever possible, run a 64-bit binary. This delivers a nearly twofold performance increase.
98On UltraSPARC, there is no need to run with 64 bits.
99
100Consider performing a 'profiled build' by building with ``gprof`` support enabled, running the recursor a bit then feed that info into the next build.
101This is good for a 20% performance boost in some cases.
102
103When running with >3000 queries per second, and running Linux versions prior to 2.6.17 on some motherboards, your computer may spend an inordinate amount of time working around an ACPI bug for each call to gettimeofday.
104This is solved by rebooting with ``clock=tsc`` or upgrading to a 2.6.17 kernel.
105This is relevant if dmesg shows ``Using pmtmr for high-res timesource``.
106
107Connection tracking and firewalls
108---------------------------------
109
110A Recursor under high load puts a severe stress on any stateful (connection tracking) firewall, so much so that the firewall may fail.
111
112Specifically, many Linux distributions run with a connection tracking firewall configured.
977049d8 113For high load operation (thousands of queries/second), It is advised to either turn off iptables completely, or use the ``NOTRACK`` feature to make sure client DNS traffic bypasses the connection tracking.
223bb49e
PL
114
115Sample Linux command lines would be::
116
117 ## IPv4
3b19870c 118 ## NOTRACK rules for 53/udp, keep in mind that you also need your regular rules for 53/tcp
223bb49e
PL
119 iptables -t raw -I OUTPUT -p udp --sport 53 -j CT --notrack
120 iptables -t raw -I PREROUTING -p udp --dport 53 -j CT --notrack
223bb49e 121 iptables -I INPUT -p udp --dport 53 -j ACCEPT
223bb49e
PL
122
123 ## IPv6
3b19870c 124 ## NOTRACK rules for 53/udp, keep in mind that you also need your regular rules for 53/tcp
223bb49e 125 ip6tables -t raw -I OUTPUT -p udp --sport 53 -j CT --notrack
223bb49e
PL
126 ip6tables -t raw -I PREROUTING -p udp --dport 53 -j CT --notrack
127 ip6tables -I INPUT -p udp --dport 53 -j ACCEPT
223bb49e 128
cfc1aa33 129When using FirewallD (Centos 7+ / Red Hat 7+ / Fedora 21+), connection tracking can be disabled via direct rules.
223bb49e
PL
130The settings can be made permanent by using the ``--permanent`` flag::
131
132 ## IPv4
3b19870c 133 ## NOTRACK rules for 53/udp, keep in mind that you also need your regular rules for 53/tcp
223bb49e
PL
134 firewall-cmd --direct --add-rule ipv4 raw OUTPUT 0 -p udp --sport 53 -j CT --notrack
135 firewall-cmd --direct --add-rule ipv4 raw PREROUTING 0 -p udp --dport 53 -j CT --notrack
223bb49e 136 firewall-cmd --direct --add-rule ipv4 filter INPUT 0 -p udp --dport 53 -j ACCEPT
223bb49e
PL
137
138 ## IPv6
3b19870c 139 ## NOTRACK rules for 53/udp, keep in mind that you also need your regular rules for 53/tcp
223bb49e
PL
140 firewall-cmd --direct --add-rule ipv6 raw OUTPUT 0 -p udp --sport 53 -j CT --notrack
141 firewall-cmd --direct --add-rule ipv6 raw PREROUTING 0 -p udp --dport 53 -j CT --notrack
223bb49e 142 firewall-cmd --direct --add-rule ipv6 filter INPUT 0 -p udp --dport 53 -j ACCEPT
223bb49e
PL
143
144Following the instructions above, you should be able to attain very high query rates.
145
e30afeb3
OM
146Tuning Incoming TCP and Out-of-Order processing
147-----------------------------------------------
148
149In general TCP uses more resources than UDP, so beware!
150It is impossible to give hard numbers for the various parameters as each site is different.
151Instead we describe the mechanism and relevant metrics so you can study your setup and change the proper settings if needed.
152
153Each incoming TCP connection uses a file descriptor in addition to the file descriptors for other purposes, like contacting authoritative servers.
154When the recursor starts up, it will check if enough file descriptors are available and complain if not.
155
156When a query is received over a TCP connection, first the packet cache is consulted.
157If an answer is found it will be returned immediately.
158If no answer is found, the Recursor will process :ref:`setting-max-concurrent-requests-per-tcp-connection` queries per incoming TCP connection concurrently.
159If more than this number of queries is pending for this TCP connection, the remaining queries will stay in the TCP receive buffer to be processed later.
160Each of the queries processed will consume an mthread until processing is done.
161A response to a query is sent immediately when it becomes available; the response can be sent before other responses to queries that were received earlier by the Recursor.
162This is the Out-of-Order feature which greatly enhances performance, as a single slow query does not prevent other queries to be processed.
163
48ec0d7e
OM
164Before version 5.0.0, TCP queries are processed by either the distributer thread(s) if :ref:`setting-pdns-distributes-queries` is true, or by worker threads if :ref:`setting-pdns-distributes-queries` is false.
165Starting with version 5.0.0, :program:`Recursor` has dedicated thread(s) processing TCP queries.
166
e30afeb3 167The maximum number of mthreads consumed by TCP queries is :ref:`setting-max-tcp-clients` times :ref:`setting-max-concurrent-requests-per-tcp-connection`.
77fbf36d 168Before version 5.0.0, if :ref:`setting-pdns-distributes-queries` is false, this number should be (much) lower than :ref:`setting-max-mthreads`, to also allow UDP queries to be handled as these also consume mthreads.
48ec0d7e
OM
169Note that :ref:`setting-max-mthreads` is a per Posix thread setting.
170This means that the global maximum number of mthreads is (#distributor threads + #worker threads) * max-mthreads.
e30afeb3
OM
171
172If you expect few clients, you can increase :ref:`setting-max-concurrent-requests-per-tcp-connection`, to allow more concurrency per TCP connection.
173If you expect many clients and you have increased :ref:`setting-max-tcp-clients`, reduce :ref:`setting-max-concurrent-requests-per-tcp-connection` number to prevent mthread starvation or increase the maximum number of mthreads.
174
175To increase the maximum number of concurrent queries consider increasing :ref:`setting-max-mthreads`, but be aware that each active mthread consumes more than 200k of memory.
6c5d466c
OM
176To see the current number of mthreads in use consult the :ref:`stat-concurrent-queries` metric.
177If a query could not be handled due to mthread shortage, the :ref:`stat-over-capacity-drops` metric is increased.
e30afeb3
OM
178
179As an example, if you have typically 200 TCP clients, and the default maximum number of mthreads of 2048, a good number of concurrent requests per TCP connection would be 5. Assuming a worst case packet cache hit ratio, if all 200 TCP clients fill their connections with queries, about half (5 * 200) of the mthreads would be used by incoming TCP queries, leaving the other half for incoming UDP queries.
84b183ca 180Note that starting with version 5.0.0, TCP queries are processed by dedicated TCP thread(s), so the sharing of mthreads between UDP and TCP queries no longer applies.
e30afeb3
OM
181
182The total number of incoming TCP connections is limited by :ref:`setting-max-tcp-clients`.
183There is also a per client address limit: :ref:`setting-max-tcp-per-client` to limit the impact of a single client.
6c5d466c 184Consult the :ref:`stat-tcp-clients` metric for the current number of TCP connections and the :ref:`stat-tcp-client-overflow` metric to see if client connection attempts were rejected because there were too many existing connections from a single address.
e30afeb3 185
3bb94fa0
O
186.. _tcp-fast-open-support:
187
188TCP Fast Open Support
189---------------------
190On Linux systems, the recursor can use TCP Fast Open for passive (incoming, since 4.1) and active (outgoing, since 4.5) TCP connections.
191TCP Fast Open allows the initial SYN packet to carry data, saving one network round-trip.
e7d686ea 192For details, consult :rfc:`7413`.
3bb94fa0 193
e7d686ea 194On Linux systems, to enable TCP Fast Open, it might be needed to change the value of the ``net.ipv4.tcp_fastopen`` sysctl.
3bb94fa0
O
195Value 0 means Fast Open is disabled, 1 is only use Fast Open for active connections, 2 is only for passive connections and 3 is for both.
196
197The operation of TCP Fast Open can be monitored by looking at these kernel metrics::
198
199 netstat -s | grep TCPFastOpen
200
e7d686ea 201Please note that if active (outgoing) TCP Fast Open attempts fail in particular ways, the Linux kernel stops using active TCP Fast Open for a while for all connections, even connection to servers that previously worked.
3bb94fa0 202This behaviour can be monitored by watching the ``TCPFastOpenBlackHole`` kernel metric and influenced by setting the ``net.ipv4.tcp_fastopen_blackhole_timeout_sec`` sysctl.
e7d686ea 203While developing active TCP Fast Open, it was needed to set ``net.ipv4.tcp_fastopen_blackhole_timeout_sec`` to zero to circumvent the issue, since it was triggered regularly when connecting to authoritative nameservers that did not respond.
3bb94fa0 204
0b308420
O
205At the moment of writing, some Google operated nameservers (both recursive and authoritative) indicate Fast Open support in the TCP handshake, but do not accept the cookie they sent previously and send a new one for each connection.
206Google is working to fix this.
3bb94fa0 207
0b308420 208If you operate an anycast pool of machines, make them share the TCP Fast Open Key by setting the ``net.ipv4.tcp_fastopen_key`` sysctl, otherwise you will create a similar issue some Google servers have.
3bb94fa0
O
209
210To determine a good value for the :ref:`setting-tcp-fast-open` setting, watch the ``TCPFastOpenListenOverflow`` metric.
211If this value increases often, the value might be too low for your traffic, but note that increasing it will use kernel resources.
212
1b5314b9 213Running with a local root zone
e46d663b 214------------------------------
1b5314b9
O
215Running with a local root zone as described in :rfc:`8806` can help reduce traffic to the root servers and reduce response times for clients.
216Since 4.6.0 PowerDNS Recursor supports two ways of doing this.
217
218Running a local Authoritative Server for the root zone
219
220- The first method is to have a local Authoritative Server that has a copy of the root zone and forward queries to it.
221 Setting up an PowerDNS Authoritative Server to serve a copy of the root zone looks like:
222
223 pdnsutil create-secondary-zone . ip1 ip2
224
225 where ``ip1`` and ``ip2`` are servers willing to serve an AXFR for the root zone; :rfc:`8806` contains a list of candidates in appendix A. The Authoritative Server will periodically make sure its copy of the root zone is up-to-date.
226 The next step is to configure a forward zone to the IP ``ip`` of the Authoritative Server in the settings file or the Recursor:
227
228 forward-zones=.=ip
229
230 The Recursor will use the Authoritative Server to ask questions about the root zone, but if it learns about delegations still follow those.
231 Multiple Recursors can use this Authoritative Server.
232
233- The second method is to cache the root zone as described in :ref:`ztc`.
234 Here each Recursor will download and fill its cache with the contents of the root zone.
235 Depending on the ``timeout`` parameter, this will be done once or periodically.
236 Refer to :ref:`ztc` for details.
3bb94fa0 237
223bb49e
PL
238Recursor Caches
239---------------
240
241The PowerDNS Recursor contains a number of caches, or information stores:
242
243Nameserver speeds cache
244^^^^^^^^^^^^^^^^^^^^^^^
245
246The "NSSpeeds" cache contains the average latency to all remote authoritative servers.
247
248Negative cache
249^^^^^^^^^^^^^^
250
251The "Negcache" contains all domains known not to exist, or record types not to exist for a domain.
252
253Recursor Cache
254^^^^^^^^^^^^^^
255
256The Recursor Cache contains all DNS knowledge gathered over time.
e30afeb3 257This is also known as the "record cache".
223bb49e
PL
258
259Packet Cache
260^^^^^^^^^^^^
261
262The Packet Cache contains previous answers sent to clients.
263If a question comes in that matches a previous answer, this is sent back directly.
264
265The Packet Cache is consulted first, immediately after receiving a packet.
266This means that a high hitrate for the Packet Cache automatically lowers the cache hitrate of subsequent caches.
267
268Measuring performance
269---------------------
270
271The PowerDNS Recursor exposes many :doc:`metrics <metrics>` that can be graphed and monitored.
503d207f
O
272
273Event Tracing
274-------------
275Event tracing is an experimental feature introduced in version 4.6.0 that allows following the internals of processing queries in more detail.
276
277In certain spots in the resolving process event records are created that contain an identification of the event, a timestamp, potentially a value and an indication if this was the start or the end of an event. This is relevant for events that describe stages in the resolving process.
278
279At this point in time event logs of queries can be exported using a protobuf log or they can be written to the log file.
280
281Note that this is an experimental feature that will change in upcoming releases.
282
283Currently, an event protobuf message has the following definition:
284
285.. code-block:: protobuf
286
287 enum EventType {
ade7a5e6
O
288 // Range 0..99: Generic events
289 CustomEvent = 0; // A custom event
290 ReqRecv = 1; // A request was received
291 PCacheCheck = 2; // A packet cache check was initiated or completed; value: bool cacheHit
292 AnswerSent = 3; // An answer was sent to the client
293
294 // Range 100: Recursor events
295 SyncRes = 100; // Recursor Syncres main function has started or completed; value: int rcode
296 LuaGetTag = 101; // Events below mark start or end of Lua hook calls; value: return value of hook
297 LuaGetTagFFI = 102;
298 LuaIPFilter = 103;
299 LuaPreRPZ = 104;
300 LuaPreResolve = 105;
301 LuaPreOutQuery = 106;
302 LuaPostResolve = 107;
303 LuaNoData = 108;
304 LuaNXDomain = 109;
305 }
503d207f
O
306
307.. code-block:: protobuf
308
309 message Event {
310 required uint64 ts = 1;
311 required EventType event = 2;
312 required bool start = 3;
313 optional bool boolVal = 4;
314 optional int64 intVal = 5;
315 optional string stringVal = 6;
316 optional bytes bytesVal = 7;
b6ff54ac 317 optional string custom = 8;
503d207f
O
318 }
319 repeated Event trace = 23;
320
321Event traces can be enabled by either setting :ref:`setting-event-trace-enabled` or by using the :doc:`rec_control <manpages/rec_control.1>` subcommand ``set-event-trace-enabled``.
322
b6ff54ac 323An example of a trace (timestamps are relative in nanoseconds) as shown in the logfile:
503d207f
O
324
325.. code-block:: C
326
ade7a5e6 327 - ReqRecv(70);
503d207f
O
328 - PCacheCheck(411964);
329 - PCacheCheck(416783,0,done);
330 - SyncRes(441811);
331 - SyncRes(337233971,0,done);
332 -AnswerSent(337266453)
333
334The packet cache check event has two events.
335The first signals the start of packet cache lookup, and the second the completion of the packet cache lookup with result 0 (not found).
336The SynRec event also has two entries. The value (0) is the return value of the SyncRes function.
337
338An example of a trace with a packet cache hit):
339
340.. code-block:: C
341
ade7a5e6 342 - ReqRecv(60);
503d207f
O
343 - PCacheCheck(22913);
344 - PCacheCheck(113255,1,done);
345 - AnswerSent(117493)
346
347Here it can be seen that packet cache returns 1 (found).
348
349An example where various Lua related events can be seen:
350
351.. code-block:: C
352
ade7a5e6 353 ReqRecv(150);
503d207f
O
354 PCacheCheck(26912);
355 PCacheCheck(51308,0,done);
356 LuaIPFilter(56868);
357 LuaIPFilter(57149,0,done);
358 LuaPreRPZ(82728);
359 LuaPreRPZ(82918,0,done);
360 LuaPreResolve(83479);
361 LuaPreResolve(210621,0,done);
362 SyncRes(217424);
363 LuaPreOutQuery(292868);
364 LuaPreOutQuery(292938,0,done);
365 LuaPreOutQuery(24702079);
366 LuaPreOutQuery(24702349,0,done);
367 LuaPreOutQuery(43055303);
368 LuaPreOutQuery(43055634,0,done);
369 SyncRes(80470320,0,done);
370 LuaPostResolve(80476592);
371 LuaPostResolve(80476772,0,done);
372 AnswerSent(80500247)
373
374There is no packet cache hit, so SyncRes is called which does a couple of outgoing queries.
375
376