]> git.ipfire.org Git - thirdparty/pdns.git/blame - pdns/recursordist/docs/appendices/internals.rst
Merge pull request #11431 from jroessler-ox/docs-kskzskroll-update
[thirdparty/pdns.git] / pdns / recursordist / docs / appendices / internals.rst
CommitLineData
223bb49e
PL
1Internals of the PowerDNS Recursor
2==================================
3
4**Warning**: This section is aimed at programmers wanting to contribute
5to the recursor, or to help fix bugs. It is not required reading for a
6PowerDNS operator, although it might prove interesting.
7
248d3812
O
8This Recursor depends on the use of some fine infrastructure: MTasker,
9MOADNSParser, MPlexer and the C++ Standard Library/Boost. This page
10will explain the conceptual relation between these components, and the
11route of a packet through the program.
223bb49e
PL
12
13 The PowerDNS Recursor
14----------------------
15
16The Recursor started out as a tiny project, mostly a technology
248d3812 17demonstration. These days it is a full blown recursor with many
223bb49e
PL
18features. This combined with a need for very high performance has made
19the recursor code less accessible than it was. The page you are reading
20hopes to rectify this situation.
21
22 Synchronous code using MTasker
23-------------------------------
24
25The original name of the program was **syncres**, which is still
26reflected in the file name ``syncres.cc``, and the class SyncRes. This
27means that PowerDNS is written naively, with one thread of execution per
28query, synchronously waiting for packets, Normally this would lead to
29very bad performance (unless running on a computer with very fast
30threading, like possibly the Sun CoolThreads family), so PowerDNS
31employs `MTasker <http://ds9a.nl/mtasker>`__ for very fast userspace
32threading.
33
34MTasker, which was developed separately from PowerDNS, does not provide
35a full multithreading system but restricts itself to those features a
36nameserver needs. It offers cooperative multitasking, which means there
37is no forced preemption of threads. This in turn means that no two
38**MThreads** ever really run at the same time.
39
248d3812
O
40This is both good and bad, but mostly good. It means the recursor does not
41have to think about locking in many cases.
223bb49e
PL
42
43It also means that the recursor could block if any operation takes too
44long.
45
46The core interaction with MTasker are the waitEvent() and sendEvent()
47functions. These pass around PacketID objects. Everything PowerDNS needs
48to wait for is described by a PacketID event, so the name is a bit
49misleading. Waiting for a TCP socket to have data available is also
50passed via a PacketID, for example.
51
52The version of MTasker in PowerDNS is newer than that described at the
53MTasker site, with a vital difference being that the waitEvent()
54structure passes along a copy of the exact PacketID sendEvent()
55transmitted. Furthermore, threads can trawl through the list of events
56being waited for and modify the respective PacketIDs. This is used for
57example with **near miss** packets: packets that appear to answer
58questions we asked, but differ in the DNS id. On seeing such a packet,
59the recursor trawls through all PacketIDs and if it finds any
60nearmisses, it updates the PacketID::nearMisses counter. The actual
61PacketID thus lives inside MTasker while any thread is waiting for it.
62
63MPlexer
64-------
65
66The Recursor uses a separate socket per outgoing query. This has the
67important benefit of making spoofing 64000 times harder, and
68additionally means that ICMP errors are reported back to the program. In
69measurements this appears to happen to one in ten queries, which would
70otherwise take a two-second timeout before PowerDNS moves on to another
71nameserver.
72
73However, this means that the program routinely needs to wait on hundreds
74or even thousands of sockets. Different operating systems offer various
75ways to monitor the state of sockets or more generally, file
76descriptors. To abstract out the differing strategies (``select``,
77``epoll``, ``kqueue``, ``completion ports``), PowerDNS contains
78**MPlexer** classes, all of which descend from the FDMultiplexer class.
79
80This class is very simple and offers only five important methods:
81addReadFD(), addWriteFD(), removeReadFD(), removeWriteFD() and run.
82
83The arguments to the **add** functions consist of an fd, a callback, and
84a boost::any variable that is passed as a reference to the callback.
85
86This might remind you of the MTasker above, and it is indeed the same
87trick: state is stored within the MPlexer. As long as a file descriptor
88remains within either the Read or Write active list, its state will
89remain stored.
90
91On arrival of a packet (or more generally, when an FD becomes readable
92or writable, which for example might mean a new TCP connection), the
93callback is called with the aforementioned reference to its parameter.
94
95The callback is free to call removeReadFD() or removeWriteFD() to remove
96itself from the active list.
97
98PowerDNS defines such callbacks as newUDPQuestion(), newTCPConnection(),
99handleRunningTCPConnection().
100
101Finally, the run() method needs to be called whenever the program is
102ready for new data. This happens in the main loop in pdns\_recursor.cc.
103This loop is what MTasker refers to as **the kernel**. In this loop, any
104packets or other MPlexer events get translated either into new MThreads
105within MTasker, or into calls to sendEvent(), which in turn wakes up
106other MThreads.
107
108MOADNSParser
109------------
110
111Yes, this does stand for **the Mother of All DNS Parsers**. And even
112that name does not do it justice! The MOADNSParser is the third attempt
113I've made at writing DNS packet parser and after two miserable failures,
114I think I've finally gotten it right.
115
116Writing and parsing DNS packets, and the DNS records it contains,
117consists of four things:
118
1191. Parsing a DNS record (from packet) into memory
1202. Generating a DNS record from memory (to packet)
1213. Writing out memory to user-readable zone format
1224. Reading said zone format into memory
123
124This gets tedious very quickly, as one needs to implement all four
125operations for each new record type, and there are dozens of them.
126
127While writing the MOADNSParser, it was discovered there is a remarkable
128symmetry between these four transitions. DNS Records are nearly always
129laid out in the same order in memory as in their zone format
130representation. And reading is nothing but inverse writing.
131
132So, the MOADNSParser is built around the notion of a **Conversion**, and
133we write all Conversion types once. So we have a Conversion from IP
134address in memory to an IP address in a DNS packet, and vice versa. And
135we have a Conversion from an IP address in zone format to memory, and
136vice versa.
137
138This in turn means that the entire implementation of the ARecordContent
139is as follows (wait for it!)
140
141::
142
143 conv.xfrIP(d_ip);
144
145Through the use of the magic called ``c++ Templates``, this one line
146does everything needed to perform the four operations mentioned above.
147
148At one point, I got really obsessed with PowerDNS memory use. So, how do
149we store DNS data in the PowerDNS recursor? I mentioned **memory** above
150a lot - this means we could just store the DNSRecordContent objects.
151However, this would be wasteful.
152
153For example, storing the following:
154
155::
156
157 www.example.org 3600 IN CNAME outpost.example.org.
158
159Would duplicate a lot of data. So, what is actually stored is a partial
160DNS packet. To store the CNAMEDNSRecordContent that corresponds to the
161above, we generate a DNS packet that has **www.example.org IN CNAME** as
162its question. Then we add **3600 IN CNAME outpost.example.org**. as its
163answer. Then we chop off the question part, and store the rest in the
164**www.example.org IN CNAME** key in our cache.
165
166When we need to retrieve **www.example.org IN CNAME**, the inverse
167happens. We find the proper partial packet, prefix it with a question
168for **www.example.org IN CNAME**, and expand the resulting packet into
169the answer **3600 IN CNAME outpost.example.org.**.
170
171Why do we go through all these motions? Because of DNS compression,
172which allows us to omit the whole **.example.org.** part, saving us 9
173bytes. This is amplified when storing multiple MX records which all look
174more or less alike. This optimization is not performed yet though.
175
176Even without compression, it makes sense as all records are
177automatically stored very compactly.
178
179The PowerDNS recursor only parses a number of **well known record
180types** and passes all other information across verbatim - it doesn't
181have to know about the content it is serving.
182
183The C++ Standard Library / Boost
184--------------------------------
185
186C++ is a powerful language. Perhaps a bit too powerful at times, you can
187turn a program into a real freakshow if you so desire.
188
189PowerDNS generally tries not to go overboard in this respect, but we do
190build upon a very advanced part of the `Boost <http://www.boost.org>`__
191C++ library: `boost::multi index
192container <http://boost.org/libs/multi_index/doc/index.html>`__.
193
194This container provides the equivalent of SQL indexes on multiple keys.
195It also implements compound keys, which PowerDNS uses as well.
196
197The main DNS cache is implemented as a multi index container object,
198with a compound key on the name and type of a record. Furthermore, the
199cache is sequenced, each time a record is accessed it is moved to the
200end of the list. When cleanup is performed, we start at the beginning.
201New records also get inserted at the end. For DNS correctness, the sort
202order of the cache is case insensitive.
203
204The multi index container appears in other parts of PowerDNS, and
205MTasker as well.
206
207 Actual DNS Algorithm
208---------------------
209
210The DNS RFCs do define the DNS algorithm, but you can't actually
211implement it exactly that way, it was written in 1987.
212
213Also, like what happened to HTML, it is expected that even non-standards
214conforming domains work, and a sizable fraction of them is misconfigured
215these days.
216
217Everything begins with SyncRes::beginResolve(), which knows nothing
218about sockets, and needs to be passed a domain name, dns type and dns
219class which we are interested in. It returns a vector of
220DNSResourceRecord objects, ready for writing either into an answer
221packet, or for internal use.
222
223After checking if the query is for any of the hardcoded domains
224(localhost, version.bind, id.server), the query is passed to
225SyncRes::doResolve, together with two vital parameters: the ``depth``
226and ``beenthere`` set. As the word **recursor** implies, we will need to
227recurse for answers. The **depth** parameter documents how deep we've
228recursed already.
229
230The ``beenthere`` set prevents loops. At each step, when a nameserver is
231queried, it is added to the ``beenthere`` set. No nameserver in the set
232will ever be queried again for the same question in the recursion
233process - we know for a fact it won't help us further. This prevents the
234process from getting stuck in loops.
235
236SyncRes::doResolve first checks if there is a CNAME in cache, using
237SyncRes::doCNAMECacheCheck, for the domain name and type queried and if
238so, changes the query (which is passed by reference) to the domain the
239CNAME points to. This is the cause of many DNS problems, a CNAME record
240really means **start over with this query**.
241
242This is followed by a call do SyncRes::doCacheCheck, which consults the
243cache for a straight answer to the question (as possibly rerouted by a
244CNAME). This function also consults the so called negative cache, but we
245won't go into that just yet.
246
247If this function finds the correct answer, and the answer hasn't expired
248yet, it gets returned and we are (almost) done. This happens in 80 to
24990% of all queries. Which is good, as what follows is a lot of work.
250
251To recap:
252
2531. beginResolve() - entry point, does checks for hardcoded domains
2542. doResolve() - start of recursion process, gets passed ``depth`` of 0
255 and empty ``beenthere`` set
2563. doCNAMECacheCheck() - check if there is a CNAME in cache which would
257 reroute the query
2584. doCacheCheck() - see if cache contains straight answer to possibly
259 rerouted query.
260
261If the data we were queried for was in the cache, we are almost done.
262One final step, which might as well be optional as nobody benefits from
263it, is SyncRes::addCruft. This function does additional processing,
264which means that if the query was for the MX record of a domain, we also
265add the IP address of the mail exchanger.
266
267The non-cached case
268^^^^^^^^^^^^^^^^^^^
269
270This is where things get interesting, because we start out with a nearly
271empty cache and have to go out to the net to get answers to fill it.
272
273The way DNS works, if you don't know the answer to a question, you find
274somebody who does. Initially you have no other place to go than the root
275servers. This is embodied in the SyncRes::getBestNSNamesFromCache
276method, which gets passed the domain we are interested in, as well as
277the ``depth`` and ``beenthere`` parameters mentioned earlier.
278
279From now on, assume our query will be for **``www.powerdns.com.``**.
280SyncRes::getBestNSNamesFromCache will first check if there are NS
281records in cache for ``www.powerdns.com.``, but there won't be. It then
282checks ``powerdns.com. NS``, and while these records do exist on the
283internet, the recursor doesn't know about them yet. So, we go on to
284check the cache for ``com. NS``, for which the same holds. Finally we
285end up checking for ``. NS``, and these we do know about: they are the
286root servers and were loaded into PowerDNS on startup.
287
288So, SyncRes::getBestNSNamesFromCache fills out a set with the **names**
289of nameservers it knows about for the **``.``** zone.
290
291This set, together with the original query **``www.powerdns.com``** gets
292passed to SyncRes::doResolveAt. This function can't yet go to work
293immediately though, it only knows the names of nameservers it can try.
294This is like asking for directions and instead of hearing **take the
295third right** you are told **go to 123 Fifth Avenue, and take a right**
296- the answer doesn't help you further unless you know where 123 Fifth
297Avenue is.
298
299SyncRes::doResolveAt first shuffles the nameservers both randomly and on
300performance order. If it knows a nameserver was fast in the past, it
301will get queried first. More about this later.
302
303Ok, here is the part where things get a bit scary. How does
304SyncRes::doResolveAt find the IP address of a nameserver? Well, by
305calling SyncRes::getAs (**get A records**), which in turn calls..
306SyncRes::doResolve. Hang on! That's where we came from! Massive
307potential for loops here. Well, it turns out that for any domain which
308can be resolved, this loop terminates. We do pass the ``beenthere`` set
309again, which makes sure we don't keep on asking the same questions to
310the same nameservers.
311
312Ok, SyncRes::getAs will give us the IP addresses of the chosen
313root-server, because these IP addresses were loaded on startup. We then
314ask these IP addresses (nameservers can have several) for its best
315answer for **``www.powerdns.com.``**. This is done using the LWRes class
316and specifically LWRes::asyncresolve, which gets passed domain name,
317type and IP address. This function interacts with MTasker and MPlexer
318above in ways which needn't concern us now. When it returns, the LWRes
319object contains the best answers the queried server had for our domain,
320which in this case means it tells us about the nameservers of ``com.``,
321and their IP addresses.
322
323All the relevant answers it gives are stored in the cache (or actually,
324merged), after which SyncRes::doResolveAt (which we are still in)
325evaluates what to do now.
326
327There are 6 options:
328
3291. The final answer is in, we are done, return to SyncRes::doResolve and
330 SyncRes::beginResolve
3312. The nameserver we queried tells us the domain we asked for
332 authoritatively does not exist. In case of the root-servers, this
333 happens when we query for *``www.powerdns.kom.``* for example, there
334 is no *``kom.``*. Return to SyncRes::beginResolve, we are done.
3353. A lesser form - it tells us it is authoritative for the query we
336 asked about, but there is no record matching our type. This happens
337 when querying for the IPv6 address of a host which only has an IPv4
338 address. Return to SyncRes::beginResolve, we are done.
3394. The nameserver passed us a CNAME to another domain, and we need to
340 reroute. Go to SyncRes::doResolve for the new domain.
3415. The nameserver did not know about the domain, but does know who does,
342 a *referral*. Stay within doResolveAt and loop to these new
343 nameservers.
3446. The nameserver replied saying *no idea*. This is called a *lame
345 delegation*. Stay within SyncRes::doResolveAt and try the other
346 nameservers we have for this domain.
347
348When not redirected using a CNAME, this function will loop until it has
349exhausted all nameservers and all their IP addresses. DNS is
350surprisingly resilient that there is often only a single non-broken
351nameserver left to answer queries, and we need to be prepared for that.
352
248d3812
O
353This is the whole DNS algorithm in PowerDNS. It contains a lot of
354tricky bits though, related to the caches and things like RPZ handling
355and DNSSEC validation.
223bb49e 356
fa951306
OM
357QName Minimization
358------------------
359
360Since the 4.3 release, the recursor implements a relaxed form of QName
361Minimization. This is a method to enhance privacy and described in the
362(draft) RFC 7816. By asking the authoritative server not the full
7a5dcbe2 363QName, but one more label than we already know it is authoritative for
fa951306
OM
364we do not leak which exact names are queried to servers higher up in
365the hierarchy.
366
7a5dcbe2 367The implementation uses a relaxed form of QName Minimization, following
fa951306
OM
368the recommendations found in the paper "A First Look at QNAME
369Minimization in the Domain Name System" by De Vries et all.
370
371We originally started with using NS probes as the example algorithm in
372the RFC draft recommends.
373
374We then quickly discovered that using NS probes were somewhat
375troublesome and after reading the mentioned paper we changed to QType
376A for probes, which worked better. We did not implemented the extra
377label prepend, not understanding why that would be needed (a more
378recent draft of the RFC came to the same conclusion).
379
380Following the recommendations in the paper we also implemented larger
381steps when many labels are present. We use steps 1-1-1-3-3-...; we
382already have a limit on the number of outgoing queries induced by a
383client query. We do a final full QName query if we get an unexpected
384error. This happens when we encounter authoritative servers that are
385not fully compliant, there are still many servers like that. The
7a5dcbe2 386recursor records with respect to this fallback scenario in the
fa951306
OM
387``qname-min-fallback-success`` metric.
388
389For forwarded queries, we do not use QName Minimization.
390
391
223bb49e
PL
392Some of the things we glossed over
393----------------------------------
394
395Whenever a packet is sent to a remote nameserver, the response time is
396stored in the SyncRes::s\_nsSpeeds map, using an exponentially weighted
397moving average. This EWMA averages out different response times, and
398also makes them decrease over time. This means that a nameserver that
399hasn't been queried recently gradually becomes **faster** in the eyes of
400PowerDNS, giving it a chance again.
401
402A timeout is accounted as a 1s response time, which should take that
403server out of the running for a while.
404
405Furthermore, queries are throttled. This means that each query to a
406nameserver that has failed is accounted in the ``s_throttle`` object.
407Before performing a new query, the query and the nameserver are looked
408up via shouldThrottle. If so, the query is assumed to have failed
409without even being performed. This saves a lot of network traffic and
410makes PowerDNS quick to respond to lame servers.
411
412It also offers a modicum of protection against birthday attack powered
413spoofing attempts, as PowerDNS will not inundate a broken server with
414queries.
415
416The negative query cache we mentioned earlier caches the cases 2 and 3
417in the enumeration above. This data needs to be stored separately, as it
418represents **non-data**. Each negcache query entry is the name of the
419SOA record that was presented with the evidence of non-existence. This
420SOA record is then retrieved from the regular cache, but with the TTL
421that originally came with the NXDOMAIN (case 2) or NXRRSET (case 3).
422
423 The Recursor Cache
424-------------------
425
426As mentioned before, the cache stores partial packets. It also stores
427not the **Time To Live** of records, but in fact the **Time To Die**. If
428the cache contains data, but it is expired, that data should not be
429deemed present. This bit of PowerDNS has proven tricky, leading to
430deadlocks in the past.
431
432There are some other very tricky things to deal with. For example,
433through a process called **more details**, a domain might have more
434nameservers than listed in its parent zone. So, there might only be two
435nameservers for ``powerdns.com.`` in the **``com.``** zone, but the
436**``powerdns.com``** zone might list more.
437
438This means that the cache should not, when talking to the **``com.``**
439servers later on, overwrite these four nameservers with only the two
440copies the **``com.``** servers pass us.
441
442However, in other cases (like for example for SOA and CNAME records),
443new data should overwrite old data.
444
445Note that PowerDNS deviates from RFC 2181 (section 5.4.1) in this
446respect.
447
1bb24087
OM
448Starting with version 4.7.0, there is a mechanism to save the
449parent NS set if it contains *more* names than the child NS set.
8ffb4f9c 450This allows falling back to the saved parent NS set on resolution errors
1bb24087
OM
451using the child specified NS set.
452As experience shows, this configuration error is encountered in the
453wild often enough to warrant this workaround.
454See :ref:`setting-save-parent-ns-set`.
455
8c08b4ec
OM
456.. _serve-stale:
457
458Serve Stale
459-----------
460
461Starting with version 4.8.0, the Recursor implements ``Serve Stale`` (:rfc:`8767`).
462This is a mechanism that allows records in the record cache that are expired
463but that cannot be refreshed (due to network or authoritative server issues) to be served anyway.
464
465The :ref:`setting-serve-stale-extensions` determines how many times the records lifetime can be extended.
922b9ae8 466Each extension of the lifetime of a record lasts 30s.
8c08b4ec
OM
467A value of 1440 means the maximum extra life time is 30 * 1440 seconds which is 12 hours.
468If the original TTL of a record was less than 30s, the original TTLs will be used as extension period.
469
470On each extension an asynchronous task to resolve the name will be created.
471If that task succeeds, the record will not be served stale anymore, as an up-to-date value is now available.
472
473
474If :ref:`setting-serve-stale-extensions` is not zero expired records will be kept in the record cache until the number of records becomes too large.
475Cache eviction will then be done on a least-recently-used basis.
476
477When dumping the cache using ``rec_control dump-cache`` the ``ss`` value shows the serve stale extension count.
478A value of 0 means the record is not being served stale, while
479a positive value shows the number of times the serve stale period has been extended.
1bb24087 480
223bb49e
PL
481 Some small things
482------------------
483
484The server-side part of PowerDNS (``pdns_recursor.cc``), which listens
485to queries by end-users, is fully IPv6 capable using the ComboAddress
486class. This class is in fact a union of a ``struct sockaddr_in`` and a
487``struct sockaddr_in6``. As long as the ``sin_family`` (or
488``sin6_family``) and ``sin_port`` members are in the same place, this
489works just fine, allowing us to pass a ComboAddress\*, cast to a
490``sockaddr*`` to the socket functions. For convenience, the ComboAddress
491also offers a length() method which can be used to indicate the length -
492either sizeof(sockaddr\_in) or sizeof(sockaddr\_in6).
493
494Access to the recursor is governed through the NetmaskGroup class, which
495internally contains Netmask, which in turn contain a ComboAddress.