postfix-2.5-20071208

author Wietse Venema <wietse@porcupine.org>

Sat, 8 Dec 2007 05:00:00 +0000 (00:00 -0500)

committer Viktor Dukhovni <viktor@dukhovni.org>

Tue, 5 Feb 2013 06:33:38 +0000 (06:33 +0000)
author Wietse Venema <wietse@porcupine.org>
Sat, 8 Dec 2007 05:00:00 +0000 (00:00 -0500)
committer Viktor Dukhovni <viktor@dukhovni.org>
Tue, 5 Feb 2013 06:33:38 +0000 (06:33 +0000)
diff --git a/postfix/HISTORY b/postfix/HISTORY

index 08ca49a46b88f161b13db9d3e0137d1184d06608..2dc23c275444425570fb27b18d6c067b610f9a89 100644 (file)
--- a/postfix/HISTORY
+++ b/postfix/HISTORY
@@ -13939,4 +13939,16 @@ Apologies for any names omitted.
         to the Postfix-owned data_directory. File: global/data_redirect.c.
  
         Lots of pathname fixes in the examples of TLS_README and
-       postconf(5); -lm library screw-up in 8qmgr/Makefile.in.
+       postconf(5); -lm library screw-up in queue manager Makefiles.
+
+20071207
+
+       Cleanup: pathname fixes in documentation; unnecessary queue
+       scan in the queue manager rate limiter; inverse square root
+       feedback in the queue manager concurrency scheduler.  Files:
+       mantools/postlink, proto/TLS_README.html, *qmgr/qmgr_queue.c.
+
+       All changes up to this point should be ready for Postfix 2.5.
+
+       Documentation: updated nqgmr preemptive scheduler documentation
+       by Patrik Rak. File: proto/SCHEDULER_README.html.
diff --git a/postfix/README_FILES/SCHEDULER_README b/postfix/README_FILES/SCHEDULER_README

index cba526863642cfa74170d533ecf0f51134aa7d65..764f9061faded2a316e0e1473a7b52b400004fac 100644 (file)
--- a/postfix/README_FILES/SCHEDULER_README
+++ b/postfix/README_FILES/SCHEDULER_README
@@ -9,47 +9,46 @@ It schedules delivery of new mail, retries failed deliveries at specific times,
  and removes mail from the queue after the last delivery attempt. There are two
  major classes of mechanisms that control the operation of the queue manager.
  
-The first class of mechanisms is concerned with the number of concurrent
-deliveries to a specific destination, including decisions on when to suspend
-deliveries after persistent failures:
+  * Concurrency scheduling is concerned with the number of concurrent
+    deliveries to a specific destination, including decisions on when to
+    suspend deliveries after persistent failures.
+  * Preemptive scheduling is concerned with the selection of email messages and
+    recipients for a given destination.
+  * Credits. This document would not be complete without.
  
-  * Concurrency scheduling
-
-      o Summary of the Postfix 2.5 concurrency feedback algorithm
-      o Summary of the Postfix 2.5 "dead destination" detection algorithm
-      o Pseudocode for the Postfix 2.5 concurrency scheduler
-      o Results for delivery to concurrency limited servers
-      o Discussion of concurrency limited server results
-      o Limitations of less-than-1 per delivery feedback
-      o Concurrency configuration parameters
-
-The second class of mechanisms is concerned with the selection of what mail to
-deliver to a given destination:
-
-  * Preemptive scheduling
+C\bCo\bon\bnc\bcu\bur\brr\bre\ben\bnc\bcy\by s\bsc\bch\bhe\bed\bdu\bul\bli\bin\bng\bg
  
-      o Why the non-preemptive Postfix queue manager was replaced
-      o How the non-preemptive queue manager scheduler works
+The following sections document the Postfix 2.5 concurrency scheduler, after a
+discussion of the limitations of the existing concurrency scheduler. This is
+followed by results of medium-concurrency experiments, and a discussion of
+trade-offs between performance and robustness.
  
-And this document would not be complete without:
+The material is organized as follows:
  
-  * Credits
+  * Drawbacks of the existing concurrency scheduler
+  * Summary of the Postfix 2.5 concurrency feedback algorithm
+  * Summary of the Postfix 2.5 "dead destination" detection algorithm
+  * Pseudocode for the Postfix 2.5 concurrency scheduler
+  * Results for delivery to concurrency limited servers
+  * Discussion of concurrency limited server results
+  * Limitations of less-than-1 per delivery feedback
+  * Concurrency configuration parameters
  
-C\bCo\bon\bnc\bcu\bur\brr\bre\ben\bnc\bcy\by s\bsc\bch\bhe\bed\bdu\bul\bli\bin\bng\bg
+D\bDr\bra\baw\bwb\bba\bac\bck\bks\bs o\bof\bf t\bth\bhe\be e\bex\bxi\bis\bst\bti\bin\bng\bg c\bco\bon\bnc\bcu\bur\brr\bre\ben\bnc\bcy\by s\bsc\bch\bhe\bed\bdu\bul\ble\ber\br
  
-This section documents the Postfix 2.5 concurrency scheduler. Prior Postfix
-versions used a simple but robust algorithm where the per-destination delivery
-concurrency was decremented by 1 after a delivery suffered connection or
-handshake failure, and was incremented by 1 otherwise. Of course the
-concurrency was never allowed to exceed the maximum per-destination concurrency
-limit. And when a destination's concurrency level dropped to zero, the
-destination was declared "dead" and delivery was suspended.
+From the start, Postfix has used a simple but robust algorithm where the per-
+destination delivery concurrency is decremented by 1 after a delivery suffered
+connection or handshake failure, and incremented by 1 otherwise. Of course the
+concurrency is never allowed to exceed the maximum per-destination concurrency
+limit. And when a destination's concurrency level drops to zero, the
+destination is declared "dead" and delivery is suspended.
  
-Drawbacks of the old +/-1 feedback per delivery are:
+Drawbacks of +/-1 concurrency feedback per delivery are:
  
    * Overshoot due to exponential delivery concurrency growth with each pseudo-
-    cohort(*). For example, with the default initial concurrency of 5,
-    concurrency would proceed over time as (5-10-20).
+    cohort(*). This can be an issue with high-concurrency channels. For
+    example, with the default initial concurrency of 5, concurrency would
+    proceed over time as (5-10-20).
  
    * Throttling down to zero concurrency after a single pseudo-cohort(*)
      failure. This was especially an issue with low-concurrency channels where a
@@ -335,11 +334,11 @@ D\bDi\bis\bsc\bcu\bus\bss\bsi\bio\bon\bn o\bof\bf c\bco\bon\bnc\bcu\bur\brr\bre\ben\bnc\bcy\by l\bli\bim\b
  
  All results in the previous sections are based on the first delivery runs only;
  they do not include any second etc. delivery attempts. The first two examples
-show that the feedback method matters little when concurrency is limited due to
-congestion. This is because the initial concurrency is already at the client's
-concurrency maximum, and because there is 10-100 times more positive than
-negative feedback. Under these conditions, the contribution from SMTP
-connection caching is negligible.
+show that the effect of feedback is negligible when concurrency is limited due
+to congestion. This is because the initial concurrency is already at the
+client's concurrency maximum, and because there is 10-100 times more positive
+than negative feedback. Under these conditions, it is no surprise that the
+contribution from SMTP connection caching is also negligible.
  
  In the last example, the old +/-1 feedback per delivery will defer 50% of the
  mail when confronted with an active (anvil-style) server concurrency limit,
@@ -449,93 +448,665 @@ Postfix versions.
  
  P\bPr\bre\bee\bem\bmp\bpt\bti\biv\bve\be s\bsc\bch\bhe\bed\bdu\bul\bli\bin\bng\bg
  
-This is the beginning of documentation for a preemptive queue manager
-scheduling algorithm by Patrik Rak. For a long time, this code was made
-available under the name "nqmgr(8)" (new queue manager), as an optional module.
-As of Postfix 2.1 this is the default queue manager, which is always called
-"qmgr(8)". The old queue manager will for some time will be available under the
-name of "oqmgr(8)".
-
-W\bWh\bhy\by t\bth\bhe\be n\bno\bon\bn-\b-p\bpr\bre\bee\bem\bmp\bpt\bti\biv\bve\be P\bPo\bos\bst\btf\bfi\bix\bx q\bqu\bue\beu\bue\be m\bma\ban\bna\bag\bge\ber\br w\bwa\bas\bs r\bre\bep\bpl\bla\bac\bce\bed\bd
-
-The non-preemptive Postfix scheduler had several limitations due to unfortunate
-choices in its design.
-
- 1. Round-robin selection by destination for mail that is delivered via the
-    same message delivery transport. The round-robin strategy was chosen with
-    the intention to prevent a single (destination) site from using up too many
-    mail delivery resources. However, that strategy penalized inbound mail on
-    bi-directional gateways. The poor suffering inbound destination would be
-    selected only 1/number-of-destinations of the time, even when it had more
-    mail than other destinations, and thus mail could be delayed.
-
-    Victor Duchovni found a workaround: use different message delivery
-    transports, and thus avoid the starvation problem. The Patrik Rak scheduler
-    solves this problem by using FIFO selection.
-
- 2. A second limitation of the old Postfix scheduler was that delivery of bulk
-    mail would block all other deliveries, causing large delays. Patrik Rak's
-    scheduler allows mail with fewer recipients to slip past bulk mail in an
-    elegant manner.
-
-H\bHo\bow\bw t\bth\bhe\be n\bno\bon\bn-\b-p\bpr\bre\bee\bem\bmp\bpt\bti\biv\bve\be q\bqu\bue\beu\bue\be m\bma\ban\bna\bag\bge\ber\br s\bsc\bch\bhe\bed\bdu\bul\ble\ber\br w\bwo\bor\brk\bks\bs
-
-The following text is from Patrik Rak and should be read together with the
-postconf(5) manual that describes each configuration parameter in detail.
-
-From user's point of view, oqmgr(8) and qmgr(8) are both the same, except for
-how next message is chosen when delivery agent becomes available. You already
-know that oqmgr(8) uses round-robin by destination while qmgr(8) uses simple
-FIFO, except for some preemptive magic. The postconf(5) manual documents all
-the knobs the user can use to control this preemptive magic - there is nothing
-else to the preemption than the quite simple conditions described in there.
-
-As for programmer-level documentation, this will have to be extracted from all
-those emails we have exchanged with Wietse [rats! I hoped that Patrik would do
-the work for me -- Wietse] But I think there are no missing bits which we have
-not mentioned in our conversations.
-
-However, even from programmer's point of view, there is nothing more to add to
-the message scheduling idea itself. There are few things which make it look
-more complicated than it is, but the algorithm is the same as the user
-perceives it. The summary of the differences of the programmer's view from the
-user's view are:
-
- 1. Simplification of terms for users: The user knows about messages and
-    recipients. The program itself works with jobs (one message is split among
-    several jobs, one per each transport needed to deliver the message) and
-    queue entries (each entry may group several recipients for same
-    destination). Then there is the peer structure introduced by qmgr(8) which
-    is simply per-job analog of the queue structure.
-
- 2. Dealing with concurrency limits: The actual implementation is complicated
-    by the fact that the messages (resp. jobs) may not be delivered in the
-    exactly scheduled order because of the concurrency limits. It is necessary
-    to skip some "blocker" jobs when the concurrency limit is reached and get
-    back to them again when the limit permits.
-
- 3. Dealing with resource limits: The actual implementation is complicated by
-    the fact that not all recipients may be read in-core. Therefore each
-    message has some recipients in-core and some may remain on-file. This means
-    that a) the preemptive algorithm needs to work with recipient count
-    estimates instead of exact counts, b) there is extra code which needs to
-    manipulate the per-transport pool of recipients which may be read in-core
-    at the same time, and c) there is extra code which needs to be able to read
-    recipients into core in batches and which is triggered at appropriate
-    moments.
-
- 4. Doing things efficiently: All important things I am aware of are done in
-    the minimum time possible (either directly or at least when amortized
-    complexity is used), but to choose which job is the best candidate for
-    preempting the current job requires linear search of up to all transport
-    jobs (the worst theoretical case - the reality is much better). As this is
-    done every time the next queue entry to be delivered is about to be chosen,
-    it seemed reasonable to add cache which minimizes the overhead. Maintenance
-    of this candidate cache slightly obfuscates things.
-
-The points 2 and 3 are those which made the implementation (look) complicated
-and were the real coding work, but I believe that to understand the scheduling
-algorithm itself (which was the real thinking work) is fairly easy.
+This document attempts to describe the new queue manager and its preeemptive
+scheduler algorithm. Note that the document was originally written to describe
+the changes between the new queue manager (in this text referred to as nqmgr,
+the name it was known by before it became the default queue manager) and the
+old queue manager (referred to as oqmgr). This is why it refers to oqmgr every
+so often.
+
+This document is divided into sections as follows:
+
+  * The structures used by nqmgr
+  * What happens when nqmgr picks up the message - how it is assigned to
+    transports, jobs, peers, entries
+  * How does the entry selection work
+  * How does the preemption work - what messages may be preempted and how and
+    what messages are chosen to preempt them
+  * How destination concurrency limits affect the scheduling algorithm
+  * Dealing with memory resource limits
+
+T\bTh\bhe\be s\bst\btr\bru\buc\bct\btu\bur\bre\bes\bs u\bus\bse\bed\bd b\bby\by n\bnq\bqm\bmg\bgr\br
+
+Let's start by recapitulating the structures and terms used when referring to
+queue manager and how it operates. Many of these are partially described
+elsewhere, but it is nice to have a coherent overview in one place:
+
+  * Each message structure represents one mail message which Postfix is to
+    deliver. The message recipients specify to what destinations is the message
+    to be delivered and what transports are going to be used for the delivery.
+
+  * Each recipient entry groups a batch of recipients of one message which are
+    all going to be delivered to the same destination.
+
+  * Each transport structure groups everything what is going to be delivered by
+    delivery agents dedicated for that transport. Each transport maintains a
+    set of queues (describing the destinations it shall talk to) and jobs
+    (referencing the messages it shall deliver).
+
+  * Each transport queue (not to be confused with the on-disk active queue or
+    incoming queue) groups everything what is going be delivered to given
+    destination (aka nexthop) by its transport. Each queue belongs to one
+    transport, so each destination may be referred to by several queues, one
+    for each transport. Each queue maintains a list of all recipient entries
+    (batches of message recipients) which shall be delivered to given
+    destination (the todo list), and a list of recipient entries already being
+    delivered by the delivery agents (the busy list).
+
+  * Each queue corresponds to multiple peer structures. Each peer structure is
+    like the queue structure, belonging to one transport and referencing one
+    destination. The difference is that it lists only the recipient entries
+    which all originate from the same message, unlike the queue structure,
+    whose entries may originate from various messages. For messages with few
+    recipients, there is usually just one recipient entry for each destination,
+    resulting in one recipient entry per peer. But for large mailing list
+    messages the recipients may need to be split to multiple recipient entries,
+    in which case the peer structure may list many entries for single
+    destination.
+
+  * Each transport job groups everything it takes to deliver one message via
+    its transport. Each job represents one message within the context of the
+    transport. The job belongs to one transport and message, so each message
+    may have multiple jobs, one for each transport. The job groups all the peer
+    structures, which describe the destinations the job's message has to be
+    delivered to.
+
+The first four structures are common to both nqmgr and oqmgr, the latter two
+were introduced by nqmgr.
+
+These terms are used extensively in the text below, feel free to look up the
+description above anytime you'll feel you have lost a sense what is what.
+
+W\bWh\bha\bat\bt h\bha\bap\bpp\bpe\ben\bns\bs w\bwh\bhe\ben\bn n\bnq\bqm\bmg\bgr\br p\bpi\bic\bck\bks\bs u\bup\bp t\bth\bhe\be m\bme\bes\bss\bsa\bag\bge\be
+
+Whenever nqmgr moves a queue file into the active queue, the following happens:
+It reads all necessary information from the queue file as oqmgr does, and also
+reads as many recipients as possible - more on that later, for now let's just
+pretend it always reads all recipients.
+
+Then it resolves the recipients as oqmgr does, which means obtaining (address,
+nexthop, transport) triple for each recipient. For each triple, it finds the
+transport; if it does not exist yet, it instantiates it (unless it's dead).
+Within the transport, it finds the destination queue for given nexthop; if it
+does not exist yet, it instantiates it (unless it's dead). The triple is then
+bound to given destination queue. This happens in qmgr_resolve() and is
+basically the same as in oqmgr.
+
+Then for each triple which was bound to some queue (and thus transport), the
+program finds the job which represents the message within that transport's
+context; if it does not exist yet, it instantiates it. Within the job, it finds
+the peer which represents the bound destination queue within this jobs context;
+if it does not exist yet, it instantiates it. Finally, it stores the address
+from the resolved triple to the recipient entry which is appended to both the
+queue entry list and the peer entry list. The addresses for same nexthop are
+batched in the entries up to recipient_concurrency limit for that transport.
+This happens in qmgr_assign() and apart from that it operates with job and peer
+structures is basically the same as in oqmgr.
+
+When the job is instantiated, it is enqueued on the transport's job list based
+on the time its message was picked up by nqmgr. For first batch of recipients
+this means it is appended to the end of the job list, but the ordering of the
+job list by the enqueue time is important as we will see shortly.
+
+[Now you should have pretty good idea what is the state of the nqmgr after
+couple of messages was picked up, what is the relation between all those job,
+peer, queue and entry structures.]
+
+H\bHo\bow\bw d\bdo\boe\bes\bs t\bth\bhe\be e\ben\bnt\btr\bry\by s\bse\bel\ble\bec\bct\bti\bio\bon\bn w\bwo\bor\brk\bk
+
+Having prepared all those above mentioned structures, the task of the nqmgr's
+scheduler is to choose the recipient entries one at a time and pass them to the
+delivery agent for corresponding transport. Now how does this work?
+
+The first approximation of the new scheduling algorithm is like this:
+
+    foreach transport (round-robin-by-transport)
+    do
+        if transport busy continue
+        if transport process limit reached continue
+        foreach transport's job (in the order of the transport's job list)
+        do
+       foreach job's peer (round-robin-by-destination)
+            if peer->queue->concurrency < peer->queue->window
+                return next peer entry.
+       done
+        done
+    done
+
+Now what is the "order of the transport's job list"? As we know already, the
+job list is by default kept in the order the message was picked up by the
+nqmgr. So by default we get the top-level round-robin transport, and within
+each transport we get the FIFO message delivery. The round-robin of the peers
+by the destination is perhaps of little importance in most real-life cases
+(unless the recipient_concurrency limit is reached, in one job there is only
+one peer structure for each destination), but theoretically it makes sure that
+even within single jobs, destinations are treated fairly.
+
+[By now you should have a feeling you really know how the scheduler works,
+except for the preemption, under ideal conditions - that is, no recipient
+resource limits and no destination concurrency problems.]
+
+H\bHo\bow\bw d\bdo\boe\bes\bs t\bth\bhe\be p\bpr\bre\bee\bem\bmp\bpt\bti\bio\bon\bn w\bwo\bor\brk\bk
+
+As you might perhaps expect by now, the transport's job list does not remain
+sorted by the job's message enqueue time all the time. The most cool thing
+about nqmgr is not the simple FIFO delivery, but that it is able to slip mail
+with little recipients past the mailing-list bulk mail. This is what the job
+preemption is about - shuffling the jobs on the transport's job list to get the
+best message delivery rates. Now how is it achieved?
+
+First I have to tell you that there are in fact two job lists in each
+transport. One is the scheduler's job list, which the scheduler is free to play
+with, while the other one keeps the jobs always listed in the order of the
+enqueue time and is used for recipient pool management we will discuss later.
+For now, we will deal with the scheduler's job list only.
+
+So, we have the job list, which is first ordered by the time the job's messages
+were enqueued, oldest messages first, the most recently picked one at the end.
+For now, let's assume that there are no destination concurrency problems.
+Without preemption, we pick some entry of the first (oldest) job on the queue,
+assign it to delivery agent, pick another one from the same job, assign it
+again, and so on, until all the entries are used and the job is delivered. We
+would then move onto the next job and so on and on. Now how do we manage to
+sneak in some entries from the recently added jobs when the first job on the
+job list belongs to a message going to the mailing-list and has thousands of
+recipient entries?
+
+The nqmgr's answer is that we can artificially "inflate" the delivery time of
+that first job by some constant for free - it is basically the same trick you
+might remember as "accumulation of potential" from the amortized complexity
+lessons. For example, instead of delivering the entries of the first job on the
+job list every time an delivery agent becomes available, we can do it only
+every second time. If you view the moments the delivery agent becomes available
+on a timeline as "delivery slots", then instead of using every delivery slot
+for the first job, we can use only every other slot, and still the overall
+delivery efficiency of the first job remains the same. So the delivery 11112222
+becomes 1.1.1.1.2.2.2.2 (1 and 2 are the imaginary job numbers, . denotes the
+free slot). Now what do we do with free slots?
+
+As you might have guessed, we will use them for sneaking the mail with little
+recipients in. For example, if we have one four-recipient mail followed by four
+one recipients mail, the delivery sequence (that is, the sequence in which the
+jobs are assigned to the delivery slots) might look like this: 12131415. Hmm,
+fine for sneaking in the single recipient mail, but how do we sneak in the mail
+with more than one recipient? Say if we have one four-recipient mail followed
+by two two-recipient mails?
+
+The simple answer would be to use delivery sequence 12121313. But the problem
+is that this does not scale well. Imagine you have mail with thousand
+recipients followed by mail with hundred recipients. It is tempting to suggest
+the delivery sequence like 121212...., but alas! Imagine there arrives another
+mail with say ten recipients. But there are no free slots anymore, so it can't
+slip by, not even if it had just only one recipients. It will be stuck until
+the hundred-recipient mail is delivered, which really sucks.
+
+So, it becomes obvious that while the inflating the message to get free slots
+is great idea, one has to be really careful of how the free slots are assigned,
+otherwise one might corner himself. So, how does nqmgr really use the free
+slots?
+
+The key idea is that one does not have to generate the free slots in a uniform
+way. The delivery sequence 111...1 is no worse than 1.1.1.1, in fact, it is
+even better as some entries are in the first case selected earlier than in the
+second case, and none is selected later! So it is possible to first to
+"accumulate" the free delivery slots and then use them all at once. It is even
+possible to accumulate some, then use them, then accumulate some more and use
+them again, as in 11..1.1 .
+
+Let's get back to the one hundred recipient example. We now know that we could
+first accumulate one hundred free slots, and only after then to preempt the
+first job and sneak the one hundred recipient mail in. Applying the algorithm
+recursively, we see the hundred recipient job can accumulate ten free delivery
+slots, and then we could preempt it and sneak in the ten recipient mail... Wait
+wait wait! Could we? Aren't we overinflating the original one thousand
+recipient mail?
+
+Well, despite it looks so at the first glance, another trick will allow us to
+answer "no, we are not!". If we had said that we will inflate the delivery time
+twice at maximum, and then we consider every other slot as a free slot, then we
+would overinflate in case of the recursive preemption. BUT! The trick is that
+if we use only every n-th slot as a free slot for n>2, there is always some
+worst inflation factor which we can guarantee not to be breached, even if we
+apply the algorithm recursively. To be precise, if for every k>1 normally used
+slots we accumulate one free delivery slot, than the inflation factor is not
+worse than k/(k-1) no matter how many recursive preemptions happen. And it's
+not worse than (k+1)/k if only non-recursive preemption happens. Now, having
+got through the theory and the related math, let's see how nqmgr implements
+this.
+
+Each job has so called "available delivery slot" counter. Each transport has a
+transport_delivery_slot_cost parameter, which defaults to
+default_delivery_slot_cost parameter which is set to 5 by default. This is the
+k from the paragraph above. Each time k entries of the job are selected for
+delivery, this counter is incremented by one. Once there are some slots
+accumulated, job which requires no more than that amount of slots to be fully
+delivered can preempt this job.
+
+[Well, the truth is, the counter is incremented every time an entry is selected
+and it is divided by k when it is used. Or even more true, there is no
+division, the other side of the equation is multiplied by k. But for the
+understanding it's good enough to use the above approximation of the truth.]
+
+OK, so now we know the conditions which must be satisfied so one job can
+preempt another one. But what job gets preempted, how do we choose what job
+preempts it if there are several valid candidates, and when does all this
+exactly happen?
+
+The answer for the first part is simple. The job whose entry was selected the
+last time is so called current job. Normally, it is the first job on the
+scheduler's job list, but destination concurrency limits may change this as we
+will see later. It is always only the current job which may get preempted.
+
+Now for the second part. The current job has certain amount of recipient
+entries, and as such may accumulate at maximum some amount of available
+delivery slots. It might have already accumulated some, and perhaps even
+already used some when it was preempted before (remember a job can be preempted
+several times). In either case, we know how many are accumulated and how many
+are left to deliver, so we know how many it may yet accumulate at maximum.
+Every other job which may be delivered by less than that amount of slots is an
+valid candidate for preemption. How do we choose among them?
+
+The answer is - the one with maximum enqueue_time/recipient_entry_count. That
+is, the older the job is, the more we should try to deliver it in order to get
+best message delivery rates. These rates are of course subject to how many
+recipients the message has, therefore the division by the recipient (entry)
+count. No one shall be surprised that message with n recipients takes n times
+longer to deliver than message with one recipient.
+
+Now let's recap the previous two paragraphs. Isn't it too complicated? Why
+don't the candidates come only among the jobs which can be delivered within the
+amount of slots the current job already accumulated? Why do we need to estimate
+how much it has yet to accumulate? If you found out the answer, congratulate
+yourself. If we did it this simple way, we would always choose the candidate
+with least recipient entries. If there were enough single recipient mails
+coming in, they would always slip by the bulk mail as soon as possible, and the
+two and more recipients mail would never get a chance, no matter how long they
+have been sitting around in the job list.
+
+This candidate selection has interesting implication - that when we choose the
+best candidate for preemption (this is done in qmgr_choose_candidate()), it may
+happen that we may not use it for preemption immediately. This leads to an
+answer to the last part of the original question - when does the preemption
+happen?
+
+The preemption attempt happens every time next transport's recipient entry is
+to be chosen for delivery. To avoid needless overhead, the preemption is not
+attempted if the current job could never accumulate more than
+transport_minimum_delivery_slots (defaults to default_minimum_delivery_slots
+which defaults to 3). If there is already enough accumulated slots to preempt
+the current job by the chosen best candidate, it is done immediately. This
+basically means that the candidate is moved in front of the current job on the
+scheduler's job list and decreasing the accumulated slot counter by the amount
+used by the candidate. If there is not enough slots... well, I could say that
+nothing happens and the another preemption is attempted the next time. But
+that's not the complete truth.
+
+The truth is that it turns out that it is not really necessary to wait until
+the jobs counter accumulates all the delivery slots in advance. Say we have ten
+recipient mail followed by two two-recipient mails. If the preemption happened
+when enough delivery slot accumulate (assuming slot cost 2), the delivery
+sequence becomes 11112211113311. Now what would we get if we would wait only
+for 50% of the necessary slots to accumulate and we promise we would wait for
+the remaining 50% later, after the we get back to the preempted job? If we use
+such slot loan, the delivery sequence becomes 11221111331111. As we can see, it
+makes it no considerably worse for the delivery of the ten-recipient mail, but
+it allows the small messages to be delivered sooner.
+
+The concept of these slot loans is where the transport_delivery_slot_discount
+and transport_delivery_slot_loan come from (they default to
+default_delivery_slot_discount and default_delivery_slot_loan, whose values are
+by default 50 and 3, respectively). The discount (resp. loan) specifies how
+many percent (resp. how many slots) one "gets in advance", when the amount of
+slots required to deliver the best candidate is compared with the amount of
+slots the current slot had accumulated so far.
+
+And it pretty much concludes this chapter.
+
+[Now you should have a feeling that you pretty much understand the scheduler
+and the preemption, or at least that you will have it after you read the last
+chapter couple more times. You shall clearly see the job list and the
+preemption happening at its head, in ideal delivery conditions. The feeling of
+understanding shall last until you start wondering what happens if some of the
+jobs are blocked, which you might eventually figure out correctly from what had
+been said already. But I would be surprised if you mental image of the
+scheduler's functionality it is not completely shattered once you start
+wondering how it works when not all recipients may be read in-core. More on
+that later.]
+
+H\bHo\bow\bw d\bde\bes\bst\bti\bin\bna\bat\bti\bio\bon\bn c\bco\bon\bnc\bcu\bur\brr\bre\ben\bnc\bcy\by l\bli\bim\bmi\bit\bts\bs a\baf\bff\bfe\bec\bct\bt t\bth\bhe\be s\bsc\bch\bhe\bed\bdu\bul\bli\bin\bng\bg a\bal\blg\bgo\bor\bri\bit\bth\bhm\bm
+
+The nqmgr uses the same algorithm for destination concurrency control as oqmgr.
+Now what happens when the destination limits are reached and no more entries
+for that destination may be selected by the scheduler?
+
+From user's point of view it is all simple. If some of the peers of a job can't
+be selected, those peers are simply skipped by the entry selection algorithm
+(the pseudo-code described before) and only the selectable ones are used. If
+none of the peers may be selected, the job is declared a "blocker job". Blocker
+jobs are skipped by the entry selection algorithm and they are also excluded
+from the candidates for preemption of current job. Thus the scheduler
+effectively behaves as if the blocker jobs didn't exist on the job list at all.
+As soon as at least one of the peers of a blocker job becomes unblocked (that
+is, the delivery agent handling the delivery of the recipient entry for given
+destination successfully finishes), the job's blocker status is removed and the
+job again participates in all further scheduler actions normally.
+
+So the summary is that the user's don't really have to be concerned about the
+interaction of the destination limits and scheduling algorithm. It works well
+on its own and there are no knobs they would need to control it.
+
+From a programmer's point of view, the blocker jobs complicate the scheduler
+quite a lot. Without them, the jobs on the job list would be normally delivered
+in strict FIFO order. If the current job is preempted, the job preempting it is
+completely delivered unless it is preempted itself. Without blockers, the
+current job is thus always either the first job on the job list, or the top of
+the stack of jobs preempting the first job on the job list.
+
+The visualization of the job list and the preemption stack without blockers
+would be like this:
+
+    first job->    1--2--3--5--6--8--...    <- job list
+    on job list    |
+                   4    <- preemption stack
+                   |
+    current job->  7
+
+In the example above we see that job 1 was preempted by job 4 and then job 4
+was preempted by job 7. After job 7 is completed, remaining entries of job 4
+are selected, and once they are all selected, job 1 continues.
+
+As we see, it's all very clean and straightforward. Now how does this change
+because of blockers?
+
+The answer is: a lot. Any job may become blocker job at any time, and also
+become normal job again at any time. This has several important implications:
+
+ 1. The jobs may be completed in arbitrary order. For example, in the example
+    above, if the current job 7 becomes blocked, the next job 4 may complete
+    before the job 7 becomes unblocked again. Or if both 7 and 4 are blocked,
+    then 1 is completed, then 7 becomes unblocked and is completed, then 2 is
+    completed and only after that 4 becomes unblocked and is completed... You
+    get the idea.
+
+    [Interesting side note: even when jobs are delivered out of order, from
+    single destination's point of view the jobs are still delivered in the
+    expected order (that is, FIFO unless there was some preemption involved).
+    This is because whenever a destination queue becomes unblocked (the
+    destination limit allows selection of more recipient entries for that
+    destination), all jobs which have peers for that destination are unblocked
+    at once.]
+
+ 2. The idea of the preemption stack at the head of the job list is gone. That
+    is, it must be possible to preempt any job on the job list. For example, if
+    the jobs 7, 4, 1 and 2 in the example above become all blocked, job 3
+    becomes the current job. And of course we do not want the preemption be
+    affected by the fact that there are some blocked jobs or not. Therefore, if
+    it turns out that job 3 might be preempted by job 6, the implementation
+    shall make it possible.
+
+ 3. The idea of the linear preemption stack itself is gone. It's no longer true
+    that one job is always preempted by only one job at one time (that is
+    directly preempted, not counting the recursively nested jobs). For example,
+    in the example above, job 1 is directly preempted by only job 4, and job 4
+    by job 7. Now assume job 7 becomes blocked, and job 4 is being delivered.
+    If it accumulates enough delivery slots, it is natural that it might be
+    preempted for example by job 8. Now job 4 is preempted by both job 7 AND
+    job 8 at the same time.
+
+Now combine the points 2) and 3) with point 1) again and you realize that the
+relations on the once linear job list became pretty complicated. If we extend
+the point 3) example: jobs 7 and 8 preempt job 4, now job 8 becomes blocked
+too, then job 4 completes. Tricky, huh?
+
+If I illustrate the relations after the above mentioned examples (but those in
+point 1)), the situation would look like this:
+
+                                v- parent
+
+    adoptive parent ->    1--2--3--5--...      <- "stack" level 0
+                          |     |
+    parent gone ->        ?     6              <- "stack" level 1
+                         / \
+    children ->         7   8   ^- child       <- "stack" level 2
+
+                          ^- siblings
+
+Now how does nqmgr deal with all these complicated relations?
+
+Well, it maintains them all as described, but fortunately, all these relations
+are necessary only for purposes of proper counting of available delivery slots.
+For purposes of ordering the jobs for entry selection, the original rule still
+applies: "the job preempting the current job is moved in front of the current
+job on the job list". So for entry selection purposes, the job relations remain
+as simple as this:
+
+    7--8--1--2--6--3--5--..   <- scheduler's job list order
+
+The job list order and the preemption parent/child/siblings relations are
+maintained separately. And because the selection works only with the job list,
+you can happily forget about those complicated relations unless you want to
+study the nqmgr sources. In that case the text above might provide some helpful
+introduction to the problem domain. Otherwise I suggest you just forget about
+all this and stick with the user's point of view: the blocker jobs are simply
+ignored.
+
+[By now, you should have a feeling that there is more things going under the
+hood than you ever wanted to know. You decide that forgetting about this
+chapter is the best you can do for the sake of your mind's health and you
+basically stick with the idea how the scheduler works in ideal conditions, when
+there are no blockers, which is good enough.]
+
+D\bDe\bea\bal\bli\bin\bng\bg w\bwi\bit\bth\bh m\bme\bem\bmo\bor\bry\by r\bre\bes\bso\bou\bur\brc\bce\be l\bli\bim\bmi\bit\bts\bs
+
+When discussing the nqmgr scheduler, we have so far assumed that all recipients
+of all messages in the active queue are completely read into the memory. This
+is simply not true. There is an upper bound on the amount of memory the nqmgr
+may use, and therefore it must impose some limits on the information it may
+store in the memory at any given time.
+
+First of all, not all messages may be read in-core at once. At any time, only
+qmgr_message_active_limit messages may be read in-core at maximum. When read
+into memory, the messages are picked from the incoming and deferred message
+queues and moved to the active queue (incoming having priority), so if there is
+more than qmgr_message_active_limit messages queued in the active queue, the
+rest will have to wait until (some of) the messages in the active queue are
+completely delivered (or deferred).
+
+Even with the limited amount of in-core messages, there is another limit which
+must be imposed in order to avoid memory exhaustion. Each message may contain
+huge amount of recipients (tens or hundreds of thousands are not uncommon), so
+if nqmgr read all recipients of all messages in the active queue, it may easily
+run out of memory. Therefore there must be some upper bound on the amount of
+message recipients which are read into the memory at the same time.
+
+Before discussing how exactly nqmgr implements the recipient limits, let's see
+how the sole existence of the limits themselves affects the nqmgr and its
+scheduler.
+
+The message limit is straightforward - it just limits the size of the lookahead
+the nqmgr's scheduler has when choosing which message can preempt the current
+one. Messages not in the active queue simply are not considered at all.
+
+The recipient limit complicates more things. First of all, the message reading
+code must support reading the recipients in batches, which among other things
+means accessing the queue file several times and continuing where the last
+recipient batch ended. This is invoked by the scheduler whenever the current
+job runs out of in-core recipients and more are required. It is also done any
+time when all in-core recipients of the message are dealt with (which may also
+mean they were deferred) but there are still more in the queue file.
+
+The second complication is that with some recipients left unread in the queue
+file, the scheduler can't operate with exact counts of recipient entries. With
+unread recipients, it is not clear how many recipient entries there will be, as
+they are subject to per-destination grouping. It is not even clear to what
+transports (and thus jobs) the recipients will be assigned. And with messages
+coming from the deferred queue, it is not even clear how many unread recipients
+are still to be delivered. This all means that the scheduler must use only
+estimates of how many recipients entries there will be. Fortunately, it is
+possible to estimate the minimum and maximum correctly, so the scheduler can
+always err on the safe side. Obviously, the better the estimates, the better
+results, so it is best when we are able to read all recipients in-core and turn
+the estimates into exact counts, or at least try to read as many as possible to
+make the estimates as accurate as possible.
+
+The third complication is that it is no longer true that the scheduler is done
+with a job once all of its in-core recipients are delivered. It is possible
+that the job will be revived later, when another batch of recipients is read in
+core. It is also possible that some jobs will be created for the first time
+long after the first batch of recipients was read in core. The nqmgr code must
+be ready to handle all such situations.
+
+And finally, the fourth complication is that the nqmgr code must somehow impose
+the recipient limit itself. Now how does it achieve it?
+
+Perhaps the easiest solution would be to say that each message may have at
+maximum X recipients stored in-core, but such solution would be poor for
+several reasons. With reasonable qmgr_message_active_limit values, the X would
+have to be quite low to maintain reasonable memory footprint. And with low X
+lots of things would not work well. The nqmgr would have problems to use the
+transport_destination_recipient_limit efficiently. The scheduler's preemption
+would be suboptimal as the recipient count estimates would be inaccurate. The
+message queue file would have to be accessed many times to read in more
+recipients again and again.
+
+Therefore it seems reasonable to have a solution which does not use a limit
+imposed on per-message basis, but which maintains a pool of available recipient
+slots, which can be shared among all messages in the most efficient manner. And
+as we do not want separate transports to compete for resources whenever
+possible, it seems appropriate to maintain such recipient pool for each
+transport separately. This is the general idea, now how does it work in
+practice?
+
+First we have to solve little chicken-and-egg problem. If we want to use the
+per-transport recipient pools, we first need to know to what transport(s) is
+the message assigned. But we will find that out only after we read in the
+recipients first. So it is obvious that we first have to read in some
+recipients, use them to find out to what transports is the message to be
+assigned, and only after that we can use the per-transport recipient pools.
+
+Now how many recipients shall we read for the first time? This is what
+qmgr_message_recipient_minimum and qmgr_message_recipient_limit values control.
+The qmgr_message_recipient_minimum value specifies how many recipients of each
+message we will read for the first time, no matter what. It is necessary to
+read at least one recipients before we can assign the message to a transport
+and create the first job. However, reading only qmgr_message_recipient_minimum
+recipients even if there are only few messages with few messages in-core would
+be wasteful. Therefore if there is less than qmgr_message_recipient_limit
+recipients in-core so far, the first batch of recipients may be larger than
+qmgr_message_recipient_minimum - as large as is required to reach the
+qmgr_message_recipient_limit limit.
+
+Once the first batch of recipients was read in core and the message jobs were
+created, the size of the subsequent recipient batches (if any - of course it's
+best when all recipients are read in one batch) is based solely on the position
+of the message jobs on their corresponding transport's job lists. Each
+transport has a pool of transport_recipient_limit recipient slots which it can
+distribute among its jobs (how this is done is described later). The subsequent
+recipient batch may be as large as the sum of all recipient slots of all jobs
+of the message permits (plus the qmgr_message_recipient_minimum amount which
+always applies).
+
+For example, if a message has three jobs, first with 1 recipient still in-core
+and 4 recipient slots, second with 5 recipient in-core and 5 recipient slots,
+and third with 2 recipients in-core and 0 recipient slots, it has 1+5+2=7
+recipients in-core and 4+5+0=9 jobs' recipients slots in total. This means that
+we could immediately read 2+qmgr_message_recipient_minimum more recipients of
+that message in core.
+
+The above example illustrates several things which might be worth mentioning
+explicitly: first, note that although the per-transport slots are assigned to
+particular jobs, we can't guarantee that once the next batch of recipients is
+read in core, that the corresponding amounts of recipients will be assigned to
+those jobs. The jobs lend its slots to the message as a whole, so it is
+possible that some jobs end up sponsoring other jobs of their message. For
+example, if in the example above the 2 newly read recipients were assigned to
+the second job, the first job sponsored the second job with 2 slots. The second
+notable thing is the third job, which has more recipients in-core than it has
+slots. Apart from sponsoring by other job we just saw it can be result of the
+first recipient batch, which is sponsored from global recipient pool of
+qmgr_message_recipient_limit recipients. It can be also sponsored from the
+message recipient pool of qmgr_message_recipient_minimum recipients.
+
+Now how does each transport distribute the recipient slots among its jobs? The
+strategy is quite simple. As most scheduler activity happens on the head of the
+job list, it is our intention to make sure that the scheduler has the best
+estimates of the recipient counts for those jobs. As we mentioned above, this
+means that we want to try to make sure that the messages of those jobs have all
+recipients read in-core. Therefore the transport distributes the slots "along"
+the job list from start to end. In this case the job list sorted by message
+enqueue time is used, because it doesn't change over time as the scheduler's
+job list does.
+
+More specifically, each time a job is created and appended to the job list, it
+gets all unused recipient slots from its transport's pool. It keeps them until
+all recipients of its message are read. When this happens, all unused recipient
+slots are transferred to the next job (which is now in fact now first such job)
+on the job list which still has some recipients unread, or eventually back to
+the transport pool if there is no such job. Such transfer then also happens
+whenever a recipient entry of that job is delivered.
+
+There is also a scenario when a job is not appended to the end of the job list
+(for example it was created as a result of second or later recipient batch).
+Then it works exactly as above, except that if it was put in front of the first
+unread job (that is, the job of a message which still has some unread
+recipients in queue file), that job is first forced to return all of its unused
+recipient slots to the transport pool.
+
+The algorithm just described leads to the following state: The first unread job
+on the job list always gets all the remaining recipient slots of that transport
+(if there are any). The jobs queued before this job are completely read (that
+is, all recipients of their message were already read in core) and have at
+maximum as many slots as they still have recipients in-core (the maximum is
+there because of the sponsoring mentioned before) and the jobs after this job
+get nothing from the transport recipient pool (unless they got something before
+and then the first unread job was created and enqueued in front of them later -
+in such case the also get at maximum as many slots as they have recipients in-
+core).
+
+Things work fine in such state for most of the time, because the current job is
+either completely read in-core or has as much recipient slots as there are, but
+there is one situation which we still have to take care specially. Imagine if
+the current job is preempted by some unread job from the job list and there are
+no more recipient slots available, so this new current job could read only
+batches of qmgr_message_recipient_minimum recipients at a time. This would
+really degrade performance. For this reason, each transport has extra pool of
+transport_extra_recipient_limit recipient slots, dedicated exactly for this
+situation. Each time an unread job preempts the current job, it gets half of
+the remaining recipient slots from the normal pool and this extra pool.
+
+And that's it. It sure does sound pretty complicated, but fortunately most
+people don't really have to care how exactly it works as long as it works.
+Perhaps the only important things to know for most people ire the following
+upper bound formulas:
+
+Each transport has at maximum
+
+    max(
+    qmgr_message_recipient_minimum * qmgr_message_active_limit
+    + *_recipient_limit + *_extra_recipient_limit,
+    qmgr_message_recipient_limit
+    )
+
+recipients in core.
+
+The total amount of recipients in core is
+
+    max(
+    qmgr_message_recipient_minimum * qmgr_message_active_limit
+    + sum( *_recipient_limit + *_extra_recipient_limit ),
+    qmgr_message_recipient_limit
+    )
+
+where the sum is over all used transports.
+
+And this terribly complicated chapter concludes the documentation of nqmgr
+scheduler.
+
+[By now you should theoretically know the nqmgr scheduler inside out. In
+practice, you still hope that you will never have to really understand the last
+or last two chapters completely, and fortunately most people really won't.
+Understanding how the scheduler works in ideal conditions is more than good
+enough for vast majority of users.]
  
  C\bCr\bre\bed\bdi\bit\bts\bs
  
@@ -556,6 +1127,6 @@ C\bCr\bre\bed\bdi\bit\bts\bs
      site detection.
    * These simplifications, and their modular implementation, helped to develop
      further insights into the different roles that positive and negative
-    concurrency feedback play, and helped to avoid all the known worst-case
+    concurrency feedback play, and helped to identify some worst-case
      scenarios.
  
diff --git a/postfix/README_FILES/TLS_README b/postfix/README_FILES/TLS_README

index d26b04ed9b2824d931fd6a0dbb11967401ef5a17..3d6c6e0219bcd64f9557141cba2ab8a4fa722dce 100644 (file)
--- a/postfix/README_FILES/TLS_README
+++ b/postfix/README_FILES/TLS_README
@@ -1735,14 +1735,14 @@ indicates a super-user shell.
          /etc/postfix/main.cf:
              smtp_tls_CAfile = /etc/postfix/cacert.pem
              smtp_tls_session_cache_database =
-               btree:/var/spool/postfix/smtp_tls_session_cache
+               btree:/var/lib/postfix/smtp_tls_session_cache
              smtp_use_tls = yes
              smtpd_tls_CAfile = /etc/postfix/cacert.pem
              smtpd_tls_cert_file = /etc/postfix/FOO-cert.pem
              smtpd_tls_key_file = /etc/postfix/FOO-key.pem
              smtpd_tls_received_header = yes
              smtpd_tls_session_cache_database =
-               btree:/var/spool/postfix/smtpd_tls_session_cache
+               btree:/var/lib/postfix/smtpd_tls_session_cache
              tls_random_source = dev:/dev/urandom
              # Postfix 2.3 and later
              smtpd_tls_security_level = may
diff --git a/postfix/conf/master.cf b/postfix/conf/master.cf

index 18095ecb3a2e3ff5b23a930684631fbdcc8438fc..a85f71f1a167c3cb65e5ed9bccc8277712139d8b 100644 (file)
--- a/postfix/conf/master.cf
+++ b/postfix/conf/master.cf
@@ -10,7 +10,7 @@
  # ==========================================================================
  smtp      inet  n       -       n       -       -       smtpd
  #submission inet n       -       n       -       -       smtpd
-#  -o smtpd_enforce_tls=yes
+#  -o smtpd_tls_security_level=encrypt
  #  -o smtpd_sasl_auth_enable=yes
  #  -o smtpd_client_restrictions=permit_sasl_authenticated,reject
  #  -o milter_macro_daemon_name=ORIGINATING
diff --git a/postfix/html/SCHEDULER_README.html b/postfix/html/SCHEDULER_README.html

index cc8cea13e6120bf9348793a5fe8eb016ca71ea17..e21fdf3075461d50c58113e6995d3644be31d961 100644 (file)
--- a/postfix/html/SCHEDULER_README.html
+++ b/postfix/html/SCHEDULER_README.html
@@ -1,5 +1,5 @@
  <!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN"
-        "http://www.w3.org/TR/html4/loose.dtd">
+       "http://www.w3.org/TR/html4/loose.dtd">
  
  <html>
  
@@ -13,7 +13,8 @@
  
  <body>
  
-<h1><img src="postfix-logo.jpg" width="203" height="98" ALT="">Postfix Queue Scheduler</h1>
+<h1><img src="postfix-logo.jpg" width="203" height="98" ALT="">Postfix
+Queue Scheduler</h1>
  
  <hr>
  
@@ -25,100 +26,91 @@ deliveries at specific times, and removes mail from the queue after
  the last delivery attempt.  There are two major classes of mechanisms
  that control the operation of the queue manager. </p>
  
-<p> The first class of mechanisms is concerned with the number of
-concurrent deliveries to a specific destination, including decisions
-on when to suspend deliveries after persistent failures: </p>
-
-    <ul>
-
-    <li> <a href="#concurrency"> Concurrency scheduling </a>
-
-       <ul>
-
-       <li> <a href="#concurrency_summary_2_5"> Summary of the
-       Postfix 2.5 concurrency feedback algorithm </a>
-
-       <li> <a href="#dead_summary_2_5"> Summary of the Postfix
-       2.5 "dead destination" detection algorithm </a>
-
-       <li> <a href="#pseudo_code_2_5"> Pseudocode for the Postfix
-       2.5 concurrency scheduler </a>
+<ul>
  
-       <li> <a href="#concurrency_results"> Results for delivery
-       to concurrency limited servers </a>
+<li> <a href="#concurrency"> Concurrency scheduling </a> is concerned
+with the number of concurrent deliveries to a specific destination,
+including decisions on when to suspend deliveries after persistent
+failures.
  
-       <li> <a href="#concurrency_discussion"> Discussion of
-       concurrency limited server results </a>
+<li> <a href="#jobs"> Preemptive scheduling </a> is concerned with
+the selection of email messages and recipients for a given destination.
  
-       <li> <a href="#concurrency_limitations"> Limitations of
-       less-than-1 per delivery feedback </a>
+<li> <a href="#credits"> Credits </a>. This document would not be
+complete without.
  
-       <li> <a href="#concurrency_config"> Concurrency configuration
-       parameters </a>
+</ul>
  
-       </ul>
+<!--
  
-    </ul>
+<p> Once started, the <a href="qmgr.8.html">qmgr(8)</a> process runs until "postfix reload"
+or "postfix stop".  As a persistent process, the queue manager has
+to meet strict requirements with respect to code correctness and
+robustness. Unlike non-persistent daemon processes, the queue manager
+cannot benefit from Postfix's process rejuvenation mechanism that
+limit the impact from resource leaks and other coding errors
+(translation: replacing a process after a short time covers up bugs
+before they can become a problem).  </p>
  
-<p> The second class of mechanisms is concerned with the selection
-of what mail to deliver to a given destination: </p>
+-->
  
-    <ul>
+<h2> <a name="concurrency"> Concurrency scheduling </a> </h2>
  
-    <li> <a href="#jobs"> Preemptive scheduling </a>
+<p> The following sections document the Postfix 2.5 concurrency
+scheduler, after a discussion of the limitations of the existing
+concurrency scheduler. This is followed by results of medium-concurrency
+experiments, and a discussion of trade-offs between performance and
+robustness.  </p>
  
-       <ul>
+<p> The material is organized as follows: </p>
  
-       <li> <a href="#job_motivation"> Why the non-preemptive Postfix queue
-       manager was replaced </a>
+<ul>
  
-       <li> <a href="#job_design"> How the non-preemptive queue manager
-       scheduler works </a>
+<li> <a href="#concurrency_drawbacks"> Drawbacks of the existing
+concurrency scheduler </a>
  
-       </ul>
+<li> <a href="#concurrency_summary_2_5"> Summary of the Postfix 2.5
+concurrency feedback algorithm </a>
  
-    </ul>
+<li> <a href="#dead_summary_2_5"> Summary of the Postfix 2.5 "dead
+destination" detection algorithm </a>
  
-<p> And this document would not be complete without: </p>
+<li> <a href="#pseudo_code_2_5"> Pseudocode for the Postfix 2.5
+concurrency scheduler </a>
  
-    <ul>
+<li> <a href="#concurrency_results"> Results for delivery to
+concurrency limited servers </a>
  
-    <li> <a href="#credits"> Credits </a>
+<li> <a href="#concurrency_discussion"> Discussion of concurrency
+limited server results </a>
  
-    </ul>
+<li> <a href="#concurrency_limitations"> Limitations of less-than-1
+per delivery feedback </a>
  
-<!--
+<li> <a href="#concurrency_config"> Concurrency configuration
+parameters </a>
  
-<p> Once started, the <a href="qmgr.8.html">qmgr(8)</a> process runs until "postfix reload"
-or "postfix stop".  As a persistent process, the queue manager has
-to meet strict requirements with respect to code correctness and
-robustness. Unlike non-persistent daemon processes, the queue manager
-cannot benefit from Postfix's process rejuvenation mechanism that
-limit the impact from resource leaks and other coding errors
-(translation: replacing a process after a short time covers up bugs
-before they can become a problem).  </p>
-
--->
+</ul>
  
-<h2> <a name="concurrency"> Concurrency scheduling </a> </h2>
+<h3> <a name="concurrency_drawbacks"> Drawbacks of the existing
+concurrency scheduler </a> </h3>
  
-<p> This section documents the Postfix 2.5 concurrency scheduler.
-Prior Postfix versions used a simple but robust algorithm where the
-per-destination delivery concurrency was decremented by 1 after a
-delivery suffered connection or handshake failure, and was incremented
-by 1 otherwise.  Of course the concurrency was never allowed to
-exceed the maximum per-destination concurrency limit.  And when a
-destination's concurrency level dropped to zero, the destination
-was declared "dead" and delivery was suspended.  </p>
+<p> From the start, Postfix has used a simple but robust algorithm
+where the per-destination delivery concurrency is decremented by 1
+after a delivery suffered connection or handshake failure, and
+incremented by 1 otherwise.  Of course the concurrency is never
+allowed to exceed the maximum per-destination concurrency limit.
+And when a destination's concurrency level drops to zero, the
+destination is declared "dead" and delivery is suspended.  </p>
  
-<p> Drawbacks of the old +/-1 feedback per delivery are: <p>
+<p> Drawbacks of +/-1 concurrency feedback per delivery are: <p>
  
  <ul>
  
  <li> <p> Overshoot due to exponential delivery concurrency growth
-with each pseudo-cohort(*). For example, with the default initial
-concurrency of 5, concurrency would proceed over time as (5-10-20).
-</p>
+with each pseudo-cohort(*). This can be an issue with high-concurrency
+channels. For example, with the default initial concurrency of 5,
+concurrency would proceed over time as (5-10-20).  </p>
  
  <li> <p> Throttling down to zero concurrency after a single
  pseudo-cohort(*) failure. This was especially an issue with
@@ -142,7 +134,8 @@ throttle down to zero, a destination is declared "dead" after a
  configurable number of pseudo-cohorts reports connection or handshake
  failure.  </p>
  
-<h3> <a name="concurrency_summary_2_5"> Summary of the Postfix 2.5 concurrency feedback algorithm </a> </h3>
+<h3> <a name="concurrency_summary_2_5"> Summary of the Postfix 2.5
+concurrency feedback algorithm </a> </h3>
  
  <p> We want to increment a destination's delivery concurrency when
  some (not necessarily consecutive) number of deliveries complete
@@ -536,12 +529,13 @@ with increasing concurrency. See text for a discussion of results.
  
  <p> All results in the previous sections are based on the first
  delivery runs only; they do not include any second etc. delivery
-attempts.  The first two examples show that the feedback method
-matters little when concurrency is limited due to congestion. This
+attempts.  The first two examples show that the effect of feedback
+is negligible when concurrency is limited due to congestion. This
  is because the initial concurrency is already at the client's
  concurrency maximum, and because there is 10-100 times more positive
-than negative feedback.  Under these conditions, the contribution
-from SMTP connection caching is negligible. </p>
+than negative feedback.  Under these conditions, it is no surprise
+that the contribution from SMTP connection caching is also negligible.
+</p>
  
  <p> In the last example, the old +/-1 feedback per delivery will
  defer 50% of the mail when confronted with an active (anvil-style)
@@ -647,115 +641,1130 @@ activity </td> </tr>
  
  <h2> <a name="jobs"> Preemptive scheduling </a> </h2>
  
-<p> This is the beginning of documentation for a preemptive queue
-manager scheduling algorithm by Patrik Rak. For a long time, this
-code was made available under the name "nqmgr(8)" (new queue manager),
-as an optional module. As of Postfix 2.1 this is the default queue
-manager, which is always called "<a href="qmgr.8.html">qmgr(8)</a>". The old queue manager
-will for some time will be available under the name of "<a href="qmgr.8.html">oqmgr(8)</a>".
+<p>
+
+This document attempts to describe the new queue manager and its
+preeemptive scheduler algorithm. Note that the document was originally
+written to describe the changes between the new queue manager (in
+this text referred to as <tt>nqmgr</tt>, the name it was known by
+before it became the default queue manager) and the old queue manager
+(referred to as <tt>oqmgr</tt>). This is why it refers to <tt>oqmgr</tt>
+every so often.
+
  </p>
  
-<h3> <a name="job_motivation"> Why the non-preemptive Postfix queue manager was replaced </a> </h3>
+<p>
  
-<p> The non-preemptive Postfix scheduler had several limitations
-due to unfortunate choices in its design. </p>
+This document is divided into sections as follows:
  
-<ol>
+</p>
  
-    <li> <p> Round-robin selection by destination for mail that is
-    delivered via the same message delivery transport. The round-robin
-    strategy was chosen with the intention to prevent a single
-    (destination) site from using up too many mail delivery resources.
-    However, that strategy penalized inbound mail on bi-directional
-    gateways.  The poor suffering inbound destination would be
-    selected only 1/number-of-destinations of the time, even when
-    it had more mail than other destinations, and thus mail could
-    be delayed. </p>
-
-    <p> Victor Duchovni found a workaround: use different message
-    delivery transports, and thus avoid the starvation problem.
-    The Patrik Rak scheduler solves this problem by using FIFO
-    selection. </p>
-
-    <li> <p> A second limitation of the old Postfix scheduler was
-    that delivery of bulk mail would block all other deliveries,
-    causing large delays.  Patrik Rak's scheduler allows mail with
-    fewer recipients to slip past bulk mail in an elegant manner.
-    </p>
+<ul>
  
-</ol>
+<li> <a href="#<tt>nqmgr</tt>_structures"> The structures used by
+nqmgr </a>
+
+<li> <a href="#<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks
+up the message </a> - how it is assigned to transports, jobs, peers,
+entries
+
+<li> <a href="#<tt>nqmgr</tt>_selection"> How does the entry selection
+work </a>
+
+<li> <a href="#<tt>nqmgr</tt>_preemption"> How does the preemption
+work </a> - what messages may be preempted and how and what messages
+are chosen to preempt them
+
+<li> <a href="#<tt>nqmgr</tt>_concurrency"> How destination concurrency
+limits affect the scheduling algorithm </a>
+
+<li> <a href="#<tt>nqmgr</tt>_memory"> Dealing with memory resource
+limits </a>
+
+</ul>
+
+<h3> <a name="<tt>nqmgr</tt>_structures"> The structures used by
+nqmgr </a> </h3>
+
+<p>
+
+Let's start by recapitulating the structures and terms used when
+referring to queue manager and how it operates. Many of these are
+partially described elsewhere, but it is nice to have a coherent
+overview in one place:
+
+</p>
+
+<ul>
+
+<li> <p> Each message structure represents one mail message which
+Postfix is to deliver. The message recipients specify to what
+destinations is the message to be delivered and what transports are
+going to be used for the delivery. </p>
+
+<li> <p> Each recipient entry groups a batch of recipients of one
+message which are all going to be delivered to the same destination.
+</p>
+
+<li> <p> Each transport structure groups everything what is going
+to be delivered by delivery agents dedicated for that transport.
+Each transport maintains a set of queues (describing the destinations
+it shall talk to) and jobs (referencing the messages it shall
+deliver). </p>
+
+<li> <p> Each transport queue (not to be confused with the on-disk
+<a href="QSHAPE_README.html#active_queue">active queue</a> or <a href="QSHAPE_README.html#incoming_queue">incoming queue</a>) groups everything what is going be
+delivered to given destination (aka nexthop) by its transport.  Each
+queue belongs to one transport, so each destination may be referred
+to by several queues, one for each transport.  Each queue maintains
+a list of all recipient entries (batches of message recipients)
+which shall be delivered to given destination (the todo list), and
+a list of recipient entries already being delivered by the delivery
+agents (the busy list). </p>
+
+<li> <p> Each queue corresponds to multiple peer structures.  Each
+peer structure is like the queue structure, belonging to one transport
+and referencing one destination. The difference is that it lists
+only the recipient entries which all originate from the same message,
+unlike the queue structure, whose entries may originate from various
+messages. For messages with few recipients, there is usually just
+one recipient entry for each destination, resulting in one recipient
+entry per peer. But for large mailing list messages the recipients
+may need to be split to multiple recipient entries, in which case
+the peer structure may list many entries for single destination.
+</p>
+
+<li> <p> Each transport job groups everything it takes to deliver
+one message via its transport. Each job represents one message
+within the context of the transport. The job belongs to one transport
+and message, so each message may have multiple jobs, one for each
+transport. The job groups all the peer structures, which describe
+the destinations the job's message has to be delivered to. </p>
+
+</ul>
+
+<p>
+
+The first four structures are common to both <tt>nqmgr</tt> and
+<tt>oqmgr</tt>, the latter two were introduced by <tt>nqmgr</tt>.
+
+</p>
+
+<p>
+
+These terms are used extensively in the text below, feel free to
+look up the description above anytime you'll feel you have lost a
+sense what is what.
+
+</p>
+
+<h3> <a name="<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks
+up the message </a> </h3>
+
+<p>
+
+Whenever <tt>nqmgr</tt> moves a queue file into the <a href="QSHAPE_README.html#active_queue">active queue</a>,
+the following happens: It reads all necessary information from the
+queue file as <tt>oqmgr</tt> does, and also reads as many recipients
+as possible - more on that later, for now let's just pretend it
+always reads all recipients.
+
+</p>
+
+<p>
+
+Then it resolves the recipients as <tt>oqmgr</tt> does, which
+means obtaining (address, nexthop, transport) triple for each
+recipient. For each triple, it finds the transport; if it does not
+exist yet, it instantiates it (unless it's dead). Within the
+transport, it finds the destination queue for given nexthop; if it
+does not exist yet, it instantiates it (unless it's dead). The
+triple is then bound to given destination queue. This happens in
+qmgr_resolve() and is basically the same as in <tt>oqmgr</tt>.
+
+</p>
+
+<p>
+
+Then for each triple which was bound to some queue (and thus
+transport), the program finds the job which represents the message
+within that transport's context; if it does not exist yet, it
+instantiates it. Within the job, it finds the peer which represents
+the bound destination queue within this jobs context; if it does
+not exist yet, it instantiates it.  Finally, it stores the address
+from the resolved triple to the recipient entry which is appended
+to both the queue entry list and the peer entry list. The addresses
+for same nexthop are batched in the entries up to recipient_concurrency
+limit for that transport. This happens in qmgr_assign() and apart
+from that it operates with job and peer structures is basically the
+same as in <tt>oqmgr</tt>.
+
+</p>
+
+<p>
+
+When the job is instantiated, it is enqueued on the transport's job
+list based on the time its message was picked up by <tt>nqmgr</tt>.
+For first batch of recipients this means it is appended to the end
+of the job list, but the ordering of the job list by the enqueue
+time is important as we will see shortly.
+
+</p>
+
+<p>
+
+[Now you should have pretty good idea what is the state of the
+<tt>nqmgr</tt> after couple of messages was picked up, what is the
+relation between all those job, peer, queue and entry structures.]
+
+</p>
+
+<h3> <a name="<tt>nqmgr</tt>_selection"> How does the entry selection
+work </a> </h3>
+
+<p>
+
+Having prepared all those above mentioned structures, the task of
+the <tt>nqmgr</tt>'s scheduler is to choose the recipient entries
+one at a time and pass them to the delivery agent for corresponding
+transport. Now how does this work?
+
+</p>
+
+<p>
+
+The first approximation of the new scheduling algorithm is like this:
+
+</p>
+
+<blockquote>
+<pre>
+foreach transport (round-robin-by-transport)
+do
+    if transport busy continue
+    if transport process limit reached continue
+    foreach transport's job (in the order of the transport's job list)
+    do
+       foreach job's peer (round-robin-by-destination)
+            if peer->queue->concurrency < peer->queue->window
+                return next peer entry.
+       done
+    done
+done
+</pre>
+</blockquote>
+
+<p>
+
+Now what is the "order of the transport's job list"? As we know
+already, the job list is by default kept in the order the message
+was picked up by the <tt>nqmgr</tt>. So by default we get the
+top-level round-robin transport, and within each transport we get
+the FIFO message delivery. The round-robin of the peers by the
+destination is perhaps of little importance in most real-life cases
+(unless the recipient_concurrency limit is reached, in one job there
+is only one peer structure for each destination), but theoretically
+it makes sure that even within single jobs, destinations are treated
+fairly.
+
+</p>
+
+<p>
+
+[By now you should have a feeling you really know how the scheduler
+works, except for the preemption, under ideal conditions - that is,
+no recipient resource limits and no destination concurrency problems.]
+
+</p>
+
+<h3> <a name="<tt>nqmgr</tt>_preemption"> How does the preemption
+work </a> </h3>
+
+<p>
+
+As you might perhaps expect by now, the transport's job list does
+not remain sorted by the job's message enqueue time all the time.
+The most cool thing about <tt>nqmgr</tt> is not the simple FIFO
+delivery, but that it is able to slip mail with little recipients
+past the mailing-list bulk mail.  This is what the job preemption
+is about - shuffling the jobs on the transport's job list to get
+the best message delivery rates. Now how is it achieved?
+
+</p>
+
+<p>
+
+First I have to tell you that there are in fact two job lists in
+each transport. One is the scheduler's job list, which the scheduler
+is free to play with, while the other one keeps the jobs always
+listed in the order of the enqueue time and is used for recipient
+pool management we will discuss later. For now, we will deal with
+the scheduler's job list only.
+
+</p>
+
+<p>
+
+So, we have the job list, which is first ordered by the time the
+job's messages were enqueued, oldest messages first, the most recently
+picked one at the end. For now, let's assume that there are no
+destination concurrency problems. Without preemption, we pick some
+entry of the first (oldest) job on the queue, assign it to delivery
+agent, pick another one from the same job, assign it again, and so
+on, until all the entries are used and the job is delivered. We
+would then move onto the next job and so on and on. Now how do we
+manage to sneak in some entries from the recently added jobs when
+the first job on the job list belongs to a message going to the
+mailing-list and has thousands of recipient entries?
+
+</p>
+
+<p>
+
+The <tt>nqmgr</tt>'s answer is that we can artificially "inflate"
+the delivery time of that first job by some constant for free - it
+is basically the same trick you might remember as "accumulation of
+potential" from the amortized complexity lessons. For example,
+instead of delivering the entries of the first job on the job list
+every time an delivery agent becomes available, we can do it only
+every second time. If you view the moments the delivery agent becomes
+available on a timeline as "delivery slots", then instead of using
+every delivery slot for the first job, we can use only every other
+slot, and still the overall delivery efficiency of the first job
+remains the same. So the delivery <tt>11112222</tt> becomes
+<tt>1.1.1.1.2.2.2.2</tt> (1 and 2 are the imaginary job numbers, .
+denotes the free slot). Now what do we do with free slots?
+
+</p>
+
+<p>
+
+As you might have guessed, we will use them for sneaking the mail
+with little recipients in. For example, if we have one four-recipient
+mail followed by four one recipients mail, the delivery sequence
+(that is, the sequence in which the jobs are assigned to the
+delivery slots) might look like this: <tt>12131415</tt>. Hmm, fine
+for sneaking in the single recipient mail, but how do we sneak in
+the mail with more than one recipient? Say if we have one four-recipient
+mail followed by two two-recipient mails?
+
+</p>
+
+<p>
+
+The simple answer would be to use delivery sequence <tt>12121313</tt>.
+But the problem is that this does not scale well. Imagine you have
+mail with thousand recipients followed by mail with hundred recipients.
+It is tempting to suggest the  delivery sequence like <tt>121212....</tt>,
+but alas! Imagine there arrives another mail with say ten recipients.
+But there are no free slots anymore, so it can't slip by, not even
+if it had just only one recipients.  It will be stuck until the
+hundred-recipient mail is delivered, which really sucks.
+
+</p>
+
+<p>
+
+So, it becomes obvious that while the inflating the message to get
+free slots is great idea, one has to be really careful of how the
+free slots are assigned, otherwise one might corner himself. So,
+how does <tt>nqmgr</tt> really use the free slots?
+
+</p>
+
+<p>
+
+The key idea is that one does not have to generate the free slots
+in a uniform way. The delivery sequence <tt>111...1</tt> is no
+worse than <tt>1.1.1.1</tt>, in fact, it is even better as some
+entries are in the first case selected earlier than in the second
+case, and none is selected later! So it is possible to first to
+"accumulate" the free delivery slots and then use them all at once.
+It is even possible to accumulate some, then use them, then accumulate
+some more and use them again, as in <tt>11..1.1</tt> .
+
+</p>
+
+<p>
+
+Let's get back to the one hundred recipient example. We now know
+that we could first accumulate one hundred free slots, and only
+after then to preempt the first job and sneak the one hundred
+recipient mail in. Applying the algorithm recursively, we see the
+hundred recipient job can accumulate ten free delivery slots, and
+then we could preempt it and sneak in the ten recipient mail...
+Wait wait wait! Could we? Aren't we overinflating the original one
+thousand recipient mail?
+
+</p>
+
+<p>
+
+Well, despite it looks so at the first glance, another trick will
+allow us to answer "no, we are not!". If we had said that we will
+inflate the delivery time twice at maximum, and then we consider
+every other slot as a free slot, then we would overinflate in case
+of the recursive preemption. BUT! The trick is that if we use only
+every n-th slot as a free slot for n&gt;2, there is always some worst
+inflation factor which we can guarantee not to be breached, even
+if we apply the algorithm recursively. To be precise, if for every
+k&gt;1 normally used slots we accumulate one free delivery slot, than
+the inflation factor is not worse than k/(k-1) no matter how many
+recursive preemptions happen. And it's not worse than (k+1)/k if
+only non-recursive preemption happens. Now, having got through the
+theory and the related math, let's see how <tt>nqmgr</tt> implements
+this.
+
+</p>
+
+<p>
+
+Each job has so called "available delivery slot" counter. Each
+transport has a <a href="postconf.5.html#transport_delivery_slot_cost"><i>transport</i>_delivery_slot_cost</a> parameter, which
+defaults to <a href="postconf.5.html#default_delivery_slot_cost">default_delivery_slot_cost</a> parameter which is set to 5
+by default. This is the k from the paragraph above. Each time k
+entries of the job are selected for delivery, this counter is
+incremented by one. Once there are some slots accumulated, job which
+requires no more than that amount of slots to be fully delivered
+can preempt this job.
+
+</p>
+
+<p>
+
+[Well, the truth is, the counter is incremented every time an entry
+is selected and it is divided by k when it is used. Or even more
+true, there is no division, the other side of the equation is
+multiplied by k. But for the understanding it's good enough to use
+the above approximation of the truth.]
+
+</p>
+
+<p>
+
+OK, so now we know the conditions which must be satisfied so one
+job can preempt another one. But what job gets preempted, how do
+we choose what job preempts it if there are several valid candidates,
+and when does all this exactly happen?
+
+</p>
+
+<p>
+
+The answer for the first part is simple. The job whose entry was
+selected the last time is so called current job. Normally, it is
+the first job on the scheduler's job list, but destination concurrency
+limits may change this as we will see later. It is always only the
+current job which may get preempted.
+
+</p>
+
+<p>
+
+Now for the second part. The current job has certain amount of
+recipient entries, and as such may accumulate at maximum some amount
+of available delivery slots. It might have already accumulated some,
+and perhaps even already used some when it was preempted before
+(remember a job can be preempted several times). In either case,
+we know how many are accumulated and how many are left to deliver,
+so we know how many it may yet accumulate at maximum. Every other
+job which may be delivered by less than that amount of slots is an
+valid candidate for preemption. How do we choose among them?
+
+</p>
+
+<p>
+
+The answer is - the one with maximum enqueue_time/recipient_entry_count.
+That is, the older the job is, the more we should try to deliver
+it in order to get best message delivery rates. These rates are of
+course subject to how many recipients the message has, therefore
+the division by the recipient (entry) count. No one shall be surprised
+that message with n recipients takes n times longer to deliver than
+message with one recipient.
+
+</p>
+
+<p>
+
+Now let's recap the previous two paragraphs. Isn't it too complicated?
+Why don't the candidates come only among the jobs which can be
+delivered within the amount of slots the current job already
+accumulated? Why do we need to estimate how much it has yet to
+accumulate? If you found out the answer, congratulate yourself. If
+we did it this simple way, we would always choose the candidate
+with least recipient entries. If there were enough single recipient
+mails coming in, they would always slip by the bulk mail as soon
+as possible, and the two and more recipients mail would never get
+a chance, no matter how long they have been sitting around in the
+job list.
+
+</p>
+
+<p>
+
+This candidate selection has interesting implication - that when
+we choose the best candidate for preemption (this is done in
+qmgr_choose_candidate()), it may happen that we may not use it for
+preemption immediately. This leads to an answer to the last part
+of the original question - when does the preemption happen?
+
+</p>
+
+<p>
+
+The preemption attempt happens every time next transport's recipient
+entry is to be chosen for delivery. To avoid needless overhead, the
+preemption is not attempted if the current job could never accumulate
+more than <a href="postconf.5.html#transport_minimum_delivery_slots"><i>transport</i>_minimum_delivery_slots</a> (defaults to
+<a href="postconf.5.html#default_minimum_delivery_slots">default_minimum_delivery_slots</a> which defaults to 3). If there is
+already enough accumulated slots to preempt the current job by the
+chosen best candidate, it is done immediately. This basically means
+that the candidate is moved in front of the current job on the
+scheduler's job list and decreasing the accumulated slot counter
+by the amount used by the candidate. If there is not enough slots...
+well, I could say that nothing happens and the another preemption
+is attempted the next time. But that's not the complete truth.
+
+</p>
+
+<p>
+
+The truth is that it turns out that it is not really necessary to
+wait until the jobs counter accumulates all the delivery slots in
+advance. Say we have ten recipient mail followed by two two-recipient
+mails. If the preemption happened when enough delivery slot accumulate
+(assuming slot cost 2), the delivery sequence becomes
+<tt>11112211113311</tt>. Now what would we get if we would wait
+only for 50% of the necessary slots to accumulate and we promise
+we would wait for the remaining 50% later, after the we get back
+to the preempted job? If we use such slot loan, the delivery sequence
+becomes <tt>11221111331111</tt>. As we can see, it makes it no
+considerably worse for the delivery of the ten-recipient mail, but
+it allows the small messages to be delivered sooner.
+
+</p>
+
+<p>
+
+The concept of these slot loans is where the
+<a href="postconf.5.html#transport_delivery_slot_discount"><a href="postconf.5.html#transport_delivery_slot_discount"><i>transport</i>_delivery_slot_discount</a></a> and
+<a href="postconf.5.html#transport_delivery_slot_loan"><i>transport</i>_delivery_slot_loan</a> come from (they default to
+<a href="postconf.5.html#default_delivery_slot_discount">default_delivery_slot_discount</a> and <a href="postconf.5.html#default_delivery_slot_loan">default_delivery_slot_loan</a>, whose
+values are by default 50 and 3, respectively). The discount (resp.
+loan) specifies how many percent (resp. how many slots) one "gets
+in advance", when the amount of slots required to deliver the best
+candidate is compared with the amount of slots the current slot had
+accumulated so far.
+
+</p>
+
+<p>
+
+And it pretty much concludes this chapter.
+
+</p>
+
+<p>
+
+[Now you should have a feeling that you pretty much understand the
+scheduler and the preemption, or at least that you will have it
+after you read the last chapter couple more times. You shall clearly
+see the job list and the preemption happening at its head, in ideal
+delivery conditions. The feeling of understanding shall last until
+you start wondering what happens if some of the jobs are blocked,
+which you might eventually figure out correctly from what had been
+said already. But I would be surprised if you mental image of the
+scheduler's functionality it is not completely shattered once you
+start wondering how it works when not all recipients may be read
+in-core.  More on that later.]
+
+</p>
+
+<h3> <a name="<tt>nqmgr</tt>_concurrency"> How destination concurrency
+limits affect the scheduling algorithm </a> </h3>
+
+<p>
+
+The <tt>nqmgr</tt> uses the same algorithm for destination concurrency
+control as <tt>oqmgr</tt>. Now what happens when the destination
+limits are reached and no more entries for that destination may be
+selected by the scheduler?
+
+</p>
+
+<p>
+
+From user's point of view it is all simple. If some of the peers
+of a job can't be selected, those peers are simply skipped by the
+entry selection algorithm (the pseudo-code described before) and
+only the selectable ones are used. If none of the peers may be
+selected, the job is declared a "blocker job". Blocker jobs are
+skipped by the entry selection algorithm and they are also excluded
+from the candidates for preemption of current job. Thus the scheduler
+effectively behaves as if the blocker jobs didn't exist on the job
+list at all. As soon as at least one of the peers of a blocker job
+becomes unblocked (that is, the delivery agent handling the delivery
+of the recipient entry for given destination successfully finishes),
+the job's blocker status is removed and the job again participates
+in all further scheduler actions normally.
+
+</p>
+
+<p>
+
+So the summary is that the user's don't really have to be concerned
+about the interaction of the destination limits and scheduling
+algorithm. It works well on its own and there are no knobs they
+would need to control it.
+
+</p>
+
+<p>
+
+From a programmer's point of view, the blocker jobs complicate the
+scheduler quite a lot. Without them, the jobs on the job list would
+be normally delivered in strict FIFO order. If the current job is
+preempted, the job preempting it is completely delivered unless it
+is preempted itself. Without blockers, the current job is thus
+always either the first job on the job list, or the top of the stack
+of jobs preempting the first job on the job list.
+
+</p>
+
+<p>
+
+The visualization of the job list and the preemption stack without
+blockers would be like this:
+
+</p>
+
+<blockquote>
+<pre>
+first job->    1--2--3--5--6--8--...    <- job list
+on job list    |
+               4    <- preemption stack
+               |
+current job->  7
+</pre>
+</blockquote>
  
-<h3> <a name="job_design"> How the non-preemptive queue manager scheduler works </a> </h3>
+<p>
  
-<p> The following text is from Patrik Rak and should be read together
-with the <a href="postconf.5.html">postconf(5)</a> manual that describes each configuration
-parameter in detail. </p>
+In the example above we see that job 1 was preempted by job 4 and
+then job 4 was preempted by job 7. After job 7 is completed, remaining
+entries of job 4 are selected, and once they are all selected, job
+1 continues.
  
-<p> From user's point of view, <a href="qmgr.8.html">oqmgr(8)</a> and <a href="qmgr.8.html">qmgr(8)</a> are both the same,
-except for how next message is chosen when delivery agent becomes
-available.  You already know that <a href="qmgr.8.html">oqmgr(8)</a> uses round-robin by destination
-while <a href="qmgr.8.html">qmgr(8)</a> uses simple FIFO, except for some preemptive magic.
-The <a href="postconf.5.html">postconf(5)</a> manual documents all the knobs the user
-can use to control this preemptive magic - there is nothing else
-to the preemption than the quite simple conditions described in there.
  </p>
  
-<p> As for programmer-level documentation, this will have to be
-extracted from all those emails we have exchanged with Wietse [rats!
-I hoped that Patrik would do the work for me -- Wietse] But I think
-there are no missing bits which we have not mentioned in our
-conversations. </p>
+<p>
  
-<p> However, even from programmer's point of view, there is nothing
-more to add to the message scheduling idea itself.  There are few
-things which make it look more complicated than it is, but the
-algorithm is the same as the user perceives it. The summary of the
-differences of the programmer's view from the user's view are: </p>
+As we see, it's all very clean and straightforward. Now how does
+this change because of blockers?
+
+</p>
+
+<p>
+
+The answer is: a lot. Any job may become blocker job at any time,
+and also become normal job again at any time. This has several
+important implications:
+
+</p>
  
  <ol>
  
-    <li> <p> Simplification of terms for users: The user knows
-    about messages and recipients. The program itself works with
-    jobs (one message is split among several jobs, one per each
-    transport needed to deliver the message) and queue entries
-    (each entry may group several recipients for same destination).
-    Then there is the peer structure introduced by <a href="qmgr.8.html">qmgr(8)</a> which is
-    simply per-job analog of the queue structure. </p>
-
-    <li> <p> Dealing with concurrency limits: The actual implementation
-    is complicated by the fact that the messages (resp. jobs) may
-    not be delivered in the exactly scheduled order because of the
-    concurrency limits. It is necessary to skip some "blocker" jobs
-    when the concurrency limit is reached and get back to them
-    again when the limit permits. </p>
-
-    <li> <p> Dealing with resource limits: The actual implementation is
-    complicated by the fact that not all recipients may be read in-core.
-    Therefore each message has some recipients in-core and some may
-    remain on-file. This means that a) the preemptive algorithm needs
-    to work with recipient count estimates instead of exact counts, b)
-    there is extra code which needs to manipulate the per-transport
-    pool of recipients which may be read in-core at the same time, and
-    c) there is extra code which needs to be able to read recipients
-    into core in batches and which is triggered at appropriate moments. </p>
-
-    <li> <p> Doing things efficiently: All important things I am
-    aware of are done in the minimum time possible (either directly
-    or at least when amortized complexity is used), but to choose
-    which job is the best candidate for preempting the current job
-    requires linear search of up to all transport jobs (the worst
-    theoretical case - the reality is much better). As this is done
-    every time the next queue entry to be delivered is about to be
-    chosen, it seemed reasonable to add cache which minimizes the
-    overhead. Maintenance of this candidate cache slightly obfuscates
-    things.
+<li> <p>
+
+The jobs may be completed in arbitrary order. For example, in the
+example above, if the current job 7 becomes blocked, the next job
+4 may complete before the job 7 becomes unblocked again. Or if both
+7 and 4 are blocked, then 1 is completed, then 7 becomes unblocked
+and is completed, then 2 is completed and only after that 4 becomes
+unblocked and is completed... You get the idea.
+
+</p>
+
+<p>
+
+[Interesting side note: even when jobs are delivered out of order,
+from single destination's point of view the jobs are still delivered
+in the expected order (that is, FIFO unless there was some preemption
+involved). This is because whenever a destination queue becomes
+unblocked (the destination limit allows selection of more recipient
+entries for that destination), all jobs which have peers for that
+destination are unblocked at once.]
+
+</p>
+
+<li> <p>
+
+The idea of the preemption stack at the head of the job list is
+gone.  That is, it must be possible to preempt any job on the job
+list. For example, if the jobs 7, 4, 1 and 2 in the example above
+become all blocked, job 3 becomes the current job. And of course
+we do not want the preemption be affected by the fact that there
+are some blocked jobs or not. Therefore, if it turns out that job
+3 might be preempted by job 6, the implementation shall make it
+possible.
+
+</p>
+
+<li> <p>
+
+The idea of the linear preemption stack itself is gone. It's no
+longer true that one job is always preempted by only one job at one
+time (that is directly preempted, not counting the recursively
+nested jobs). For example, in the example above, job 1 is directly
+preempted by only job 4, and job 4 by job 7. Now assume job 7 becomes
+blocked, and job 4 is being delivered. If it accumulates enough
+delivery slots, it is natural that it might be preempted for example
+by job 8. Now job 4 is preempted by both job 7 AND job 8 at the
+same time.
+
+</p>
  
  </ol>
  
-<p> The points 2 and 3 are those which made the implementation
-(look) complicated and were the real coding work, but I believe
-that to understand the scheduling algorithm itself (which was the
-real thinking work) is fairly easy. </p>
+<p>
+
+Now combine the points 2) and 3) with point 1) again and you realize
+that the relations on the once linear job list became pretty
+complicated. If we extend the point 3) example: jobs 7 and 8 preempt
+job 4, now job 8 becomes blocked too, then job 4 completes. Tricky,
+huh?
+
+</p>
+
+<p>
+
+If I illustrate the relations after the above mentioned examples
+(but those in point 1)), the situation would look like this:
+
+</p>
+
+<blockquote>
+<pre>
+                            v- parent
+
+adoptive parent ->    1--2--3--5--...      <- "stack" level 0
+                      |     |
+parent gone ->        ?     6              <- "stack" level 1
+                     / \
+children ->         7   8   ^- child       <- "stack" level 2
+
+                      ^- siblings
+</pre>
+</blockquote>
+
+<p>
+
+Now how does <tt>nqmgr</tt> deal with all these complicated relations?
+
+</p>
+
+<p>
+
+Well, it maintains them all as described, but fortunately, all these
+relations are necessary only for purposes of proper counting of
+available delivery slots. For purposes of ordering the jobs for
+entry selection, the original rule still applies: "the job preempting
+the current job is moved in front of the current job on the job
+list". So for entry selection purposes, the job relations remain
+as simple as this:
+
+</p>
+
+<blockquote>
+<pre>
+7--8--1--2--6--3--5--..   <- scheduler's job list order
+</pre>
+</blockquote>
+
+<p>
+
+The job list order and the preemption parent/child/siblings relations
+are maintained separately. And because the selection works only
+with the job list, you can happily forget about those complicated
+relations unless you want to study the <tt>nqmgr</tt> sources. In
+that case the text above might provide some helpful introduction
+to the problem domain. Otherwise I suggest you just forget about
+all this and stick with the user's point of view: the blocker jobs
+are simply ignored.
+
+</p>
+
+<p>
+
+[By now, you should have a feeling that there is more things going
+under the hood than you ever wanted to know. You decide that
+forgetting about this chapter is the best you can do for the sake
+of your mind's health and you basically stick with the idea how the
+scheduler works in ideal conditions, when there are no blockers,
+which is good enough.]
+
+</p>
+
+<h3> <a name="<tt>nqmgr</tt>_memory"> Dealing with memory resource
+limits </a> </h3>
+
+<p>
+
+When discussing the <tt>nqmgr</tt> scheduler, we have so far assumed
+that all recipients of all messages in the <a href="QSHAPE_README.html#active_queue">active queue</a> are completely
+read into the memory. This is simply not true. There is an upper
+bound on the amount of memory the <tt>nqmgr</tt> may use, and
+therefore it must impose some limits on the information it may store
+in the memory at any given time.
+
+</p>
+
+<p>
+
+First of all, not all messages may be read in-core at once. At any
+time, only <a href="postconf.5.html#qmgr_message_active_limit">qmgr_message_active_limit</a> messages may be read in-core
+at maximum. When read into memory, the messages are picked from the
+<a href="QSHAPE_README.html#incoming_queue">incoming</a> and deferred message queues and moved to the <a href="QSHAPE_README.html#active_queue">active queue</a>
+(incoming having priority), so if there is more than
+<a href="postconf.5.html#qmgr_message_active_limit">qmgr_message_active_limit</a> messages queued in the <a href="QSHAPE_README.html#active_queue">active queue</a>, the
+rest will have to wait until (some of) the messages in the active
+queue are completely delivered (or deferred).
+
+</p>
+
+<p>
+
+Even with the limited amount of in-core messages, there is another
+limit which must be imposed in order to avoid memory exhaustion.
+Each message may contain huge amount of recipients (tens or hundreds
+of thousands are not uncommon), so if <tt>nqmgr</tt> read all
+recipients of all messages in the <a href="QSHAPE_README.html#active_queue">active queue</a>, it may easily run
+out of memory. Therefore there must be some upper bound on the
+amount of message recipients which are read into the memory at the
+same time.
+
+</p>
+
+<p>
+
+Before discussing how exactly <tt>nqmgr</tt> implements the recipient
+limits, let's see how the sole existence of the limits themselves
+affects the <tt>nqmgr</tt> and its scheduler.
+
+</p>
+
+<p>
+
+The message limit is straightforward - it just limits the size of
+the
+lookahead the <tt>nqmgr</tt>'s scheduler has when choosing which
+message can preempt the current one. Messages not in the active
+queue simply are not considered at all.
+
+</p>
+
+<p>
+
+The recipient limit complicates more things. First of all, the
+message reading code must support reading the recipients in batches,
+which among other things means accessing the queue file several
+times and continuing where the last recipient batch ended. This is
+invoked by the scheduler whenever the current job runs out of in-core
+recipients and more are required. It is also done any time when all
+in-core recipients of the message are dealt with (which may also
+mean they were deferred) but there are still more in the queue file.
+
+</p>
+
+<p>
+
+The second complication is that with some recipients left unread
+in the queue file, the scheduler can't operate with exact counts
+of recipient entries. With unread recipients, it is not clear how
+many recipient entries there will be, as they are subject to
+per-destination grouping. It is not even clear to what transports
+(and thus jobs) the recipients will be assigned. And with messages
+coming from the <a href="QSHAPE_README.html#deferred_queue">deferred queue</a>, it is not even clear how many unread
+recipients are still to be delivered. This all means that the
+scheduler must use only estimates of how many recipients entries
+there will be.  Fortunately, it is possible to estimate the minimum
+and maximum correctly, so the scheduler can always err on the safe
+side.  Obviously, the better the estimates, the better results, so
+it is best when we are able to read all recipients in-core and turn
+the estimates into exact counts, or at least try to read as many
+as possible to make the estimates as accurate as possible.
+
+</p>
+
+<p>
+
+The third complication is that it is no longer true that the scheduler
+is done with a job once all of its in-core recipients are delivered.
+It is possible that the job will be revived later, when another
+batch of recipients is read in core. It is also possible that some
+jobs will be created for the first time long after the first batch
+of recipients was read in core. The <tt>nqmgr</tt> code must be
+ready to handle all such situations.
+
+</p>
+
+<p>
+
+And finally, the fourth complication is that the <tt>nqmgr</tt>
+code must somehow impose the recipient limit itself. Now how does
+it achieve it?
+
+</p>
+
+<p>
+
+Perhaps the easiest solution would be to say that each message may
+have at maximum X recipients stored in-core, but such solution would
+be poor for several reasons. With reasonable <a href="postconf.5.html#qmgr_message_active_limit">qmgr_message_active_limit</a>
+values, the X would have to be quite low to maintain reasonable
+memory footprint. And with low X lots of things would not work well.
+The <tt>nqmgr</tt> would have problems to use the
+<a href="postconf.5.html#transport_destination_recipient_limit"><i>transport</i>_destination_recipient_limit</a> efficiently. The
+scheduler's preemption would be suboptimal as the recipient count
+estimates would be inaccurate. The message queue file would have
+to be accessed many times to read in more recipients again and
+again.
+
+</p>
+
+<p>
+
+Therefore it seems reasonable to have a solution which does not use
+a limit imposed on per-message basis, but which maintains a pool
+of available recipient slots, which can be shared among all messages
+in the most efficient manner. And as we do not want separate
+transports to compete for resources whenever possible, it seems
+appropriate to maintain such recipient pool for each transport
+separately. This is the general idea, now how does it work in
+practice?
+
+</p>
+
+<p>
+
+First we have to solve little chicken-and-egg problem. If we want
+to use the per-transport recipient pools, we first need to know to
+what transport(s) is the message assigned. But we will find that
+out only after we read in the recipients first. So it is obvious
+that we first have to read in some recipients, use them to find out
+to what transports is the message to be assigned, and only after
+that we can use the per-transport recipient pools.
+
+</p>
+
+<p>
+
+Now how many recipients shall we read for the first time? This is
+what <a href="postconf.5.html#qmgr_message_recipient_minimum">qmgr_message_recipient_minimum</a> and <a href="postconf.5.html#qmgr_message_recipient_limit">qmgr_message_recipient_limit</a>
+values control. The <a href="postconf.5.html#qmgr_message_recipient_minimum">qmgr_message_recipient_minimum</a> value specifies
+how many recipients of each message we will read for the first time,
+no matter what.  It is necessary to read at least one recipients
+before we can assign the message to a transport and create the first
+job. However, reading only <a href="postconf.5.html#qmgr_message_recipient_minimum">qmgr_message_recipient_minimum</a> recipients
+even if there are only few messages with few messages in-core would
+be wasteful. Therefore if there is less than <a href="postconf.5.html#qmgr_message_recipient_limit">qmgr_message_recipient_limit</a>
+recipients in-core so far, the first batch of recipients may be
+larger than <a href="postconf.5.html#qmgr_message_recipient_minimum">qmgr_message_recipient_minimum</a> - as large as is required
+to reach the <a href="postconf.5.html#qmgr_message_recipient_limit">qmgr_message_recipient_limit</a> limit.
+
+</p>
+
+<p>
+
+Once the first batch of recipients was read in core and the message
+jobs were created, the size of the subsequent recipient batches (if
+any - of course it's best when all recipients are read in one batch)
+is based solely on the position of the message jobs on their
+corresponding transport's job lists. Each transport has a pool of
+<a href="postconf.5.html#transport_recipient_limit"><i>transport</i>_recipient_limit</a> recipient slots which it can
+distribute among its jobs (how this is done is described later).
+The subsequent recipient batch may be as large as the sum of all
+recipient slots of all jobs of the message permits (plus the
+<a href="postconf.5.html#qmgr_message_recipient_minimum">qmgr_message_recipient_minimum</a> amount which always applies).
+
+</p>
+
+<p>
+
+For example, if a message has three jobs, first with 1 recipient
+still in-core and 4 recipient slots, second with 5 recipient in-core
+and 5 recipient slots, and third with 2 recipients in-core and 0
+recipient slots, it has 1+5+2=7 recipients in-core and 4+5+0=9 jobs'
+recipients slots in total. This means that we could immediately
+read 2+<a href="postconf.5.html#qmgr_message_recipient_minimum">qmgr_message_recipient_minimum</a> more recipients of that message
+in core.
+
+</p>
+
+<p>
+
+The above example illustrates several things which might be worth
+mentioning explicitly: first, note that although the per-transport
+slots are assigned to particular jobs, we can't guarantee that once
+the next batch of recipients is read in core, that the corresponding
+amounts of recipients will be assigned to those jobs. The jobs lend
+its slots to the message as a whole, so it is possible that some
+jobs end up sponsoring other jobs of their message. For example,
+if in the example above the 2 newly read recipients were assigned
+to the second job, the first job sponsored the second job with 2
+slots. The second notable thing is the third job, which has more
+recipients in-core than it has slots. Apart from sponsoring by other
+job we just saw it can be result of the first recipient batch, which
+is sponsored from global recipient pool of <a href="postconf.5.html#qmgr_message_recipient_limit">qmgr_message_recipient_limit</a>
+recipients. It can be also sponsored from the message recipient
+pool of <a href="postconf.5.html#qmgr_message_recipient_minimum">qmgr_message_recipient_minimum</a> recipients.
+
+</p>
+
+<p>
+
+Now how does each transport distribute the recipient slots among
+its jobs?  The strategy is quite simple. As most scheduler activity
+happens on the head of the job list, it is our intention to make
+sure that the scheduler has the best estimates of the recipient
+counts for those jobs. As we mentioned above, this means that we
+want to try to make sure that the messages of those jobs have all
+recipients read in-core. Therefore the transport distributes the
+slots "along" the job list from start to end. In this case the job
+list sorted by message enqueue time is used, because it doesn't
+change over time as the scheduler's job list does.
+
+</p>
+
+<p>
+
+More specifically, each time a job is created and appended to the
+job list, it gets all unused recipient slots from its transport's
+pool. It keeps them until all recipients of its message are read.
+When this happens, all unused recipient slots are transferred to
+the next job (which is now in fact now first such job) on the job
+list which still has some recipients unread, or eventually back to
+the transport pool if there is no such job. Such transfer then also
+happens whenever a recipient entry of that job is delivered.
+
+</p>
+
+<p>
+
+There is also a scenario when a job is not appended to the end of
+the job list (for example it was created as a result of second or
+later recipient batch). Then it works exactly as above, except that
+if it was put in front of the first unread job (that is, the job
+of a message which still has some unread recipients in queue file),
+that job is first forced to return all of its unused recipient slots
+to the transport pool.
+
+</p>
+
+<p>
+
+The algorithm just described leads to the following state: The first
+unread job on the job list always gets all the remaining recipient
+slots of that transport (if there are any). The jobs queued before
+this job are completely read (that is, all recipients of their
+message were already read in core) and have at maximum as many slots
+as they still have recipients in-core (the maximum is there because
+of the sponsoring mentioned before) and the jobs after this job get
+nothing from the transport recipient pool (unless they got something
+before and then the first unread job was created and enqueued in
+front of them later - in such case the also get at maximum as many
+slots as they have recipients in-core).
+
+</p>
+
+<p>
+
+Things work fine in such state for most of the time, because the
+current job is either completely read in-core or has as much recipient
+slots as there are, but there is one situation which we still have
+to take care specially.  Imagine if the current job is preempted
+by some unread job from the job list and there are no more recipient
+slots available, so this new current job could read only batches
+of <a href="postconf.5.html#qmgr_message_recipient_minimum">qmgr_message_recipient_minimum</a> recipients at a time. This would
+really degrade performance. For this reason, each transport has
+extra pool of <a href="postconf.5.html#transport_extra_recipient_limit"><i>transport</i>_extra_recipient_limit</a> recipient
+slots, dedicated exactly for this situation. Each time an unread
+job preempts the current job, it gets half of the remaining recipient
+slots from the normal pool and this extra pool.
+
+</p>
+
+<p>
+
+And that's it. It sure does sound pretty complicated, but fortunately
+most people don't really have to care how exactly it works as long
+as it works.  Perhaps the only important things to know for most
+people ire the following upper bound formulas:
+
+</p>
+
+<p>
+
+Each transport has at maximum
+
+</p>
+
+<blockquote>
+<pre>
+max(
+<a href="postconf.5.html#qmgr_message_recipient_minimum">qmgr_message_recipient_minimum</a> * <a href="postconf.5.html#qmgr_message_active_limit">qmgr_message_active_limit</a>
++ *_recipient_limit + *_extra_recipient_limit,
+<a href="postconf.5.html#qmgr_message_recipient_limit">qmgr_message_recipient_limit</a>
+)
+</pre>
+</blockquote>
+
+<p>
+
+recipients in core.
+
+</p>
+
+<p>
+
+The total amount of recipients in core is
+
+</p>
+
+<blockquote>
+<pre>
+max(
+<a href="postconf.5.html#qmgr_message_recipient_minimum">qmgr_message_recipient_minimum</a> * <a href="postconf.5.html#qmgr_message_active_limit">qmgr_message_active_limit</a>
++ sum( *_recipient_limit + *_extra_recipient_limit ),
+<a href="postconf.5.html#qmgr_message_recipient_limit">qmgr_message_recipient_limit</a>
+)
+</pre>
+</blockquote>
+
+<p>
+
+where the sum is over all used transports.
+
+</p>
+
+<p>
+
+And this terribly complicated chapter concludes the documentation
+of <tt>nqmgr</tt> scheduler.
+
+</p>
+
+<p>
+
+[By now you should theoretically know the <tt>nqmgr</tt> scheduler
+inside out. In practice, you still hope that you will never have
+to really understand the last or last two chapters completely, and
+fortunately most people really won't. Understanding how the scheduler
+works in ideal conditions is more than good enough for vast majority
+of users.]
+
+</p>
  
  <h2> <a name="credits"> Credits </a> </h2>
  
@@ -785,8 +1794,8 @@ was separated from dead site detection.
  
  <li> These simplifications, and their modular implementation, helped
  to develop further insights into the different roles that positive
-and negative concurrency feedback play, and helped to avoid all the
-known worst-case scenarios.
+and negative concurrency feedback play, and helped to identify some
+worst-case scenarios.
  
  </ul>
  
diff --git a/postfix/html/TLS_README.html b/postfix/html/TLS_README.html

index 316dd3dd65dd23afaa2150938cbf54c8396d4775..a32f132991d0e39da70bf47eedce5ef1b3822af2 100644 (file)
--- a/postfix/html/TLS_README.html
+++ b/postfix/html/TLS_README.html
@@ -2328,14 +2328,14 @@ but don't require them from all clients. </p>
  /etc/postfix/<a href="postconf.5.html">main.cf</a>:
      <a href="postconf.5.html#smtp_tls_CAfile">smtp_tls_CAfile</a> = /etc/postfix/cacert.pem
      <a href="postconf.5.html#smtp_tls_session_cache_database">smtp_tls_session_cache_database</a> =
-       btree:/var/spool/postfix/smtp_tls_session_cache
+       btree:/var/lib/postfix/smtp_tls_session_cache
      <a href="postconf.5.html#smtp_use_tls">smtp_use_tls</a> = yes
      <a href="postconf.5.html#smtpd_tls_CAfile">smtpd_tls_CAfile</a> = /etc/postfix/cacert.pem
      <a href="postconf.5.html#smtpd_tls_cert_file">smtpd_tls_cert_file</a> = /etc/postfix/FOO-cert.pem
      <a href="postconf.5.html#smtpd_tls_key_file">smtpd_tls_key_file</a> = /etc/postfix/FOO-key.pem
      <a href="postconf.5.html#smtpd_tls_received_header">smtpd_tls_received_header</a> = yes
      <a href="postconf.5.html#smtpd_tls_session_cache_database">smtpd_tls_session_cache_database</a> =
-       btree:/var/spool/postfix/smtpd_tls_session_cache
+       btree:/var/lib/postfix/smtpd_tls_session_cache
      <a href="postconf.5.html#tls_random_source">tls_random_source</a> = dev:/dev/urandom
      # Postfix 2.3 and later
      <a href="postconf.5.html#smtpd_tls_security_level">smtpd_tls_security_level</a> = may
diff --git a/postfix/html/postconf.5.html b/postfix/html/postconf.5.html

index f7b62a18a9e3e0136f2a8508610f0bb3dd35bb3f..8915a004e732dd43fa506c8b1543c1bb47c72419 100644 (file)
--- a/postfix/html/postconf.5.html
+++ b/postfix/html/postconf.5.html
@@ -1823,12 +1823,6 @@ The <i>number</i> must be in the range 0..1 inclusive. With
  <i>number</i> equal to "1", a destination's delivery concurrency
  is decremented by 1 after each failed pseudo-cohort.  </dd>
  
-<dt> <b><i>number</i> / sqrt_concurrency </b> </dt>
-
-<dd> Variable feedback of "<i>number</i> / sqrt(delivery concurrency)".
-The <i>number</i> must be in the range 0..1 inclusive. This setting
-may be removed in a future version.  </dd>
-
  </dl>
  
  <p> A pseudo-cohort is the number of deliveries equal to a destination's
@@ -1877,12 +1871,6 @@ The <i>number</i> must be in the range 0..1 inclusive. With
  <i>number</i> equal to "1", a destination's delivery concurrency
  is incremented by 1 after each successful pseudo-cohort.  </dd>
  
-<dt> <b><i>number</i> / sqrt_concurrency </b> </dt>
-
-<dd> Variable feedback of "<i>number</i> / sqrt(delivery concurrency)".
-The <i>number</i> must be in the range 0..1 inclusive. This setting
-may be removed in a future version.  </dd>
-
  </dl>
  
  <p> A pseudo-cohort is the number of deliveries equal to a destination's
diff --git a/postfix/man/man5/postconf.5 b/postfix/man/man5/postconf.5

index b7588f118d8f9dd1014e778d1f02246c01030934..2554e9fc8d91d20eb8515708e4b302cb711c6a03 100644 (file)
--- a/postfix/man/man5/postconf.5
+++ b/postfix/man/man5/postconf.5
@@ -1039,10 +1039,6 @@ Variable feedback of "\fInumber\fR / (delivery concurrency)".
  The \fInumber\fR must be in the range 0..1 inclusive. With
  \fInumber\fR equal to "1", a destination's delivery concurrency
  is decremented by 1 after each failed pseudo-cohort.
-.IP "\fB\fInumber\fR / sqrt_concurrency \fR"
-Variable feedback of "\fInumber\fR / sqrt(delivery concurrency)".
-The \fInumber\fR must be in the range 0..1 inclusive. This setting
-may be removed in a future version.
  .PP
  A pseudo-cohort is the number of deliveries equal to a destination's
  delivery concurrency.
@@ -1076,10 +1072,6 @@ Variable feedback of "\fInumber\fR / (delivery concurrency)".
  The \fInumber\fR must be in the range 0..1 inclusive. With
  \fInumber\fR equal to "1", a destination's delivery concurrency
  is incremented by 1 after each successful pseudo-cohort.
-.IP "\fB\fInumber\fR / sqrt_concurrency \fR"
-Variable feedback of "\fInumber\fR / sqrt(delivery concurrency)".
-The \fInumber\fR must be in the range 0..1 inclusive. This setting
-may be removed in a future version.
  .PP
  A pseudo-cohort is the number of deliveries equal to a destination's
  delivery concurrency.
diff --git a/postfix/mantools/postconf2html b/postfix/mantools/postconf2html

index 3088670256ab5cfbfb2f4b4de90c653f98816bfc..0e0c9816c56e6ba0c0cfc19327beea4469de23af 100755 (executable)
--- a/postfix/mantools/postconf2html
+++ b/postfix/mantools/postconf2html
@@ -7,6 +7,8 @@
  # - Process input as text blocks separated by one or more empty
  # (or all whitespace) lines.
  #
+# - Skip text between <!-- and -->; each must be on a different line.
+#
  # - Don't touch blocks that start with `<' in column zero.
  #
  # The only changes made are:
@@ -36,10 +38,21 @@ while(<>) {
  
      # Gobble up the next text block.
      $block = "";
+    $comment = 0;
      do {
         $_ =~ s/\s+\n$/\n/;
         $block .= $_;
-    } while(($_ = <>) && /\S/);
+       if ($_ =~ /<!--/)
+           { $comment = 1; } 
+       if ($comment && $_ =~ /-->/)
+           { $comment = 0; $block =~ s/<!--.*-->//sg; }
+    } while((($_ = <>) && /\S/) || $comment);
+
+    # Skip blanks after comment elimination.
+    if ($block =~ /^\s/) {
+       $block =~ s/^\s+//s;
+       next if ($block eq "");
+    }
  
      # Don't touch a text block starting with < in column zero.
      if ($block =~ /^</) {
diff --git a/postfix/proto/SCHEDULER_README.html b/postfix/proto/SCHEDULER_README.html

index ebd67c87a4c0e980c2c255cb7ebb20e98364238a..375e7b4bde643866dbe7449982c05f2f6115cace 100644 (file)
--- a/postfix/proto/SCHEDULER_README.html
+++ b/postfix/proto/SCHEDULER_README.html
@@ -1,5 +1,5 @@
  <!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN"
-        "http://www.w3.org/TR/html4/loose.dtd">
+       "http://www.w3.org/TR/html4/loose.dtd">
  
  <html>
  
@@ -13,7 +13,8 @@
  
  <body>
  
-<h1><img src="postfix-logo.jpg" width="203" height="98" ALT="">Postfix Queue Scheduler</h1>
+<h1><img src="postfix-logo.jpg" width="203" height="98" ALT="">Postfix
+Queue Scheduler</h1>
  
  <hr>
  
@@ -25,100 +26,91 @@ deliveries at specific times, and removes mail from the queue after
  the last delivery attempt.  There are two major classes of mechanisms
  that control the operation of the queue manager. </p>
  
-<p> The first class of mechanisms is concerned with the number of
-concurrent deliveries to a specific destination, including decisions
-on when to suspend deliveries after persistent failures: </p>
-
-    <ul>
-
-    <li> <a href="#concurrency"> Concurrency scheduling </a>
-
-       <ul>
-
-       <li> <a href="#concurrency_summary_2_5"> Summary of the
-       Postfix 2.5 concurrency feedback algorithm </a>
-
-       <li> <a href="#dead_summary_2_5"> Summary of the Postfix
-       2.5 "dead destination" detection algorithm </a>
-
-       <li> <a href="#pseudo_code_2_5"> Pseudocode for the Postfix
-       2.5 concurrency scheduler </a>
+<ul>
  
-       <li> <a href="#concurrency_results"> Results for delivery
-       to concurrency limited servers </a>
+<li> <a href="#concurrency"> Concurrency scheduling </a> is concerned
+with the number of concurrent deliveries to a specific destination,
+including decisions on when to suspend deliveries after persistent
+failures.
  
-       <li> <a href="#concurrency_discussion"> Discussion of
-       concurrency limited server results </a>
+<li> <a href="#jobs"> Preemptive scheduling </a> is concerned with
+the selection of email messages and recipients for a given destination.
  
-       <li> <a href="#concurrency_limitations"> Limitations of
-       less-than-1 per delivery feedback </a>
+<li> <a href="#credits"> Credits </a>. This document would not be
+complete without.
  
-       <li> <a href="#concurrency_config"> Concurrency configuration
-       parameters </a>
+</ul>
  
-       </ul>
+<!--
  
-    </ul>
+<p> Once started, the qmgr(8) process runs until "postfix reload"
+or "postfix stop".  As a persistent process, the queue manager has
+to meet strict requirements with respect to code correctness and
+robustness. Unlike non-persistent daemon processes, the queue manager
+cannot benefit from Postfix's process rejuvenation mechanism that
+limit the impact from resource leaks and other coding errors
+(translation: replacing a process after a short time covers up bugs
+before they can become a problem).  </p>
  
-<p> The second class of mechanisms is concerned with the selection
-of what mail to deliver to a given destination: </p>
+-->
  
-    <ul>
+<h2> <a name="concurrency"> Concurrency scheduling </a> </h2>
  
-    <li> <a href="#jobs"> Preemptive scheduling </a>
+<p> The following sections document the Postfix 2.5 concurrency
+scheduler, after a discussion of the limitations of the existing
+concurrency scheduler. This is followed by results of medium-concurrency
+experiments, and a discussion of trade-offs between performance and
+robustness.  </p>
  
-       <ul>
+<p> The material is organized as follows: </p>
  
-       <li> <a href="#job_motivation"> Why the non-preemptive Postfix queue
-       manager was replaced </a>
+<ul>
  
-       <li> <a href="#job_design"> How the non-preemptive queue manager
-       scheduler works </a>
+<li> <a href="#concurrency_drawbacks"> Drawbacks of the existing
+concurrency scheduler </a>
  
-       </ul>
+<li> <a href="#concurrency_summary_2_5"> Summary of the Postfix 2.5
+concurrency feedback algorithm </a>
  
-    </ul>
+<li> <a href="#dead_summary_2_5"> Summary of the Postfix 2.5 "dead
+destination" detection algorithm </a>
  
-<p> And this document would not be complete without: </p>
+<li> <a href="#pseudo_code_2_5"> Pseudocode for the Postfix 2.5
+concurrency scheduler </a>
  
-    <ul>
+<li> <a href="#concurrency_results"> Results for delivery to
+concurrency limited servers </a>
  
-    <li> <a href="#credits"> Credits </a>
+<li> <a href="#concurrency_discussion"> Discussion of concurrency
+limited server results </a>
  
-    </ul>
+<li> <a href="#concurrency_limitations"> Limitations of less-than-1
+per delivery feedback </a>
  
-<!--
+<li> <a href="#concurrency_config"> Concurrency configuration
+parameters </a>
  
-<p> Once started, the qmgr(8) process runs until "postfix reload"
-or "postfix stop".  As a persistent process, the queue manager has
-to meet strict requirements with respect to code correctness and
-robustness. Unlike non-persistent daemon processes, the queue manager
-cannot benefit from Postfix's process rejuvenation mechanism that
-limit the impact from resource leaks and other coding errors
-(translation: replacing a process after a short time covers up bugs
-before they can become a problem).  </p>
-
--->
+</ul>
  
-<h2> <a name="concurrency"> Concurrency scheduling </a> </h2>
+<h3> <a name="concurrency_drawbacks"> Drawbacks of the existing
+concurrency scheduler </a> </h3>
  
-<p> This section documents the Postfix 2.5 concurrency scheduler.
-Prior Postfix versions used a simple but robust algorithm where the
-per-destination delivery concurrency was decremented by 1 after a
-delivery suffered connection or handshake failure, and was incremented
-by 1 otherwise.  Of course the concurrency was never allowed to
-exceed the maximum per-destination concurrency limit.  And when a
-destination's concurrency level dropped to zero, the destination
-was declared "dead" and delivery was suspended.  </p>
+<p> From the start, Postfix has used a simple but robust algorithm
+where the per-destination delivery concurrency is decremented by 1
+after a delivery suffered connection or handshake failure, and
+incremented by 1 otherwise.  Of course the concurrency is never
+allowed to exceed the maximum per-destination concurrency limit.
+And when a destination's concurrency level drops to zero, the
+destination is declared "dead" and delivery is suspended.  </p>
  
-<p> Drawbacks of the old +/-1 feedback per delivery are: <p>
+<p> Drawbacks of +/-1 concurrency feedback per delivery are: <p>
  
  <ul>
  
  <li> <p> Overshoot due to exponential delivery concurrency growth
-with each pseudo-cohort(*). For example, with the default initial
-concurrency of 5, concurrency would proceed over time as (5-10-20).
-</p>
+with each pseudo-cohort(*). This can be an issue with high-concurrency
+channels. For example, with the default initial concurrency of 5,
+concurrency would proceed over time as (5-10-20).  </p>
  
  <li> <p> Throttling down to zero concurrency after a single
  pseudo-cohort(*) failure. This was especially an issue with
@@ -142,7 +134,8 @@ throttle down to zero, a destination is declared "dead" after a
  configurable number of pseudo-cohorts reports connection or handshake
  failure.  </p>
  
-<h3> <a name="concurrency_summary_2_5"> Summary of the Postfix 2.5 concurrency feedback algorithm </a> </h3>
+<h3> <a name="concurrency_summary_2_5"> Summary of the Postfix 2.5
+concurrency feedback algorithm </a> </h3>
  
  <p> We want to increment a destination's delivery concurrency when
  some (not necessarily consecutive) number of deliveries complete
@@ -536,12 +529,13 @@ with increasing concurrency. See text for a discussion of results.
  
  <p> All results in the previous sections are based on the first
  delivery runs only; they do not include any second etc. delivery
-attempts.  The first two examples show that the feedback method
-matters little when concurrency is limited due to congestion. This
+attempts.  The first two examples show that the effect of feedback
+is negligible when concurrency is limited due to congestion. This
  is because the initial concurrency is already at the client's
  concurrency maximum, and because there is 10-100 times more positive
-than negative feedback.  Under these conditions, the contribution
-from SMTP connection caching is negligible. </p>
+than negative feedback.  Under these conditions, it is no surprise
+that the contribution from SMTP connection caching is also negligible.
+</p>
  
  <p> In the last example, the old +/-1 feedback per delivery will
  defer 50% of the mail when confronted with an active (anvil-style)
@@ -647,115 +641,1130 @@ activity </td> </tr>
  
  <h2> <a name="jobs"> Preemptive scheduling </a> </h2>
  
-<p> This is the beginning of documentation for a preemptive queue
-manager scheduling algorithm by Patrik Rak. For a long time, this
-code was made available under the name "nqmgr(8)" (new queue manager),
-as an optional module. As of Postfix 2.1 this is the default queue
-manager, which is always called "qmgr(8)". The old queue manager
-will for some time will be available under the name of "oqmgr(8)".
+<p>
+
+This document attempts to describe the new queue manager and its
+preeemptive scheduler algorithm. Note that the document was originally
+written to describe the changes between the new queue manager (in
+this text referred to as <tt>nqmgr</tt>, the name it was known by
+before it became the default queue manager) and the old queue manager
+(referred to as <tt>oqmgr</tt>). This is why it refers to <tt>oqmgr</tt>
+every so often.
+
  </p>
  
-<h3> <a name="job_motivation"> Why the non-preemptive Postfix queue manager was replaced </a> </h3>
+<p>
  
-<p> The non-preemptive Postfix scheduler had several limitations
-due to unfortunate choices in its design. </p>
+This document is divided into sections as follows:
  
-<ol>
+</p>
  
-    <li> <p> Round-robin selection by destination for mail that is
-    delivered via the same message delivery transport. The round-robin
-    strategy was chosen with the intention to prevent a single
-    (destination) site from using up too many mail delivery resources.
-    However, that strategy penalized inbound mail on bi-directional
-    gateways.  The poor suffering inbound destination would be
-    selected only 1/number-of-destinations of the time, even when
-    it had more mail than other destinations, and thus mail could
-    be delayed. </p>
-
-    <p> Victor Duchovni found a workaround: use different message
-    delivery transports, and thus avoid the starvation problem.
-    The Patrik Rak scheduler solves this problem by using FIFO
-    selection. </p>
-
-    <li> <p> A second limitation of the old Postfix scheduler was
-    that delivery of bulk mail would block all other deliveries,
-    causing large delays.  Patrik Rak's scheduler allows mail with
-    fewer recipients to slip past bulk mail in an elegant manner.
-    </p>
+<ul>
  
-</ol>
+<li> <a href="#<tt>nqmgr</tt>_structures"> The structures used by
+nqmgr </a>
+
+<li> <a href="#<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks
+up the message </a> - how it is assigned to transports, jobs, peers,
+entries
+
+<li> <a href="#<tt>nqmgr</tt>_selection"> How does the entry selection
+work </a>
+
+<li> <a href="#<tt>nqmgr</tt>_preemption"> How does the preemption
+work </a> - what messages may be preempted and how and what messages
+are chosen to preempt them
+
+<li> <a href="#<tt>nqmgr</tt>_concurrency"> How destination concurrency
+limits affect the scheduling algorithm </a>
+
+<li> <a href="#<tt>nqmgr</tt>_memory"> Dealing with memory resource
+limits </a>
+
+</ul>
+
+<h3> <a name="<tt>nqmgr</tt>_structures"> The structures used by
+nqmgr </a> </h3>
+
+<p>
+
+Let's start by recapitulating the structures and terms used when
+referring to queue manager and how it operates. Many of these are
+partially described elsewhere, but it is nice to have a coherent
+overview in one place:
+
+</p>
+
+<ul>
+
+<li> <p> Each message structure represents one mail message which
+Postfix is to deliver. The message recipients specify to what
+destinations is the message to be delivered and what transports are
+going to be used for the delivery. </p>
+
+<li> <p> Each recipient entry groups a batch of recipients of one
+message which are all going to be delivered to the same destination.
+</p>
+
+<li> <p> Each transport structure groups everything what is going
+to be delivered by delivery agents dedicated for that transport.
+Each transport maintains a set of queues (describing the destinations
+it shall talk to) and jobs (referencing the messages it shall
+deliver). </p>
+
+<li> <p> Each transport queue (not to be confused with the on-disk
+active queue or incoming queue) groups everything what is going be
+delivered to given destination (aka nexthop) by its transport.  Each
+queue belongs to one transport, so each destination may be referred
+to by several queues, one for each transport.  Each queue maintains
+a list of all recipient entries (batches of message recipients)
+which shall be delivered to given destination (the todo list), and
+a list of recipient entries already being delivered by the delivery
+agents (the busy list). </p>
+
+<li> <p> Each queue corresponds to multiple peer structures.  Each
+peer structure is like the queue structure, belonging to one transport
+and referencing one destination. The difference is that it lists
+only the recipient entries which all originate from the same message,
+unlike the queue structure, whose entries may originate from various
+messages. For messages with few recipients, there is usually just
+one recipient entry for each destination, resulting in one recipient
+entry per peer. But for large mailing list messages the recipients
+may need to be split to multiple recipient entries, in which case
+the peer structure may list many entries for single destination.
+</p>
+
+<li> <p> Each transport job groups everything it takes to deliver
+one message via its transport. Each job represents one message
+within the context of the transport. The job belongs to one transport
+and message, so each message may have multiple jobs, one for each
+transport. The job groups all the peer structures, which describe
+the destinations the job's message has to be delivered to. </p>
+
+</ul>
+
+<p>
+
+The first four structures are common to both <tt>nqmgr</tt> and
+<tt>oqmgr</tt>, the latter two were introduced by <tt>nqmgr</tt>.
+
+</p>
+
+<p>
+
+These terms are used extensively in the text below, feel free to
+look up the description above anytime you'll feel you have lost a
+sense what is what.
+
+</p>
+
+<h3> <a name="<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks
+up the message </a> </h3>
+
+<p>
+
+Whenever <tt>nqmgr</tt> moves a queue file into the active queue,
+the following happens: It reads all necessary information from the
+queue file as <tt>oqmgr</tt> does, and also reads as many recipients
+as possible - more on that later, for now let's just pretend it
+always reads all recipients.
+
+</p>
+
+<p>
+
+Then it resolves the recipients as <tt>oqmgr</tt> does, which
+means obtaining (address, nexthop, transport) triple for each
+recipient. For each triple, it finds the transport; if it does not
+exist yet, it instantiates it (unless it's dead). Within the
+transport, it finds the destination queue for given nexthop; if it
+does not exist yet, it instantiates it (unless it's dead). The
+triple is then bound to given destination queue. This happens in
+qmgr_resolve() and is basically the same as in <tt>oqmgr</tt>.
+
+</p>
+
+<p>
+
+Then for each triple which was bound to some queue (and thus
+transport), the program finds the job which represents the message
+within that transport's context; if it does not exist yet, it
+instantiates it. Within the job, it finds the peer which represents
+the bound destination queue within this jobs context; if it does
+not exist yet, it instantiates it.  Finally, it stores the address
+from the resolved triple to the recipient entry which is appended
+to both the queue entry list and the peer entry list. The addresses
+for same nexthop are batched in the entries up to recipient_concurrency
+limit for that transport. This happens in qmgr_assign() and apart
+from that it operates with job and peer structures is basically the
+same as in <tt>oqmgr</tt>.
+
+</p>
+
+<p>
+
+When the job is instantiated, it is enqueued on the transport's job
+list based on the time its message was picked up by <tt>nqmgr</tt>.
+For first batch of recipients this means it is appended to the end
+of the job list, but the ordering of the job list by the enqueue
+time is important as we will see shortly.
+
+</p>
+
+<p>
+
+[Now you should have pretty good idea what is the state of the
+<tt>nqmgr</tt> after couple of messages was picked up, what is the
+relation between all those job, peer, queue and entry structures.]
+
+</p>
+
+<h3> <a name="<tt>nqmgr</tt>_selection"> How does the entry selection
+work </a> </h3>
+
+<p>
+
+Having prepared all those above mentioned structures, the task of
+the <tt>nqmgr</tt>'s scheduler is to choose the recipient entries
+one at a time and pass them to the delivery agent for corresponding
+transport. Now how does this work?
+
+</p>
+
+<p>
+
+The first approximation of the new scheduling algorithm is like this:
+
+</p>
+
+<blockquote>
+<pre>
+foreach transport (round-robin-by-transport)
+do
+    if transport busy continue
+    if transport process limit reached continue
+    foreach transport's job (in the order of the transport's job list)
+    do
+       foreach job's peer (round-robin-by-destination)
+            if peer->queue->concurrency < peer->queue->window
+                return next peer entry.
+       done
+    done
+done
+</pre>
+</blockquote>
+
+<p>
+
+Now what is the "order of the transport's job list"? As we know
+already, the job list is by default kept in the order the message
+was picked up by the <tt>nqmgr</tt>. So by default we get the
+top-level round-robin transport, and within each transport we get
+the FIFO message delivery. The round-robin of the peers by the
+destination is perhaps of little importance in most real-life cases
+(unless the recipient_concurrency limit is reached, in one job there
+is only one peer structure for each destination), but theoretically
+it makes sure that even within single jobs, destinations are treated
+fairly.
+
+</p>
+
+<p>
+
+[By now you should have a feeling you really know how the scheduler
+works, except for the preemption, under ideal conditions - that is,
+no recipient resource limits and no destination concurrency problems.]
+
+</p>
+
+<h3> <a name="<tt>nqmgr</tt>_preemption"> How does the preemption
+work </a> </h3>
+
+<p>
+
+As you might perhaps expect by now, the transport's job list does
+not remain sorted by the job's message enqueue time all the time.
+The most cool thing about <tt>nqmgr</tt> is not the simple FIFO
+delivery, but that it is able to slip mail with little recipients
+past the mailing-list bulk mail.  This is what the job preemption
+is about - shuffling the jobs on the transport's job list to get
+the best message delivery rates. Now how is it achieved?
+
+</p>
+
+<p>
+
+First I have to tell you that there are in fact two job lists in
+each transport. One is the scheduler's job list, which the scheduler
+is free to play with, while the other one keeps the jobs always
+listed in the order of the enqueue time and is used for recipient
+pool management we will discuss later. For now, we will deal with
+the scheduler's job list only.
+
+</p>
+
+<p>
+
+So, we have the job list, which is first ordered by the time the
+job's messages were enqueued, oldest messages first, the most recently
+picked one at the end. For now, let's assume that there are no
+destination concurrency problems. Without preemption, we pick some
+entry of the first (oldest) job on the queue, assign it to delivery
+agent, pick another one from the same job, assign it again, and so
+on, until all the entries are used and the job is delivered. We
+would then move onto the next job and so on and on. Now how do we
+manage to sneak in some entries from the recently added jobs when
+the first job on the job list belongs to a message going to the
+mailing-list and has thousands of recipient entries?
+
+</p>
+
+<p>
+
+The <tt>nqmgr</tt>'s answer is that we can artificially "inflate"
+the delivery time of that first job by some constant for free - it
+is basically the same trick you might remember as "accumulation of
+potential" from the amortized complexity lessons. For example,
+instead of delivering the entries of the first job on the job list
+every time an delivery agent becomes available, we can do it only
+every second time. If you view the moments the delivery agent becomes
+available on a timeline as "delivery slots", then instead of using
+every delivery slot for the first job, we can use only every other
+slot, and still the overall delivery efficiency of the first job
+remains the same. So the delivery <tt>11112222</tt> becomes
+<tt>1.1.1.1.2.2.2.2</tt> (1 and 2 are the imaginary job numbers, .
+denotes the free slot). Now what do we do with free slots?
+
+</p>
+
+<p>
+
+As you might have guessed, we will use them for sneaking the mail
+with little recipients in. For example, if we have one four-recipient
+mail followed by four one recipients mail, the delivery sequence
+(that is, the sequence in which the jobs are assigned to the
+delivery slots) might look like this: <tt>12131415</tt>. Hmm, fine
+for sneaking in the single recipient mail, but how do we sneak in
+the mail with more than one recipient? Say if we have one four-recipient
+mail followed by two two-recipient mails?
+
+</p>
+
+<p>
+
+The simple answer would be to use delivery sequence <tt>12121313</tt>.
+But the problem is that this does not scale well. Imagine you have
+mail with thousand recipients followed by mail with hundred recipients.
+It is tempting to suggest the  delivery sequence like <tt>121212....</tt>,
+but alas! Imagine there arrives another mail with say ten recipients.
+But there are no free slots anymore, so it can't slip by, not even
+if it had just only one recipients.  It will be stuck until the
+hundred-recipient mail is delivered, which really sucks.
+
+</p>
+
+<p>
+
+So, it becomes obvious that while the inflating the message to get
+free slots is great idea, one has to be really careful of how the
+free slots are assigned, otherwise one might corner himself. So,
+how does <tt>nqmgr</tt> really use the free slots?
+
+</p>
+
+<p>
+
+The key idea is that one does not have to generate the free slots
+in a uniform way. The delivery sequence <tt>111...1</tt> is no
+worse than <tt>1.1.1.1</tt>, in fact, it is even better as some
+entries are in the first case selected earlier than in the second
+case, and none is selected later! So it is possible to first to
+"accumulate" the free delivery slots and then use them all at once.
+It is even possible to accumulate some, then use them, then accumulate
+some more and use them again, as in <tt>11..1.1</tt> .
+
+</p>
+
+<p>
+
+Let's get back to the one hundred recipient example. We now know
+that we could first accumulate one hundred free slots, and only
+after then to preempt the first job and sneak the one hundred
+recipient mail in. Applying the algorithm recursively, we see the
+hundred recipient job can accumulate ten free delivery slots, and
+then we could preempt it and sneak in the ten recipient mail...
+Wait wait wait! Could we? Aren't we overinflating the original one
+thousand recipient mail?
+
+</p>
+
+<p>
+
+Well, despite it looks so at the first glance, another trick will
+allow us to answer "no, we are not!". If we had said that we will
+inflate the delivery time twice at maximum, and then we consider
+every other slot as a free slot, then we would overinflate in case
+of the recursive preemption. BUT! The trick is that if we use only
+every n-th slot as a free slot for n&gt;2, there is always some worst
+inflation factor which we can guarantee not to be breached, even
+if we apply the algorithm recursively. To be precise, if for every
+k&gt;1 normally used slots we accumulate one free delivery slot, than
+the inflation factor is not worse than k/(k-1) no matter how many
+recursive preemptions happen. And it's not worse than (k+1)/k if
+only non-recursive preemption happens. Now, having got through the
+theory and the related math, let's see how <tt>nqmgr</tt> implements
+this.
+
+</p>
+
+<p>
+
+Each job has so called "available delivery slot" counter. Each
+transport has a <i>transport</i>_delivery_slot_cost parameter, which
+defaults to default_delivery_slot_cost parameter which is set to 5
+by default. This is the k from the paragraph above. Each time k
+entries of the job are selected for delivery, this counter is
+incremented by one. Once there are some slots accumulated, job which
+requires no more than that amount of slots to be fully delivered
+can preempt this job.
+
+</p>
+
+<p>
+
+[Well, the truth is, the counter is incremented every time an entry
+is selected and it is divided by k when it is used. Or even more
+true, there is no division, the other side of the equation is
+multiplied by k. But for the understanding it's good enough to use
+the above approximation of the truth.]
+
+</p>
+
+<p>
+
+OK, so now we know the conditions which must be satisfied so one
+job can preempt another one. But what job gets preempted, how do
+we choose what job preempts it if there are several valid candidates,
+and when does all this exactly happen?
+
+</p>
+
+<p>
+
+The answer for the first part is simple. The job whose entry was
+selected the last time is so called current job. Normally, it is
+the first job on the scheduler's job list, but destination concurrency
+limits may change this as we will see later. It is always only the
+current job which may get preempted.
+
+</p>
+
+<p>
+
+Now for the second part. The current job has certain amount of
+recipient entries, and as such may accumulate at maximum some amount
+of available delivery slots. It might have already accumulated some,
+and perhaps even already used some when it was preempted before
+(remember a job can be preempted several times). In either case,
+we know how many are accumulated and how many are left to deliver,
+so we know how many it may yet accumulate at maximum. Every other
+job which may be delivered by less than that amount of slots is an
+valid candidate for preemption. How do we choose among them?
+
+</p>
+
+<p>
+
+The answer is - the one with maximum enqueue_time/recipient_entry_count.
+That is, the older the job is, the more we should try to deliver
+it in order to get best message delivery rates. These rates are of
+course subject to how many recipients the message has, therefore
+the division by the recipient (entry) count. No one shall be surprised
+that message with n recipients takes n times longer to deliver than
+message with one recipient.
+
+</p>
+
+<p>
+
+Now let's recap the previous two paragraphs. Isn't it too complicated?
+Why don't the candidates come only among the jobs which can be
+delivered within the amount of slots the current job already
+accumulated? Why do we need to estimate how much it has yet to
+accumulate? If you found out the answer, congratulate yourself. If
+we did it this simple way, we would always choose the candidate
+with least recipient entries. If there were enough single recipient
+mails coming in, they would always slip by the bulk mail as soon
+as possible, and the two and more recipients mail would never get
+a chance, no matter how long they have been sitting around in the
+job list.
+
+</p>
+
+<p>
+
+This candidate selection has interesting implication - that when
+we choose the best candidate for preemption (this is done in
+qmgr_choose_candidate()), it may happen that we may not use it for
+preemption immediately. This leads to an answer to the last part
+of the original question - when does the preemption happen?
+
+</p>
+
+<p>
+
+The preemption attempt happens every time next transport's recipient
+entry is to be chosen for delivery. To avoid needless overhead, the
+preemption is not attempted if the current job could never accumulate
+more than <i>transport</i>_minimum_delivery_slots (defaults to
+default_minimum_delivery_slots which defaults to 3). If there is
+already enough accumulated slots to preempt the current job by the
+chosen best candidate, it is done immediately. This basically means
+that the candidate is moved in front of the current job on the
+scheduler's job list and decreasing the accumulated slot counter
+by the amount used by the candidate. If there is not enough slots...
+well, I could say that nothing happens and the another preemption
+is attempted the next time. But that's not the complete truth.
+
+</p>
+
+<p>
+
+The truth is that it turns out that it is not really necessary to
+wait until the jobs counter accumulates all the delivery slots in
+advance. Say we have ten recipient mail followed by two two-recipient
+mails. If the preemption happened when enough delivery slot accumulate
+(assuming slot cost 2), the delivery sequence becomes
+<tt>11112211113311</tt>. Now what would we get if we would wait
+only for 50% of the necessary slots to accumulate and we promise
+we would wait for the remaining 50% later, after the we get back
+to the preempted job? If we use such slot loan, the delivery sequence
+becomes <tt>11221111331111</tt>. As we can see, it makes it no
+considerably worse for the delivery of the ten-recipient mail, but
+it allows the small messages to be delivered sooner.
+
+</p>
+
+<p>
+
+The concept of these slot loans is where the
+<i>transport</i>_delivery_slot_discount and
+<i>transport</i>_delivery_slot_loan come from (they default to
+default_delivery_slot_discount and default_delivery_slot_loan, whose
+values are by default 50 and 3, respectively). The discount (resp.
+loan) specifies how many percent (resp. how many slots) one "gets
+in advance", when the amount of slots required to deliver the best
+candidate is compared with the amount of slots the current slot had
+accumulated so far.
+
+</p>
+
+<p>
+
+And it pretty much concludes this chapter.
+
+</p>
+
+<p>
+
+[Now you should have a feeling that you pretty much understand the
+scheduler and the preemption, or at least that you will have it
+after you read the last chapter couple more times. You shall clearly
+see the job list and the preemption happening at its head, in ideal
+delivery conditions. The feeling of understanding shall last until
+you start wondering what happens if some of the jobs are blocked,
+which you might eventually figure out correctly from what had been
+said already. But I would be surprised if you mental image of the
+scheduler's functionality it is not completely shattered once you
+start wondering how it works when not all recipients may be read
+in-core.  More on that later.]
+
+</p>
+
+<h3> <a name="<tt>nqmgr</tt>_concurrency"> How destination concurrency
+limits affect the scheduling algorithm </a> </h3>
+
+<p>
+
+The <tt>nqmgr</tt> uses the same algorithm for destination concurrency
+control as <tt>oqmgr</tt>. Now what happens when the destination
+limits are reached and no more entries for that destination may be
+selected by the scheduler?
+
+</p>
+
+<p>
+
+From user's point of view it is all simple. If some of the peers
+of a job can't be selected, those peers are simply skipped by the
+entry selection algorithm (the pseudo-code described before) and
+only the selectable ones are used. If none of the peers may be
+selected, the job is declared a "blocker job". Blocker jobs are
+skipped by the entry selection algorithm and they are also excluded
+from the candidates for preemption of current job. Thus the scheduler
+effectively behaves as if the blocker jobs didn't exist on the job
+list at all. As soon as at least one of the peers of a blocker job
+becomes unblocked (that is, the delivery agent handling the delivery
+of the recipient entry for given destination successfully finishes),
+the job's blocker status is removed and the job again participates
+in all further scheduler actions normally.
+
+</p>
+
+<p>
+
+So the summary is that the user's don't really have to be concerned
+about the interaction of the destination limits and scheduling
+algorithm. It works well on its own and there are no knobs they
+would need to control it.
+
+</p>
+
+<p>
+
+From a programmer's point of view, the blocker jobs complicate the
+scheduler quite a lot. Without them, the jobs on the job list would
+be normally delivered in strict FIFO order. If the current job is
+preempted, the job preempting it is completely delivered unless it
+is preempted itself. Without blockers, the current job is thus
+always either the first job on the job list, or the top of the stack
+of jobs preempting the first job on the job list.
+
+</p>
+
+<p>
+
+The visualization of the job list and the preemption stack without
+blockers would be like this:
+
+</p>
+
+<blockquote>
+<pre>
+first job->    1--2--3--5--6--8--...    <- job list
+on job list    |
+               4    <- preemption stack
+               |
+current job->  7
+</pre>
+</blockquote>
  
-<h3> <a name="job_design"> How the non-preemptive queue manager scheduler works </a> </h3>
+<p>
  
-<p> The following text is from Patrik Rak and should be read together
-with the postconf(5) manual that describes each configuration
-parameter in detail. </p>
+In the example above we see that job 1 was preempted by job 4 and
+then job 4 was preempted by job 7. After job 7 is completed, remaining
+entries of job 4 are selected, and once they are all selected, job
+1 continues.
  
-<p> From user's point of view, oqmgr(8) and qmgr(8) are both the same,
-except for how next message is chosen when delivery agent becomes
-available.  You already know that oqmgr(8) uses round-robin by destination
-while qmgr(8) uses simple FIFO, except for some preemptive magic.
-The postconf(5) manual documents all the knobs the user
-can use to control this preemptive magic - there is nothing else
-to the preemption than the quite simple conditions described in there.
  </p>
  
-<p> As for programmer-level documentation, this will have to be
-extracted from all those emails we have exchanged with Wietse [rats!
-I hoped that Patrik would do the work for me -- Wietse] But I think
-there are no missing bits which we have not mentioned in our
-conversations. </p>
+<p>
  
-<p> However, even from programmer's point of view, there is nothing
-more to add to the message scheduling idea itself.  There are few
-things which make it look more complicated than it is, but the
-algorithm is the same as the user perceives it. The summary of the
-differences of the programmer's view from the user's view are: </p>
+As we see, it's all very clean and straightforward. Now how does
+this change because of blockers?
+
+</p>
+
+<p>
+
+The answer is: a lot. Any job may become blocker job at any time,
+and also become normal job again at any time. This has several
+important implications:
+
+</p>
  
  <ol>
  
-    <li> <p> Simplification of terms for users: The user knows
-    about messages and recipients. The program itself works with
-    jobs (one message is split among several jobs, one per each
-    transport needed to deliver the message) and queue entries
-    (each entry may group several recipients for same destination).
-    Then there is the peer structure introduced by qmgr(8) which is
-    simply per-job analog of the queue structure. </p>
-
-    <li> <p> Dealing with concurrency limits: The actual implementation
-    is complicated by the fact that the messages (resp. jobs) may
-    not be delivered in the exactly scheduled order because of the
-    concurrency limits. It is necessary to skip some "blocker" jobs
-    when the concurrency limit is reached and get back to them
-    again when the limit permits. </p>
-
-    <li> <p> Dealing with resource limits: The actual implementation is
-    complicated by the fact that not all recipients may be read in-core.
-    Therefore each message has some recipients in-core and some may
-    remain on-file. This means that a) the preemptive algorithm needs
-    to work with recipient count estimates instead of exact counts, b)
-    there is extra code which needs to manipulate the per-transport
-    pool of recipients which may be read in-core at the same time, and
-    c) there is extra code which needs to be able to read recipients
-    into core in batches and which is triggered at appropriate moments. </p>
-
-    <li> <p> Doing things efficiently: All important things I am
-    aware of are done in the minimum time possible (either directly
-    or at least when amortized complexity is used), but to choose
-    which job is the best candidate for preempting the current job
-    requires linear search of up to all transport jobs (the worst
-    theoretical case - the reality is much better). As this is done
-    every time the next queue entry to be delivered is about to be
-    chosen, it seemed reasonable to add cache which minimizes the
-    overhead. Maintenance of this candidate cache slightly obfuscates
-    things.
+<li> <p>
+
+The jobs may be completed in arbitrary order. For example, in the
+example above, if the current job 7 becomes blocked, the next job
+4 may complete before the job 7 becomes unblocked again. Or if both
+7 and 4 are blocked, then 1 is completed, then 7 becomes unblocked
+and is completed, then 2 is completed and only after that 4 becomes
+unblocked and is completed... You get the idea.
+
+</p>
+
+<p>
+
+[Interesting side note: even when jobs are delivered out of order,
+from single destination's point of view the jobs are still delivered
+in the expected order (that is, FIFO unless there was some preemption
+involved). This is because whenever a destination queue becomes
+unblocked (the destination limit allows selection of more recipient
+entries for that destination), all jobs which have peers for that
+destination are unblocked at once.]
+
+</p>
+
+<li> <p>
+
+The idea of the preemption stack at the head of the job list is
+gone.  That is, it must be possible to preempt any job on the job
+list. For example, if the jobs 7, 4, 1 and 2 in the example above
+become all blocked, job 3 becomes the current job. And of course
+we do not want the preemption be affected by the fact that there
+are some blocked jobs or not. Therefore, if it turns out that job
+3 might be preempted by job 6, the implementation shall make it
+possible.
+
+</p>
+
+<li> <p>
+
+The idea of the linear preemption stack itself is gone. It's no
+longer true that one job is always preempted by only one job at one
+time (that is directly preempted, not counting the recursively
+nested jobs). For example, in the example above, job 1 is directly
+preempted by only job 4, and job 4 by job 7. Now assume job 7 becomes
+blocked, and job 4 is being delivered. If it accumulates enough
+delivery slots, it is natural that it might be preempted for example
+by job 8. Now job 4 is preempted by both job 7 AND job 8 at the
+same time.
+
+</p>
  
  </ol>
  
-<p> The points 2 and 3 are those which made the implementation
-(look) complicated and were the real coding work, but I believe
-that to understand the scheduling algorithm itself (which was the
-real thinking work) is fairly easy. </p>
+<p>
+
+Now combine the points 2) and 3) with point 1) again and you realize
+that the relations on the once linear job list became pretty
+complicated. If we extend the point 3) example: jobs 7 and 8 preempt
+job 4, now job 8 becomes blocked too, then job 4 completes. Tricky,
+huh?
+
+</p>
+
+<p>
+
+If I illustrate the relations after the above mentioned examples
+(but those in point 1)), the situation would look like this:
+
+</p>
+
+<blockquote>
+<pre>
+                            v- parent
+
+adoptive parent ->    1--2--3--5--...      <- "stack" level 0
+                      |     |
+parent gone ->        ?     6              <- "stack" level 1
+                     / \
+children ->         7   8   ^- child       <- "stack" level 2
+
+                      ^- siblings
+</pre>
+</blockquote>
+
+<p>
+
+Now how does <tt>nqmgr</tt> deal with all these complicated relations?
+
+</p>
+
+<p>
+
+Well, it maintains them all as described, but fortunately, all these
+relations are necessary only for purposes of proper counting of
+available delivery slots. For purposes of ordering the jobs for
+entry selection, the original rule still applies: "the job preempting
+the current job is moved in front of the current job on the job
+list". So for entry selection purposes, the job relations remain
+as simple as this:
+
+</p>
+
+<blockquote>
+<pre>
+7--8--1--2--6--3--5--..   <- scheduler's job list order
+</pre>
+</blockquote>
+
+<p>
+
+The job list order and the preemption parent/child/siblings relations
+are maintained separately. And because the selection works only
+with the job list, you can happily forget about those complicated
+relations unless you want to study the <tt>nqmgr</tt> sources. In
+that case the text above might provide some helpful introduction
+to the problem domain. Otherwise I suggest you just forget about
+all this and stick with the user's point of view: the blocker jobs
+are simply ignored.
+
+</p>
+
+<p>
+
+[By now, you should have a feeling that there is more things going
+under the hood than you ever wanted to know. You decide that
+forgetting about this chapter is the best you can do for the sake
+of your mind's health and you basically stick with the idea how the
+scheduler works in ideal conditions, when there are no blockers,
+which is good enough.]
+
+</p>
+
+<h3> <a name="<tt>nqmgr</tt>_memory"> Dealing with memory resource
+limits </a> </h3>
+
+<p>
+
+When discussing the <tt>nqmgr</tt> scheduler, we have so far assumed
+that all recipients of all messages in the active queue are completely
+read into the memory. This is simply not true. There is an upper
+bound on the amount of memory the <tt>nqmgr</tt> may use, and
+therefore it must impose some limits on the information it may store
+in the memory at any given time.
+
+</p>
+
+<p>
+
+First of all, not all messages may be read in-core at once. At any
+time, only qmgr_message_active_limit messages may be read in-core
+at maximum. When read into memory, the messages are picked from the
+incoming and deferred message queues and moved to the active queue
+(incoming having priority), so if there is more than
+qmgr_message_active_limit messages queued in the active queue, the
+rest will have to wait until (some of) the messages in the active
+queue are completely delivered (or deferred).
+
+</p>
+
+<p>
+
+Even with the limited amount of in-core messages, there is another
+limit which must be imposed in order to avoid memory exhaustion.
+Each message may contain huge amount of recipients (tens or hundreds
+of thousands are not uncommon), so if <tt>nqmgr</tt> read all
+recipients of all messages in the active queue, it may easily run
+out of memory. Therefore there must be some upper bound on the
+amount of message recipients which are read into the memory at the
+same time.
+
+</p>
+
+<p>
+
+Before discussing how exactly <tt>nqmgr</tt> implements the recipient
+limits, let's see how the sole existence of the limits themselves
+affects the <tt>nqmgr</tt> and its scheduler.
+
+</p>
+
+<p>
+
+The message limit is straightforward - it just limits the size of
+the
+lookahead the <tt>nqmgr</tt>'s scheduler has when choosing which
+message can preempt the current one. Messages not in the active
+queue simply are not considered at all.
+
+</p>
+
+<p>
+
+The recipient limit complicates more things. First of all, the
+message reading code must support reading the recipients in batches,
+which among other things means accessing the queue file several
+times and continuing where the last recipient batch ended. This is
+invoked by the scheduler whenever the current job runs out of in-core
+recipients and more are required. It is also done any time when all
+in-core recipients of the message are dealt with (which may also
+mean they were deferred) but there are still more in the queue file.
+
+</p>
+
+<p>
+
+The second complication is that with some recipients left unread
+in the queue file, the scheduler can't operate with exact counts
+of recipient entries. With unread recipients, it is not clear how
+many recipient entries there will be, as they are subject to
+per-destination grouping. It is not even clear to what transports
+(and thus jobs) the recipients will be assigned. And with messages
+coming from the deferred queue, it is not even clear how many unread
+recipients are still to be delivered. This all means that the
+scheduler must use only estimates of how many recipients entries
+there will be.  Fortunately, it is possible to estimate the minimum
+and maximum correctly, so the scheduler can always err on the safe
+side.  Obviously, the better the estimates, the better results, so
+it is best when we are able to read all recipients in-core and turn
+the estimates into exact counts, or at least try to read as many
+as possible to make the estimates as accurate as possible.
+
+</p>
+
+<p>
+
+The third complication is that it is no longer true that the scheduler
+is done with a job once all of its in-core recipients are delivered.
+It is possible that the job will be revived later, when another
+batch of recipients is read in core. It is also possible that some
+jobs will be created for the first time long after the first batch
+of recipients was read in core. The <tt>nqmgr</tt> code must be
+ready to handle all such situations.
+
+</p>
+
+<p>
+
+And finally, the fourth complication is that the <tt>nqmgr</tt>
+code must somehow impose the recipient limit itself. Now how does
+it achieve it?
+
+</p>
+
+<p>
+
+Perhaps the easiest solution would be to say that each message may
+have at maximum X recipients stored in-core, but such solution would
+be poor for several reasons. With reasonable qmgr_message_active_limit
+values, the X would have to be quite low to maintain reasonable
+memory footprint. And with low X lots of things would not work well.
+The <tt>nqmgr</tt> would have problems to use the
+<i>transport</i>_destination_recipient_limit efficiently. The
+scheduler's preemption would be suboptimal as the recipient count
+estimates would be inaccurate. The message queue file would have
+to be accessed many times to read in more recipients again and
+again.
+
+</p>
+
+<p>
+
+Therefore it seems reasonable to have a solution which does not use
+a limit imposed on per-message basis, but which maintains a pool
+of available recipient slots, which can be shared among all messages
+in the most efficient manner. And as we do not want separate
+transports to compete for resources whenever possible, it seems
+appropriate to maintain such recipient pool for each transport
+separately. This is the general idea, now how does it work in
+practice?
+
+</p>
+
+<p>
+
+First we have to solve little chicken-and-egg problem. If we want
+to use the per-transport recipient pools, we first need to know to
+what transport(s) is the message assigned. But we will find that
+out only after we read in the recipients first. So it is obvious
+that we first have to read in some recipients, use them to find out
+to what transports is the message to be assigned, and only after
+that we can use the per-transport recipient pools.
+
+</p>
+
+<p>
+
+Now how many recipients shall we read for the first time? This is
+what qmgr_message_recipient_minimum and qmgr_message_recipient_limit
+values control. The qmgr_message_recipient_minimum value specifies
+how many recipients of each message we will read for the first time,
+no matter what.  It is necessary to read at least one recipients
+before we can assign the message to a transport and create the first
+job. However, reading only qmgr_message_recipient_minimum recipients
+even if there are only few messages with few messages in-core would
+be wasteful. Therefore if there is less than qmgr_message_recipient_limit
+recipients in-core so far, the first batch of recipients may be
+larger than qmgr_message_recipient_minimum - as large as is required
+to reach the qmgr_message_recipient_limit limit.
+
+</p>
+
+<p>
+
+Once the first batch of recipients was read in core and the message
+jobs were created, the size of the subsequent recipient batches (if
+any - of course it's best when all recipients are read in one batch)
+is based solely on the position of the message jobs on their
+corresponding transport's job lists. Each transport has a pool of
+<i>transport</i>_recipient_limit recipient slots which it can
+distribute among its jobs (how this is done is described later).
+The subsequent recipient batch may be as large as the sum of all
+recipient slots of all jobs of the message permits (plus the
+qmgr_message_recipient_minimum amount which always applies).
+
+</p>
+
+<p>
+
+For example, if a message has three jobs, first with 1 recipient
+still in-core and 4 recipient slots, second with 5 recipient in-core
+and 5 recipient slots, and third with 2 recipients in-core and 0
+recipient slots, it has 1+5+2=7 recipients in-core and 4+5+0=9 jobs'
+recipients slots in total. This means that we could immediately
+read 2+qmgr_message_recipient_minimum more recipients of that message
+in core.
+
+</p>
+
+<p>
+
+The above example illustrates several things which might be worth
+mentioning explicitly: first, note that although the per-transport
+slots are assigned to particular jobs, we can't guarantee that once
+the next batch of recipients is read in core, that the corresponding
+amounts of recipients will be assigned to those jobs. The jobs lend
+its slots to the message as a whole, so it is possible that some
+jobs end up sponsoring other jobs of their message. For example,
+if in the example above the 2 newly read recipients were assigned
+to the second job, the first job sponsored the second job with 2
+slots. The second notable thing is the third job, which has more
+recipients in-core than it has slots. Apart from sponsoring by other
+job we just saw it can be result of the first recipient batch, which
+is sponsored from global recipient pool of qmgr_message_recipient_limit
+recipients. It can be also sponsored from the message recipient
+pool of qmgr_message_recipient_minimum recipients.
+
+</p>
+
+<p>
+
+Now how does each transport distribute the recipient slots among
+its jobs?  The strategy is quite simple. As most scheduler activity
+happens on the head of the job list, it is our intention to make
+sure that the scheduler has the best estimates of the recipient
+counts for those jobs. As we mentioned above, this means that we
+want to try to make sure that the messages of those jobs have all
+recipients read in-core. Therefore the transport distributes the
+slots "along" the job list from start to end. In this case the job
+list sorted by message enqueue time is used, because it doesn't
+change over time as the scheduler's job list does.
+
+</p>
+
+<p>
+
+More specifically, each time a job is created and appended to the
+job list, it gets all unused recipient slots from its transport's
+pool. It keeps them until all recipients of its message are read.
+When this happens, all unused recipient slots are transferred to
+the next job (which is now in fact now first such job) on the job
+list which still has some recipients unread, or eventually back to
+the transport pool if there is no such job. Such transfer then also
+happens whenever a recipient entry of that job is delivered.
+
+</p>
+
+<p>
+
+There is also a scenario when a job is not appended to the end of
+the job list (for example it was created as a result of second or
+later recipient batch). Then it works exactly as above, except that
+if it was put in front of the first unread job (that is, the job
+of a message which still has some unread recipients in queue file),
+that job is first forced to return all of its unused recipient slots
+to the transport pool.
+
+</p>
+
+<p>
+
+The algorithm just described leads to the following state: The first
+unread job on the job list always gets all the remaining recipient
+slots of that transport (if there are any). The jobs queued before
+this job are completely read (that is, all recipients of their
+message were already read in core) and have at maximum as many slots
+as they still have recipients in-core (the maximum is there because
+of the sponsoring mentioned before) and the jobs after this job get
+nothing from the transport recipient pool (unless they got something
+before and then the first unread job was created and enqueued in
+front of them later - in such case the also get at maximum as many
+slots as they have recipients in-core).
+
+</p>
+
+<p>
+
+Things work fine in such state for most of the time, because the
+current job is either completely read in-core or has as much recipient
+slots as there are, but there is one situation which we still have
+to take care specially.  Imagine if the current job is preempted
+by some unread job from the job list and there are no more recipient
+slots available, so this new current job could read only batches
+of qmgr_message_recipient_minimum recipients at a time. This would
+really degrade performance. For this reason, each transport has
+extra pool of <i>transport</i>_extra_recipient_limit recipient
+slots, dedicated exactly for this situation. Each time an unread
+job preempts the current job, it gets half of the remaining recipient
+slots from the normal pool and this extra pool.
+
+</p>
+
+<p>
+
+And that's it. It sure does sound pretty complicated, but fortunately
+most people don't really have to care how exactly it works as long
+as it works.  Perhaps the only important things to know for most
+people ire the following upper bound formulas:
+
+</p>
+
+<p>
+
+Each transport has at maximum
+
+</p>
+
+<blockquote>
+<pre>
+max(
+qmgr_message_recipient_minimum * qmgr_message_active_limit
++ *_recipient_limit + *_extra_recipient_limit,
+qmgr_message_recipient_limit
+)
+</pre>
+</blockquote>
+
+<p>
+
+recipients in core.
+
+</p>
+
+<p>
+
+The total amount of recipients in core is
+
+</p>
+
+<blockquote>
+<pre>
+max(
+qmgr_message_recipient_minimum * qmgr_message_active_limit
++ sum( *_recipient_limit + *_extra_recipient_limit ),
+qmgr_message_recipient_limit
+)
+</pre>
+</blockquote>
+
+<p>
+
+where the sum is over all used transports.
+
+</p>
+
+<p>
+
+And this terribly complicated chapter concludes the documentation
+of <tt>nqmgr</tt> scheduler.
+
+</p>
+
+<p>
+
+[By now you should theoretically know the <tt>nqmgr</tt> scheduler
+inside out. In practice, you still hope that you will never have
+to really understand the last or last two chapters completely, and
+fortunately most people really won't. Understanding how the scheduler
+works in ideal conditions is more than good enough for vast majority
+of users.]
+
+</p>
  
  <h2> <a name="credits"> Credits </a> </h2>
  
@@ -785,8 +1794,8 @@ was separated from dead site detection.
  
  <li> These simplifications, and their modular implementation, helped
  to develop further insights into the different roles that positive
-and negative concurrency feedback play, and helped to avoid all the
-known worst-case scenarios.
+and negative concurrency feedback play, and helped to identify some
+worst-case scenarios.
  
  </ul>
  
diff --git a/postfix/proto/TLS_README.html b/postfix/proto/TLS_README.html

index 59ec1868f01a874cb5f9d192567e65e75ef46dc6..fcd6c6bfcc453f9804256f42f1fdd90b31968370 100644 (file)
--- a/postfix/proto/TLS_README.html
+++ b/postfix/proto/TLS_README.html
@@ -2328,14 +2328,14 @@ but don't require them from all clients. </p>
  /etc/postfix/main.cf:
      smtp_tls_CAfile = /etc/postfix/cacert.pem
      smtp_tls_session_cache_database =
-       btree:/var/spool/postfix/smtp_tls_session_cache
+       btree:/var/lib/postfix/smtp_tls_session_cache
      smtp_use_tls = yes
      smtpd_tls_CAfile = /etc/postfix/cacert.pem
      smtpd_tls_cert_file = /etc/postfix/FOO-cert.pem
      smtpd_tls_key_file = /etc/postfix/FOO-key.pem
      smtpd_tls_received_header = yes
      smtpd_tls_session_cache_database =
-       btree:/var/spool/postfix/smtpd_tls_session_cache
+       btree:/var/lib/postfix/smtpd_tls_session_cache
      tls_random_source = dev:/dev/urandom
      # Postfix 2.3 and later
      smtpd_tls_security_level = may
diff --git a/postfix/proto/postconf.proto b/postfix/proto/postconf.proto

index 14fc2c884f44dd14baefc6e67cafbfc774b8f374..09d39dca31ddfcd27a1c7f9b36f84326c94ed5dd 100644 (file)
--- a/postfix/proto/postconf.proto
+++ b/postfix/proto/postconf.proto
@@ -35,6 +35,9 @@
  #   * The postconf2man tool leaves unrecognized HTML in place as a
  #     reminder that it is not supported.
  #
+#   * Text between <!-- and --> is stripped out. The <!-- and -->
+#     must appear on separate lines.
+#
  #   Also:
  #
  #   * All <dt> and <dd>text must be closed with </dt> and </dd>.
@@ -10844,12 +10847,16 @@ The <i>number</i> must be in the range 0..1 inclusive. With
  <i>number</i> equal to "1", a destination's delivery concurrency
  is decremented by 1 after each failed pseudo-cohort.  </dd>
  
+<!--
+
  <dt> <b><i>number</i> / sqrt_concurrency </b> </dt>
  
  <dd> Variable feedback of "<i>number</i> / sqrt(delivery concurrency)".
  The <i>number</i> must be in the range 0..1 inclusive. This setting
  may be removed in a future version.  </dd>
  
+-->
+
  </dl>
  
  <p> A pseudo-cohort is the number of deliveries equal to a destination's
@@ -10894,12 +10901,16 @@ The <i>number</i> must be in the range 0..1 inclusive. With
  <i>number</i> equal to "1", a destination's delivery concurrency
  is incremented by 1 after each successful pseudo-cohort.  </dd>
  
+<!--
+
  <dt> <b><i>number</i> / sqrt_concurrency </b> </dt>
  
  <dd> Variable feedback of "<i>number</i> / sqrt(delivery concurrency)".
  The <i>number</i> must be in the range 0..1 inclusive. This setting
  may be removed in a future version.  </dd>
  
+-->
+
  </dl>
  
  <p> A pseudo-cohort is the number of deliveries equal to a destination's
diff --git a/postfix/src/global/mail_version.h b/postfix/src/global/mail_version.h

index 03886b74277af774cfd1879c4324d3c0d86bd3f8..fb91cb4e1e6d9e83cfa79f230de82a8f4e1f8a06 100644 (file)
--- a/postfix/src/global/mail_version.h
+++ b/postfix/src/global/mail_version.h
@@ -20,7 +20,7 @@
    * Patches change both the patchlevel and the release date. Snapshots have no
    * patchlevel; they change the release date only.
    */
-#define MAIL_RELEASE_DATE      "20071207"
+#define MAIL_RELEASE_DATE      "20071208"
  #define MAIL_VERSION_NUMBER    "2.5"
  
  #ifdef SNAPSHOT
diff --git a/postfix/src/oqmgr/Makefile.in b/postfix/src/oqmgr/Makefile.in

index 599fdf89c07344e2b839989c1f00aa0fd06e1671..979c3caee79e18204f11dc80e7a5e0c7e23d20d1 100644 (file)
--- a/postfix/src/oqmgr/Makefile.in
+++ b/postfix/src/oqmgr/Makefile.in
@@ -19,7 +19,7 @@ LIBS  = ../../lib/libmaster.a ../../lib/libglobal.a ../../lib/libutil.a
  .c.o:; $(CC) $(CFLAGS) -c $*.c
  
  $(PROG):       $(OBJS) $(LIBS)
-       $(CC) $(CFLAGS) -o $@ $(OBJS) $(LIBS) $(SYSLIBS) -lm
+       $(CC) $(CFLAGS) -o $@ $(OBJS) $(LIBS) $(SYSLIBS)
  
  $(OBJS): ../../conf/makedefs.out
  
diff --git a/postfix/src/oqmgr/qmgr.h b/postfix/src/oqmgr/qmgr.h

index 837c048666036d67301bf10703cfc363d4876328..ddd4b3f36e46f7487471aff26cae2cb6b4f85aa9 100644 (file)
--- a/postfix/src/oqmgr/qmgr.h
+++ b/postfix/src/oqmgr/qmgr.h
@@ -118,7 +118,9 @@ struct QMGR_FEEDBACK {
  
  #define QMGR_FEEDBACK_IDX_NONE         0       /* no window dependence */
  #define QMGR_FEEDBACK_IDX_WIN          1       /* 1/window dependence */
+#if 0
  #define QMGR_FEEDBACK_IDX_SQRT_WIN     2       /* 1/sqrt(window) dependence */
+#endif
  
  #ifdef QMGR_FEEDBACK_IDX_SQRT_WIN
  #include <math.h>
diff --git a/postfix/src/oqmgr/qmgr_queue.c b/postfix/src/oqmgr/qmgr_queue.c

index 1cc8f50c95abe7d056b0b7dc6bca6bc84bffa984..ed31571f6bad898092125ae21cfcbce8e5d79108 100644 (file)
--- a/postfix/src/oqmgr/qmgr_queue.c
+++ b/postfix/src/oqmgr/qmgr_queue.c
@@ -142,12 +142,10 @@ static void qmgr_queue_resume(int event, char *context)
       * We can't simply force delivery on this queue: the transport's pending
       * count may already be maxed out, and there may be other constraints
       * that definitely should be none of our business. The best we can do is
-     * to play by the same rules as everyone else: trigger *some* delivery
-     * via qmgr_active_drain() and let round-robin selection work for us.
+     * to play by the same rules as everyone else: let qmgr_active_drain()
+     * and round-robin selection take care of message selection.
       */
      queue->window = 1;
-    if (queue->todo_refcount > 0)
-       qmgr_active_drain();
  
      /*
       * Every event handler that leaves a queue in the "ready" state should
diff --git a/postfix/src/qmgr/Makefile.in b/postfix/src/qmgr/Makefile.in

index 7858283fc6aeffd7c10a7d8497f51a433bef01f2..c1a7f20519e5f39f48c90e4174d032ef692cff82 100644 (file)
--- a/postfix/src/qmgr/Makefile.in
+++ b/postfix/src/qmgr/Makefile.in
@@ -21,7 +21,7 @@ LIBS  = ../../lib/libmaster.a ../../lib/libglobal.a ../../lib/libutil.a
  .c.o:; $(CC) $(CFLAGS) -c $*.c
  
  $(PROG):       $(OBJS) $(LIBS)
-       $(CC) $(CFLAGS) -o $@ $(OBJS) $(LIBS) $(SYSLIBS) -lm
+       $(CC) $(CFLAGS) -o $@ $(OBJS) $(LIBS) $(SYSLIBS)
  
  $(OBJS): ../../conf/makedefs.out
  
diff --git a/postfix/src/qmgr/qmgr.h b/postfix/src/qmgr/qmgr.h

index 3aa5fb5ec821197d048bc2f626349f3d65f00960..df8b980adc6b3b1ecc4ba02966c62a0a64db16e6 100644 (file)
--- a/postfix/src/qmgr/qmgr.h
+++ b/postfix/src/qmgr/qmgr.h
@@ -130,7 +130,9 @@ struct QMGR_FEEDBACK {
  
  #define QMGR_FEEDBACK_IDX_NONE         0       /* no window dependence */
  #define QMGR_FEEDBACK_IDX_WIN          1       /* 1/window dependence */
+#if 0
  #define QMGR_FEEDBACK_IDX_SQRT_WIN     2       /* 1/sqrt(window) dependence */
+#endif
  
  #ifdef QMGR_FEEDBACK_IDX_SQRT_WIN
  #include <math.h>
diff --git a/postfix/src/qmgr/qmgr_queue.c b/postfix/src/qmgr/qmgr_queue.c

index 2a81740e963c72b629c654121276bbd77363922e..3bc6782b2f73efbd51694edda28c5928627bf860 100644 (file)
--- a/postfix/src/qmgr/qmgr_queue.c
+++ b/postfix/src/qmgr/qmgr_queue.c
@@ -144,12 +144,10 @@ static void qmgr_queue_resume(int event, char *context)
       * We can't simply force delivery on this queue: the transport's pending
       * count may already be maxed out, and there may be other constraints
       * that definitely should be none of our business. The best we can do is
-     * to play by the same rules as everyone else: trigger *some* delivery
-     * via qmgr_active_drain() and let round-robin selection work for us.
+     * to play by the same rules as everyone else: let qmgr_active_drain()
+     * and round-robin selection take care of message selection.
       */
      queue->window = 1;
-    if (queue->todo_refcount > 0)
-       qmgr_active_drain();
  
      /*
       * Every event handler that leaves a queue in the "ready" state should
author	Wietse Venema <wietse@porcupine.org>
	Sat, 8 Dec 2007 05:00:00 +0000 (00:00 -0500)
committer	Viktor Dukhovni <viktor@dukhovni.org>
	Tue, 5 Feb 2013 06:33:38 +0000 (06:33 +0000)
postfix/HISTORY		patch \| blob \| blame \| history
postfix/README_FILES/SCHEDULER_README		patch \| blob \| blame \| history
postfix/README_FILES/TLS_README		patch \| blob \| blame \| history
postfix/conf/master.cf		patch \| blob \| blame \| history
postfix/html/SCHEDULER_README.html		patch \| blob \| blame \| history
postfix/html/TLS_README.html		patch \| blob \| blame \| history
postfix/html/postconf.5.html		patch \| blob \| blame \| history
postfix/man/man5/postconf.5		patch \| blob \| blame \| history
postfix/mantools/postconf2html		patch \| blob \| blame \| history
postfix/proto/SCHEDULER_README.html		patch \| blob \| blame \| history
postfix/proto/TLS_README.html		patch \| blob \| blame \| history
postfix/proto/postconf.proto		patch \| blob \| blame \| history
postfix/src/global/mail_version.h		patch \| blob \| blame \| history
postfix/src/oqmgr/Makefile.in		patch \| blob \| blame \| history
postfix/src/oqmgr/qmgr.h		patch \| blob \| blame \| history
postfix/src/oqmgr/qmgr_queue.c		patch \| blob \| blame \| history
postfix/src/qmgr/Makefile.in		patch \| blob \| blame \| history
postfix/src/qmgr/qmgr.h		patch \| blob \| blame \| history
postfix/src/qmgr/qmgr_queue.c		patch \| blob \| blame \| history