From 04ab676d96e68c422e660f16caab6368d5a4300b Mon Sep 17 00:00:00 2001 From: dgaudet Date: Tue, 30 Sep 1997 23:24:30 +0000 Subject: [PATCH] Incorporate my performance tuning document. Document a lot more stuff that changed. git-svn-id: https://svn.apache.org/repos/asf/httpd/httpd/trunk@79314 13f79535-47bb-0310-9956-ffa450edef68 --- docs/manual/misc/index.html | 12 +- docs/manual/misc/perf-tuning.html | 820 ++++++++++++++++++++++++++++++ docs/manual/platform/perf.html | 1 + 3 files changed, 832 insertions(+), 1 deletion(-) create mode 100644 docs/manual/misc/perf-tuning.html diff --git a/docs/manual/misc/index.html b/docs/manual/misc/index.html index 2b44c661318..77a7bb28abd 100644 --- a/docs/manual/misc/index.html +++ b/docs/manual/misc/index.html @@ -89,7 +89,17 @@ HREF="perf.html" >Performance Notes (General) -

Some generic notes about how to improve Apache performance +

Some generic notes about how to improve the performance of your + machine/OS. +

Notes about how to (run-time and compile-time) configure + Apache for highest performance. Notes explaining why Apache does + some things, and why it doesn't do other things (which make it + slower/faster).

+ +Apache Performance Notes + + + +

Apache Performance Notes

+ +

Author: Dean Gaudet + +

Introduction

Apache is a general webserver, which is designed to be correct first, and +fast second. Even so, it's performance is quite satisfactory. Most +sites have less than 10Mbits of outgoing bandwidth, which Apache can +fill using only a low end Pentium-based webserver. In practice sites +with more bandwidth require more than one machine to fill the bandwidth +due to other constraints (such as CGI or database transaction overhead). +For these reasons the development focus has been mostly on correctness +and configurability. + +

Unfortunately many folks overlook these facts and cite raw performance +numbers as if they are some indication of the quality of a web server +product. There is a bare minimum performance that is acceptable, beyond +that extra speed only caters to a much smaller segment of the market. +But in order to avoid this hurdle to the acceptance of Apache in some +markets, effort was put into Apache 1.3 to bring performance up to a +point where the difference with other high-end webservers is minimal. + +

Finally there are the folks who just plain want to see how fast something +can go. The author falls into this category. The rest of this document +is dedicated to these folks who want to squeeze every last bit of +performance out of Apache's current model, and want to understand why +it does some things which slow it down. + +

Note that this is tailored towards Apache 1.3 on Unix. Some of it applies +to Apache on NT. Apache on NT has not been tuned for performance yet, +in fact it probably performs very poorly because NT performance requires +a different programming model. + +

Hardware and Operating System Issues

+ +

The single biggest hardware issue affecting webserver performance +is RAM. A webserver should never ever have to swap, swapping increases +the latency of each request beyond a point that users consider "fast +enough". This causes users to hit stop and reload, further increasing +the load. You can, and should, control the MaxClients +setting so that your server does not spawn so many children it starts +swapping. + +

Beyond that the rest is mundane: get a fast enough CPU, a fast enough +network card, and fast enough disks, where "fast enough" is something +that needs to be determined by experimentation. + +

Operating system choice is largely a matter of local concerns. But +a general guideline is to always apply the latest vendor TCP/IP patches. +HTTP serving completely breaks many of the assumptions built into Unix +kernels up through 1994 and even 1995. Good choices include +recent FreeBSD, and Linux. + +

Run-Time Configuration Issues

+ +

HostnameLookups

Prior to Apache 1.3, HostnameLookups defaulted to On. +This adds latency +to every request because it requires a DNS lookup to complete before +the request is finished. In Apache 1.3 this setting defaults to Off. +However (1.3 or later), if you use any allow from domain or +deny from domain directives then you will pay for a +double reverse DNS lookup (a reverse, followed by a forward to make sure +that the reverse is not being spoofed). So for the highest performance +avoid using these directives (it's fine to use IP addresses rather than +domain names). + +

Note that it's possible to scope the directives, such as within +a <Location /server-status> section. In this +case the DNS lookups are only performed on requests matching the +criteria. Here's an example which disables +lookups except for .html and .cgi files: + +

+HostnameLookups off
+<Files ~ "\.(html|cgi)$>
+    HostnameLookups on
+</Files>
+

+ +But even still, if you just need DNS names +in some CGIs you could consider doing the +gethostbyname call in the specific CGIs that need it. + +

FollowSymLinks and SymLinksIfOwnerMatch

Wherever in your URL-space you do not have an +Options FollowSymLinks, or you do have an +Options SymLinksIfOwnerMatch Apache will have to +issue extra system calls to check up on symlinks. One extra call per +filename component. For example, if you had: + +

+DocumentRoot /www/htdocs
+<Directory />
+    Options SymLinksIfOwnerMatch
+</Directory>
+

+ +and a request is made for the URI /index.html. +Then Apache will perform lstat(2) on /www, +/www/htdocs, and /www/htdocs/index.html. The +results of these lstats are never cached, +so they will occur on every single request. If you really desire the +symlinks security checking you can do something like this: + +

+DocumentRoot /www/htdocs
+<Directory />
+    Options FollowSymLinks
+</Directory>
+<Directory /www/htdocs>
+    Options -FollowSymLinks +SymLinksIfOwnerMatch
+</Directory>
+

+ +This at least avoids the extra checks for the DocumentRoot +path. Note that you'll need to add similar sections if you have any +Alias or RewriteRule paths outside of your +document root. For highest performance, and no symlink protection, +set FollowSymLinks everywhere, and never set +SymLinksIfOwnerMatch. + +

AllowOverride

+ +

Wherever in your URL-space you allow overrides (typically +.htaccess files) Apache will attempt to open +.htaccess for each filename component. For example, + +

+DocumentRoot /www/htdocs
+<Directory />
+    AllowOverride all
+</Directory>
+

+ +and a request is made for the URI /index.html. Then +Apache will attempt to open /.htaccess, +/www/.htaccess, and /www/htdocs/.htaccess. +The solutions are similar to the previous case of

Options
+FollowSymLinks

. For highest performance use +AllowOverride None everywhere in your filesystem. + +

Negotiation

+ +

If at all possible, avoid content-negotiation if you're really +interested in every last ounce of performance. In practice the +benefits of negotiation outweigh the performance penalties. There's +one case where you can speed up the server. Instead of using +a wildcard such as: + +

+DirectoryIndex index
+

+ +Use a complete list of options: + +

+DirectoryIndex index.cgi index.pl index.shtml index.html
+

+ +where you list the most common choice first. + +

Process Creation

+ +

Prior to Apache 1.3 the MinSpareServers, +MaxSpareServers, and StartServers settings +all had drastic effects on benchmark results. In particular, Apache +required a "ramp-up" period in order to reach a number of children +sufficient to serve the load being applied. After the initial +spawning of StartServers children, only one child per +second would be created to satisfy the MinSpareServers +setting. So a server being accessed by 100 simultaneous clients, +using the default StartServers of 5 would take on +the order 95 seconds to spawn enough children to handle the load. This +works fine in practice on real-life servers, because they aren't restarted +frequently. But does really poorly on benchmarks which might only run +for ten minutes. + +

The one-per-second rule was implemented in an effort to avoid +swamping the machine with the startup of new children. If the machine +is busy spawning children it can't service requests. But it has such +a drastic effect on the perceived performance of Apache that it had +to be replaced. As of Apache 1.3, +the code will relax the one-per-second rule. It +will spawn one, wait a second, then spawn two, wait a second, then spawn +four, and it will continue exponentially until it is spawning 32 children +per second. It will stop whenever it satisfies the +MinSpareServers setting. + +

This appears to be responsive enough that it's +almost unnecessary to twiddle the MinSpareServers, +MaxSpareServers and StartServers knobs. When +more than 4 children are spawned per second, a message will be emitted +to the ErrorLog. If you see a lot of these errors then +consider tuning these settings. Use the mod_status output +as a guide. + +

Related to process creation is process death induced by the +MaxRequestsPerChild setting. By default this is 30, which +is probably far too low unless your server is using a module such as +mod_perl which causes children to have bloated memory +images. If your server is serving mostly static pages then consider +raising this value to something like 10000. The code is robust enough +that this shouldn't be a problem. + +

When keep-alives are in use, children will be kept busy +doing nothing waiting for more requests on the already open +connection. The default KeepAliveTimeout of +15 seconds attempts to minimize this effect. The tradeoff +here is between network bandwidth and server resources. +In no event should you raise this above about 60 seconds, as + +most of the benefits are lost. + +

Compile-Time Configuration Issues

+ +

mod_status and Rule STATUS=yes

+ +

If you include mod_status +and you also set Rule STATUS=yes when building +Apache, then on every request Apache will perform two calls to +gettimeofday(2) (or times(2) depending +on your operating system), and (pre-1.3) several extra calls to +time(2). This is all done so that the status report +contains timing indications. For highest performance, set Rule +STATUS=no. + +

accept Serialization - multiple sockets

+ +

This discusses a shortcoming in the Unix socket API. +Suppose your +web server uses multiple Listen statements to listen on +either multiple ports or multiple addresses. In order to test each +socket to see if a connection is ready Apache uses select(2). +select(2) indicates that a socket has none or +at least one connection waiting on it. Apache's model includes +multiple children, and all the idle ones test for new connections at the +same time. A naive implementation looks something like this +(these examples do not match the code, they're contrived for +pedagogical purposes): + +

+    for (;;) {
+	for (;;) {
+	    fd_set accept_fds;
+
+	    FD_ZERO (&accept_fds);
+	    for (i = first_socket; i <= last_socket; ++i) {
+		FD_SET (i, &accept_fds);
+	    }
+	    rc = select (last_socket+1, &accept_fds, NULL, NULL, NULL);
+	    if (rc < 1) continue;
+	    new_connection = -1;
+	    for (i = first_socket; i <= last_socket; ++i) {
+		if (FD_ISSET (i, &accept_fds)) {
+		    new_connection = accept (i, NULL, NULL);
+		    if (new_connection != -1) break;
+		}
+	    }
+	    if (new_connection != -1) break;
+	}
+	process the new_connection;
+    }
+

+ +But this naive implementation has a serious starvation problem. Recall +that multiple children execute this loop at the same time, and so multiple +children will block at select when they are in between +requests. All those blocked children will awaken and return from +select when a single request appears on any socket +(the number of children which awaken varies depending on the operating +system and timing issues). +They will all then fall down into the loop and try to accept +the connection. But only one will succeed (assuming there's still only +one connection ready), the rest will be blocked in accept. +This effectively locks those children into serving requests from that +one socket and no other sockets, and they'll be stuck there until enough +new requests appear on that socket to wake them all up. +This starvation problem was first documented in +PR#467. There +are at least two solutions. + +

One solution is to make the sockets non-blocking. In this case the +accept won't block the children, and they will be allowed +to continue immediately. But this wastes CPU time. Suppose you have +ten idle children in select, and one connection arrives. +Then nine of those children will wake up, try to accept the +connection, fail, and loop back into select, accomplishing +nothing. Meanwhile none of those children are servicing requests that +occurred on other sockets until they get back up to the select +again. Overall this solution does not seem very fruitful unless you +have as many idle CPUs (in a multiprocessor box) as you have idle children, +not a very likely situation. + +

Another solution, the one used by Apache, is to serialize entry into +the inner loop. The loop looks like this (differences highlighted): + +

+    for (;;) {
+	accept_mutex_on ();
+	for (;;) {
+	    fd_set accept_fds;
+
+	    FD_ZERO (&accept_fds);
+	    for (i = first_socket; i <= last_socket; ++i) {
+		FD_SET (i, &accept_fds);
+	    }
+	    rc = select (last_socket+1, &accept_fds, NULL, NULL, NULL);
+	    if (rc < 1) continue;
+	    new_connection = -1;
+	    for (i = first_socket; i <= last_socket; ++i) {
+		if (FD_ISSET (i, &accept_fds)) {
+		    new_connection = accept (i, NULL, NULL);
+		    if (new_connection != -1) break;
+		}
+	    }
+	    if (new_connection != -1) break;
+	}
+	accept_mutex_off ();
+	process the new_connection;
+    }
+

+ + +The functions accept_mutex_on and accept_mutex_off +implement a mutual exclusion semaphore. Only one child can have the +mutex at any time. There are several choices for implementing these +mutexes. The choice is defined in src/conf.h (pre-1.3) or +src/main/conf.h (1.3 or later). Some architectures +do not have any locking choice made, on these architectures it is unsafe +to use multiple Listen directives. + +

USE_FLOCK_SERIALIZED_ACCEPT +: This method uses the flock(2) system call to lock a +lock file (located by the LockFile directive). + +
USE_FCNTL_SERIALIZED_ACCEPT +: This method uses the fcntl(2) system call to lock a +lock file (located by the LockFile directive). + +
USE_SYSVSEM_SERIALIZED_ACCEPT +: (1.3 or later) This method uses SysV-style semaphores to implement the +mutex. Unfortunately SysV-style semaphores have some bad side-effects. +One is that it's possible Apache will die without cleaning up the semaphore +(see the ipcs(8) man page). The other is that the semaphore +API allows for a denial of service attack by any CGIs running under the +same uid as the webserver (i.e. all CGIs unless you use something +like suexec or cgiwrapper). For these reasons this method is not used +on any architecture except IRIX (where the previous two are prohibitively +expensive on most IRIX boxes). + +
USE_USLOCK_SERIALIZED_ACCEPT +: (1.3 or later) This method is only available on IRIX, and uses +usconfig(2) to create a mutex. While this method avoids +the hassles of SysV-style semaphores, it is not the default for IRIX. +This is because on single processor IRIX boxes (5.3 or 6.2) the +uslock code is two orders of magnitude slower than the SysV-semaphore +code. On multi-processor IRIX boxes the uslock code is an order of magnitude +faster than the SysV-semaphore code. Kind of a messed up situation. +So if you're using a multiprocessor IRIX box then you should rebuild your +webserver with -DUSE_USLOCK_SERIALIZED_ACCEPT on the +EXTRA_CFLAGS. + +
USE_PTHREADS_SERIALIZED_ACCEPT +: (1.3 or later) This method uses POSIX mutexes and should work on +any architecture implementing the full POSIX threads specification, +however appears to only work on Solaris (2.5 or later). This is the +default for Solaris 2.5 or later. +

+ +

If your system has another method of serialization which isn't in the +above list then it may be worthwhile adding code for it (and submitting +a patch back to Apache). + +

Another solution that has been considered but never implemented is +to partially serialize the loop -- that is, let in a certain number +of processes. This would only be of interest on multiprocessor boxes +where it's possible multiple children could run simultaneously, and the +serialization actually doesn't take advantage of the full bandwidth. +This is a possible area of future investigation, but priority remains +low because highly parallel web servers are not the norm. + +

Ideally you should run servers without multiple Listen +statements if you want the highest performance. But read on. + +

accept Serialization - single socket

+ +

The above is fine and dandy for multiple socket servers, but what +about single socket servers? In theory they shouldn't experience +any of these same problems because all children can just block in +accept(2) until a connection arrives, and no starvation +results. In practice this hides almost the same "spinning" behaviour +discussed above in the non-blocking solution. The way that most TCP +stacks are implemented, the kernel actually wakes up all processes blocked +in accept when a single connection arrives. One of those +processes gets the connection and returns to user-space, the rest spin in +the kernel and go back to sleep when they discover there's no connection +for them. This spinning is hidden from the user-land code, but it's +there nonetheless. This can result in the same load-spiking wasteful +behaviour that a non-blocking solution to the multiple sockets case can. + +

For this reason we have found that many architectures behave more +"nicely" if we serialize even the single socket case. So this is +actually the default in almost all cases. Crude experiments under +Linux (2.0.30 on a dual Pentium pro 166 w/128Mb RAM) have shown that +the serialization of the single socket case causes less than a 3% +decrease in requests per second over unserialized single-socket. +But unserialized single-socket showed an extra 100ms latency on +each request. This latency is probably a wash on long haul lines, +and only an issue on LANs. If you want to override the single socket +serialization you can define SAFE_UNSERIALIZED_ACCEPT +and then single-socket servers will not serialize at all. + +

Lingering Close

+ +

As discussed in +draft-ietf-http-connection-00.txt section 8, +in order for an HTTP server to reliably implement the protocol +it needs to shutdown each direction of the communication independently +(recall that a TCP connection is bi-directional, each half is independent +of the other). This fact is often overlooked by other servers, but +is correctly implemented in Apache as of 1.2. + +

When this feature was added to Apache it caused a flurry of +problems on various versions of Unix because of a shortsightedness. +The TCP specification does not state that the FIN_WAIT_2 state has a +timeout, but it doesn't prohibit it. On systems without the timeout, +Apache 1.2 induces many sockets stuck forever in the FIN_WAIT_2 state. +In many cases this can be avoided by simply upgrading to the latest +TCP/IP patches supplied by the vendor, in cases where the vendor has +never released patches (i.e. SunOS4 -- although folks with a source +license can patch it themselves) we have decided to disable this feature. + +

There are two ways of accomplishing this. One is the +socket option SO_LINGER. But as fate would have it, +this has never been implemented properly in most TCP/IP stacks. Even +on those stacks with a proper implementation (i.e. Linux 2.0.31) this +method proves to be more expensive (cputime) than the next solution. + +

For the most part, Apache implements this in a function called +lingering_close (in http_main.c). The +function looks roughly like this: + +

+    void lingering_close (int s)
+    {
+	char junk_buffer[2048];
+
+	/* shutdown the sending side */
+	shutdown (s, 1);
+
+	signal (SIGALRM, lingering_death);
+	alarm (30);
+
+	for (;;) {
+	    select (s for reading, 2 second timeout);
+	    if (error) break;
+	    if (s is ready for reading) {
+		read (s, junk_buffer, sizeof (junk_buffer));
+		/* just toss away whatever is here */
+	    }
+	}
+
+	close (s);
+    }
+

+ +This naturally adds some expense at the end of a connection, but it +is required for a reliable implementation. As HTTP/1.1 becomes more +prevalent, and all connections are persistent, this expense will be +amortized over more requests. If you want to play with fire and +disable this feature you can define NO_LINGCLOSE, but +this is not recommended at all. In particular, as HTTP/1.1 pipelined +persistent connections come into use lingering_close +is an absolute necessity (and + +pipelined connections are faster, so you +want to support them). + +

Scoreboard File

+ +

Apache's parent and children communicate with each other through +something called the scoreboard. Ideally this should be implemented +in shared memory. For those operating systems that we either have +access to, or have been given detailed ports for, it typically is +implemented using shared memory. The rest default to using an +on-disk file. The on-disk file is not only slow, but it is unreliable +(and less featured). Peruse the src/main/conf.h file +for your architecture and look for either HAVE_MMAP or +HAVE_SHMGET. Defining one of those two enables the +supplied shared memory code. If your system has another type of +shared memory then edit the file src/main/http_main.c and +add the hooks necessary to use it in Apache. (Send us back a patch +too please.) + +

Historical note: The Linux port of Apache didn't start to use +shared memory until version 1.2 of Apache. This oversight resulted +in really poor and unreliable behaviour of earlier versions of Apache +on Linux. + +

`DYNAMIC_MODULE_LIMIT`

+ +

If you have no intention of using dynamically loaded modules +(you probably don't if you're reading this and tuning your +server for every last ounce of performance) then you should add +-DDYNAMIC_MODULE_LIMIT=0 when building your server. +This will save RAM that's allocated only for supporting dynamically +loaded modules. + +

Appendix: Detailed Analysis of a Trace

+ +Here is a system call trace of Apache 1.3 running on Linux. The run-time +configuration file is essentially the default plus: + +

+<Directory />
+    AllowOverride none
+    Options FollowSymLinks
+</Directory>
+

+ +The file being requested is a static 6K file of no particular content. +Traces of non-static requests or requests with content negotiation +look wildly different (and quite ugly in some cases). First the +entire trace, then we'll examine details. (This was generated by +the strace program, other similar programs include +truss, ktrace, and par.) + +

+accept(15, {sin_family=AF_INET, sin_port=htons(22283), sin_addr=inet_addr("127.0.0.1")}, [16]) = 3
+flock(18, LOCK_UN)                      = 0
+sigaction(SIGUSR1, {SIG_IGN}, {0x8059954, [], SA_INTERRUPT}) = 0
+getsockname(3, {sin_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
+setsockopt(3, IPPROTO_TCP1, [1], 4)     = 0
+read(3, "GET /6k HTTP/1.0\r\nUser-Agent: "..., 4096) = 60
+sigaction(SIGUSR1, {SIG_IGN}, {SIG_IGN}) = 0
+time(NULL)                              = 873959960
+gettimeofday({873959960, 404935}, NULL) = 0
+stat("/home/dgaudet/ap/apachen/htdocs/6k", {st_mode=S_IFREG|0644, st_size=6144, ...}) = 0
+open("/home/dgaudet/ap/apachen/htdocs/6k", O_RDONLY) = 4
+mmap(0, 6144, PROT_READ, MAP_PRIVATE, 4, 0) = 0x400ee000
+writev(3, [{"HTTP/1.1 200 OK\r\nDate: Thu, 11"..., 245}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 6144}], 2) = 6389
+close(4)                                = 0
+time(NULL)                              = 873959960
+write(17, "127.0.0.1 - - [10/Sep/1997:23:39"..., 71) = 71
+gettimeofday({873959960, 417742}, NULL) = 0
+times({tms_utime=5, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 446747
+shutdown(3, 1 /* send */)               = 0
+oldselect(4, [3], NULL, [3], {2, 0})    = 1 (in [3], left {2, 0})
+read(3, "", 2048)                       = 0
+close(3)                                = 0
+sigaction(SIGUSR1, {0x8059954, [], SA_INTERRUPT}, {SIG_IGN}) = 0
+munmap(0x400ee000, 6144)                = 0
+flock(18, LOCK_EX)                      = 0
+

+ +

Notice the accept serialization: + +

+flock(18, LOCK_UN)                      = 0
+...
+flock(18, LOCK_EX)                      = 0
+

+ +These two calls can be removed by defining +SAFE_UNSERIALIZED_ACCEPT as described earlier. + +

Notice the SIGUSR1 manipulation: + +

+sigaction(SIGUSR1, {SIG_IGN}, {0x8059954, [], SA_INTERRUPT}) = 0
+...
+sigaction(SIGUSR1, {SIG_IGN}, {SIG_IGN}) = 0
+...
+sigaction(SIGUSR1, {0x8059954, [], SA_INTERRUPT}, {SIG_IGN}) = 0
+

+ +This is caused by the implementation of graceful restarts. When the +parent receives a SIGUSR1 it sends a SIGUSR1 +to all of its children (and it also increments a "generation counter" +in shared memory). Any children that are idle (between connections) +will immediately die +off when they receive the signal. Any children that are in keep-alive +connections, but are in between requests will die off immediately. But +any children that have a connection and are still waiting for the first +request will not die off immediately. + +

To see why this is necessary, consider how a browser reacts to a closed +connection. If the connection was a keep-alive connection and the request +being serviced was not the first request then the browser will quietly +reissue the request on a new connection. It has to do this because the +server is always free to close a keep-alive connection in between requests +(i.e. due to a timeout or because of a maximum number of requests). +But, if the connection is closed before the first response has been +received the typical browser will display a "document contains no data" +dialogue (or a broken image icon). This is done on the assumption that +the server is broken in some way (or maybe too overloaded to respond +at all). So Apache tries to avoid ever deliberately closing the connection +before it has sent a single response. This is the cause of those +SIGUSR1 manipulations. + +

Note that it is theoretically possible to eliminate all three of +these calls. But in rough tests the gain proved to be almost unnoticeable. + +

In order to implement virtual hosts, Apache needs to know the +local socket address used to accept the connection: + +

+getsockname(3, {sin_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
+

+ +It is possible to eliminate this call in many situations (such as when +there are no virtual hosts, or when Listen directives are +used which do not have wildcard addresses). But no effort has yet been +made to do these optimizations. + +

Apache turns off the Nagle algorithm: + +

+setsockopt(3, IPPROTO_TCP1, [1], 4)     = 0
+

+ +because of problems described in +a +paper by John Heidemann. + +

Notice the two time calls: + +

+time(NULL)                              = 873959960
+...
+time(NULL)                              = 873959960
+

+ +One of these occurs at the beginning of the request, and the other occurs +as a result of writing the log. At least one of these is required to +properly implement the HTTP protocol. The second occurs because the +Common Log Format dictates that the log record include a timestamp of the +end of the request. A custom logging module could eliminate one of the +calls. + +

As described earlier, Rule STATUS=yes causes two +gettimeofday calls and a call to times: + +

+gettimeofday({873959960, 404935}, NULL) = 0
+...
+gettimeofday({873959960, 417742}, NULL) = 0
+times({tms_utime=5, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 446747
+

+ +These can be removed by either removing mod_status or +setting Rule STATUS=no. + +

It might seem odd to call stat: + +

+stat("/home/dgaudet/ap/apachen/htdocs/6k", {st_mode=S_IFREG|0644, st_size=6144, ...}) = 0
+

+ +This is part of the algorithm which calculates the +PATH_INFO for use by CGIs. In fact if the request had been +for the URI /cgi-bin/printenv/foobar then there would be +two calls to stat. The first for +/home/dgaudet/ap/apachen/cgi-bin/printenv/foobar +which does not exist, and the second for +/home/dgaudet/ap/apachen/cgi-bin/printenv, which does exist. +Regardless, at least one stat call is necessary when +serving static files because the file size and modification times are +used to generate HTTP headers (such as Content-Length, +Last-Modified) and implement protocol features (such +as If-Modified-Since). A somewhat more clever server +could avoid the stat when serving non-static files, +however doing so in Apache is very difficult given the modular structure. + +

All static files are served using mmap: + +

+mmap(0, 6144, PROT_READ, MAP_PRIVATE, 4, 0) = 0x400ee000
+...
+munmap(0x400ee000, 6144)                = 0
+

+ +On some architectures it's slower to mmap small +files than it is to simply read them. The define +MMAP_THRESHOLD can be set to the minimum size required before +using mmap. By default it's set to 0 (except on SunOS4 +where experimentation has shown 8192 to be a better value). Using a +tool such as +lmbench +you can determine the optimal setting for your +environment. It may even be the case that mmap isn't used +on your architecture, if so then defining USE_MMAP_FILES +might work (if it works then report back to us). + + +

Apache does its best to avoid copying bytes around in memory. The +first write of any request typically is turned into a writev +which combines both the headers and the first hunk of data: + +

+writev(3, [{"HTTP/1.1 200 OK\r\nDate: Thu, 11"..., 245}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 6144}], 2) = 6389
+

+ +When doing HTTP/1.1 chunked encoding Apache will generate up to four +element writevs. The goal is to push the byte copying +into the kernel, where it typically has to happen anyhow (to assemble +network packets). On testing, various Unixes (BSDI 2.x, Solaris 2.5, +Linux 2.0.31+) properly combine the elements into network packets. +Pre-2.0.31 Linux will not combine, and will create a packet for +each element, so upgrading is a good idea. Defining NO_WRITEV +will disable this combining, but result in very poor chunked encoding +performance. + +

The log write: + +

+write(17, "127.0.0.1 - - [10/Sep/1997:23:39"..., 71) = 71
+

+ +can be deferred by defining BUFFERED_LOGS. In this case +up to PIPE_BUF bytes (a POSIX defined constant) of log entries +are buffered before writing. At no time does it split a log entry +across a PIPE_BUF boundary because those writes may not +be atomic. (i.e. entries from multiple children could become mixed together). +The code does it best to flush this buffer when a child dies. + +

The lingering close code causes four system calls: + +

+shutdown(3, 1 /* send */)               = 0
+oldselect(4, [3], NULL, [3], {2, 0})    = 1 (in [3], left {2, 0})
+read(3, "", 2048)                       = 0
+close(3)                                = 0
+

+ +which were described earlier. + +

Let's apply some of these optimizations: +-DSAFE_UNSERIALIZED_ACCEPT -DBUFFERED_LOGS and +Rule STATUS=no. Here's the final trace: + +

+accept(15, {sin_family=AF_INET, sin_port=htons(22286), sin_addr=inet_addr("127.0.0.1")}, [16]) = 3
+sigaction(SIGUSR1, {SIG_IGN}, {0x8058c98, [], SA_INTERRUPT}) = 0
+getsockname(3, {sin_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
+setsockopt(3, IPPROTO_TCP1, [1], 4)     = 0
+read(3, "GET /6k HTTP/1.0\r\nUser-Agent: "..., 4096) = 60
+sigaction(SIGUSR1, {SIG_IGN}, {SIG_IGN}) = 0
+time(NULL)                              = 873961916
+stat("/home/dgaudet/ap/apachen/htdocs/6k", {st_mode=S_IFREG|0644, st_size=6144, ...}) = 0
+open("/home/dgaudet/ap/apachen/htdocs/6k", O_RDONLY) = 4
+mmap(0, 6144, PROT_READ, MAP_PRIVATE, 4, 0) = 0x400e3000
+writev(3, [{"HTTP/1.1 200 OK\r\nDate: Thu, 11"..., 245}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 6144}], 2) = 6389
+close(4)                                = 0
+time(NULL)                              = 873961916
+shutdown(3, 1 /* send */)               = 0
+oldselect(4, [3], NULL, [3], {2, 0})    = 1 (in [3], left {2, 0})
+read(3, "", 2048)                       = 0
+close(3)                                = 0
+sigaction(SIGUSR1, {0x8058c98, [], SA_INTERRUPT}, {SIG_IGN}) = 0
+munmap(0x400e3000, 6144)                = 0
+

+ +That's 19 system calls, of which 4 remain relatively easy to remove, +but don't seem worth the effort. + +

Appendix: The Pre-Forking Model

+ +

Apache (on Unix) is a pre-forking model server. The +parent process is responsible only for forking child +processes, it does not serve any requests or service any network +sockets. The child processes actually process connections, they serve +multiple connections (one at a time) before dying. +The parent spawns new or kills off old +children in response to changes in the load on the server (it does so +by monitoring a scoreboard which the children keep up to date). + +

This model for servers offers a robustness that other models do +not. In particular, the parent code is very simple, and with a high +degree of confidence the parent will continue to do its job without +error. The children are complex, and when you add in third party +code via modules, you risk segmentation faults and other forms of +corruption. Even should such a thing happen, it only affects one +connection and the server continues serving requests. The parent +quickly replaces the dead child. + +

Pre-forking is also very portable across dialects of Unix. +Historically this has been an important goal for Apache, and it continues +to remain so. + +

The pre-forking model comes under criticism for various +performance aspects. Of particular concern are the overhead +of forking a process, the overhead of context switches between +processes, and the memory overhead of having multiple processes. +Furthermore it does not offer as many opportunities for data-caching +between requests (such as a pool of mmapped files). +Various other models exist and extensive analysis can be found in the + papers +of the JAWS project. In practice all of these costs vary drastically +depending on the operating system. + +

Apache's core code is already multithread aware, and Apache version +1.3 is multithreaded on NT. There have been at least two other experimental +implementations of threaded Apache (one using the 1.3 code base on DCE, +and one using a custom user-level threads package and the 1.0 code base, +neither are available publically). Part of our redesign for version 2.0 +of Apache will include abstractions of the server model so that we +can continue to support the pre-forking model, and also support various +threaded models. + + + diff --git a/docs/manual/platform/perf.html b/docs/manual/platform/perf.html index 4b440ed9b7c..58b09f7a427 100644 --- a/docs/manual/platform/perf.html +++ b/docs/manual/platform/perf.html @@ -92,6 +92,7 @@ SGI

WebFORCE Web Server Tuning Guidelines for IRIX 5.3, <http://www.sgi.com/Products/WebFORCE/Resources/res_TuningGuide.html> +

Performance Tuning -- accept_mutex

-- 2.47.2