There are two major approaches to multiplexing communication channels on servers: thread-per-connect (where multiple threads each manage a connection with synchronous send and recv calls) and poll/select (where a multiplexing function is used to wait for connections to become readable or writable). The two in-house communications packages I've seen lately use a hybrid approach, combining poll/select with a threadpool to allow requests to be processed without blocking communication channels, but the fundamental dichotomy remains.
So having used both of the two approaches over the course of my career, and being now involved in work on yet another communication package, I was curious: which of the two approaches performs faster on Linux? It was my belief that poll/select would significantly outperform thread-per-select on a single processor box: poll/select seems like it should avoid the overhead of context switching required by thread-per-select.
I was surprised to discover that my belief was wrong: unless I've screwed up in benchmarking this, thread-per-connect performance and poll loop performance appear to be equivalent - or at least too close to call.
The following table maps my results:
Num Clients | Num Requests | Poll time (in ms) | Thread time (in ms) |
10 | 1000 | 213 | 227 |
50 | 200 | 229 | 216 |
100 | 100 | 212 | 225 |
150 | 66 | 220 | 239 |
200 | 50 | 223 | 252 |
300 | 33 | 232 | 280 |
380 | 26 | 250 | 293 |
In each of the test cases, I send 10,000 requests/response pairs from a client process to a server. The client process maintains some number of client connections - from 10 to 380 (that being about the maximum number of threads pthread appears to support on my system) - and sends as many requests per connection as necessary so that the total number of requests is about 10,000 (so, for example, if we're using 10 connections we send 1000 requests on each connection).
In the "Poll" cases, the server is a single-threaded poll loop. In the "Thread" cases, the server spawns a new thread for every connection. Each test case was run 5 times. The times shown are the average actual run times (not CPU time). In two cases, I culled two of the values from the poll tests because they were way out of normal range (over one second, presumably due to system conditions).
As you can see, there is no significant difference between the poll and thread cases. The thread cases do appear to degrade slightly as we get into the higher numbers of connections, but not by very much compared to the general variation in the timings.
What I'm curious to see now is how these numbers compare to a typical hybrid approach - where events from a poll loop are dispatched to a thread pool. Again, I can't imagine that a hybrid approach would perform anywhere near as well, but I've been proven wrong before :-).
I'm no expert on benchmarking, so if anyone has any input on the process or the results, I'd be very interested. The benchmarking software can be found here.