Introduction
The reason for the current work is to analyze different methods for efficent delivery of network events from kernel mode to user mode. Five methods are examined, poll() that has been chosen as the better old-style method, standard /dev/poll interface, standard RT signals, RT signals with one-sig-per-fd patch and a new /dev/epoll that uses a quite different notification method. This work is composed by :
The new /dev/epoll kernel patch
The patch is quite simple and it adds notification callbacks to the 'struct file' data structure :
******* include/linux/fs.h
****** include/linux/fcblist.h
#define FCB_LOCAL_SIZE 4
void file_notify_event(struct file *filep, long *event);
void file_notify_cleanup(struct file *filep);
The meaning of this callback list is to give lower IO layers the ability to notify upper layers that will register their "interests" to the file structure. In fs/file_table.c initialization and cleanup code has been added while in fs/fcblist.c the callback list handling code has been fit :
****** fs/file_table.c
****** fs/fcblist.c
The callbacks will receive a 'long *' whose first element is one of the ION_* events while the nexts could store additional params whose meaning will vary depending on the first one. This interface is a draft and I used it only to verify if the transport method is efficent "enough" to work on. At the current stage notifications has been plugged only inside the socket files by adding :
****** include/net/sock.h
if ((kdpfd = open("/dev/epoll", O_RDWR)) == -1) {
}
where maxfds is the maximum number of file descriptors that it's supposed to stock inside the polling device. Files are added to the interest set by :
struct pollfd pfd;
and removed with :
struct pollfd pfd;
The core dispatching code looks like :
Basically the driver allocates two sets of pages that it uses as a double buffer to store files events. The field ep_resoff will tell where, inside the map, the result set resides so, while working on one set, the kernel can use the other one to store incoming events. There is no copy to userspace issues, events coming from the same file are collapsed into a single slot and the EP_POLL function will never do a linear scan of the interest set to perform a file->f_ops->poll(). To use the /dev/epoll interface You've to mknod such name with major=10 and minor=124 :
# mknod /dev/epoll c 10 124
You can download the patch here :
The /dev/poll patch from Provos-Lever
There's very few things to say about this, only that a virt_to_page() bug has been fixed to make the patch work. I fixed also a problem the patch has when it tries to resize the hash table by calling kmalloc() for a big chunk of memory that can't be satisfied. Now vmalloc() is used for hash table allocation. I modified a patch for 2.4.3 that I found at the CITI web site and this should be the port to 2.4.x of the original ( 2.2.x ) one used by Provos-Lever. You can download the patch here :
The RT signals one-sig-per-fd patch
This patch coded by Vitaly Luban implement RT signals collapsing and try to avoid SIGIO delivery that happens when the RT signals queue become full. You can download the patch here :
The HTTP server
The HTTP server is very simple(tm) and is based on event polling + coroutines that make the server quite efficent. The coroutine library implementation used inside the server has been taken from :
http://www.goron.de/~froese/coro/
It's very small, simple and fast. The default stack size used by the server is 8192 and this, when trying to charge a lot of connections, may result in memory waste and vm trashing. A stack size of 4096 should be sufficent with this ( empty ) HTTP server implementation. Another issue is about the allocation method used by the coro library that uses mmap() for stack allocation. This, when the rate of accept()/close() become high may result in performance loss. I changed the library ( just one file coro.c ) to use malloc()/free() instead of mmap()/munmap(). Again, it's very simple ( the server ) and always emits the same HTTP response whose size can be programmed by a command line parameter. Other two command line options enable You to set the listening port and the fd set size. You can download the server here :
Old version:
The deadconn(tm) tool
If the server is simple this is even simpler and its purpose is to create "dead" connections to the server to simulate a realistic load where a bunch of slow links are connected. You can download deadconn here :
The test
The test machine is a PIII 600MHz, 128 Mb RAM, eepro100 network card connected to a 100Mbps fast ethernet switch. The kernel is 2.4.6 over a RH 6.2 and the coroutine library version is 1.1.0-pre2. I used a dual PIII 1GHz, 256 Mb RAM and dual eepro100 as httperf machine, while a dual PIII 900 MHz, 256 Mb RAM and dual eepro100 has been used as deadconn(tm) machine. Since httperf when used with an high number of num-conns goes very quickly to fill the fds space ( modified to 8000 ) I used this command line :
--think-timeout 5 --timeout 5 --num-calls 2500 --num-conns 100 --hog --rate 100
This basically allocates 100 connections that will load the server under different values of dead connections. The other parameter I varied is the response size from 128, 512 and 1024. Another test, that has more respect of the nature of the internet sessions, is to have a burst of connections that are opened, make two HTTP requests and than are closed. This test is implemented with httperf by calling :
--think-timeout 5 --timeout 5 --num-calls 2 --num-conns 27000 --hog --rate 5000
Each of these numbers is the average of three runs. You can download httperf here :
The test show that the /dev/epoll is about 10-12% faster than the RT signals one-sig implementation and that either /dev/epoll and both RT signals implementation keeps flat over dead connections load. The RT-one-sig implementation is slight faster than the simple RT signal, but here only a couple of SIGIO occurred during the test.
The system call interface ( aka sys_epoll )
The need of a system call interface to the event retrival device driven the implementation of sys_epoll, that offsers the same level of scalability through a simpler interface for the developer. The new system call interface introduces three new system calls that maps to the corresponding user space calls :
These functions are described in their manual pages :
Patches that implement the system call interface are available here. A library to access the new ( 2.5.45 ) epoll is available here :
A simple pipe-based epoll performace tester :
User space libraries that supports epoll :
During the epoll test I quickly made a patch for thttpd :
Conclusion
These numbers show that the new /dev/epoll ( and sys_epoll ) improve the efficency of the server from a response rate point of view and from a CPU utilization point of view ( better value of CPU/load factor ). The response rate of the new /dev/epoll in completely independent from the number of connections while the standard poll() and the old /dev/poll seems to suffer the load. The standard deviation is also very low compared to poll() and old /dev/poll and this let me think that 1) there's more power to be extracted 2) the method has a predictable response over high loads. Both the RT signals and RT one-sig implementations behave pretty flat over dead connections load with the one-sig version that is about 10-12% faster than the simple RT signals version. RT singnals implementations ( even if the one-sig less ) seems to suffer the burst test that simulates the real internet load where a huge number of connections are alive. This because of the limit of the RT signals queue that, even with the one-sig patch applied, is going to become full during the test.
Blog dan artikelnya bagus, komentar juga ya di web blog saya www.when-who-what.com
ReplyDelete