Authors: CERN AFS Team, October 2012
Unpredictable, high access latency to files in home directories and workspaces has become a pressing issue for AFS users at CERN. It is caused by erratic overloads of file-servers and occasionally affects almost entire user community in the laboratory. Access time variations appear because AFS file-server resources are shared by hundreds (and sometimes thousands) clients who access files interactively (e.g. ls, editor,..), from UNIX scripts and programs (e.g. find, grep -R, make, ...) and via large arrays of batch jobs.
The problem became more severe in 2012 due the change of the AFS service architecture and size: larger servers = more sharing of common resources at the level of a file-server = less isolation of users. The service size in terms of the provided data storage has doubled in last year but the scaling of the IO throughput (ops/s) is complex and difficult to assess.
To address the specific problem of access latency, in September 2012 the AFS team organized several targeted brainstorming sessions which included analysis of OpenAFS source code, debugging OpenAFS network protocol, exploration of runtime traces of the file-server, testing and experimenting with different server configurations.
Below is a short summary of the findings.
When access latency to files increases the bare-metal hardware limits of the file servers are typically not reached (hardware specifications below refer to the newly purchased afs2xx servers which become a new generation fabric for the AFS service):
- disks are relatively idle and not maxed out (maximum streaming performance of raw disks is 150MB/s, random IO max performance is at around 50MB/s);
- network throughput of the file-server is flat top in the range of 50-250MB/s (maximum at 10Gb/s);
- CPU utilization is flat top around 130% (maxing out one of the 4 cores available in the system).
The throughput is not limited by a network bottleneck between clients and the server as concluded by an iperf test using 30 random lxbatch nodes (maximum observed TCP throughput at 9Gb/s).
Hence the working hypothesis assumed that the limit comes from the software limits of the OpenAFS file-server implementation.
The basic characteristics of the OpenAFS file-server are as follows:
- OpenAFS file-server is built upon the custom rx protocol based on UDP;
- rx protocol (developed in 1980s-1990s) implements a TCP-like functionality and relies
heavily on packet acknowledgment;
rx server uses one UDP socket for all client connections and calls,
rx server has one listener thread and a fixed-size pool of worker threads (246 out of 256 total file-server threads, 10 threads being reserved for special purposes);
listener thread multiplexes rx packets, analyzes the rx header and dispatches tasks to appropriate worker threads;
worker threads handle client requests and serve data;
rx server optimizations include a hot-thread feature which allows the listener thread to handle new client connection request and the listener role be passed to other thread from the worker pool. This feature is enabled by default for the file-servers.
- OpenAFS client opens one rx connection per user.
- There may be up to 4 simultaneous calls per connection.
Initial clues have been given by profiling information (oprofile) and stack traces (gcore, gdb) - snapshots of production file-server were taken at the time of increased access times. The symptoms differ between the snapshots, indicating possible multiple root-causes:
- worker thread shortage: all worker threads are busy and requests must be queued (rescheduled);
- lock contention: idle worker threads waiting on internal locks which are not solicited by the listener thread = most time spent in pthread library;
- listener thread becoming CPU-bound and using 100% CPU;
- listener thread disappearing sometimes with the hot-thread feature enabled.
The problem may be disentangled into two largely independent root causes.
Thread shortage was a known-problem in the past and the OpenAFS code was patched at CERN to address it:
- by increasing the thread pool up to 256 threads (from 128 in vanilla OpenAFS 1.4)
- by implementing a thread throttling (rescheduling) to prevent one client from taking over all available worker threads.
The rescheduling policy is a trade-off between top performance to few clients and fair access times for all clients. Thus the latter patch was further refined in early 2012 in favour of fair access and the algorithm currently used in production allows a single user to take:
- up to n2=60 threads per one accessed volume under normal load,
- up to n1=30 threads per one accessed volume under high load (high-load condition occurs when there are (other) calls waiting for a thread).
This choice of thresholds also stems from the performance limitations of the IO subsystem: with larger number of threads the top performance becomes limited by the disk performance because concurrent data access typically occurs on one volume which is stored entirely on one disk.
However the thread shortage is responsible for only a fraction of access incidents. The high access times frequently occur in spite of worker threads being available.
Synthetic stress testing allowed to create file-server overload in reproducible and controlled environment with lxbatch test jobs and rxperf utility. A parameter sweep was performed to test different scenarios:
- OpenAFS server 1.4-14 CERN-patched versus vanilla OpenAFS 1.6.1;
- enabling/disabling internal rx-features (hot threads);
- varying other runtime parameters such as network buffer sizes, jumbo-frames, etc.
The access latency becomes unacceptably high even with a modest number of clients (30 lxbatch jobs) concurrently reading the same large file. Test test was constructed such that the entire working dataset fits in the server’s memory and disk I/O subsystem remains idle (this rules out other hardware bottlenecks). The access latency increases despite many worker threads being available (and idle). The increased access latency is particularly pronounced for writing files while reading is less affected (but still unacceptably high).
Disabling the hot-thread feature does not impact the performance (but makes the testing and analysis easier).
Synthetic testing of the rx protocol using the rxperf utility revealed that the listener thread becomes easily overloaded handling acknowledgment packets (as required by the rx protocol). The test also confirmed the general bandwidth limitations of the rx-protocol at 1-2Gb/s (the actual value depends on the network settings such as jumbo frames). The general rx bandwidth limitation is not however directly related to large access latency which is the inability of the file-server to process client requests in spite of excess of hardware resources.
Unusually high UDP error count (netstat -s or /proc/net/snmp) which occurs in these tests leads to the following scenario:
- under high load a listener threads receives packets at a high rate and is required to send acknowledgments at a similar rate,
- the listener thread is slowed down because the packets are dropped by the UDP layer as the containing socket buffer is too small,
- the worker threads are not woken up by the listener thread at a rate of incoming packets because the packets are not properly acknowledged (this appears as worker thread lock contention in the profiling data),
- the UDP socket buffer overflows and the fraction of dropped UDP packet increases and the slowdown effect is amplified.
Increasing the buffer size for UDP socket buffer from 2MB to 8MB (sysctl and file-server options) improves the performance and reduces the average waiting time for serving requests from >10s to ~1s for 250 concurrent clients in the rxperf test. With increased UDP socket buffer OpenAFS 1.6 boost the latency performance by additional factor 3.
Here we show the rxperf results on OpenAFS 1.4.14 (CERN), the latest 1.6 stable branch, and the latest OpenAFS unstable master branch in git. Latency results without the increased buffer are not shown since they are generally >20s and highly spoiradic. With a 16MB buffer applied, we observe:
client: master; server: 1.4.14 (CERN) RECV: threads 1, times 1, bytes 64000: 1088 msec [0.45 Mbps] SEND: threads 1, times 1, bytes 64000: 896 msec [0.54 Mbps] RPC: threads 1, times 1, write bytes 64000, read bytes 64000: 1714 msec [4.66 Mbps] client: master; server: 1.6 stable RECV: threads 1, times 1, bytes 64000: 336 msec [1.45 Mbps] SEND: threads 1, times 1, bytes 64000: 248 msec [1.96 Mbps] RPC: threads 1, times 1, write bytes 64000, read bytes 64000: 453 msec [17.66 Mbps] client: master; server: master RECV: threads 1, times 1, bytes 64000: 303 msec [1.61 Mbps] SEND: threads 1, times 1, bytes 64000: 262 msec [1.86 Mbps] RPC: threads 1, times 1, write bytes 64000, read bytes 64000: 475 msec [16.84 Mbps]
It is confirmed that with 250 concurrent reading clients, the latency is ~1s for the 1.4 server and ~300ms for the 1.6 and new servers.
It is also interesting to note that the throughput seen by the 250 reading processes also varies per OpenAFS version:
1.4-cern server: 250*RECV: threads 1, times 1, bytes 20000000: 65393 msec [2.33 Mbps] 5000000000B / 65.393s ~= 0.612 Gbps 1.6-stable server: 250*RECV: threads 1, times 1, bytes 20000000: 20455 msec [7.46 Mbps] 5000000000B / 20.455s ~= 1.96 Gbps 1.6-master server: 250* RECV: threads 1, times 1, bytes 20000000: 19397 msec [7.87 Mbps] 5000000000B / 19.397s ~= 2.06 Gbps
1.6 and newer servers sustain more than 3x the throughput during this test.
The increased 16MB UDP buffer size provides a good protection for the file-server if the worker thread pool is not entirely exhausted. This should eliminate most common access latency incidents. The next scalability limit is related to the thread-shortage.
As demonstrated in the scalability testing (both for CERN AFS 1.4 and OpenAFS 1.6.1), if all threads are busy at all times the access times become unacceptably high (15-30s or more). This condition occurs when the total number of test clients exceeds 250 and 9 different volumes are accessed at the same time. In production environment this corresponds to 9 different users submitting at least 30 batch jobs, each constantly hammering the file-server. However in a more common mixed usage with 2 or 3 heavy batch users, there would still remain between 150-180 threads free for other users. In an unlikely scenario of 4 concurrent calls issued by each client this 150-180 threads should be sufficient to handle some 40 users performing heavy-duty operations on the unix prompt (such as multiple find or recursive grep) before the thread starvation occurs.
Dividing the rescheduling threshold by half (n2=15) allows to support up to 700 clients heavily hammering 12 different volumes with a moderate server slowdown (up to 2 seconds access latency) and high-latency spikes reduced (but not entirely eliminated). Such an “extreme” load rarely occurs in CERN AFS production system.
The access latency may be radically improved by increasing the UDP buffer size for the file-server, as demonstrated during synthetic testing of the file-server and using the rxperf utility. This eliminates one of the main bottlenecks in the current production system based on OpenAFS 1.4.14. Deployment of OpenAFS 1.6.1 should further decrease the average access latency by a factor of 3. Currently deployed thread rescheduling algorithm should provide enough scaling under typical circumstances. If needed, the scaling may be further improved by changing hard-coded rescheduling thresholds. The performance data gathered during this study will provide a useful guidance for further evolution of the AFS service at CERN.
The ultimate impact of this solution will be assessed after its roll-out in the production system.