Pontus Networks has recently helped an investment bank benchmark and improve the latency of an FX pricing system without changing a single line of code or any of the target hardware. We reduced the pricing system’s peak latency of receiving a raw market data tick to sending a FIX market data snapshot to 25 clients by over 20 times by combining the following technologies:
- PontusVision ThreadManager™ – to manage the application threads,
- RedHat’s Realtime operating system – MRG,
- Azul System’s Zing Java virtual machine.
Predictability trumps lowest possible average latency
Both buy and sell-side organizations have to have predictable behaviour in their systems. If a sell-side market maker has a predictably slow system, the prices can be adjusted by applying the appropriate level of spreads to prevent the bank from being picked on the market with the wrong prices. However, if a pricing system’s latency has lower average latency, but higher peaks than the predictably slow system, the sell-side organization can easily send inaccurate prices to the market, and be picked by a nimble buy-side organization. Thus, buy-side organizations should also try to have predictability in their systems to ensure that they don’t miss opportunities in the market.
Sell-side market makers have to send prices to their customers within 1ms of a new price to retain market share. Unlike the fast-paced world of equities, some of the primary FX markets tick at relatively low frequencies of updates once around 100ms or more. Within that period, the first 5ms is where the majority (around 90%) of the trades occur. Some of the markets will also batch and randomize orders in 1ms buckets, with any orders received in the 1st ms randomized, followed by any orders received in the 2nd ms, and so on.
Any serious market maker has to be able to reliably send out quotes to clients ideally within the first 1ms; otherwise, they miss a large number of orders due to high spreads, or worse, risk having stale prices and lose money.
The main goals of the exercise were to test and improve end to end latency on an FX trading system without changing a single line of code. The latency was measured by matching raw market data ticks with FIX market data snapshots sent to 25 FIX clients using a network switch with a high-precision clock.
The following figure illustrates the six test cases performed:
Because the applications could not be modified, only the operating system, java virtual machine (JVM), and the allocation of threads to cores (thread management) were changed. The operating systems (OSs) tested were RedHat’s RHEL 6.4 Linux, and RedHat’s MRG 3.1. The java virtual machines (JVMs) tested were Oracle’s hotspot 1.7.40, and Azul System’s Zing version 5.7.2. In addition to the OS and JVMs, the PontusVision Thread Manager module was used to carefully pin application cores to specific CPUs.
Because these tests were performed in the actual pricing systems, at the client’s request, in order to mask out the real figures, we have normalized the results between 0 and 100 units of latency (e.g. 100 can be 1000ms, or 10ms, or 100ms). The following graph shows the normalized results across 3 runs of each test case. The application under test was identical in all these test cases, and so was the input data:
The graph shows percentile graphs of the test case results. As mentioned above, the Y axis has the latency in normalized units (e.g. 100 can mean any arbitrary unit, and any arbitrary multiple such as 100ms, or 10ms, or 1000ms), whereas the X axis has the different percentiles. The X axis also has a time period that represents the annualized time frame. Quite often, people ignore the higher percentiles as very unlikely events; however, these still represent a significant amount of time in a whole year. As an example, extrapolating the test results over a whole year, the system would be expected to exhibit the latency figures to the right of the 99.99th percentile for 48 minutes and 40 seconds in one year. Since the latency spikes tend to occur around busy market periods, that represents quite a long period of exposure to (or loss of opportunity from) the markets.
The test results clearly show that the most predictable system behavior happened in test PV-006, with peak latencies that were over 20 times lower than test case PV-001. Zing was the clear winner of the JVMs, and MRG was the most predictable OS; however, the predictability comes at a cost of around 20% higher average latency than when using the plain RHEL 6.4 kernel. Lastly, the PontusVision Thread Manager also significantly helped improve performance in all cases.
The following graph highlights the high impact of 270% that thread management made on the system when running normal RHEL and the Hotspot JVM:
Similarly, even with MRG and Zing, thread management still made a significant (56% reduction) impact in the peak latency:
Why Manage Threads
Most non-real-time operating systems (OSs) are not very good at allocating threads for latency-sensitive applications. Modern OSs are typically configured to balance the load across various CPUs rather than reduce the latency of the application. When dealing with latency-sensitive applications, these strategies cause the latency to increase significantly.
The increase in latency happens because of non-uniform memory access (NUMA). NUMA servers have memory modules local to each CPU. Whilst this significantly reduces the latency to memory local to a CPU, it makes it much more expensive for remote CPUs to access that same memory. Take as an example a simple multi-threaded application with a reader, parser, and writer threads that receives data from a network card, and sends data back out again on the same card. In a Four-CPU NUMA machine where the OS just distributes threads across CPUs, the number of times the data is moved can be several times greater than in a NUMA machine that has been properly tuned:
The diagram above illustrates the tug-war between constraining resources to reduce latency without constraining them too much so latency is affected. The red rectangles on the left hand side represent various threads that may run on the system, including internal OS threads in an un-tuned system. The green rectangles on the right-hand side represent the same application configured to only use a single CPU. The same application running on the right-hand side will have significantly less latency than on the left hand side simply because the data has to be copied fewer times. However, the reduction in latency comes at a cost; instead of having 4 CPUs to use, the application with the configuration on the right-hand side can only use one CPU. This leads to the tug-war between CPU hops and CPU availability. This is where PontusVision ThreadManager™ helped tremendously reducing latency.
How PontusVision Thread Manager Helped
PontusVision ThreadManager™ reduced the time to reduce the latency from weeks of experiments to seconds of simulations. Traditionally, thread management is done empirically, with engineers looking at the system behavior, and systematically changing the thread allocations. Because of the large number of permutations in the system, this usually takes several weeks of testing. The pricing application tested was significantly more complex than the small red/green squares example above, with 12 software components, around 200 threads, and 64 logical cores (or more accurately, 32 cores with 2 hyper threads each).
More solutions than the total number of atoms in the universe
At approximately 64 to the 200th power (number of cores to the power of threads), there are more solutions to this problem than the total number of atoms in the universe.
Obviously, engineers never exhaustively try every single permutation, but even with educated guesses, empirically tuning a system usually takes several weeks, if not months of testing. What is worse is that any changes in the applications or environments can render the thread management strategy useless. This means that the thread management strategy has to be continually updated after every change in the environment. PontusVision ThreadManager™ not only came up with the strategies quickly, but it also provided all the controls to ensure that it was accurately executed, and that every change to the environment was traceable.
How Zing and MRG Helped
In pricing systems, it’s often better to have predictability than the best performance. A close parallel to illustrate this is a commuter with two options to go to work:
- The first option usually takes 20 minutes, but is highly unpredictable, with high chances of getting stuck in the tunnel up to 6 hours at least two or 3 times a year.
- The second option usually takes 24 minutes, but has had 100% punctuality and never faltered for the last 30 years of service.
Given the choices above, few commuters would choose the first option. Similarly, most market makers will prefer a pricing system with consistent but predictable performance (e.g. using the parameters from test PV-006) than the best performance available (e.g. using the parameters from test PV-002), but with large unpredictable spikes of several times the normal operation.
After coming up with the thread allocation management strategy, we also had to find a deterministic OS and JVM to ensure that the application had predictable latency. As the above results show, Zing was far more efficient at providing deterministic JVM behavior than Oracle’s Hotspot JVM. Because the trading system was largely java-based, the Zing JVM provided deterministic garbage collection times, thus significantly reducing outliers. However, even the superiority of the JVM was not enough to reduce outliers when running the application in a normal non-deterministic operating system. That’s where Red Hat’s MRG was largely complementary to Zing.
MRG allowed system interrupts caused by various housekeeping OS tasks to be predictable. In a non-real-time operating system, interrupts can pause applications for large and indeterminate amounts of time. MRG adds an extra layer of interrupts to ensure that any pauses only happen up to a predictable amount of time. Note from the results that on average, the trading system ran slightly faster on the normal OS than on the real-time OS; however, there were almost no outliers in the higher percentiles. As a market maker that has to react to price changes in the first 5 ms to be profitable, which system would you choose?
System Under Test
The following diagram shows the physical topology of the hardware used in these tests (note that there is better equipment out in the market, but this is all the customer had available in their lab):
The Arista 7150 S64 is a robust tried-and-tested 10/40GB switch that has built-in time stamping capabilities. Though it is no longer the fastest switch in the market, its ease of use convinced us to use it instead of competitors such as Cisco’s newer 3548s, which, on paper, can have around 100nano seconds less latency at the expense of fewer ports.
The IBM 3750 has four sandy bridge CPUs interconnected by a square-shaped QPI links, with no diagonal connections across the CPUs. This server exposes two separate PCIe chipsets externally. This gives architects more freedom when doing thread management.
EFX-Colo1 was the main system under test. This server hosted all the trading system applications, which were not simulations, but rather the real-world trading system apps. This server had a number of different 10Gb Solarflare network cards connected to the Arista switch. Some of the NICs were connected to CPU 0’s PCIe, and some were connected to CPU 1’s PCIe. CPU 0’s NICs were used for FIX engine connections to 25 simulated clients (which ran on EFX-Colo2), and CPU 1’s NICs were connected to the PontusVision Netview market data replayer (which ran on EFX-Colo3).
EFX-Colo2 only ran the 25 simulated FIX clients. The 25 FIX client connections were simulated using the PontusVision Netview FIX client. Each simulated client was subscribing to the same15 different currency pairs, and receiving FIX ‘market data snapshot’ messages.
EFX-Colo3 ran the market data replayer, as well as an application used to monitor latency in the environment. To ensure reproducibility, the same pre-recorded market data was replayed using the PontusVision Netview player. The market data closely emulated data from one of the main FX venues for 30 minutes before and after a non-farm payroll announcement. The application used to monitor latency used a Solarflare NIC connected to a mirrored port in EFX-Switch 1. The switch was configured to time stamp all the replayed ‘market data’ and the ‘market data snapshot’ sent to the simulated FIX clients. All the time stamped data was then sent to the mirrored port where the monitoring application was connected. The monitoring application then matched the raw market data ticks with the market data snapshots received by the clients.
The three technologies above (thread management, a real-time OS, and a real-time JVM) can make a significant impact on system predictability and, in some cases (e.g. the thread management) also significantly reduce average latency. Thread management can be done manually, but using a tool like PontusVision ThreadManager™ reduces the amount of time and effort required to achieve a good result from weeks to minutes.