Thread Manager Vs numactl / taskset / migrate pages / hwloc
Why Manage Threads
Most non-real-time operating systems (OSs) are not very good at allocating threads for latency-sensitive applications. Modern OSs are typically configured to balance the load across various CPUs rather than reduce the latency of the application. When dealing with latency-sensitive applications, these strategies cause the latency to increase significantly.
The increase in latency happens because of non-uniform memory access (NUMA). NUMA servers have memory modules local to each CPU. Whilst this significantly reduces the latency to memory local to a CPU, it makes it much more expensive for remote CPUs to access that same memory. Take as an example a simple multi-threaded application with a reader, parser, and writer threads that receives data from a network card, and sends data back out again on the same card. In a Four-CPU NUMA machine where the OS just distributes threads across CPUs, the number of times the data is moved can be several times greater than in a NUMA machine that has been properly tuned:
The diagram above illustrates the tug-war between constraining resources to reduce latency without constraining them too much so latency is affected. The red rectangles on the left hand side represent various threads that may run on the system, including internal OS threads in an un-tuned system. The green rectangles on the right-hand side represent the same application configured to only use a single CPU. The same application running on the right-hand side will have significantly less latency than on the left hand side simply because the data has to be copied fewer times. However, the reduction in latency comes at a cost; instead of having 4 CPUs to use, the application with the configuration on the right-hand side can only use one CPU. This leads to the tug-war between CPU hops and CPU availability. This is where Pontus Vision’s ThreadManager™ helps tremendously.
ThreadManager™ uses a patent pending technology to optimize the way software threads run on hardware cores by trying to reduce the number of context switches (which add latency), trying to maintain applications that need to communicate with each other as close as possible. The results of the simulation can then become a runnable script that users can use to start their applications, or alternatively, a script that modifies the pinning strategies of running applications, or even RPM files (on Linux) that automatically configure the servers to run the threads in the most optimal way.
Hwloc, Numactl taskset migratepages, start & Thread Manager
One question that often arises is: why do I need ThreadManager™ if I already have all these OS commands to pin applications for me? The answer is simple: you need both because they are complementary, and not mutually exclusive.
These commands allow you to either figure out what the machine layout is (e.g. hwloc on Linux will show the level 1,2,3 cache relationships), or to actually pin the memory of a starting process (e.g. numactl, or migratepages on Linux, or start /node on Windows), or pin the execution of threads to a particular core (e.g. taskset on Linux, or start /affinity on Windows); however, they do not use the intelligence of how the applications are related to each other to make these choices. That’s where PontusVision Thread Manager’s model helps quickly find an optimal configuration. In fact, the simulation results will often be saved in templates that use these very same commands to actually implement the pinning strategies that the model created.
The Pontus Vision ThreadManager™ GUI is what users use to get optimized thread affinity strategies for a given server. The GUI is composed of three main windows, described in the following sections:
Thread Management Window
The Thread Management window allows users to enter information for the thread manager model, as well as view the model’s results. The input information is the target hardware platform, as well as the software that will run on that hardware. These are the main inputs to run the thread manager simulation, which is also shown in this screen.
Here are the main parts:
The toolbar at the top has several buttons that control the main functionality of the window:
- Save Button: saves a simulation into the PontusVision configuration tree.
- Save As Button: saves an existing simulation under a different name.
- Open Button: opens a previously saved simulation from the PontusVision configuration tree.
- Clear Button: clears out the page, erasing any unsaved simulation details.
- Run Simulation: takes the currently configured hardware and software details, and runs a simulation for (by default) 10 seconds.
- Simulation Time Text Box: amount of time (in seconds) that the simulation will run for (normally 10 seconds is sufficient for quite good results).
- Simulation Algorithm: algorithm used by the simulation; you can experiment with various algorithms, but the ‘Fit First’ usually comes up with a very good result.
Hardware and Software Trees
The hardware tree is used to drag and drop the hardware in to the canvas.
The software tree is used to drag and drop the components
Right side panels
The right side Panel has three main accordion boxes.
The top part of the hardware accordion box has a canvas where users drag and drop servers to set the model’s target hardware platform.
The bottom part of the hardware accordion box has a properties grid. When users click on the properties grid you will see the properties associated with the hardware:
- id – The name of the server
- contextSwitchCosts – a number from 0 to 100 used by the model to penalize context switches
- physicalCpuList – a list of all the numa nodes where each core is located.
- latencyMatrix – a matrix showing the costs of moving data between various cores in the hardware simulation
- junkCores – a list of one or more cores that are sacrificed to run background OS tasks, and any non-latency critical applications.