For years, C# has been dismissed as the language of choice for low latency applications. I have been working in the financial industry for over 15 years and very often I heard that any low latency software such as HFT (High-Frequency Trading) had to be implemented in native languages such as C++.
I think the primary requirement for low latency is not just the performance but consistency and predictability of execution time. This is exactly why the “unpredictable” behaviour of GC puts developers off from using C# (and other .NET languages) in low latency scenarios. However, I have recently started to notice several positions being advertised where “Low Latency” featured together with one of the .NET languages. I have also worked as Trade Desk Algo Developer at Credit-Suisse developing various trading automation employing C# and .NET as my “weapons” of choice without any problems.
The language itself is not the only criteria for low latency applications. Very rarely (or never) low latency applications live in isolation. In the world of finance, you need to connect to various third-party services. Such as market data and order execution providers, and that is where the network connectivity and its performance also plays a significant role.
Another aspect of managed languages is the perceived overhead of various checks. In this article, I would try to examine to what degree these affect the performance.
The standard argument is that CLR (Common Language Runtime) employs a non-deterministic way of cleaning up unused memory – Garbage Collection (GC). “Nondeterministic” in this context means that the developer has no direct control over when the memory is freed. It is up to the runtime (CLR) to decide when is the most appropriate time to perform the garbage collection, based on available memory and the application behaviour. Although the developer can initiate GC by calling the “Collect()” method, this practice is commonly discouraged.
The CLR GC employs a tracing method of identifying and freeing up memory, whereby the algorithm traverses a graph of objects allocated on the heap and identifies the ones that are not then used. The CLR has to pause some threads during GC to be able to identify, free up and compact memory. Because the memory is compacted (aka defragmented), all the live references have also got to be updated.
All of that takes time and more importantly when the GC will start, causing a small pause in the execution of the application. This behaviour deemed unacceptable for low latency applications.
Just In Time compilation (JIT) is an approach to compilation, where each method is compiled on demand. This leads to a slower start up times and occasional pauses, when a method is accessed for the first time. However, this is only applicable to “cold starts” and is easily solved by running NGen utility against the assemblies, which generates native code for the same assembly. There is also another way – manually “touching” the methods at the beginning of the program – forcing JIT to do its job once and for all, rather than on demand basis. But this is more like a hack.
Despite its perceived disadvantages, GC eliminates a lot of potential bugs and enhances the productivity of a developer, who doesn’t have to be concerned about memory deallocation. This significantly reduces the possibility of memory related bugs and means that the software could be delivered much faster.
Because of the work being done by GC, memory allocations are extremely fast. One of the last stages of GC is memory compaction. Since the memory is compacted (there are few exceptions that leave “holes” in memory because of pinned object and Large Object Heap) allocation of new objects is extremely fast. All the CLR has to do it to advance the “next object pointer”. Whereas in C/C++ the memory would have to be scanned, to find an unused block for the new object.
Even though GC causes small pauses, they are usually unnoticeable. The GC team is working hard to make sure GC causes as little interruption to the program execution as possible. They have also added additional flavours of GC, giving the developer the option of specifying which behaviour they want depending on the application type. In addition more recent changes to the .NET Framework and C#, such as return by reference which opens the possibility of using large structs rather than reference types to save on a number of allocations. Before such practice would most probably backfire, as copying large structs is relatively expensive.
This mode is optimised for interactivity by frequent short garbage collects. Each collect is split up into smaller pieces and is interleaved with the managed applications threads. This improves responsiveness at the cost of CPU cycles. This is ideal for interactive desktop applications where a freezing application is an annoyance for the users and ideal CPU time is abundant when waiting for user input. Concurrent workstation mode improves the usability with perceived performance.
Note that interactive GC only applies for Gen2 (full heap) collects because Gen0 and Gen1 collects are in nature very fast.
Non-concurrent workstation mode works by suspending managed application threads when a GC is initiated. It provides better throughput but might appear as application hiccups where everything freezes for the users.
In server mode, a managed heap and a dedicated garbage collector thread are created for each CPU. This means the each CPU allocates memory in its own heap, therefore, results in lock-free allocation. When a collect is initiated all the managed application threads are suspended and all the GC threads collect in parallel.
Another thing to note is that the size of the managed heap segments is larger in server mode than workstation mode. A segment is a unit of which the memory is allocated on the managed heap.
It is possible to choose the type of GC for a managed application in the configuration file. Under the element add one of the below three settings and depending on the number of CPU, the garbage collector will run in the configured mode. Garbage Collection type settings
Running a standalone managed application the GC mode is by default concurrent workstation. Managed application hosts like ASP.Net and SQLCLR run with Server GC by default
.NET Assemblies contain IL code which is not yet executable and requires one last step – compilation to native instructions by JIT. Since it’s done on a machine where the code is going to run and not on a build server, JIT compiler has all the information about the hardware, primarily the CPU, to tailor the generated native instructions to the particular CPU architecture and its features, such as extensions, additional registers etc. Theoretically, this is ought to yield faster code. I haven’t seen any benchmarks, but my guess would be that the difference is very marginal.
Whereas in C++ you can embed assembler code, you cannot do so in C#. Personally, I don’t see the reason why assembler should be used in 2015 unless of course you’re writing something really low level.
Other features, like inlining, although present in a form of an attribute and a flag in C# – [MethodImpl(MethodImplOptions.AggressiveInlining)] it merely acts as a hint to the compiler. Aggressive inlining option tells it not to include the size of the method into the list of criteria when deciding whether to inline the method or not. Even so, there is no guarantee. CLR already does an excellent job of inlining methods or properties automatically where necessary. AggressiveInlining option should be used in very rare circumstances, forcing to inline large methods could lead to the less efficient use of CPU cache.
The beauty of managed code is that, together with the memory management, many checks have been introduced to eliminate the most common bugs.
One example is array boundary checks. The pro native language camp will say those checks significant performance. This is not exactly the case, while the checks are made, and this is an essential feature of CLR, they are not necessarily always performed at runtime. CLR will use static analysis to eliminate checks. Even if a check has to be performed on an array, the JIT team used a nice trick to reduce the number of instructions. Here is an example (x86):
cmp EDX, dword ptr [ECX+4] jae SHORT G_M60672_IG03 // unsigned comparison
EDX contains the array index, and [ECX+4] the length of the array. “jae” instruction performs the comparison that index < length. The trick here is that it seems that the code doesn’t check for negative index. JIT guys employ unsigned arithmetic properties and use “jae” instruction which performs an unsigned check. So, if EDX is, negative, casting it to an unsigned representation will yield the value of 2^31, thus causing it to fail validation.
There is quite a long list of features in native languages that are not available in C#. The questions are whether it makes a significant difference.
I would argue that the productivity and the speed at which the software can be developed plays a more important role, especially in these days when it cheaper to throw in more powerful hardware rather than spending more time on writing, optimising and testing the code.
Low Latency is not only dependent on the language the application and the framework are written in, but on other factors such as the performance of the network. For instance, you might want to disable Nagle’s algorithms, which optimises the traffic over TCP at the expense of latency. This can easily be done in C# by setting Socket.NoDelay option to true
Equipped with decent hardware and sufficient memory GC pauses would be imperceptible and there are more components in the chain that could lead to latency than the code itself.