This is getting quite interesting…

I Love C#

This blog is written respective to the features C# 8.0 is suggesting to have. The base for this write up is the following video where Mads Torgersen explains Seth Juarez the newest C# 8.0 features proposed. Please bear in mind none of these features are released or finalized yet.

You can watch the video here.

Nullable Reference Types

I felt like this is somewhat of a misnomer since reference types are by default nullable as they can be null and a lot of developer would expect so. Although to be technically correct nullability (Is that a word? Who knows!) is not a mandatory need for all reference types. We are just used to see it that way and thus the naming is correct.

This specific feature points to a fact where someone is actually capable of declaring reference type instances which are not supposed to be null. Even in…

View original post 1,286 more words

As part of the series of posts announced at this initial blog post (.NET Application Architecture Guidance) that explores each of the architecture areas currently covered by our team, this current blog post focuses on “Microservices and Docker containers: Architecture, Patterns and Development guidance”.

Just as a reminder, the four introductory blog posts of this series will be the following:

The microservices architecture is emerging as an important approach for distributed mission-critical applications. In a microservice-based architecture, the application is built on a collection of services that can be developed, tested, deployed, and versioned independently. In addition, enterprises are increasingly realizing cost savings, solving deployment problems, and improving DevOps and production operations by using containers (Docker engine based as de facto standard).

Microsoft has been releasing container innovations for Windows and Linux by creating products like Azure Container Service and Azure Service Fabric, and by partnering with industry leaders like Docker, Mesosphere, and Kubernetes. These products deliver container solutions that help companies build and deploy applications at cloud speed and scale, whatever their choice of platform or tools…

https://blogs.msdn.microsoft.com/dotnet/2017/08/02/microservices-and-docker-containers-architecture-patterns-and-development-guidance/

We can use the Range method to build two integer sequences as follows: You can join the two sequences using the Concat method: You’ll see that the concatenated sequence holds all numbers from the first and the second sequence:

via Concatenate two IEnumerable sequences in C# .NET — Exercises in .NET with Andras Nemes

In parallel programs is very important to regard cache size and hit rates on a single CPU, but it’s even more important to consider how the caches of multiple processors/cores interact. Let’s consider a single representative example, which demonstrates the important cache optimisation and emphasizes the value of good tools when it comes to performance optimisation in general.

Let’s first examine the first sequential method, it performs the rudimentary task of summing all the elements in a two-dimensional array of integers and returns the result:

public static int MatrixSumSequential(int [,] matrix)
{
    int sum = 0;
    int rows = matrix.GetUpperBound(0);
    int cols = matrix.GetUpperBound(1);
    for(int i = 0; i < rows; i++)
    {
        for(int j = 0; j < cols; j++)
        {
            sum += matrix[i, j];
        }
    }
    return sum;  
}

We could have used TPL but let’s ignore the huge arsenal of tools TPL provides in our simple example. The following attempt at parallelisation may appear sufficiently reasonable to harvest the fruits of multi-core execution, and even implements a crude aggregation to avoid synchronisation on the shared sum variable:

public static int MatrixSumParallel(int [,] matrix)
{
    int sum = 0;
    int rows = matrix.GetUpperBound(0);
    int cols = matrix.GetUpperBound(1);
    const int THREADS = 4;
    int chunk = row / THREADS;
    int [] localSums = new int[THREADS];
    Threads [] threads = new Threads[THREADS];
    for(int i = = 0; i < THREADS; i++)
    {
        int start = chunk * i;
        int end - chunk * (1 + i);
        int threadNum = i;
        threads[i] = new Thread(() => {
            for(int row = start; row < end; r++)
            {
                for(int col = 0; col < cols; col++)
                {
                    localSums[threadNum] += matrix[row, col];
                }
            }
        });
        threads[i].Start();
        foreach(var thread in threads)
            thread.Join();
    }
    return localSums.Sum();
}

 

Executing each of the two methods several times on an i7 machine with 6 cores produced the following results for a 2,000 x 2,000 matrix of integers:

  • 325ms average for sequential method
  • 935ms for the parallel method. Three times as slow as the sequential method!

The obvious question is why?
This is not an example of too fine grained parallelism because the number of threads is only 4. However if you accept the premise that the problem is somehow the cache related, it would make sense to measure the number of cache misses introduced by the 2 methods above.

The Visual Studio profiler when sampling the execution of each method with a 2,000 x 2,000 matrix reported 963 exclusive samples in the parallel version and only 659 exclusive samples in the sequential version, the vast majority of samples being on the inner loop line that reads from the matrix.

Why would a line of code writing to localSums introduce so many cache misses in comparison to writing to sum local variable? The answer is that the writes to the shared array invalidate cache lines at other processors/cores, causing every += operating to be a cache miss.
When the processor writes to a memory location that is in the cache of another processor/core cache, the hardware causes a cache invalidation, that marks the cache line as invalid. Accessing that line results in a cache miss.

The moral of the story do not blindly introduce parallelization in a hope that that would also result in the performance increase. Always test both versions, you might be surprised at the results!

Fraction Implementation in C#

I’m not really sure why Microsoft have never bothere with implementing a Fraction primitive in .NET. I’m sure there are plenty of uses as fraction allow to preserve the maximum possible precision. I have therefore decided to create my own implementation (albeit somewhat primitive at this stage) .

My implementation automatically simplifies the fraction, so if you we to create new Function(6, 3) that would be simplified to 2. The Fraction struct implements all the arithmetic operators on itself and on Int64, float, double and decimal.

Internally the Fraction is represented as two Int64: Numerator and Denominator and is always simplified upon initialisation. I initially intended to have it as an option, however following profiling the cost of simplification is not that great and the benefits outweigh the performance drawbacks. 

Fraction has explicit conversion to Int64 (although that is bound to lose precision), float, double and decimal. It supports comparison with Int64, float, double and decimal and even supports ++ and — operations.

So far I have provided more or less complete implementation with plenty of Unit Test. Now the hard word of optimising the performance begins!

Design of Fractions

Fraction is implemented as a struct (pretty obvious choice). It takes a numerator as the first argument and denominator,  it then tries to simplify the fraction using the Euclidean algorithm, so if you were to specify 333/111 it would become 3.

The implementation supports all arithmetic operations with long, float, double and decimal and can also be converted to those type by either calling the corresponding methods or using explicit cast.

You can also create a function from either a long, float, double or decimal. Conversion from a long is quite trivial however conversion from a float, double or a decimal goes through a while loop and multiplies the floating point number until it has no decimal places. This method is relatively slow and therefore is not recommended.

Apart from that the Fraction behaves like a fist class citizen: you can compare a Fraction to any other number, divide, multiply, add, subtract, compare, increment decrement etc.

For example:

var oneThird = Fraction(1, 3);
var reciprocal = oneThird.Reciprocal();

Console.WriteLine(oneThird * reciprocal) : "1"
Console.WriteLine(++oneThird) : "4/3" - just like with an integer ++ adds 1 
Console.WriteLine(oneThird * oneThird) : "1/9"

 

Please feel free to contribute to the codebase if you feel like it

https://github.com/ebalynn/Balynn.Maths.Fraction

If you are about to begin implementing your own version of a thread safe generic weak dictionary – STOP!
As of .NET 4.0 there is already a class that implements that functionality and it’s called ConditionalWeakTable and it exists in System.Runtime.CompilerServices namespace.

There are however several limitations: both the key and the value have to be reference types (TKey : class and TValue : class).

Here is the comment from the source file:

** Description: Compiler support for runtime-generated "object fields."
**
** Lets DLR and other language compilers expose the ability to
** attach arbitrary "properties" to instanced managed objects at runtime.
**
** We expose this support as a dictionary whose keys are the
** instanced objects and the values are the "properties."
**
** Unlike a regular dictionary, ConditionalWeakTables will not
** keep keys alive.
**
**
** Lifetimes of keys and values:
**
** Inserting a key and value into the dictonary will not
** prevent the key from dying, even if the key is strongly reachable
** from the value.
**
** Prior to ConditionalWeakTable, the CLR did not expose
** the functionality needed to implement this guarantee.
**
** Once the key dies, the dictionary automatically removes
** the key/value entry.
**
**
** Relationship between ConditionalWeakTable and Dictionary:
**
** ConditionalWeakTable mirrors the form and functionality
** of the IDictionary interface for the sake of api consistency.
**
** Unlike Dictionary, ConditionalWeakTable is fully thread-safe
** and requires no additional locking to be done by callers.
**
** ConditionalWeakTable defines equality as Object.ReferenceEquals().
** ConditionalWeakTable does not invoke GetHashCode() overrides.
**
** It is not intended to be a general purpose collection
** and it does not formally implement IDictionary or
** expose the full public surface area.
**
**
**
** Thread safety guarantees:
**
** ConditionalWeakTable is fully thread-safe and requires no
** additional locking to be done by callers.
**
**
** OOM guarantees:
**
** Will not corrupt unmanaged handle table on OOM. No guarantees
** about managed weak table consistency. Native handles reclamation
** may be delayed until appdomain shutdown.

Just by looking at the comments alone we can extrapolate that what we have is equivalent to a generic, thread safe weak dictionary! Further internet research, confirms the findings.
There are several critical limitations though:
• TKey and TValue both have to be reference types
• The equality is defined using ReferenceEquals()
• GetHashCode() overrides are never called

The JIT compiler logically determines which methods to inline. But sometimes we know better than it does. With AggressiveInlining, we give the compiler a hint. We tell it that the method should be inlined. Actually, the only hint we give the compiler is to ignore the size restriction on the method or the property you want to inline. Using this attribute does not guarantee that the method will be inlined. There are 1000 and 1 reasons why it cannot be (being virtual for one thing)

Example

This example benchmarks a method with no attribute, and with AggressiveInlining. The method body contains several lines of useless code. This makes the method large in bytes, so the JIT compiler may decide not to inline it.

And: We apply the MethodImplOptions.AggressiveInlining option to Method2. It is an enum.

using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;
 
class Program
{
    const int _max = 10000000;
    static void Main()
    {
        // ... Compile the methods
        Method1();
        Method2();
        int sum = 0;
 
        var s1 = Stopwatch.StartNew();
        for (int i = 0; i &lt; _max; i++)
        {
            sum += Method1();
        }
        s1.Stop();
        var s2 = Stopwatch.StartNew();
        for (int i = 0; i &lt; _max; i++)
        {
          sum += Method2();
        }
        s2.Stop();
        
        Console.WriteLine(((double)(s1.Elapsed.TotalMilliseconds * 1000000) / _max).ToString("0.00 ns"));
    
        Console.WriteLine(((double)(s2.Elapsed.TotalMilliseconds * 1000000) / _max).ToString("0.00 ns"));
        Console.Read();
    }
 
    static int Method1()
    {
        // ... No inlining suggestion
        return "one".Length + "two".Length + "three".Length +
            "four".Length + "five".Length +   "six".Length +
            "seven".Length + "eight".Length + "nine".Length +
            "ten".Length;
    }
 
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    static int Method2()
    {
        // ... Aggressive inlining
        return "one".Length + "two".Length + "three".Length +
            "four".Length + "five".Length + "six".Length +
            "seven".Length + "eight".Length + "nine".Length +
            "ten".Length;
    }
}

Output

7.34 ns No options
0.32 ns MethodImplOptions.AggressiveInlining

We see that with no options, the method calls required seven nanoseconds each. But with inlining specified (with AggressiveInlining), the calls required less than one nanosecond each.

Tip 1: Consider for a moment all the things you could do with those seven nanoseconds.

Tip 2: If you are scheduling your life based on nanoseconds, please consider reducing your coffee intake.

Every decent .NET developer understands why we need destructors and how they work. In short once you create a method ‘~[ClassName]’ that particular object is placed into finalization queue. Normally those classes that use destructors do also implement the IDisposable interface, and the destructor is there only as a safety net – i.e. in case the developer forgets to call implicitly or explicitly ‘Dispose()’.

However, even though it is clearly stated, not every developer realizes the implications of having to put that object into finalization queue. The popular trail of thought is that if ‘GC.SuppressFinalize()’ there would be no performance penalty, at the end of the day we are removing the object from the finalisation queue and the GC has less work to do.
What many do not realise is that there is only one finalisation queue and if working with multiple threads .NET would have to synchronise the access to that queue somehow. So even when creating objects with destructors from multiple threads, inadvertently we create contention to the finalisation queue and even the creation of such objects could potentially be slower.

Introduction

For years, C# has been dismissed as the language of choice for low latency applications. I have been working in the financial industry for over 15 years and very often I heard that any low latency software such as HFT (High-Frequency Trading) had to be implemented in native languages such as C++.

Low Latency

I think the primary requirement for low latency is not just the performance but consistency and predictability of execution time. This is exactly why the “unpredictable” behaviour of GC puts developers off from using C# (and other .NET languages) in low latency scenarios. However, I have recently started to notice several positions being advertised where “Low Latency” featured together with one of the .NET languages. I have also worked as Trade Desk Algo Developer at Credit-Suisse developing various trading automation employing C# and .NET as my “weapons” of choice without any problems.
The language itself is not the only criteria for low latency applications. Very rarely (or never) low latency applications live in isolation. In the world of finance, you need to connect to various third-party services. Such as market data and order execution providers, and that is where the network connectivity and its performance also plays a significant role.
Another aspect of managed languages is the perceived overhead of various checks. In this article, I would try to examine to what degree these affect the performance.

Disadvantages

Garbage Collector

The standard argument is that CLR (Common Language Runtime) employs a non-deterministic way of cleaning up unused memory – Garbage Collection (GC). “Nondeterministic” in this context means that the developer has no direct control over when the memory is freed. It is up to the runtime (CLR) to decide when is the most appropriate time to perform the garbage collection, based on available memory and the application behaviour. Although the developer can initiate GC by calling the “Collect()” method, this practice is commonly discouraged.
The CLR GC employs a tracing method of identifying and freeing up memory, whereby the algorithm traverses a graph of objects allocated on the heap and identifies the ones that are not then used. The CLR has to pause some threads during GC to be able to identify, free up and compact memory. Because the memory is compacted (aka defragmented), all the live references have also got to be updated.
All of that takes time and more importantly when the GC will start, causing a small pause in the execution of the application. This behaviour deemed unacceptable for low latency applications.

Just In Time Compilation

Just In Time compilation (JIT) is an approach to compilation, where each method is compiled on demand. This leads to a slower start up times and occasional pauses, when a method is accessed for the first time. However, this is only applicable to “cold starts” and is easily solved by running NGen utility against the assemblies, which generates native code for the same assembly. There is also another way – manually “touching” the methods at the beginning of the program – forcing JIT to do its job once and for all, rather than on demand basis. But this is more like a hack.

Advantages

Garbage Collector

Despite its perceived disadvantages, GC eliminates a lot of potential bugs and enhances the productivity of a developer, who doesn’t have to be concerned about memory deallocation. This significantly reduces the possibility of memory related bugs and means that the software could be delivered much faster.
Because of the work being done by GC, memory allocations are extremely fast. One of the last stages of GC is memory compaction. Since the memory is compacted (there are few exceptions that leave “holes” in memory because of pinned object and Large Object Heap) allocation of new objects is extremely fast. All the CLR has to do it to advance the “next object pointer”. Whereas in C/C++ the memory would have to be scanned, to find an unused block for the new object.
Even though GC causes small pauses, they are usually unnoticeable. The GC team is working hard to make sure GC causes as little interruption to the program execution as possible. They have also added additional flavours of GC, giving the developer the option of specifying which behaviour they want depending on the application type. In addition more recent changes to the .NET Framework and C#, such as return by reference which opens the possibility of using large structs rather than reference types to save on a number of allocations. Before such practice would most probably backfire, as copying large structs is relatively expensive.

GC Flavours

Source: http://www.lybecker.com/blog/2007/04/03/garbage-collection-flavors/

Concurrent Workstation Garbage Collection

This mode is optimised for interactivity by frequent short garbage collects. Each collect is split up into smaller pieces and is interleaved with the managed applications threads. This improves responsiveness at the cost of CPU cycles. This is ideal for interactive desktop applications where a freezing application is an annoyance for the users and ideal CPU time is abundant when waiting for user input. Concurrent workstation mode improves the usability with perceived performance.
Note that interactive GC only applies for Gen2 (full heap) collects because Gen0 and Gen1 collects are in nature very fast.

Non-concurrent Workstation Garbage Collection

Non-concurrent workstation mode works by suspending managed application threads when a GC is initiated. It provides better throughput but might appear as application hiccups where everything freezes for the users.

Server Garbage Collection

In server mode, a managed heap and a dedicated garbage collector thread are created for each CPU. This means the each CPU allocates memory in its own heap, therefore, results in lock-free allocation. When a collect is initiated all the managed application threads are suspended and all the GC threads collect in parallel.
Another thing to note is that the size of the managed heap segments is larger in server mode than workstation mode. A segment is a unit of which the memory is allocated on the managed heap.

It is possible to choose the type of GC for a managed application in the configuration file. Under the element add one of the below three settings and depending on the number of CPU, the garbage collector will run in the configured mode. Garbage Collection type settings
Running a standalone managed application the GC mode is by default concurrent workstation. Managed application hosts like ASP.Net and SQLCLR run with Server GC by default

JIT

.NET Assemblies contain IL code which is not yet executable and requires one last step – compilation to native instructions by JIT. Since it’s done on a machine where the code is going to run and not on a build server, JIT compiler has all the information about the hardware, primarily the CPU, to tailor the generated native instructions to the particular CPU architecture and its features, such as extensions, additional registers etc. Theoretically, this is ought to yield faster code. I haven’t seen any benchmarks, but my guess would be that the difference is very marginal.

Less Control over Optimisations

Whereas in C++ you can embed assembler code, you cannot do so in C#. Personally, I don’t see the reason why assembler should be used in 2015 unless of course you’re writing something really low level.
Other features, like inlining, although present in a form of an attribute and a flag in C# – [MethodImpl(MethodImplOptions.AggressiveInlining)] it merely acts as a hint to the compiler. Aggressive inlining option tells it not to include the size of the method into the list of criteria when deciding whether to inline the method or not. Even so, there is no guarantee. CLR already does an excellent job of inlining methods or properties automatically where necessary. AggressiveInlining option should be used in very rare circumstances, forcing to inline large methods could lead to the less efficient use of CPU cache.

Automatic Checks

The beauty of managed code is that, together with the memory management, many checks have been introduced to eliminate the most common bugs.
One example is array boundary checks. The pro native language camp will say those checks significant performance. This is not exactly the case, while the checks are made, and this is an essential feature of CLR, they are not necessarily always performed at runtime. CLR will use static analysis to eliminate checks. Even if a check has to be performed on an array, the JIT team used a nice trick to reduce the number of instructions. Here is an example (x86):

cmp     EDX, dword ptr [ECX+4]         
jae     SHORT G_M60672_IG03            // unsigned comparison

EDX contains the array index, and [ECX+4] the length of the array. “jae” instruction performs the comparison that index < length. The trick here is that it seems that the code doesn’t check for negative index. JIT guys employ unsigned arithmetic properties and use “jae” instruction which performs an unsigned check. So, if EDX is, negative, casting it to an unsigned representation will yield the value of 2^31, thus causing it to fail validation.

Does it matter?

There is quite a long list of features in native languages that are not available in C#. The questions are whether it makes a significant difference.
I would argue that the productivity and the speed at which the software can be developed plays a more important role, especially in these days when it cheaper to throw in more powerful hardware rather than spending more time on writing, optimising and testing the code.
Low Latency is not only dependent on the language the application and the framework are written in, but on other factors such as the performance of the network. For instance, you might want to disable Nagle’s algorithms, which optimises the traffic over TCP at the expense of latency. This can easily be done in C# by setting Socket.NoDelay option to true
Equipped with decent hardware and sufficient memory GC pauses would be imperceptible and there are more components in the chain that could lead to latency than the code itself.

Linq and extension methods are wonderful features, since they allow extending objects without having to modify the original types. Extension methods are special types of static methods, but they are called as if they were instance methods. Most prominent examples include Linq to Object where there are a plethora of methods available for IEnumerable type.

However since many types implement IEnumerable interface this inadvertently creates a problem. For example an array of int is IEnumerable and we could call a Count() extension method on this array. However this approach has significant performance drawbacks in comparison to simply calling .Length on the array. The problem becomes obvious if we look at the implementation of the Count() extension method:

public static int Count<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw Error.ArgumentNull(nameof(source));
    }

    if (source is ICollection<TSource> collectionoft)
    {
        return collectionoft.Count;
    }

    if (source is IIListProvider<TSource> listProv)
    {
        return listProv.GetCount(onlyIfCheap: false);
    }

    if (source is ICollection collection)
    {
        return collection.Count;
    }

    int count = 0;
    using (IEnumerator<TSource> e = source.GetEnumerator())
    {
        checked
        {
            while (e.MoveNext())
            {
                 count++;
            }
        }
    }
    
    return count;
}

Because the extension method is applied on any IEnumerable it has to be very clever in how it computes the count: if it’s a collection then we can simply return the .Count otherwise we have to enumerate through all elements to determine the count. Just because of it’s size this method will not be inlined by the compiler and the cost of using .Count() vs .Length on an array type is going to be very high.

One solution to this problem is to implement Linq extensions for array types separately. In which case the implementation of .Count() becomes very simple and is likely to be inlined:

public static int Count<TSource>(this TSource [] items)
{
     return items.Length;
}

But then there are many other types where the Count could be easily calculated by calling .Count property, such as List<T>, IList<T>, ICollection<T>, ImmutableList<T> etc. The proper solution would be to implement all the Linq to Object extension methods for each of the types separately to ensure the best performance. Even in the case of List<T> vs IList<T> an implementation for the specific type would be slightly faster.

In other Linq to Object extension methods we could improve performance by substituting foreach with for which is normally slightly faster.

However this inevitably results in explosion of the codebase that has to be maintained. One plausible solution, while maintaining all the benefits of having an extension method for each type, is to write a template and then use t4 files to create a specific implementation for each type.

Taking the implementation challenges aside, integrating such a solution into an existing application is extremely trivial. We just need to add the assembly and provided that it uses the same namespace as the original Linq to Object extensions, recompiling the application would be enough for the compiler to pick up the “better” suited extension methods, resulting in immediate performance benefits.

I have started such project on GitHub

https://github.com/ebalynn/FasterLinq