How to Reduce Cyclomatic Complexity Part 8: Map-Reduce Queries

by Zoran Horvat

Map-Reduce was first devised as an architectural pattern. When faced with enormous amounts of data, developers had problems to calculate a single result. Principal problem was getting the data to the processing server and then processing all of them on that server.

With Map-Reduce, things changed. Processing was moved closer to data. Instead of downloading all the data to the processing server, the processing algorithm was executed directly on the node which holds the data. Then only the calculated result would be sent to the processing server.

But those results produced at data nodes were partial. There was one more step to conclude the operation – sending all partial results to the processing server. At that point, processing server would be able to aggregate all partial results into one integral result.

Map-Reduce on Smaller Scale

As time progressed, programmers became aware that map-reduce principle could have wider application. It has already proved its power on big data. But the same principle could be applied to smaller amounts of data, on data that we regularly process in applications.

In many cases we are just facing a collection of objects and we need to produce one object or one value from that collection. We can easily apply the map-reduce principle to any collection. The first step is to map all objects to intermediate results. The second step is to reduce collection of intermediate results to one value.

Language Integrated Query (LINQ)

That is the basic purpose of LINQ library. We can use it directly on storage through LINQ to SQL or LINQ to Entities. Or, alternatively, we can use it to in-memory data through LINQ to Objects.

Common feature of all methods in the LINQ library is that they are supporting the map-reduce model. Selection methods – Select, Where, Skip, Take, etc. – are mapping collection of objects into another collection of objects. Methods like Join, Group, OrderBy are also mapping methods. Each of them also produces another collection of objects. Finally, aggregation methods, such as Sum, Max, even First and Single, are there to reduce the collection to a single object.

On a related note, if you ever wondered why there is no ForEach method in the LINQ library, here is the answer. All methods in the library are query methods. They do not have side effects. They only produce a result of a query, leaving the system state intact.

ForEach method would be the only one different there. Such method would supposedly return void and its sole purpose would be to have side effects. Therefore, this method was not included in the LINQ library.

Example

Suppose that we want to test whether a given integer is prime. Remember, a number is prime if it is greater than 1 and its only positive divisors are 1 and itself. Here is the function which tests that:

bool IsPrime(int number)
{

    if (number < 2)
        return false;

    for (int i = 2; i * i <= number; i++)
        if (number % i == 0)
            return false;

    return true;

}

This function iterates through all numbers up to square root of the tested number and checks if any of them divides the number. If any such divisor is found, the function returns False.

The way in which this function is implemented is, so to say, classic approach. It is procedural, because we have explicitly said which steps should be executed to reach the correct Boolean result.

We can rewrite this method to work in a completely different way. The first step would be to build a collection of potential divisors – those would be the numbers up to square root of the tested number:

IEnumerable<int> candidates =
    Enumerable.Range(2, (int)Math.Sqrt(number) - 1);

These are all the potential divisors. Now we test whether all of them are producing non-zero result on a modulo operation:

bool isPrime =
    Enumerable.Range(2, (int)Math.Sqrt(number) - 1)
    .All(divisor => number % divisor != 0);

In this example there is no mapping step – all numbers in the collection are taken as they are. The second line, call to the All method, is the reduce step. It reduces the collection to one single value of type Boolean.

Finally, we just have to rule out numbers smaller than 2, because those are not prime by definition. And that will be all it takes to write map-reduce version of the prime test:

bool IsPrime(int number)
{
    return
        number >= 2 &&
        Enumerable.Range(2, (int)Math.Sqrt(number) - 1)
            .All(divisor => number % divisor != 0);
}

Not only that this function is shorter, but it also communicates the intention more clearly. We can almost read it in English: Number is prime if it is greater or equal to two and none of the numbers up to its square root divides it.

More Complicated Example

Here is one more complicated example. Suppose that we want to find the most frequent number in the unsorted array of integers. This is called mode of the array and it is important in some applications.

We could solve this problem in a traditional way by just iterating through the array and building an index of all values we find, counting each of them. In the end, we would just return the value with highest count associated. Here is the code:

struct Mode
{
    public int Value;
    public int Count;
}

Mode FindMode(int[] array)
{

    Mode best = new Mode();
    Dictionary<int, int> hashtable =
        new Dictionary<int, int>();

    foreach (int value in array)
    {
        int curCount;
        int count = 1;
        if (hashtable.TryGetValue(value, out curCount))
            count = curCount + 1;

        hashtable[value] = count;

        if (count > best.Count)
        {
            best = new Mode()
            {
                Value = value,
                Count = count
            };
        }
    }

    return best;

}

When faced with implementation like this, we have to stop for a moment and think how this method works. This becomes especially difficult if there is an indication that this method could contain a bug. Then we really have to focus on its internals to figure whether it is working fine, in which case we have to look for the problem elsewhere, or it really mixes something up.

Map-reduce approach could prove to be much more valuable in cases like this. We could take the input array as the starting collection. Then we could implement a mapping step, so that same values are put close together - this is the indexing step. Then we could reduce this collection to a single value by picking the element with highest count.

Here is the implementation using LINQ to Objects:

Mode FindMode(int[] array)
{
    return
        array
            .GroupBy(value => value)
            .Select(group =>
                new Mode()
                {
                    Value = group.Key,
                    Count = group.Count()
                })
            .OrderByDescending(mode => mode.Count)
            .First();
}

Once again, we have reached an implementation which is significantly shorter. As a side note, this implementation once again reads exactly like the idea we used to build it: Index the array by their values in it, associate count to each of the values, sort the collection so that elements with highest counts come first, and then just pick the first one.

What happens if someone suspects that this function contains a bug? Nothing special – just walk through this sequence of transformations and see if it fits the requirements. There are no loops and branching statements that are hard to analyze. There are no boundary conditions here, those are all buried deep inside the LINQ operators.

Real World Example

Let’s get back to the e-commerce application that we have been refactoring throughout this series of articles. Domain model defines a discount:

public interface IDiscount
{
    decimal Apply(decimal price);
}

Discount object simply receives a price and returns a modified price. For example, 2% discount would just multiply the price with 0.98 and return the new price.

Implementations of the IDiscount interface are used in the Product class, which has to report its price:

namespace Store.Domain.Implementation
{
    public class Product: IProduct
    {
        ...
        public IProduct ApplyDiscounts(IEnumerable<IDiscount> discounts)
        {
            decimal price = this.Price;

            foreach (IDiscount discount in discounts)
                price = discount.Apply(price);

            return new Product(this.Name)
            {
                Price = price
            };
        }

    }
}

When application executes a purchase request, it first asks the product for its price with all discounts applied. Implementation is once again old-fashioned. There is the loop through the collection of discounts and they are accumulated in the price variable.

We can apply the map-reduce principle here as well. There would be no mapping step here again, because every discount would only be mapped to itself. However, reduce step would have to be implemented. The code tells more than words, so here it is:

namespace Store.Domain.Implementation
{
    public class Product: IProduct
    {
        ...
        public IProduct ApplyDiscounts(IEnumerable<IDiscount> discounts)
        {
            return new Product(this.Name)
            {
                Price = discounts.Aggregate(this.Price,
                                            (cur, discount) => discount.Apply(cur))
            };
        }

    }
}

This implementation relies on the Aggregate method from the LINQ to Objects library. The first argument to the Aggregate method is the seed. The second argument is the aggregation function which takes running result and current object from the collection and produces the next running result. Aggregation function will be called once for each of the objects in the collection. End result is the final running result, and that is what the Aggregate function will return.

Shorter version of this explanation is that the Aggregate method will apply all discounts in sequence, gradually reducing the product price, and return the final price with all discounts applied.

Needless to say, this solution is much shorter than the first one which relied on explicit looping and aggregating the price.

Conclusion

Whenever you write a loop in your code, you become responsible for things that might go wrong in it. This becomes even more obvious when the loop contains branching statements, break and return statements and similar.

Edge cases, boundary conditions, condition tests, all of those can become very hard to see in code because they are often mixed together with scaffolding code, such as loop or if-then-else statements.

By using the LINQ to Objects library, we remove all the scaffolding from our code. Modified code is easier to read and it also has a natural flow. Map-reduce sequences of statements are just defining a sequence of transformations that should be performed in a row to reach the final result. Domain logic is exposed in such code, while scaffolding is moved away from it.


If you wish to learn more, please watch my latest video courses

About

Zoran Horvat

Zoran Horvat is the Principal Consultant at Coding Helmet, speaker and author of 100+ articles, and independent trainer on .NET technology stack. He can often be found speaking at conferences and user groups, promoting object-oriented and functional development style and clean coding practices and techniques that improve longevity of complex business applications.

  1. Pluralsight
  2. Udemy
  3. Twitter
  4. YouTube
  5. LinkedIn
  6. GitHub