How to Reduce Cyclomatic Complexity: MapReduce Queries by Zoran Horvat @zoranh75
MapReduce was first devised as an architectural pattern. When faced with
enormous amounts of data, developers had problems to calculate a single result.
Principal problem was getting the data to the processing server and then
processing all of them on that server.
With MapReduce, things changed. Processing was moved closer to data. Instead
of downloading all the data to the processing server, the processing algorithm
was executed directly on the node which holds the data. Then only the
calculated result would be sent to the processing server.
But those results produced at data nodes were partial. There was one more step
to conclude the operation – sending all partial results to the processing
server. At that point, processing server would be able to aggregate all partial
results into one integral result.
MapReduce on Smaller Scale
As time progressed, programmers became aware that mapreduce principle could
have wider application. It has already proved its power on big data. But the
same principle could be applied to smaller amounts of data, on data that we
regularly process in applications.
In many cases we are just facing a collection of objects and we need to produce
one object or one value from that collection. We can easily apply the
mapreduce principle to any collection. The first step is to map all objects to
intermediate results. The second step is to reduce collection of intermediate
results to one value.
Language Integrated Query (LINQ)
That is the basic purpose of LINQ library. We can use it directly on storage
through LINQ to SQL or LINQ to Entities. Or, alternatively, we can use it to
inmemory data through LINQ to Objects.
Common feature of all methods in the LINQ library is that they are supporting
the mapreduce model. Selection methods – Select, Where, Skip, Take, etc. – are mapping collection of objects into another collection of objects.
Methods like Join, Group, OrderBy are also mapping methods. Each of them also produces another collection of
objects. Finally, aggregation methods, such as Sum, Max, even First and Single, are there to reduce the collection to a single object.
On a related note, if you ever wondered why there is no ForEach method in the LINQ library, here is the answer. All methods in the library are
query methods. They do not have side effects. They only produce a result of a
query, leaving the system state intact.
ForEach method would be the only one different there. Such method would
supposedly return void and its sole purpose would be to have side effects.
Therefore, this method was not included in the LINQ library.
Example
Suppose that we want to test whether a given integer is prime. Remember, a
number is prime if it is greater than 1 and its only positive divisors are 1
and itself. Here is the function which tests that:
bool IsPrime(int number)
{
if (number < 2)
return false;
for (int i = 2; i * i <= number; i++)
if (number % i == 0)
return false;
return true;
}
This function iterates through all numbers up to square root of the tested
number and checks if any of them divides the number. If any such divisor is
found, the function returns False.
The way in which this function is implemented is, so to say, classic approach.
It is procedural, because we have explicitly said which steps should be
executed to reach the correct Boolean result.
We can rewrite this method to work in a completely different way. The first
step would be to build a collection of potential divisors – those would be the
numbers up to square root of the tested number:
IEnumerable<int> candidates =
Enumerable.Range(2, (int)Math.Sqrt(number)  1);
These are all the potential divisors. Now we test whether all of them are
producing nonzero result on a modulo operation:
bool isPrime =
Enumerable.Range(2, (int)Math.Sqrt(number)  1)
.All(divisor => number % divisor != 0);
In this example there is no mapping step – all numbers in the collection are
taken as they are. The second line, call to the All method, is the reduce step. It reduces the collection to one single value of
type Boolean.
Finally, we just have to rule out numbers smaller than 2, because those are not
prime by definition. And that will be all it takes to write mapreduce version
of the prime test:
bool IsPrime(int number)
{
return
number >= 2 &&
Enumerable.Range(2, (int)Math.Sqrt(number)  1)
.All(divisor => number % divisor != 0);
}
Not only that this function is shorter, but it also communicates the intention
more clearly. We can almost read it in English: Number is prime if it is
greater or equal to two and none of the numbers up to its square root divides
it.
More Complicated Example
Here is one more complicated example. Suppose that we want to find the most
frequent number in the unsorted array of integers. This is called mode of the
array and it is important in some applications.
We could solve this problem in a traditional way by just iterating through the
array and building an index of all values we find, counting each of them. In
the end, we would just return the value with highest count associated. Here is
the code:
struct Mode
{
public int Value;
public int Count;
}
Mode FindMode(int[] array)
{
Mode best = new Mode();
Dictionary<int, int> hashtable =
new Dictionary<int, int>();
foreach (int value in array)
{
int curCount;
int count = 1;
if (hashtable.TryGetValue(value, out curCount))
count = curCount + 1;
hashtable[value] = count;
if (count > best.Count)
{
best = new Mode()
{
Value = value,
Count = count
};
}
}
return best;
}
When faced with implementation like this, we have to stop for a moment and
think how this method works. This becomes especially difficult if there is an
indication that this method could contain a bug. Then we really have to focus
on its internals to figure whether it is working fine, in which case we have to
look for the problem elsewhere, or it really mixes something up.
Mapreduce approach could prove to be much more valuable in cases like this. We
could take the input array as the starting collection. Then we could implement
a mapping step, so that same values are put close together  this is the
indexing step. Then we could reduce this collection to a single value by
picking the element with highest count.
Here is the implementation using LINQ to Objects:
Mode FindMode(int[] array)
{
return
array
.GroupBy(value => value)
.Select(group =>
new Mode()
{
Value = group.Key,
Count = group.Count()
})
.OrderByDescending(mode => mode.Count)
.First();
}
Once again, we have reached an implementation which is significantly shorter.
As a side note, this implementation once again reads exactly like the idea we
used to build it: Index the array by their values in it, associate count to
each of the values, sort the collection so that elements with highest counts
come first, and then just pick the first one.
What happens if someone suspects that this function contains a bug? Nothing
special – just walk through this sequence of transformations and see if it fits
the requirements. There are no loops and branching statements that are hard to
analyze. There are no boundary conditions here, those are all buried deep
inside the LINQ operators.
Real World Example
Let’s get back to the ecommerce application that we have been refactoring
throughout this series of articles. Domain model defines a discount:
public interface IDiscount
{
decimal Apply(decimal price);
}
Discount object simply receives a price and returns a modified price. For
example, 2% discount would just multiply the price with 0.98 and return the new
price.
Implementations of the IDiscount interface are used in the Product class, which
has to report its price:
namespace Store.Domain.Implementation
{
public class Product: IProduct
{
...
public IProduct ApplyDiscounts(IEnumerable<IDiscount> discounts)
{
decimal price = this.Price;
foreach (IDiscount discount in discounts)
price = discount.Apply(price);
return new Product(this.Name)
{
Price = price
};
}
}
}
When application executes a purchase request, it first asks the product for its
price with all discounts applied. Implementation is once again oldfashioned.
There is the loop through the collection of discounts and they are accumulated
in the price variable.
We can apply the mapreduce principle here as well. There would be no mapping
step here again, because every discount would only be mapped to itself.
However, reduce step would have to be implemented. The code tells more than
words, so here it is:
namespace Store.Domain.Implementation
{
public class Product: IProduct
{
...
public IProduct ApplyDiscounts(IEnumerable<IDiscount> discounts)
{
return new Product(this.Name)
{
Price = discounts.Aggregate(this.Price,
(cur, discount) => discount.Apply(cur))
};
}
}
}
This implementation relies on the Aggregate method from the LINQ to Objects library. The first argument to the Aggregate method is the seed. The second argument is the aggregation function which
takes running result and current object from the collection and produces the
next running result. Aggregation function will be called once for each of the objects in the collection. End
result is the final running result, and that is what the Aggregate function will return.
Shorter version of this explanation is that the Aggregate method will apply all discounts in sequence, gradually reducing the product
price, and return the final price with all discounts applied.
Needless to say, this solution is much shorter than the first one which relied
on explicit looping and aggregating the price.
Conclusion
Whenever you write a loop in your code, you become responsible for things that
might go wrong in it. This becomes even more obvious when the loop contains
branching statements, break and return statements and similar.
Edge cases, boundary conditions, condition tests, all of those can become very
hard to see in code because they are often mixed together with scaffolding
code, such as loop or ifthenelse statements.
By using the LINQ to Objects library, we remove all the scaffolding from our
code. Modified code is easier to read and it also has a natural flow.
Mapreduce sequences of statements are just defining a sequence of
transformations that should be performed in a row to reach the final result.
Domain logic is exposed in such code, while scaffolding is moved away from it.
See also:
Published: May 26, 2015; Modified: Apr 30, 2015
ZORAN HORVAT Zoran is software architect dedicated to clean design and CTO in a growing software company. Since 2014 Zoran is an author at Pluralsight where he is preparing a series of courses on design patterns, writing unit and integration tests and applying methods to improve code design and longterm maintainability. Follow him on Twitter @zoranh75 to receive updates and links to new articles. Watch Zoran's video courses at pluralsight.com (requires registration): Tactical Design Patterns in .NET: Managing Responsibilities Applying a design pattern to a realworld problem is not as straightforward as literature implicitly tells us. It is a more engaged process. This course gives an insight into tactical decisions we need to make when applying design patterns that have to do with separating and implementing class responsibilities. More... Tactical Design Patterns in .NET: Control Flow Improve your skills in writing simpler and safer code by applying coding practices and design patterns that are affecting control flow. More... Improving Testability Through Design This course tackles the issues of designing a complex application so that it can be covered with high quality tests. More... Share this article
