http://www.codinghelmet.com/  

Wear a helmet. Even when coding.

exercises > array-intersection

Exercise #20: Finding Intersection of Two Unsorted Arrays
by Zoran Horvat @zoranh75

Problem Statement

Given two unsorted arrays containing integer numbers, write a function that returns the third array or ot suitable collection (e.g. a list) which is the intersection of the two arrays. This means that the resulting array should only contain elements that appear in both input arrays. Order of elements in the resulting array is irrelevant.

Example: The first array is 3, 1, 2, 3, 4, 2, 3. The second array is 2, 3, 6, 2, 5, 2, 2, 3. Intersection of these two arrays is 3, 2, 3, 2.

Problem Analysis

In this task we need to traverse through two arrays, initially unsorted, and find matching elements that appear in both arrays. We do not want to presume any particular traversal, but it would certainly be beneficial if number of times that we access array elements is kept low. On the other hand, space required to run the algorithm should also be reduced, if possible.

Indexing Solution

Let’s start cracking the problem by providing some very simple solution. For example, we could index somehow one of the arrays, then iterate through the other array. Each element that is found in the index is pushed to the output and simultaneously removed from the index. Critical characteristics of this solution are then space taken by the index and time it takes to insert, find and remove items from it. Most of the data structures that could be used here take space proportional to number of items stored. In terms of time, hash table is probably the best option. It requires O(1) time to insert, find and remove items. If array lengths are n and m, it takes O(n) total time to index one array. Pass through the second array, including finding elements in the hash table would take O(m) time. Finally, removing elements that are found in the hash table would take O(min(m, n)) time in worst case. From structure size point of view, it makes sense to index the shorter of the arrays. Bottom line is that this solution requires O(n+m) time and O(min(n, m)) space. Here is the pseudocode which implements this solution:

function Intersection(a, b, n, m)
    -- a - array
    -- b - array
    -- n - length of a
    -- m - length of b
begin

    -- intersection - resulting array

    if (n > m)
        intersection = Intersection(b, a, m, n);
    else
        begin

            -- hash – hash table

            for i = 1 to n
            if hash contains a[i] then
                hash[a[i]] = hash[a[i]] + 1
            else
                add mapping a[i] -> 1 to hash

            for j = 1 to m
            begin

                if hash contains b[i] AND hash[b[i]] > 1 then
            append b[i] to intersection
            hash[b[i]] = hash[b[i]] - 1
        else if hash contains b[i] AND hash[b[i]] = 0 then
            append b[i] to intersection
            remove b[i] from hash

            end

        end

    return intersection;

end

Note that hash table in this implementation is used as a multiset. Each element added to the hash table is associated with its count. When element is first added to the hash table, its count is one. Each next time, the count for the element is incremented by one. Similarly, whenever an element is found in the hash table, its count is decremented by one. Should the count fall down to zero for any element, corresponding entry is removed from the hash table.

Quality of this solution is that it solves the problem in time proportional to just iterating through both arrays - hardly that we could expect anything more efficient than that. However, with memory consumption proportional to size of the array, there is plenty of room for improvements.

Brute Force Solution

In order to reduce memory footprint we could now search for an in-place solution. One very simple algorithm would be to compare each element of one array with each element of the other array. Time complexity of this solution is O(n*m), which is much worse than linear time provided by the hashing solution. But space consumption is now reduced to O(1). Here is the pseudocode:

function Intersection(a, b, n, m, out k)
    -- a – array
    -- b – array
    -- n – length of a
    -- m – length of b
    -- k – on output contains length of intersection
begin

    k = 0;

    for i = 1 to n
        begin

            for j = k + 1 to m
                begin

                    if a[i] = b[j] then
                        begin
                            k = k + 1
                            tmp = b[k]
                            b[k] = b[j]
                            b[j] = tmp
                            j = m    -- exit the loop
                        end

                end

        end

    intersection = empty array
    if k > 0 then
        intersection = subarray(b, 1, k)    -- first k elements

    return intersection

end

There are a couple of details in this function which require explanations. When element is found in the array, it is moved towards the beginning of the array and observed start of the array is increased by one. In this way, we guarantee that the same element is not going to be found again. But when that process completes, we note that elements at the beginning of the array are the intersection we were looking for.

Sorting Solution

We could now try to enhance the in-place intersecting algorithm presented in previous section. Our goal is to keep space requirement at O(1), but to reduce time complexity below quadratic time. One way to proceed is to sort both arrays and then to iterate through them in parallel, extracting equal elements along the way. Here is the pseudocode which does precisely that.

function] Intersection(a, b, n, m)
    -- a - array
    -- b - array
    -- n - length of a
    -- m - length of b
begin

    sort a
    sort b

    i = 1
    k = 1

    intersection - empty array

    while i <= n AND j <= m
        begin

        if a[i] = b[i] then
            append a[i] to intersection
        i = i + 1
        j = j + 1
        else if a[i] < b[j] then
            i = i + 1
        else
            j = j + 1

    end

    return intersection

end

This solution possesses several positive properties. First of all, it is more efficient: it takes O(nlogn) time to sort array a, O(mlogm) time to sort array b and O(n+m) time to iterate through both arrays and produce the resulting array. Total execution time is then O(nlogn+mlogm). Space required is still O(1).

Drawback of this solution is that sorting both arrays completely might be an overkill. Consider arrays [1000, 999, 998, ..., 1] and [2000, 1999, 1998, ..., 1001]. We could sort these arrays in roughly 20,000 comparisons (2*1,000*log2(1,000)). But it is obvious that intersection of these two arrays is an empty set, simply because maximum value from the first array is smaller than the minimum value from the second array. Since minimum and maximum values of two unsorted arrays can be found in O(n+m) steps, it looks like there is at least some room for improvement. That is what we are going to try in the following section.

Partial Sorting Solution

Suppose that we know the median of the first array. We could partition the array around the median in the same way as it is done in QuickSort and similar algorithms. The goal of the partitioning step is to produce partial ordering of the array. All values less than the chosen value (pivot) should be moved towards the beginning of the array. All values greater than the pivot should be moved towards the end of the array. And middle positions should be populated by elements that are equal to the pivot value.

The trick at this point is to partition the second array using exactly the same pivot value. In this way we are producing such state in which whole sections of the arrays are guaranteed to contain disjoint values.

Partially sorted arrays

The gain obtained this way is twofold. First, we don't need to compare elements from crossing sections of the arrays: we know that these sections are disjoint and there are no elements that would contribute to the intersection. Second, more important benefit, is that from time to time one of the sections will be empty. In that case the whole corresponding section in the other array can be ignored. That is the real benefit from partially sorting the arrays: parts that are obviously disjoint will quickly be removed from further processing.

Below is the pseudocode for this solution.

function IntersectionPartialSort(a, b, n, m)
    -- a - array
    -- b - array
    -- n - length of a
    -- m - length of b
begin
    return IntersectionRecursive(a, b, 1, n, 1, m)
end

function IntersectionRecursive(a, b, aStart, aEnd, bStart, bEnd)
    -- a - array
    -- b - array
    -- aStart - first position in a to intersect
    -- aEnd - last position in a to intersect
    -- bStart - first position in b to intersect
    -- bEnd - last position in b to intersect
begin

    intersection - resulting array

    pivot = MedianOfMedians(a, aStart, aEnd);

    Partition(a, aStart, aEnd, pivot, out aPivotStart, out aPivotLen)
    Partition(b, bStart, bEnd, pivot, out bPivotStart, out bPivotLen)

    -- Find intersection of pivots
    for i = 1 to min(aPivotLen, bPivotLen)
        append pivot to intersection

    -- Recursively solve left partitions
    if aPivotStart > aStart AND bPivotStart > bStart then
        begin
        part = IntersectionRecursive(a, b, aStart, aPivotStart - 1,
                                     bStart, bPivotStart - 1)
        append part to intersection
    end

    -- Recursively solve right partitions
    if aPivotStrat + aPivotLen < aEnd AND
       bPivotStart + bPivotLen < bEnd then
        begin
            part = IntersectionRecursive(a, b, aPivotStart + aPivotLen, aEnd,
                                         bPivotStart + bPivotLen, bEnd)
            append part to intersection
    end

end

function Partition(a, start, end, pivot, out pivotStart, out pivotLen)
    -- a - array
    -- start - starting position to partition
    -- end - ending position to partition
    -- pivot - pivot value around which to partition
    -- pivotStrat - on output first position at which pivot occurs
    -- pivotLen - on output total number of occurrences of pivot
begin

    i = start

    while i <= end
        begin

        if a[i] < pivot then
            a[start] = a[i]
        i = i + 1
        start = start + 1
        else if a[i] = pivot then
            i = i + 1
        else
            tmp = a[end]
        a[end] = a[i]
        a[i] = tmp
        end = end - 1

    end

    for i = start to end
        a[i] = pivot

    pivotStart = start
    pivotLen = end - start + 1

end

function MedianOfMedians(a, int start, int finish)
    -- a - array
    -- strat - starting index from which median is calcuated
    -- finish - ending index until which median is calculated
begin

    while finish > start + 1
        begin

        pos = start

            for segmentStart = start to finish step 5
                begin

            segmentEnd = segmentStart + 5
                    if segmentEnd > end then
                        segmentEnd = end;

                    for i = segmentStart to segmentEnd - 1
                        for j = i + 1 to segmentEnd
                            if a[i] > a[j] then
                    begin
                                    tmp = a[i]
                                    a[i] = a[j]
                    a[j] = tmp
                end

                    middlePos = (segmentStart + segmentEnd) / 2
                    tmp = a[pos]
                    a[pos] = a[middlePos]
                    a[middlePos] = tmp

                    pos = pos + 1

                end

             finish = pos;

        end

    return a[start];

end

Worst case running time of this solution is the same as for full sorting solution: O(nlogn+mlogm). What we are hoping for is improved average running time thanks to stopping condition which ends the processing of parts of the array that do not have an overlapping part in another array.

Implementation

In this section we are providing implementation in C# for full sorting and for partial sorting solution.

using System;
using System.Collections;
using System.Collections.Generic;

namespace Intersection
{

    public class Program
    {

        static List<int> IntersectionPartialSort(int[] a,int[] b)
        {

            List<int> intersection = new List<int>();

            IntersectionRecursive(a, b, 0, a.Length - 1,
                                  0, b.Length - 1, intersection);

            return intersection;

        }

        static void IntersectionRecursive(int[] a,int[] b,
                                         int aStart,int aEnd,
                                         int bStart,int bEnd,
                                          List<int> intersection)
        {

            int pivot = MedianOfMedians(a, aStart, aEnd);

            int aPivotStart;
            int aPivotLen;
            int bPivotStart;
            int bPivotLen;

            Partition(a, aStart, aEnd, pivot,
                      out aPivotStart, out aPivotLen);

            Partition(b, bStart, bEnd, pivot,
                      out bPivotStart, out bPivotLen);

            // Find intersection of pivots
            for (int i = 0; i < aPivotLen && i < bPivotLen; i++)
                intersection.Add(pivot);

            // Recursively solve left partitions
            if (aPivotStart > aStart && bPivotStart > bStart)
                IntersectionRecursive(a, b, aStart, aPivotStart - 1,
                                      bStart, bPivotStart - 1,
                                      intersection);

            // Recursively solve right partitions
            if (aPivotStart + aPivotLen - 1 < aEnd &&
                bPivotStart + bPivotLen - 1 < bEnd)
                IntersectionRecursive(a, b, aPivotStart + aPivotLen, aEnd,
                                      bPivotStart + bPivotLen, bEnd,
                                      intersection);

        }

        static void Partition(int[] a,int start,int end,int pivot,
                              out int pivotStart, out int pivotLen)
        {

            int i = start;

            while (i <= end)
            {

                if (a[i] < pivot)
                {
                    a[start++] = a[i++];
                }
                else if (a[i] == pivot)
                {
                    i++;
                }
                else
                {
                   int tmp = a[end];
                    a[end] = a[i];
                    a[i] = tmp;
                    end--;
                }

            }

            for (i = start; i <= end; i++)
                a[i] = pivot;

            pivotStart = start;
            pivotLen = end - start + 1;

        }

        static int MedianOfMedians(int[] a,int start,int end)
        {

            while (end > start + 1)
            {

               int pos = start;

                for (int segment = start; segment < end; segment += 5)
                {

                   int tmp;

                   int segmentEnd = segment + 5;
                    if (segmentEnd > end)
                        segmentEnd = end;

                    for (int i = segment; i < segmentEnd - 1; i++)
                        for (int j = i + 1; j < segmentEnd; j++)
                            if (a[i] > a[j])
                            {
                                tmp = a[i];
                                a[i] = a[j];
                                a[j] = tmp;
                            }

                    int middlePos = (segment + segmentEnd - 1) / 2;
                    tmp = a[pos];
                    a[pos] = a[middlePos];
                    a[middlePos] = tmp;
                    pos++;

                }

                end = pos;

            }

            return a[start];

        }

        static List<int> IntersectionFullSort(int[] a,int[] b)
        {

            Array.Sort(a);
            Array.Sort(b);

            List<int> intersection = new List<int>();

            int i = 0;
            int j = 0;

            while (i < a.Length && j < b.Length)
            {
                if (a[i] == b[j])
                {
                    intersection.Add(a[i]);
                    i++;
                    j++;
                }
                else if (a[i] < b[j])
                {
                    i++;
                }
                else
                {
                    j++;
                }
            }

            return intersection;

        }

        static List<int> IntersectionIndexing(int[] a,int[] b)
        {

            List<int> intersection = null;

            if (a.Length > b.Length)
            {
                intersection = IntersectionIndexing(b, a);
            }
            else
            {

                Hashtable hash = new Hashtable();

                intersection = new List<int>();

                for (int i = 0; i < a.Length; i++)
                {
                    if (hash.Contains(a[i]))
                        hash[a[i]] = (int)hash[a[i]] + 1;
                    else
                        hash.Add(a[i], 1);
                }

                for (int i = 0; i < b.Length; i++)
                {

                   int count = 0;
                    if (hash.Contains(b[i]))
                        count = (int)hash[b[i]];

                    if (count > 1)
                    {
                        intersection.Add(b[i]);
                        hash[b[i]] = count - 1;
                    }
                    else if (count == 1)
                    {
                        intersection.Add(b[i]);
                        hash.Remove(b[i]);
                    }
                }
            }

            return intersection;

        }

        static int[] CopyArray(int[] a)
        {
           int[] b = newint[a.Length];
            a.CopyTo(b, 0);
            return b;
        }

        static int[] CreateArray(int len,int rangeLow,int rangeHigh)
        {

           int[] a = newint[len];

            for (int i = 0; i < len; i++)
                a[i] = _rnd.Next(rangeLow, rangeHigh);

            return a;

        }

        delegate List<int> IntersectionAlgorithm(int[] a,int[] b);

        static void RunPerformanceTest()
        {

           int range = 50000;
           int length = 300000;
           int repeatsCount = 10;

            double[,] timeSpent = new double[3, 101];

            IntersectionAlgorithm[] algorithms =
                new IntersectionAlgorithm[3]
                {
                    IntersectionFullSort,
                    IntersectionPartialSort,
                    IntersectionIndexing
                };

            for (int repeat = 0; repeat < repeatsCount; repeat++)
            {
                for (int overlap = 0; overlap <= 100; overlap++)
                {

                   int snap = range * (100 - overlap)  100;

                   int[] a = CreateArray(length, 0, range);
                   int[] b = CreateArray(length, snap, range + snap);

                    for (int alg = 0; alg < algorithms.Length; alg++)
                    {

                       int[] aCopy = CopyArray(a);
                       int[] bCopy = CopyArray(b);

                        DateTime tsStart = DateTime.UtcNow;
                        algorithms[alg](aCopy, bCopy);
                        DateTime tsEnd = DateTime.UtcNow;

                        timeSpent[alg, overlap] +=
                            (tsEnd - tsStart).TotalMilliseconds;

                    }

                }

            }

            for (int percent = 0; percent <= 100; percent++)
            {
                Console.Write("{0,4}%", percent);
                for (int i = 0; i < 3; i++)
                    Console.Write("\t{0:0}",
                                  timeSpent[i, percent]  repeatsCount);
                Console.WriteLine();
            }

        }

        static int[] ReadArray()
        {

            Console.Write("Enter array elements (ENTER to exit): ");

            string[] elements =
                Console
                .ReadLine()
                .Split(new char[] { ' ' },
                       StringSplitOptions.RemoveEmptyEntries);

           int[] a = null;

            if (elements.Length > 0)
            {
                a = new int[elements.Length];
                for (int i = 0; i < elements.Length; i++)
                    a[i] =int.Parse(elements[i]);
            }

            return a;

        }

        static void PrintList(List<int> list)
        {
            foreach (int value in list)
                Console.Write("{0,4}", value);
            Console.WriteLine();
        }

        static void Main(string[] args)
        {

            Console.Write("Execute performance test? (y/n) ");
            if (Console.ReadLine().ToLower() == "y")
                RunPerformanceTest();

            while (true)
            {

               int[] a = ReadArray();
                if (a == null)
                    break;

               int[] b = ReadArray();
                if (b == null)
                    break;

                List<int> intersection = IntersectionPartialSort(a, b);

                Console.WriteLine("Array intersection is:");
                PrintList(intersection);

            }

        }

        static Random _rnd = new Random();

    }
}

Demonstration

When this application is run, it produces the following output:

Execute performance test? (y/n) n

Enter array elements (ENTER to exit): 3 1 2 3 4 2 3
Enter array elements (ENTER to exit): 2 3 6 2 5 2 2 3
Array intersection is:    2   2   3   3

Enter array elements (ENTER to exit):

Output shows that partial sorting algorithm correctly identifies values that appear in both input arrays. This is not a surprise. But what might surprise many is the diagram that follows.

Execution time

Picture above shows running times of three intersection algorithms. Horizontal axis indicates how much data from the two arrays are overlapping. For example, if first array contains random data in range 1-1000 and second array random data in range 701-1700, then overlapping is 30%. The first line represents full sorting algorithm with asymptotic complexity O(nlogn). Second line represents partial sorting algorithm with same complexity O(nlogn) and optimized value selection and. Finally, the third line represents indexing algorithm with complexity O(n). While asymptotic complexity indicates that running time might be lower for indexing algorithm, this solution actually performs much worse than solutions with higher complexity.

There are several reasons that lead to this consequence. Probably most important one has to do with CPU architecture. Here is the very short explanation of what happens when we iterate through two arrays simultaneously, as it is done in the IntersectionFullSort. (The fact that arrays are sorted plays no role in this analysis.) What CPU internally does when it accesses an array element is to immediately request values from operating memory which immediately follows the accessed location. This is done because there is expectation that application might ask for the nearby data soon, which is absolutely true when iterating through an array. These prefetched values are stored in CPU cache - small but very fast portion of memory, located physically near or on the same chip with CPU. Accessing data from cache is multiple times faster than reading the same data from the operating memory.

Now let's turn to the indexing solution. It first adds all elements of one array to a hash table. Then it iterates through the other array and queries the hash table for each of the values it encounters. While doing that, function pays heavy penalty by not being able to profit on spatial locality of data. Values are scattered around the hash table (that is how hash table works in the first place). Result is that any given value is likely not to be found in CPU cache. This is simply because portion of the hash table which contains the requested value is probably not very near to any other part visited recently.

These processes are very complex in their nature and it is often hard to foresee their effects. But when comparing time needed to iterate through an array with time needed to query the hash table same number of times, result becomes much more predictable. Simple iteration takes advantage of using CPU cache and executes several times faster than the competing hashing algorithm.

Conclusion is that constant factor in indexing solution is so large that it makes the whole function slower than any other, despite the optimistic asymptotic complexity.

But what about the partial sorting solution? It makes less comparisons searching for common values, thanks to its ability to discard whole sections of arrays. But in order to discover which sections can be discarded, this function performs quite a lot of analysis: finding a good pivot and partitioning arrays. Result is again high constant factor, making this function consistently slower than the full sorting solution.

What happened here is what often happens when theory hits the reality. Execution time is not the same as asymptotic complexity. It is not even proportional to the expression depicted by the asymptotic complexity, but the proportion varies with dimension of data set at hand. Shortest and most elegant solution has won the competition. It relies on sorting function borrowed from the framework, function which is beyond any doubt highly optimized.

Moral of the story is that most optimized algorithm is not necessarily the fastest one. It's often the other way around. Short code with simple flow, like iteration, which relies on built-in functions is often the fastest one in practice, despite the slight handicap in terms of asymptotic complexity.

See also:

Published: Dec 25, 2013; Modified: Sep 16, 2015

ZORAN HORVAT

Zoran is software architect dedicated to clean design and CTO in a growing software company. Since 2014 Zoran is an author at Pluralsight where he is preparing a series of courses on object-oriented and functional design, design patterns, writing unit and integration tests and applying methods to improve code design and long-term maintainability.

Follow him on Twitter @zoranh75 to receive updates and links to new articles.

Watch Zoran's video courses at pluralsight.com (requires registration):

Making Your C# Code More Object-Oriented

This course will help leverage your conceptual understanding to produce proper object-oriented code, where objects will completely replace procedural code for the sake of flexibility and maintainability. More...

Advanced Defensive Programming Techniques

This course will lead you step by step through the process of developing defensive design practices, which can substitute common defensive coding, for the better of software design and implementation. More...

Tactical Design Patterns in .NET: Creating Objects

This course sheds light on issues that arise when implementing creational design patterns and then provides practical solutions that will make our code easier to write and more stable when running. More...

Tactical Design Patterns in .NET: Managing Responsibilities

Applying a design pattern to a real-world problem is not as straight-forward as literature implicitly tells us. It is a more engaged process. This course gives an insight to tactical decisions we need to make when applying design patterns that have to do with separating and implementing class responsibilities. More...

Tactical Design Patterns in .NET: Control Flow

Improve your skills in writing simpler and safer code by applying coding practices and design patterns that are affecting control flow. More...

Writing Highly Maintainable Unit Tests

This course will teach you how to develop maintainable and sustainable tests as your production code grows and develops. More...

Improving Testability Through Design

This course tackles the issues of designing a complex application so that it can be covered with high quality tests. More...

Share this article

webmasters