Regex and LINQ Query to Split Text into Distinct Words

by Zoran Horvat

Suppose that we want to split a (multiline) text into a collection of distinct words.

We can do that by combining a regular expression which extracts separate words and LINQ query which turns matches into a collection of strings.

Here is the extension method which turns a given string into a collection of words that appear in it:

static class TextToWordsExtensions
{
    public static IEnumerable<string> SplitIntoWords(this string text)
    {

        string pattern = @"\b[\p{L}]+\b";

        return
            Regex.Matches(text, pattern)
                .Cast<Match>()                          // Extract matches
                .Select(match => match.Value.ToLower()) // Change to same case
                .Distinct();                            // Remove duplicates

    }
}

This function can be changed to accommodate a different set of rules regarding detecting word boundaries. For example, we could allow numbers and underscores as parts of words:

string pattern = @"\b[\p{L}\p{N}_]+\b";

Another possible modification is to support words with dashes:

string pattern = @"\b[\p{L}]+(-[\p{L}]+)*\b";

Demonstration

Below is the console application which turns two segments of text into collections of words.

The first text is the quote from Ernest Hemingway in English. The second text is the quote from Fyodor Dostoyevsky in Russian. This is to demonstrate that words can be extracted in Unicode, not only in ASCII character set.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace ConsoleDemo
{

    static class TextToWordsExtensions
    {

        public static IEnumerable<string> SplitIntoWords(this string text)
        {

            string pattern = @"\b[\p{L}]+(-[\p{L}]+)*\b";

            return
                Regex.Matches(text, pattern)
                    .Cast<Match>()                          // Extract matches
                    .Select(match => match.Value.ToLower()) // Change to same case
                    .Distinct();                            // Remove duplicates

        }

        public static IEnumerable<string> PrintWords(this IEnumerable<string> words,
                                                     int columnWidth)
        {

            Console.OutputEncoding = UTF8Encoding.Unicode;

            int count = words.Count();
            int maxLength = words.DefaultIfEmpty(string.Empty).Max(word => word.Length);
            int wordsPerRow = (columnWidth + 1)  (maxLength + 1);

            words
                .Select((word, pos) => new
                {
                    Index = pos,
                    Word = word.PadRight(maxLength)
                })
                .ToList()
                .ForEach(element =>
                {

                    char delimiter = ' ';
                    if (element.Index == count - 1)
                        delimiter = '\n';
                    else if ((element.Index + 1) % wordsPerRow == 0)
                        delimiter = '\n';

                    Console.Write("{0}{1}", element.Word, delimiter);

                });

            Console.WriteLine();

            return words;

        }
    }

    class Program
    {

        static void Main(string[] args)
        {

            string text =
                "He always thought of the sea as 'la mar'\n" +
                "which is what people call her in Spanish\n" +
                "when they love her. Sometimes those who\n" +
                "love her say bad things of her but they\n" +
                "are always said as though she were a woman.\n" +
                "Some of the younger fishermen, those who\n" +
                "used buoys as floats for their lines and\n" +
                "had motorboats, bought when the shark\n" +
                "livers had brought much money, spoke of\n" +
                "her as 'el mar' which is masculine.\n" +
                "They spoke of her as a contestant or a\n" +
                "place or even an enemy. But the old man\n" +
                "always thought of her as feminine and\n" +
                "as something that gave or withheld\n" +
                "great favours, and if she did wild or\n" +
                "wicked things it was because she\n" +
                "could not help them. The moon affects\n" +
                "her as it does a woman, he thought.";

            string text1 =
                "Но, однако ж, прибавлю, что во всякой\n" +
                "гениальной или новой человеческой мысли,\n" +
                "или просто даже во всякой серьезной\n" +
                "человеческой мысли, зарождающейся в\n" +
                "чьей-нибудь голове, всегда остается\n" +
                "нечто такое, чего никак нельзя передать\n" +
                "другим людям, хотя бы вы исписали целые\n" +
                "томы и растолковывали вашу мысль тридцать\n" +
                "пять лет; всегда останется нечто, что ни\n" +
                "за что не захочет выйти изпод вашего\n" +
                "черепа и останется при вас навеки;\n" +
                "с тем вы и умрете, не передав никому,\n" +
                "может быть, самого-то главного из\n" +
                "вашей идеи.";

            text
                .SplitIntoWords()
                .OrderBy(word => word)
                .PrintWords(50);

            text1.SplitIntoWords()
                .OrderBy(word => word)
                .PrintWords(50);

            Console.WriteLine("Press ENTER to continue...");
            Console.ReadLine();

        }
    }
}

When this application is run, it produces words from both segments of text, both in Latin and Cyrillic letters:

            
a          affects    always     an
and        are        as         bad
because    bought     brought    buoys
but        call       contestant could
did        does       el         enemy
even       favours    feminine   fishermen
floats     for        gave       great
had        he         help       her
if         in         is         it
la         lines      livers     love
man        mar        masculine  money
moon       much       not        of
old        or         people     place
said       say        sea        shark
she        some       something  sometimes
spanish    spoke      that       the
their      them       they       things
those      though     thought    used
was        were       what       when
which      who        wicked     wild
withheld   woman      younger

бы             быть           в
вас            вашего         вашей
вашу           во             всегда
всякой         вы             выйти
гениальной     главного       голове
даже           другим         ж
за             зарождающейся  захочет
и              идеи           из
изпод          или            исписали
лет            людям          может
мысли          мысль          навеки
не             нельзя         нечто
ни             никак          никому
но             новой          однако
остается       останется      передав
передать       при            прибавлю
просто         пять           растолковывали
с              самого-то      серьезной
такое          тем            томы
тридцать       умрете         хотя
целые          чего           человеческой
черепа         что            чьей-нибудь

Press ENTER to continue...
                
    

If you wish to learn more, please watch my latest video courses

About

Zoran Horvat

Zoran Horvat is the Principal Consultant at Coding Helmet, speaker and author of 100+ articles, and independent trainer on .NET technology stack. He can often be found speaking at conferences and user groups, promoting object-oriented and functional development style and clean coding practices and techniques that improve longevity of complex business applications.

  1. Pluralsight
  2. Udemy
  3. Twitter
  4. YouTube
  5. LinkedIn
  6. GitHub