http://www.codinghelmet.com/  

Wear a helmet. Even when coding.

hints > split-text-into-words

Regex and LINQ Query to Split Text into Distinct Words
by Zoran Horvat @zoranh75

Suppose that we want to split a (multiline) text into a collection of distinct words.

We can do that by combining a regular expression which extracts separate words and LINQ query which turns matches into a collection of strings.

Here is the extension method which turns a given string into a collection of words that appear in it:

static class TextToWordsExtensions
{
    public static IEnumerable<string> SplitIntoWords(this string text)
    {

        string pattern = @"\b[\p{L}]+\b";

        return
            Regex.Matches(text, pattern)
                .Cast<Match>()                          // Extract matches
                .Select(match => match.Value.ToLower()) // Change to same case
                .Distinct();                            // Remove duplicates

    }
}

This function can be changed to accommodate a different set of rules regarding detecting word boundaries. For example, we could allow numbers and underscores as parts of words:

string pattern = @"\b[\p{L}\p{N}_]+\b";

Another possible modification is to support words with dashes:

string pattern = @"\b[\p{L}]+(-[\p{L}]+)*\b";

Demonstration

Below is the console application which turns two segments of text into collections of words.

The first text is the quote from Ernest Hemingway in English. The second text is the quote from Fyodor Dostoyevsky in Russian. This is to demonstrate that words can be extracted in Unicode, not only in ASCII character set.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace ConsoleDemo
{

    static class TextToWordsExtensions
    {

        public static IEnumerable<string> SplitIntoWords(this string text)
        {

            string pattern = @"\b[\p{L}]+(-[\p{L}]+)*\b";

            return
                Regex.Matches(text, pattern)
                    .Cast<Match>()                          // Extract matches
                    .Select(match => match.Value.ToLower()) // Change to same case
                    .Distinct();                            // Remove duplicates

        }

        public static IEnumerable<string> PrintWords(this IEnumerable<string> words,
                                                     int columnWidth)
        {

            Console.OutputEncoding = UTF8Encoding.Unicode;

            int count = words.Count();
            int maxLength = words.DefaultIfEmpty(string.Empty).Max(word => word.Length);
            int wordsPerRow = (columnWidth + 1)  (maxLength + 1);

            words
                .Select((word, pos) => new
                {
                    Index = pos,
                    Word = word.PadRight(maxLength)
                })
                .ToList()
                .ForEach(element =>
                {

                    char delimiter = ' ';
                    if (element.Index == count - 1)
                        delimiter = '\n';
                    else if ((element.Index + 1) % wordsPerRow == 0)
                        delimiter = '\n';

                    Console.Write("{0}{1}", element.Word, delimiter);

                });

            Console.WriteLine();

            return words;

        }
    }

    class Program
    {

        static void Main(string[] args)
        {

            string text =
                "He always thought of the sea as 'la mar'\n" +
                "which is what people call her in Spanish\n" +
                "when they love her. Sometimes those who\n" +
                "love her say bad things of her but they\n" +
                "are always said as though she were a woman.\n" +
                "Some of the younger fishermen, those who\n" +
                "used buoys as floats for their lines and\n" +
                "had motorboats, bought when the shark\n" +
                "livers had brought much money, spoke of\n" +
                "her as 'el mar' which is masculine.\n" +
                "They spoke of her as a contestant or a\n" +
                "place or even an enemy. But the old man\n" +
                "always thought of her as feminine and\n" +
                "as something that gave or withheld\n" +
                "great favours, and if she did wild or\n" +
                "wicked things it was because she\n" +
                "could not help them. The moon affects\n" +
                "her as it does a woman, he thought.";

            string text1 =
                "Но, однако ж, прибавлю, что во всякой\n" +
                "гениальной или новой человеческой мысли,\n" +
                "или просто даже во всякой серьезной\n" +
                "человеческой мысли, зарождающейся в\n" +
                "чьей-нибудь голове, всегда остается\n" +
                "нечто такое, чего никак нельзя передать\n" +
                "другим людям, хотя бы вы исписали целые\n" +
                "томы и растолковывали вашу мысль тридцать\n" +
                "пять лет; всегда останется нечто, что ни\n" +
                "за что не захочет выйти изпод вашего\n" +
                "черепа и останется при вас навеки;\n" +
                "с тем вы и умрете, не передав никому,\n" +
                "может быть, самого-то главного из\n" +
                "вашей идеи.";

            text
                .SplitIntoWords()
                .OrderBy(word => word)
                .PrintWords(50);

            text1.SplitIntoWords()
                .OrderBy(word => word)
                .PrintWords(50);

            Console.WriteLine("Press ENTER to continue...");
            Console.ReadLine();

        }
    }
}

When this application is run, it produces words from both segments of text, both in Latin and Cyrillic letters:

a          affects    always     an
and        are        as         bad
because    bought     brought    buoys
but        call       contestant could
did        does       el         enemy
even       favours    feminine   fishermen
floats     for        gave       great
had        he         help       her
if         in         is         it
la         lines      livers     love
man        mar        masculine  money
moon       much       not        of
old        or         people     place
said       say        sea        shark
she        some       something  sometimes
spanish    spoke      that       the
their      them       they       things
those      though     thought    used
was        were       what       when
which      who        wicked     wild
withheld   woman      younger

бы             быть           в
вас            вашего         вашей
вашу           во             всегда
всякой         вы             выйти
гениальной     главного       голове
даже           другим         ж
за             зарождающейся  захочет
и              идеи           из
изпод          или            исписали
лет            людям          может
мысли          мысль          навеки
не             нельзя         нечто
ни             никак          никому
но             новой          однако
остается       останется      передав
передать       при            прибавлю
просто         пять           растолковывали
с              самого-то      серьезной
такое          тем            томы
тридцать       умрете         хотя
целые          чего           человеческой
черепа         что            чьей-нибудь

Press ENTER to continue...

See also:

Published: Feb 18, 2015

ZORAN HORVAT

Zoran is software architect dedicated to clean design and CTO in a growing software company. Since 2014 Zoran is an author at Pluralsight where he is preparing a series of courses on object-oriented and functional design, design patterns, writing unit and integration tests and applying methods to improve code design and long-term maintainability.

Follow him on Twitter @zoranh75 to receive updates and links to new articles.

Watch Zoran's video courses at pluralsight.com (requires registration):

Making Your C# Code More Object-Oriented

This course will help leverage your conceptual understanding to produce proper object-oriented code, where objects will completely replace procedural code for the sake of flexibility and maintainability. More...

Advanced Defensive Programming Techniques

This course will lead you step by step through the process of developing defensive design practices, which can substitute common defensive coding, for the better of software design and implementation. More...

Tactical Design Patterns in .NET: Creating Objects

This course sheds light on issues that arise when implementing creational design patterns and then provides practical solutions that will make our code easier to write and more stable when running. More...

Tactical Design Patterns in .NET: Managing Responsibilities

Applying a design pattern to a real-world problem is not as straight-forward as literature implicitly tells us. It is a more engaged process. This course gives an insight to tactical decisions we need to make when applying design patterns that have to do with separating and implementing class responsibilities. More...

Tactical Design Patterns in .NET: Control Flow

Improve your skills in writing simpler and safer code by applying coding practices and design patterns that are affecting control flow. More...

Writing Highly Maintainable Unit Tests

This course will teach you how to develop maintainable and sustainable tests as your production code grows and develops. More...

Improving Testability Through Design

This course tackles the issues of designing a complex application so that it can be covered with high quality tests. More...

Share this article

webmasters