Wear a helmet. Even when coding.

hints > split-text-into-words

Regex and LINQ Query to Split Text into Distinct Words
by Zoran Horvat @zoranh75

Suppose that we want to split a (multiline) text into a collection of distinct words.

We can do that by combining a regular expression which extracts separate words and LINQ query which turns matches into a collection of strings.

Here is the extension method which turns a given string into a collection of words that appear in it:

static class TextToWordsExtensions
    public static IEnumerable<string> SplitIntoWords(this string text)

        string pattern = @"\b[\p{L}]+\b";

            Regex.Matches(text, pattern)
                .Cast<Match>()                          // Extract matches
                .Select(match => match.Value.ToLower()) // Change to same case
                .Distinct();                            // Remove duplicates


This function can be changed to accommodate a different set of rules regarding detecting word boundaries. For example, we could allow numbers and underscores as parts of words:

string pattern = @"\b[\p{L}\p{N}_]+\b";

Another possible modification is to support words with dashes:

string pattern = @"\b[\p{L}]+(-[\p{L}]+)*\b";


Below is the console application which turns two segments of text into collections of words.

The first text is the quote from Ernest Hemingway in English. The second text is the quote from Fyodor Dostoyevsky in Russian. This is to demonstrate that words can be extracted in Unicode, not only in ASCII character set.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace ConsoleDemo

    static class TextToWordsExtensions

        public static IEnumerable<string> SplitIntoWords(this string text)

            string pattern = @"\b[\p{L}]+(-[\p{L}]+)*\b";

                Regex.Matches(text, pattern)
                    .Cast<Match>()                          // Extract matches
                    .Select(match => match.Value.ToLower()) // Change to same case
                    .Distinct();                            // Remove duplicates


        public static IEnumerable<string> PrintWords(this IEnumerable<string> words,
                                                     int columnWidth)

            Console.OutputEncoding = UTF8Encoding.Unicode;

            int count = words.Count();
            int maxLength = words.DefaultIfEmpty(string.Empty).Max(word => word.Length);
            int wordsPerRow = (columnWidth + 1)  (maxLength + 1);

                .Select((word, pos) => new
                    Index = pos,
                    Word = word.PadRight(maxLength)
                .ForEach(element =>

                    char delimiter = ' ';
                    if (element.Index == count - 1)
                        delimiter = '\n';
                    else if ((element.Index + 1) % wordsPerRow == 0)
                        delimiter = '\n';

                    Console.Write("{0}{1}", element.Word, delimiter);



            return words;


    class Program

        static void Main(string[] args)

            string text =
                "He always thought of the sea as 'la mar'\n" +
                "which is what people call her in Spanish\n" +
                "when they love her. Sometimes those who\n" +
                "love her say bad things of her but they\n" +
                "are always said as though she were a woman.\n" +
                "Some of the younger fishermen, those who\n" +
                "used buoys as floats for their lines and\n" +
                "had motorboats, bought when the shark\n" +
                "livers had brought much money, spoke of\n" +
                "her as 'el mar' which is masculine.\n" +
                "They spoke of her as a contestant or a\n" +
                "place or even an enemy. But the old man\n" +
                "always thought of her as feminine and\n" +
                "as something that gave or withheld\n" +
                "great favours, and if she did wild or\n" +
                "wicked things it was because she\n" +
                "could not help them. The moon affects\n" +
                "her as it does a woman, he thought.";

            string text1 =
                "Но, однако ж, прибавлю, что во всякой\n" +
                "гениальной или новой человеческой мысли,\n" +
                "или просто даже во всякой серьезной\n" +
                "человеческой мысли, зарождающейся в\n" +
                "чьей-нибудь голове, всегда остается\n" +
                "нечто такое, чего никак нельзя передать\n" +
                "другим людям, хотя бы вы исписали целые\n" +
                "томы и растолковывали вашу мысль тридцать\n" +
                "пять лет; всегда останется нечто, что ни\n" +
                "за что не захочет выйти изпод вашего\n" +
                "черепа и останется при вас навеки;\n" +
                "с тем вы и умрете, не передав никому,\n" +
                "может быть, самого-то главного из\n" +
                "вашей идеи.";

                .OrderBy(word => word)

                .OrderBy(word => word)

            Console.WriteLine("Press ENTER to continue...");


When this application is run, it produces words from both segments of text, both in Latin and Cyrillic letters:

a          affects    always     an
and        are        as         bad
because    bought     brought    buoys
but        call       contestant could
did        does       el         enemy
even       favours    feminine   fishermen
floats     for        gave       great
had        he         help       her
if         in         is         it
la         lines      livers     love
man        mar        masculine  money
moon       much       not        of
old        or         people     place
said       say        sea        shark
she        some       something  sometimes
spanish    spoke      that       the
their      them       they       things
those      though     thought    used
was        were       what       when
which      who        wicked     wild
withheld   woman      younger

бы             быть           в
вас            вашего         вашей
вашу           во             всегда
всякой         вы             выйти
гениальной     главного       голове
даже           другим         ж
за             зарождающейся  захочет
и              идеи           из
изпод          или            исписали
лет            людям          может
мысли          мысль          навеки
не             нельзя         нечто
ни             никак          никому
но             новой          однако
остается       останется      передав
передать       при            прибавлю
просто         пять           растолковывали
с              самого-то      серьезной
такое          тем            томы
тридцать       умрете         хотя
целые          чего           человеческой
черепа         что            чьей-нибудь

Press ENTER to continue...

See also:

Published: Feb 18, 2015


Zoran is software architect dedicated to clean design and CTO in a growing software company. Since 2014 Zoran is an author at Pluralsight where he is preparing a series of courses on design patterns, writing unit and integration tests and applying methods to improve code design and long-term maintainability.

Follow him on Twitter @zoranh75 to receive updates and links to new articles.

Watch Zoran's video courses at (requires registration):

Tactical Design Patterns in .NET: Managing Responsibilities

Applying a design pattern to a real-world problem is not as straightforward as literature implicitly tells us. It is a more engaged process. This course gives an insight into tactical decisions we need to make when applying design patterns that have to do with separating and implementing class responsibilities. More...

Tactical Design Patterns in .NET: Control Flow

Improve your skills in writing simpler and safer code by applying coding practices and design patterns that are affecting control flow. More...

Improving Testability Through Design

This course tackles the issues of designing a complex application so that it can be covered with high quality tests. More...

Share this article