by Zoran Horvat
Suppose that we want to split a (multiline) text into a collection of distinct words.
We can do that by combining a regular expression which extracts separate words and LINQ query which turns matches into a collection of strings.
Here is the extension method which turns a given string into a collection of words that appear in it:
static class TextToWordsExtensions
{
public static IEnumerable<string> SplitIntoWords(this string text)
{
string pattern = @"\b[\p{L}]+\b";
return
Regex.Matches(text, pattern)
.Cast<Match>() // Extract matches
.Select(match => match.Value.ToLower()) // Change to same case
.Distinct(); // Remove duplicates
}
}
This function can be changed to accommodate a different set of rules regarding detecting word boundaries. For example, we could allow numbers and underscores as parts of words:
string pattern = @"\b[\p{L}\p{N}_]+\b";
Another possible modification is to support words with dashes:
string pattern = @"\b[\p{L}]+(-[\p{L}]+)*\b";
Below is the console application which turns two segments of text into collections of words.
The first text is the quote from Ernest Hemingway in English. The second text is the quote from Fyodor Dostoyevsky in Russian. This is to demonstrate that words can be extracted in Unicode, not only in ASCII character set.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace ConsoleDemo
{
static class TextToWordsExtensions
{
public static IEnumerable<string> SplitIntoWords(this string text)
{
string pattern = @"\b[\p{L}]+(-[\p{L}]+)*\b";
return
Regex.Matches(text, pattern)
.Cast<Match>() // Extract matches
.Select(match => match.Value.ToLower()) // Change to same case
.Distinct(); // Remove duplicates
}
public static IEnumerable<string> PrintWords(this IEnumerable<string> words,
int columnWidth)
{
Console.OutputEncoding = UTF8Encoding.Unicode;
int count = words.Count();
int maxLength = words.DefaultIfEmpty(string.Empty).Max(word => word.Length);
int wordsPerRow = (columnWidth + 1) (maxLength + 1);
words
.Select((word, pos) => new
{
Index = pos,
Word = word.PadRight(maxLength)
})
.ToList()
.ForEach(element =>
{
char delimiter = ' ';
if (element.Index == count - 1)
delimiter = '\n';
else if ((element.Index + 1) % wordsPerRow == 0)
delimiter = '\n';
Console.Write("{0}{1}", element.Word, delimiter);
});
Console.WriteLine();
return words;
}
}
class Program
{
static void Main(string[] args)
{
string text =
"He always thought of the sea as 'la mar'\n" +
"which is what people call her in Spanish\n" +
"when they love her. Sometimes those who\n" +
"love her say bad things of her but they\n" +
"are always said as though she were a woman.\n" +
"Some of the younger fishermen, those who\n" +
"used buoys as floats for their lines and\n" +
"had motorboats, bought when the shark\n" +
"livers had brought much money, spoke of\n" +
"her as 'el mar' which is masculine.\n" +
"They spoke of her as a contestant or a\n" +
"place or even an enemy. But the old man\n" +
"always thought of her as feminine and\n" +
"as something that gave or withheld\n" +
"great favours, and if she did wild or\n" +
"wicked things it was because she\n" +
"could not help them. The moon affects\n" +
"her as it does a woman, he thought.";
string text1 =
"Но, однако ж, прибавлю, что во всякой\n" +
"гениальной или новой человеческой мысли,\n" +
"или просто даже во всякой серьезной\n" +
"человеческой мысли, зарождающейся в\n" +
"чьей-нибудь голове, всегда остается\n" +
"нечто такое, чего никак нельзя передать\n" +
"другим людям, хотя бы вы исписали целые\n" +
"томы и растолковывали вашу мысль тридцать\n" +
"пять лет; всегда останется нечто, что ни\n" +
"за что не захочет выйти изпод вашего\n" +
"черепа и останется при вас навеки;\n" +
"с тем вы и умрете, не передав никому,\n" +
"может быть, самого-то главного из\n" +
"вашей идеи.";
text
.SplitIntoWords()
.OrderBy(word => word)
.PrintWords(50);
text1.SplitIntoWords()
.OrderBy(word => word)
.PrintWords(50);
Console.WriteLine("Press ENTER to continue...");
Console.ReadLine();
}
}
}
When this application is run, it produces words from both segments of text, both in Latin and Cyrillic letters:
a affects always an
and are as bad
because bought brought buoys
but call contestant could
did does el enemy
even favours feminine fishermen
floats for gave great
had he help her
if in is it
la lines livers love
man mar masculine money
moon much not of
old or people place
said say sea shark
she some something sometimes
spanish spoke that the
their them they things
those though thought used
was were what when
which who wicked wild
withheld woman younger
бы быть в
вас вашего вашей
вашу во всегда
всякой вы выйти
гениальной главного голове
даже другим ж
за зарождающейся захочет
и идеи из
изпод или исписали
лет людям может
мысли мысль навеки
не нельзя нечто
ни никак никому
но новой однако
остается останется передав
передать при прибавлю
просто пять растолковывали
с самого-то серьезной
такое тем томы
тридцать умрете хотя
целые чего человеческой
черепа что чьей-нибудь
Press ENTER to continue...
If you wish to learn more, please watch my latest video courses
In this course, you will learn the basic principles of object-oriented programming, and then learn how to apply those principles to construct an operational and correct code using the C# programming language and .NET.
As the course progresses, you will learn such programming concepts as objects, method resolution, polymorphism, object composition, class inheritance, object substitution, etc., but also the basic principles of object-oriented design and even project management, such as abstraction, dependency injection, open-closed principle, tell don't ask principle, the principles of agile software development and many more.
More...
In this course, you will learn how design patterns can be applied to make code better: flexible, short, readable.
You will learn how to decide when and which pattern to apply by formally analyzing the need to flex around specific axis.
More...
This course begins with examination of a realistic application, which is poorly factored and doesn't incorporate design patterns. It is nearly impossible to maintain and develop this application further, due to its poor structure and design.
As demonstration after demonstration will unfold, we will refactor this entire application, fitting many design patterns into place almost without effort. By the end of the course, you will know how code refactoring and design patterns can operate together, and help each other create great design.
More...
In four and a half hours of this course, you will learn how to control design of classes, design of complex algorithms, and how to recognize and implement data structures.
After completing this course, you will know how to develop a large and complex domain model, which you will be able to maintain and extend further. And, not to forget, the model you develop in this way will be correct and free of bugs.
More...
Zoran Horvat is the Principal Consultant at Coding Helmet, speaker and author of 100+ articles, and independent trainer on .NET technology stack. He can often be found speaking at conferences and user groups, promoting object-oriented and functional development style and clean coding practices and techniques that improve longevity of complex business applications.