LINQ Query for Blacklist-Based Spam Filter

by Zoran Horvat

In many cases we need to implement a feature to block unwanted messages from passing down the system. We can rely on elaborate solutions, but very often that is overkill. Most of the unwanted messages can be removed by a simple blacklist filter – a list of forbidden words.

If we are not willing to invest into a large solution, we can implement a simple LINQ expression which detects words from the blacklist in the block of text:

bool IsSpam(string text, IEnumerable<string> wordBlacklist)
{

    string pattern = @"\b[\p{L}]+\b";

    return
        Regex.Matches(text, pattern)
            .Cast<Match>()                                // Extract matches
            .Select(match => match.Value.ToLower())       // Convert to lower case
            .Where(word => wordBlacklist.Contains(word))  // Find in blacklist
            .Any();                                       // Stop when first match found

}

This implementation is based on regular expression which detects words in the plain text. This expression can be changed to fit different needs. Please refer to Regex and LINQ Query to Split Text into Distinct Words for more options.

Demonstration

Let’s try this implementation on a text segment taken from the Ernest Hemingway’s "The Old Man and the Sea". In this demonstration, we are assuming that messages containing words "purse", "masculine" or "buy" are spam and should be eliminated.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

namespace MinWeightPath
{

    class Program
    {

        static bool IsSpam(string text, IEnumerable<string> wordBlacklist)
        {

            string pattern = @"\b[\p{L}]+\b";

            return
                Regex.Matches(text, pattern)
                    .Cast<Match>()                                // Extract matches
                    .Select(match => match.Value.ToLower())       // Convert to lower case
                    .Where(word => wordBlacklist.Contains(word))  // Find in blacklist
                    .Any();                                       // Stop when first match found

        }

        static void Main(string[] args)
        {

            string text =
                "He always thought of the sea as 'la mar'\n" +
                "which is what people call her in Spanish\n" +
                "when they love her. Sometimes those who\n" +
                "love her say bad things of her but they\n" +
                "are always said as though she were a woman.\n" +
                "Some of the younger fishermen, those who\n" +
                "used buoys as floats for their lines and\n" +
                "had motorboats, bought when the shark\n" +
                "livers had brought much money, spoke of\n" +
                "her as 'el mar' which is masculine.\n" +
                "They spoke of her as a contestant or a\n" +
                "place or even an enemy. But the old man\n" +
                "always thought of her as feminine and\n" +
                "as something that gave or withheld\n" +
                "great favours, and if she did wild or\n" +
                "wicked things it was because she\n" +
                "could not help them. The moon affects\n" +
                "her as it does a woman, he thought.";

            string[] blacklist = { purse, masculine, buy };

            if (IsSpam(text, blacklist))
                Console.WriteLine("Ernest Hemingway is marked as spammer.");

            Console.ReadLine();

        }
    }
}

When this code is run, it produces the following output:

            
Ernest Hemingway is marked as spammer.
                
    

If you wish to learn more, please watch my latest video courses

About

Zoran Horvat

Zoran Horvat is the Principal Consultant at Coding Helmet, speaker and author of 100+ articles, and independent trainer on .NET technology stack. He can often be found speaking at conferences and user groups, promoting object-oriented and functional development style and clean coding practices and techniques that improve longevity of complex business applications.

  1. Pluralsight
  2. Udemy
  3. Twitter
  4. YouTube
  5. LinkedIn
  6. GitHub