http://www.codinghelmet.com/  

Wear a helmet. Even when coding.

hints > linq-query-blacklist-spam-filter

LINQ Query for Blacklist-Based Spam Filter
by Zoran Horvat @zoranh75

In many cases we need to implement a feature to block unwanted messages from passing down the system. We can rely on elaborate solutions, but very often that is overkill. Most of the unwanted messages can be removed by a simple blacklist filter – a list of forbidden words.

If we are not willing to invest into a large solution, we can implement a simple LINQ expression which detects words from the blacklist in the block of text:

bool IsSpam(string text, IEnumerable<string> wordBlacklist)
{

    string pattern = @"\b[\p{L}]+\b";

    return
        Regex.Matches(text, pattern)
            .Cast<Match>()                                // Extract matches
            .Select(match => match.Value.ToLower())       // Convert to lower case
            .Where(word => wordBlacklist.Contains(word))  // Find in blacklist
            .Any();                                       // Stop when first match found

}

This implementation is based on regular expression which detects words in the plain text. This expression can be changed to fit different needs. Please refer to Regex and LINQ Query to Split Text into Distinct Words for more options.

Demonstration

Let’s try this implementation on a text segment taken from the Ernest Hemingway’s "The Old Man and the Sea". In this demonstration, we are assuming that messages containing words "purse", "masculine" or "buy" are spam and should be eliminated.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

namespace MinWeightPath
{

    class Program
    {

        static bool IsSpam(string text, IEnumerable<string> wordBlacklist)
        {

            string pattern = @"\b[\p{L}]+\b";

            return
                Regex.Matches(text, pattern)
                    .Cast<Match>()                                // Extract matches
                    .Select(match => match.Value.ToLower())       // Convert to lower case
                    .Where(word => wordBlacklist.Contains(word))  // Find in blacklist
                    .Any();                                       // Stop when first match found

        }

        static void Main(string[] args)
        {

            string text =
                "He always thought of the sea as 'la mar'\n" +
                "which is what people call her in Spanish\n" +
                "when they love her. Sometimes those who\n" +
                "love her say bad things of her but they\n" +
                "are always said as though she were a woman.\n" +
                "Some of the younger fishermen, those who\n" +
                "used buoys as floats for their lines and\n" +
                "had motorboats, bought when the shark\n" +
                "livers had brought much money, spoke of\n" +
                "her as 'el mar' which is masculine.\n" +
                "They spoke of her as a contestant or a\n" +
                "place or even an enemy. But the old man\n" +
                "always thought of her as feminine and\n" +
                "as something that gave or withheld\n" +
                "great favours, and if she did wild or\n" +
                "wicked things it was because she\n" +
                "could not help them. The moon affects\n" +
                "her as it does a woman, he thought.";

            string[] blacklist = { purse, masculine, buy };

            if (IsSpam(text, blacklist))
                Console.WriteLine(Ernest Hemingway is marked as spammer.);

            Console.ReadLine();

        }
    }
}

When this code is run, it produces the following output:

Ernest Hemingway is marked as spammer.

See also:

Published: Sep 1, 2015; Modified: Jun 8, 2015

ZORAN HORVAT

Zoran is software architect dedicated to clean design and CTO in a growing software company. Since 2014 Zoran is an author at Pluralsight where he is preparing a series of courses on object-oriented and functional design, design patterns, writing unit and integration tests and applying methods to improve code design and long-term maintainability.

Follow him on Twitter @zoranh75 to receive updates and links to new articles.

Watch Zoran's video courses at pluralsight.com (requires registration):

Making Your C# Code More Object-Oriented

This course will help leverage your conceptual understanding to produce proper object-oriented code, where objects will completely replace procedural code for the sake of flexibility and maintainability. More...

Advanced Defensive Programming Techniques

This course will lead you step by step through the process of developing defensive design practices, which can substitute common defensive coding, for the better of software design and implementation. More...

Tactical Design Patterns in .NET: Creating Objects

This course sheds light on issues that arise when implementing creational design patterns and then provides practical solutions that will make our code easier to write and more stable when running. More...

Tactical Design Patterns in .NET: Managing Responsibilities

Applying a design pattern to a real-world problem is not as straight-forward as literature implicitly tells us. It is a more engaged process. This course gives an insight to tactical decisions we need to make when applying design patterns that have to do with separating and implementing class responsibilities. More...

Tactical Design Patterns in .NET: Control Flow

Improve your skills in writing simpler and safer code by applying coding practices and design patterns that are affecting control flow. More...

Writing Highly Maintainable Unit Tests

This course will teach you how to develop maintainable and sustainable tests as your production code grows and develops. More...

Improving Testability Through Design

This course tackles the issues of designing a complex application so that it can be covered with high quality tests. More...

Share this article

webmasters