Configurable String Matching Solution

Introduction

In previous article titled "Efficient String Matching Algorithm with Use of Wildcard Characters" (http://www.c-sharpcorner.com/UploadFile/b81385/8832/)  we have presented one method to compare strings against patterns which contain wildcard characters. In this article we are presenting classes which can be used to formalize the string comparison. Applications can offer several comparison methods and then let the caller decide which one to use in every call to a function. Classes shown in this article can help build such functionality almost without effort.

Please note that latest documentation and source code for this library is available at address http://www.sysexpand.com/?path=net/sysexpand/text/strmatch/manual/intro. All subsequent changes and enhancements will be performed at this address rather than in the article.


Problem Statement

It is a common requirement in application programming that application should at some point test whether a string matches certain pattern or to extract members of a set of strings which match the pattern. In most of the cases in practice pattern matching can be defined as one of the three types:

  • Exact - string must exactly match the pattern,
  • Matching with pattern which contains wildcards - pattern may contain wildcards that replace one or multiple characters from the string, or
  • Matching the regular expression - string is required to match the given regular expression.

What often arises as the problem when developing such applications is that matching method is not determined in advance, but it becomes known only at run time, typically being decided by the application user.

This article presents ready to use classes that implement these three string matching methods and their combinations, so that matching method can be changed dynamically at run time without changing the rest of the application.

In previous article named Efficient Algorithm for String Matching with Use of Wildcard Characters we have already explained a method applied to match strings against patterns that contain wildcards. Function presented in that article will be used in implementation of the corresponding class in this article.

Design

Classes that will be used in string matching will all implement interface named IStringMatcher, which is declared like this:

public interface IStringMatcher
{
    string Pattern { get; set; }
    bool IsMatch(string value);
}

All types that implement this interface will expose a read-write property Pattern. This property is writeable so that one pattern matching object can be used on multiple patterns and multiple input strings during its lifetime. Making this property writeable allows application to create only one instance of appropriate class which implements IStringMatcher interface and then to use it for all matching. In other words, every class implementing IStringMatcher interface only captures the matching algorithm, rather than a particular pattern against which strings are matched using that algorithm. The second member of the interface is the IsMatch method, which is used to test whether the contained pattern matches the string passed as the parameter.

We are providing three basic implementations of this interface, named ExactMatcher, WildcardMatcher and RegexMatcher. As their names imply, these classes implement the three matching methods listed in the introduction. In addition to members defined by the IStringMatcher interface, WildcardMatcher class also provides properties that can be used to set wildcard characters used in the pattern, because these characters may vary across different systems.

There is one more class provided, named CombinedStringMatcher, which also implements IStringMatcher interface. Instances of this class can contain multiple instances of IStringMatcher and its IsMatch method returns true if at least one contained matcher's IsMatch method returns true. Each of the contained matchers can generally be of different class and can have its own pattern different than the others. In this way we can ask whether a given string can be matched against any of the patterns given.

Implementation

In this section we are providing full source code of all utility classes that implement IStringMatcher interface. For explanations on the way in which WildcardMatcher implements IsMatch function, please refer to article Efficient Algorithm for String Matching with Use of Wildcard Characters.

using System.Text.RegularExpressions;
using System.Collections.Generic;
using System.Text;

namespace SysExpand.Text.StringMatching
{

    public class ExactMatcher: IStringMatcher
    {

        public ExactMatcher() { }
        public ExactMatcher(string pattern) { _pattern = pattern; }

        public string Pattern
        {
            Get { return _pattern ?? string.Empty; }
            Set { _pattern = value; }
        }
        public bool IsMatch(string value)
        {
            return Pattern == (value ?? string.Empty);
        }
        private string _pattern;
    }

    public class WildcardMatcher: IStringMatcher
    {
        public WildcardMatcher(): this(null, DefaultSingleWildcard, DefaultMultipleWildcard) { }
        public WildcardMatcher(string pattern): this(pattern, DefaultSingleWildcard, DefaultMultipleWildcard) { }
        public WildcardMatcher(string pattern, char singleWildcard, char multipleWildcard)
        {
            _pattern = pattern;
            _singleWildcard = singleWildcard;
            _multipleWildcard = multipleWildcard;
        }

        public string Pattern
        {
            Get { return _pattern ?? string.Empty; }
            Set { _pattern = value; }
        }

        public bool IsMatch(string value)
        {

            int[] inputPosStack = new int[(value.Length + 1) * (Pattern.Length + 1)];   // Stack containing input positions that should be tested for further matching
            int[] patternPosStack = new int[inputPosStack.Length];                      // Stack containing pattern positions that should be tested for further matching
            int stackPos = -1;                                                          // Points to last occupied entry in stack; -1 indicates that stack is empty
            bool[,] pointTested = new bool[value.Length + 1, Pattern.Length + 1];       // Each true value indicates that input position vs. pattern position has been tested
 
            int inputPos = 0;   // Position in input matched up to the first multiple wildcard in pattern
            int patternPos = 0;
// Position in pattern matched up to the first multiple wildcard in pattern

            if (_pattern == null)
                _pattern = string.Empty;

            // Match beginning of the string until first multiple wildcard in pattern
            while (inputPos < value.Length && patternPos < Pattern.Length && Pattern[patternPos] != MultipleWildcard && (value[inputPos] == Pattern[patternPos] || Pattern[patternPos] == SingleWildcard))
            {
                inputPos++;
                patternPos++;
            }

            // Push this position to stack if it points to end of pattern or to a general wildcard character
            if (patternPos == _pattern.Length || _pattern[patternPos] == _multipleWildcard)
            {
                pointTested[inputPos, patternPos] = true;
                inputPosStack[++stackPos] = inputPos;
                patternPosStack[stackPos] = patternPos;
            }

            bool matched = false;

            // Repeat matching until either string is matched against the pattern or no more parts remain on stack to test
            while (stackPos >= 0 && !matched)
            {

                inputPos = inputPosStack[stackPos];         // Pop input and pattern positions from stack
                patternPos = patternPosStack[stackPos--];   // Matching will succeed if rest of the input string matches rest of the pattern
 
                if (inputPos == value.Length && patternPos == Pattern.Length)
                    matched = true;     // Reached end of both pattern and input string, hence matching is successful
                else
                {
                    // First character in next pattern block is guaranteed to be multiple wildcard
                   
// So skip it and search for all matches in value string until next multiple wildcard character is reached in pattern

                    for (int curInputStart = inputPos; curInputStart < value.Length; curInputStart++)
                    {

                        int curInputPos = curInputStart;
                        int curPatternPos = patternPos + 1;

                        if (curPatternPos == Pattern.Length)
                        {   // Pattern ends with multiple wildcard, hence rest of the input string is matched with that character
                            curInputPos = value.Length;
                        }
                        else
                        {

                            while (curInputPos < value.Length && curPatternPos < Pattern.Length && Pattern[curPatternPos] != MultipleWildcard &&
                                (value[curInputPos] == Pattern[curPatternPos] || Pattern[curPatternPos] == SingleWildcard))
                            {
                                curInputPos++;
                                curPatternPos++;
                            }

                        }

                        // If we have reached next multiple wildcard character in pattern without breaking the matching sequence, then we have another candidate for full match
                        // This candidate should be pushed to stack for further processing
                        // At the same time, pair (input position, pattern position) will be marked as tested, so that it will not be pushed to stack later again
                        if (((curPatternPos == Pattern.Length && curInputPos == value.Length) || (curPatternPos < Pattern.Length && Pattern[curPatternPos] == MultipleWildcard))
                            && !pointTested[curInputPos, curPatternPos])
                        {
                            pointTested[curInputPos, curPatternPos] = true;
                            inputPosStack[++stackPos] = curInputPos;
                            patternPosStack[stackPos] = curPatternPos;
                        }

                    }
                }

            }
 
            return matched;

        }

        public char SingleWildcard
        {
            Get { return _singleWildcard; }
            Set { _singleWildcard = value; }
        }
        public char MultipleWildcard
        {
            Get { return _multipleWildcard; }
            Set { _multipleWildcard = value; }
        }

        /// </summary>
        private string _pattern;
        private char _singleWildcard;
        private char _multipleWildcard;
        public const char DefaultSingleWildcard = '?';
        public const char DefaultMultipleWildcard = '*';

    }

    public class RegexMatcher: IStringMatcher
    {
        public RegexMatcher() { }
        public RegexMatcher(string pattern) { _pattern = pattern; }

        public string Pattern
        {
            Get { return _pattern ?? string.Empty; }
            Set { _pattern = value; }
        }

        public bool IsMatch(string value)
        {
            if (string.IsNullOrEmpty(_pattern))
                throw new System.InvalidOperationException("Cannot match strings before regular expression pattern is set.");
            Regex reg = new Regex(_pattern);
            return reg.IsMatch(value);
        }
        private string _pattern;
    }

    public class CombinedStringMatcher: IStringMatcher
    {
        public CombinedStringMatcher() { _matchers = new List<IStringMatcher>(); }
        public CombinedStringMatcher(IStringMatcher[] matchers): this() { AddRange(matchers); }

        public void Add(IStringMatcher matcher)
        {

            if (matcher == null)
                throw new System.ArgumentNullException("Cannot add null reference string matcher.");
            _matchers.Add(matcher);
        }
        public void AddRange(IStringMatcher[] matchers)
        {

            if (matchers == null)
                throw new System.ArgumentNullException("Array of matchers cannot be null.");

            for (int i = 0; i < matchers.Length; i++)
                Add(matchers[i]);   // May throw new System.ArgumentNullException
        }

        public string Pattern
        {
            get
            {

                StringBuilder sb = new StringBuilder();
                bool first = true;

                foreach (IStringMatcher matcher in _matchers)
                {
                    sb.AppendFormat("{0}({1})", first ? string.Empty : " OR ", matcher.Pattern);
                    first = false;
                }

                return sb.ToString();

            }
            set
            {
                throw new System.InvalidOperationException("Cannot set pattern on CombinedStringMatcher.");
            }
        }

        public bool IsMatch(string value)
        {

            bool matched = false;

            foreach (IStringMatcher matcher in _matchers)
                if (matcher.IsMatch(value))
                {
                    matched = true;
                    break;
                }

            return matched;

        }
        List<IStringMatcher> _matchers;

    }

}

How to use

In this section we will give a short example which shows how to use string matching classes. Code in the example lists all files from a given directory that match one of several name patterns.

using System;
using SysExpand.Text.StringMatching;
using System.IO;

namespace ConsoleApplication
{
    class Program
    {

        static void ListFiles(string directory, IStringMatcher matcher)
        {

            DirectoryInfo dir = new DirectoryInfo(directory);
            FileInfo[] files = dir.GetFiles();
            int count = 0;
            long size = 0;

            Console.WriteLine("{0} {1}", matcher.GetType().Name, matcher.Pattern);
            Console.WriteLine("----------------------------------");

            for (int i = 0; i < files.Length; i++)
                if (matcher.IsMatch(files[i].Name.ToLower()))
                {
                    Console.WriteLine(files[i].Name);
                    count++;
                    size += files[i].Length;
                }

            Console.WriteLine();
            Console.WriteLine("{0} file(s) ({1} total)", count, files.Length);
            Console.WriteLine("{0} byte(s)", size);
            Console.WriteLine("----------------------------------");
            Console.WriteLine();
 
        }

        static void Main(string[] args)
        {
           
            string path = Environment.GetFolderPath(Environment.SpecialFolder.System);

            IStringMatcher matcher =
                new CombinedStringMatcher(
                        new IStringMatcher[]
                        {
                            new ExactMatcher("notepad.exe"),
                            new WildcardMatcher("??ta*.*"),
                            new RegexMatcher(".*\\.(com|ico)")
                        }
                    );

            ListFiles(path, new WildcardMatcher("sys*.exe"));
            ListFiles(path, matcher);
            Console.ReadLine();

        }
    }
}


This example relies on the ListFiles function which receives directory path and string matcher. Function lists all files from the directory that are matched with specified matcher. Note that matching algorithm is not known to this function due to encapsulation provided by IStringMatcher interface.

The fact that ListFiles operates on any string matching algorithm is then used to list files in two distinct ways: first all executables having name starting with "sys" are listed; in second path complex pattern is created so that notepad.exe, files having "ta" on positions 2 and 3 in the name and all *.com, *.ico files are listed.

Output of the program may look like this:

WildcardMatcher sys*.exe
----------------------------------
syskey.exe
systeminfo.exe
SystemPropertiesAdvanced.exe
SystemPropertiesComputerName.exe
SystemPropertiesDataExecutionPrevention.exe
SystemPropertiesHardware.exe
SystemPropertiesPerformance.exe
SystemPropertiesProtection.exe
SystemPropertiesRemote.exe
systray.exe

10 file(s) (2281 total)
686080 byte(s)
----------------------------------

CombinedStringMatcher (notepad.exe) OR (??ta*.*) OR (.*\.(com|ico))
----------------------------------
chcp.com
dataclen.dll
diskcomp.com
diskcopy.com
format.com
mode.com
more.com
mstask.dll
netapi32.dll
notepad.exe
PerfCenterCpl.ico
tree.com

12 file(s) (2281 total)
715328 byte(s)
----------------------------------

Conclusion

In this article we have provided full solution to strings matching problem when application must be able to change the string matching rules at run time. Using these classes is quite simple. They allow application to be coded in terms of IStringMatcher interface on all places where string matching functionality is required. Particular implementation of string matcher can then be selected at run time without affecting the code which relies on string matching outcome.

Once again, please refer to the library home page http://www.sysexpand.com/?path=net/sysexpand/text/strmatch/manual/intro at which latest version of code and documentation for this library are available.