Calculating the normalized compression distance between two strings


This article describes the Normalized Compression Distance (NCD).  The NCD is an approach that is used for clustering. It's based on algorithmic complexity developed by Kolmogorov, called Normalized Information Distance (NID). NCD can be used to cluster objects of any kind, such as music, texts, or gene sequences (microarray classification). Other areas are plagiarism detection and visual data mining.

 

This article shows how to use a compressor such as GZip to calculate the NCD. Microsoft Visual C# is used to implement the sample program.

 

Suppose we have two string sequences x and y, and the concatenated sequence xy, and a compressor C with C(s) a function returning the number of bytes of the compressed strings.

 

C(xy) will have (almost) the same number of bytes as C(x) when x = y. The more y looks like x the more redundancy will be met by the compressor, resulting in C(xy) bytes coming closer to the number of bytes of C(x). One can extract a measure of similarity between two objects based on this phenomenon.

 

Precisely, NCD(x,y)  =  C(xy) - min{C(x), C(y)} / max{C(x), C(y)}, 

 

0 <= NCD(x, y) <= 1.

 

When NCD(x, y) = 0, then x and y are similar, if NCD(x,y) = 1, they are dissimilar. The distance is used to cluster objects. Note: the size of the concatenated string must lie within the block size of the compressor. We will choose a block size of 900000 bytes.

 

For this program we need a compressor GZip in Microsoft Visual C#. We will use the free DLL SharpZipLib.dll that ships with ICSharpCode. Download the assemblies from the site http://www.icsharpcode.net/OpenSource/SharpZipLib/. In Visual C# we need to add a reference to the DLL to include the ICSharpCode compressor namespace in the project. Go to the solution explorer and right click References. Select Add References. Browse to the location of the unzipped Icsharpcode.SharpZipLib.dll (NET 2.0).

 

using System;

using System.Collections.Generic;

using System.ComponentModel;

using System.Data;

using System.Drawing;

using System.Linq;

using System.Text;

using System.Windows.Forms;

using ICSharpCode.SharpZipLib.GZip;

using System.IO

 

To perform the calculation two rich text boxes are used onto one form. Each rich text box can accept a piece of text. Just paste text into it. When the calculate button is hit each piece of text is concatenated and the resulting concatenation will be compressed. Now the NCD can be easily calculated. Suppose the NCD is close to zero, this will indicate similarity between the two pieces of text, e.g. is it written by the same author?

 

Pasting text into the rich text box (RichTextBox) means that we have to implement a context menu (ContextMenuStrip) for each rich text box. Drag a ContextMenuStrip from the ToolBox onto the form. You can edit/add the menu items just as with the MenuStrip. Add a menu item 'Paste' to the context menu. Now we have to associate the context menu with the rich text box controls. Click on the rich text box control and go to the ContextMenuStrip property. Select the implemented context menu.

 

At this time we have to implement a method that retrieves copied text from the clipboard and paste it to one of the rich text boxes (rtf format). 

 

private void pasteToolStripMenuItem_Click(object sender, EventArgs e)

{

// Which RichTextBox invoked the ContextMenu?

          RichTextBox richTextBox =

 (RichTextBox)contextMenuStrip.SourceControl;

 

      if (Clipboard.ContainsText(TextDataFormat.Rtf))

          {

richTextBox.SelectedRtf =

 Clipboard.GetData(DataFormats.Rtf).ToString();

            Clipboard.Clear();

}

} 

 

We also need a copy function. Add 'copy' to the ContextMenuStrip. The method of the copy function looks like this: 

 

private void copyToolStripMenuItem_Click(object sender, EventArgs e)

{

          // Which RichTextBox invoked the ContextMenu?

RichTextBox richTextBox =

 (RichTextBox)contextMenuStrip.SourceControl;

 

      Clipboard.SetData(DataFormats.Rtf, richTextBox.SelectedRtf);

} 

 

Three buttons are dragged from the Visual toolbox onto the NCD form. One to calculate the NCD, one to clear both rich text boxes, and one to close the form. 

 

private void buttonClose_Click(object sender, EventArgs e)

{

     this.Close();

}

 

private void buttonClear_Click(object sender, EventArgs e)

{

          richTextBoxStringX.Clear();

          richTextBoxStringY.Clear();

          textBoxNCD.Text = "";

} 

 

At last, the calculation method needs to be implemented. Two questions arise. Can we convert the rich text format into a string? Can we convert it into a byte array to be compressed by Gzip? Yes, we can, look at the following implementation. Note: The string type represents a sequence of zero or more Unicode characters. ASCII encoding will be used to denote the content of the strings. 

 

public const int BLOCK_SIZE = 900000;

 

private void buttonCalculate_Click(object sender, EventArgs e)

{

          string x = richTextBoxStringX.Text.ToString();

          string y = richTextBoxStringY.Text.ToString();

 

          string xy = x + y;

 

          Directory.CreateDirectory(@"\NCD");

       

          // Note: Encoding must be the same for text comparison

          System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();

 

          byte[] bytes_x = enc.GetBytes(x.ToCharArray());

          byte[] bytes_y = enc.GetBytes(y.ToCharArray());

 

          byte[] bytes_xy = enc.GetBytes(xy.ToCharArray());

 

          // Throw exception when xy > blocksize

          if (bytes_xy.Length > BLOCK_SIZE)

          {

          throw new

Exception("Concatenation exceeds block size of compressor.");

          }

 

      GZipOutputStream bzos;

 

          // Compress x

          bzos = new GZipOutputStream(File.Create(@"\NCD\x.zip"), BLOCK_SIZE);

          bzos.Write(bytes_x, 0, bytes_x.Length);

bzos.Close();

           

          // Get file size compressed x

          System.IO.FileInfo fileInfo;

          fileInfo = new System.IO.FileInfo(@"\NCD\x.zip");

          long bytes_Cx = fileInfo.Length;

 

          // Compress y

          bzos = new GZipOutputStream(File.Create(@"\NCD\y.zip"), BLOCK_SIZE);

          bzos.Write(bytes_y, 0, bytes_y.Length);

          bzos.Close();

 

          // Get file size compressed y

          fileInfo = new System.IO.FileInfo(@"\NCD\y.zip");

          long bytes_Cy = fileInfo.Length;

 

          // Compress xy

          bzos = new GZipOutputStream(File.Create(@"\NCD\xy.zip"),

                                                          BLOCK_SIZE);

          bzos.Write(bytes_xy, 0, bytes_xy.Length);

          bzos.Close();

 

          // Get file size compressed xy

          fileInfo = new System.IO.FileInfo(@"\NCD\xy.zip");

          long bytes_Cxy = fileInfo.Length;

 

           // Determine C(xy)-min{C(x),C(y)}

          long max_C;

          if (bytes_Cx > bytes_Cy)

          {

         max_C = bytes_Cxy - bytes_Cy;

          }

          else

          {

          max_C = bytes_Cxy - bytes_Cx;

          }

 

          // Determine max{C(x), C(y)}           

long max_Cx_Cy;

          if (bytes_Cx > bytes_Cy)

          {

                   max_Cx_Cy = bytes_Cx;

          }

          else

          {

          max_Cx_Cy = bytes_Cy;

          }

 

          // NCD =  C(xy)-min{C(x),C(y)} / max{C(x), C(y)}

          float NCD = (float)max_C /(float)max_Cx_Cy;

          textBoxNCD.Text = NCD.ToString();

}

 

The NCD as already mentioned can be used to cluster similar objects, e.g. virus families (SARS). The exact way of clustering is described in the original paper by Paul Vitanyi "Clustering by Compression". See also http://www.complearn.org/.

 

For the use of the appropriate compressor see the article "Common Pitfalls Using The Normalized Compression Distance: What To Watch Out For In A Compressor", Communications in Information Systems, Vol. 5, No. 4 pp. 367-384, 2005.  


Similar Articles