An Introduction to Phonetic Search in Java

This article introduces phonetic search algorithms and their usage in Java based applications without reinventing the wheel.

Introduction
 
As you know, search algorithms have taken a new turn in the modern age. Now-a-days if you search for a specific word, you may get results along with similar words in various search engines. Let us consider a typical example like "Facebook", if you search for taht name then you will get a list of names having slightly different spellings. Similarly for the LinkedIn website. It is a different topic how the Facebook search engine works, but I want to highlight the concept of phonetic search. It is a type of search where the string or words have the similar pronunciation. The primary objective of the applications is to provide better search capability and thereby it provides more intuitive information to the user. Sometimes it is also considered as a misspelled searching technique. Therefore modern applications have adopted the principles of exact searches as well as similar words searches. This article introduces phonetic search algorithms and their usage in Java based applications without reinventing the wheel.
 
Technicalities
 
It is a common practice for developers to provide search implementations based upon the exact string matching. If the string does not match then the result becomes null. Before we move into phonetic search, we need to understand the word "phonetic". Phonetic is a wing of linguistics that deals with the sounds of human speech. Basically it is more about the word that you pronounce. In the case of a phonetic search, it is a technique to look up a word with the exact spelling along with the words having similar sounds. Let us use a few examples of names having similar sounds but they differ in their spellings.
  • Caret and Carat
  • Nelson, Neilson and Neelson
  • Neekita and Nikita
  • Cup and Cop
The preceding sample words have similar sounds in the English language; it is also possible they may have different sounds in other languages. It is out of scope for the explanation about other languages. To do phonetic search capability there are various algorithms and also algorithms specific to a language. In this case I provide below some famous phonetic search algorithms and for some algorithms there are already Java implementations in the easiest manner for the smooth usage in our applications.
  • Soundex
  • Metaphone
  • Double Metaphone
  • Metaphone 3
  • Caverphone
  • NYSIIS
  • Daitch-Mokotoff
Let me provide you the outline of each algorithm very briefly.
 
Soundex
 
This algorithm was developed by Robert Russell in 1910 for the words in English. As per this algorithm, words are compared based upon their index value. The main principle behind this algorithm is that consonants are grouped depending on the ordinal numbers and finally encoded into a value against which others are matched. This algorithm is very popular and widely used in many applications.
 
Metaphone
 
This algorithm was developed by Lawrence Philips in 1990 for encoding words correspondng to English pronunciation rules. It is considered to be better than Soundex. In this case words are grouped and the resulting value is also a word unlikely in the case of soundex. This algorithm seems to be more complicated.
 
Double Metaphone
 
The developer of the Metaphone algorithm provided an improved version called "Doube Metaphone" in the year 2000 by providing support to other European languages. It is called "Double" because it provides both a primary and a secondary code for a word and code can be up to 4 characters. The Double Metaphone rule engine is a bit more complex than the others.
 
Metaphone 3
 
The same developer of Metaphone Lawrence Philips provided an improved version of an algorithm called Metaphone 3 in 2009. In this algorithm, various sounds like soft and hard were taken into consideration. It provided more support to Slavik languages like Russian.
 
Caverphone
 
The Caverphone algorithm was developed by David Hood in 2002 as part of a New Zealand project called "Caversham Project" to match the data in the old and new electoral lists. Although this algorithm is applicable for English words, it provides much more specific recognition of accents of words of New Zealand.
 
NYSIIS
 
This algorithm was developed in 1970 as part of the "New York State Identification and Intelligence System". It promises to provide better result and accuracy, up to 2.7%, over the Soundex algorithm. Again it is more specific to American names.
 
Daitch-Mokotoff
 
This algorithm was developed by two Jewish genealogists Gary Mokotoff and Randy Daitch in the year 1985. This algorithm is similar to Soundex but provides more accuracy for Russian and Jewish names.
 
Now let us get to the technical implementation of the phonetic search algorithm. As I have already explained, some of the famous algorithms are available freely in the form library that can be easily plugged into our application. Apache "commons-codec" provides the implementations for the following algorithms.
  • Metaphone
  • Double Metaphone
  • Soundex
You can freely download the common-codec jar file and use it in your application. I provide below the brief code snippet.
  1. package com.ddlab.rnd.phonetic;  
  2. import org.apache.commons.codec.language.DoubleMetaphone;  
  3. import org.apache.commons.codec.language.Metaphone;  
  4. import org.apache.commons.codec.language.Soundex;  
  5. /** 
  6.  * The Class Test is used to provide the examples on Common-Codec about the 
  7.  * various phonetic algorithm usage 
  8.  * @author <a href="mailto:debadatta.mishra@gmail.com">Debadatta Mishra</a> 
  9.  * @since 2013 
  10.  */  
  11. public class Test {  
  12.     /** 
  13.      * The main method. 
  14.      * @param args 
  15.      *            the arguments 
  16.      * @throws Exception 
  17.      *             the exception 
  18.      */  
  19.     public static void main(String[] args) throws Exception {  
  20.         String s1 = "the sky"// "ice cream";//"carat";//"Nelson";  
  21.         String s2 = "this guy"// "I scream";//"caret";//"Neilson";  
  22.         String str1 = "Nelson";  
  23.         String str2 = "Neilson";  
  24.         String name1 = "debadatta";  
  25.         String name2 = "debadutta";  
  26.         // Using Metaphone  
  27.         Metaphone metaPhone = new Metaphone();  
  28.         System.out.println(s1 + " and " + s2 + " are equal ?---->"  
  29.             +  
  30.             metaPhone.isMetaphoneEqual(s1, s2));  
  31.         // Using Double Metaphone  
  32.         DoubleMetaphone doubleMetaphone = new DoubleMetaphone();  
  33.         System.out.println(str1 + " and " + str2 + " are similar ?"  
  34.             +  
  35.             doubleMetaphone.isDoubleMetaphoneEqual(str1, str2));  
  36.         // Using Soundex  
  37.         Soundex soundex = new Soundex();  
  38.         System.out.println(name1 + " and " + name2 + " are equal ?---->"  
  39.             +  
  40.             soundex.soundex(s1).equals(soundex.soundex(s2)));  
  41.     }  
  42. }  
Apart from that I have just created a simulated project for the employees working in an organization for better searching of names. Assume that you have a list of employee names and a user is searching for a specific name, you can retrieve from the employee database and you can use algorithm implementations available in commons-codec to provide the result. In this project I have stored all the names in a flat file rather than using any database for your convenience.
 
Configuration
 
To have a clear understanding and usage of the preceding concept download the source code from dropbox "https://www.dropbox.com/s/88pbpk3fr7gvm9g/phoneticsearch.zip ". Configure the project in the Eclipse editor. Refer to the test Java classes available in the test source folder.
 
Conclusion
 
I hope you have enjoyed my small article about phonetic searches in Java. Download the complete project and go through the source code to understand the concept and its usage. Based upon the complexity and design , you can decide whether to use this concept. For any kind of issues and error you can contact me at debadatta.mishra@gmail.com.
 
Resource and Reference