Parsing US Postal Addresses from Textfile and seperating various fields of Address

Here I'm presenting article about how to parse address that are of US format from text file or any other sources.


There are many services on internet that provide the parsing of address like if someone  give an address you need to parse various parts from the address of US postal format from then and these can be useful  in many program like marketing etc. to gather data about specific state people etc..

Here I'm presenting article about how to parse address that are of US format from text file or any other sources. I coded it to help one of my friend and I am sharing here.

Here main two concepts I have used to program this kind of parser that is RegEx and Text Processing.

Here is different format that are of US Postal Address

JEREMY MARTINSON

455 LARKSPUR DR

CALIFORNIA SPRINGS CA 92926-4601

---- 

MARY ROE

MOBILE SYSTEMS

455 E DRAGMAN K

TUCSON AZ 85705-4589

USA

----

MARY ROE

MOBILE SYSTEMS

500 E DRAGMAN SUITE 5A

TUCSON AZ 85705-4601

USA

---- 

JOHN DOE

CENTER FOR FINANCIAL ASSISTANCE TO DEPOSED NIGERIAN ROYALTY

421 E DRACHMAN

TUCSON AZ 85705-7445

USA 

Now we need a program that can identify various parts of these addresses and Store it in csv like format that is consistent so when we want to retrieve detail based on various parts it's easy to sort /search the data according to various parts of address like we need to sort it by postal code / area 

I will store various part of these addresses in a class called Address and then I will write all the address found in Text file to another Text file in format separated by pipe '|' symbol. If you want you can use database too here. 

It's a text processing program so RegEx and some text processing thinker man would be good here..

We can start coding by looking and examine various address format available in TextFile from which we want to parse data to another TextFile with better format that is readable by other programs. 

First step to think here is that we need to check weather that address is contains Postal Code of USA or not if not its not standard address we can't process. If it contain US Postal code or not the we will identify weather its having 4 lines or 5 line format than proceed to next Steps.. 

So here is my code that can identify above mentioned address formats and can parse various parts of addresses from the text file. That contains many addresses. 

I have coded Address Class to manipulate information better 

    public class Address

    {

         

            public string Street;   

            public string Locality; 

            public string City ;

            public string State ;   

            public string PostalCode;     

            public string Country;  

 

          public  Address()

            {

                Street = "";

                Locality = "";

                City = "";

                State = "";

                PostalCode = "";

                Country = "";

            }

 

          public void ClearObject()

          {

              Street = "";

              Locality = "";

              City = "";

              State = "";

              PostalCode = "";

              Country = "";

          }       

 

         public string  _Street

         {

             get

             {

                 return Street;

             }

             set

             {

                 Street = value;

             }

         }

 

         public string _Country

         {

             get

             {

                 return Country;

             }

             set

             {

                 Country = value;

             }

         }

 

         public string  _PostalCode

         {

             get

             {

                 return PostalCode;

             }

             set

             {

                 PostalCode = value;

             }

         }

 

         public string  _State

         {

             get

             {

                 return State;

             }

             set

             {

                 State = value;

 

             }

         }

 

         public string _Locality

         {

             get

             {

                 return Locality;

             }

             set

             {

                 Locality = value;

             }

         }

 

         public void WriteAddress()

         {

             StreamWriter sw = new StreamWriter("formatted_data.txt", true);

             sw.Write(String.Format("{0}|{1}|{2}|{3}|{4}|{5}\r\n",Street,Locality,City,State,PostalCode,Country));

             sw.Close();

         }

     } 

Here Address method writes the parsed Address in consistent format into another TextFile called formatted data.

Now comes the main code that can parse addresses in just one click from various addresses separated by blank line breaks. 

If code looks horrible to you then check the example as I have used lot of code here that are hard to understand . 

private void button1_Click(object sender, EventArgs e)

{

    /* read the file */

    string Data = File.ReadAllText("sample.txt");

    /* replace with single del */

    Data = Data.Replace("\r\n\r\n", "|");

    string[] AddressList = Data.Split('|');

    Address obj = new Address();

    for (int i = 0; i < AddressList.Length - 1; i++)

    {

        AddressList[i] = AddressList[i].Replace("\r\n","|");

        string[] Fields = AddressList[i].Split('|');

        /* if contain us Postal */

        Regex rex = new Regex(@"\b[0-9]{5}(?:-[0-9]{4})?\b");

        if (rex.IsMatch(AddressList[i]) == true)

       {

           obj.ClearObject();

           obj._Country = "USA";

           if (rex.Matches(Fields[2]).Count > 0)

           {

               obj.PostalCode = rex.Matches(Fields[2])[0].Value.ToString();

               obj._State = Fields[2].Substring(rex.Matches(Fields[2])[0].Index - 3, 3);

               string[] x = Fields[2].Split(' ');

               obj._Locality = x[0];

           }

           else if(rex.Matches(Fields[3]).Count > 0)

           {

               obj.PostalCode = rex.Matches(Fields[3])[0].Value.ToString();

               obj._State = Fields[3].Substring(rex.Matches(Fields[3])[0].Index - 3, 3);

               /* get locality */

               string[] x = Fields[3].Split(' ');

               obj._Locality = x[0];

           }

 

           if (Fields.Length == 5)

           {

                  obj._Street = Fields[2];

           }

           else if (Fields.Length == 3)

           {

                            obj._Street = Fields[1];

           }

           obj.WriteAddress();

       }

    }

}


When you click Button you will get all the fields of address of USA separated by "|" in fomatted_data.txt file where the exe is there. It will contain result like below 


455 LARKSPUR DR|CALIFORNIA||CA |92926-4601|USA
455 E DRAGMAN K|TUCSON||AZ |85705-4589|USA
500 E DRAGMAN SUITE 5A|TUCSON||AZ |85705-4601|USA


so now its easy to get any field of address according to need from the text file as its consistent and all address are in same format :)