Blue Theme Orange Theme Green Theme Red Theme
 
Team Foundation Server Hosting
Home | Forums | Videos | Advertise | Certifications | Downloads | Blogs | Interviews | Jobs | Beginners | Training
 | Consulting  
Submit an Article Submit a Blog 
 Jump to
Skip Navigation Links
TechnologyExpand Technology
WebsiteExpand Website
6 Months Free & No Setup Fees ASP.NET Hosting!
Search :       Advanced Search »
Home » Internet & Web » Flat File Parsed to XML Using C#

Flat File Parsed to XML Using C#

I ran across an interesting problem today where I had to parse a flat file (csv or tab delimited) into an xml document. The solution I arrived at is flexible enough for reuse so I though I'd share the library along with some of my development notes.

Author Rank :
Page Views : 43429
Downloads : 1388
Rating :
 Rate it
Level : Beginner
   Print Read/Post comments Post a comment  Similar Articles  
   Email to a friend  Bookmark  Author's other articles  
Download Files:
FlatFileParser.zip
 
 
Discover the top 5 tips for understanding .NET Interop
Become a Sponsor
Discover the top 5 tips for understanding .NET Interop
Become a Sponsor
 Tag Cloud
 Latest Jobs
More ... 
 Latest Interview Questions
More ... 

Code overview:

Use:

This is a static class with two public methods used to parse an input string representing tab delimited or comma delimited data into an XmlDocument:

public static XmlDocument ParseCsvToXml(string input, string topElementName, string recordElementName, params string[] recordItemElementName)

public static XmlDocument ParseTabToXml(string input, string topElementName, string recordElementName, params string[] recordItemElementName)

The first thing that I'd like to point out is the signature of the publicly facing methods.  Notice the params keyword in the last parameter in the ParseCsvToXml() method.  This parameter will let me pass in a variable amount of parameters at the end of the method which will represent all the xml node names so we add as many node names as there are columns in our input string the end of the call.

XmlDocument result = Parser.ParseCsvToXml(input, "TopElement", "Record", "Field1", "Field2", "Field3");

The following document will be built.

<?xml version="1.0" encoding="utf-8" ?>
<TopElement>
  <Record>
    <Field1>data</Field1>
    <Field2>data</Field2>
    <Field3>data</Field3>
  </Record>
</TopElement>

There must be a node name specified for each column in the input string.

Development notes:

I'm using two main steps in the conversion process: (1) disassembling the flat file into a 2D matrix of strings and then (2) constructing an xml document from the matrix. There are a pre-process and (in the case of the csv conversion) post-process step that have to happen in order to clean up the data.

If it's worth doing twice, it's worth doing once

Originally I had two separate recursive methods for post processing csv and tab delimited data that is now living in the nodes that were built.  Basically the point is to remove any double quotes and put back commas that were embedded in double quotes in the csv input.

The first method I wrote was to recursively post process the tab-delimited data.  This works really well because the XmlDocument inherits from XmlNode so I can have one method that can accept a node in the document and the document itself.

private static void PostProcessTabNode(XmlNode node)
{
    if (!String.IsNullOrEmpty(node.Value) && m_quotesOnBothEnds.IsMatch(node.Value))
        node.Value = node.Value.Substring(1, node.Value.Length - 2);

    foreach (XmlNode subNode in node.ChildNodes)
        PostProcessTabNode(subNode);
}

Next I wrote the method to recursively post process the comma-delmited data:

private static void PostProcessCsvNode(XmlNode node)
{
    if(! String.IsNullOrEmpty(node.Value))
        node.Value = node.Value.Replace(strTemporaryPlaceholder, strComma);

    foreach (XmlNode subNode in node.ChildNodes)
        PostProcessCsvNode(subNode);
}

What I ended up with was two methods with some code repeated at the end.  Anytime I see code repeated a shudder goes down my spine because it screams out "MAINTENANCE AND CONSISTANCY NIGHTMARE".  I may eventually have more types of data I'd like to parse into xml, so I thought it would be worth refactoring at this point.

I moved to a "controller" method that will be responsible for the recursion.

private static void PostProcess(XmlNode node, Action<XmlNode> process)
{
    process(node);

    foreach (XmlNode subNode in node.ChildNodes)
        PostProcess(subNode, process);
}

The Action<XmlNode> is a predefined delegate that I'll use to point to a method with the same signature that will actually do the work. 

private static void PostProcessTabNode(XmlNode node)
{
    if (!String.IsNullOrEmpty(node.Value) && m_quotesOnBothEnds.IsMatch(node.Value))
        node.Value = node.Value.Substring(1, node.Value.Length - 2);
}

private static void PostProcessCsvNode(XmlNode node)
{
    if(! String.IsNullOrEmpty(node.Value))
        node.Value = node.Value.Replace(strTemporaryPlaceholder, strComma);
}

The nice thing about this refactoring is that now all my methods are more cohesive (each method has a distinct purpose) which corresponds to ease of maintenance and ease of understanding.

When I call the PostProcess() method I'll pass in the document to be cleaned up and the name of the method to do the cleaning.  The compiler is smart enough to know that a new delegate of type Action<XmlNode> needs to be created so I don't have to specify it.

PostProcess(doc, PostProcessTabNode);

I could have called this method in the following way with the exact same results but to me it is much harder to read and understand at a quick glance:

PostProcess(doc, new Action<XmlNode>(PostProcessTabNode));

Strings are Evil

Having a good handle on where strings are in our code is pretty important.  Because they are immutable, they can be very expensive.  If there are multiple instances of the same string within the code, the CLR will "intern" the strings and use a single memory space to hold the string value and pass out multiple references to that memory space.

http://msdn2.microsoft.com/en-us/library/system.string.intern(vs.80).aspx

For me, declaring re-used strings as constant and readonly variables ensures I'm not accidentally using a different spelling or extra space in my strings and so it helps keep the warts off the IL code generated and will keep the assembly load time to a minimum (each time the assembly is loaded into memory, it finds the literal strings and interns them.  Less literal strings to intern means less work for the CLR to do when loading my assembly).

        private const string
            strComma = ",",
            strTemporaryPlaceholder = "~~`~~",
            strTab = "\t";

Anyways, that's about it for the general overview.  Other code you might be interested in are the disassembly and xml building methods in the source code. The unit tests I used are pretty rough and I used them to do a general visual check of the output, but I included them with the code anyways.

I hope you find the library useful.

Until next time,
Happy coding

Comment Request!
Thank you for reading this post. Please post your feedback, question, or comments about this post Here.
Login to add your contents and source code to this article
 [Top] Rate this article
 
 About the author
 
Matthew Cochran
Looking for C# Consulting?
C# Consulting is founded in 2002 by the founders of C# Corner. Unlike a traditional consulting company, our consultants are well-known experts in .NET and many of them are MVPs, authors, and trainers. We specialize in Microsoft .NET development and utilize Agile Development and Extreme Programming practices to provide fast pace quick turnaround results. Our software development model is a mix of Agile Development, traditional SDLC, and Waterfall models.
Click here to learn more about C# Consulting.
 
Introducing MaxV - one click. infinite control. Hyper-V Hosting from MaximumASP.
Finally – a virtual platform that delivers next-generation Windows Server 2008 Hyper-V virtualization technology from a managed hosting partner you can truly depend on. Visit www.maximumasp.com/max for a FREE 30 day trial. Hurry offer ends soon. Climb aboard the MaxV platform and take advantage of High Availability, Intelligent Monitoring, Recurrent Backups, and Scalability – with no hassle or hidden fees. As a managed hosting partner focused solely on Microsoft technologies since 2000, MaximumASP is uniquely qualified to provide the superior support that our business is built on. Unparalleled expertise with Microsoft technologies lead to working directly with Microsoft as first to offer IIS 7 and SQL 2008 betas in a hosted environment; partnering in the Go Live Program for Hyper-V; and product co-launches built on WS 2008 with Hyper-V technology.
Dynamic PDF
ceTE software specializes in components for dynamic PDF generation and manipulation. The DynamicPDF™ product line allows you to dynamically generate PDF documents, merge PDF documents and new content to existing PDF documents from within your applications.
Discover the Top 5 .NET Memory Management Fundamentals
To write the best .NET code, you need to know exactly how the .NET framework really manages memory. Ricky Leeks presents the Top 5 fundamental facts of .NET memory management. Learn more.
Nevron Chart for .NET 2010.1 Now Available
The leading .NET charting control now features PDF, Flash and Silverlight export, visualization of large datasets and more. Deliver true charting functionality to your BI, Scorecard, Presentation or Scientific apps. Download evaluation now.
ASP.NET 4 Hosting
Get 2 Months Free of ASP.NET Hosting for Only $4.95/month! Receive FREE MS SQL and MySQL Databases Including ASP.NET 4/3.5, MVC 3.0, Silverlight 4, Windows 2008/IIS 7.0 Plus FREE IIS 7 Modules. Host UNLIMITED ASP.NET Web Sites – Click Here!
 
 Post a Feedback, Comment, or Question about this article
Subject:
Comment:
Discover the top 5 tips for understanding .NET Interop
Become a Sponsor
 Comments
Source Code by howardbash On July 1, 2007
I wish that clicking the soure code link at the end of this article would cause the download instead of directing me to another site with it's own search which did not find this title. I would like a copy of the source to learn from... Thanks, Howard
Reply | Email | Modify 
Re: Source Code by Matthew On July 1, 2007
Try accessing the source code from the link on the uppre right hand of the following page & let me know if you have any luck: http://www.c-sharpcorner.com/UploadFile/rmcochran/FlatFileToXmlDocument06302007111353AM/FlatFileToXmlDocument.aspx
Reply | Email | Modify 
compare 2 xml document by Adina On July 31, 2007
Hi, Please help! I have 2 xml document and I want to see if they are identical or not. I could not find a solution to cover the both xml documents and when I find a node to return the value and compare, and go to the children and return the value and compare and fine the next node, return the value, compare and so on ... For to cover the code, I need something general, not to specified the node name.... Thanks, Adina.
Reply | Email | Modify 
Location of XML Document by Doug On November 19, 2008
How do you write the document out to a specific location? e.g. - C:\XMLDocs\
Reply | Email | Modify 
XML parser by Eric On May 21, 2009
very nice thanks for the code
Reply | Email | Modify 
6 Months Free & No Setup Fees ASP.NET Hosting!
 © 2012  contents copyright of their authors. Rest everything copyright Mindcracker. All rights reserved.