Blue Theme Orange Theme Green Theme Red Theme
 
Home | Forums | Videos | Photos | Downloads | Blogs | E-Books | Interviews | Jobs | Beginners | Training
 | Consulting  
Submit an Article Submit a Blog 
 Login Close
User Id:
Password:
 
Forgot Password
Forgot Username
Why Register
 Jump to
Skip Navigation Links
TechnologyExpand Technology
WebsiteExpand Website
Dundas Dashboard
 Resources  
Close
 Our Network  
Close
Search :       Advanced Search »
Home » XML .NET » Flat File Parsed to XML Using C#

Flat File Parsed to XML Using C#

I ran across an interesting problem today where I had to parse a flat file (csv or tab delimited) into an xml document. The solution I arrived at is flexible enough for reuse so I though I'd share the library along with some of my development notes.

Author Rank:
Technologies: .NET 1.0/1.1, XML,Visual C# .NET
Total downloads : 1051
Total page views :  31620
Rating :
 4/5
This article has been rated :  1 times
   Print Read/Post comments Post a comment  Rate  
   Email to a friend  Bookmark  Similar Articles  Author's other articles  
Download Files:
FlatFileParser.zip
 
Become a Sponsor


Related EbooksTop Videos

Code overview:

Use:

This is a static class with two public methods used to parse an input string representing tab delimited or comma delimited data into an XmlDocument:

public static XmlDocument ParseCsvToXml(string input, string topElementName, string recordElementName, params string[] recordItemElementName)

public static XmlDocument ParseTabToXml(string input, string topElementName, string recordElementName, params string[] recordItemElementName)

The first thing that I'd like to point out is the signature of the publicly facing methods.  Notice the params keyword in the last parameter in the ParseCsvToXml() method.  This parameter will let me pass in a variable amount of parameters at the end of the method which will represent all the xml node names so we add as many node names as there are columns in our input string the end of the call.

XmlDocument result = Parser.ParseCsvToXml(input, "TopElement", "Record", "Field1", "Field2", "Field3");

The following document will be built.

<?xml version="1.0" encoding="utf-8" ?>
<TopElement>
  <Record>
    <Field1>data</Field1>
    <Field2>data</Field2>
    <Field3>data</Field3>
  </Record>
</TopElement>

There must be a node name specified for each column in the input string.

Development notes:

I'm using two main steps in the conversion process: (1) disassembling the flat file into a 2D matrix of strings and then (2) constructing an xml document from the matrix. There are a pre-process and (in the case of the csv conversion) post-process step that have to happen in order to clean up the data.

If it's worth doing twice, it's worth doing once

Originally I had two separate recursive methods for post processing csv and tab delimited data that is now living in the nodes that were built.  Basically the point is to remove any double quotes and put back commas that were embedded in double quotes in the csv input.

The first method I wrote was to recursively post process the tab-delimited data.  This works really well because the XmlDocument inherits from XmlNode so I can have one method that can accept a node in the document and the document itself.

private static void PostProcessTabNode(XmlNode node)
{
    if (!String.IsNullOrEmpty(node.Value) && m_quotesOnBothEnds.IsMatch(node.Value))
        node.Value = node.Value.Substring(1, node.Value.Length - 2);

    foreach (XmlNode subNode in node.ChildNodes)
        PostProcessTabNode(subNode);
}

Next I wrote the method to recursively post process the comma-delmited data:

private static void PostProcessCsvNode(XmlNode node)
{
    if(! String.IsNullOrEmpty(node.Value))
        node.Value = node.Value.Replace(strTemporaryPlaceholder, strComma);

    foreach (XmlNode subNode in node.ChildNodes)
        PostProcessCsvNode(subNode);
}

What I ended up with was two methods with some code repeated at the end.  Anytime I see code repeated a shudder goes down my spine because it screams out "MAINTENANCE AND CONSISTANCY NIGHTMARE".  I may eventually have more types of data I'd like to parse into xml, so I thought it would be worth refactoring at this point.

I moved to a "controller" method that will be responsible for the recursion.

private static void PostProcess(XmlNode node, Action<XmlNode> process)
{
    process(node);

    foreach (XmlNode subNode in node.ChildNodes)
        PostProcess(subNode, process);
}

The Action<XmlNode> is a predefined delegate that I'll use to point to a method with the same signature that will actually do the work. 

private static void PostProcessTabNode(XmlNode node)
{
    if (!String.IsNullOrEmpty(node.Value) && m_quotesOnBothEnds.IsMatch(node.Value))
        node.Value = node.Value.Substring(1, node.Value.Length - 2);
}

private static void PostProcessCsvNode(XmlNode node)
{
    if(! String.IsNullOrEmpty(node.Value))
        node.Value = node.Value.Replace(strTemporaryPlaceholder, strComma);
}

The nice thing about this refactoring is that now all my methods are more cohesive (each method has a distinct purpose) which corresponds to ease of maintenance and ease of understanding.

When I call the PostProcess() method I'll pass in the document to be cleaned up and the name of the method to do the cleaning.  The compiler is smart enough to know that a new delegate of type Action<XmlNode> needs to be created so I don't have to specify it.

PostProcess(doc, PostProcessTabNode);

I could have called this method in the following way with the exact same results but to me it is much harder to read and understand at a quick glance:

PostProcess(doc, new Action<XmlNode>(PostProcessTabNode));

Strings are Evil

Having a good handle on where strings are in our code is pretty important.  Because they are immutable, they can be very expensive.  If there are multiple instances of the same string within the code, the CLR will "intern" the strings and use a single memory space to hold the string value and pass out multiple references to that memory space.

http://msdn2.microsoft.com/en-us/library/system.string.intern(vs.80).aspx

For me, declaring re-used strings as constant and readonly variables ensures I'm not accidentally using a different spelling or extra space in my strings and so it helps keep the warts off the IL code generated and will keep the assembly load time to a minimum (each time the assembly is loaded into memory, it finds the literal strings and interns them.  Less literal strings to intern means less work for the CLR to do when loading my assembly).

        private const string
            strComma = ",",
            strTemporaryPlaceholder = "~~`~~",
            strTab = "\t";

Anyways, that's about it for the general overview.  Other code you might be interested in are the disassembly and xml building methods in the source code. The unit tests I used are pretty rough and I used them to do a general visual check of the output, but I included them with the code anyways.

I hope you find the library useful.

Until next time,
Happy coding


Login to add your contents and source code to this article
 [Top] Rate this article
 About the author
 
Matthew Cochran
Looking for C# Consulting?
C# Consulting is founded in 2002 by the founders of C# Corner. Unlike a traditional consulting company, our consultants are well-known experts in .NET and many of them are MVPs, authors, and trainers. We specialize in Microsoft .NET development and utilize Agile Development and Extreme Programming practices to provide fast pace quick turnaround results. Our software development model is a mix of Agile Development, traditional SDLC, and Waterfall models.
Click here to learn more about C# Consulting.
 
Introducing MaxV - one click. infinite control. Hyper-V Hosting from MaximumASP.
Finally – a virtual platform that delivers next-generation Windows Server 2008 Hyper-V virtualization technology from a managed hosting partner you can truly depend on. Visit www.maximumasp.com/max for a FREE 30 day trial. Hurry offer ends soon. Climb aboard the MaxV platform and take advantage of High Availability, Intelligent Monitoring, Recurrent Backups, and Scalability – with no hassle or hidden fees. As a managed hosting partner focused solely on Microsoft technologies since 2000, MaximumASP is uniquely qualified to provide the superior support that our business is built on. Unparalleled expertise with Microsoft technologies lead to working directly with Microsoft as first to offer IIS 7 and SQL 2008 betas in a hosted environment; partnering in the Go Live Program for Hyper-V; and product co-launches built on WS 2008 with Hyper-V technology.
Dynamic PDF
ceTE software specializes in components for dynamic PDF generation and manipulation. The DynamicPDF™ product line allows you to dynamically generate PDF documents, merge PDF documents and new content to existing PDF documents from within your applications.
Go.NET
Build custom interactive diagrams, network, workflow editors, flowcharts, or software design tools. Includes many predefined kinds of nodes, links, and basic shapes. Supports layers, scrolling, zooming, selection, drag-and-drop, clipboard, in-place editing, tooltips, grids, printing, overview window, palette. 100% implemented in C# as a managed .NET Control. Document/View/Tool architecture with many properties&events. Optional automatic layout.
Dundas Software
Dundas Chart for .NET is the most advanced .NET charting package available today.  With an extremely complete feature set, elegant architecture and easy implementation, Dundas Chart can quickly add advanced Charting functionality to enhance and transform ASP.NET and Windows Forms applications.  Whether you are implementing charting into internal projects, or building applications for clients, Dundas Chart offers advanced technology and advanced results to get the most out of data.
Clickatell's SMS Gateway
Clickatell's Developer Solutions allow you to SMS enable any website or application via a range of API's. Learn More about our API connections.
Free access to .NET Memory Management video
Everything you need to know about Garbage Collection, Temporary Objects, Fragmentation, Finalization and common causes of memory leaks in .NET. Watch the video here.
Microsoft Visual Studio 2010
Microsoft Visual Studio 2010 offers more to developers than any other Visual Studio release. Work more productively and collaboratively-with greater control over your work at every step. The Beta 2 can give you a head start on achieving efficiency.
 
   Print Read/Post comments Post a comment  Rate  
   Email to a friend  Bookmark  Similar Articles  Author's other articles  
Download Files:
FlatFileParser.zip
 
 Post a Feedback, Comment, or Question about this article
Subject:  
Comment:  
Dundas Dashboard
Become a Sponsor
 Comments
Source Code by howardbash On July 1, 2007
I wish that clicking the soure code link at the end of this article would cause the download instead of directing me to another site with it's own search which did not find this title. I would like a copy of the source to learn from... Thanks, Howard
Reply | Email | Delete | Modify | 
Re: Source Code by Matthew On July 1, 2007
Try accessing the source code from the link on the uppre right hand of the following page & let me know if you have any luck: http://www.c-sharpcorner.com/UploadFile/rmcochran/FlatFileToXmlDocument06302007111353AM/FlatFileToXmlDocument.aspx
Reply | Email | Delete | Modify | 
compare 2 xml document by Adina On July 31, 2007
Hi, Please help! I have 2 xml document and I want to see if they are identical or not. I could not find a solution to cover the both xml documents and when I find a node to return the value and compare, and go to the children and return the value and compare and fine the next node, return the value, compare and so on ... For to cover the code, I need something general, not to specified the node name.... Thanks, Adina.
Reply | Email | Delete | Modify | 
Location of XML Document by Doug On November 19, 2008
How do you write the document out to a specific location? e.g. - C:\XMLDocs\
Reply | Email | Delete | Modify | 
XML parser by Eric On May 21, 2009
very nice thanks for the code
Reply | Email | Delete | Modify | 

 Hosted by MaximumASP  |  Found a broken link?  |  Contact Us  |  Terms & conditions  |  Privacy Policy  |  Site Map  |  Suggest an Idea  |  Media Kit
Current Version: 5.2009.6.2
 © 1999 - 2009  Mindcracker LLC. All Rights Reserved