Easily Find Tags and Values in a Large Xml Document Using XmlTextReader in C#

Liju Gopalan
1y
67.5k
0
1

Article

Introduction

XML (eXtensible Markup Language) is a widely used format for structuring and storing data in a hierarchical manner. It consists of elements enclosed in tags, which can have attributes and contain text or other elements. However, processing large XML files can be memory-intensive, especially if you load the entire document into memory at once.

Use XmlTextReader to parse large XML documents

using System.Xml;

public void FindParticularNodesUsingTextReader()
{
    string xmlFilePath = @"C:\Document and Settings\Administrator\Desktop\sampleXmlDoc.xml";

    using (XmlTextReader txtReader = new XmlTextReader(xmlFilePath))
    {
        txtReader.WhitespaceHandling = WhitespaceHandling.None;

        while (txtReader.Read())
        {
            if (txtReader.Name.Equals("TotalPrice") && txtReader.IsStartElement())
            {
                txtReader.Read();
                richTextBox1.AppendText(txtReader.Value);
            }
        }
    }
}

Output

12.36 11.99 7.97

Faster, read-only XPath query-based access to data, use XPathDocument and XPathNavigator along with xpath query.

using System.Xml.XPath;

public void FindTagsUsingXPathNavigatorAndXPathDocumentNew()
{
    string xmlFilePath = @"C:\Documents and Settings\Administrator\Desktop\sampleXmlDoc.xml";

    XPathDocument xpDoc = new XPathDocument(xmlFilePath);
    XPathNavigator xpNav = xpDoc.CreateNavigator();
    XPathExpression xpExpression = xpNav.Compile("/Orders/Order/TotalPrice");
    XPathNodeIterator xpIter = xpNav.Select(xpExpression);

    while (xpIter.MoveNext())
    {
        richTextBox1.AppendText(xpIter.Current.Value);
    }
}

Output

12.36 11.99 7.97

Combining XmlReader and XmlDocument. On the XmlReader, use the MoveToContent and Skip methods to skip unwanted items.

using System.Xml;

public void UseXmlReaderAndXmlDocument()
{
    string xmlFilePath = @"C:\Documents and Settings\Administrator\Desktop\sampleXmlDoc.xml";

    using (XmlReader rdrObj = XmlReader.Create(xmlFilePath))
    {
        while (rdrObj.Read())
        {
            if (rdrObj.NodeType.Equals(XmlNodeType.Element) &&
                rdrObj.Name.Equals("TotalPrice") &&
                rdrObj.IsStartElement())
            {
                rdrObj.Read();
                richTextBox1.AppendText(rdrObj.Value);
            }
        }
    }
}

Output

12.36 11.99 7.97

using System.Xml;

public void UseXmlReaderAndXmlDocumentNew()
{
    string xmlFilePath = @"C:\Documents and Settings\Administrator\Desktop\sampleXmlDoc.xml";

    using (XmlReader rdrObj = XmlReader.Create(xmlFilePath))
    {
        XmlDocument xmlDocObj = new XmlDocument();
        
        while (rdrObj.Read())
        {
            if (rdrObj.NodeType == XmlNodeType.Element &&
                rdrObj.Name.Equals("TotalPrice") &&
                rdrObj.IsStartElement())
            {
                rdrObj.Read();
                richTextBox1.AppendText(rdrObj.Value);
            }
        }

        rdrObj.Close(); // Close the XmlReader before loading into XmlDocument
        
        xmlDocObj.Load(xmlFilePath);
        richTextBox1.Text = xmlDocObj.InnerText;
    }
}

Design Considerations

Avoid XML as long as possible.
Avoid processing large documents.
Avoid validation. XmlValidatingReader is 2-3x slower than XmlTextReader.
Avoid DTD, especially IDs and entity references.
Use streaming interfaces such as XmlReader or SAXdotnet.
Consider hard-coded processing, including validation.
Shorten node name length.
Consider sharing NameTable, but only when names are likely to be really common. With more and more irrelevant names, it becomes slower and slower.

Parsing XML

Use XmlTextReader and avoid validating readers.
When a node is required, consider using XmlDocument.ReadNode(), not the entire Load().
Set null for XmlResolver property on some XmlReaders to avoid access to external resources.
Make full use of MoveToContent() and Skip(). They avoid extraneous name creation. However, it becomes almost nothing when you use XmlValidatingReader.
Avoid accessing Value for Text/CDATA nodes as long as possible.

Validating XML

Avoid extraneous validation.
Consider caching schemas.
Avoid identity constraint usage. Not only because it stores keys/fields for the entire document but also because the keys are boxed.
Avoid extraneous strong typing. It results in XmlSchemaDatatype.ParseValue(). It could also result in avoiding access to the Value string.

Writing XML

Write output directly as long as possible.
To save documents, XmlTextWriter without indentation is better than TextWriter/Stream/file output (all indented) except for human reading.

DOM Processing

Avoid InnerXml. It internally creates XmlTextReader/XmlTextWriter. InnerText is fine.
Avoid PreviousSibling. XmlDocument is very inefficient for backward traverse.
Append nodes as soon as possible. Adding a big subtree results in a longer extraneous run to check ID attributes.
Prefer FirstChild/NextSibling and avoid to access ChildNodes. It creates XmlNodeList, which is initially not instantiated.

XPath Processing

Consider using XPathDocument, but only when you need the entire document. With XmlDocument, you can use ReadNode() but no equivalent for XPathDocument.
Avoid preceding-sibling and preceding axes queries, especially over XmlDocument. They would result in sorting, and for XmlDocument, they need access to PreviousSibling.
Avoid // (descendant). The returned nodes are mostly likely to be irrelevant.
Avoid position(), last() and positional predicates (especially things like foo[last()-1]).
Compile the XPath string to XPathExpression and reuse it for frequent queries.
Don't run XPath query frequently. It is costly since it always has to be Clone() XPathNavigators.

XSLT Processing

Reuse (cache) XslTransform objects.
Avoid key() in XSLT. They can return all kind of nodes that prevents node-type-based optimization.
Avoid document(), especially with nonstatic arguments.
Pull style (e.g. xsl:for-each) is usually better than template match.
Minimize output size. More importantly, minimize input.