Easily Find Tags and Values in a Large Xml Document Using XmlTextReader in C#

Introduction

XML (eXtensible Markup Language) is a widely used format for structuring and storing data in a hierarchical manner. It consists of elements enclosed in tags, which can have attributes and contain text or other elements. However, processing large XML files can be memory-intensive, especially if you load the entire document into memory at once.

Use XmlTextReader to parse large XML documents

using System.Xml;

public void FindParticularNodesUsingTextReader()
{
    string xmlFilePath = @"C:\Document and Settings\Administrator\Desktop\sampleXmlDoc.xml";

    using (XmlTextReader txtReader = new XmlTextReader(xmlFilePath))
    {
        txtReader.WhitespaceHandling = WhitespaceHandling.None;

        while (txtReader.Read())
        {
            if (txtReader.Name.Equals("TotalPrice") && txtReader.IsStartElement())
            {
                txtReader.Read();
                richTextBox1.AppendText(txtReader.Value);
            }
        }
    }
}

Output

12.36 11.99 7.97

Faster, read-only XPath query-based access to data, use XPathDocument and XPathNavigator along with xpath query.

using System.Xml.XPath;

public void FindTagsUsingXPathNavigatorAndXPathDocumentNew()
{
    string xmlFilePath = @"C:\Documents and Settings\Administrator\Desktop\sampleXmlDoc.xml";

    XPathDocument xpDoc = new XPathDocument(xmlFilePath);
    XPathNavigator xpNav = xpDoc.CreateNavigator();
    XPathExpression xpExpression = xpNav.Compile("/Orders/Order/TotalPrice");
    XPathNodeIterator xpIter = xpNav.Select(xpExpression);

    while (xpIter.MoveNext())
    {
        richTextBox1.AppendText(xpIter.Current.Value);
    }
}

Output

12.36 11.99 7.97

Combining XmlReader and XmlDocument. On the XmlReader, use the MoveToContent and Skip methods to skip unwanted items.

using System.Xml;

public void UseXmlReaderAndXmlDocument()
{
    string xmlFilePath = @"C:\Documents and Settings\Administrator\Desktop\sampleXmlDoc.xml";

    using (XmlReader rdrObj = XmlReader.Create(xmlFilePath))
    {
        while (rdrObj.Read())
        {
            if (rdrObj.NodeType.Equals(XmlNodeType.Element) &&
                rdrObj.Name.Equals("TotalPrice") &&
                rdrObj.IsStartElement())
            {
                rdrObj.Read();
                richTextBox1.AppendText(rdrObj.Value);
            }
        }
    }
}

Output

12.36 11.99 7.97

using System.Xml;

public void UseXmlReaderAndXmlDocumentNew()
{
    string xmlFilePath = @"C:\Documents and Settings\Administrator\Desktop\sampleXmlDoc.xml";

    using (XmlReader rdrObj = XmlReader.Create(xmlFilePath))
    {
        XmlDocument xmlDocObj = new XmlDocument();
        
        while (rdrObj.Read())
        {
            if (rdrObj.NodeType == XmlNodeType.Element &&
                rdrObj.Name.Equals("TotalPrice") &&
                rdrObj.IsStartElement())
            {
                rdrObj.Read();
                richTextBox1.AppendText(rdrObj.Value);
            }
        }

        rdrObj.Close(); // Close the XmlReader before loading into XmlDocument
        
        xmlDocObj.Load(xmlFilePath);
        richTextBox1.Text = xmlDocObj.InnerText;
    }
}

Design Considerations

  • Avoid XML as long as possible.
  • Avoid processing large documents.
  • Avoid validation. XmlValidatingReader is 2-3x slower than XmlTextReader.
  • Avoid DTD, especially IDs and entity references.
  • Use streaming interfaces such as XmlReader or SAXdotnet.
  • Consider hard-coded processing, including validation.
  • Shorten node name length.
  • Consider sharing NameTable, but only when names are likely to be really common. With more and more irrelevant names, it becomes slower and slower.

Parsing XML

  • Use XmlTextReader and avoid validating readers.
  • When a node is required, consider using XmlDocument.ReadNode(), not the entire Load().
  • Set null for XmlResolver property on some XmlReaders to avoid access to external resources.
  • Make full use of MoveToContent() and Skip(). They avoid extraneous name creation. However, it becomes almost nothing when you use XmlValidatingReader.
  • Avoid accessing Value for Text/CDATA nodes as long as possible.

Validating XML

  • Avoid extraneous validation.
  • Consider caching schemas.
  • Avoid identity constraint usage. Not only because it stores keys/fields for the entire document but also because the keys are boxed.
  • Avoid extraneous strong typing. It results in XmlSchemaDatatype.ParseValue(). It could also result in avoiding access to the Value string.

Writing XML

  • Write output directly as long as possible.
  • To save documents, XmlTextWriter without indentation is better than TextWriter/Stream/file output (all indented) except for human reading.

DOM Processing

  • Avoid InnerXml. It internally creates XmlTextReader/XmlTextWriter. InnerText is fine.
  • Avoid PreviousSibling. XmlDocument is very inefficient for backward traverse.
  • Append nodes as soon as possible. Adding a big subtree results in a longer extraneous run to check ID attributes.
  • Prefer FirstChild/NextSibling and avoid to access ChildNodes. It creates XmlNodeList, which is initially not instantiated.

XPath Processing

  • Consider using XPathDocument, but only when you need the entire document. With XmlDocument, you can use ReadNode() but no equivalent for XPathDocument.
  • Avoid preceding-sibling and preceding axes queries, especially over XmlDocument. They would result in sorting, and for XmlDocument, they need access to PreviousSibling.
  • Avoid // (descendant). The returned nodes are mostly likely to be irrelevant.
  • Avoid position(), last() and positional predicates (especially things like foo[last()-1]).
  • Compile the XPath string to XPathExpression and reuse it for frequent queries.
  • Don't run XPath query frequently. It is costly since it always has to be Clone() XPathNavigators.

XSLT Processing

  • Reuse (cache) XslTransform objects.
  • Avoid key() in XSLT. They can return all kind of nodes that prevents node-type-based optimization.
  • Avoid document(), especially with nonstatic arguments.
  • Pull style (e.g. xsl:for-each) is usually better than template match.
  • Minimize output size. More importantly, minimize input.


Similar Articles