XML Diff and Patch using LINQ to Xml and LINQ to Objects

This article focuses on working with XML and how to get best out of LINQ to XML and LINQ to Objects. Focus of this article is to you show the power of LINQ to Xml and get you started on LINQ with practical examples.



Overview

This article focuses on working with XML and how to get best out of LINQ to XML and LINQ to Objects. Focus of this article is to you show the power of LINQ to Xml and get you started on LINQ with practical examples. We will discuss specific scenario of finding difference between two XML files and applying changes. Instead of creating a tool and end-to-end solution, I preferred to pick up what areas you might hit and how to solve few problems using different techniques.

All the queries in this document are tested with sample XML provided. So it is very easy for anyone to try them by just copying XML and sample queries. Most of the queries are independent except for one in Task 6 and Task 7 sections.

Intentionally I tried to keep structure of this document as task/solution approach, to get developers quickly started on LINQ and focus on areas they are interested.

Details covered in this article are only around LINQ to Xml and some on LINQ to Objects. There is more to LINQ than what can be covered here.  I tried to keep details simple and easy to follow so that it might help as good start, if you are new to LINQ.

Problem:  Given two XML files, find differences between the files. Apply differences on the source file [Merging changes].

Following sections focus on solving key problems in XML diff/patch areas. This requires querying, comparing and modifying XML. So in following sections, I take each task and give options with sample LINQ code, on how we can solve each task. This way you can just focus on tasks you are interested in.

Why LINQ?

LINQ is very powerful declarative way of solving problem with less code. With LINQ, we just say what we want and don't bother so much about, how we want to do it [Imperative style]. It is great that Microsoft introduced LINQ to .NET. When I first heard about it, I thought it is one of those Microsoft tricks to make thing easy for developer but not good from performance point of view. But it is amazing to see how well LINQ performs.

Disclaimer: All details provided here are from my practical experience on projects. Focus of this article is to quickly show the power of LINQ to Xml and other key benefits of using LINQ. Based on your requirement, you may find some options more suitable over others. Please refer to MSDN documentation and blogs on LINQ if you prefer to dig deep into specific area of LINQ.
 
Solving the problem:

Sample input: When you try any code-block in subsequent sections, please copy sample input data used below always.

File1 Data: Master file (Referred to as " A XML" in the document)


<?
xml version="1.0" encoding="utf-8" ?>

<Employees>

  <Employee ID="1" Name="JayT" JoinDate="3/1/2009">

    <Projects>

      <Project ID="1" Name="Project1" StartDate="2/1/2009" EndDate="4/1/2009" />

      <Project ID="2" Name="Project2" StartDate="3/5/2009" EndDate="4/1/2010" />

    </Projects>

  </Employee>

  <Employee ID="2" Name="Kim" JoinDate="4/1/2009">

    <Projects>

      <Project ID="3" Name="Project3" StartDate="5/1/2009" EndDate="10/1/2009" />

      <Project ID="2" Name="Project2" StartDate="3/5/2009" EndDate="4/1/2010" />

      <Project ID="6" Name="Project6" StartDate="3/5/2009" EndDate="4/1/2010" />

    </Projects>

  </Employee>

  <Employee ID="3" Name="Tom" JoinDate="6/1/2009">

    <Projects>

    </Projects>

  </Employee>

</Employees>

File2 Data: Modified File (Referred to as "B XML" in the document)


<?
xml version="1.0" encoding="utf-8" ?>

<Employees>

  <Employee ID="1" Name="JayT" JoinDate="3/1/2009">

    <Projects>

      <Project ID="6" Name="Project6" StartDate="5/1/2009" EndDate="11/1/2009" />

    </Projects>   

  </Employee>

  <Employee ID="3" Name="Tom" JoinDate="6/1/2009">

    <Projects>

      <Project ID="5" Name="Project5" StartDate="7/1/2009" EndDate="12/1/2009" />

    </Projects>   

  </Employee>
</Employees>

Task 1: Knowing LINQ

  • Think different: LINQ is declarative programming. Whole point is you reduce lot imperative style of coding and most of the time simple query gets you, what you want. It takes some effort to stay out of old approaches using DOM/XPath and think in LINQ way.
  • LINQ classes start with X [For all key Xml* classes, you will find matching X… class] Example: XDocument for XmlDocument, XNode for XmlNode. I think this is done only to make us feel better, having used Xml* classes for long. But LINQ is declarative programming model compared to procedural/imperative approach we use with DOM.
  • There are few useful static methods on XNode, XElement, XDocument.  XContainer is base class for XElement and XDocument. XNode is base class for XContainer class.
  • You can mix and match XPath/XQuery syntax in LINQ. So picking up LINQ only gives you best of many worlds, as you can also mix your LINQ to Objects. As long as you use XPath selectors in places where it makes sense, you are ok with it.
  • XElement is also serializable using XmlSerializer or DatContractSerializer making it first class citizen for your WCF data contract members.
  • LINQ is simple and faster if you use equi-joins [even with composite keys]. Things are little tricky if you need to use non-qui joins but there is enough support to get around this. More on this in LINQ query section.
  • By now you already know LINQ. Just try the samples in the rest of this article and you will find it very easy to pick up other concepts.
Task 2: Loading XML

Loading from files/steam objects/URIs:

XDocument.Load [To load whole document]

XElement.Load [To load all elements]
Load methods have whole bunch of overloaded methods and one of them will sure match for what you are trying to do.
Loading from string:

XDocument.Parse(string xml data);
XElement.Parse(string xml data);
See: LoadOptions optional parameter for additional useful info
 
Task 3: Selecting/Querying data from XML

// Load XML: Using any methods explained in Task1
XDocument
aDoc = XDocument.Load("File1.xml");
XDocument bDoc = XDocument.Load("File2.xml");


Sample Query1: Simple select queries

      // Select all employees with join date > 3/1/2009    
      var result1 = from elm in aDoc.Descendants("Employee")
        where DateTime.Parse(elm.Attribute("JoinDate").Value) >
                                                           
DateTime.Parse("3/1/2009")
         select elm;


Sample Query 2: Selecting data with multiple conditions

     
// Select employees with Id >= 2  working on project "Project 2"
      var result2 = from prj in aDoc.Descendants("Project")
                    let emp = prj.Parent.Parent                                   
                    where Int32.Parse(emp.Attribute("ID").Value) >= 2 &&
                    prj.Attribute("Name").Value == "Project2"
                    select emp;


As you see, LINQ truly stands for its name and is just like SQL. Only issue you might stubble across is that select clause is at the end and From clause is in the beginning. This might confuse few people but once you figure out that from clause is like foreach clause trying to pick up one item in collection, you may find it easy.

Simple tricks to learn LINQ: Following are few simple tricks I can think of on how to pick up LINQ syntax quickly:

  • Always starting writing something like "var result = from xx in <your collection>". Once you get pass this line, you will find it very easy to write rest of the query.
  • Don't pay much attention that we declare something with "var" type
  • Treat results as collection of what you selected in "Select".
TIP 1: Creating custom objects on fly:

You don't always have to live with XElement/XDocument returned by the query and you can build custom object on fly using anonymous types as shown in "SAMPLE QUERY 3"

          // SAMPLE QUERY 3:
      // Complex queries with custom functions to evaluate
      // conditions or select attribute values
      var result3 = from emp in aDoc.Descendants("Employee")

              where IsValidEmployee(emp) &&                               

  IsValidProject(emp.Elements("Projects").First())

                    select new
                          {
                              EmployeeID = GetEmployeeID(emp),
                              Name = GetEmployeeName(emp),
                              Projects =GetProjects(emp.Element("Projects")),
                              EmployeeRef = emp
                          };
 

// Following are quick sample implementations only for demo purpose.
         // In practice you may do something more meaningful

static bool IsValidEmployee(XElement emp){return true;}
          static
bool IsValidProject(XElement prj){ return true;}
          static
int GetEmployeeID(XElement emp)
          {

returnInt32.Parse(emp.Attribute("ID").Value);
}

static string GetEmployeeName(XElement emp)
          {

return emp.Attribute("Name").Value;
}

static List<string> GetProjects(XElement prjs) { return new List<string>(); }

TIP 2: Keeping reference of XElement in selected items, helps in some cases

Converting result to custom object structure and writing code only based on custom objects properties is not efficient and code is not flexible in some cases. For example, if you want to add attributes later on to XML based on some checks, not having reference to XElement, we have to query this again. Also if we need to do some conditional checks based on XElement attributes only in specific cases, it is not worth reading all of them up front and code is not flexible that way.

I will cover more on modifying XML in later sections.

TIP 3: Creating instance of a known-type in Select, instead of using anonymous type.

You don't always have to create anonymous type in SELECT. You can directly select XElement OR you can create instance of a class that is already defined as follows:

select new XXXClass { Member1 = value1, Member2= value2 };


Some cases creating instances of known type is useful if you need them for other business rules, down the line. There is no need to create anonymous types first and then copy them to business entities, in this scenario.

Task 4: Comparing XML from two XML files

Now comes interesting part of comparing two XML files to find differences or common elements. There is no one way that works best. Based on your requirement you may find combination of following useful. LINQ is very good with equi join scenarios and efficient. We need to be careful with non-equi join scenarios [!=]. I will try to cover different scenarios below with few tips. When we are writing queries one thing to keep in mind is, we can write a bad query and blame LINQ for performance.

I will start with simple options and get into more involved queries:

XDocument aDoc = XDocument.Load("File1.xml");
XDocument
bDoc = XDocument.Load("File2.xml");

Finding common elements (Elements that are in both A and B)

// COMMON ELEMENTS QUERY (one-liner)
// Get Projects from aDoc that are same in bDoc

var
commonfromA = aDoc.Descendants("Project").Intersect(bDoc.Descendants("Project"));

Finding different elements (elements that are in A but not in B)

// DIFFERENT ELEMENTS QUERY (one-liner)
// Get Projects in aDoc that are not in bDoc

var
diffinA = aDoc.Descendants("Project").Except(bDoc.Descendants("Project"));

Comparing using Custom Query

// JOIN QUERY
// Custom compare based on specific attribute

var
prjsAB = from aPrj in aDoc.Descendants("Project")
             join bPrj in bDoc.Descendants("Project")
                           on aPrj.Attribute("ID").Value equals

bPrj.Attribute("ID").Value

             select new
             {
                   aProject = aPrj,
                  bProject = bPrj
             };


Join with composite Key

// JOIN WITH COMPOSITE KEY
// Custom compare based on specific attribute

var
prjsA = from aPrj in aDoc.Descendants("Project")
            join bPrj in bDoc.Descendants("Project")
            on new {
                             ID = aPrj.Attribute("ID").Value,
                        Name = aPrj.Attribute("Name").Value
                   }
                   equals
                   new
                   {
                             ID = bPrj.Attribute("ID").Value,
                        Name = bPrj.Attribute("Name").Value
                   }

            select aPrj;


Cross-Join (Returning elements in both A and B)

// CROSS JOIN (RETURN elements in Both XML documents)
// IMPORTANT: Use cross-joins only when you need cross-join results
// Always prefer regular joins as they are faster. If you need non-equi joins
// There other options than using cross-join with where condition.
// You might get into performance issues if you use cross-joins for non-equi join cases

var
allPrjsinAB = from aPrj in aDoc.Descendants("Project")
                      from bPrj in bDoc.Descendants("Project")
                      where aPrj.Parent.Parent.Attribute("ID").Value ==

                                                     bPrj.Parent.Parent.Attribute("ID").Value
                      select new { ProjectA = aPrj, ProjectB = bPrj };


As you see in cross-join example above, you will get many interesting results based on if you use where clause or not and what where conditions you use. You can also use this for non-equi join scenarios in some cases. But I found few cases where multiple regular join queries ran faster than single cross-join that returned non-equi join results I expected. So pay attention to this area when you use cross-joins.

TIP 4: Join with composite keys

When you need comparison using composite keys for equi-join, prefer using JOIN clause with composite keys [as shown in JOIN Composite Key examples above] instead of trying cross-join with composite conditions  in where clause.

TIP 5: Non-equi joins with multiple queries

// NON-EQUI JOIN scenario
// Find All Projects that are new in B Document/XML
      

// Get all Projects IDs in B XML that are not in A XML -Using except
var
bProjIDs = (from prjID in bDoc.Descendants("Project")
                select prjID.Attribute("ID").Value
                ).Except(from prjID in aDoc.Descendants("Project")
                         select prjID.Attribute("ID").Value); 

// Now get full project elements from B XML matching new project IDs
// You don't have to create new type and can just select newPrj directly

var
bNewPrjs = from newPrj in bDoc.Descendants("Project")
               join prjID in bProjIDs on newPrj.Attribute("ID").Value equals prjID
               select new { ProjectID = prjID, ProjectRef = newPrj };


As you see in "Non equi-join" scenario above - We actually did not use Non-equi join directly but depended on Except method. Also we split the query to two compared to using cross-join with non-equi joins in where clause. You can try it your self and see which option gets you better performance. When you need to do repeated non-equi joins with different conditions, splitting the query like above, makes it easy as you store intermediate results and re-use them in multiple queries.

Useful compare methods in XDocument/XElement:

Comparing XML as a string is not accurate and you will find many issues. So XDocument and XElement classes have very useful compare methods. Following are few that you may find useful in comparison scenarios:

XElement.DeepEquals(node1, node2);
      XDocument.DeepEquals(node1, node2); 

      XElement.CompareDocumentOrder(node1, node2);     
      XDocument.CompareDocumentOrder(node1, node2);


Task 5: Displaying results

In all above cases LINQ query returned results into variable declared as "var". This variable holds return results of LINQ query. You iterate through the results like you normally iterate through any iterator.

Example: Iterating through results

      foreach(var prj in bNewPrjs)
      {
          Console.WriteLine("Project ID: " + prj.ProjectID);
      }


As you see, even though we are using item type as "var", inside the loop, LINQ correctly identifies members on type this item is created. If you used well-known type in your select, you can use same well-known type in foreach too. In any case VS intellisense shows up all the member variables.

Task 6: Modifying XML

We are looking into scenarios of finding differences between two XMLs and applying changes on one of them. So following discussion is limited to this scope. But in general you can do all sort of XML building in single nested statement with LINQ.

One of the things we need to be careful when modifying XML is that while iterating through collection, we cannot make changes in a way we modify the collection. This is general rule iterating through any collection. But there are no issues if you change properties on the item in-side the iteration loop, as long as it won't impact query results OR modify collection.

Pay more attention when you need to remove elements. One option is to use power of LINQ to extract the elements you are interested and push them to another generic collection. You can later on iterate through list collection and do required changes [removing elements etc].

Adding New Elements from B File

// Get All Employee elements in A XML where new Projects are added in B XML
// We are assuming that employee elements are same in both files [For simplicity]
// REUSE: Here we are re-using New project query results from above

var
aEmployees = from aEmp in aDoc.Descendants("Employee")
                 join newPrj in bNewPrjs on aEmp.Attribute("ID").Value
                             equals newPrj.ProjectRef.Parent.Parent.Attribute("ID").Value
                 select new { Eemployee = aEmp, Project = newPrj.ProjectRef }; 

// Now add new projects
foreach
(var employee in aEmployees)
{
    employee.Eemployee.Add(employee.Project);
    Debug.WriteLine(employee.ToString()); // Make sure new item added
}


// aDoc - now automatically has updates
// Just verify

Debug
.WriteLine(aDoc.ToString());

Removing Elements

Remove method on XElement instance will remove it from document. Normally it is best to remove elements on a clean copy OR first push all elements that need to be removed to separate collection and them call .Remove() on elements in remove collection.

TIP 6: Keeping two copies of XML

Normally in cases where we need to query two XML and merge changes to one, it is best to keep a new copy of XML that represents source and apply changes only to source. This helps us from avoiding all sorts of issues with modifying collections etc.

TASK 7: Using Xpath methods with LINQ

Finally we cannot ignore our good old friend XPath. There are cases we have exact XML query path to an element or elements and we want to query using Xpath style. Following are few samples:

// Get Employee with ID = 1
XElement
empElement = aDoc.XPathSelectElement("/Employees/Employee[@ID=\"1\"]"); 

// Get Project with Employee ID = 1 and Project ID = 2
XElement
prjElement = aDoc.XPathSelectElement("/Employees/Employee[@ID=\"1\"]/Projects/Project[@ID=\"2\"]"); 

// Get all projects where Project ID > 2var projects = aDoc.XPathSelectElements("/Employees//Projects/Project[@ID > \"2\"]");

Conclusion:

As you see LINQ to XML samples above combines many known techniques from LINQ to Objects, SQL and general language skills that we already know. This is one of the reasons LINQ will shine well as learning it is easy. LINQ with its declarative style programming, makes things lot simpler and even faster in some cases. What is covered in this article is only fraction of what you can do with LINQ. But I hope this helps and encourages developers who are new to LINQ.