Convert HTML To XML

In this article, I shall explain how to convert an html file to a XML. We shall proceed step by step.

Let us look at some of the facts about HTML file.

Like XML, HTML is also a tag based language but it doesn't conform to XML standard. The non-conformance pertains to the tags which do not require closing such as img. There are several characters or character sequence which are also illegal in XML e.g.   etc.

Therefore, our first task is to clean the html file so that it can be parsed as xml. There are variety of tools available on Internet on how to clean HTML file. I have referred the blog and code provided. Nevertheless, I have modified the code to fit to my own needs.

Next step is to extract the required information. Assuming a given html page has a fixed layout to show a report or order information, we can deduce the information using any standard XML parser. In my case, I have used XML parser from .NET framework.

  1. string xmlContents;  
  2. try  
  3. {  
  4.     XmlDocument doc = new XmlDocument();  
  5.     doc.Load(outputFileTextBox.Text);  
  6.     //4th table element contains the required order number  
  7.     XmlNode node = doc.GetElementsByTagName("table")[3];  
  8.     for (int i = 1; i < node.ChildNodes.Count - 1; i++)  
  9.     {  
  10.         Order order = new Order()  
  11.         {  
  12.             Part_Number = node.ChildNodes[i].ChildNodes[0].InnerText ? ? string.Empty,  
  13.                 Customer_Part_Number = node.ChildNodes[i].ChildNodes[1].InnerText ? ? string.Empty,  
  14.                 Supplier_Part_Number = node.ChildNodes[i].ChildNodes[2].InnerText ? ? string.Empty,  
  15.                 Supplier_Name = node.ChildNodes[i].ChildNodes[4].InnerText ? ? string.Empty,  
  16.                 Type = node.ChildNodes[i].ChildNodes[5].InnerText ? ? string.Empty,  
  17.                 Material = node.ChildNodes[i].ChildNodes[6].InnerText ? ? string.Empty,  
  18.                 Unit_of_Measure = node.ChildNodes[i].ChildNodes[7].InnerText ? ? string.Empty,  
  19.                 Quantity = node.ChildNodes[i].ChildNodes[8].InnerText ? ? string.Empty  
  20.         };  
  21.         bom.BomList.Add(order);  
  22.     }  
  23. }  
  24. catch (XmlException exception)  
  25. {  
  26.     Console.WriteLine("xml parsing failed {0}", exception.Message);  
  27. }  
In order to transform the data into XML, first of all the information should be saved. This is a crucial step we should think over and try to use existing features provided by .NET framework. I have created a class that can be serialized using xmlserializer.
  1. // <summary>  
  2. /// Order Model  
  3. /// </summary>  
  4. [Serializable]  
  5. public class Order  
  6. {  
  7.     public string Part_Number  
  8.     {  
  9.         get;  
  10.         set;  
  11.     }  
  12.     public string Customer_Part_Number  
  13.     {  
  14.         get;  
  15.         set;  
  16.     }  
  17.     public string Supplier_Part_Number  
  18.     {  
  19.         get;  
  20.         set;  
  21.     }  
  22.     public string Supplier_Name  
  23.     {  
  24.         get;  
  25.         set;  
  26.     }  
  27.     public string Type  
  28.     {  
  29.         get;  
  30.         set;  
  31.     }  
  32.     public string Color  
  33.     {  
  34.         get;  
  35.         set;  
  36.     }  
  37.     public string Material  
  38.     {  
  39.         get;  
  40.         set;  
  41.     }  
  42.     public string Unit_of_Measure  
  43.     {  
  44.         get;  
  45.         set;  
  46.     }  
  47.     public string Quantity  
  48.     {  
  49.         get;  
  50.         set;  
  51.     }  
  52. }  
  53. [XmlInclude(typeof(Order))]  
  54. public class BOM  
  55. {  
  56.     [XmlElement(ElementName = "Order")]  
  57.     public List < Order > BomList = new List < Order > ();  
  58. }  
The advantage of it is that you get the whole serialized xml in a string. Thereafter, xml contents can be written to an xml file easily. There exists many possibility to achieve the required functionality. But I personally thought it would be easier do it this way.
  1. //Serialize bills of material list. howeverm it needs the original type of order class  
  2. XmlSerializer xmlSerializer = new XmlSerializer(typeof(BOM), new Type[]  
  3. {  
  4.     typeof(Order)  
  5. });;  
  6. //Serialize using xmlserializer  
  7. using(StringWriter writer = new StringWriter())  
  8.     {  
  9.         xmlSerializer.Serialize(writer, bom);  
  10.         xmlContents = writer.ToString();  
  11.     }  
  12.     //write serialized contents to xml file  
  13. using(StreamWriter fileWriter = new StreamWriter(outputFileTextBox.Text))  
  14. {  
  15.     fileWriter.Write(xmlContents);  
  16. }  
Note: In this article, I have taken an HTML file provided by one of the users on C-SharpCorner. Just to be on safe side, I am not responsible for any data in HTML.

 

X

Build smarter apps with Machine Learning, Bots, Cognitive Services - Start free.

Start Learning Now