Convert HTML Files To DOCX Files With MariGold.OpenXHTML

MariGold.OpenXHTML is a GitHub project to convert HTML to Open XML Word documents. It is a wrapper library built over the Open XML SDK 2.5 library. It simply extracts each element from HTML documents and creates corresponding Open XML word elements in the Word file. For example, any div elements in HTML document will be converted as paragraphs in the Word document. Not all the HTML/CSS behaviors can be supported in the Word documents. Say, floated div may not convert in the same manner because the Word documents do not support floated paragraphs.
 
This article demonstrates how to create a Windows application to convert an HTML file to an Open XML Word document. Our sample application is a simple form with one textbox and two buttons. One button is to select an HTML file with an open dialog box and another is one to convert this HTML file and save the Word document with a Save dialog box.
 
Set up the application
 
This article is using Visual Studio 2015 Community Edition and .NET 4.5 Framework to create the application. Open the "New Project" dialog box and select the "Windows Forms Application".



Modify the default Form1 form as shown below.
 
 
 
Now, it's time to add the reference to the MariGold.OpenXHTML library. It is available as NuGet package. In Visual Studio, select Tools -> NuGet Package Manager -> Package Manager Console and enter the following command. Alternatively, you can use the "Manage NuGet Packages" menu from the "References" project folder.
 
 Install-Package MariGold.OpenXHTML
 
This will also install the following dependencies.
  • DocumentFormat.OpenXml - OpenXml SDK 2.5 library to create Open XML word documents.
  • MariGold.HtmlParser - To parse and extract the HTML elements from the input file.
How the code works
 
The "Browse" button simply selects an HTML file and copies the file path into the text box.
  1. if (openFileDialog1.ShowDialog() == DialogResult.OK)  
  2. {  
  3.     txtFile.Text = openFileDialog1.FileName;  
  4. }  
The "Convert" button initially lets the user select a Word file for the final output. Then, it creates a WordDocument object and attaches the user selected Word file with it. Finally, it reads the HTML file content using a SteamReader and converts that HTML into an Open XML document.
  1. if (saveFileDialog1.ShowDialog() == DialogResult.OK)  
  2. {  
  3.     WordDocument doc = new WordDocument(saveFileDialog1.FileName);  
  4.   
  5.     using (StreamReader sr = new StreamReader(txtFile.Text))  
  6.     {  
  7.         doc.Process(new HtmlParser(sr.ReadToEnd()));  
  8.     }  
  9.   
  10.     doc.Save();  
  11. }  
The WordDocument class encapsulates all the HTML to DOCX conversion jobs. When initialized, it accepts either a memory stream or a file path of the target Word document as shown in the above example. The process simply accepts an HTML Parser implementation to parse the given HTML document and construct an Open XML document with this HTML. Finally, the Save method commits all the changes into the Word document.
 
The WordDocument class also contains a couple of methods and properties to manipulate the Word document like relative image URL or link URL. Refer to the gGitHub project home page for more information.