About Development on Two-Way RTF to XML/XHTML Converter Components and Services

OVERVIEW-Part 1

Microsoft introduced the Rich Text Format for specifying simple formatted text with embedded graphics. Initially intended to transfer such data between different applications on different operating systems, today this format is commonly used in Windows for enhanced editing capabilities. The XHTML to RTF converter consists in an XSL stylesheet for parsing XHTML tags and generating their RTF equivalents.

New challenges of RTF:

  • Extraction of text without consideration of any format information;

  • Extraction and conversion of embedded image information;

  • Conversion of the RTF layout and/or data into another format such as XML or HTML;

  • Transferring RTF data into a custom data model.

Goals of designing the component:

To developed an application which can doing conversion from RTF to Text/XML/HTML

  • Support for the current RTF;

  • Open source C#.NET code;

  • Unlimited usage in console, WinForms, WPF, and ASP.NET applications;

  • Independence of third party components;

  • Unicode support;

  • Separation of parsing and the actual interpretation of the RTF data;

  • Providing simple predefined conversion modules for text, images, XML, and HTML;

  • Ready-to-Use RTF converter applications for text, images, XML, and HTML;

  • Open architecture for simple creation of RTF converters.

Weak(points)

  • The component offers no high-level functionality to create RTF content.

  • The present RTF interpreter is restricted to content data and basic formatting options;

  • There is no special support for the following RTF layout elements:

    (Tables)
    (Lists)
    Automatic numbering
    All features which require knowledge of how Microsoft Word might mean it.

In general, this should not pose a big problem for many areas of use. A conforming RTF writer should always write content with readers in mind that they do not know about tags and features which were introduced later in the standards history. As a consequence, a lot of the content in an RTF document is stored several times (at least if the writer cares about other applications). This is taken advantage of by the interpreter here, which just simply focuses on the visual content. Some writers in common use, however, improperly support this alternate representation which will result in differences in the resulting output.

Thanks to its open architecture, the RTF parser is a solid base for development of an RTF converter which focuses on layout.

2wayRTF2XML/XHTML - RTF Parser and convertion from rtf to xml, xtml

The actual parsing of the data is being done by the class RtfParser. Apart from the tag recognition, it also handles (a first level of) character encoding and Unicode support. The RTF parser classifies the RTF data into the following basic elements:

  • RTF Group: A group of RTF elements;

  • RTF Tag: The name and value of an RTF tag;

  • RTF Text: Arbitrary text content (not necessarily visible!).

Untitled-1.gif
Figure 1.

The actual parsing process can be monitored by ParserListeners (Observer Pattern), which offers an opportunity to react on specific events and perform corresponding actions.

The integrated parser listener RtfParserListenerFileLogger can be used to write the structure of the RTF elements into a log file (mainly intended for use during development). The produced output can be customized using its RtfParserLoggerSettings. The additional RtfParserListenerLogger parser listener can be used to log the parsing process to any ILogger implementation (see System functions).

  • The parser listener RtfParserListenerStructureBuilder generates the Structure Model from the RTF elements encountered during parsing. That model represents the basic elements as instances of IRtfGroup, IRtfTag, and IRtfText. Access to the hierarchical structure can be gained through the RTF group available in RtfParserListenerStructureBuilder.StructureRoot. Based on the Visitor Pattern, it is easily possible to examine the structure model via any IRtfElementVisitor implementation:

//-------------------------------------------------------------------------

public class MyVisitor : IRtfElementVisitor
{
    void(RtfWriteStructureModel())
      {
            RtfParserListenerFileLogger logger =
                  new RtfParserListenerFileLogger( @"c:\temp\RtfParser.log" );
            IRTFGroup structureRoot =
                  RtfParserTool.Parse( @"{\rtf1foobar}", logger );
            structureRoot.Visit( this );
      } // RtfWriteStructureModel 

      // ----------------------------------------------------------------------

      void IRtfElementVisitor.VisitTag( IRtfTag tag )
      {
            Console.WriteLine( "Tag: " + tag.FullName );
      } // IRtfElementVisitor.VisitTag 

      // ----------------------------------------

 void IRtfElementVisitor.VisitGroup( IRtfGroup group )
      {
            Console.WriteLine( "Group: " + group.Destination );
            foreach ( IRtfElement child in group.Contents )
            {
                  child.Visit( this ); // recursive
            }
      } // IRtfElementVisitor.VisitGroup
 
      // ----------------------------------------------------------------------
     
void IRtfElementVisitor.VisitText( IRtfText text )
      {
            Console.WriteLine( "Text: " + text.Text );
      } // IRtfElementVisitor.VisitText
 
} // MyVisitor

Untitled-2.gif 
Figure 2.

Note, however, that the same result for such simple functionality could be achieved by writing a custom IRtfParserListener (see below). This can, in some cases, be useful to avoid the overhead of creating the structure model in memory.

The utility class RtfParserTool offers the possibility to receive RTF data from a multitude of sources, such as string, TextReader, and Stream. And it allows, via its IRtfSource interface, to handle all these (and even other) scenarios in a uniform way.

The interface IRtfParserListener with its base utility implementation RtfParserListenerBase offers a way to react in custom ways to specific events during the parsing process:

// ------------------------------------------------------------------------

public
class MyParserListener : RtfParserListenerBase
{
      // ----------------------------------------------------------------------
      protected override void DoParseBegin()
      {
            Console.WriteLine( "parse begin" );
      } // DoParseBegin 

      // ----------------------------------------------------------------------

      protected override void DoGroupBegin()
      {
            Console.WriteLine( "group begin -level " + Level.ToString() );
      } // DoGroupBegin

      // ----------------------------------------------------------------------

      protected override void DoTagFound( IRtfTag tag )
      {
            Console.WriteLine( "tag " + tag.FullName );
      } // DoTagFound

      // ----------------------------------------------------------------------

      protected override void DoTextFound( IRtfText text )
      {
            Console.WriteLine( "text " + text.Text );
      } // DoTextFound 

      // ----------------------------------------------------------------------

      protected override void DoGroupEnd()
      {
            Console.WriteLine( "group end -level " + Level.ToString() );
      } // DoGroupEnd 

      // ----------------------------------------------------------------------
      protected override void DoParseSuccess()
      {
            Console.WriteLine( "parse success" );
      } // DoParseSuccess

      // ----------------------------------------------------------------------

      protected override void DoParseFail( RtfException reason )
      {
            Console.WriteLine( "parse failed: " + reason.Message );
      } // DoParseFail 

      // ----------------------------------------------------------------------

      protected override void DoParseEnd()
      {
            Console.WriteLine( "parse end" );
      } // DoParseEnd 

} // MyParserListener

Note that the used base class already provides (empty) implementations for all the interface methods, so only the ones which are required for a specific purpose need to be overridden.

RTF(Interpreter)

Once an RTF document has been parsed into a structure model, it is subject to interpretation through the RTF interpreter. One obvious way to interpret the structure is to build a Document Model which provides high-level access to the meaning of the document's contents. A very simple document model is part of this component, and consists of the following building blocks:

  • Document info: title, subject, author etc.

  • User properties

  • Color information

  • Font information

  • Text formats

  • Visuals:
             Text with associated formatting information
             (Breaks) : line, paragraph, section, page
             Special characters: tabulator, paragraph begin/end, dash, space, bullet, quote, hyphen
             (Images)

    Untitled-3i.gif
    Figure 3.   Rtf Converter for WPF

The various Visuals represent the recognized visible RTF elements, and can be examined with any IRtfVisualVisitor implementation.
The various Visuals represent the recognized visible RTF elements, and can be examined with any IRtfVisualVisitor implementation.

Analogous to the possibilities of the RTF parser, the provided RtfInterpreter supports monitoring the interpretation process with InterpreterListeners for specific purposes.

Analyzing documents might be simplified by using the RtfInterpreterListenerFileLogger interpreter listener, which writes the recognized RTF elements into a log file. Its output can be customized through its RtfInterpreterLoggerSettings. The additional RtfInterpreterListenerLogger interpreter listener can be used to log the interpretation process to any ILogger implementation (see System functions).

Untitled-4i.gif
Figure 4.   RTF Converter fpr Windows Forms

..\RtfConverter_exe_src_article\RtfWinForms\bin\Debug\PhS.Solutions.Community.RtfConverter.RtfWinForms.exe

Construction of the document model is also achieved through such an interpreter listener (RtfInterpreterListenerDocumentBuilder) which, in the end, delivers an instance of an IRtfDocument.

The following example shows how to make use of the high-level API of the document model:

// ----------------------------------------------------------------------

void RtfWriteDocumentModel( Stream rtfStream )
{
RtfInterpreterListenerFileLogger logger =
new RtfInterpreterListenerFileLogger( @"c:\temp\RtfInterpreter.log" );
IRtfDocument document = RtfInterpreterTool.BuildDoc( rtfStream, logger );
RtfWriteDocument( document );
} // RtfWriteDocumentModel

// ----------------------------------------------------------------------

void RtfWriteDocument( IRtfDocument document )
{
Console.WriteLine( "RTF Version: " + document.RtfVersion.ToString() );
 
   // document info
Console.WriteLine( "Title: " + document.DocumentInfo.Title );
Console.WriteLine( "Subject: " + document.DocumentInfo.Subject );
Console.WriteLine( "Author: " + document.DocumentInfo.Author );
    // ...

   // fonts
foreach ( IRtfFont font in document.FontTable )
{
Console.WriteLine( "Font: " + font.Name );
}

    // colors
foreach ( IRtfColor color in document.ColorTable )
{
Console.WriteLine( "Color: " + color.AsDrawingColor.ToString() );
}

     // user properties
foreach ( IRtfDocumentProperty documentProperty in document.UserProperties )
{
Console.WriteLine( "User property: " + documentProperty.Name );
}

     // visuals (preferably handled through an according visitor)
foreach ( IRtfVisual visual in document.VisualContent )
{
    switch(visual.Kind)
{
case RtfVisualKind.Text:
Console.WriteLine( "Text: " + ((IRtfVisualText)visual).Text );
break;
case RtfVisualKind.Break:
Console.WriteLine( "Tag: " +
((IRtfVisualBreak)visual).BreakKind.ToString() );
break;
case RtfVisualKind.Special:
Console.WriteLine( "Text: " +
((IRtfVisualSpecialChar)visual).CharKind.ToString() );
break;
case RtfVisualKind.Image:
IRtfVisualImage image = (IRtfVisualImage)visual;
Console.WriteLine( "Image: " + image.Format.ToString() +
" " + image.Width.ToString() + "x" + image.Height.ToString() );
break;
}
}
} // RtfWriteDocument


As with the parser, the class RtfInterpreterTool offers convenience functionality for easy interpretation of RTF data and creation of a corresponding IRtfDocument. In case no IRtfGroup is yet available, it also provides for passing any source to the RtfParserTool for automatic on-the-fly parsing.

The interface IRtfInterpreterListener, with its base utility implementation RtfInterpreterListenerBase, offers the necessary foundation for a custom interpreter listener:

// ------------------------------------------------------------------------

public class MyInterpreterListener : RtfInterpreterListenerBase
{
      // ----------------------------------------------------------------------
      protected override void DoBeginDocument( IRtfInterpreterContext context )
      {
            // custom action
      } // DoBeginDocument

       // ----------------------------------------------------------------------

      protected override void DoInsertText( IRtfInterpreterContext context, string text )
      {
            // custom action
      } // DoInsertText

       // ----------------------------------------------------------------------

      protected override void DoInsertSpecialChar( IRtfInterpreterContext context,
            RtfVisualSpecialCharKind kind )
      {
            // custom action
      } // DoInsertSpecialChar 

      // ----------------------------------------------------------------------

      protected override void DoInsertBreak( IRtfInterpreterContext context,
            RtfVisualBreakKind kind )
      {
            // custom action
      } // DoInsertBreak 

      // ----------------------------------------------------------------------

      protected override void DoInsertImage( IRtfInterpreterContext context,
            RtfVisualImageFormat format,
            int width, int height, int desiredWidth, int desiredHeight,
            int scaleWidthPercent, int scaleHeightPercent,
            string imageDataHex
            )
      {
            // custom action
      } // DoInsertImage 

      // ----------------------------------------------------------------------

      protected override void DoEndDocument( IRtfInterpreterContext context )
      {
            // custom action
      } // DoEndDocument 
} // MyInterpreterListener

The IRtfInterpreterContext passed to all of these methods contains the document information which is available at the very moment (colors, fonts, formats, etc.) as well as information about the state of the interpretation.

RTF Base Converters

As a foundation for the development of more complex converters, there are four base converters available for text, images, XML, and HTML. They are designed to be extended by inheritance.

Untitled-5.gif
Figure 5.

Text(Converter)

The RtfTextConverter can be used to extract plain text from an RTF document. Its RtfTextConvertSettings determines how to represent special characters, tabulators, white space, breaks (line, page, etc.), and what to do with them.

// ----------------------------------------------------------------------
void ConvertRtf2Text( Stream rtfStream )
{
    // logger
RtfInterpreterListenerFileLogger logger =
new RtfInterpreterListenerFileLogger( @"c:\temp\RtfInterpreter.log" ); 
  
 // text converter
RtfTextConvertSettings textConvertSettings = new RtfTextConvertSettings();
textConvertSettings.BulletText = "-"; // // replace default bullet text 'Ã,°'
RtfTextConverter textConverter = new RtfTextConverter( textConvertSettings );
  
  // interpreter
RtfInterpreterTool.Interpret( rtfStream, logger, textConverter );
Console.WriteLine( textConverter.PlainText );
} // ConvertRtf2Text

Image(Converter)

The RtfImageConverter offers a way to extract images from an RTF document. The size of the images can remain unscaled or as they appear in the RTF document. Optionally, the format of the image can be converted to another ImageFormat. File name, type, and size can be controlled by an IRtfVisualImageAdapter. The RtfImageConvertSettings determines the storage location as well as any scaling.

// ----------------------------------------------------------------------
void ConvertRtf2Image( Stream rtfStream )
{

    // logger
RtfInterpreterListenerFileLogger logger =
  new RtfInterpreterListenerFileLogger( @"c:\temp\RtfInterpreter.log" );

    // image converter
    // convert all images
to JPG
RtfVisualImageAdapter imageAdapter = new RtfVisualImageAdapter( ImageFormat.Jpeg );
RtfImageConvertSettings imageConvertSettings =
  new RtfImageConvertSettings( imageAdapter );
imageConvertSettings.ImagesPath = @"c:\temp\images\";
imageConvertSettings.ScaleImage = true; // scale images
RtfImageConverter imageConverter = new RtfImageConverter( imageConvertSettings );

    // interpreter
RtfInterpreterTool.Interpret( rtfStream, logger, imageConverter );

    // all images are saved to the path 'c:\temp\images\'
} // ConvertRtf2Image

XML(Converter)

The RtfXmlConverter converts the recognized RTF visuals into an XML document. Its RtfXmlConvertSettings allows to specify the used XML namespace and the corresponding prefix.

// ----------------------------------------------------------------------

void ConvertRtf2Xml( Stream rtfStream )
{

    // logger
RtfInterpreterListenerFileLogger logger =
new RtfInterpreterListenerFileLogger( @"c:\temp\RtfInterpreter.log" );

    // interpreter
IRtfDocument rtfDocument = RtfInterpreterTool.BuildDoc( rtfStream, logger );

    // XML convert
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings();
xmlWriterSettings.Indent = true;
xmlWriterSettings.IndentChars = ( "  " );
string fileName = @"c:\temp\Rtf.xml";\
using ( XmlWriter writer = XmlWriter.Create( fileName, xmlWriterSettings ) )
{
RtfXmlConverter xmlConverter = new RtfXmlConverter( rtfDocument, writer );
xmlConverter.Convert();
writer.Flush();
}
} // ConvertRtf2Xml

HTML(Converter)

The RtfHtmlConverter converts the recognized RTF visuals into an HTML document. File names, type, and size of any encountered images can be controlled through an IRtfVisualImageAdapter, while the RtfHtmlConvertSettings determines storage location, stylesheets, and other HTML document information.

// ----------------------------------------------------------------------
void ConvertRtf2Html( Stream rtfStream )
{
    // logger
RtfInterpreterListenerFileLogger logger =
new RtfInterpreterListenerFileLogger( @"c:\temp\RtfInterpreter.log" );

    // image converter
    // convert all images
to JPG
RtfVisualImageAdapter imageAdapter = new RtfVisualImageAdapter( ImageFormat.Jpeg );
RtfImageConvertSettings imageConvertSettings =
new RtfImageConvertSettings( imageAdapter );
imageConvertSettings.ScaleImage = true; // scale images
RtfImageConverter imageConverter =
new RtfImageConverter( imageConvertSettings );

    // interpreter
IRtfDocument rtfDocument = RtfInterpreterTool.Interpret( rtfStream,
logger, imageConverter );

    // html converter
RtfHtmlConvertSettings htmlConvertSettings =
new RtfHtmlConvertSettings( imageAdapter );
htmlConvertSettings.StyleSheetLinks.Add( "default.css" );
RtfHtmlConverter htmlConverter = new RtfHtmlConverter( rtfDocument,
htmlConvertSettings );
Console.WriteLine( htmlConverter.Convert() );
} // ConvertRtf2Html


HTML Styles can be integrated in two ways:

  • Inline through RtfHtmlCssStyle in RtfHtmlConvertSettings.Styles

  • Link through RtfHtmlConvertSettings.StyleSheetLinks

The RtfHtmlConvertScope allows to restrict the target range:

  • RtfHtmlConvertScope.All: complete HTML document (=Default)

  • ...
     
  • RtfHtmlConvertScope.Content: only paragraphs

RTF Converter Applications

The console applications Rtf2Raw, Rtf2Xml, and Rtf2Html demonstrate the range of functionality of the corresponding base converters, and offer a starting point for the development of our own RTF converter.

Rtf2Raw()

The command line application Rtf2Raw converts an RTF document into plain text and images:
 

Rtf2Raw source-file [destination] [/IT:format] [/CE:encoding]
[/IS+] [/ST-] [/SI-] [/LD:path] [/LP] [/LI] [/D] [/O] [/?]

source-file           source rtf file
destination          destination directory (default=source-file directory)
/IT:format           images type format: bmp, emf, exif, gif, icon, jpg,
                         png, tiff or wmf (default=original)
/CE:encoding         character encoding: ASCII, UTF7, UTF8, Unicode,
                         BigEndianUnicode, UTF32, OperatingSystem (default=UTF8)
/IS+               image scale (default=off)
/ST-                   don't save text to the destination (default=on)
/SI-                   don't save images to the destination (default=on)
/LD:path                log file directory (default=destination directory)
/LP                   write rtf parser log (default=off)
/LI                  write rtf interpreter log (default=off)
/D                   write text to screen (default=off)
/O                  open text in associated application (default=off)
/?              this help

Samples:
Rtf2Raw(MyText.rtf)
Rtf2Raw MyText.rtf c:\temp
Rtf2Raw MyText.rtf c:\temp /CSS:MyCompany.css
Rtf2Raw MyText.rtf c:\temp /CSS:MyCompany.css,ThisProject.css
Rtf2Raw MyText.rtf c:\temp /CSS:MyCompany.css,ThisProject.css /IT:png
Rtf2Raw MyText.rtf c:\temp /CSS:MyCompany.css,ThisProject.css /IT:png
/LD:log /LP /LI
 

Rtf2Xml()

The command line application Rtf2Xml converts an RTF document into an XML document:

   Rtf2Xml source-file [destination] [/CE:encoding] [/P:prefix]
[/NS:namespace] [/LD:path] [/LP] [/LI] [/?]


source-file          source rtf file
destination           destination directory (default=source-file directory)
/CE:encoding           character encoding: ASCII, UTF7, UTF8, Unicode,
                              BigEndianUnicode, UTF32, OperatingSystem (default=UTF8)
/P:prefix                    xml prefix (default=none)
/NS:namespace            xml namespace (default=none)
/LD:path                     log file directory (default=destination directory)
/LP                       write rtf parser log (default=off)
/LI                    write rtf interpreter log (default=off)
/?                  this help

Samples:
Rtf2Xml MyText.rtf
Rtf2Xml MyText.rtf c:\temp
Rtf2Xml MyText.rtf c:\temp /NS:MyNs
Rtf2Xml MyText.rtf c:\temp /LD:log /LP /LI

 

Rtf2Html

The command line application Rtf2Html converts an RTF document into an HTML document:

Rtf2Html source-file [destination] [/CSS:names] [/ID:path] [/IT:format] [/CE:encoding]
[/CS:charset] [/SH-] [/SI-] [/LD:path] [/LP] [/LI] [/D] [/O] [/?]


source-      file source rtf file
destination    destination directory (default=source-file directory)
/CSS:name1,name2 style sheet names (default=none)
/ID:path         images directory (default=destination directory)
/IT:format        images type format: jpg, gif or png (default=jpg)
/CE:encoding     character encoding: ASCII, UTF7, UTF8, Unicode,
                        BigEndianUnicode, UTF32, OperatingSystem (default=UTF8)
/CS:charset      document character set used for the HTML header meta-tag
'content' (default=UTF-8)
/SH-          don't save HTML to the destination (default=on)
/SI-        don't save images to the destination (default=on)
/LD:           path log file directory (default=destination directory)
/LP              write rtf parser log file (default=off)
/LI            write rtf interpreter log file (default=off)
/D        display HTML text on screen (default=off)
/O      open HTML in associated application (default=off)
/?    this help

Samples:
Rtf2Html MyText.rtf
Rtf2Html MyText.rtf c:\temp
Rtf2Html MyText.rtf c:\temp /CSS:MyCompany.css
Rtf2Html MyText.rtf c:\temp /CSS:MyCompany.css,ThisProject.css
Rtf2Html MyText.rtf c:\temp /CSS:MyCompany.css,ThisProject.css
/ID:images /IT:png
Rtf2Html MyText.rtf c:\temp /CSS:MyCompany.css,ThisProject.css
/ID:images /IT:png /LD:log /LP /LI
 

Projects

The following projects are provided in the RTF converter component:
 

Sys System functions. See below for a short description.
Parser Parsing of RTF data.
ParserTests  Unit tests for Parser.
Interpreter Interpretation of parsed RTF data. Functionality for conversion of RTF to plain text and images.
InterpreterTests Unit tests for Interpreter.
ConverterXml Functionality for conversion of RTF to XML.
ConverterHtml Functionality for conversion of RTF to HTML.
Rtf2Raw Command line application to convert from RTF to text-and image data.
Rtf2Xml Command line application to convert from RTF to XML.
Rtf2Html Command line application to convert from RTF to HTML.
RtfWinForms Windows Forms sample application which demonstrates a simply conversion from RTF to Text/XML/HTML.
RtfWindows WPF sample application which demonstrates a simply conversion from RTF to Text/XML/HTML.

System Functions

HashTool Functions to simplify implementing overrides of the object. GetHashCode() method.
StringTool Functions for String formatting.
CollectionTool Functions to simplify handling of Collections
ApplicationArguments Functions to interpret command line arguments. Offers support for the argument types ToggleArgument, ValueArgument, and NamedValueArgument.
Logging Functionality to abstract the embedding of a Logger facility. Supports the logger types LoggerNone, LoggerTrace, and LoggerLog4net.
Test  Functionality to build a unit-based test application.

2wayRTF2XML/XHTML - xhtml to rtf convertion

  • RtfConverter_exe_src_article > https://sourceforge.net/project/downloading.php?group_id=226892&use_mirror=osdn&filename=RtfConverter_exe_src_article.zip

  • RtfConverterPhS_src > https://sourceforge.net/project/downloading.php?group_id=226892&use_mirror=osdn&filename=RtfConverterPhS_src.zip


  • XHTML2RTF_PhS > https://sourceforge.net/project/downloading.php?group_id=226892&use_mirror=osdn&filename=XHTML2RTF_PhS.rar

  • About development of 2wayRTF2XM_XHTML converter on CS_ASP.rtf > https://sourceforge.net/project/downloading.php?group_id=226892&use_mirror=osdn&filename=article.zip

OVERVIEW - Part 2

This(component) 's documentation describes a two-way 2wayRTF2XML/XHTML converting which takes an HTML document as input and generates a Microsoft Word document for printing and after that by using second module it is possible a Microsoft Office Word document to XHTML document convertion.

It is web-based application for printing official documents from the application...
Although there are standardization efforts in progress (both at the W3C with XHTML-PRINT and IEEE with the Print Working Group), and besides some good tools to print HTML ( HTML Print from Bersoft, ScriptX from MeadCo). We keep our Web-based application, and reuse generated HTML to feed a printer and wwwroot directory.

The printing of HTML documents, format HTML documents for printing, with specific fonts, sizes, headers, footers, and margins is NOT a simple task (HTML format is not appropriate for printing) -but you can find other formats and use new tools to convert HTML documents into Microsoft Word format and vice versa, a format suitable for printing or internet broadcast (http; iis).

FEATURES()

  • The XHTML2RTF conversion tool: Converts XHTML documents into RTF documents

  • Generated RTF can be previewed and printed by Microsoft Word (commercialware) and Word Viewer (freeware)

  • Uses an XSL stylesheet and Microsoft XML SDK 3.0

  • Runs on Windows XP and Windows 2000 Server (and probably others) 

  • Can be plugged into Web-based (ASP) or Batch (WSH) applications

  • Is highly extensible and customizable -new tags can be supported easily, and direct RTF commands can be sent to the output (with no rendering in the HTML flow) with the <xhtml2rtf:raw> tag.

  • Support RTF-specific fields like page numbering and total number of pages via <xhtml2rtf:page_number> and <xhtml2rtf:total_number_of_pages> tags.

INTRODUCTION()

The XHTML2RTF conversion tool uses XSL stylesheet to convert an XHTML document into an RTF document, suitable for previewing and printing with Word (or Word Viewer).

XHTML = HTML + XML

The Extensible HyperText Markup Language (XHTML) is a family of current and future document types and modules that reproduce, subset, and extend HTML, reformulated in XML. XHTML Family document types are all XML-based, and ultimately are designed to work in conjunction with XML-based user agents. XHTML is the successor of HTML, and a series of specifications has been developed for XHTML.
=> The XHTML2RTF conversion tool reads XHTML documents as input.
 
As a consequence, you have to adapt your application in order to use this tool.

XSL()

XSL stands for eXtensible Stylesheet Language. It is a family of recommendations for defining XML document transformation and presentation. It consists of three parts:

A programming language for transforming XML documents: XSL Transformations ( XSLT)
an expression language used by XSLT to access or refer to parts of an XML document: XML Path Language ( XPath). This language provides pattern matching (xsl:template match), conditional statements (xsl:when test), loops (for-each), etc...
an XML vocabulary for specifying formatting semantics: similar to W3C cascading style sheets (CSS), this vocabulary provides enhanced presentation features.

For more about XSL, please refer to XSL references pages.

=> The XHTML2RTF conversion tool uses XSL to transform XHTML documents (XML documents) into RTF documents.
This is the core of the tool -anything else is just a glue to build your application. Everything is in the XSL stylesheet.

Microsoft XML SDK 3.0

Microsoft provides an XML SDK for processing XML and XSL documents. It's often installed with the operating system, but you can download and install the latest SDK.

=> The XHTML2RTF conversion tool uses XML SDK objects and methods to process XHTML and transform it into RTF.
XML SDK API is available to Web application as well as batch applications and so is the XHTML2RTF conversion tool.

Microsoft Rich Text Format (RTF)

Microsoft created a exchange format for Word documents: Rich Text Format (RTF). Unlike the native Word format, it is documented; moreover, RTF has been here for some time (so you can view RTF document with good old Word 97). There is also a free RTF viewer ( Word 97/2000 Viewer), and even Wordpad (installed with most Windows releases) can open, view and edit RTF documents.
XHTML to RTF component
The XHTML to RTF converter consists in an XSL stylesheet for parsing XHTML tags and generating their RTF equivalents.

USAGE()

From HTML to XHTML

You have to adapt your application to generate XHTML documents if you want to use the XHTML2RTF convertion tool:

 Include an XML declaration at the beginning of the document:
<?xml version="1.0" encoding="iso-8859-1" ?>


Include XHTML namepace declaration (the default) and XHTML2RTF namespace declaration in tag <html>

<html

xmlns = "http://www.w3.org/1999/xhtml"
xmlns:xhtml2rtf="http://www.lutecia.info/download/xmlns/xhtml2rtf">
...
</html>
 

Use lower case for both tag names and attribute names

<P></P> becomes <p></p>
<A href="...">...</a> becomes <a href="...">...</a>
etc...

Add termination for all tags (XHTML is more strict than HTML):
<link rel="stylesheet" href="..."> becomes <link rel="stylesheet" href="..." />
<hr> becomes <hr />
<br> becomes <br />

Quote all attribute values:
<table class=noprint> becomes <table class="noprint">
<a href=mypage.asp> becomes <a href="mypage.asp">

Use encoded characters for non-ASCII and/or special characters:
& becomes &amp;
, becomes &#233;
S becomes &#232;
etc...

Replace HTML character entities by their code (XML knows very few character entity references -use character codes instead):
&nbsp; becomes &#160;
&egrave; becomes &#232;
&eacute; becomes &#233;
&ecirc; becomes &#234;
etc...

Do not use direct style for tags (use class and an external CSS stylesheet instead)
<div style='background:#c0c0c0; font-size: 125%; padding:1.0pt 10.0pt 1.0pt 10.0pt;'>
becomes()
<div class="mydivstyle">
Thus, you will be able to customize the RTF output for your class (it's much too hard to parse an HTML style declaration within an XSL stylesheet).

Spaces in HTML and RTF

In HTML, spaces are not significant -most browsers ignore them when they render the document. On the other hand, Microsoft Word (and RTF) render spaces as visible characters. Be carefull when building your HTML document: do not generate spaces or they will be shown in your Word document.

Header and Footer in HTML and RTF

  • The default header in the RTF document contains the HTML <title> (from the <head> section). You can change the header by setting the parameters header-font-size-default, header-distance-from-edge, and header-indentation-left (see parameters below).
    You can also create your own header by using class "rtf_header" and "rtf_header_first" in your HTML document: rtf_header_first defines a complete HTML content for the header on first page of the document
     
  • rtf_header defines a complete HTML content for the header on all other pages of the document

The default footer in the RTF document contains the page number and document date (current date and time; i.e. print date and time). You can change the footer by setting the parameters footer-font-size-default, footer-distance-from-edge and use-default-footer (see parameters below).

Page(Break)

To force a page break in your RTF document, you can use the CSS style "page-break-before" or "page-break-after" with value "always":

This is on page 1
<p style="page-break-before:always"/>
This is on page 2

Note that other values for these CSS styles (left, right, auto...) are not supported (only "always" is supported).

XSL Stylesheet Parameters

The XSL stylesheet xhtml2rtf.xsl provides a set of parameters so that you can change the stylesheet default behavior:

  • page-start-number: Page start number (default: 1)

  • page-setup-paper-width: Paper width in TWIPS (default: 11907 TWIPS = 21 cm, i.e. A4 format)

  • page-setup-paper-height: Paper height in TWIPS (default: 16840 TWIPS = 29.7 cm, i.e. A4 format)

  • page-setup-margin-top: Top margin in TWIPS (default: 1440 TWIPS = 1 inch = 2.54 cm)

  • page-setup-margin-bottom: Bottom margin in TWIPS (default: 1440 TWIPS = 1 inch = 2.54 cm)

  • page-setup-margin-left: Left margin in TWIPS (default: 1134 TWIPS = 2 cm)

  • page-setup-margin-right: Right margin in TWIPS (default: 1134 TWIPS = 2 cm)

  • font-size-default: Default font size in TWIPS (default: 20 TWIPS = 10 pt.)

  • font-name-default: Default font name (default: 'Times New Roman')

  • font-name-fixed: Default font name for fixed-width text, like PRE or CODE (default: 'Courier New')

  • font-name-barcode: Barcode font name (default: '3 of 9 Barcode')

  • header-font-size-default: Header default font size in TWIPS (default: 14 TWIPS = 7 pt.)

  • header-distance-from-edge: Default distance between top of page and top of header, in TWIPS (default: 720 TWIPS = 1.27 cm)

  • header-indentation-left: Header left indentation in TWIPS (default: 0)

  • footer-font-size-default: Footer default font size in TWIPS (default: 14 TWIPS = 7 pt.)

  • footer-distance-from-edge: Default distance between bottom of page and bottom of footer, in TWIPS (default: 720 TWIPS = 1.27 cm)

  • use-default-footer: Boolean flag: 1 to use default footer (page number and date) or 0 no footer (default: 1)

  • document-protected: Boolean flag: 1 protected (cannot be modified) or 0 unprotected (default: 1)

  • normalize-space: Boolean flag: 1 spaces are normalized and trimmed, or 0 no normalization no trim (default: 0)

  • my-normalize-space: Boolean flag: 1 spaces are normalized (NOT TRIMMED), or 0 no normalization (default: 1)

Batch(mode(WSH))

I wrote a BATCH program (XHTML2RTF.BAT) which relies on Windows Script Host (WSH) to call the XML DOM SDK and transforms an HTML file into its RTF equivalent (output is done in stdout).
To use this component from batch: call program XHTML2RTF.BAT with the HTML file name as parameter. The RTF file is generated in stdout, so you should redirect the output with the > operator. Then you can open the generated file with Microsoft Word (or Wordpad):

C:\> XHTML2RTF.BAT Readme.htm > Readme.rtf
C:\> START WINWORD Readme.rtf

To pass parameters to the XHTML2RTF program, use the -p flag followed by a parameter name and value.
Example: Run.../cmd ->

C:\> XHTML2RTF.BAT -p page-start-number=5 -p document-protected=0 -p font-name-default='Arial' Readme.htm > Readme.rtf
C:\> START WINWORD Readme.rtf

Untitled-6i.gif
Figure 6.

Web(-Based(ASP))

I wrote a simple ASP library to call the component from an ASP page, producing RTF document from live, dynamic content (results from a SQL database request, for example). Create virtual directory in your IIS service called "xhtml2rtf_phs" and after that browse the url(s): http://localhost:81/xhtml2rtf_phs/HelloWorld1.asp or http://localhost:81/xhtml2rtf_phs/HelloWorld2.asp (Demo: http://212.50.77.132:81/xhtml2rtf_phs/HelloWorld1.asp)

Untitled-7.gif
Figure 7.

To use this component from a Web page, you have to include the XHTML2RTF.inc file in your page, and call function XHTMLString2RTF(), passing the XHTML document (as a string):

<!--#include file="XHTML2RTF.inc"-->

var(strXHTML = " \")
<html xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:xhtml2rtf=\"http://www.lutecia.info/download/xmlns/xhtml2rtf\"> \
<head> \
<title>Hello, World! from string</title> \
</head> \
<body> \
<h1>Hello, World!</h1> \
</body> \
</html> \
";

XHTMLString2RTF(strXHTML);
 

Untitled-8.gif
Figure 8.

Note:
The real production system uses SQL requests, generates XML output, transforms it into XHTML via a first XSL stylesheet, and then transforms it into an RTF document. The example above is just that -an example for demonstration purposes. Please do not generate HTML via strings on your production system ;-)

Raw RTF output

The XHTML2RTF conversion provides a direct RTF output with no rendering in XHTML. The tool processes a special tag (<xhtml2rtf:raw>) to send RTF directly. For example, this code will send a TAB character in the RTF output:
<xhtml2rtf:raw class="rtf">\tab </xhtml2rtf:raw>

This code will not be rendered in your Web browser, since the class "rtf" is defined in the css stylesheet as "display:none".

There are many uses for this raw output -in particular, you can work around most of the current limitations in the conversion tool (as listed in TODO section). For example, you can send the RTF code for an image, even if the conversion tool doesn't handle images yet:

<xhtml2rtf:raw class="rtf">
{
\*\shppict{\pict\picw3043\pich3043\picwgoal1725\pichgoal1725\pngblip89504e470d0a1a0a0000
000d49484452000000730000007308020000002421aab1000000017352474200aece1ce90000000467414d
410000b18f0bfc61050000...}
}
</xhtml2rtf:raw>
 

To find out what RTF code is appropriate for this image, I just used Word to edit a document with a picture, and then saved it in RTF format. I opened the resulting file as text, and copied/pasted the RTF code into the XHTML output, within <xhtml2rtf:raw> tags.

RTF-specific fields

Some RTF-specific fields are available in the conversion tool:

Page(Number)

You can display the current page number in your RTF document via <xhtml2rtf:page_number>:

PAGE <xhtml2rtf:page_number/>

Total Number of Pages

You can display total number of pages in your RTF document via <xhtml2rtf:total_number_of_pages>

PAGE <xhtml2rtf:page_number/> / <xhtml2rtf:total_number_of_pages/>

IMPLEMENTATION()

  • The XHTML to RTF converter consists in an XSL stylesheet for parsing XHTML tags and generating their RTF equivalents.

TO DO LIST

  • Full support for XHTML tags <ul>, <li>, <ol> (not fully supported)
  • Full support for XHTML tags <table>, <tr>, <td> (not fully supported)
  • Support XHTML Objects (<object>), Images (<img>), and Applets (<applet>) (not
  • supported yet)
  • Support XHTML attribute <title> with RTF annotations (bugs in current version)
  • Support XHTML hyphen and soft hyphen characters
  • Support XHTML INS and DEL elements
  • Support XHTML Lists (<ul>, <ol>, &l


Similar Articles