Text to HTML Parser

Introduction

If you have been into developing Web Applications then you might have at many times experienced that when you display multiple lines of data from a database you loose the spacing or formatting between multiple lines of data. Also in some applications like Forums users can post HTML content directly which can lead to some serious problems. What I mean by Posting HTML content is that e.g.. A user can post a HTML Image tag like <img src="
http://myserver.com/mypic.jpg" > and when someone views this post the actual image gets displayed instead of the Tag ! Someone can post a link to some malicious coded page and all the users can become easy targets which can cause some serious security implications.

Problem

The problem that I have described above is divided into 2 parts.

  1. Formatting Problem: In HTML all the white spaces between two characters get converted into a single white space automatically. Also Carriage Return '/r' and Line Feed '/n' characters do not have any affect on the HTML formatting. Due to this if you have a multiple line post, while displaying HTML converts all formatting to just a single continuous line.
  2. HTML Content: This can be both a problem or boon depending on users of your application. While displaying the content from the database, the HTML engine of the client browser actually parses the HTML content of the data. Due to this instead of displaying the tag's as text, they actually get converted to HTML.

Solution

There is a common solution to both the above problems, you have to parse the Text content from the Database into respective HTML tags.

  1. Formatting Solution: In HTML &nbsp; denotes a extra white space. So every 2 white spaces should be substituted by a single white space and &nbsp; .
    Also every line terminator should be replaced by the break tag <br>, which will result in the next character starting for a new line. 
  2. HTML Content: The solution to this is a bit tricky, in HTML every valid tag is contained within the < and > brackets. So to make all the HTML tags in your post invalid just change the < and > tags to their HTML counter parts &lt; and &gt; respectively. Also one other formatting change to be made is that the double quotation mark " has to be changed into its HTML equivalent &quot;.

Text to HTML parser

On the .NET Platform the String object is immutable i.e. once you create a String object you cannot change its contents. Since our parser needs to do some heavy weight string manipulations, I use the StringBuilder class from the System.Text namespace which provides a mutable string object. Also for streaming access to textual content I use the StringReader and StringWriter classes from the System.IO namespace.

Click Here to view a sample consumer in action!

Example: Normal post


Some sample text with lots of extra white spacing .. ... and some text on a new line. lastly the HTML textbox tag


Example: Parsed Text with HTML posting allowed


Some sample text with lots of   extra   white spacing .. ...
and some text on a new line.
lastly the HTML textbox tag



Example: Parsed Text with HTML posting disabled (exactly same as posted)


Some sample text with lots of   extra   white spacing .. ...
and some text on a new line.
lastly the HTML textbox tag <input type="text">


Source Code

1) ParseText method:- The method to convert Text into HTML

public string parsetext(string text, bool allow)
{
//Create a StringBuilder object from the string intput
//parameter
StringBuilder sb = new StringBuilder(text) ;
//Replace all double white spaces with a single white space
//and &nbsp;
sb.Replace(" "," &nbsp;");
//Check if HTML tags are not allowed
if(!allow)
{
//Convert the brackets into HTML equivalents
sb.Replace("<","&lt;") ;
sb.Replace(">","&gt;") ;
//Convert the double quote
sb.Replace("\"","&quot;");
}
//Create a StringReader from the processed string of
//the StringBuilder
StringReader sr = new StringReader(sb.ToString());
StringWriter sw = new StringWriter();
//Loop while next character exists
while(sr.Peek()>-1)
{
//Read a line from the string and store it to a temp
//variable
string temp = sr.ReadLine();
//write the string with the HTML break tag
//Note here write method writes to a Internal StringBuilder
//object created automatically
sw.Write(temp+"<br>") ;
}
//Return the final processed text
return sw.GetStringBuilder().ToString();
}

2) textparser.aspx - A sample consumer for the Text to HTML parser
<%@ Page Language="C#" %>
<%@ Import namespace="System.Text" %>
<%@ Import Namespace="System.IO" %>
<html>
<
head>
<
script language="C#" runat=server >
private void Post_Text(object sender, EventArgs e)
{
//Check if there is some text inside the TextBox
if(mess.Text!="")
{
//Check if option to Parse Text is selected
if(parse.Checked)
{
//Check if option to convert HTML tags to text is selected
if(htmlpost.Checked)
{
//Call the parsetext method
//Pass the text content from the textbox and false so that
//HTML tags do not get converted to text
postmess.Text=parsetext(mess.Text,false) ;
}
else
{
//Call the parsetext method
//Pass the text content from the textbox and true so that
//HTML tags get converted to text
postmess.Text=parsetext(mess.Text,true) ;
}
}
else
{
//Just post the text without any parsing
postmess.Text=mess.Text ;
}
}
}
//Method to parse Text into HTML
public string parsetext(string text, bool allow)
{
//Create a StringBuilder object from the string intput
//parameter
StringBuilder sb = new StringBuilder(text) ;
//Replace all double white spaces with a single white space
//and &nbsp;
sb.Replace(" "," &nbsp;");
//Check if HTML tags are not allowed
if(!allow)
{
//Convert the brackets into HTML equivalents
sb.Replace("<","&lt;") ;
sb.Replace(">","&gt;") ;
//Convert the double quote
sb.Replace("\"","&quot;");
}
//Create a StringReader from the processed string of
//the StringBuilder object
StringReader sr = new StringReader(sb.ToString());
StringWriter sw = new StringWriter();
//Loop while next character exists
while(sr.Peek()>-1)
{
//Read a line from the string and store it to a temp
//variable
string temp = sr.ReadLine();
//write the string with the HTML break tag
//Note here write method writes to a Internal StringBuilder
//object created automatically
sw.Write(temp+"<br>") ;
}
//Return the final processed text
return sw.GetStringBuilder().ToString();
}
</script>
</
head>
<
body>
<
center>
<
h3>Wecome to Saurabh's Text to HTML Parser</h3>
<
br>
<
form runat=server ID="Form2">
<
table border=1>
<
tr>
<
td valign=top>Your message</td>
<
td>
<
asp:label text="&nbsp;" id=postmess runat=server />
</
td></tr>
<
tr><td valign=top>Enter Message </td>
<td><asp:textbox Columns="50" Rows="20" TextMode="MultiLine" id=mess runat=server /></td></tr>
<
tr><td colspan=2><asp:checkbox id=parse text="Select to Parse the Text into HTML" runat=server /><br>
<asp:checkbox id=htmlpost text="Select to allow posting of HTML content" runat=server />
</
td></tr>
<
tr><td colspan=2><asp:button onClick="Post_Text" text="Post Text" runat=server ID="Button1" NAME="Button1"/></td></tr>
</
table>
</
form>
</
center>
</
body>
</
html>


Similar Articles