Reading Contents From PDF, Word, Text Files In C#

These days we are dealing with reading text from different type of files. This article describes how to read text/content from Text files, Word documents and PDF documents. Let’s discuss one by one.

Read text from PDF files

In this section we will discuss how to read text from PDF files. Please follow the below steps: 

Step 1

Download itextsharp assembly from below URL. Just extract it (itextsharp-dll-core) and add reference (iTextSharp.dll) to project. http://sourceforge.net/projects/itextsharp/.

Step 2

Add the following namespaces for iTextsharp,

  1. using TextSharp.text;   
  2. using iTextSharp.text.pdf;  
  3. using iTextSharp.text.pdf.parser;  
Step 3

Add the following code to read text from PDF files. I added the following methods which returns text as a string format.

Code
  1. private string GetTextFromPDF()  
  2. {  
  3.    StringBuilder text = new StringBuilder();  
  4.    using (PdfReader reader = new PdfReader("D:\\RentReceiptFormat.pdf"))  
  5.    {  
  6.       for (int i = 1; i <= reader.NumberOfPages; i++)  
  7.       {  
  8.          text.Append(PdfTextExtractor.GetTextFromPage(reader, i));  
  9.       }  
  10.    }  
  11.    
  12.    return text.ToString();  
  13. }  

We can also accomplish above by using other third party tools like PDFLib, PDFBox etc. But these are license versions – so I used free version of assembly iTextSharp.

Read Text from Word documents

In this section we will discuss how to read text from the Word document.
 

Step 1

Add Microsoft.Office.Interop.Word assembly to project. Please refer the following snapshot.

Step 2

After adding assembly, please add following namespace to class/code behind files.

  1. using Microsoft.Office.Interop.Word;   

Then write the following code read text from Word documents which returns content as a string.

Code

  1. /// <summary>  
  2. /// Reading Text from Word document  
  3. /// </summary>  
  4. /// <returns></returns>  
  5. private string GetTextFromWord()  
  6. {  
  7.    StringBuilder text = new StringBuilder();  
  8.    Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();  
  9.    object miss = System.Reflection.Missing.Value;  
  10.    object path = @"D:\Articles2.docx";  
  11.    object readOnly = true;  
  12.    Microsoft.Office.Interop.Word.Document docs = word.Documents.Open(ref path, ref miss, ref readOnly, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss);  
  13.    
  14.    for (int i = 0; i < docs.Paragraphs.Count; i++)  
  15.    {  
  16.       text.Append(" \r\n " + docs.Paragraphs[i + 1].Range.Text.ToString());  
  17.    }   
  18.    
  19.    return text.ToString();  
  20. }  

Read text from text files

In this section we will discuss how to read text from text files.

Add namespace (using System.IO;). The following code is to read content from text(.txt), xml(.xml), html(.html) files.

Code

  1. /// <summary>  
  2. /// Reading text from text files  
  3. /// </summary>  
  4. /// <returns></returns>  
  5. private string GetTextFromText()  
  6. {  
  7.    string text = System.IO.File.ReadAllText(@"D:\Articles2.txt");  
  8.      
  9.    return text.ToString();  
  10. }  

Hope this helps you.

Happy Coding!!