Read, Compare And Create PDF Files In .NET Core

Introduction

 
In This tutorial we are going to learn about how to work with PDF Files in .NET Core. I believe this is a very important topic. When working with PDF files there are a lot of libraries available out there in the market and nuget repository. I will use the open source library. The tutorial is divided into 3 Cases and 1 Scenario. First, let us understand about iText7.
 

iText7

 
iText7, formerly known as iTextSharp, is the enterprise grade PDF library that takes care of all the PDF standards and related technologies. It solves all the problems related to PDF and lets the developers focus purely on business logic. It offers an abstract document model. It gives us the ability to structure, read, delete, insert PDF documents. It meets all the global PDF standards. It comes with comprehensive documentation. It offers high performance. It is extensible. We can do add-ons in it using iText7 Suite. It is available in Java and .NET. It is supported by the community. The iText7 source code is also open source and available on Github.
 
Free Licenses
 
We can use it for free iText7 Community under Open Source AGPL license. The license has some restrictions. For commercial purposes please read its official documentation for licenses.
 

What will be covered in the tutorial?

 
In this tutorial i will create 3 functions where we will perform read operation using itext7, compare two files' content using diff_match_patch offered by Google as a free source on github, and creating a pdf file using itext7 that will store the result. So, overall we are going to perform comparison of 2 pdf files by taking one real life scenario. So, let’s start!!
 
Case 1
 
Reading A Pdf File To Text, using iText7
  1. /* 
  2. .. 
  3. using iText.Kernel.Pdf; 
  4. using iText.Kernel.Pdf.Canvas.Parser; 
  5. using iText.Kernel.Pdf.Canvas.Parser.Listener; 
  6. */  
  7. public string ReadFile(string pdfPath) {  
  8.     var pageText = new StringBuilder();  
  9.     using(PdfDocument pdfDocument = new PdfDocument(new PdfReader(pdfPath))) {  
  10.         var pageNumbers = pdfDocument.GetNumberOfPages();  
  11.         for (int i = 1; i <= pageNumbers; i++) {  
  12.             LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();  
  13.             PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);  
  14.             parser.ProcessPageContent(pdfDocument.GetFirstPage());  
  15.             pageText.Append(strategy.GetResultantText());  
  16.         }  
  17.     }  
  18.     return pageText.ToString();  
  19. }  
Case 2
 
Reading and Comparing Two Pdf Files, using Above Function and Google diff_match_patch which is now open source and available at here. We will simply download that file and write the code below.
 
This diff match patch is a powerful library to compare/manipulate the plain texts.
  1. /* 
  2. .. 
  3. using DiffMatchPatch; 
  4. using System.IO; 
  5. using System.Reflection; 
  6. using System.Text;*/  
  7. public string ComparePdfFiles() {  
  8.     string pdfPath1 = Path.GetDirectoryName(Assembly.GetEntryAssembly().Location) + @ "\SamplePdfs\original.pdf";  
  9.     string pdfPath2 = Path.GetDirectoryName(Assembly.GetEntryAssembly().Location) + @ "\SamplePdfs\altered.pdf";  
  10.     StringBuilder compareResult = new StringBuilder();  
  11.     var text1 = ReadFile(pdfPath1);  
  12.     var text2 = ReadFile(pdfPath2);  
  13.     diff_match_patch dmp = new diff_match_patch();  
  14.     var diff = dmp.diff_main(text1, text2);  
  15.     foreach(var d in diff) {  
  16.         compareResult.Append(d + "\n");  
  17.     }  
  18.     return compareResult.ToString();  
  19. }  
Case 3 - Creating a PDF file, using iText7
 
Now, we will use lines as paragraphs to generate a PDF File.
  1. /* 
  2. .. 
  3. using iText.Layout; 
  4. using iText.Layout.Element; 
  5. */  
  6. public void GeneratePdf(string[] paragraphs, string destination) {  
  7.     FileInfo file = new FileInfo(destination);  
  8.     file.Delete();  
  9.     var fileStream = file.Create();  
  10.     fileStream.Close();  
  11.     PdfDocument pdfdoc = new PdfDocument(new PdfWriter(file));  
  12.     pdfdoc.SetTagged();  
  13.     using(Document document = new Document(pdfdoc)) {  
  14.         foreach(var para in paragraphs) {  
  15.             document.Add(new Paragraph(para));  
  16.         }  
  17.         document.Close();  
  18.     }  
  19. }  
Now, let us take a scenario. A person was preparing a report of Q4 Revenue. His friend changed the revenue figures just for fun. But he had one PDF copy saved as a backup. So, now let us compare the changes that have been made in this file.
 
App Code
  1. PdfFileHandler reader = new PdfFileHandler();  //Name of wrapper class for all methods shown above  
  2. var result = reader.ComparePdfFiles().Split("\n");    
  3. reader.GeneratePdf(result, "compare.pdf");    
  4. Console.WriteLine("Done..");    
  5. Console.ReadKey(true);    
Original File
 
Read, Compare And Create PDF Files In .NET Core
Altered File
Read, Compare And Create PDF Files In .NET Core
Comparison Result File
Read, Compare And Create PDF Files In .NET Core
That’s it! We can see the comparison result generated. This type of file can become difficult to interpret. So, you can customize the result in format as per your needs or per client needs. Source code is attached for reference. Please feel free to discuss and ask questions :)