Udai Mathur

Udai Mathur

  • NA
  • 49
  • 10.4k

Paragraph Reading in PDF

Jul 1 2019 3:46 AM
In my code I need to read the PDF file content and based on some specific requirnment I need to insert the content of PDF into sql server DB.
I used iText sharp for PDF reading. It reads well the when it found the entire line in PDF.
Problems comes when it found table inside the PDF.

It first get into column1 and reads the line and jumps into column2 and reads that line and so on.
Problem is column1 has paragraph string and column2 has paragraph string. It breaks those paragraph into single different lines which has no meaning.

I want it to work like go to column1 read paragraph and if it find new paragraph after newline then read the paragraph from second line.
After processing column1 then jumps into colum2.
 
I am attaching PDF_File and PDF_Content screen shots. Here you can check it is merging two different paragraphs of different cells.

Currently I am using below code:

PdfReader reader = new PdfReader(@"D:\pdf1.pdf");
int PageNum = reader.NumberOfPages;

string[] sentence;

for (int i = 1; i <= PageNum; i++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = PdfTextExtractor.GetTextFromPage(reader, i, strategy);

    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
    text.Append(currentText);

    sentence = text.ToString().Split('\n');   
}

Attachment: PDF.rar