Extract Text from Word Documents in Java

C# Curator
3y
8.8k
0
2

Article

Introduction

Extracting text from documents is a common practice in our work or daily lives. It can be performed for various purposes, such as to analyze textual content in documents or to retrieve information from documents. We all know that Word documents are popularly used for storing and processing text, therefore, this article will primarily focus on extracting text from Word documents in Java using Free Spire.Doc for Java.

A Word document can contain a wide range of elements, such as sections, paragraphs, tables, and bookmarks. This article will introduce how to extract text from Word documents as well as extract text from different elements in Word Documents.

Extract Text from a Whole Word Document
Extract Text from a Section or Paragraph in a Word Document
Extract Text from Paragraphs that Use Specific Styles in a Word Document
Extract Text from a Table in a Word Document
Extract Text from a Bookmark in a Word Document

Add Dependencies

If you are using maven, you can import the jar file of Free Spire.Doc for Java into your application by adding the following code to your project's pom.xml file.

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.doc.free</artifactId>
        <version>5.2.0</version>
    </dependency>
</dependencies>

If you are not using maven, you can download Free Spire.Doc for Java from the official website, extract the zip file and then import the Spire.Doc.jar file under the lib folder into your project as a dependency.

Extract Text from a Whole Word Document in Java

Extracting text from a whole Word document is extremely simple. You just need to follow four steps below:

Initialize an instance of the Document class.
Load a Word document using Document.loadFromFile() method.
Get text from the document using Document.getText() method.
Write the text into a .txt file.

import com.spire.doc.Document;

import java.io.File;
import java.io.FileWriter;

public class ExtractTextFromDocument {
    public static void main(String []args) throws Exception {
        //Initialize an instance of the Document class
        Document document = new Document();
        //Load a Word document
        document.loadFromFile("Input.docx");

        //Get text from the whole document
        String content = document.getText();

        //Initialize an instance of the File class
        File output = new File("Document.txt");
        //Initialize an instance of the FileWriter class
        FileWriter writer = new FileWriter(output);
        //Write the text into a .txt file
        writer.write(content);
        writer.flush();
        writer.close();
    }
}

Extract Text from Word Documents in Java

Extract Text from a Section or Paragraph in a Word Document in Java

A Word document can contain one or more sections, and a section can contain one or more paragraphs.

You can extract text from a specific paragraph in a section, or extract text from a section by iterating through all paragraphs in it and then extracting text from them.

The following steps show you how to extract text from a specific paragraph in a section:

Initialize an instance of the Document class.
Load a Word document using Document.loadFromFile() method.
Get the desired section by its index using Document.getSections().get(int) method.
Get the desired paragraph in the section by its index using Section.getParagraphs().get(int) method.
Get the text of the paragraph using Paragraph.getText() method.
Write the text into a .txt file.

import com.spire.doc.Document;
import com.spire.doc.Section;
import com.spire.doc.documents.Paragraph;

import java.io.File;
import java.io.FileWriter;

public class ExtractTextFromParagraph {
    public static void main(String []args) throws Exception {
        //Initialize an instance of the Document class
        Document document = new Document();
        //Load a Word document
        document.loadFromFile("Input.docx");

        //Get the first section
        Section section = document.getSections().get(0);

        //Get the second paragraph in the section
        Paragraph paragraph = section.getParagraphs().get(1);
        //Get the text of the paragraph
        String text = paragraph.getText();

        //Initialize an instance of the File class
        File output = new File("Paragraphs.txt");
        //Initialize an instance of the FileWriter class
        FileWriter writer = new FileWriter(output);
        //Write the text nto a .txt file
        writer.write(text);
        writer.flush();
        writer.close();
    }
}

Extract Text from Word Documents in Java

Extract Text from Paragraphs that Use Specific Styles in a Word Document in Java

The paragraphs in a Word document can be applied with different styles, such as Heading 1, Heading 2, Heading 3, or even with a custom style.

Free Spire.Doc for Java provides the ability to extract text from paragraphs that use specific styles in a Word document. The following are the main steps to do so:

Initialize an instance of the Document class.
Load a Word document using Document.loadFromFile() method.
Initialize an instance of the StringBuilder class.
Iterate through all sections in the document.
Iterate through all paragraphs in each section.
Check if the current paragraph uses a specific style using Paragraph.getStyleName().equals(String) method.
Get the text from the paragraph using Paragraph.getText() method.
Save the text into the StringBuilder.
Write the text in the StringBuilder into a .txt file.

import com.spire.doc.Document;
import com.spire.doc.documents.Paragraph;

import java.io.File;
import java.io.FileWriter;

public class ExtractTextFromParagraphsWithSpecificStyles {
    public static void main(String []args) throws Exception {
        //Initialize an instance of the Document class
        Document document = new Document();
        //Load a Word document
        document.loadFromFile("Input.docx");

        //Initialize an instance of the StringBuilder class
        StringBuilder sb = new StringBuilder();

        //Loop through all sections in the document
        for (int i = 0; i < document.getSections().getCount(); i++) {
            //Loop through the paragraphs in each section
            for (int j = 0; j < document.getSections().get(i).getParagraphs().getCount(); j++) {
                //Get the current paragraph
                Paragraph paragraph = document.getSections().get(i).getParagraphs().get(j);
                //Check if the paragraph style name is "Heading 1"
                if (paragraph.getStyleName().equals("Heading1")) {
                    //Get the text of the paragraph
                    String text = paragraph.getText();
                    //Save the text into the StringBuilder
                    sb.append(text + "\n");
                }
            }
        }

        //Initialize an instance of the File class
        File output = new File("ParagraphsWithStyles.txt");
        //Initialize an instance of the FileWriter class
        FileWriter writer = new FileWriter(output);
        //Write the text in the StringBuilder into a .txt file
        writer.write(sb.toString());
        writer.flush();
        writer.close();
    }
}

Extract Text from Word Documents in Java

Extract Text from a Table in a Word Document in Java

A table is made up of cells. To extract text from a table, you need to access the cells in the table and then get the text from them. The following are the detailed steps:

Initialize an instance of the Document class.
Load a Word document using Document.loadFromFile() method.
Get the desired section by its index using Document.getSections().get(int) method.
Get the desired table in the section by its index using Section.getTables().get(int) method.
Initialize an instance of the StringBuilder class.
Iterate through the rows in the table.
Iterate through the cells in each row.
Iterate through the paragraphs in each cell.
Get the text of each paragraph using Paragraph.getText() method and save the result into the StringBuilder.
Write the text in the StringBuilder into a .txt file.

import com.spire.doc.Document;
import com.spire.doc.Section;
import com.spire.doc.TableCell;
import com.spire.doc.TableRow;
import com.spire.doc.documents.Paragraph;
import com.spire.doc.interfaces.ITable;

import java.io.File;
import java.io.FileWriter;

public class ExtractTextFromTable {
    public static void main(String []args) throws Exception {
        //Initialize an instance of the Document class
        Document document = new Document();
        //Load a Word document
        document.loadFromFile("Table.docx");

        //Get the first section
        Section section = document.getSections().get(0);

        //Get the first table in the first section
        ITable table = section.getTables().get(0);

        //Initialize an instance of the StringBuilder class
        StringBuilder sb = new StringBuilder();

        //Iterate through the rows in the table
        for (int i = 0; i < table.getRows().getCount(); i++) {
            TableRow row = table.getRows().get(i);
            //Iterate through the cells in each row
            for (int j = 0; j < row.getCells().getCount(); j++) {
                TableCell cell = row.getCells().get(j);
                //Iterate through the paragraphs in each cell
                for (int k = 0; k < cell.getParagraphs().getCount(); k++) {
                    //Extract text from each paragraph
                    Paragraph paragraph = cell.getParagraphs().get(k);
                    String text = paragraph.getText();
                    //Append the text to the StringBuilder
                    sb.append(text+ "\t");
                }
            }
            sb.append("\r\n");
        }

        //Initialize an instance of the File class
        File output = new File("Table.txt");
        //Initialize an instance of the FileWriter class
        FileWriter writer = new FileWriter(output);
        //Write the text in the StringBuilder into a .txt file
        writer.write(sb.toString());
        writer.flush();
        writer.close();
    }
}

Extract Text from Word Documents in Java

Extract Text from a Bookmark in a Word Document in Java

In Word, text can be bookmarked to enable readers to quickly navigate to its location.

You can retrieve the text of a specific bookmark in a Word document by following the steps below:

Initialize an instance of the Document class.
Load a Word document using Document.loadFromFile() method.
Initialize an instance of the BookmarksNavigator class.
Find the specific bookmark by its name using BookmarksNavigator.moveToBookmark(String) method.
Get the content of the bookmark using BookmarksNavigator.getBookmarkContent() method.
Initialize an instance of the StringBuilder class.
Iterate through the items in the bookmark content.
Check if the current item is of Paragraph type.
Iterate through the child objects in the paragraph.
Check if the current child object is of TextRange type.
Get the text of the text range using TextRange.getText() method and save the result into the StringBuilder.
Write the text in the StringBuilder into a .txt file.

import com.spire.doc.Document;
import com.spire.doc.documents.BookmarksNavigator;
import com.spire.doc.documents.Paragraph;
import com.spire.doc.documents.TextBodyPart;
import com.spire.doc.fields.TextRange;

import java.io.File;
import java.io.FileWriter;

public class ExtractTextFromBookmark {
    public static void main(String []args) throws Exception {
        //Initialize an instance of the Document class
        Document document = new Document();
        //Load a Word document
        document.loadFromFile("Bookmark.docx");
        
        //Initialize an instance of the BookmarksNavigator class
        BookmarksNavigator navigator = new BookmarksNavigator(document);
        //Find the specific bookmark by its name
        navigator.moveToBookmark("MyFirstBookmark");
        //Get the content of the bookmark
        TextBodyPart textBodyPart = navigator.getBookmarkContent();

        //Initialize an instance of the StringBuilder class
        StringBuilder sb = new StringBuilder();

        //Iterate through the items in the bookmark content
        for (Object item : textBodyPart.getBodyItems()) {
            //Check if the current item is of Paragraph type
            if ((item instanceof Paragraph)) {
                //Iterate through the child objects in the paragraph
                for (Object childObject : ((Paragraph)(item)).getChildObjects()) {
                    //Check if the current child object is of TextRange type
                    if ((childObject instanceof TextRange)) {
                        //Get the text of the text range and save the results into the StringBuilder
                        TextRange range = ((TextRange)(childObject));
                        sb.append(range.getText() + "\n");
                    }
                }
            }
        }

        //Initialize an instance of the File class
        File output = new File("Bookmark.txt");
        //Initialize an instance of the FileWriter class
        FileWriter writer = new FileWriter(output);
        //Write the text in the StringBuilder into a .txt file
        writer.write(sb.toString());
        writer.flush();
        writer.close();
    }
}

Extract Text from Word Documents in Java