Extract Text from Documents using Python (with and without AI)

Varun Setia
Sep 16
2.7k
0
2

Article

Introduction

In this article, we are going to understand how to utilize a Python package to extract text from different types of documents. Extracting text is one of the most common requirements when working with files such as Word reports, PowerPoint slides, PDFs, or even emails. Traditionally, handling each file type required its own library or custom parsing logic, which quickly became complicated and time-consuming.

With the help of the MarkItDown Python package, this process becomes much easier. It provides a single consistent mechanism to work with multiple document formats, allowing developers to quickly read and convert files into structured text. This makes it very useful for building applications like document search, summarization tools, or data preprocessing pipelines.

We will start by learning how to set up the Python environment, install the required package, and then move on to extracting text from documents without using AI. Later, we will also explore how to enhance extraction for formats like images using AI assistance. By the end of this walkthrough, you will have a clear idea of how to integrate MarkItDown into your workflow for handling different file formats in Python.

Code Walkthrough

Set up a Python environment and download packages

Set up a virtual environment

python -m venv venv
venv/Scripts/activate

Download packages

pip install markitdown[all]

Extracting text from documents without AI

Create file extractor_simple_main.py

In this file, we will see how to read the text content of docx, PPT, and Email format files (msg and eml).

def print_prev(text: str):
    result_prev = text[0:50]+'...'
    print(result_prev)


from markitdown import MarkItDown

md = MarkItDown(enable_plugins=False) # Set to True to enable plugins

#read Docx file
result = md.convert("2024_Annual_Report.docx")
print("Doc file text")
print_prev(result.text_content)

#read PPT
result = md.convert("SlidesFY24Q3.pptx")
print("PPT file text")
print_prev(result.text_content)


#read email
result = md.convert("Congratulations_sample.eml")
print("Email file text")
print_prev(result.text_content)

result = md.convert("Congratulations_sample2.msg")
print("Email file text")
print_prev(result.text_content)

Run command

python extractor_simple_main.py

Output

Now, can we extract image content directly

from markitdown import MarkItDown

md = MarkItDown(enable_plugins=False) # Set to True to enable plugins
result = md.convert("run_img.png")
print(result.text_content)

Not directly!!

The above code will not print any output.

Now, let's try to extract the image with AI assistance

from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI(
    api_key="sk-******",#store in env variable
)
md = MarkItDown(llm_client=client, llm_model="gpt-5-nano")
result = md.convert("run_img.png")
print(result.text_content)

This library can read cthe ontents of several formats:

PDF
PowerPoint
Word
Excel
Images (EXIF metadata and OCR)
Audio (EXIF metadata and speech transcription)
HTML
Text-based formats (CSV, JSON, XML)
ZIP files (iterates over contents)
Youtube URLs
EPubs
Emails

Currently, this library supports OpenAI, and it is expected to expand support to other LLMs too.

Thanks for reading till the end! Hope this was informative.