Skip to content

How to Extract Text and Images from a PDF File

How to Extract Text and Images from a PDF File

PDFs (Portable Document Format) is a widely used file format for documents, eBooks, and other files. However, extracting text and images from a PDF can be challenging, especially when dealing with large PDF files or complex document layouts. In this tutorial, we’ll show you how to use Python to extract text and images from a PDF file.

Contact me

Find the GitHub repository of this tutorial at https://github.com/shamim-akhtar/extract-pdf-text-images.

Prerequisites

Before we start, you’ll need to install Python on your computer. If you haven’t installed Python, download it from the official website (https://www.python.org/downloads/). You’ll also need to install the following libraries:

  • PyPDF2: A library for working with PDF files.
  • fitz: A Python wrapper for the MuPDF PDF rendering library.

You can install both libraries using pip in your command prompt or terminal:

pip install PyPDF2 fitz
Code language: Shell Session (shell)

Extracting Text from a PDF

Let’s start by extracting the text from a PDF file using Python. Here’s the code:

import PyPDF2
import os

# Set the input and output paths
input_path = 'C:/path/to/input.pdf'
output_path = 'C:/path/to/output.txt'

# Open the PDF file
with open(input_path, 'rb') as pdf_file:
    # Read the PDF file
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)

    # Extract text from all pages
    text = ''
    for page in pdf_reader.pages:
        text += page.extract_text()

# Save the text to a file
with open(output_path, 'w', encoding='utf-8') as output_file:
    output_file.write(text)
Code language: Python (python)

Let’s go over the code step by step:

  1. First, we import the PyPDF2 and os libraries.
  2. We set the input and output paths. Replace the input_path variable with the file path of the PDF file that you want to extract text. Replace the output_path variable with the file path where you want to save the extracted text.
  3. We open the PDF file using Python’s open function and the 'rb' mode, which stands for read binary.
  4. We create a PdfFileReader object using PyPDF2. This object represents the PDF file that we just opened.
  5. We use a for loop to iterate over all the pages in the PDF file.
  6. For each page, we use the extract_text method to extract the text.
  7. We append the extracted text to the text variable.
  8. After all the pages have been processed, we save the extracted text to a file using Python’s open function and the 'w' mode, which stands for write.
  9. We write the text variable to the output file using the write method.

That’s it! The extracted text should now be saved to the file specified by the output_path variable.

Extracting Images from a PDF

Next, let’s look at how to extract images from a PDF file using Python. Here’s the code:

import fitz
import os

# Set the input and output paths
input_path = 'C:/path/to/input.pdf'
output_path = 'C:/path/to/images'

# Open the PDF file with fitz
pdf_doc = fitz.open(input_path)

# Get a list of images on the first page of the PDF
first_page_images = pdf_doc.get_page_images(0)

# Save each image to a file
for image in first_page_images
Code language: Python (python)

Code Explanation

The code is a Python script (Jupyter Notebook) that extracts text and images from a PDF file. It imports three libraries: fitz, os, and PyPDF2. It then sets the input and output paths, opens the PDF file and reads it using PyPDF2, and extracts the text from the first page of the PDF. The extracted text is saved to a file in the specified output folder. The script then opens the PDF file again using fitz, gets a list of images on the first page of the PDF, saves each image to a file in the specified output folder, and prints the number of images detected on the first page of the PDF.

To use the code, you must replace the input_path variable with the PDF file path you want to extract text and images. You should also set the output_path variable to the folder where you want the output files to be saved. After running the Python script, it will extract the text and images from the PDF and save them to files in the specified output folder.

Note that the code is only designed to extract text and images from the first page of the PDF. To extract text and images from other pages, you must modify the code accordingly.

Read My Other Tutorials

  1. Implement Mazes in Unity2D
  2. Reusable Finite State Machine using C++
  3. Flocking and Boids Simulation in Unity2D
  4. Runtime Depth Sorting of Sprites in a Layer
  5. Implement Constant Size Sprite in Unity2D
  6. Implement Camera Pan and Zoom Controls in Unity2D
  7. Implement Drag and Drop Item in Unity
  8. Graph-Based Pathfinding Using C# in Unity
  9. 2D Grid-Based Pathfinding Using C# and Unity
  10. 8-Puzzle Problem Using A* in C# and Unity
  11. Create a Jigsaw Puzzle Game in Unity
  12. Implement a Generic Pathfinder in Unity using C#
  13. Create a Jigsaw Puzzle Game in Unity
  14. Generic Finite State Machine Using C#
  15. Implement Bezier Curve using C# in Unity
  16. Create a Jigsaw Tile from an Existing Image
  17. Create a Jigsaw Board from an Existing Image
  18. Solving 8 puzzle problem using A* star search
  19. A Configurable Third-Person Camera in Unity
  20. Player Controls With Finite State Machine Using C# in Unity
  21. Finite State Machine Using C# Delegates in Unity
  22. Enemy Behaviour With Finite State Machine Using C# Delegates in Unity
  23. Augmented Reality – Fire Effect using Vuforia and Unity
  24. Implementing a Finite State Machine Using C# in Unity
  25. Solving 8 puzzle problem using A* star search in C++
  26. What Are C# Delegates And How To Use Them
  27. How to Generate Mazes Using Depth-First Algorithm
Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *