PDFs (Portable Document Format) is a widely used file format for documents, eBooks, and other files. However, extracting text and images from a PDF can be challenging, especially when dealing with large PDF files or complex document layouts. In this tutorial, we’ll show you how to use Python to extract text and images from a PDF file.
Contact me
Find the GitHub repository of this tutorial at https://github.com/shamim-akhtar/extract-pdf-text-images.
Prerequisites
Before we start, you’ll need to install Python on your computer. If you haven’t installed Python, download it from the official website (https://www.python.org/downloads/). You’ll also need to install the following libraries:
- PyPDF2: A library for working with PDF files.
- fitz: A Python wrapper for the MuPDF PDF rendering library.
You can install both libraries using pip in your command prompt or terminal:
pip install PyPDF2 fitz
Code language: Shell Session (shell)
Extracting Text from a PDF
Let’s start by extracting the text from a PDF file using Python. Here’s the code:
import PyPDF2
import os
# Set the input and output paths
input_path = 'C:/path/to/input.pdf'
output_path = 'C:/path/to/output.txt'
# Open the PDF file
with open(input_path, 'rb') as pdf_file:
# Read the PDF file
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
# Extract text from all pages
text = ''
for page in pdf_reader.pages:
text += page.extract_text()
# Save the text to a file
with open(output_path, 'w', encoding='utf-8') as output_file:
output_file.write(text)
Code language: Python (python)
Let’s go over the code step by step:
- First, we import the
PyPDF2
andos
libraries. - We set the input and output paths. Replace the
input_path
variable with the file path of the PDF file that you want to extract text. Replace theoutput_path
variable with the file path where you want to save the extracted text. - We open the PDF file using Python’s
open
function and the'rb'
mode, which stands for read binary. - We create a
PdfFileReader
object usingPyPDF2
. This object represents the PDF file that we just opened. - We use a
for
loop to iterate over all the pages in the PDF file. - For each page, we use the
extract_text
method to extract the text. - We append the extracted text to the
text
variable. - After all the pages have been processed, we save the extracted text to a file using Python’s
open
function and the'w'
mode, which stands for write. - We write the
text
variable to the output file using thewrite
method.
That’s it! The extracted text should now be saved to the file specified by the output_path
variable.
Extracting Images from a PDF
Next, let’s look at how to extract images from a PDF file using Python. Here’s the code:
import fitz
import os
# Set the input and output paths
input_path = 'C:/path/to/input.pdf'
output_path = 'C:/path/to/images'
# Open the PDF file with fitz
pdf_doc = fitz.open(input_path)
# Get a list of images on the first page of the PDF
first_page_images = pdf_doc.get_page_images(0)
# Save each image to a file
for image in first_page_images
Code language: Python (python)
Code Explanation
The code is a Python script (Jupyter Notebook) that extracts text and images from a PDF file. It imports three libraries: fitz, os, and PyPDF2. It then sets the input and output paths, opens the PDF file and reads it using PyPDF2, and extracts the text from the first page of the PDF. The extracted text is saved to a file in the specified output folder. The script then opens the PDF file again using fitz, gets a list of images on the first page of the PDF, saves each image to a file in the specified output folder, and prints the number of images detected on the first page of the PDF.
To use the code, you must replace the input_path
variable with the PDF file path you want to extract text and images. You should also set the output_path
variable to the folder where you want the output files to be saved. After running the Python script, it will extract the text and images from the PDF and save them to files in the specified output folder.
Note that the code is only designed to extract text and images from the first page of the PDF. To extract text and images from other pages, you must modify the code accordingly.
Read My Other Tutorials
- Implement Mazes in Unity2D
- Reusable Finite State Machine using C++
- Flocking and Boids Simulation in Unity2D
- Runtime Depth Sorting of Sprites in a Layer
- Implement Constant Size Sprite in Unity2D
- Implement Camera Pan and Zoom Controls in Unity2D
- Implement Drag and Drop Item in Unity
- Graph-Based Pathfinding Using C# in Unity
- 2D Grid-Based Pathfinding Using C# and Unity
- 8-Puzzle Problem Using A* in C# and Unity
- Create a Jigsaw Puzzle Game in Unity
- Implement a Generic Pathfinder in Unity using C#
- Create a Jigsaw Puzzle Game in Unity
- Generic Finite State Machine Using C#
- Implement Bezier Curve using C# in Unity
- Create a Jigsaw Tile from an Existing Image
- Create a Jigsaw Board from an Existing Image
- Solving 8 puzzle problem using A* star search
- A Configurable Third-Person Camera in Unity
- Player Controls With Finite State Machine Using C# in Unity
- Finite State Machine Using C# Delegates in Unity
- Enemy Behaviour With Finite State Machine Using C# Delegates in Unity
- Augmented Reality – Fire Effect using Vuforia and Unity
- Implementing a Finite State Machine Using C# in Unity
- Solving 8 puzzle problem using A* star search in C++
- What Are C# Delegates And How To Use Them
- How to Generate Mazes Using Depth-First Algorithm
A committed and optimistic professional who brings passion and enthusiasm to help motivate, guide and mentor young students into their transition to the Industry and reshape their careers for a fulfilling future. The past is something that you cannot undo. The future is something that you can build.
I enjoy coding, developing games and writing tutorials. Visit my GitHub to see the projects I am working on right now.
Educator | Developer | Mentor