How to extract data from PDF files using Python

Published on Aug. 22, 2023, 12:15 p.m.

To extract data from PDF files using Python , you can use several libraries including PyPDF2, pdfminer, pdfplumber, PDFQuery and PyMuPDF. Here’s an example using the PyPDF2 library:

import PyPDF2

# Open the PDF file in the read mode
pdfFileObj = open('filename.pdf', 'rb')

# Create a PDFReader object
pdfReader = PyPDF2.PdfReader(pdfFileObj)

# Get the number of pages in the document 
num_pages = pdfReader.numPages

# Extract text from each page and store it in a list
text_list = []

for page_no in range(num_pages):
    # Extract the page
    pageObj = pdfReader.getPage(page_no)

    # Extract text from page and append to list
    text_list.append(pageObj.extractText())

# Close the PDF file object
pdfFileObj.close()

# Print the extracted text
for text in text_list:
    print(text)

This code will open the PDF file in read mode, create a PDFReader object to read the file, get the number of pages in the document, extract the text from each page and store it in a list. Finally, it will print the extracted text.

Note that the process of extracting data from PDFs can be complex and may require trial and error to get the desired results. Different libraries may work better for different types of PDFs and data extraction methods.

Please note that there is a deprecation issue with reader.numPages in PyPDF2 version 3.0.0 and above, it’s recommended to use len(reader.pages) instead. Here are the steps to install PyPDF2 using pip:

  1. Open a command prompt or terminal window.
  2. Enter the following command to install PyPDF2:
pip install PyPDF2
  1. Wait for the installation to complete.
  2. Verify that PyPDF2 is installed by running the following command:
pip show PyPDF2

This will display information about the PyPDF2 package, including the version number and installed location.

Note that if you’re using a specific Python distribution, such as Anaconda or Miniconda, you may need to use the distribution’s specific package manager instead of pip.

I hope this helps! Let me know if you have any further questions or need further assistance.

Tags: