Extract Table from PDF with Python

Yulia Nudelman
2 min readFeb 10, 2021

Find and extract table from PDF file with pdfplumber library.

Photo by Mika Baumeister on Unsplash

Python provides several libraries for PDF table extraction. Libraries like camelot, tabula-py and excalibur-py can easily find and extract the well-defined tables. But sometimes, all of these powerful libraries failed when you try to extract non-formatted tables.

pdfplumber is a Python library for text and table extraction.

pdfplumber finds:

  • explicitly defined lines
  • intersections of those lines
  • cells that use these intersections

And groups bordering cells into tables.

Table for Extraction

example.pdf

Install

!pip install pdfplumber

Import

import pdfplumber

Load PDF file

filename="example.pdf"
pdf=pdfplumber.open(filename)

Extract table

table=pdf.pages[0].extract_table()
  • pdf.pages: returns the list of pages.
  • page.extract_table(): returns the text extracted from the largest table on the page, represented as a list of lists.

Create columns list

header=1
columns=list()
for column in table[header]:
if column!=None and len(column)>1:
columns.append(column)
print(columns)

Convert to DataFrame

import pandas as pddf=pd.DataFrame(table[header+1:])
df.index=df[0]
df=df.rename_axis(index=None)
del df[0]
df.columns=columns

Save as CSV

df.to_csv(filename.split('.')[0]+'.csv',encoding='utf-8')

Result

example.csv

Source code is here.

--

--