Extract Table from PDF with Python

Find and extract table from PDF file with pdfplumber library.

Photo by Mika Baumeister on Unsplash

Python provides several libraries for PDF table extraction. Libraries like camelot, tabula-py and excalibur-py can easily find and extract the well-defined tables. But sometimes, all of these powerful libraries failed when you try to extract non-formatted tables.

pdfplumber is a Python library for text and table extraction.

pdfplumber finds:

  • explicitly defined lines
  • intersections of those lines
  • cells that use these intersections

And groups bordering cells into tables.

example.pdf
!pip install pdfplumber
import pdfplumber
filename="example.pdf"
pdf=pdfplumber.open(filename)
table=pdf.pages[0].extract_table()
  • pdf.pages: returns the list of pages.
  • page.extract_table(): returns the text extracted from the largest table on the page, represented as a list of lists.
header=1
columns=list()
for column in table[header]:
if column!=None and len(column)>1:
columns.append(column)
print(columns)
import pandas as pddf=pd.DataFrame(table[header+1:])
df.index=df[0]
df=df.rename_axis(index=None)
del df[0]
df.columns=columns
df.to_csv(filename.split('.')[0]+'.csv',encoding='utf-8')
example.csv

Source code is here.

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store