Extract Table from PDF with Python

Yulia Nudelman
2 min readFeb 10, 2021

--

Find and extract table from PDF file with pdfplumber library.

Photo by Mika Baumeister on Unsplash

Python provides several libraries for PDF table extraction. Libraries like camelot, tabula-py and excalibur-py can easily find and extract the well-defined tables. But sometimes, all of these powerful libraries failed when you try to extract non-formatted tables.

pdfplumber is a Python library for text and table extraction.

pdfplumber finds:

  • explicitly defined lines
  • intersections of those lines
  • cells that use these intersections

And groups bordering cells into tables.

Table for Extraction

example.pdf

Install

!pip install pdfplumber

Import

import pdfplumber

Load PDF file

filename="example.pdf"
pdf=pdfplumber.open(filename)

Extract table

table=pdf.pages[0].extract_table()
  • pdf.pages: returns the list of pages.
  • page.extract_table(): returns the text extracted from the largest table on the page, represented as a list of lists.

Create columns list

header=1
columns=list()
for column in table[header]:
if column!=None and len(column)>1:
columns.append(column)
print(columns)

Convert to DataFrame

import pandas as pddf=pd.DataFrame(table[header+1:])
df.index=df[0]
df=df.rename_axis(index=None)
del df[0]
df.columns=columns

Save as CSV

df.to_csv(filename.split('.')[0]+'.csv',encoding='utf-8')

Result

example.csv

Source code is here.

--

--