Extract Table from PDF with Python

2 min readFeb 10, 2021

Find and extract table from PDF file with pdfplumber library.

Python provides several libraries for PDF table extraction. Libraries like camelot, tabula-py and excalibur-py can easily find and extract the well-defined tables. But sometimes, all of these powerful libraries failed when you try to extract non-formatted tables.

pdfplumber is a Python library for text and table extraction.

pdfplumber finds:

explicitly defined lines
intersections of those lines
cells that use these intersections

And groups bordering cells into tables.

Table for Extraction

Install

!pip install pdfplumber

Import

import pdfplumber

Load PDF file

filename="example.pdf"
pdf=pdfplumber.open(filename)

Extract table

table=pdf.pages[0].extract_table()

pdf.pages: returns the list of pages.
page.extract_table(): returns the text extracted from the largest table on the page, represented as a list of lists.

Create columns list

header=1
columns=list()
for column in table[header]:
   if column!=None and len(column)>1:
      columns.append(column)
print(columns)

Convert to DataFrame

import pandas as pddf=pd.DataFrame(table[header+1:])
df.index=df[0]
df=df.rename_axis(index=None)
del df[0]
df.columns=columns

Save as CSV

df.to_csv(filename.split('.')[0]+'.csv',encoding='utf-8')

Result

Source code is here.