Extract Table from PDF with Python
2 min readFeb 10, 2021
Find and extract table from PDF file with pdfplumber
library.
Python provides several libraries for PDF table extraction. Libraries like camelot
, tabula-py
and excalibur-py
can easily find and extract the well-defined tables. But sometimes, all of these powerful libraries failed when you try to extract non-formatted tables.
pdfplumber
is a Python library for text and table extraction.
pdfplumber
finds:
- explicitly defined lines
- intersections of those lines
- cells that use these intersections
And groups bordering cells into tables.
Table for Extraction
Install
!pip install pdfplumber
Import
import pdfplumber
Load PDF file
filename="example.pdf"
pdf=pdfplumber.open(filename)
Extract table
table=pdf.pages[0].extract_table()
- pdf.pages: returns the list of pages.
- page.extract_table(): returns the text extracted from the largest table on the page, represented as a list of lists.
Create columns list
header=1
columns=list()
for column in table[header]:
if column!=None and len(column)>1:
columns.append(column)
print(columns)
Convert to DataFrame
import pandas as pddf=pd.DataFrame(table[header+1:])
df.index=df[0]
df=df.rename_axis(index=None)
del df[0]
df.columns=columns
Save as CSV
df.to_csv(filename.split('.')[0]+'.csv',encoding='utf-8')
Result
Source code is here.