How to read tables from PDF: tabula-py Sometimes the raw data | Big Data Science

How to read tables from PDF: tabula-py
Sometimes the raw data for analysis is stored in pdf documents. To automatically extract data from this format straight into a dataframe, try tabula-py. It is a simple Python wrapper for tabula-java that can read PDF tables and convert to pandas dataframe as well as CSV / TSV / JSON files.
Just first install it through your pip package manager: pip install tabula-py
And then import into your Python script:
import tabula as tb
And you can use:
file = 'DataFile.pdf'
data = tb.read_pdf (file, pages = '12')
df = pd.DataFrame (data)
Examples: https://medium.com/codestorm/how-to-read-and-scrape-data-from-pdf-file-using-python-2f2a2fe73ae7
Documentation: https://tabula-py.readthedocs.io/en/latest/

Big Data Science

👨‍🦼 1.44K
Technologies

Big Data Science channel gathers together all interesting facts about Data Science. For cooperation: a.chernobrovov@gmail.com. 💼 — https://t.me/bds_job — channel about Data Science jobs and car...

Join
▲ Vote (1)

How to read tables from PDF: tabula-py Sometimes the raw data | Big Data Science

Login