Intro guide: Extract PDF tables with Camelot for Python and Robin

Tables are a huge percentage of data we need to extract from PDFs. The Camelot package for Python helps a lot. It can be difficult to install due to the way Python installations handle the Windows path, and because installation is different depending on whether you use ‘conda’ or ‘pip’, but it’s well worth it.

Here’s a screenshot code for extracting the table provided in the Camelot mini-tut on the docs page at the link above (modify it for your purposes, of course):

Here’s a shot of the Excel sheet produced:

Code:

#*************************************************************************
# This script uses Camelot for Python. It can be complicated to install,
# especially on Win10, which puts packages in a weird place. It's worth the effort.
# Please review the instructions at https://camelot-py.readthedocs.io/en/master/
# CAREFULLY. Installation can be frustrating. If you don't have 'conda' on
# your system (I don't), you will have to install camelot-py[cv] AFTER
# installing dependencies, which are tkinter and ghostscript.
#
# Grab 'foo.pdf' here: https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf
#
# This demo closely follows the instructions at 
# https://camelot-py.readthedocs.io/en/master/
#*************************************************************************
set pythonPath to "C:\Users\UserName\AppData\Local\Programs\Python\Python37\python.exe"

# Set the path for the jPDF file with tables to read.
set pdfFile to "C:\Users\UserName\Documents\Robin\9.2\pdf\camelot\foo.pdf"

# Set the working directory for 'System.RunApplicationAndWaitToComplete'
set workDir to 'C:\Users\UserName\Documents\Robin\9.2\pdf\camelot'

# Set the python script name(sys.argv[0]), the PDF file to process (sys.arg[1]), and the
# Excel output file (sys.argv[2])
set pyScript to 'camelot3.py '
set pyArg1 to 'foo.pdf '
set pyArg2 to 'robintest.xlsx'

# Delete the output .xlsx file if it already exists
if (File.Exists File: (workDir + "\\" + pyArg2)) then
    Console.Write Message: "Deleting pre-existing output ..."
    File.Delete Files: (workDir + "\\" + pyArg2) 
else
    Console.Write Message: "No pre-existing output, processing."
end

# That's all it takes!
System.RunApplicationAndWaitToComplete      ApplicationPath:  pythonPath \
                                            CommandLineArguments: pyScript + pyArg1 + pyArg2 \
                                            WorkingDirectory: workDir \
                                            WindowStyle:System.ProcessWindowStyle.Normal \
                                            Timeout:0 \
                                            ProcessId=> ProcessId \
                                            ExitCode=> ExitCode

and the (really short :slightly_smiling_face:) Python script needed to produce this output.

import camelot
import sys
tables = camelot.read_pdf(sys.argv[1])
tables[0].to_excel(sys.argv[2]) # to_json, to_excel, to_html

As you can see, output to json and html are also possible.

Regards,
burque505

3 Likes

@burque505 Great job! :+1:

2 Likes

Great guide @burque505!!!
Looking forward to the next one!

Best regards,
J.

1 Like