Intro guide: Extract AcroForm PDF data with Python and Robin

AcroForms present both challenges and opportunities.

For this demo, I chose (US) IRS form 1040.

Steps:

  1. Convert the completed AcroForm to a Python ‘OrderedDict’ using ‘PyPDF2’, needs Python >= 3.6 if I’m not mistaken;
  2. Convert the dictionary to json.
  3. Write the json to a file.
  4. Read the json file to a Robin variable.
  5. Convert the Robin variable to a CustomObject.
  6. Process the output (not much processing here!) :slightly_smiling_face:

Here’s what the PDF looks like (apologies to Lada Gaga, I doubt she lives there and I’m sure I pegged her income way low) :smiling_imp:

A screenshot of the Robin code, suitably redacted:

The output:

MyOutput

The Robin code (also redacted - sort of):

#*************************************************************************
# This script uses python to convert an AcroForm PDF to json,
# reads it into a Robin variable, converts that to a CustomObject.
# Some very basic processing is then performed for demo purposes.
#*************************************************************************

#*************************************************************************
# For this demonstration, variables are set outside the main Robin 
# processing code.
# Hopefully this will make it easier to adapt the script for your own paths.
#*************************************************************************

# As the python script used need 'PyPDF2', python > 3.6 is required
set pythonPath to "C:\Users\You\AppData\Local\Programs\Python\Python37\python.exe"

# Set the path for the json file to be produced
set jsonFile to "C:\Users\You\Documents\Robin\9.2\json\robintest.json"

# Set the working directory for 'System.RunApplicationAndWaitToComplete'
set workDir to 'C:\Users\You\Documents\Robin\9.2\json'

# Set the python script name(sys.argv[0]), the PDF file to process (sys.arg[1]), and the
# json output file (sys.argv[2])
set pyScript to 'pdftojson1.py '
set pyArg1 to 'f1040robin.pdf '
set pyArg2 to 'robintest.json'

# Delete the output .json file if it already exists
if (File.Exists File: "C:\Users\You\Documents\Robin\9.2\json\robintest.json") then
    Console.Write Message: "It exists, deleting ..."
    File.Delete Files: "C:\Users\You\Documents\Robin\9.2\json\robintest.json"
else
    Console.Write Message: "It does not exist"
end

# Create a .json file from a PDF using python
System.RunApplicationAndWaitToComplete      ApplicationPath:  pythonPath \
                                            CommandLineArguments: pyScript + pyArg1 + pyArg2 \
                                            WorkingDirectory: workDir \
                                            WindowStyle:System.ProcessWindowStyle.Normal \
                                            Timeout:0 \
                                            ProcessId=> ProcessId \
                                            ExitCode=> ExitCode

# Read the .json created to a Robin variable
File.ReadText                               File:  jsonFile \
                                            Encoding:File.TextFileEncoding.UTF8 \
                                            Content=> Content

# Convert to Custom Object
variables.ConvertJsonToCustomObject         Json:  Content \
                                            CustomObject=> Job
# Basic output processing.
Console.Write Message: Job["f1_05[0]"] + " " + Job["f1_06[0]"]
Console.Write Message: Job["f1_08[0]"] + "Ste " + Job["f1_09[0]"] + ", " + Job["f1_10[0]"]

… and the Python script I used (I know I got some of this from SO, but I don’t remember where and can’t find it again. I did make some additions to it):

# -*- coding: utf-8 -*-

from collections import OrderedDict
from PyPDF2 import PdfFileWriter, PdfFileReader
import json
import sys


def _getFields(obj, tree=None, retval=None, fileobj=None):

"""
The *tree* and *retval* parameters are for recursive use.

:param fileobj: A file object (usually a text file) to write
    a report to on all interactive form fields found.
:return: A dictionary where each key is a field name, and each
    value is a :class:`Field<PyPDF2.generic.Field>` object. By
    default, the mapping name is used for keys.
:rtype: dict, or ``None`` if form data could not be located.
"""
fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name',
                   '/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'}
if retval is None:
    retval = OrderedDict()
    catalog = obj.trailer["/Root"]
    # get the AcroForm tree
    if "/AcroForm" in catalog:
        tree = catalog["/AcroForm"]
    else:
        return None
if tree is None:
    return retval

obj._checkKids(tree, retval, fileobj)
for attr in fieldAttributes:
    if attr in tree:
        # Tree is a field
        obj._buildField(tree, retval, fileobj, fieldAttributes)
        break

if "/Fields" in tree:
    fields = tree["/Fields"]
    for f in fields:
        field = f.getObject()
        obj._buildField(field, retval, fileobj, fieldAttributes)

return retval


def get_form_fields(infile):
infile = PdfFileReader(open(infile, 'rb'))
fields = _getFields(infile)
return OrderedDict((k, v.get('/V', '')) for k, v in fields.items())



if __name__ == '__main__':
from pprint import pprint
# pdf file name
pdf_file_name = sys.argv[1]

# pprint(get_form_fields(pdf_file_name))
val = get_form_fields(pdf_file_name)
# pprint(val)
y = json.dumps(val)
pprint(y)
# write the output file
f= open(sys.argv[2], "a+")
f.write(y)
f.close()

Some final notes:
You might well want to “prettyprint” the JSON before you try to write your Robin code for processing the CustomObject.

I’ve included a prettyprint line in the Python code that you can modify you want to send the results to a file.

There is a lot of initial work involved to create your process for an AcroForm, but forms like 1040 don’t change - you could process millions of them once you’ve debugged your script.

Regards,
burque505

3 Likes

Excellent work @burque505!
Thank you very much!

Best regards,
J.

1 Like

Can you explain your code? I’m getting some errors which I’m not able to solve because I haven’t understood the code properly.

@Yash_bitla99, I’ll be happy to try. As far as the Python script goes, as I merely modified a script referenced here, here, and explained in great detail and very helpfully here. If you can post your errors, perhaps I can help. Most of the heavy lifting here is done by Python, with Robin just converting “robintest.json”.

Questions:
A) Is robintest.json being written to your file system before the script fails?
B) If so, can you comment out the ‘Console.Write’ lines to see if the errors lie in display, rather than processing?

Regarding the rest of the code, without knowing what your errors are, I can only offer the following.

  1. Make sure Python is >= 3.6. I use 3.7 stock.
  2. Make sure PyPDF2 is correctly installed. Run some easy tests to make sure.
  3. If you’re using the 1040 form I did, please ensure it’s the same one I used.

Best of luck,
burque505

1 Like