Use 'pdftotext.exe' to extract text from a non-image PDF

This is inspired by @nutan’s question regarding extracting text from PDFs. (Note: updated to reflect version 4.02 of xpdf-tools).

If it’s not an image/scanned PDF, you don’t have to use a resource-heavy solution to grab text from the PDF.

For this little guide you’ll need the XPDF command line tools, available at the link. “pdftotext.exe” may already be on your system as it’s so widely used.

The code below uses a little trick to get the output written to stdout - append ’ - ’ to the command line after the input file argument.

Here’s the code, adjust your file paths accordingly:

Here’s the content of ‘helloworld.pdf’:

pdfcontent

and here’s the output of the code above.

output

Code listing:

# You will need "pdftotext.exe"
# One place to find it is https://www.xpdfreader.com/download.html
# Some version may be on your system - latest version is 4.02
set cmd to 'C:\Users\you\Documents\Robin\9.2\pdf\pdftotext.exe'
set arg to ' -layout C:\Users\you\Documents\Robin\9.2\pdf\helloworld.pdf - '
# The ' -layout' arg preserves newlines.

System.RunDOSCommand                        DOSCommandOrApplication:  cmd + arg \
                                            WorkingDirectory:'' \
                                            StandardOutput=> StandardOutput \
                                            StandardError=> StandardError \
                                            ExitCode=> ExitCode

Console.Write                               Message: StandardOutput

will duplicate the results in my code above. There are more command line tools - as I find them useful I’ll update or reply to this guide. I encourage you to read the docs in the ‘docs’ folder of the version 4.02 xpdf-tools.

Regards,
burque505

3 Likes

But can I extract specific data from that text file created (‘StandardOuput’) and store it in an excel sheet?

2 Likes

@Yash_bitla99, I’ll be working on just that thing over the weekend. Automating PDF-related processes is a major part of RPA for me, so I’ll be focusing on it. Other tools I’m experimenting with right now are Tabula and Poppler for windows.

Poppler has a ‘pdftohtml’ tool also, with a ‘-xml’ option. This is reportedly a good pre-processor for further processing with C#, so I’m hopeful.

For now, all I’ve been able to do is grab the text, I haven’t done any processing of it yet.

Regards,
burque505

2 Likes

I stumbled upon this post now. Don’t know if the reply would help . But I think Tabula would be good.
Basic 2 line python code if table has to be extracted from PDF to Excel.
`` from tabula import convert_into
convert_into(*Your PDF Path*, * CSV Path *, output_format="csv",pages="all")

Then CSV data can be extracted to Excel either using Robin or extra code in Python.

This is working perfectly for me.

4 Likes

Very nice @Sahil, thanks for sharing it.
Regards,]
burque505