This is inspired by @nutan’s question regarding extracting text from PDFs. (Note: updated to reflect version 4.02 of xpdf-tools).
If it’s not an image/scanned PDF, you don’t have to use a resource-heavy solution to grab text from the PDF.
For this little guide you’ll need the XPDF command line tools, available at the link. “pdftotext.exe” may already be on your system as it’s so widely used.
The code below uses a little trick to get the output written to stdout - append ’ - ’ to the command line after the input file argument.
Here’s the code, adjust your file paths accordingly:
Here’s the content of ‘helloworld.pdf’:
and here’s the output of the code above.
# You will need "pdftotext.exe" # One place to find it is https://www.xpdfreader.com/download.html # Some version may be on your system - latest version is 4.02 set cmd to 'C:\Users\you\Documents\Robin\9.2\pdf\pdftotext.exe' set arg to ' -layout C:\Users\you\Documents\Robin\9.2\pdf\helloworld.pdf - ' # The ' -layout' arg preserves newlines. System.RunDOSCommand DOSCommandOrApplication: cmd + arg \ WorkingDirectory:'' \ StandardOutput=> StandardOutput \ StandardError=> StandardError \ ExitCode=> ExitCode Console.Write Message: StandardOutput
will duplicate the results in my code above. There are more command line tools - as I find them useful I’ll update or reply to this guide. I encourage you to read the docs in the ‘docs’ folder of the version 4.02 xpdf-tools.