Intro guide: PDF OCR with Ghostscript and Tesseract

Grabbing text from PDFs isn’t so tough if the image wasn’t scanned.
For a scanned image, things are tougher but getting better all the time.
For this guide, I installed Tesseract alpha (don’t let the ‘alpha’ designation scare
you off, on Windows it’s pretty stable), from https://digi.bib.uni-mannheim.de/tesseract/.
This version was created October 30, 2019, so look for that. Feel free to install a
non-alpha version to see if it works, but YMMV.

This guide also required Ghostscript. I installed v.9.50 64-bit from https://www.ghostscript.com/download/gsdnld.html.

Both Tesseract and Ghostscript have a bewildering number of command line options, I just picked what worked.

  1. convert pdf to tiff with ghostscript
  2. convert tiff to hocr with tesseract
  3. rename .hocr file to .html
  4. extract (js) document.body.innerText

For your own testing here are some suggested activities:

  1. Find a suitable scanned PDF on the web for downloading:
    a) find images here: https://www.ocrsdk.com/documentation/sample-images/
    b) http://www.mattmahoney.net/ocr/
    c) https://courses.cs.vt.edu/csonline/AI/Lessons/VisualProcessing/OCRscans_files/bowers.jpg
    d) http://courses.cs.vt.edu/csonline/AI/Lessons/VisualProcessing/OCRscans_files/
    e) https://www.topocr.com/samples.html
  2. With Robin, save it to a file.
  3. With ghostscript, convert it to .tiff
  4. With Tesseract, convert it to .hocr
  5. With Robin, rename the .hocr file to .html
  6. With Robin, launch a browser to display the pseudo-html
  7. with JS, extract “document.body.innerText” to a Robin variable.
  8. process from there.

For reference, here’s my Ghostscript command line (assuming ‘ns’ as a base file name):

gswin64c -q -dNOPAUSE -sDEVICE=tiffg4 -sOutputFile=ns.tif ns.pdf -c quit

REALLY small code screenshot:

and a shot of the PDF before and after conversion (I couldn’t tell the difference between the PDF and the TIFF, not surprisingly), but before OCR:

ns

After processing, it looks like this in Notepad:

A full code listing with non-working path names:

# *****************************************************************
# Intro guide: Convert scanned PDF to text with ghostcript
# and Tesseract.
# *****************************************************************

# Path to ghostscript executable
set appPath to 'C:\Program Files\gs\gs9.50\bin\gswin64c.exe'

# Set variable for working directory
set workDir to 'C:\Users\YourNameHere\Documents\Robin\9.2\ocr'

# Set filename WITHOUT extension
set fileName to 'ns'

# Set ghostscript arguments for reasonable image conversion.
# Note that we append '.tif' to the filename for the output, and '.pdf' to the filename for the input.
set argsv to ' -q -dNOPAUSE -sDEVICE=tiffg4 -sOutputFile=' + fileName + '.tif'  + ' ' + fileName + '.pdf' + ' -c quit'

# *****************************************************************
# First, we convert the Pdf to TIF with ghostcript
# *****************************************************************

system.RunApplicationAndWaitToComplete          ApplicationPath:  appPath \
                                                CommandLineArguments: argsv \
                                                WorkingDirectory: workDir\
                                                WindowStyle:System.ProcessWindowStyle.Normal \
                                                Timeout:0 \
                                                ProcessId=> \
                                                ProcessId ExitCode=> ExitCode

# *****************************************************************
# Now we use Tesseract to extract the text from the .TIF file
# *****************************************************************

# First, we set our path to Tesseract and our arguments to Tesseract
set tessPath to 'C:\Program Files\Tesseract-OCR\tesseract.exe'

set argsTess to ' ' + fileName + '.tif ' + fileName + ' hocr'


# Run Tesseract using the supplied path variable and args
system.RunApplicationAndWaitToComplete          ApplicationPath:  tessPath \
                                                CommandLineArguments: argsTess\
                                                WorkingDirectory: workDir\
                                                WindowStyle:System.ProcessWindowStyle.Normal \
                                                Timeout:0 \
                                                ProcessId=> \
                                                ProcessId ExitCode=> ExitCode

# This is a kludge which works great. A .horc file is XML, which your browser can show if you rename it with a .HTML extension
system.RunDOSCommand                            DOSCommandOrApplication: 'ren ' + filename + '.hocr ' + filename + '.html' \
                                                WorkingDirectory: workDir \
                                                StandardOutput=> StandardOutput \
                                                StandardError=> StandardError \
                                                ExitCode=> ExitCode

# We need to compose a URL for the browser to load.
set composedURL to workdir + "\\" + fileName + ".html"


# Launch the browser, minimized
WebAutomation.LaunchAutomationBrowser           Url:  composedURL \
                                                WindowState:WebAutomation.BrowserWindowState.Minimized \
                                                ClearCache:False \
                                                ClearCookies:False \
                                                CustomUserAgentString:'' \
                                                BrowserInstance=> Browser

# We'll use Javascript to extract the 'innerText', which is what we want.
WebAutomation.ExecuteJavascript                 BrowserInstance:  Browser \
                                                Javascript: \
                                                """
                                                function ExecuteScript()  
                                                {var a = document.body.innerText; return a;}
                                                """ \ 
                                                Result=> Result
# Show the result in the console.
Console.Write Message: Result

# Doesn't look great, but Notepad will work. We'll create a text file and open it.

File.WriteText                                  File:  workDir + '\\result.txt'\
                                                TextToWrite:  Result \
                                                AppendNewLine:True \
                                                IfFileExists:File.IfFileExists.Overwrite \
                                                Encoding:File.FileEncoding.Unicode



System.RunApplication                           ApplicationPath:  'notepad.exe' \
                                                CommandLineArguments:workDir + '\\result.txt'\
                                                WorkingDirectory:'' WindowStyle:System.ProcessWindowStyle.Normal \
                                                ProcessId=> WordId

WebAutomation.CloseWebBrowser                   BrowserInstance: Browser

I hope this will provide a staring point for some useful OCR of PDFs.

Regards,
burque505

2 Likes

Great Job burque505!!! Congrats and thanks for sharing!

@nldavila, thanks for the kind words, and Merry Christmas!
Regards,
burque505

Yes, Merry Christmas for all the Robin Community!

1 Like