PDF data extraction

How to extract data from pdf using OCR?

Hi Neha, Welcome to the Robin forum.

Created a bot to extract data from a scanned PDF document using XpdfReader(to convert pdf pages to images) and Tesseract(OCR).

As part of the configuration, you have to download and add these two tools path to the system variable.

High-level steps:

Step 01: Convert the scanned PDF files to image(JPEG).
Step 02: Loop all the converted pages and extract data using Tesseract OCR.
Step 03: Store the result into an output text file.

set inputFile to  #<< Input pdf file >>     
set outputTextFile to #<<  Text file to write OCR output >>

Console.Write Message: "Process stared."
if (File.Exists File: outputTextFile) then
    Console.Write Message: "Deleting existing output file."
    File.Delete Files: outputTextFile
end

Folder.GetSpecialFolder SpecialFolder:Folder.SpecialFolder.DesktopDirectory SpecialFolderPath=> SpecialFolderPath
Folder.Create FolderPath:SpecialFolderPath+"\\"  FolderName:"Temp"  Folder=> CreatedFolder

# Download pdf to image tool(XpdfReader) and add the bin folder path system variables.
# Link to download XpdfReader(https://www.xpdfreader.com/download.html) 

System.RunDOSCommand DOSCommandOrApplication:"pdfimages -j " + inputFile + " outImage"  WorkingDirectory:CreatedFolder \
                StandardOutput=> StandardOutput StandardError=> StandardError ExitCode=> ExitCode

if ExitCode <> 0 then
    Console.Write Message:  "Error : " + StandardError 
    Console.Write Message:  "OCR process ended with error."  
else
    

    Folder.GetFiles Folder: CreatedFolder  FileFilter:'*' IncludeSubfolders:False FailOnAccessDenied:True \ 
                    SortBy1:Folder.SortBy.NoSort SortDescending1:False SortBy2:Folder.SortBy.NoSort \
                    SortDescending2:False SortBy3:Folder.SortBy.FullName SortDescending3:False Files=> Files

    loop foreach file in Files
	
        File.GetTempPath TempFile=> TempFile

        #Pre-requesties : Install Tesseract and add the installation location in system environmental variable(PATH)
        #Download Link  : https://github.com/UB-Mannheim/tesseract/wiki
        #Note           : To check the installation and environmental variables, Open command prompt run tesseract --version. 
                         #if everything is properly configured it will display the version information otherwise it will display command not recognized.

        System.RunDOSCommand DOSCommandOrApplication: 'tesseract '+ file +" " + TempFile \
                             StandardOutput=> StandardOutput \
                             StandardError=> StandardError \
                             ExitCode=> ExitCode

        if (ExitCode = 0) then
            File.ReadText File: TempFile+".txt" Encoding:File.TextFileEncoding.UTF8 Content=> OCRResult
            File.Delete Files: TempFile+".txt"
            File.WriteText File: outputTextFile TextToWrite: OCRResult AppendNewLine:True IfFileExists:File.IfFileExists.Append Encoding:File.FileEncoding.Unicode 
        else
            Console.Write Message: "Error : " + StandardError 
            Console.Write Message: "Process ended with an error." 
        end	
    end
end

Console.Write Message: "Process completed."
Folder.Delete Folder: CreatedFolder

Hope this helps, Happy automation!!!

Thanks,
Ranjith

3 Likes

Hello @Neha and welcome to our community!

Right now there no available modules for PDF and OCR automation.
We are currently working on them and they are going to be publicly available with our next release.

You can follow the great answer from @Ranjith as a workaround for now. :slight_smile:

Best regards,
J.

3 Likes

Hello Ranjith,
Thank you so much for the code.

But I am getting empty image after using xpdf can you help me with the same.

Hello jpap,

Ok.
Thank you for your reply.

Regards,
Neha

1 Like

If possible please share your input file(PDF)

hii
I am trying to extract data of this pdf in json format

.
As i am not able to share pdf I am sharing the image for the same .

Thanks and Regards,
Neha

Have you released Ocr aotomation ??

Hello @Rinni and welcome to our community!
We have concluded our work on a dedicated PDF module and we are currently during the testing phase.
Work on a dedicated OCR module has began.

Best regards,
J.