How to extract data from pdf using OCR?
Hi Neha, Welcome to the Robin forum.
Created a bot to extract data from a scanned PDF document using XpdfReader(to convert pdf pages to images) and Tesseract(OCR).
As part of the configuration, you have to download and add these two tools path to the system variable.
Step 01: Convert the scanned PDF files to image(JPEG).
Step 02: Loop all the converted pages and extract data using Tesseract OCR.
Step 03: Store the result into an output text file.
set inputFile to #<< Input pdf file >> set outputTextFile to #<< Text file to write OCR output >> Console.Write Message: "Process stared." if (File.Exists File: outputTextFile) then Console.Write Message: "Deleting existing output file." File.Delete Files: outputTextFile end Folder.GetSpecialFolder SpecialFolder:Folder.SpecialFolder.DesktopDirectory SpecialFolderPath=> SpecialFolderPath Folder.Create FolderPath:SpecialFolderPath+"\\" FolderName:"Temp" Folder=> CreatedFolder # Download pdf to image tool(XpdfReader) and add the bin folder path system variables. # Link to download XpdfReader(https://www.xpdfreader.com/download.html) System.RunDOSCommand DOSCommandOrApplication:"pdfimages -j " + inputFile + " outImage" WorkingDirectory:CreatedFolder \ StandardOutput=> StandardOutput StandardError=> StandardError ExitCode=> ExitCode if ExitCode <> 0 then Console.Write Message: "Error : " + StandardError Console.Write Message: "OCR process ended with error." else Folder.GetFiles Folder: CreatedFolder FileFilter:'*' IncludeSubfolders:False FailOnAccessDenied:True \ SortBy1:Folder.SortBy.NoSort SortDescending1:False SortBy2:Folder.SortBy.NoSort \ SortDescending2:False SortBy3:Folder.SortBy.FullName SortDescending3:False Files=> Files loop foreach file in Files File.GetTempPath TempFile=> TempFile #Pre-requesties : Install Tesseract and add the installation location in system environmental variable(PATH) #Download Link : https://github.com/UB-Mannheim/tesseract/wiki #Note : To check the installation and environmental variables, Open command prompt run tesseract --version. #if everything is properly configured it will display the version information otherwise it will display command not recognized. System.RunDOSCommand DOSCommandOrApplication: 'tesseract '+ file +" " + TempFile \ StandardOutput=> StandardOutput \ StandardError=> StandardError \ ExitCode=> ExitCode if (ExitCode = 0) then File.ReadText File: TempFile+".txt" Encoding:File.TextFileEncoding.UTF8 Content=> OCRResult File.Delete Files: TempFile+".txt" File.WriteText File: outputTextFile TextToWrite: OCRResult AppendNewLine:True IfFileExists:File.IfFileExists.Append Encoding:File.FileEncoding.Unicode else Console.Write Message: "Error : " + StandardError Console.Write Message: "Process ended with an error." end end end Console.Write Message: "Process completed." Folder.Delete Folder: CreatedFolder
Hope this helps, Happy automation!!!
Hello @Neha and welcome to our community!
Right now there no available modules for PDF and OCR automation.
We are currently working on them and they are going to be publicly available with our next release.
You can follow the great answer from @Ranjith as a workaround for now.
Thank you so much for the code.
But I am getting empty image after using xpdf can you help me with the same.
Thank you for your reply.
If possible please share your input file(PDF)
I am trying to extract data of this pdf in json format
As i am not able to share pdf I am sharing the image for the same .
Thanks and Regards,
Have you released Ocr aotomation ??
Hello @Rinni and welcome to our community!
We have concluded our work on a dedicated PDF module and we are currently during the testing phase.
Work on a dedicated OCR module has began.