Custom Module for PDF text extraction

I won’t progress with this unless I get some feedback from interested users or the powers that be. :grinning: You know who you are. :grinning:

This Custom Module will use PdfPig. (What a name, eh? ) PdfPig is a port of PdfBox, a mature Java library from Apache. But as @jpap said here, the team is working on a dedicated OCR and a dedicated PDF module. So it’s unlikely whatever I do with this will ever be anything other than a stopgap until the next release.

All I’ve done here is use a couple of examples from the PdfPig github page. A note of caution if you’re going to try this at home … :grinning: DO NOT use the NuGet package available from VS. You’ll have to first build PdfPig and add all the libraries listed in the code as references.

I have put in everything including the kitchen sink, and only a couple are now used:

  • UglyToad.Examples
  • UglyToad.PdfPig.Core
  • UglyToad.PdfPig.DocumentLayoutAnalysis
  • UglyToad.PdfPig.Fonts
  • UglyToad.PdfPig.Tokenization
  • UglyToad.PdfPig.Tokens

That said, here is code for “ExtractText”:

using System;
using Robin.Core;
using Robin.Core.Attributes;
using System.ComponentModel;
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;
using UglyToad.PdfPig.DocumentLayoutAnalysis;
using UglyToad.PdfPig.DocumentLayoutAnalysis.WordExtractor;

namespace Modules.PdfPig
{
    [Action(Order = 1)]
    [Throws("ActionError")] // TODO: change error name (or delete if not needed)
    public class ExtractText : ActionBase
    {
        #region Properties

        // NOTE: You can find sample description and friendly name entries in Resources

        [InputArgument]
        public string InputFile { get; set; }

        [OutputArgument]
        public string TextOut { get; set; }

        #endregion

        #region Methods Overrides

        public override void Execute(ActionContext context)
        {
            try
            {
                using (PdfDocument document = PdfDocument.Open(InputFile))
                {
                    foreach (Page page in document.GetPages())
                    {
                        string pageText = page.Text;

                        foreach (Word word in page.GetWords())
                        {
                            TextOut += (word.Text);
                        }
                    }
                }
            }
            catch (Exception e)
            {
                if (e is ActionException) throw;

                throw new ActionException("ActionError", e.Message, e.InnerException);
            }

            // TODO: set values to Output Arguments here
        }

        #endregion
    }
}

And here is the code for ExtractTextWithNewLines:

using System;
using Robin.Core;
using Robin.Core.Attributes;
using System.ComponentModel;
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;
using UglyToad.PdfPig.DocumentLayoutAnalysis;
using UglyToad.PdfPig.DocumentLayoutAnalysis.TextExtractor;


namespace Modules.PdfPig
{
    [Action(Order = 1)]
    [Throws("ActionError")] // TODO: change error name (or delete if not needed)
    public class ExtractTextWithNewLines : ActionBase
    {
        #region Properties

        // NOTE: You can find sample description and friendly name entries in Resources

        [InputArgument]
        public string InputFile { get; set; }

        [OutputArgument]
        public string TextOut { get; set; }

        #endregion

        #region Methods Overrides

        public override void Execute(ActionContext context)
        {
            try
            {
                using (var document = PdfDocument.Open(InputFile))
                {
                    foreach (var page in document.GetPages())
                    {
                        //
                        TextOut += ContentOrderTextExtractor.GetText(page, true);
                        // Console.WriteLine(text);
                    }
                }
            }
            catch (Exception e)
            {
                if (e is ActionException) throw;

                throw new ActionException("ActionError", e.Message, e.InnerException);
            }

            // TODO: set values to Output Arguments here
        }

        #endregion
    }
}

Here is the code for my test script:

PdfPig.ExtractText \
            InputFile: "C:\work\RobinTests\Lorem2.pdf" \
            TextOut=> TextOut

Console.Write Message: TextOut

PdfPig.ExtractTextWithNewLines \
            InputFile: "C:\work\RobinTests\Lorem2.pdf" \
            TextOut=> TextOut

Console.Write Message: TextOut

PdfPig.ExtractTextWithNewLines \
            InputFile: "C:\work\RobinTests\PigTest2.pdf" \
            TextOut=> TextOut

Console.Write Message: TextOut

Here is the (lengthy) output of the script.

Checking script...
Loading robot...
Running script...
Loremipsumdolorsitamet,consectetueradipiscingelit.Maecenasporttitorconguemassa.Fusceposuere,magnasedpulvinarultricies,puruslectusmalesuadalibero,sitametcommodomagnaerosquisurna.Nuncviverraimperdietenim.Fusceest.Vivamusatellus.Pellentesquehabitantmorbitristiquesenectusetnetusetmalesuadafamesacturpisegestas.Proinpharetranonummypede.Maurisetorci.Aeneanneclorem.Inporttitor.Doneclaoreetnonummyaugue.Suspendisseduipurus,scelerisqueat,vulputatevitae,pretiummattis,nunc.Maurisegetnequeatsemvenenatiseleifend.Utnonummy.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa.
Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo
magna eros quis urna.

Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.

Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.
Proin pharetra nonummy pede. Mauris et orci.

Aenean nec lorem. In porttitor. Donec laoreet nonummy augue.

Suspendisse dui purus, scelerisque at, vulputate vitae, pretium mattis, nunc. Mauris eget neque at
sem venenatis eleifend. Ut nonummy.  
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa.
Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo
magna eros quis urna.

Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.

Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.
Proin pharetra nonummy pede. Mauris et orci.

Aenean nec lorem. In porttitor. Donec laoreet nonummy augue.

Suspendisse dui purus, scelerisque at, vulputate vitae, pretium mattis, nunc. Mauris eget neque at
sem venenatis eleifend. Ut nonummy.

Lorem ipsum dolor sit amet, consectetur adipisicing elit. Cuttlefish moth ibis antelope kingfisher,
caiman rhinoceros cockroach, goat. Havanese capybara giraffe! Meerkat cuttlefish otter moose
chinchilla, bird peacock kingfisher? Fossa akita octopus lizard penguin mule otter, serval
bulldog! Chamois, kakapo insect zebra dragonfly bison magpie, snowshoe wildebeest elephant
labradoodle. Pheasant whippet mongoose zebra butterfly many echidna hare. 

Reindeer, baboon chipmunk ferret tapir mongrel gopher bulldog? Tropicbird tortoise lion koala
moth dog panther rat deer newt, hedgehog gar. Sloth to pike cichlid, walrus tapir kudu! Moth
parrot jaguar dolphin, duck mongrel a porcupine. Whippet deer oyster caterpillar. Cuscus,
leopard cuttlefish numbat? 

Flamingo catfish bandicoot nightingale lynx, wildebeest gar chinook chinchilla lizard?
Greyhound, molly bat tapir goat centipede kangaroo a. Heron hamster pelican nightingale
badger, bullfrog, llama? Meerkat guppy gar elephant other baboon? Armadillo ant macaw eagle
akita wolf dunker quokka with centipede. Peacock warthog wolverine, echidna crab ferret
macaw ostrich. Eagle burmese grouse indri fox parrot chimpanzee kakapo goat porcupine lion.

Lorem ipsum dolor sit amet. Seakale shoot collard artichoke endive kohlrabi tomato broccoli
zucchini seed! Scallion seed leek. Together a turnip. 

Pea fennel gram cucumber artichoke bean brussels, kombu eggplant lettuce scallion. Swiss
aubergine very broccoli. Taro turnip scallion coriander, eggplant beetroot, eggplant? Sprout
onion swiss silver grape a watercress beetroot summer. Lettuce avocado earthnut chestnut with
swiss many shallot. Summer broccoli scallion okra? Chestnut onion greens quandong lotus
turnip sorrel. Watercress, pumpkin turnip desert endive cabbage.  Together rutabaga celery onion burdock sprout, beet! Pepper fava seed avocado collard wakame
avocado earthnut gram. Cucumber very sorrel maize. Scallion, salsify chicory fennel. Fennel
dandelion parsley? Bean wakame sea celery winter ricebean.

Lorem ipsum dolor sit amet! Captain scimitar to mast gold. Har plank cannon-ball, shaft curse,
mermaid cannon. Crate ahoy to bow other head. Jolly Roger shaft, captain together. Davey east
be plank, runnin', legend kill. Bread, ye cannon ashore bucket, embark crate dispatch raft? Wreck
craft island shipment sword Jolly Roger shipment bar, sails, strike sword? 

To her, east together bottle dispatch, keel drunk! Off harbor, scurvy bar? Mate scalawag be sail
rum Flying Dutchmen south mermaid Kraken food barge north? Bottle mermaid planks, sunken
south, pirates sink. Planks locker sail ahoy runnin' crate? Devil cannon-ball, tharrr myth direction
booty, drinkin' direction locker arrrrr mate. Raid bunks death island, ahoy, explosion sails
cannon parrot. Together ongoin' other ahoy pub death myth bow Jolly Roger crate ongoin'.
Sword sunken shaft to flee drunk many sink, wreck plank! 

Shaft beard, ambush death sea other. Barrel pub coast, 'til, crate those bunks. Captain, head
scalawag, direction scimitar code planks other. Head ahoy flee rope rudder be Jolly Roger. Bow
south gold sink, myth? Rope 'til flee sails, storage direction sea raid island One-Eyed Willy.
Ashore craft with mate mermaid thar boom west treasure kill legend much!

AHOY, MATEY! THE END IS NIGH! ROBIN SHALL RULE THE SEAS! 
Execution completed successfully.

These are promising results. I can’t get over how easy it is to create Custom Modules in Robin!

Regards,
burque505

1 Like

Please retry now @burque505.

Best regards,
J.

1 Like

@jpap Hello James, any info on next release and open sourcing it? Please let us know so that we can continue our poc’s on our use cases :slightly_smiling_face: thanks.

1 Like

@jpap, fixed! Thank you.
Regards,
burque505

@burque505 thank you :slightly_smiling_face: can you please publish steps or screenshots on how we can build any custom module? It’s useful for .net noob like me.

1 Like

Hi @Murali, unfortunately the extensive guide by @jpap that used to be here has disappeared. @jpap, would it be possible to bring that back?

I will try to create a guide, though, and hope that the forum will now accept screenshots again. If not I have to post the screenshots to Imgur, and that gets complicated. Please bear with me while I try to work through it.

Regards,
burque505

1 Like

@burque505 sure. Thanks.

@burque505, the documentation is up and running. Some urls have been updated after our latest updates.
Here you go:
https://docs.robin-language.org/extending-robin-using-the-sdk/

Best regards,
J.

1 Like

Please from now on please upload only .png files.
In case you want to upload any .zip files or .gifs use an external service and paste the url.
Thank you.

Best regards,
J.

1 Like

Got it! Thanks, @jpap.
Regards,
burque505

1 Like