Tutorial: Extracting data from a web page and saving it as an excel file

(v6 of the guide - updated script according to Robin v 0.9.2)

GitHub: Robin examples: Web Data Extraction and Storing in Excel file


By utilizing UISpy we can extract various information not only from our desktop/apps but also from the web.
All (if not most) of web related actions are part of the WebAutomation module.
I am going to use the URL burque505 tried extracting data from.

Let’s begin…

Go to this website.
Then launch UISpy.
Click on Add Control.
Position your mouse over an element of the United States table.
Hold Ctrl + Shift and scroll down until you have selected the table element.

Then hold Ctrl and left click in order to capture the window element. Hit done.


Note that I have renamed the .appmask file, the application name, the window name and the selector for convenience.

Click on Edit Selectors option.


Note: You can always modify your selectors through the selector builder.
The default selector will be sufficient in most cases, such as this one.
The best approach is to first experiment with the default(initial selector generated by UISpy) and then modify it according to your needs.


Just to showcase the mods, I have included the whole path (from the body and downwards), and excluded all the class attributes except the one belonging to the table element at the bottom.
If you make any changes you can save them by clicking Update.
To follow the guide step-by-step leave the selector as it was generated.


Open Robin’s editor.
When using .appmask files we have to import them.
It is mandatory to use the import command at the top of your Robin script.
image

Now we have access to the selectors we have created.

To reference a selector the syntax goes as follows:
[name of the .appmask file].[name of the application you are extracting data from].[screen/window name].[control name]

In case you want to reference it multiple times in your script it is a good practice to store it in a variable.
image

Create a variable to hold the URL and then launch a browser instance.
image

Use the WebAutomation.LaunchChrome action and pass the Website variable as an input:
image

After that we can use the .ExtractHtmlTableInExcel action to extract the table we want and write the retrieved data to an excel instance.
image

Finally we save the excel file, terminate the excel application and close the browser.

By navigating to the folder you declared, you will see that a new .xlsx file has been created.
image

Open it and you will see the extracted data.

Here is a screenshot of the whole Robin script:


James Papadimitriou
Technical Evangelist

3 Likes

James, thank you very, very much! I was able to exactly duplicate your results, both on Windows 10 at 96dpi and on Win7 at 120dpi (although the selector outlining is skewed at 125% zoom, it works fine, both for UI automation and web automation). The tip about Ctrl-Shift + scroll for selecting is priceless - exactly what I needed.

I got the same duplication of headers shown in your example. Do you know if that might be avoided by tweaking the selector, possibly? As long as it’s consistent, it doesn’t really matter as the rows can be deleted either manually or (as I will attempt) by the robot, I suppose.

EDIT: This seems a little kludge-y, but definitely removes the offending rows :grinning:.
I put a delay in to make sure I could see it working, works fine without it. I’d be inclined to leave a small delay in there to make sure Excel keeps up.

Best regards,
burque505

Yes burque505.
The header is written in the excel multiple times due to the configuration I made to the selector.
It is not mandatory to modify the selector every time.
It depends on the case. Many times will be sufficient by default, just the way it was generated by UISpy.
This particular case is one of them.
So the best approach would be to first go with the default extractor and then modify it based on the result/need.

Generally speaking, based on the situation and the result you need, you might need to modify the extracted data (by meddling with the excel file, the list, whatever you might have stored the data into).

I am updating the guide.
Thanks for the feedback! :slight_smile:

Best regards,
J.


James Papadimitriou
Technical Evangelist

1 Like

Added link to GitHub repo.
There you can find the source code as well as the utilized .appmask file.
Note: You should modify the directory of the .appmask file for the example to work properly.

Best regards,
J.


James Papadimitriou
Technical Evangelist

1 Like

Updated script according to Robin v 0.9.2 changes.

Best regards,
J.


James Papadimitriou
Technical Evangelist

1 Like

Hi

I tried to follow your instruction but it throws an error. Please help as I really want to learn the table extraction. Thanks


Hi @jayson, if I’m not mistaken version 0.93 removed the ability to use variables as control names.
Also, you have the StocksAppMask set for a Bloomberg page, and the website is set to visit Yahoo.
Analagous code is below (working as of today, but Close and Save code omitted. That part of your code is fine).

# We begin by importing the .appmask file that contains the selector(s).
# In the example the .appmask file is saved in a dedicated folder under the root of the project.
import ".\StocksAppMask2.appmask"

# create a variable to hold the url
set Website to "https://www.bloomberg.com/markets/stocks/world-indexes/americas"

# We then launch a chrome instance with the url of choice.
WebAutomation.LaunchChrome  Url: Website \
                            BrowserInstance=> Browser


# We then run the ExtractHtmlTableInExcel action .
# We are using the Browser instance we have already launched,
# the variable that holds the control selector
# and save the extracted HTML table to an excel instance.
WebAutomation.DataExtraction.ExtractHtmlTableInExcel    BrowserInstance:  Browser \
                                                        Control: StocksAppMask2.StocksApp.BloombergWebPage.USAStocksTable \
                                                        ExcelInstance=> ExcelInstance

Regards,
burque505

Got it! I already made it work. You’re awesome!
Just wondering does the header not included? or just because of my selector?

Thank you again for the help.

1 Like

Hi @jayson, on that particular page the headers are separate from the table.
You might try experimenting with Javascript to extract the table. Consider this code:

data = document.querySelector('div[class*="data-tables first"]');

If you run this from DevTools in Chrome, then you can execute this code:

data.innerText;

And you should see this (different values beyond doubt). You can wrangle the data from there.

"United States
NAME	VALUE	NET CHANGE	% CHANGE	1 MONTH	1 YEAR	TIME (EDT)

CTRN:IND
NASDAQ TRANSPORTATION IX
	5,130.43	-115.58	-2.20%	-1.23%	+0.23%	10/16/2020

SML:IND
S&P 600 SMALLCAP INDEX
	923.71	-5.86	-0.63%	+6.10%	-2.81%	10/16/2020

RTY:IND
RUSSELL 2000 INDEX
	1,633.81	-5.07	-0.31%	+6.31%	+6.40%	10/16/2020

NBI:IND
NASDAQ BIOTECH INDEX
	4,392.63	+21.62	+0.49%	+2.68%	+36.68%	10/16/2020

INDU:IND
DOW JONES INDUS. AVG
	28,606.31	+112.11	+0.39%	+3.43%	+6.86%	10/16/2020

CBNK:IND
NASDAQ BANK INDEX
	2,753.78	-0.03	0.00%	+5.24%	-24.64%	10/16/2020

BBREIT:IND
BBG US REITS
	278.61	-2.06	-0.73%	+1.05%	-15.15%	10/16/2020

NYA:IND
NYSE COMPOSITE INDEX
	13,169.32	+32.07	+0.24%	+2.62%	+1.25%	10/16/2020

NDX:IND
NASDAQ 100 STOCK INDX
	11,852.17	-46.40	-0.39%	+8.37%	+50.63%	10/16/2020

CFIN:IND
NASDAQ OTHER FINANCIAL
	9,952.05	-26.79	-0.27%	+4.94%	+10.50%	10/16/2020

CINS:IND
NASDAQ INSURANCE INDEX
	9,296.75	-11.65	-0.13%	+5.21%	-8.72%	10/16/2020

TRAN:IND
DOW JONES TRANS. AVG
	11,836.43	-152.40	-1.27%	+3.54%	+12.63%	10/16/2020

CUTL:IND
NASDAQ TELECOMM INDEX
	416.91	+0.56	+0.13%	+2.47%	+9.14%	10/16/2020

SPX:IND
S&P 500 INDEX
	3,483.81	+0.47	+0.01%	+4.95%	+16.66%	10/16/2020

CCMP:IND
NASDAQ COMPOSITE
	11,671.56	-42.31	-0.36%	+8.14%	+44.28%	10/16/2020

UTIL:IND
DOW JONES UTILITY AVG
	884.52	+9.30	+1.06%	+10.85%	+2.14%	10/16/2020

BKX:IND
KBW BANK INDEX
	78.07	-0.02	-0.03%	+2.36%	-22.70%	10/16/2020

RIY:IND
RUSSELL 1000 INDEX
	1,947.02	-0.57	-0.03%	+5.36%	+18.09%	10/16/2020

NDF:IND
NASDAQ FINANCIAL INDEX
	4,605.22	-7.39	-0.16%	+5.04%	-5.53%	10/16/2020

RAY:IND
RUSSELL 3000 INDEX
	2,051.06	-0.93	-0.05%	+5.41%	+17.33%	10/16/2020

IXK:IND
NASDAQ COMPUTER INDEX
	8,134.34	-20.63	-0.25%	+10.13%	+59.06%	10/16/2020

MID:IND
S&P 400 MIDCAP INDEX
	1,997.34	-5.48	-0.27%	+7.05%	+3.13%	10/16/2020

CIND:IND
NASDAQ INDUSTRIAL INDEX
	9,201.45	-75.69	-0.82%	+8.25%	+43.25%	10/16/2020"
headerz